Using the matrix to help Meta gear up

Just 12-months after it was created, in December 2004, 1-million people were active on Facebook. As of December 2021 it had an average 1.93 billion daily active users. EPFL is in a unique collaboration with its parent company Meta around distributed deep learning research.

For a user base of this size, large-scale automated-systems must be utilized to understand user experience in order to ensure accuracy and success. EPFL’s Machine Learning and Optimization Laboratory (MLO), led by Professor Martin Jaggi, is in active collaboration with Meta Platforms, Inc., Facebook’s parent company, to solve this unique challenge.

With funding from EPFL’s EcoCloud Research Center, MLO collaborates with Meta through internships at the company for MLO researchers and the use by Meta of a pioneering MLO invention: PowerSGD. MLO is helping Meta to analyze and better understand millions of users’ experiences while at the same time respecting user privacy. This requires collaborative learning, that is, privacy-preserving analysis of information from many devices for the training of a neural network that gathers, and even predicts, patterns of behavior.

To do this, a key strategy is to divide the study of these patterns over “the edge”, using both the user’s device, and others that sit between it and the data center, as a form of distributed training. This requires a fast flow of information and efficient analysis of the data. PowerSGD is an algorithm which compresses model updates in matrix form, allowing a drastic reduction in the communication required for distributed training. When applied to standard deep learning benchmarks, such as image recognition or transformer models for text, the algorithm saves up to 99% of the communication while retaining good model accuracy.

PowerSGD was used to speed up training of the XLM-R model by up to 2x. XLM-R is a critical Natural Language Processing model powering most of the text understanding services at Meta. Facebook, Instagram, WhatsApp and Workplace all rely on XLM-R for their text understanding needs. Use cases include: 1) Content Integrity: detecting hate speech, violence, bullying and harassment; 2) Topic Classification: the classification of topics enabling feed ranking of products like Facebook; 3) Business Integrity: detecting any policy violation for Ads across all products; 4) Shops: providing better product understanding and recommendations for shops.

“There are three aspects to the process. The first is to develop gradient compression algorithms to speed up the training, reducing the time required to prepare this information for its transfer to a centralized hub. The second is efficient training of the neural network within a data center – it would normally take several weeks to process all the information, but we distribute the training, reducing computation from months to days,” said MLO doctoral researcher Tao Lin.

Tao Lin of MLO

As a third aspect, privacy is a constant factor under consideration. “We have to distinguish between knowledge and data. We need to ensure users’ privacy by making sure that our learning algorithms can extract knowledge without extracting their data and we can do this through federated learning,” continued Lin.

The PowerSGD algorithm has been gaining in reputation over the last few years. The developers of deep learning software PyTorch have included it as part of their software suite (PyTorch 1.10), which is used by Meta, OpenAI, Tesla and similar technology corporations that rely on artificial intelligence. The collaboration with Meta is due to run until 2023.

Authors: Tanya Petersen, John Maxwell

Source: EPFL

This content is distributed under a Creative Commons CC BY-SA 4.0 license. You may freely reproduce the text, videos and images it contains, provided that you indicate the author’s name and place no restrictions on the subsequent use of the content. If you would like to reproduce an illustration that does not contain the CC BY-SA notice, you must obtain approval from the author.

Read More

Senior Sustainable Computing Infrastructure Specialist

EcoCloud’s mission is to provide a suitable research and cooperation framework for EPFL faculty and their affiliated students, as well as Industrial Affiliates to collaborate on joint R&D projects targetting the sustainable development of society thanks to IT infrastructure.

In this previous context, EcoCloud aims to develop and maintain an EPFL-wide cloud computing experimental facility to conduct multi-laboratory research on topics ranging from server and rack hardware technologies (computing, storage and cooling) to innovative software design (ICT stack, VM schedulers, energy-efficient machine learning mapping, etc.).

To manage this experimental facility and the developed multi-laboratory research projects, EcoCloud Center is looking to recruit a:

Senior Sustainable Computing Infrastructure Specialist

Main duties and responsibilities :

  • Administration of different types of servers for research and prototyping tasks, as well as experimental facility management
  • System administration / integration in the scope of research and teaching projects
  • Architect suitable monitoring infrastructures for the configured experimental setup.
  • IT support to the research groups affiliated with EcoCloud using the experimental facility.
  • Purchasing of all IT material including: Servers, monitoring instrumentation infrastructure, cables, and consumables for the Experimental Facility
  • Configuration of 20-25 custom server racks (equipped with configurable cooling loops outside of the mission-critical scope of the data center, top-of-rack switches, inter-rack networking, and space for additional measurement equipment)
  • Setup of a number of experimental servers for power and thermal monitoring.
  • Installation and maintenance of equipment (thermal modeling setup in POWERLab/ESL, including an accurate thermal camera equipment), to the new experimental facility.
  • Develop custom monitoring infrastructures with sensors and hardware built-in of servers and racks for energy, power consumption, temperature, humidity, and vibrations monitoring.
  • Monitoring hardware boxes, regarding actual hardware fabrication for computing, develop a new multi-FPGA setup infrastructure.
  • Make the experimental infrastructure accessible for well-trained IT personnel to reprogram and monitor this hardware acceleration system, including 10 additional desks for experimental equipment and control the work done in this requested space.

Your profile :

  • PhD in Computer Engineering or Computer Science or Electrical Engineering
  • Experience in coordinating the development of research experiments and protocols.
  • Experience in implementation of software infrastructure for the storage of information.
  • Experience in designing and implementing the infrastructure to store and retrieve the results of the experiments (particularly using remote means).
  • Solid project and IT management skills
  • Knowledge of C, C++, Linux scripting, server administration tools, VMWare, XEN, OpenMP, FPGA EDA tools (Xilinx), Cloud Computing Systems (AWS, Oracle Cloud, Microsoft Azure, and Google Cloud).
  • Fluent in English (written and spoken)
  • Experience in working with industrial partners
  • Knowledge of EPFL infrastructure is a plus
  • Strong interpersonal skills, service-mindedness
  • Ability to mentor other resources
  • Creative, self-driven and autonomous

We offer :

  • A stimulating, dynamic, inter-disciplinary, and multicultural environment with vari-ous academic challenges
  • An opportunity to evolve in a framework promoting advanced technology
  • Possibilities for further education
Start date :
to be agreed upon
Term of employment :
Unlimited (CDI)

Remark :
Only candidates who applied through EPFL website or our partner Jobup’s website will be considered. Files sent by agencies without a mandate will not be taken into account.


Apply Online

Read More

EcoCloud Annual Event

EcoCloud Annual Event

Registrations Open:

Date of the event:
24th May, 2022

Lausanne Palace Hotel 
Rue du Grand Chêne, 7-9
CH-1002 Lausanne
+41 21 331 31 31


8:00 – 8:30 Pick-up badges/registration + welcome coffee
8:30 Introduction
Anna Fontcuberta i Morral, David Atienza and Ed Bugnion (EPFL)
Session 1 Sustainable smart cities, transport systems and agriculture
9:00 Digital Twins at the services of Sustainable Cities & Transport systems : challenges?
Frédéric Dreyer (EPFL)
9:15 Towards a Data-driven Operational Digital Twin for Railway Wheels
Olga Fink (EPFL)
9:35 Competitional aspects of climate/smart agriculture and forestry
Marilyn Wolf (University of Nebraska-Lincoln)
10:00 Data production in renewable energy hubs
François Maréchal (EPFL)
10:30 Coffee break
Session 2 Energy-constrained and sustainable deep learning
11:00 Advanced computational imaging: the rise and fall of deep neural nets
Michael Unser (EPFL)
11:35 Carbon-Aware Deep Learning to Promote Model Optimization: An Application in Global Health
Annie Hartley (EPFL)
12:00 – 13:00 Standing lunch
Session 3 Energy-constrained trustworthy and computing systems
13:00 A cryptocurrency to Save the Planet
Rachid Guerraoui (EPFL)
13:30 Solving Hard Optimization Problems with Light
Francesca Parmigiani (Microsoft)
14:00 – 15:00 Break
Session 4 Sustainable Scientific Computing
15:00 Some aspects of scientific computing and data management in climate and environmental applications
Michael Lehning (EPFL) 
15:25 Optical Computing With Optical Fibers
Christophe Moser (EPFL)
15:50 Heating up while cooling down: How Decentralized Multi-Functional Computing Infrastructures can contribute to reach Switzerland’s climate goals?
Berat Denizdurduran (DeepSquare)
16:10 Challenges in solving high profile, high impact scientific problems with HPC: the tokamak fusion reactors and the Square Kilometer Array cases
Gilles Fourestey (EPFL)
16:30 – 18:30 Poster session + networking aperitif

Read More

We stand with Ukraine

EcoCloud strongly condemns Russia’s military invasion and acts of war in Ukraine, as well as the dreadful violation of international humanitarian and human rights law. We are really shocked by the tragedy currently unfolding in Ukraine, and we fully support everyone affected by the war.

The EcoCloud community calls on the different governments to take immediate action to protect everyone in that country, particularly including its civilian population and people affiliated with its universities. Now more than ever, we must promote our societal values (justice, freedom, respect, community, and responsibility) and confront this situation collectively and peacefully to end this nonsensical war.

Read More

ASPLOS is back – in person

After going virtual since 2022, ASPLOS is returning to Lausanne for the 2022 edition, 28th February to 4th March.

The 2022 edition of ASPLOS marks its 40th anniversary. In 1982, ASPLOS emerged as the ultimate conference for researchers from a variety of software and hardware system communities to collaborate and engage technically. It has been ahead of the curve in technologies such as RISC and VLIW processors, small and large-scale multiprocessors, clusters and networks-of-workstations, optimizing compilers, RAID, and network-storage system designs.

Today, as we enter the post-Moore era, ASPLOS 2022 re-establishes itself as the premier platform for cross-layer design and optimization to address fundamental challenges in computer system design in the coming years.

We look forward to welcoming you at the SwissTech Center, EPFL.

Official Website

Read More
© 2022 Cyber-Defence Campus

Heterogeneous computing creates new electrical-level vulnerabilities

Under the initiative of the armasuisse – Cyber-Defence Campus, a team of EPFL scientists, including CYD Doctoral Fellow Dina Mahmoud of PARSA, recently presented the first proof-of-concept for undervolting-based fault injection from the programmable logic of a field programmable gate array (FPGA) to the software executing on a processing system in the same system-on-chip (SoC). The team also proposes a number of future research directions, which, if addressed, should help to ensure the security of today’s heterogeneous computing systems.

Most Cyberattacks such as ransomware exploit vulnerabilities in software. While often neglected, hardware-based attacks can be just as powerful, on top of being more difficult to patch, as the underlying vulnerability remains in the deployed hardware. Hardware attacks in which adversaries have physical access to their target devices have long been investigated. However, with the world wide web and the possibility to access computing resources remotely in the cloud, remotely-controlled hardware attacks have become a reality. Examples of remote attacks include fault-injection attacks causing computation or data manipulation errors and side-channel attacks extracting secrets from power or electromagnetic side channels.

With Moore’s law losing pace in recent years, customizable hardware combining various types of processing units together in one heterogeneous system has become a global trend to increase performance. Since heterogeneous computing is a relatively recent phenomenon, not all security vulnerabilities have been fully understood or investigated. To better understand the landscape of cybersecurity in relation to heterogeneous systems, we surveyed state-of-the-art research on electrical-level attacks and defenses. We focused on exploits which leverage vulnerabilities caused by the electrical signals or their coupling. For example, demanding more power than the power supply can provide, results in lowered voltage for the entire system; the undervolting can affect the functioning of the circuits (e.g., in a computer) and cause faults. Or, an adversary can monitor minute variations in the voltage waveform and use them to classify or even fully uncover the operations executed by the victim. Our survey, which will appear in ACM Computing Surveys, addresses the electrical-level attacks on central processing units (CPUs), field-programmable gate arrays (FPGAs), and graphics processing units (GPUs), the three processing units frequently combined in heterogeneous platforms. We discuss whether electrical-level attacks targeting only one processing unit can extend to the heterogeneous system as a whole and highlight open research directions necessary for ensuring the security of these systems in the future.

In the survey, we discuss a number of system-level vulnerabilities which have not been investigated yet. One of the open research questions we highlight is the possibility of inter-component fault-injection attacks in our subsequent work, which will be presented in March at the Design, Automation and Test in Europe conference (DATE 2022), we demonstrate the feasibility of such an attack. We show the first undervolting attack in which circuits, implemented using the FPGA programmable logic, act as an aggressor while the CPU, residing on the same system-on-chip, is the victim. We program the FPGA to deploy malicious hardware circuits in order to force the FPGA to draw considerable current and cause a drop in the power supply voltage. Since the power supply is shared, the obtained voltage drop propagates across the entire chip. As a result, the computation performed by the CPU faults. If exploit2ed in a remote setting, this attack can lead to denial-of-service or data breach. With these findings, we further confirm the need for continuing research on the security of heterogeneous systems in order to prevent such attacks.


FundingThe CYD Fellowships are supported by armasuisse Science and Technology.

ReferencesMahmoud, Dina Gamaleldin Ahmed Shawky; Hussein, Samah; Lenders, Vincent; Stojilovic, Mirjana: FPGA-to-CPU Undervolting Attacks. 25th Design, Automation and Test in Europe – DATE 2022 , Antwerp, Belgium [Virtual], March 14-23, 2022:

Mahmoud, Dina G.; Lenders, Vincent; Stojilović, Mirjana: Electrical-Level Attacks on CPUs, FPGAs, and GPUs: Survey and Implications in the Heterogeneous Era. ACM Computing Surveys, Volume 55, Issue 3, April 2022, Article No.: 58. DOI: 10.1145/3498337

Read More

Intel funds EcoCloud Midgard-based research

An exciting new development in the progress of Midgard, a novel re-envisioning of the virtual memory abstraction ubiquitous to computer systems, sees a tech leader funding research that will bring together experts from Yale, the University of Edinburgh and EcoCloud at EPFL.

Global semiconductor manufacturer Intel is sponsoring an EcoCloud-led project entitled “Virtual Memory for Post-Moore Servers”, which is part of its wider research into improving power performance and total cost of ownership for servers in big-scale datacenters.

What is the Post-Moore era?

Moore’s law was conceived by Gordon Moore in 1965. He would later become CEO of Intel. Moore predicted that the number of transistors in a dense integrated circuit would double, roughly every two years. This has been remarkably accurate up to now, but we are reaching the stage where physical limitations will curtail this pattern within the next couple of years: we are approaching the “Post-Moore era”. Many observers are optimistic about the continuation of technological progress in a variety of other areas, including new chip architectures, quantum computing and artificial intelligence.

Midgard is a radical new technology which helps provide optimizations for memory in data centers with an innovative, highly efficient namespace for access and protection control. Efficient memory protection is a foundation for virtualization, confidential computing, use of accelerators, and emerging computing paradigms such as serverless computing.

The Midgard layer is an intermediate stratum that renders possible staggering gains in performance for data servers as memory grows, and which is compatible with modern operating systems such as Linux, Android, macOS and Windows.

The project Virtual Memory for Post-Moore Servers aims to disrupt traditional server technology, targeting full-stack evaluation and hardware/software co-design, based on Midgard’s radical approach to virtual memory.

Midgard is a consortium of the following principal investigators at EcoCloud, University of Edinburgh and Yale:
David Atienza, Abhishek Bhattacharjee, Babak Falsafi, Boris Grot and Mathias Payer.

Link to the Midgard Website


Read More

Compusapien: More computing, less energy

© cherezoff / Adobe Stock

Today’s data centres have an efficiency problem – much of their energy is used not to process data, but to keep the servers cool. A new server architecture under development by the EU-funded COMPUSAPIEN project could solve this.

As the digital revolution continues to accelerate, so too does our demand for more computing power. Unfortunately, current semiconductor technology is energy-inefficient, meaning so too are the servers and cloud technologies that depend on them. In fact, as much as 40 % of a server’s energy is used just to keep it cool. “This problem is aggravated by the fact that the complex design of the modern server results in a high operating temperature,” says David Atienza Alonso, who heads the Embedded Systems Laboratory (ESL) at the Swiss Federal Institute of Technology Lausanne (EPFL). “As a result, servers cannot be operated at their full potential without the risk of overheating and system failures.” To tackle this problem, the EU has issued several policies addressing the increasing energy consumption of data centres, including the JRC EU Code for Data Centres. According to Atienza Alonso, meeting the goals of these policies requires an overhaul of computing server architecture and the metrics used to measure their efficiency – which is exactly what the COMPUSAPIEN (Computing Server Architecture with Joint Power and Cooling Integration at the Nanoscale) project aims to do. “The project intends to completely revise the current computing server architecture to drastically improve its energy efficiency and that of the data centres it serves,” explains Atienza Alonso, who serves as the project’s principal investigator.

Cooling conundrum

At the heart of the project, which is supported by the European Research Council, is a disruptive, 3D architecture that can overcome the worst-case power and cooling issues that have plagued servers. What makes this design so unique is its use of a heterogeneous, many-core architecture template with an integrated on-chip microfluidic fuel cell network, which allows the server to simultaneously provide both cooling and power. According to Atienza Alonso, this design represents the ultimate solution to the server cooling conundrum. “This integrated, 3D cooling approach, which uses tiny microfluidic channels to both cool servers and convert heat into electricity, has proved to be very effective,” he says. “This guarantees that 3D many-core server chips built with the latest nanometre-scale process technologies will not overheat and stop working.”

A greener cloud

Atienza Alonso estimates that the new 3D heterogeneous computing architecture template, which recycles the energy spent in cooling with the integrated micro-fluidic cell array (FCA) channels, could recover 30-40 % of the energy typically consumed by data centres. With more gains expected when the FCA technology is improved in the future, the energy consumption (and environmental impact) of a data centre will be drastically reduced, with more computing being done using the same amount of energy. “Thanks to integration of new optimised computing architectures and accelerators, the next generation of workloads on the cloud (e.g. deep learning) can be executed much more efficiently,” adds Atienza Alonso. “As a result, servers in data centres can serve many more applications using much less energy, thus dramatically reducing the carbon footprint of the IT and cloud computing sector.”



Read More

Qualcomm Fellowship for Sid Gupta

As the innovation technology Midgard gains momentum, one of its key researchers has been rewarded for their work by Qualcomm Technologies.

Qualcomm Innovation Fellowship rewards excellent young researchers in the field of AI and cybersecurity with individual prizes of $40,000, dedicated mentors from the Qualcomm Technologies team and the opportunity to present their work in person to an audience of technical leaders at the company’s HQ in San Diego. 

It is therefore with great pleasure that we announce that Siddharth Gupta, a doctoral researcher at PARSA, has been named as one of the recipients of this prestigious prize. 

Siddharth Gupta, supervised by Babak Falsafi and Abhishek Bhattacharjee, has been selected for his proposal: “Rebooting Virtual Memory with Midgard.”

Popular online services have extensive user bases that are generating data at an unprecedented rate. This drastic increase in the dataset sizes has led to servers with TB-scale memory capacity. This proposal addresses the problem that large memories in data centers outgrow Translation Lookaside Buffer (TLB) capacities and need more page table levels in Memory Management Units (MMUs), leading to large latencies near the CPU cores.

Sid proposes to introduce an extra stage of address translation, allowing coarse-grain translation combined with access control near the cores and fine-grain translation to support fragmentation near the memories. The worst latencies are moved away from the CPU cores and are mitigated by the caches, which operate in the intermediate address space.

More on Midgard


Read More