Silicon device dependability – talk by visiting professor Sani Nassif

Wednesday, 23rd November, 2022

EcoCloud Visiting Professor Sani Nassif gave a talk in the BC Building at EPFL to a captivated audience.

The talk examines trends in Silicon device dependability as scaling continues, and proposes some areas of cross-domain research that are needed to keep the information infrastructure functioning in the future.

Prof. Babak Falsafi, Visiting Professor Sani Nassif, Prof. Giovanni de Micheli and Prof. David Atienza
Dr. Nassif explained that Vincent Van Gogh would have made an excellent system designer.
Read More

Compusapien: More computing, less energy

© cherezoff / Adobe Stock

Today’s data centres have an efficiency problem – much of their energy is used not to process data, but to keep the servers cool. A new server architecture under development by the EU-funded COMPUSAPIEN project could solve this.

As the digital revolution continues to accelerate, so too does our demand for more computing power. Unfortunately, current semiconductor technology is energy-inefficient, meaning so too are the servers and cloud technologies that depend on them. In fact, as much as 40 % of a server’s energy is used just to keep it cool. “This problem is aggravated by the fact that the complex design of the modern server results in a high operating temperature,” says David Atienza Alonso, who heads the Embedded Systems Laboratory (ESL) at the Swiss Federal Institute of Technology Lausanne (EPFL). “As a result, servers cannot be operated at their full potential without the risk of overheating and system failures.” To tackle this problem, the EU has issued several policies addressing the increasing energy consumption of data centres, including the JRC EU Code for Data Centres. According to Atienza Alonso, meeting the goals of these policies requires an overhaul of computing server architecture and the metrics used to measure their efficiency – which is exactly what the COMPUSAPIEN (Computing Server Architecture with Joint Power and Cooling Integration at the Nanoscale) project aims to do. “The project intends to completely revise the current computing server architecture to drastically improve its energy efficiency and that of the data centres it serves,” explains Atienza Alonso, who serves as the project’s principal investigator.

Cooling conundrum

At the heart of the project, which is supported by the European Research Council, is a disruptive, 3D architecture that can overcome the worst-case power and cooling issues that have plagued servers. What makes this design so unique is its use of a heterogeneous, many-core architecture template with an integrated on-chip microfluidic fuel cell network, which allows the server to simultaneously provide both cooling and power. According to Atienza Alonso, this design represents the ultimate solution to the server cooling conundrum. “This integrated, 3D cooling approach, which uses tiny microfluidic channels to both cool servers and convert heat into electricity, has proved to be very effective,” he says. “This guarantees that 3D many-core server chips built with the latest nanometre-scale process technologies will not overheat and stop working.”

A greener cloud

Atienza Alonso estimates that the new 3D heterogeneous computing architecture template, which recycles the energy spent in cooling with the integrated micro-fluidic cell array (FCA) channels, could recover 30-40 % of the energy typically consumed by data centres. With more gains expected when the FCA technology is improved in the future, the energy consumption (and environmental impact) of a data centre will be drastically reduced, with more computing being done using the same amount of energy. “Thanks to integration of new optimised computing architectures and accelerators, the next generation of workloads on the cloud (e.g. deep learning) can be executed much more efficiently,” adds Atienza Alonso. “As a result, servers in data centres can serve many more applications using much less energy, thus dramatically reducing the carbon footprint of the IT and cloud computing sector.”



Read More

A paradigm shift in virtual memory use: Midgard

Researchers at Ecocloud, the EPFL Center for Sustainable Cloud Computing, have pioneered an innovative approach to implementing virtual memory in data centers, which will greatly increase server efficiency.

Virtual Memory has always been a pillar for memory isolation, protection and security in digital platforms. The use of virtual memory is non-negotiable, even in widely-used hardware accelerators like GPU, NICs, FPGAs and secure CPU architectures. It is therefore vital that silicon should be used as frugally as possible.

As services host more data in server memory for faster access, the traditional virtual memory technologies that look up data in server memory and check for protection have emerged as a bottleneck. Modern graph analytics workloads (e.g., on social media) spend over 20% of their time in virtual memory translation and protection checks. Server virtualization for cloud computing, to help increase utilization of infrastructure and return on investment in data centers, dramatically exacerbates this problem by requiring lookups and protection checks across multiple layers of guest (customer) and host (cloud provider) software.

The way in which virtual memory is assigned in these servers is critical because, with such huge quantities of data involved, changes in strategy can have a massive effect on server efficiency and data security.

“Virtual memory technology has been around since the 1960’s, and lays the foundation for memory protection and security in modern digital platforms,” write the authors of “Rebooting Virtual Memory with Midgard”, a paper they will present next month at ISCA’21, the flagship conference in computer architecture.

Memory has become the most precious silicon product in data centers in recent years, as more services are brought online. Virtual memory traditionally divides up the physical storage into fixed size units, for optimal capacity management. This division slows down lookups and protection checks as memory capacity increases, because large regions of memory in application software (e.g., GBs) is disintegrated into millions of pages (e.g., KB). Modern chips (e.g., the recently announced Apple M1) employ thousands of table entries per processor to do lookups and perform protection checks for each memory access.

Namespaces are used to store unique references for data, in structured hierarchies. Removing some of this hierarchy and reducing the number of translations would represent a net gain in efficiency. The authors propose Midgard, which introduces a namespace for data lookup and memory protection checks in the memory system without making any modifications to the application software or the programming interface in modern platforms (e.g., Linux, Android, macOS/iOS).

With Midgard, data lookups and protection checks are done directly in the Midgard namespace in on-chip memory, and a translation to fixed size pages is only needed for access to physical memory. Unlike traditional virtual memory whose overhead grows with memory capacity, Midgard future-proofs virtual memory as the overhead of translation and protection check to physical memory decreases with growing on-chip memory capacity in future products filtering traffic to physical memory.

Analytic and empirical results described in the paper show a remarkable performance from Midgard when compared to traditional technology, or even rival new technologies (e.g., the larger fixed size pages used in certain applications). At low loads the Midgard system was 5% behind standard performance, but with loads of 256 MB aggregate large cache it can match and even outperform traditional systems in terms of virtual memory overheads.

Figure 1: The Average Memory Access Time for address translations on low-memory, high-memory and Midgard systems.

The authors conclude: “This paper is the first of several steps needed to demonstrate a fully working system with Midgard. We focused on a proof-of-concept software-modelled prototype of key architectural components. Future work will address the wide spectrum of topics needed to realize Midgard in real systems.”

Rebooting Virtual Memory with Midgard
S. Gupta; A. Bhattacharyya; Y. Oh; A. Bhattacharjee; B. Falsafi and M. Payer
ISCA 2021 48th International Symposium on Computer Architecture, Online conference, June 14-19, 2021.

Midgard etymology: a middle realm between heaven (Asgard) and hell (Helheim)

Read More