Large scale infrastructures for distributed and parallel computing offer thousands of computing nodes to their users to satisfy their computing needs. As the need for massively parallel computing increases in industry and development, cloud infrastructures and computing centers are being forced to increase in size and to transition to new computing technologies. While the advantage for the users is clear, such evolution imposes significant challenges, such as energy consumption and fault tolerance. Fault tolerance is even more critical in infrastructures built on commodity hardware. Recent works have shown that large scale machines built with commodity hardware experience more failures than previously thought.
Leonardo Bautista Gomez, senior Researcher at the Barcelona Supercomputing Center, will focus on how to guarantee high reliability to high-performance applications running in large infrastructures. In particular, they will cover all the technical content necessary to implement scalable multilevel checkpointing for tightly coupled applications. This will include an overview of the internals of the FTI library, and explain how multilevel checkpointing is implemented today, together with examples that the audience can test and analyze on their own laptops, so that they learn how to use FTI in practice, and ultimately transfer that knowledge to their production systems.