Abstract
The advent of extreme scale computing platforms will require the use of parallel resources at an unprecedented scale. On the technological side, the continuous shrinking of transistor geometry and the increasing complexity of these devices affect dramatically their sensitivity to natural radiation leading to a high rate of hardware faults, and thus diminish their reliability. Handling fully these faults at the computer system level may have a prohibitive computational and energetic cost.
High performance computing applications that aim at exploiting all these resources will thus need to be resilient.
In this talk, we will first give an overview of the current trends towards exascale. We will discuss the new challenges to face in terms of platform reliability and associated variety of possible faults. We will then discuss some of the solutions that have been proposed to tackle these errors before discussing in more details some contributions in sparse numerical linear algebra.
First, in the context of computing node crashes, we will discuss possible remedies in the framework of linear system or eigenproblem solutions, that are the inner most numerical kernels in many scientific and engineering applications and also ones of the most time consuming parts.
Second, we will discuss a somehow more challenging problem related to silent transient soft-errors produced by natural radiation and consisting in a bit-flip in a memory cell producing unexpected results at the application level. In that context we will consider the conjugate gradient (CG) method that is the most widely used iterative scheme for the solution of large sparse systems of linear equations when the matrix is symmetric positive definite. We will investigate through extensive numerical experiments the sensitivity of of CG to bit-flips and further discuss possible numerical criteria to detect the occurrence of such faults.
The above mentioned research activities have been conducted in collaboration with many colleagues including E. Agullo (Inria), S. Cools (University of Antwerpen), E. Fatih-Yetkin (Kadir Has University), P. Salas (CERFACS), W. Vanroose (University of Antwerpen) and M. Zounon (NAG).