Powered by OpenAIRE graph

INRA-SIEGE

Country: France
173 Projects, page 1 of 35
  • Funder: French National Research Agency (ANR) Project Code: ANR-10-BLAN-0301
    Funder Contribution: 503,689 EUR

    The advent of exascale machines will help solve new scientific challenges only if the resilience of large scientific applications deployed on these machines can be guaranteed. With 10,000,000 core processors, or more, the time interval between two consecutive failures is anticipated to be smaller than the typical duration of a checkpoint, i.e., the time needed to save all necessary application and system data. No actual progress can then be expected for a large-scale parallel application. Current fault-tolerant techniques and tools can no longer be used. The main objective of the RESCUE project is to develop new algorithmic techniques and software tools to solve the "exascale resilience problem". Solving this problem implies a departure from current approaches, and calls for yet-to-be-discovered algorithms, protocols and software tools. This proposed research follows three main research thrusts. The first thrust deals with novel checkpoint protocols. This thrust will include the classification of relevant fault categories and the development of a software package for fault injection into application execution at runtime. The main research activity will be the design and development of scalable and light-weight checkpoint and migration protocols, with on-the-fly storing of key data, distributed but coordinated decisions, etc. These protocols will be validated via a prototype implementation integrated with the public-domain MPICH project. The second thrust entails the development of novel execution models, i.e., accurate stochastic models to predict (and, in turn, optimize) the expected performance (execution time or throughput) of large-scale parallel scientific applications. In the third thrust, we will develop novel parallel algorithms for scientific numerical kernels. We will profile a representative set of key large-scale applications to assess their resilience characteristics (e.g., identify specific patterns to reduce checkpoint overhead). We will also analyze execution trade-offs based on the replication of crucial kernels and on decentralized ABFT (Algorithm-Based Fault Tolerant) techniques. Finally, we will develop new numerical methods and robust algorithms that still converge in the presence of multiple failures. These algorithms will be implemented as part of a software prototype, which will be evaluated when confronted with realistic faults generated via our fault injection techniques. We firmly believe that only the combination of these three thrusts (new checkpoint protocols, new execution models, and new parallel algorithms) can solve the exascale resilience problem. We hope to contribute to the solution of this critical problem by providing the community with new protocols, models and algorithms, as well as with a set of freely available public-domain software prototypes. The RESCUE project team comprises well-recognized scientists, with complementary expertise, and who are gathered together for the first time. In addition, the project is conducted in collaboration with a selected team of US leaders: Marc Snir and Bill Gropp at the University of Illinois at Urbana Champaign (Blue Waters project), and Henri Casanova at Hawaii University (models for parallel jobs). The former collaboration with Marc Snir and Bill Gropp is conducted under the auspices of the INRIA-Illinois Joint Laboratory at Urbana Champaign co-headed by Franck Cappello and Marc Snir. The latter collaboration with Henri Casanova takes place within a joint INRIA-NSF team. All this explains why we did not go through a formal ANR-NSF agreement.

    more_vert
  • Funder: French National Research Agency (ANR) Project Code: ANR-24-RRII-0002
    Funder Contribution: 20,000,000 EUR
    more_vert
  • Funder: French National Research Agency (ANR) Project Code: ANR-23-RDIA-0001
    Funder Contribution: 7,947,440 EUR
    more_vert
  • Funder: French National Research Agency (ANR) Project Code: ANR-23-PEIA-0011
    Funder Contribution: 6,651,360 EUR
    more_vert
  • Funder: French National Research Agency (ANR) Project Code: ANR-22-PTCC-0002
    Funder Contribution: 16,531,200 EUR
    more_vert
  • chevron_left
  • 1
  • 2
  • 3
  • 4
  • 5
  • chevron_right

Do the share buttons not appear? Please make sure, any blocking addon is disabled, and then reload the page.

Content report
No reports available
Funder report
No option selected
arrow_drop_down

Do you wish to download a CSV file? Note that this process may take a while.

There was an error in csv downloading. Please try again later.