HomeSC is the International Conference for
 High Performnance Computing, Networking, Storage and Analysis
scyourway

SC Conference - Activity Details



Leveraging 3D PCRAM Technologies to Reduce Checkpoint Overhead for Future Exascale Systems

Authors:
Xiangyu Dong  (Pennsylvania State University)
Naveen Muralimanohar  (HP Labs)
Norm Jouppi  (HP Labs)
Richard Kaufmann  (HP Labs)
Yuan Xie  (Pennsylvania State University)
Papers Session
Future HPC Architectures
Thursday,  04:30PM - 05:00PM
Room PB256
Abstract:
The scalability of future massively parallel processing (MPP) systems is challenged by high failure rates. Current hard disk drive (HDD) checkpointing results in overhead of 25% or more at the petascale. With a direct correlation between checkpoint frequencies and node counts, novel techniques that can take more frequent checkpoints with minimum overhead are critical to implement a reliable exascale system. In this work, we leverage the upcoming Phase-Change Random Access Memory (PCRAM) technology and propose a hybrid local/global checkpointing mechanism after a thorough analysis of MPP systems failure rates and failure sources. We propose three variants of PCRAM-based hybrid checkpointing schemes, DIMM+HDD, DIMM+DIMM, and 3D+3D, to reduce the checkpoint overhead and offer a smooth transition from the conventional pure HDD checkpoint to the ideal 3D PCRAM mechanism. The proposed pure 3D PCRAM-based mechanism can ultimately take checkpoints with overhead less than 4% on a projected exascale system.
The full paper can be found in the ACM Digital Library and IEEE Computer Society
   Sponsors    ACM    IEEE