|
|
 |
|
SC Conference - Activity Details
Optimal Real Number Codes for Fault Tolerant Matrix Operations
Author:
|
Zizhong Chen
(Colorado School of Mines)
|
Papers Session
|
Sustainability and Reliability
|
|
Thursday, 11:30AM - 12:00PM
|
|
Room PB251
|
Abstract:
It has been demonstrated recently that single fail-stop process failure in ScaLAPACK matrix multiplication can be tolerated without checkpointing.
Multiple simultaneous processor failures can be tolerated
without checkpointing by encoding matrices using a real-number
erasure correcting code. However, the floating-point representation
of a real number in today's high performance computer architecture introduces
round off errors which can be enlarged and cause
the loss of precision of possibly all effective digits during recovery
when the number of processors in the system is large.
In this paper, we present a class of Reed-Solomon style real-number
erasure correcting codes which have optimal numerical stability during recovery.
We analytically construct the numerically best erasure correcting codes for 2 erasures and develop an approximation method to computationally
construct numerically good codes for 3 or more erasures.
Experimental results demonstrate that the proposed codes are
numerically much more stable than existing codes.
|
|
|