HomeSC is the International Conference for
 High Performnance Computing, Networking, Storage and Analysis
scyourway

SC Conference - Activity Details



FALCON: A System for Reliable Checkpoint Recovery in Shared Grid Environments

Authors:
Tanzima Z Islam  (Purdue University)
Saurabh Bagchi  (Purdue University)
Rudolf Eigenmann  (Purdue University)
Papers Session
Sustainability and Reliability
Thursday,  10:30AM - 11:00AM
Room PB251
Abstract:
In Fine-Grained Cycle Sharing (FGCS) systems, machine owners voluntarily share their unused CPU cycles with guest jobs, as long as their performance degradation is tolerable. However, unpredictable evictions of guest jobs lead to fluctuating completion times. Checkpoint-recovery is an attractive mechanism for recovering from such "failures". Today's FGCS systems often use expensive, high-performance dedicated checkpoint servers. However, in geographically distributed clusters, this may incur high checkpoint transfer latencies. In this paper we present a system called FALCON that uses available disk resources of the FGCS machines as shared checkpoint repositories. However, an unavailable storage host may lead to loss of checkpoint data. Therefore, we model failures of storage hosts and develop a prediction algorithm for choosing reliable checkpoint repositories. We experiment with FALCON in the university-wide Condor testbed at Purdue and show improved and consistent performance for guest jobs in the presence of irregular resource availability.
The full paper can be found in the ACM Digital Library and IEEE Computer Society
   Sponsors    ACM    IEEE