HomeSC is the International Conference for
 High Performnance Computing, Networking, Storage and Analysis
scyourway

SC Conference - Activity Details



Supporting Fault-Tolerance for Time-Critical Events in Distributed Environments

Authors:
Qian Zhu  (Ohio State University)
Gagan Agrawal  (Ohio State University)
Papers Session
Grid Scheduling
Wednesday,  02:30PM - 03:00PM
Room PB256
Abstract:
In this paper, we consider the problem of supporting fault tolerance for adaptive and time-critical applications in heterogeneous and unreliable grid computing environments. Our goal for this class of applications is to optimize a user-specified benefit function while meeting the time deadline. Our first contribution in this paper is a multi-objective optimization algorithm for scheduling the application onto the most efficient and reliable resources. In this way, the processing can achieve the maximum benefit while also maximizing the success-rate, which is the probability of finishing execution without failures. However, when failures do occur, we have developed a hybrid failure-recovery scheme to ensure the application can complete within the time interval. Our experimental results show that our scheduling algorithm can achieve better benefit when compared to several heuristics-based greedy scheduling algorithms with a negligible overhead. Benefit is further improved by the hybrid failure recovery scheme, and the success-rate becomes 100%.
The full paper can be found in the ACM Digital Library and IEEE Computer Society
   Sponsors    ACM    IEEE