|
|
 |
|
SC Conference - Activity Details
GPUs: TeraFLOPS or TeraFLAWED?
Authors:
|
Imran S. Haque
(Stanford University)
|
|
Vijay S. Pande
(Stanford University)
|
Posters Session
|
Tuesday, 05:15PM - 07:00PM
|
|
Room Oregon Ballroom Lobby
|
Abstract:
A lack of error checking and correcting (ECC) capability in the memory subsystems of graphics cards has been cited as hindering acceptance of GPUs as high-performance coprocessors, but no quantification has been done to assess the impact of this design. In this poster we present MemtestG80, our software for assessing memory error rates on NVIDIA G80-architecture-based GPUs. Furthermore, we present the results of a large-scale assessment of GPU error rate, conducted by running MemtestG80 on over 20,000 hosts on the Folding@home distributed computing network.
Our control experiments on consumer-grade and dedicated-GPGPU hardware in a controlled environment found no errors. However, our survey over consumer-grade cards on Folding@home finds that, in their installed environments, a majority of tested GPUs exhibit a non-negligible, pattern-sensitive rate of memory soft errors. We demonstrate that these errors persist even after controlling for overclocking or environmental proxies for temperature, but depend strongly on board architecture.
|
|
|