
How we fixed CUDA Error 101: invalid device ordinal ... torch._C._cuda_getDeviceCount() > 0 🤯
This article details how a team fixed a server issue where one out of eight GPUs went offline due to a loose power connector. Attempts to bypass the problem via configuration adjustments failed. Success came from directly unbinding the troublesome GPU from the NVIDIA driver, a quick fix that got the server running again without needing a reboot. The story emphasizes simple, effective solutions in tech troubleshooting.
Saravana Rathinam