GPU Problem #1: Why Your PyTorch Training Runs Out of GPU Memory (and How to Actually Debug It)
TL;DR Your PyTorch training crashes with CUDA error: out of memory at 60-70% GPU memory utilization. nvidia-smi says you have free memory. torch.cuda.memory_summary() shows fragmented blocks. But n...

Source: DEV Community
TL;DR Your PyTorch training crashes with CUDA error: out of memory at 60-70% GPU memory utilization. nvidia-smi says you have free memory. torch.cuda.memory_summary() shows fragmented blocks. But neither tool tells you why it happened or when it started. Ingero traces every cudaMalloc and cudaFree call at the kernel level, showing the exact allocation pattern that caused fragmentation — and which line of your Python code triggered it. The Problem You're training a model. It works fine for hours, then suddenly: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 256.00 MiB (GPU 0; 15.90 GiB total capacity; 10.24 GiB already allocated; 1.89 GiB free; 11.52 GiB reserved) Wait — 1.89 GiB free, but can't allocate 256 MiB? That's memory fragmentation. The free memory exists, but it's scattered across hundreds of small non-contiguous blocks. No single block is large enough. This is the #1 GPU debugging pain point for ML engineers. Everyone hits it. The standard advice is "reduc