Running Qwen2.5-32B on RTX 4060 8GB — Beating M4 at 10.8 t/s with llama.cpp
My laptop has an RTX 4060. 8GB of VRAM. It's the spec people call "the short straw" for running local LLMs. Still, I wanted to run a 32B model. I'd tried the 7B class. It works. But when you use it...

Source: DEV Community
My laptop has an RTX 4060. 8GB of VRAM. It's the spec people call "the short straw" for running local LLMs. Still, I wanted to run a 32B model. I'd tried the 7B class. It works. But when you use it for coding assistance, you start running into quality issues. On the other hand, hitting an API racks up monthly costs, and there are times I want to work offline. I'm aware of the prevailing sentiment that "32B on 8GB is impossible." The entire model's layers won't fit on the GPU. But I'd heard that llama.cpp's hybrid inference (GPU+CPU split) had gotten considerably better over the past year, so I decided to give it a shot with nothing to lose. Why llama.cpp There are other inference engines for local LLMs. Ollama is popular and easy to set up, and vLLM has high throughput. I tried Ollama first. Setup was indeed easy, but it doesn't give you fine-grained control over ngl (the number of layers offloaded to GPU). With 8GB of VRAM, I want to tune this one layer at a time, but Ollama decides a