A $550 AMD GPU should be enough to do useful local AI.
But if you actually try it, the stack quickly reminds you that most of local inference was built with NVIDIA in mind. ROCm is uneven on consumer AMD. CUDA assumptions leak into everything. Vulkan paths exist, but they often feel like fallback code, not something anyone seriously wants to win on.
That is why I started building ZINC, an LLM inference engine in Zig and Vulkan for AMD consumer GPUs.
The hardware story is not the problem. An RX 9070 XT costs about $550 and has the kind of bandwidth and compute that should make it a real local inference card. A Radeon AI PRO R9700 gives you 32 GB of VRAM on the same RDNA4 family. The gap is the software.
That gap is expensive in a very annoying way. You can have a perfectly capable AMD GPU in your machine and still feel locked out of the modern local LLM stack. "Run it locally" too often means "run it locally, as long as your hardware matches the ecosystem's defaults."
The project is aimed at practical local inference and serving on AMD consumer hardware, especially RDNA3 and RDNA4:
- GGUF model loading
- Vulkan compute instead of ROCm or CUDA
- OpenAI-compatible API
- a runtime that treats AMD as the target instead of the fallback
The current measured benchmarks are on an AMD Radeon AI PRO R9700, which is a 32 GB RDNA4 card. The two clean models in the current baseline are:
- Qwen3.5-35B-A3B-UD-Q4_K_XL at 33.58 tok/s decode in the CLI path
- Qwen3.5-2B-Q4_K_M at 22.93 tok/s decode in the CLI path
On the raw OpenAI-compatible HTTP path, the current numbers are 33.55 tok/s for the 35B model and 21.88 tok/s for the 2B model.
The 35B result matters most to me because it proves the engine can already do serious work on a single RDNA4 card. I am not claiming that exact 35B setup already runs on a $550 16 GB card today. The point is simpler: consumer-price AMD hardware should be much more useful for local AI than the current software stack allows.
That is also why Zig has been such a good fit.
This project lives in the part of the stack where language choice actually matters. The hot path is Vulkan calls, command recording, synchronization, descriptor management, buffer lifetime, shader packaging, and build orchestration. That is systems work. Zig gives me direct C ABI access through @cImport, explicit memory control, defer and errdefer for resource cleanup, and a build system that can treat GLSL-to-SPIR-V compilation as part of the program instead of some side script pile. It stays out of the way in exactly the places this project needs it to.
There is also a broader reason I like building this in Zig: the ecosystem is still early enough that projects like this can help define what native AI systems programming in Zig looks like. A lot of AI infrastructure today is Python on top and C++/CUDA underneath. That works, but it also leaves a lot of hidden complexity and a lot of software that is harder to read and modify than it should be. Zig feels like a credible path for smaller, more direct AI runtimes.
The early work on ZINC has been much less glamorous than the hardware thesis.
The first version did not even run all 40 layers of Qwen3.5. It embedded the token, ran a final norm, projected to logits, and still produced text that looked alive enough to fool me for a minute. Later I found an LM head dispatch bug that left 97% of the vocabulary exactly zero. A flash attention binding bug read K-cache floats as page IDs and hung the GPU. The tokenizer skipped GPT-2 byte-to-unicode remapping and turned a five-token prompt into nine tokens. That phase of the project taught me something important: in inference, "it prints text" proves almost nothing.
Now ZINC produces coherent output on real RDNA4 hardware and runs a full Qwen3.5-35B forward pass with attention, MoE routing, and SSM layers. The biggest bottleneck is decode overhead, not lack of GPU capability. Right now the runtime is still paying too many GPU/CPU synchronization costs per token. The next major step is to collapse that into a pre-recorded decode graph instead of thrashing through per-layer submission and readback.
The interesting thing about the current numbers is that the smaller model is not faster. On this node, the 2B path is slower than the 35B MoE path. That is a useful sign that today's bottleneck is in the decode path and kernel regime, not just in raw parameter count.
That is where contributors can make a real difference.
The most useful help right now is concrete systems work:
- Vulkan command recording and synchronization cleanup
- reducing decode-time CPU/GPU round-trips
- shader work on quantized DMMV and fused kernels
- correctness validation against reference runtimes
- reproducible benchmark setups on RDNA3 and RDNA4
- docs that make the project easier to run and reason about
The codebase is still small enough to read:
- about 5k lines of Zig
- about 2k lines of GLSL
That is intentional. I want this to stay understandable.
I think local AI gets much healthier when it is not effectively locked to one vendor, one driver stack, or one set of assumptions about who the "real" user is. If you care about Zig, Vulkan, AMD GPUs, or local inference on hardware people actually own, I would like your help building that alternative.
Top comments (0)