Q: What quantization of CodeLlama 34B should I use on a Apple M1?

For 16 GB VRAM on the Apple M1, the Q4_K_M variant is the best fit. Estimated ~1 tokens/sec on the Q4_K_M quantization.

Q: How fast does CodeLlama 34B run on Apple M1?

Roughly 1 tokens/sec for Q4_K_M. Real speed depends on context length, backend (Ollama, llama.cpp, LM Studio), and KV cache size.

Q: What if Apple M1 is not enough for CodeLlama 34B?

Consider upgrading to NVIDIA GeForce RTX 4090 (24 GB VRAM) which fits the recommended 24 GB target. Or pick a smaller quantization to stay on your current card.

Question 1

Can I run CodeLlama 34B on a Apple M1?

Accepted Answer

Sort of — Apple M1 can run CodeLlama 34B (Q4_K_M) only by spilling layers to RAM. Generation will be slow. CPU + GPU hybrid — not enough VRAM (16 GB < 22 GB min), but 64 GB RAM is sufficient. Expect significantly slower inference.

Question 2

What quantization of CodeLlama 34B should I use on a Apple M1?

Accepted Answer

For 16 GB VRAM on the Apple M1, the Q4_K_M variant is the best fit. Estimated ~1 tokens/sec on the Q4_K_M quantization.

Question 3

How fast does CodeLlama 34B run on Apple M1?

Accepted Answer

Roughly 1 tokens/sec for Q4_K_M. Real speed depends on context length, backend (Ollama, llama.cpp, LM Studio), and KV cache size.

Question 4

What if Apple M1 is not enough for CodeLlama 34B?

Accepted Answer

Consider upgrading to NVIDIA GeForce RTX 4090 (24 GB VRAM) which fits the recommended 24 GB target. Or pick a smaller quantization to stay on your current card.

Can I Run CodeLlama 34B on Apple M1?

Share this matchup

Every CodeLlama 34B quantization on Apple M1

Upgrade options that fit CodeLlama 34B better