Skip to main content

Build from Source

llama.cpp is a C/C++ inference engine that powers most local LLM tools under the hood. Building from source gives you the latest optimizations.

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && make -j

# For NVIDIA GPU support (CUDA):
make -j GGML_CUDA=1

# For Apple Silicon (Metal):
make -j GGML_METAL=1

Download a GGUF Model

Models come in GGUF format with different quantization levels. Q4_K_M offers the best balance of quality and size for most users.

# Example: download Llama 3.1 8B Q4_K_M from HuggingFace
# Use huggingface-cli or wget
huggingface-cli download bartowski/Meta-Llama-3.1-8B-Instruct-GGUF \
  --include "Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf" \
  --local-dir models/

Run the CLI

Use llama-cli for interactive chat or llama-server for an OpenAI-compatible API.

# Interactive chat
./llama-cli -m models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \
  -c 8192 --chat-template llama3

# Start an OpenAI-compatible API server
./llama-server -m models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \
  --port 8080 -c 8192

GPU Offloading

Use -ngl (number of GPU layers) to control how many model layers run on the GPU. Set to a large number to offload everything, or a lower number to split between CPU and GPU when VRAM is limited.

# Full GPU offload (all layers)
./llama-server -m model.gguf -ngl 99

# Partial offload (20 layers on GPU, rest on CPU)
./llama-server -m model.gguf -ngl 20

Recommended Models for llama.cpp

← All Guides