Evaluate
Benchmark and eval center
Leaderboards are useful for discovery, not final selection. Always score models on your own task distribution before making a long-term choice.
Benchmark literacy checklist
- Use benchmark scores to shortlist, not to make final deployment decisions.
- Compare models on your domain tasks (coding, support, RAG, internal docs).
- Track both quality and operations metrics (latency, memory fit, failure modes).
- Re-run evaluations after quantization, runtime changes, or model upgrades.
Starter eval template
Adapt this schema to build a repeatable eval set across candidate models.
{
"name": "local-llm-eval-v1",
"metrics": ["pass_rate", "latency_p95_ms", "format_adherence"],
"cases": [
{
"id": "coding-refactor-001",
"task": "coding",
"prompt": "Refactor this function for readability without changing behavior.",
"must_include": ["unchanged behavior", "clear naming"],
"must_avoid": ["API breaking changes"]
},
{
"id": "rag-grounding-001",
"task": "rag",
"prompt": "Answer only from the provided context and cite line numbers.",
"must_include": ["grounded answer", "citation"],
"must_avoid": ["unsupported claims"]
}
]
}