CyberMetric
Feb 2026Multiple-choice cybersecurity knowledge evaluation across 10,000 questions
Tihanyi et al. (2024). CyberMetric: A Benchmark Dataset based on Retrieval-Augmented Generation for Evaluating LLMs in Cybersecurity Knowledge(opens in new tab). IEEE CSR 2024.
About This Benchmark
Sample Questions
Which VPN protocol uses AES-GCM for encryption and integrity and supports seamless reconnection properties similar to IKEv2 MOBIKE?
What is a fundamental difference between rootkits and conventional Trojan horse programs?
Norbert is the security administrator for a public network. In an attempt to detect hacking attempts, he installed a program on his production servers that imitates a well-known operating system vulnerability and reports exploitation attempts to the administrator. What is this type of technique called?
Methodology
Scoring
- Accuracy: Exact match against the correct answer letter (A, B, C, or D)
- Cost: USD per question based on token usage
- Completion: Percentage of questions answered without errors
Task Design
- Each question has 4 answer options (A/B/C/D) with a single correct answer
- Questions span cybersecurity standards, certifications, network security, cryptography, risk management, and more
- Models must output a single answer letter—no explanation needed
- Dataset sourced from NIST standards, CISSP/CEH certs, research papers, and security textbooks
- All questions verified by human cybersecurity experts
Execution
- Frontier providers (OpenAI, Anthropic, Google) evaluated via batch API for cost efficiency
- Open-weight and smaller providers evaluated via real-time API through OpenRouter
- "Thinking" / reasoning mode enabled for all models that support it
- Same prompt template for all models, no per-model tuning
- Full dataset of 10,180 questions evaluated per model
Prompt Template
Key Findings
Remarkably Tight Spread
All 13 models scored between 86.9% and 90.8%—a gap of just 3.9 percentage points from top to bottom. To test whether this reflects memorization, we gave GPT-5 Mini the questions without answer choices and forced a letter-only response: it scored 25.45%, consistent with random guessing (25%). This suggests the high scores reflect genuine cybersecurity knowledge, not dataset contamination.
Accuracy by Model
Closed vs Open-Weight
Closed models cluster between 88.5–90.8%, open-weight between 86.9–87.8%—a consistent but narrow ~2pp gap. Within each group the spread is even tighter: 2.3pp among closed models, under 1pp among open-weight. GPT-5 Mini (89.3%) nearly matches GPT-5.2 (89.4%), suggesting that model scale provides diminishing returns on static knowledge tasks at this saturation level.
Cost Varies 200x
Qwen3 Coder Next costs just $0.024 per 1,000 questions while Claude Sonnet 4.5 costs $4.77—a 200x difference for only 2.4pp less accuracy. GPT-5 Mini offers the best frontier value at $0.38/1K questions with 89.3% accuracy, and GPT-OSS 120B delivers 87.3% at just $0.03/1K—making high-quality cybersecurity knowledge evaluation accessible at minimal cost.
Cost per Task
Reliability
All frontier models achieved 100% completion (GPT-5.2, GPT-5 Mini, Sonnet 4.5, Haiku 4.5, Opus 4.6, Gemini 3.0 Pro). Among OpenRouter models, most maintained >99% completion. GLM-4.7 had the lowest at 97.4% with 264 errors. Gemini 3.0 Flash was excluded entirely due to a batch API failure resulting in 0% completion.
Task Completion Rate
Model Recommendations
- All tested models have broadly internalized cybersecurity fundamentals — Choose based on cost and integration needs.
- GPT-5 Mini — Best practical choice for most use cases. Frontier-class accuracy (89.3%) at just $0.38/1K questions—over 10x cheaper than Sonnet 4.5 for a negligible 0.1pp difference.
- GPT-OSS 120B / Qwen3 Coder Next — For bulk or cost-sensitive workloads. At $0.03 and $0.024 per 1K questions respectively, the 2–3pp accuracy gap versus frontier models is unlikely to matter in practice.
- For harder evaluation — Static knowledge recall is effectively solved. To meaningfully differentiate models on cybersecurity, use applied benchmarks that test reasoning and tool use—such as Cybench (agentic CTF) or BOTSv3 (log investigation).
Caveats
- We tested for dataset memorization by giving GPT-5 Mini questions without answer choices (letter-only response). It scored 25.45% on the 2K subset—consistent with random guessing (25%). GPT-5.2 scored 30% on the 80-question subset, within expected variance. We do not find evidence of memorization, though this does not definitively rule it out.
- With answer choices provided but no questions, GPT-5 Mini scored 55%—likely because some distractor options are obviously wrong, reducing the effective choice set. This is a dataset quality consideration rather than a memorization signal.
- Frontier providers (OpenAI, Anthropic, Google) were evaluated via batch API, so per-question latency is not available for these models. Latency data is only reported for models run through OpenRouter.
- Cost estimates are based on published API pricing at the time of evaluation and exclude any batch or volume discounts.
Other Benchmarks
Real CTF challenges from CSAW competitions covering reverse engineering, forensics, and miscellaneous problem-solving
Defensive security CTF challenges testing forensics, reverse engineering, and miscellaneous security skills
Blue team CTF scenarios testing incident response and threat hunting
Multi-label classification of MITRE ATT&CK tactics and techniques from Sigma rules
Scale security operations beyond headcount
Run Cotool's harness in your environment to get real security work done
