CyberMetric
Feb 2026Multiple-choice cybersecurity knowledge evaluation across 10,000 questions
Tihanyi et al. (2024). CyberMetric: A Benchmark Dataset based on Retrieval-Augmented Generation for Evaluating LLMs in Cybersecurity Knowledge(opens in new tab). IEEE CSR 2024.
About This Benchmark
Sample Questions
Which VPN protocol uses AES-GCM for encryption and integrity and supports seamless reconnection properties similar to IKEv2 MOBIKE?
What is a fundamental difference between rootkits and conventional Trojan horse programs?
Norbert is the security administrator for a public network. In an attempt to detect hacking attempts, he installed a program on his production servers that imitates a well-known operating system vulnerability and reports exploitation attempts to the administrator. What is this type of technique called?
Methodology
Scoring
- Accuracy: Exact match against the correct answer letter (A, B, C, or D)
- Cost: USD per question based on token usage
- Completion: Percentage of questions answered without errors
Task Design
- Each question has 4 answer options (A/B/C/D) with a single correct answer
- Questions span cybersecurity standards, certifications, network security, cryptography, risk management, and more
- Models must output a single answer letter—no explanation needed
- Dataset sourced from NIST standards, CISSP/CEH certs, research papers, and security textbooks
- All questions verified by human cybersecurity experts
Execution
- Frontier providers (OpenAI, Anthropic, Google) evaluated via batch API for cost efficiency
- Open-weight and smaller providers evaluated via real-time API through OpenRouter
- "Thinking" / reasoning mode enabled for all models that support it
- Same prompt template for all models, no per-model tuning
- Full dataset of 10,180 questions evaluated per model
Prompt Template
Key Findings
Remarkably Tight Spread
All 13 models scored between 86.9% and 90.8%—a gap of just 3.9 percentage points from top to bottom. To test whether this reflects memorization, we gave GPT-5 Mini the questions without answer choices and forced a letter-only response: it scored 25.45%, consistent with random guessing (25%). This suggests the high scores reflect genuine cybersecurity knowledge, not dataset contamination.
Accuracy by Model
Closed vs Open-Weight
Closed models cluster between 88.5–90.8%, open-weight between 86.9–87.8%—a consistent but narrow ~2pp gap. Within each group the spread is even tighter: 2.3pp among closed models, under 1pp among open-weight. GPT-5 Mini (89.3%) nearly matches GPT-5.2 (89.4%), suggesting that model scale provides diminishing returns on static knowledge tasks at this saturation level.
Cost Varies 200x
Qwen3 Coder Next costs just $0.024 per 1,000 questions while Claude Sonnet 4.5 costs $4.77—a 200x difference for only 2.4pp less accuracy. GPT-5 Mini offers the best frontier value at $0.38/1K questions with 89.3% accuracy, and GPT-OSS 120B delivers 87.3% at just $0.03/1K—making high-quality cybersecurity knowledge evaluation accessible at minimal cost.
Cost per Task
Reliability
All frontier models achieved 100% completion (GPT-5.2, GPT-5 Mini, Sonnet 4.5, Haiku 4.5, Opus 4.6, Gemini 3.0 Pro). Among OpenRouter models, most maintained >99% completion. GLM-4.7 had the lowest at 97.4% with 264 errors. Gemini 3.0 Flash was excluded entirely due to a batch API failure resulting in 0% completion.
Task Completion Rate
Model Recommendations
- All tested models have broadly internalized cybersecurity fundamentals — Choose based on cost and integration needs.
- GPT-5 Mini — Best practical choice for most use cases. Frontier-class accuracy (89.3%) at just $0.38/1K questions—over 10x cheaper than Sonnet 4.5 for a negligible 0.1pp difference.
- GPT-OSS 120B / Qwen3 Coder Next — For bulk or cost-sensitive workloads. At $0.03 and $0.024 per 1K questions respectively, the 2–3pp accuracy gap versus frontier models is unlikely to matter in practice.
- For harder evaluation — Static knowledge recall is effectively solved. To meaningfully differentiate models on cybersecurity, use applied benchmarks that test reasoning and tool use—such as Cybench (agentic CTF) or BOTSv3 (log investigation).
Caveats
- We tested for dataset memorization by giving GPT-5 Mini questions without answer choices (letter-only response). It scored 25.45% on the 2K subset—consistent with random guessing (25%). GPT-5.2 scored 30% on the 80-question subset, within expected variance. We do not find evidence of memorization, though this does not definitively rule it out.
- With answer choices provided but no questions, GPT-5 Mini scored 55%—likely because some distractor options are obviously wrong, reducing the effective choice set. This is a dataset quality consideration rather than a memorization signal.
- Frontier providers (OpenAI, Anthropic, Google) were evaluated via batch API, so per-question latency is not available for these models. Latency data is only reported for models run through OpenRouter.
- Cost estimates are based on published API pricing at the time of evaluation and exclude any batch or volume discounts.
Other Benchmarks
BlueBench-Intrusion-001: Real macOS infostealer intrusion spanning incident response, threat hunting, and detection engineering
Real CTF challenges from CSAW competitions covering reverse engineering, forensics, and miscellaneous problem-solving
Defensive security CTF challenges testing forensics, reverse engineering, and miscellaneous security skills
Blue team CTF scenarios testing incident response and threat hunting
Multi-label classification of MITRE ATT&CK tactics and techniques from Sigma rules
AI for the blue team.
Run Cotool's harness in your environment to get real security work done
