Skip to main content

CyberMetric

Feb 2026

Multiple-choice cybersecurity knowledge evaluation across 10,000 questions

Metric
View
Provider
10180 questions · 13 models · Feb 2026
RankModels (13)AccCostComp
1
Claude Opus 4.6
90.8%$0.0025100.0%
2
Gemini 3.0 Pro
90.0%$0.0017100.0%
3
GPT-5.2
89.4%$0.0004100.0%
4
Claude Sonnet 4.5
89.4%$0.0048100.0%
5
GPT-5 Mini
89.3%$0.0004100.0%
6
Claude Haiku 4.5
88.5%$0.0040100.0%
7
Qwen3 235B
87.8%$0.000898.9%
8
GLM-4.7
87.5%$0.001397.4%
9
MiniMax M2.1
87.5%$0.000699.5%
10
GPT-OSS 120B
87.3%$0.000099.2%
Best AccuracyBest Open Weight
Dataset
CyberMetric-10000
View Dataset(opens in new tab)
10180 questions
Standards & Certifications
Network Security
Cryptography
Risk Management
Access Control
Incident Response
Application Security
Cloud Security

About This Benchmark

CyberMetric is a 10,000-question multiple-choice benchmark designed to evaluate cybersecurity knowledge across standards, certifications, network security, cryptography, risk management, and technical security concepts. Questions are sourced from NIST standards, CISSP/CEH certifications, research papers, and security textbooks, then verified by human domain experts. This single-turn evaluation tests what models have internalized about cybersecurity fundamentals during training—no tools, no retrieval, just raw knowledge. We evaluated 13 models from 7 providers across the full 10,180-question dataset.

Sample Questions

Which VPN protocol uses AES-GCM for encryption and integrity and supports seamless reconnection properties similar to IKEv2 MOBIKE?

ASecure Shell (SSH)
BMACsec
COpenVPN
DWireGuard

What is a fundamental difference between rootkits and conventional Trojan horse programs?

ARootkits replace existing programs and files on systems, while conventional Trojans are new programs installed into systems that have been compromised.
BConventional Trojan horse programs are harder to detect compared to rootkits.
CRootkits do not incorporate active mechanisms to prevent them from being noticed.
DConventional Trojan horse programs operate at the kernel level of the operating system.

Norbert is the security administrator for a public network. In an attempt to detect hacking attempts, he installed a program on his production servers that imitates a well-known operating system vulnerability and reports exploitation attempts to the administrator. What is this type of technique called?

ABear trap
BFirewall
CPseudo-flaw
DHoney pot

Methodology

Scoring

  • Accuracy: Exact match against the correct answer letter (A, B, C, or D)
  • Cost: USD per question based on token usage
  • Completion: Percentage of questions answered without errors

Task Design

  • Each question has 4 answer options (A/B/C/D) with a single correct answer
  • Questions span cybersecurity standards, certifications, network security, cryptography, risk management, and more
  • Models must output a single answer letter—no explanation needed
  • Dataset sourced from NIST standards, CISSP/CEH certs, research papers, and security textbooks
  • All questions verified by human cybersecurity experts

Execution

  • Frontier providers (OpenAI, Anthropic, Google) evaluated via batch API for cost efficiency
  • Open-weight and smaller providers evaluated via real-time API through OpenRouter
  • "Thinking" / reasoning mode enabled for all models that support it
  • Same prompt template for all models, no per-model tuning
  • Full dataset of 10,180 questions evaluated per model

Prompt Template

System
You are an expert in cybersecurity. Answer multiple-choice questions by selecting the single best option. Provide your answer as a single uppercase letter (A, B, C, or D).
User
[Question text] A. [Option A] B. [Option B] C. [Option C] D. [Option D] Provide your answer on a final line starting with "Answer:" followed by a single letter.
[Question + Options]

Key Findings

Remarkably Tight Spread

All 13 models scored between 86.9% and 90.8%—a gap of just 3.9 percentage points from top to bottom. To test whether this reflects memorization, we gave GPT-5 Mini the questions without answer choices and forced a letter-only response: it scored 25.45%, consistent with random guessing (25%). This suggests the high scores reflect genuine cybersecurity knowledge, not dataset contamination.

Accuracy by Model

Anthropic
Google
OpenAI
Qwen
Zhipu
Minimax

Closed vs Open-Weight

Closed models cluster between 88.5–90.8%, open-weight between 86.9–87.8%—a consistent but narrow ~2pp gap. Within each group the spread is even tighter: 2.3pp among closed models, under 1pp among open-weight. GPT-5 Mini (89.3%) nearly matches GPT-5.2 (89.4%), suggesting that model scale provides diminishing returns on static knowledge tasks at this saturation level.

Cost Varies 200x

Qwen3 Coder Next costs just $0.024 per 1,000 questions while Claude Sonnet 4.5 costs $4.77—a 200x difference for only 2.4pp less accuracy. GPT-5 Mini offers the best frontier value at $0.38/1K questions with 89.3% accuracy, and GPT-OSS 120B delivers 87.3% at just $0.03/1K—making high-quality cybersecurity knowledge evaluation accessible at minimal cost.

Cost per Task

Qwen
OpenAI
Minimax
Zhipu
Moonshot
Google

Reliability

All frontier models achieved 100% completion (GPT-5.2, GPT-5 Mini, Sonnet 4.5, Haiku 4.5, Opus 4.6, Gemini 3.0 Pro). Among OpenRouter models, most maintained >99% completion. GLM-4.7 had the lowest at 97.4% with 264 errors. Gemini 3.0 Flash was excluded entirely due to a batch API failure resulting in 0% completion.

Task Completion Rate

Anthropic
Google
OpenAI
Minimax
Qwen

Model Recommendations

  • All tested models have broadly internalized cybersecurity fundamentals Choose based on cost and integration needs.
  • GPT-5 Mini Best practical choice for most use cases. Frontier-class accuracy (89.3%) at just $0.38/1K questions—over 10x cheaper than Sonnet 4.5 for a negligible 0.1pp difference.
  • GPT-OSS 120B / Qwen3 Coder Next For bulk or cost-sensitive workloads. At $0.03 and $0.024 per 1K questions respectively, the 2–3pp accuracy gap versus frontier models is unlikely to matter in practice.
  • For harder evaluation Static knowledge recall is effectively solved. To meaningfully differentiate models on cybersecurity, use applied benchmarks that test reasoning and tool use—such as Cybench (agentic CTF) or BOTSv3 (log investigation).

Caveats

  • We tested for dataset memorization by giving GPT-5 Mini questions without answer choices (letter-only response). It scored 25.45% on the 2K subset—consistent with random guessing (25%). GPT-5.2 scored 30% on the 80-question subset, within expected variance. We do not find evidence of memorization, though this does not definitively rule it out.
  • With answer choices provided but no questions, GPT-5 Mini scored 55%—likely because some distractor options are obviously wrong, reducing the effective choice set. This is a dataset quality consideration rather than a memorization signal.
  • Frontier providers (OpenAI, Anthropic, Google) were evaluated via batch API, so per-question latency is not available for these models. Latency data is only reported for models run through OpenRouter.
  • Cost estimates are based on published API pricing at the time of evaluation and exclude any batch or volume discounts.

Scale security operations beyond headcount

Run Cotool's harness in your environment to get real security work done

Book a demo