CyberMetric

Feb 2026

Multiple-choice cybersecurity knowledge evaluation across 10,000 questions

Metric

View

Provider

10180 questions · 13 models · Feb 2026

Rank	Models (13)	Acc	Cost	Comp
1	Claude Opus 4.6	90.8%	$0.0025	100.0%
2	Gemini 3.0 Pro	90.0%	$0.0017	100.0%
3	GPT-5.2	89.4%	$0.0004	100.0%
4	Claude Sonnet 4.5	89.4%	$0.0048	100.0%
5	GPT-5 Mini	89.3%	$0.0004	100.0%
6	Claude Haiku 4.5	88.5%	$0.0040	100.0%
7	Qwen3 235B	87.8%	$0.0008	98.9%
8	GLM-4.7	87.5%	$0.0013	97.4%
9	MiniMax M2.1	87.5%	$0.0006	99.5%
10	GPT-OSS 120B	87.3%	$0.0000	99.2%

Best AccuracyBest Open Weight

Dataset

CyberMetric-10000

View Dataset

10180 questions

Standards & Certifications

Network Security

Cryptography

Risk Management

Access Control

Incident Response

Application Security

Cloud Security

Citation

Tihanyi et al. (2024). CyberMetric: A Benchmark Dataset based on Retrieval-Augmented Generation for Evaluating LLMs in Cybersecurity Knowledge(opens in new tab). IEEE CSR 2024.

About This Benchmark

CyberMetric is a 10,000-question multiple-choice benchmark designed to evaluate cybersecurity knowledge across standards, certifications, network security, cryptography, risk management, and technical security concepts. Questions are sourced from NIST standards, CISSP/CEH certifications, research papers, and security textbooks, then verified by human domain experts. This single-turn evaluation tests what models have internalized about cybersecurity fundamentals during training—no tools, no retrieval, just raw knowledge. We evaluated 13 models from 7 providers across the full 10,180-question dataset.

Sample Questions

Which DoS attack sends large amounts of ICMP Echo traffic to a broadcast IP address with a spoofed source address of a victim?

ASYN flood attack

BPing of Death

CSmurf attack

DBotnet attack

What is a fundamental difference between rootkits and conventional Trojan horse programs?

ARootkits replace existing programs and files on systems, while conventional Trojans are new programs installed into systems that have been compromised.

BConventional Trojan horse programs are harder to detect compared to rootkits.

CRootkits do not incorporate active mechanisms to prevent them from being noticed.

DConventional Trojan horse programs operate at the kernel level of the operating system.

Norbert is the security administrator for a public network. In an attempt to detect hacking attempts, he installed a program on his production servers that imitates a well-known operating system vulnerability and reports exploitation attempts to the administrator. What is this type of technique called?

ABear trap

BFirewall

CPseudo-flaw

DHoney pot

Key Findings

Remarkably Tight Spread

All 13 models scored between 86.9% and 90.8%—a gap of just 3.9 percentage points from top to bottom. To test whether this reflects memorization, we gave GPT-5 Mini the questions without answer choices and forced a letter-only response: it scored 25.45%, consistent with random guessing (25%). This suggests the high scores reflect genuine cybersecurity knowledge, not dataset contamination.

Accuracy by Model

Anthropic

Google

OpenAI

Qwen

Zhipu

Minimax

Closed vs Open-Weight

Closed models cluster between 88.5–90.8%, open-weight between 86.9–87.8%—a consistent but narrow ~2pp gap. Within each group the spread is even tighter: 2.3pp among closed models, under 1pp among open-weight. GPT-5 Mini (89.3%) nearly matches GPT-5.2 (89.4%), suggesting that model scale provides diminishing returns on static knowledge tasks at this saturation level.

Cost Varies 200x

Qwen3 Coder Next costs just $0.024 per 1,000 questions while Claude Sonnet 4.5 costs $4.77—a 200x difference for only 2.4pp less accuracy. GPT-5 Mini offers the best frontier value at $0.38/1K questions with 89.3% accuracy, and GPT-OSS 120B delivers 87.3% at just $0.03/1K—making high-quality cybersecurity knowledge evaluation accessible at minimal cost.

Cost per Task

Qwen

OpenAI

Minimax

Zhipu

Moonshot

Google

Reliability

All frontier models achieved 100% completion (GPT-5.2, GPT-5 Mini, Sonnet 4.5, Haiku 4.5, Opus 4.6, Gemini 3.0 Pro). Among OpenRouter models, most maintained >99% completion. GLM-4.7 had the lowest at 97.4% with 264 errors. Gemini 3.0 Flash was excluded entirely due to a batch API failure resulting in 0% completion.

Task Completion Rate

Anthropic

Google

OpenAI

Minimax

Qwen

Model Recommendations

All tested models have broadly internalized cybersecurity fundamentals — Choose based on cost and integration needs.
GPT-5 Mini — Best practical choice for most use cases. Frontier-class accuracy (89.3%) at just $0.38/1K questions—over 10x cheaper than Sonnet 4.5 for a negligible 0.1pp difference.
GPT-OSS 120B / Qwen3 Coder Next — For bulk or cost-sensitive workloads. At $0.03 and $0.024 per 1K questions respectively, the 2–3pp accuracy gap versus frontier models is unlikely to matter in practice.
For harder evaluation — Static knowledge recall is effectively solved. To meaningfully differentiate models on cybersecurity, use applied benchmarks that test reasoning and tool use—such as Cybench (agentic CTF) or BOTSv3 (log investigation).

Methodology

Scoring

Accuracy: Exact match against the correct answer letter (A, B, C, or D)
Cost: USD per question based on token usage
Completion: Percentage of questions answered without errors

Task Design

Each question has 4 answer options (A/B/C/D) with a single correct answer
Questions span cybersecurity standards, certifications, network security, cryptography, risk management, and more
Models must output a single answer letter—no explanation needed
Dataset sourced from NIST standards, CISSP/CEH certs, research papers, and security textbooks
All questions verified by human cybersecurity experts

Execution

Frontier providers (OpenAI, Anthropic, Google) evaluated via batch API for cost efficiency
Open-weight and smaller providers evaluated via real-time API through OpenRouter
"Thinking" / reasoning mode enabled for all models that support it
Same prompt template for all models, no per-model tuning
Full dataset of 10,180 questions evaluated per model

Prompt Template

System

You are an expert in cybersecurity. Answer multiple-choice questions by selecting the single best option. Provide your answer as a single uppercase letter (A, B, C, or D).

User

[Question text] A. [Option A] B. [Option B] C. [Option C] D. [Option D] Provide your answer on a final line starting with "Answer:" followed by a single letter.

[Question + Options]

Caveats

We tested for dataset memorization by giving GPT-5 Mini questions without answer choices (letter-only response). It scored 25.45% on the 2K subset—consistent with random guessing (25%). GPT-5.2 scored 30% on the 80-question subset, within expected variance. We do not find evidence of memorization, though this does not definitively rule it out.
With answer choices provided but no questions, GPT-5 Mini scored 55%—likely because some distractor options are obviously wrong, reducing the effective choice set. This is a dataset quality consideration rather than a memorization signal.
Frontier providers (OpenAI, Anthropic, Google) were evaluated via batch API, so per-question latency is not available for these models. Latency data is only reported for models run through OpenRouter.
Cost estimates are based on published API pricing at the time of evaluation and exclude any batch or volume discounts.

Other Benchmarks

Windows Enterprise Intrusion

BlueBench-Intrusion-002: Real multi-host Windows Active Directory intrusion spanning detection engineering, malware analysis, and open-ended incident reporting

Detection Engineering

40 samples · 22 models · Jul 2026