Sigma Detection Classification

Jan 2026

Multi-label classification of MITRE ATT&CK tactics and techniques from Sigma rules

Metric

View

Provider

2733 Sigma Detection Rules · 12 models · Jan 2026

Rank	Models (12)	F1 Score	Precision	Recall	Cost / Task	Latency
1	Claude Opus 4.5	66.0%	67.5%	69.2%	$0.0038	2.5s
2	Claude Sonnet 4.5	60.0%	57.7%	68.6%	$0.0022	3.1s
3	Gemini 3.0 Pro	57.2%	56.2%	63.6%	$0.0017	25.1s
4	Gemini 3.0 Flash	56.4%	51.3%	71.4%	$0.0004	9.0s
5	GPT-5.2	56.3%	55.8%	62.8%	$0.0012	1.2s
6	GLM 4.7	55.2%	55.2%	60.5%	$0.0039	51.5s
7	GPT-OSS 120B	48.9%	47.3%	56.3%	$0.0007	5.0s
8	DeepSeek v3.2	48.0%	50.3%	48.9%	$0.0001	1.9s
9	MiniMax M2	47.4%	48.2%	50.9%	$0.0005	21.4s
10	Claude Haiku 4.5	47.2%	44.8%	55.3%	$0.0007	0.8s

Best F1 ScoreBest Open Weight

Dataset

SigmaHQ Rules

View Dataset

2733 Sigma Detection Rules

Detection Engineering

MITRE ATT&CK

SIEM

Windows

Linux

Cloud

Network

Application

About This Benchmark

This benchmark evaluates LLMs' intrinsic knowledge of detection engineering and the MITRE ATT&CK framework. Unlike agentic tasks where models can query external resources, this single-turn classification task tests what models have learned about adversary tradecraft during training. Given a Sigma rule's detection logic, title, and description (with MITRE tags stripped), models must predict all applicable technique IDs (e.g., T1059.001). The dataset comprises 2,733 rules from the SigmaHQ repository, covering Windows, Linux, cloud, and network detections.

Sample Tasks

Input•Certutil Base64 Decode Detection

title: File Decoded From Base64/Hex Via Certutil
logsource:
    category: process_creation
    product: windows
detection:
    selection:
        Image|endswith: '\\certutil.exe'
        CommandLine|contains:
            - '-decode'
            - '-decodehex'
    condition: selection

Label•T1027 (Obfuscated Files or Information)

Input•PowerShell Download Cradle

title: Suspicious PowerShell Download Cradle
logsource:
    category: ps_script
    product: windows
detection:
    selection:
        ScriptBlockText|contains:
            - 'IEX'
            - 'Invoke-Expression'
            - 'DownloadString'
    condition: selection

Label•T1059.001 (PowerShell), T1105 (Ingress Tool Transfer)

Key Findings

F1 Score

Claude Opus 4.5 achieved the highest hierarchical F1 score at 66%, followed by Sonnet 4.5 at 60% and a cluster of models around 56-57% (Gemini 3.0 Pro, Gemini 3.0 Flash, GPT-5.2). Among open-weight models, DeepSeek v3.2 led at 48% F1.

F1 Score by Model

Anthropic

Google

OpenAI

Xai

Deepseek

Minimax

Precision vs Recall

Since ground truth labels are community-contributed and often incomplete, recall is the more reliable metric. It measures coverage of human-intended labels, while precision is penalized when models correctly identify techniques the author simply forgot to tag. Gemini 3.0 Flash achieved the highest recall at 71%, followed by Opus 4.5 (69%) and Sonnet 4.5 (69%). Among open-weight models, DeepSeek v3.2 led with 49% recall.

Precision vs Recall

Top-right is best (high precision and recall)

anthropic

google

openai

xai

deepseek

minimax

qwen

Cost Efficiency

Gemini 3.0 Flash offered the best frontier cost efficiency at ~$0.39/1000 samples with 56% F1. Among open-weight models, DeepSeek v3.2 was remarkably cheap at ~$0.08/1000 samples with 48% F1, nearly 5x cheaper than Gemini Flash. For highest accuracy, Opus 4.5 cost ~$3.84/1000 samples for 66% F1.

Cost per Task

Deepseek

Google

Minimax

Anthropic

OpenAI

Qwen

Speed

Haiku 4.5 was the fastest frontier model at 0.75s average per sample, followed by GPT-5.2 at 1.2s. Among open-weight models, Qwen3 235B was fastest at 0.97s, though with lower accuracy (38% F1). DeepSeek v3.2 offered a better speed/accuracy trade-off at 1.9s with 48% F1.

Task Duration (avg)

Anthropic

Qwen

OpenAI

Deepseek

Google

Minimax

Model Recommendations

Gemini 3.0 Flash — Best for maximizing recall. Highest recall (71%) at just ~$0.39/1000 samples. Ideal when missing a technique is costlier than over-predicting.
Claude Opus 4.5 — Best overall F1 (66%) with strong recall (69%). Choose when you need balanced precision and recall.
Claude Sonnet 4.5 — Strong recall (69%) at lower cost than Opus. Good balance for production detection enrichment.
DeepSeek v3.2 — Best open-weight value at ~$0.08/1000 samples. Usable recall (49%) at nearly 50x cheaper than Opus.
GPT-5.2 — Fast frontier option with 63% recall at 1.2s latency. Good for high-throughput pipelines.

Methodology

Scoring

F1 Score: Hierarchical F1 score accounting for MITRE technique hierarchy (partial credit for parent techniques)
Precision: Proportion of predicted techniques that are correct
Recall: Proportion of techniques that were correctly predicted
Cost: USD per sample based on token usage

Task Design

Models receive a Sigma rule (title, description, detection logic) with MITRE tags stripped to prevent label leakage
Output is parsed as a comma-separated list of technique IDs (e.g., T1059, T1059.001)
Ground truth comes from the official attack.* tags authored in Sigma rules
Single-turn prediction tests intrinsic knowledge, no tool use or external lookups

Hierarchical Scoring

MITRE techniques follow a parent/sub-technique hierarchy (e.g., T1059 → T1059.001)
Exact match or more specific prediction (sub-technique when parent expected): 1.0 score
Less specific prediction (parent when sub-technique expected): 0.75 partial credit
Optimal greedy matching assigns predictions to ground truths maximizing total score
Hierarchical precision = (sum of match scores) / (number of predictions)
Hierarchical recall = (sum of match scores) / (number of ground truths)
This approach rewards semantic understanding of attack relationships over exact memorization

Controls

Same prompt template for all models, no per-model tuning
Models instructed to output ONLY comma-separated technique IDs (e.g., T1078, T1003.001)

Prompt Template

System

You are a cybersecurity expert specializing in the MITRE ATT&CK framework. Your task is to analyze detection rules and identify which MITRE ATT&CK technique IDs they are designed to detect. MITRE ATT&CK technique IDs follow the format T#### (e.g., T1078, T1003) or T####.### for sub-techniques (e.g., T1078.003). Analyze the detection rule carefully and output ONLY a comma-separated list of technique IDs, nothing else.

User

Analyze this Sigma detection rule and identify the MITRE ATT&CK technique IDs it detects:

[Sigma Rule YAML]

Caveats

Ground truth labels are community-contributed: rule authors may annotate only a subset of applicable TTPs, making precision a noisy metric. A "false positive" may actually be a valid technique the author omitted.
Recall is the more reliable indicator of model quality, measuring how well the model covers the techniques the human author explicitly intended to assign.
Hierarchical scoring awards 0.75 partial credit for parent techniques, which may inflate scores compared to strict exact-match evaluation.
Some Sigma rules have ambiguous or overly broad MITRE mappings in the ground truth.

Other Benchmarks

Windows Enterprise Intrusion

BlueBench-Intrusion-002: Real multi-host Windows Active Directory intrusion spanning detection engineering, malware analysis, and open-ended incident reporting

Detection Engineering

40 samples · 22 models · Jul 2026