macOS Threat Investigation

Mar 2026

BlueBench-Intrusion-001: Real macOS infostealer intrusion spanning incident response, threat hunting, and detection engineering

Metric

Track

View

Provider

36 tasks · 9 models · Mar 2026

Rank	Models (9)	Acc	Cost	Lat
1	GPT-5.4	86.8%	$2.81	36m 49s
2	Claude Opus 4.6	84.2%	$3.90	48m 38s
3	GPT-5.3 Codex	81.6%	$3.33	37m 5s
4	Claude Opus 4.5	75.0%	$4.94	43m 34s
5	GLM-5	71.0%	$3.76	62m 7s
6	Gemini 3.1 Pro	68.8%	$0.86	41m 3s
7	Kimi K2.5	68.4%	$0.36	29m 38s
8	GPT-5.4 Mini	59.2%	$0.42	5m 15s
9	Gemini 3.0 Flash	54.0%	$0.29	88m 53s

Best AccuracyBest Open Weight

Dataset

BlueBench-Intrusion-001

Data Partner: Threat Hunting Labs

36 tasks

416K+ log events

Investigation Tracks

Incident Response·12 tasks(33%)

Threat Hunting·12 tasks(33%)

Detection Engineering·12 tasks(33%)

Incident Response

Threat Hunting

Detection Engineering

Incident Response

Threat Hunting

Detection Engineering

macOS Forensics

Credential Access

Data Exfiltration

About This Benchmark

Unlike CTF-style challenges, this benchmark uses real intrusion data. Agents investigate a genuine Odyssey Stealer compromise captured in a controlled environment. Odyssey is a macOS infostealer delivered via trojanized apps (in this case, a fake Ledger Live app) that harvests credentials, exfiltrates keychain and browser data over HTTP, and installs LaunchDaemon persistence. The dataset contains 416K+ real events across 14 log sources (EDR telemetry, macOS Unified Logs, Zeek network metadata, and security alerts). Agents were given SQL-based query tools to investigate across three tracks: incident response, threat hunting, and detection engineering. The intrusion dataset was developed in partnership with Threat Hunting Labs.

Sample Questions

Incident Response

Q: What persistence artifact was installed? Provide the full plist path.

A: /Library/LaunchDaemons/com.*****.plist (redacted)

Threat Hunting

Q: What built-in utility is used to decode content immediately before the main payload executes?

A: base64 -d

Detection Engineering

Q: Write and validate a correlation query that ties local credential validation to a subsequent privileged action within a short time window.

A: [SQL query — executed against the live dataset to validate detection coverage]

Methodology

Scoring

Accuracy: LLM-judged 0–1 per question using a weighted rubric: technical accuracy (60%), completeness (25%), specificity (15%)
Cost: USD per task based on token usage
Latency: Wall-clock time to complete each investigation track
Completion: Percentage of tracks finished without unrecoverable errors

Setup

Logs from a real macOS intrusion loaded into a sandboxed query environment
Agents given SQL-based query tools to investigate 416K+ events across 14 log sources
36 scored questions across three investigation tracks: incident response, threat hunting, and detection engineering
Questions range from alert triage and kill-chain reconstruction to containment decisions and detection rule authoring

Controls

Same minimal system prompt for all models, no per-model tuning
"Thinking" mode enabled where available
Agent loops capped at 40 iterations (tools removed to force an answer)
All models given identical tool access and data

Key Findings

Accuracy (All Tracks)

GPT-5.4 led at 87%, followed by Opus 4.6 (84%) and GPT-5.3 Codex (82%). GPT-5.4 was the most consistent across tracks (83-92%), while Gemini 3.1 Pro scored 89% on threat hunting but dropped to 57-58% elsewhere.

Accuracy by Model

OpenAI

Anthropic

Zhipu

Google

Moonshot

Track Breakdown

Each track rewarded different strengths. IR was hardest (top score 83%). Gemini 3.1 Pro dominated threat hunting at 89%. GPT-5.4 and Opus 4.6 tied at 92% on detection engineering. One question stumped every model: locating the hidden credential cache path.

Cost (All Tracks)

IR was the most expensive track across models. Kimi K2.5 had the best value at $0.36/task for 68% accuracy, roughly 8x cheaper than GPT-5.4. GPT-5.4 Mini hit $0.23/task on detection engineering while still scoring 71%.

Cost per Task

Google

Moonshot

OpenAI

Zhipu

Anthropic

Speed (All Tracks)

GPT-5.4 Mini finished tracks in ~5 minutes. GPT-5.4 and GPT-5.3 Codex averaged ~37 minutes. Gemini 3.0 Flash was slowest at 89 minutes per track.

Task Duration (avg)

OpenAI

Moonshot

Google

Anthropic

Zhipu

Reliability (All Tracks)

Eight of nine models achieved 100% completion. Gemini 3.1 Pro had 2 failures out of 40 runs (95%).

Task Completion Rate

OpenAI

Anthropic

Zhipu

Moonshot

Google

Model Recommendations

GPT-5.4 — Best for high-stakes investigations. Highest accuracy at 87% with no weak tracks (83-92%) at ~$2.81/task.
Kimi K2.5 — Best value overall. Achieves 68% accuracy at ~$0.36/task with 100% reliability, and scores 79% on threat hunting specifically.
GPT-5.4 Mini — Best for rapid triage. Completes tracks in ~5 minutes and scores 71% on detection engineering at just $0.23/task.
Gemini 3.1 Pro — Best for threat hunting. Scores 89% on that track (highest of any model on any track) at $1.63/task, but weaker on IR and detection engineering.
Claude Opus 4.6 — Second highest accuracy at 84%, tied for best on detection engineering (92%). Strongest pick in the Anthropic ecosystem.

Caveats

Analysis questions use LLM-judged scoring, which introduces some variability compared to exact-match evaluation.
Costs are measured per task (12 tasks per track).
Gemini 3.1 Pro was in preview at the time of evaluation with tight rate limits and high API response latency. Its wall-clock times may not reflect production performance.