Skip to main content

BOTSv3 Blue Team CTF

Dec 2025

Blue team CTF scenarios testing incident response and threat hunting

Metric
View
Provider
51 scenarios · 15 models · Dec 2025
RankModels (15)AccCostLat
1
GPT-5.2
68.6%$4.0310m 19s
2
GPT-5.1
64.7%$1.675m 54s
3
Claude Opus 4.5
64.7%$5.141m 53s
4
GPT-5
62.8%$1.107m 53s
5
Claude Sonnet 4.5
60.8%$7.226m 14s
6
Claude Haiku 4.5
51.0%$0.794m
7
Gemini 3.0 Pro
51.0%$0.937m 41s
8
GPT-5 Mini
49.0%$0.0213m 28s
9
Qwen3 Coder
43.1%$0.224m
10
MiniMax M2
29.4%$0.154m 15s
Best AccuracyBest Open Weight
Dataset
Splunk BOTSv3
View Dataset(opens in new tab)
51 scenarios
·
2.7M+ log events
Incident Response
Threat Hunting
Alert Triage
Log Analysis
Advanced Peristent Threat (APT)
Cloud Security (AWS/Azure)

About This Benchmark

We reproduced the Splunk BOTSv3 blue team Capture the Flag (CTF) environment, comprising over 2.7M logs spanning 13 months with 51 question-answer pairs. These scenarios test cloud-based attack investigation (AWS, Azure), simulated APT intrusions, incident response, threat hunting, and log analysis. Using Cotool's agent harness, we gave agents access to Splunk tools (search, listDatasets, describeSourceType) to query data and answer forensic questions. These tasks represent a meaningful subset of real SOC workflows.

Sample Questions

Q: What external client IP address is able to initiate successful logins to Frothly using an expired user account?

A: 199.66.91.253

Q: Bud accidentally makes an S3 bucket publicly accessible. What is the event ID of the API call that enabled public access?

A: ab45689d-69cd-41e7-8705-5350402cf7ac

Methodology

Scoring

  • Accuracy: Case-insensitive exact match against ground truth answers
  • Cost: USD per task based on token usage (excluding prompt caching)
  • Latency: Wall-clock time to complete each task
  • Completion: Percentage of tasks finished without unrecoverable errors

Setup

  • Built a Splunk Enterprise instance and indexed the full BOTSv3 dataset
  • Agents given access to three tools: search, listDatasets, and describeSourceType
  • Tasks require agents to query Splunk, analyze results, and produce a final answer

Controls

  • Same minimal system prompt for all models, no per-model tuning
  • "Thinking" mode enabled where available
  • Agent loops capped at 100 iterations (tools removed to force an answer)
  • Some questions modified to include context that humans would have in a CTF setting
  • Questions requiring web search or browser access were removed

Key Findings

Accuracy

GPT-5.2 achieved the highest accuracy at ~69%, followed by GPT-5.1 and Opus 4.5 at 65%. GPT-5 and Sonnet 4.5 scored 63% and 61% respectively. Among open-weight models, Qwen3 Coder led at 43%, while MiniMax M2 and GPT-OSS-120b ranged from 25-29%.

Accuracy by Model

OpenAI
Anthropic
Google
Qwen
Minimax

Speed

Opus 4.5 was the fastest competitive model at just 113s average, roughly half the time of Haiku 4.5 (240s), despite presumably being a larger model. This suggests reasoning efficiency can outweigh raw inference latency in long-horizon agentic tasks.

Task Duration (avg)

OpenAI
Deepseek
Anthropic
Qwen
Minimax
Google

Cost

GPT-5.1 delivered 65% accuracy at just ~$1.67/task, the best cost-to-accuracy ratio among frontier models. GPT-5.2 cost ~$4.03/task for 69% accuracy, while Opus 4.5 cost ~$5.14/task for 65%. Among open-weight models, Qwen3 Coder offered strong value at ~$0.22/task (43% accuracy), with MiniMax M2 and GPT-OSS-120b even cheaper at ~$0.15 and ~$0.10/task respectively.

Cost per Task

OpenAI
Deepseek
Google
Minimax
Qwen
Anthropic

Token Efficiency

Among frontier models, GPT-5 was the most token-efficient at ~793K per task, while Sonnet 4.5 consumed ~2.4M, over 3x more. GPT-5.2 (2.1M) and GPT-5.1 (1.2M) were notably less efficient than their predecessor. Among smaller models, GPT-5 Mini led at just ~87K tokens per task, followed by DeepSeek v3.2 (126K) and GPT-OSS-120b (198K).

Token Usage per Task

OpenAI
Deepseek
Google
Anthropic
Minimax

Reliability

Most models achieved 100% task completion, including GPT-5.2, GPT-5.1, GPT-5, Sonnet 4.5, Haiku 4.5, GPT-5 Mini, Qwen3 Coder, MiniMax M2, GPT-5 Nano, and DeepSeek v3.2. However, some models suffered from many unrecoverable errors, particularly GPT-OSS-120b (69% completion) and the Gemini models (Gemini 3.0 Pro at 92%, Gemini 2.5 Pro at 84%, and Gemini 2.5 Flash at 88%). This suggests potential struggles with long-context log investigation tasks.

Task Completion Rate

OpenAI
Anthropic
Qwen
Minimax
Deepseek

Model Recommendations

  • GPT-5.2 Best for most blue team investigations. Highest accuracy at 69% with 100% task completion reliability.
  • GPT-5.1 Best value for top-tier accuracy. Achieves 65% accuracy at ~1/3 the cost of Opus 4.5 with 100% reliability.
  • Claude Opus 4.5 Best for time-critical investigations where speed is paramount. Fastest model tested with 65% accuracy, though at higher cost.
  • Qwen3 Coder Best for cost-sensitive investigations requiring moderate accuracy. Achieves 43% accuracy at ~$0.22/task with 100% reliability.
  • Claude Haiku 4.5 Best for interactive triage and real-time alert enrichment. Good balance of speed (240s), accuracy (51%), and 100% reliability at low cost.

Caveats

  • Some questions were modified to include context that a human would have possessed in the original CTF setting.
  • Questions requiring web search or browser access were removed to focus the evaluation on SIEM-based investigation.
  • Cost estimates exclude prompt caching benefits, which can substantially reduce effective cost in production.

Scale security operations beyond headcount

Run Cotool's harness in your environment to get real security work done

Book a demo