Skip to content

⚑ January 04, 2026¢

Generated: 2026-01-04 17:43 UTC
Total Duration: 43m 0s
Iterations: 5
Judge (classifier) model: gpt-4.1

About this BenchmarkΒΆ

Fast Benchmark: Quick regression tests using markers regression or benchmark - designed to run frequently and catch regressions.

HolmesGPT is continuously evaluated against real-world Kubernetes and cloud troubleshooting scenarios.

If you find scenarios that HolmesGPT does not perform well on, please consider adding them as evals to the benchmark.

Model Accuracy ComparisonΒΆ

Model Pass Fail Skip/Error Total Success Rate
deepseek-3.1 49 21 0 70 🟑 70% (49/70)
gpt-5 37 33 0 70 🟑 53% (37/70)
gpt-5.1 36 34 0 70 🟑 51% (36/70)
haiku-4.5 44 26 0 70 🟑 63% (44/70)
sonnet-4.5 59 11 0 70 🟑 84% (59/70)

Model Cost ComparisonΒΆ

Model Tests Avg Cost Min Cost Max Cost Total Cost
gpt-5 70 $0.06 $0.00 $0.28 $4.38
gpt-5.1 65 $0.11 $0.00 $0.39 $7.46
haiku-4.5 70 $0.04 $0.00 $0.11 $2.84
sonnet-4.5 70 $0.17 $0.01 $0.34 $11.66

Model Latency ComparisonΒΆ

Model Avg (s) Min (s) Max (s) P50 (s) P95 (s)
deepseek-3.1 87.8 6.6 167.8 93.2 143.1
gpt-5 33.1 3.7 639.3 25.7 46.6
gpt-5.1 89.1 6.5 338.4 74.4 202.3
haiku-4.5 27.5 3.1 59.8 30.4 42.2
sonnet-4.5 43.6 7.6 94.5 46.6 66.2

Performance by TagΒΆ

Success rate by test category and model:

Tag deepseek-3.1 gpt-5 gpt-5.1 haiku-4.5 sonnet-4.5 Warnings
benchmark 🟑 44% (11/25) 🟑 48% (12/25) 🟑 36% (9/25) 🟑 28% (7/25) 🟑 56% (14/25)
context_window 🟑 90% (9/10) 🟑 90% (9/10) 🟑 50% (5/10) 🟑 20% (2/10) 🟑 90% (9/10)
counting 🟒 100% (5/5) 🟒 100% (5/5) 🟒 100% (5/5) 🟒 100% (5/5) 🟒 100% (5/5)
datetime 🟑 87% (13/15) 🟑 87% (13/15) 🟑 67% (10/15) 🟑 47% (7/15) 🟑 93% (14/15)
easy 🟑 89% (31/35) 🟑 63% (22/35) 🟑 66% (23/35) 🟑 86% (30/35) 🟒 100% (35/35)
grafana-dashboard πŸ”΄ 0% (0/5) πŸ”΄ 0% (0/5) πŸ”΄ 0% (0/5) πŸ”΄ 0% (0/5) πŸ”΄ 0% (0/5)
hard πŸ”΄ 0% (0/5) πŸ”΄ 0% (0/5) πŸ”΄ 0% (0/5) πŸ”΄ 0% (0/5) πŸ”΄ 0% (0/5)
kubernetes 🟑 71% (25/35) 🟑 43% (15/35) 🟑 46% (16/35) 🟑 63% (22/35) 🟑 86% (30/35)
logs 🟑 60% (12/20) 🟑 70% (14/20) 🟑 50% (10/20) 🟑 35% (7/20) 🟑 70% (14/20)
loki 🟑 60% (β…—) 🟒 100% (5/5) 🟒 100% (5/5) 🟑 80% (β…˜) 🟒 100% (5/5)
medium 🟑 60% (18/30) 🟑 50% (15/30) 🟑 43% (13/30) 🟑 47% (14/30) 🟑 80% (24/30)
metrics πŸ”΄ 0% (0/5) πŸ”΄ 0% (0/5) πŸ”΄ 0% (0/5) πŸ”΄ 0% (0/5) πŸ”΄ 0% (0/5)
network 🟒 100% (5/5) 🟒 100% (5/5) 🟑 80% (β…˜) 🟒 100% (5/5) 🟒 100% (5/5)
one-test 🟒 100% (5/5) 🟑 40% (β…–) 🟑 60% (β…—) 🟒 100% (5/5) 🟒 100% (5/5)
port-forward 🟑 30% (3/10) 🟑 50% (5/10) 🟑 50% (5/10) 🟑 40% (4/10) 🟑 50% (5/10)
question-answer πŸ”΄ 0% (0/5) πŸ”΄ 0% (0/5) πŸ”΄ 0% (0/5) πŸ”΄ 0% (0/5) πŸ”΄ 0% (0/5)
regression 🟑 84% (38/45) 🟑 56% (25/45) 🟑 60% (27/45) 🟑 82% (37/45) 🟒 100% (45/45)
runbooks 🟑 40% (4/10) 🟑 60% (6/10) 🟑 60% (6/10) 🟑 70% (7/10) 🟒 100% (10/10)
Overall 🟑 70% (49/70) 🟑 53% (37/70) 🟑 51% (36/70) 🟑 63% (44/70) 🟑 84% (59/70)

Raw ResultsΒΆ

Status of all evaluations across models. Color coding:

  • 🟒 Passing 100% (stable)
  • 🟑 Passing 1-99%
  • πŸ”΄ Passing 0% (failing)
  • πŸ”§ Mock data failure (missing or invalid test data)
  • ⚠️ Setup failure (environment/infrastructure issue)
  • ⏱️ Timeout or rate limit error
  • ⏭️ Test skipped (e.g., known issue or precondition not met)
Eval ID deepseek-3.1 gpt-5 gpt-5.1 haiku-4.5 sonnet-4.5
09_crashpod πŸ”— 🟒 🟑 🟑 🟒 🟒
101_loki_historical_logs_pod_deleted πŸ”— 🟑 🟒 🟒 🟑 🟒
108_logs_nearby_lines πŸ”— πŸ”΄ πŸ”΄ πŸ”΄ 🟑 πŸ”΄
111_pod_names_contain_service πŸ”— 🟒 πŸ”΄ 🟑 🟑 🟒
12_job_crashing πŸ”— 🟑 🟑 🟑 🟒 🟒
162_get_runbooks πŸ”— 🟑 🟑 🟑 🟑 🟒
176_network_policy_blocking_traffic_no_runbooks πŸ”— 🟒 🟒 🟑 🟒 🟒
179_grafana_big_dashboard_query πŸ”— πŸ”΄ πŸ”΄ πŸ”΄ πŸ”΄ πŸ”΄
24_misconfigured_pvc πŸ”— 🟒 πŸ”΄ πŸ”΄ 🟑 🟒
43_current_datetime_from_prompt πŸ”— 🟑 🟑 🟒 🟒 🟒
61_exact_match_counting πŸ”— 🟒 🟒 🟒 🟒 🟒
73a_time_window_anomaly πŸ”— 🟑 🟒 🟑 🟑 🟑
73b_time_window_anomaly πŸ”— 🟒 🟑 🟑 🟑 🟒
96_no_matching_runbook πŸ”— 🟑 🟑 🟑 🟑 🟒
SUMMARY 🟑 70% (49/70) 🟑 53% (37/70) 🟑 51% (36/70) 🟑 63% (44/70) 🟑 84% (59/70)

Detailed Raw ResultsΒΆ

Eval ID deepseek-3.1 gpt-5 gpt-5.1 haiku-4.5 sonnet-4.5
09_crashpod πŸ”— 🟒 100% (5/5) / ⏱️ 83.5s 🟑 40% (β…–) / ⏱️ 15.1s / πŸ’° $0.04 🟑 60% (β…—) / ⏱️ 49.8s / πŸ’° $0.06 🟒 100% (5/5) / ⏱️ 26.4s / πŸ’° $0.03 🟒 100% (5/5) / ⏱️ 38.1s / πŸ’° $0.13
101_loki_historical_logs_pod_deleted πŸ”— 🟑 60% (β…—) / ⏱️ 110.9s 🟒 100% (5/5) / ⏱️ 29.9s / πŸ’° $0.08 🟒 100% (5/5) / ⏱️ 196.8s / πŸ’° $0.24 🟑 80% (β…˜) / ⏱️ 36.0s / πŸ’° $0.05 🟒 100% (5/5) / ⏱️ 45.9s / πŸ’° $0.17
108_logs_nearby_lines πŸ”— πŸ”΄ 0% (0/5) / ⏱️ 134.4s πŸ”΄ 0% (0/5) / ⏱️ 33.2s / πŸ’° $0.09 πŸ”΄ 0% (0/5) / ⏱️ 131.9s / πŸ’° $0.18 🟑 20% (β…•) / ⏱️ 36.5s / πŸ’° $0.06 πŸ”΄ 0% (0/5) / ⏱️ 63.4s / πŸ’° $0.21
111_pod_names_contain_service πŸ”— 🟒 100% (5/5) / ⏱️ 95.7s πŸ”΄ 0% (0/5) / ⏱️ 12.4s / πŸ’° $0.02 🟑 40% (β…–) / ⏱️ 67.5s / πŸ’° $0.07 🟑 80% (β…˜) / ⏱️ 24.0s / πŸ’° $0.03 🟒 100% (5/5) / ⏱️ 53.5s / πŸ’° $0.23
12_job_crashing πŸ”— 🟑 80% (β…˜) / ⏱️ 95.0s 🟑 20% (β…•) / ⏱️ 31.4s / πŸ’° $0.07 🟑 20% (β…•) / ⏱️ 85.4s / πŸ’° $0.11 🟒 100% (5/5) / ⏱️ 31.0s / πŸ’° $0.04 🟒 100% (5/5) / ⏱️ 48.8s / πŸ’° $0.16
162_get_runbooks πŸ”— 🟑 40% (β…–) / ⏱️ 82.2s 🟑 60% (β…—) / ⏱️ 159.8s / πŸ’° $0.13 🟑 40% (β…–) / ⏱️ 96.1s / πŸ’° $0.12 🟑 60% (β…—) / ⏱️ 36.0s / πŸ’° $0.07 🟒 100% (5/5) / ⏱️ 50.5s / πŸ’° $0.21
176_network_policy_blocking_traffic_no_runbooks πŸ”— 🟒 100% (5/5) / ⏱️ 93.0s 🟒 100% (5/5) / ⏱️ 35.3s / πŸ’° $0.09 🟑 80% (β…˜) / ⏱️ 91.9s / πŸ’° $0.11 🟒 100% (5/5) / ⏱️ 37.8s / πŸ’° $0.06 🟒 100% (5/5) / ⏱️ 49.4s / πŸ’° $0.21
179_grafana_big_dashboard_query πŸ”— πŸ”΄ 0% (0/5) / ⏱️ 94.4s πŸ”΄ 0% (0/5) / ⏱️ 21.2s / πŸ’° $0.04 πŸ”΄ 0% (0/5) / ⏱️ 175.0s / πŸ’° $0.20 πŸ”΄ 0% (0/5) / ⏱️ 29.5s / πŸ’° $0.04 πŸ”΄ 0% (0/5) / ⏱️ 34.4s / πŸ’° $0.13
24_misconfigured_pvc πŸ”— 🟒 100% (5/5) / ⏱️ 103.6s πŸ”΄ 0% (0/5) / ⏱️ 4.3s / πŸ’° $0.00 πŸ”΄ 0% (0/5) / ⏱️ 16.9s / πŸ’° $0.01 🟑 20% (β…•) / ⏱️ 9.9s / πŸ’° $0.01 🟒 100% (5/5) / ⏱️ 42.5s / πŸ’° $0.14
43_current_datetime_from_prompt πŸ”— 🟑 80% (β…˜) / ⏱️ 12.9s 🟑 80% (β…˜) / ⏱️ 7.4s / πŸ’° $0.01 🟒 100% (5/5) / ⏱️ 24.0s / πŸ’° $0.02 🟒 100% (5/5) / ⏱️ 6.3s / πŸ’° $0.01 🟒 100% (5/5) / ⏱️ 9.8s / πŸ’° $0.01
61_exact_match_counting πŸ”— 🟒 100% (5/5) / ⏱️ 35.6s 🟒 100% (5/5) / ⏱️ 16.4s / πŸ’° $0.04 🟒 100% (5/5) / ⏱️ 49.8s / πŸ’° $0.06 🟒 100% (5/5) / ⏱️ 14.0s / πŸ’° $0.01 🟒 100% (5/5) / ⏱️ 19.1s / πŸ’° $0.06
73a_time_window_anomaly πŸ”— 🟑 80% (β…˜) / ⏱️ 95.6s 🟒 100% (5/5) / ⏱️ 27.5s / πŸ’° $0.07 🟑 80% (β…˜) / ⏱️ 93.0s / πŸ’° $0.13 🟑 20% (β…•) / ⏱️ 31.8s / πŸ’° $0.05 🟑 80% (β…˜) / ⏱️ 48.6s / πŸ’° $0.20
73b_time_window_anomaly πŸ”— 🟒 100% (5/5) / ⏱️ 93.2s 🟑 80% (β…˜) / ⏱️ 28.2s / πŸ’° $0.07 🟑 20% (β…•) / ⏱️ 37.3s / πŸ’° $0.03 🟑 20% (β…•) / ⏱️ 32.9s / πŸ’° $0.05 🟒 100% (5/5) / ⏱️ 46.3s / πŸ’° $0.18
96_no_matching_runbook πŸ”— 🟑 40% (β…–) / ⏱️ 99.9s 🟑 60% (β…—) / ⏱️ 41.0s / πŸ’° $0.13 🟑 80% (β…˜) / ⏱️ 132.4s / πŸ’° $0.17 🟑 80% (β…˜) / ⏱️ 33.5s / πŸ’° $0.06 🟒 100% (5/5) / ⏱️ 60.6s / πŸ’° $0.30

Results are automatically generated and updated weekly. View full traces and detailed analysis in Braintrust experiment: local-benchmark-20260101-140005.