Claude 4.0 vs 4.5 (n=5)ΒΆ
Generated: 2025-09-30 15:37 UTC
Total Duration: 6h 16m 42s
Iterations: 5
Judge (classifier) model: gpt-4o
About this BenchmarkΒΆ
HolmesGPT is continuously evaluated against real-world Kubernetes and cloud troubleshooting scenarios.
If you find scenarios that HolmesGPT does not perform well on, please consider adding them as evals to the benchmark.
Model Accuracy ComparisonΒΆ
| Model | Pass | Fail | Skip/Error | Total | Success Rate |
|---|---|---|---|---|---|
| gpt-4o | 295 | 174 | 56 | 525 | π‘ 63% (295/469) |
| gpt-4.1 | 346 | 122 | 57 | 525 | π‘ 74% (346/468) |
| gpt-5 | 360 | 104 | 61 | 525 | π‘ 78% (360/464) |
| sonnet-4-20250514 | 419 | 51 | 55 | 525 | π‘ 89% (419/470) |
| sonnet-4-5-20250929 | 420 | 50 | 55 | 525 | π‘ 89% (420/470) |
Model Cost ComparisonΒΆ
| Model | Tests | Avg Cost | Min Cost | Max Cost | Total Cost |
|---|---|---|---|---|---|
| gpt-4o | 468 | $0.14 | $0.01 | $0.85 | $64.90 |
| gpt-4.1 | 468 | $0.11 | $0.02 | $1.07 | $52.00 |
| gpt-5 | 464 | $0.13 | $0.02 | $0.58 | $61.76 |
| sonnet-4-20250514 | 468 | $0.17 | $0.06 | $1.05 | $80.54 |
| sonnet-4-5-20250929 | 467 | $0.16 | $0.06 | $0.64 | $75.56 |
Model Latency ComparisonΒΆ
| Model | Avg (s) | Min (s) | Max (s) | P50 (s) | P95 (s) |
|---|---|---|---|---|---|
| gpt-4o | 49.0 | 8.9 | 278.2 | 43.5 | 94.7 |
| gpt-4.1 | 53.8 | 5.2 | 236.8 | 48.2 | 109.3 |
| gpt-5 | 190.3 | 22.5 | 1136.0 | 158.1 | 442.5 |
| sonnet-4-20250514 | 89.6 | 10.4 | 879.7 | 64.8 | 231.5 |
| sonnet-4-5-20250929 | 73.0 | 10.6 | 663.3 | 60.0 | 154.6 |
Performance by TagΒΆ
Success rate by test category and models:
| Tag | gpt-4o | gpt-4.1 | gpt-5 | sonnet-4-20250514 | sonnet-4-5-20250929 | Warnings |
|---|---|---|---|---|---|---|
| chain-of-causation | π΄ 0% (0/30) | π‘ 3% (1/30) | π‘ 40% (12/30) | π‘ 63% (19/30) | π‘ 70% (21/30) | β οΈ 50 skipped |
| context_window | π‘ 57% (20/35) | π‘ 77% (27/35) | π‘ 83% (29/35) | π‘ 86% (30/35) | π‘ 77% (27/35) | |
| counting | π’ 100% (20/20) | π’ 100% (20/20) | π‘ 95% (19/20) | π’ 100% (20/20) | π’ 100% (20/20) | |
| database | π΄ 0% (0/5) | π‘ 60% (β ) | π’ 100% (5/5) | π’ 100% (5/5) | π’ 100% (5/5) | β οΈ 75 skipped |
| datadog | π‘ 75% (15/20) | π‘ 80% (16/20) | π‘ 95% (18/19) | π’ 100% (20/20) | π’ 100% (20/20) | β οΈ 1 skipped |
| datetime | π‘ 65% (13/20) | π‘ 65% (13/20) | π‘ 95% (19/20) | π‘ 75% (15/20) | π‘ 85% (17/20) | β οΈ 50 skipped |
| easy | π‘ 97% (175/180) | π‘ 96% (173/180) | π‘ 80% (144/179) | π‘ 97% (174/180) | π‘ 96% (172/180) | β οΈ 1 skipped |
| hard | π‘ 11% (8/70) | π‘ 29% (20/70) | π‘ 57% (40/70) | π‘ 77% (54/70) | π‘ 80% (56/70) | β οΈ 150 skipped |
| kafka | βͺοΈ - | βͺοΈ - | βͺοΈ - | βͺοΈ - | βͺοΈ - | β οΈ 50 skipped |
| kubernetes | π‘ 55% (129/235) | π‘ 71% (168/235) | π‘ 69% (163/235) | π‘ 89% (208/235) | π‘ 87% (205/235) | β οΈ 25 skipped |
| logs | π‘ 62% (80/130) | π‘ 67% (87/129) | π‘ 77% (100/130) | π‘ 75% (98/130) | π‘ 82% (106/130) | β οΈ 176 skipped |
| medium | π‘ 51% (112/219) | π‘ 70% (153/218) | π‘ 82% (176/215) | π‘ 87% (191/220) | π‘ 87% (192/220) | β οΈ 133 skipped |
| network | π‘ 45% (9/20) | π‘ 60% (12/20) | π‘ 85% (17/20) | π’ 100% (20/20) | π’ 100% (20/20) | |
| numerical | π’ 100% (5/5) | π’ 100% (5/5) | π’ 100% (5/5) | π’ 100% (5/5) | π’ 100% (5/5) | |
| port-forward | π‘ 29% (13/45) | π‘ 44% (20/45) | π‘ 53% (24/45) | π‘ 49% (22/45) | π‘ 42% (19/45) | |
| prometheus | π‘ 65% (13/20) | π‘ 95% (19/20) | π’ 100% (20/20) | π’ 100% (20/20) | π‘ 80% (16/20) | |
| question-answer | π’ 100% (20/20) | π’ 100% (20/20) | π‘ 95% (19/20) | π’ 100% (20/20) | π’ 100% (20/20) | |
| runbooks | π‘ 73% (22/30) | π‘ 73% (22/30) | π‘ 93% (28/30) | π’ 100% (30/30) | π‘ 97% (29/30) | β οΈ 25 skipped |
| slackbot | βͺοΈ - | βͺοΈ - | βͺοΈ - | βͺοΈ - | βͺοΈ - | β οΈ 25 skipped |
| traces | π΄ 0% (0/25) | π‘ 4% (1/25) | π‘ 40% (10/25) | π‘ 56% (14/25) | π‘ 64% (16/25) | |
| transparency | π‘ 71% (50/70) | π‘ 71% (50/70) | π‘ 84% (59/70) | π‘ 81% (57/70) | π‘ 84% (59/70) | β οΈ 25 skipped |
| Overall | π‘ 63% (295/469) | π‘ 74% (346/468) | π‘ 78% (360/464) | π‘ 89% (419/470) | π‘ 89% (420/470) | β οΈ 284 skipped |
Raw ResultsΒΆ
Status of all evaluations across models. Color coding:
- π’ Passing 100% (stable)
- π‘ Passing 1-99%
- π΄ Passing 0% (failing)
- π§ Mock data failure (missing or invalid test data)
- β οΈ Setup failure (environment/infrastructure issue)
- β±οΈ Timeout or rate limit error
- βοΈ Test skipped (e.g., known issue or precondition not met)
Detailed Raw ResultsΒΆ
| Eval ID | gpt-4o | gpt-4.1 | gpt-5 | sonnet-4-20250514 | sonnet-4-5-20250929 |
|---|---|---|---|---|---|
| 01_how_many_pods π | π’ 100% (5/5) / β±οΈ 31.3s / π° $0.08 | π’ 100% (5/5) / β±οΈ 33.2s / π° $0.05 | π’ 100% (5/5) / β±οΈ 43.4s / π° $0.04 | π’ 100% (5/5) / β±οΈ 34.3s / π° $0.08 | π’ 100% (5/5) / β±οΈ 33.6s / π° $0.08 |
| 02_what_is_wrong_with_pod π | π’ 100% (5/5) / β±οΈ 42.9s / π° $0.07 | π’ 100% (5/5) / β±οΈ 38.7s / π° $0.06 | π’ 100% (5/5) / β±οΈ 123.9s / π° $0.10 | π’ 100% (5/5) / β±οΈ 53.5s / π° $0.11 | π’ 100% (5/5) / β±οΈ 67.5s / π° $0.10 |
| 03_what_is_the_command_to_port_forward π | π’ 100% (5/5) / β±οΈ 61.2s / π° $0.12 | π’ 100% (5/5) / β±οΈ 53.7s / π° $0.11 | π’ 100% (5/5) / β±οΈ 68.9s / π° $0.06 | π’ 100% (5/5) / β±οΈ 43.0s / π° $0.12 | π’ 100% (5/5) / β±οΈ 51.9s / π° $0.09 |
| 04_related_k8s_events π | π’ 100% (5/5) / β±οΈ 39.4s / π° $0.11 | π’ 100% (5/5) / β±οΈ 38.7s / π° $0.06 | π‘ 80% (β ) / β±οΈ 69.1s / π° $0.05 | π’ 100% (5/5) / β±οΈ 58.8s / π° $0.09 | π’ 100% (5/5) / β±οΈ 62.4s / π° $0.09 |
| 05_image_version π | π’ 100% (5/5) / β±οΈ 43.6s / π° $0.10 | π’ 100% (5/5) / β±οΈ 56.3s / π° $0.07 | π‘ 80% (β ) / β±οΈ 73.8s / π° $0.06 | π’ 100% (5/5) / β±οΈ 37.0s / π° $0.09 | π’ 100% (5/5) / β±οΈ 38.1s / π° $0.09 |
| 09_crashpod π | π’ 100% (5/5) / β±οΈ 43.0s / π° $0.13 | π’ 100% (5/5) / β±οΈ 37.3s / π° $0.06 | π‘ 80% (β ) / β±οΈ 92.4s / π° $0.08 | π’ 100% (5/5) / β±οΈ 73.0s / π° $0.14 | π’ 100% (5/5) / β±οΈ 64.8s / π° $0.14 |
| 100a_historical_logs π | π΄ 0% (0/5) / β±οΈ 52.3s / π° $0.12 | π΄ 0% (0/5) / β±οΈ 54.8s / π° $0.07 | π΄ 0% (0/5) / β±οΈ 500.6s / π° $0.29 | π΄ 0% (0/5) / β±οΈ 116.2s / π° $0.27 | π΄ 0% (0/5) / β±οΈ 98.0s / π° $0.19 |
| 100b_historical_logs_nonstandard_label π | π΄ 0% (0/5) / β±οΈ 50.9s / π° $0.11 | π΄ 0% (0/5) / β±οΈ 57.5s / π° $0.07 | π΄ 0% (0/5) / β±οΈ 363.3s / π° $0.22 | π΄ 0% (0/5) / β±οΈ 157.0s / π° $0.18 | π΄ 0% (0/5) / β±οΈ 102.9s / π° $0.17 |
| 101_historical_logs_pod_deleted π | π΄ 0% (0/5) / β±οΈ 53.2s / π° $0.12 | π΄ 0% (0/5) / β±οΈ 53.8s / π° $0.08 | π΄ 0% (0/5) / β±οΈ 268.6s / π° $0.16 | π΄ 0% (0/5) / β±οΈ 97.5s / π° $0.16 | π΄ 0% (0/5) / β±οΈ 85.9s / π° $0.15 |
| 103_logs_transparency_default_limit π | π΄ 0% (0/5) / β±οΈ 63.1s / π° $0.15 | π΄ 0% (0/5) / β±οΈ 105.8s / π° $0.39 | π’ 100% (5/5) / β±οΈ 137.1s / π° $0.09 | π΄ 0% (0/5) / β±οΈ 81.0s / π° $0.41 | π’ 100% (5/5) / β±οΈ 74.6s / π° $0.12 |
| 104a_postgres_root_issue π | π΄ 0% (0/5) / β±οΈ 48.3s / π° $0.18 | π‘ 60% (β ) / β±οΈ 85.6s / π° $0.35 | π’ 100% (5/5) / β±οΈ 233.2s / π° $0.21 | π’ 100% (5/5) / β±οΈ 71.9s / π° $0.19 | π’ 100% (5/5) / β±οΈ 106.0s / π° $0.24 |
| 107_log_filter_http_status_code π | π‘ 40% (β ) / β±οΈ 54.0s / π° $0.15 | π’ 100% (5/5) / β±οΈ 57.4s / π° $0.10 | π’ 100% (5/5) / β±οΈ 472.2s / π° $0.30 | π’ 100% (5/5) / β±οΈ 127.5s / π° $0.22 | π’ 100% (5/5) / β±οΈ 100.3s / π° $0.24 |
| 108_logs_nearby_lines π | π΄ 0% (0/5) / β±οΈ 64.2s / π° $0.17 | π΄ 0% (0/5) / β±οΈ 57.6s / π° $0.23 | π‘ 40% (β ) / β±οΈ 345.4s / π° $0.26 | π‘ 20% (β ) / β±οΈ 111.3s / π° $0.36 | π΄ 0% (0/5) / β±οΈ 89.7s / π° $0.22 |
| 109_logs_transparency_not_found π | π‘ 80% (β ) / β±οΈ 47.2s / π° $0.13 | π’ 100% (5/5) / β±οΈ 44.5s / π° $0.07 | π’ 100% (5/5) / β±οΈ 135.7s / π° $0.09 | π’ 100% (5/5) / β±οΈ 44.4s / π° $0.09 | π’ 100% (5/5) / β±οΈ 48.1s / π° $0.10 |
| 10_image_pull_backoff π | π’ 100% (5/5) / β±οΈ 47.3s / π° $0.18 | π’ 100% (5/5) / β±οΈ 55.9s / π° $0.10 | π‘ 60% (β ) / β±οΈ 99.9s / π° $0.10 | π’ 100% (5/5) / β±οΈ 59.4s / π° $0.13 | π’ 100% (5/5) / β±οΈ 60.0s / π° $0.13 |
| 110_k8s_events_image_pull π | π’ 100% (5/5) / β±οΈ 34.7s / π° $0.09 | π’ 100% (5/5) / β±οΈ 42.4s / π° $0.07 | π’ 100% (5/5) / β±οΈ 100.1s / π° $0.07 | π’ 100% (5/5) / β±οΈ 72.1s / π° $0.10 | π’ 100% (5/5) / β±οΈ 53.4s / π° $0.10 |
| 111_disabled_datadog_traces π | π΄ 0% (0/5) / β±οΈ 40.5s / π° $0.03 | π‘ 60% (β ) / β±οΈ 39.6s / π° $0.03 | π‘ 80% (β ) / β±οΈ 235.0s / π° $0.13 | π’ 100% (5/5) / β±οΈ 87.4s / π° $0.15 | π’ 100% (5/5) / β±οΈ 44.8s / π° $0.06 |
| 111_pod_names_contain_service π | π’ 100% (5/5) / β±οΈ 71.3s / π° $0.16 | π’ 100% (5/5) / β±οΈ 68.3s / π° $0.10 | π‘ 40% (β ) / β±οΈ 210.5s / π° $0.10 | π’ 100% (5/5) / β±οΈ 77.3s / π° $0.20 | π’ 100% (5/5) / β±οΈ 66.9s / π° $0.16 |
| 112_find_pvcs_by_uuid π | π΄ 0% (0/5) / β±οΈ 45.8s / π° $0.12 | π‘ 20% (β ) / β±οΈ 58.2s / π° $0.08 | π’ 100% (5/5) / β±οΈ 147.8s / π° $0.08 | π’ 100% (5/5) / β±οΈ 67.5s / π° $0.13 | π’ 100% (5/5) / β±οΈ 88.6s / π° $0.13 |
| 114_checkout_latency_tracing_rebuild[0] π | π΄ 0% (0/5) / β±οΈ 69.7s / π° $0.20 | π‘ 20% (β ) / β±οΈ 89.3s / π° $0.16 | π‘ 40% (β ) / β±οΈ 377.2s / π° $0.34 | π‘ 20% (β ) / β±οΈ 148.2s / π° $0.31 | π‘ 40% (β ) / β±οΈ 173.2s / π° $0.52 |
| 115_checkout_errors_tracing[0] π | π΄ 0% (0/5) / β±οΈ 87.3s / π° $0.22 | π΄ 0% (0/5) / β±οΈ 93.8s / π° $0.21 | π‘ 40% (β ) / β±οΈ 265.8s / π° $0.20 | π‘ 20% (β ) / β±οΈ 136.2s / π° $0.30 | π‘ 20% (β ) / β±οΈ 255.3s / π° $0.51 |
| 11_init_containers π | π’ 100% (5/5) / β±οΈ 45.3s / π° $0.11 | π’ 100% (5/5) / β±οΈ 54.0s / π° $0.07 | π‘ 80% (β ) / β±οΈ 139.5s / π° $0.10 | π’ 100% (5/5) / β±οΈ 65.4s / π° $0.12 | π’ 100% (5/5) / β±οΈ 73.8s / π° $0.11 |
| 121_new_relic_checkout_errors_tracing[0] π | π΄ 0% (0/5) / β±οΈ 35.5s / π° $0.10 | π΄ 0% (0/5) / β±οΈ 40.7s / π° $0.05 | π‘ 60% (β ) / β±οΈ 530.6s / π° $0.41 | π‘ 40% (β ) / β±οΈ 189.5s / π° $0.48 | π‘ 60% (β ) / β±οΈ 145.3s / π° $0.41 |
| 122_new_relic_checkout_latency_tracing_rebuild[0] π | π΄ 0% (0/5) / β±οΈ 42.6s / π° $0.20 | π΄ 0% (0/5) / β±οΈ 65.4s / π° $0.25 | π‘ 40% (β ) / β±οΈ 583.9s / π° $0.36 | π’ 100% (5/5) / β±οΈ 293.4s / π° $0.41 | π’ 100% (5/5) / β±οΈ 156.6s / π° $0.39 |
| 123_new_relic_checkout_errors_tracing[0] π | π΄ 0% (0/5) / β±οΈ 63.9s / π° $0.11 | π΄ 0% (0/5) / β±οΈ 50.6s / π° $0.06 | π‘ 20% (β ) / β±οΈ 343.2s / π° $0.31 | π’ 100% (5/5) / β±οΈ 155.5s / π° $0.44 | π’ 100% (5/5) / β±οΈ 124.7s / π° $0.37 |
| 12_job_crashing π | π‘ 60% (β ) / β±οΈ 49.7s / π° $0.11 | π’ 100% (5/5) / β±οΈ 41.9s / π° $0.08 | π‘ 80% (β ) / β±οΈ 184.2s / π° $0.15 | π’ 100% (5/5) / β±οΈ 92.1s / π° $0.13 | π’ 100% (5/5) / β±οΈ 65.8s / π° $0.14 |
| 13a_pending_node_selector_basic π | π’ 100% (5/5) / β±οΈ 52.8s / π° $0.14 | π’ 100% (5/5) / β±οΈ 52.0s / π° $0.10 | π‘ 20% (β ) / β±οΈ 84.3s / π° $0.04 | π’ 100% (5/5) / β±οΈ 119.2s / π° $0.14 | π‘ 80% (β ) / β±οΈ 54.6s / π° $0.11 |
| 13b_pending_node_selector_detailed π | π΄ 0% (0/5) / β±οΈ 42.9s / π° $0.13 | π‘ 80% (β ) / β±οΈ 45.4s / π° $0.09 | π‘ 40% (β ) / β±οΈ 110.8s / π° $0.08 | π’ 100% (5/5) / β±οΈ 66.7s / π° $0.13 | π’ 100% (5/5) / β±οΈ 63.6s / π° $0.14 |
| 14_pending_resources π | π’ 100% (5/5) / β±οΈ 58.7s / π° $0.13 | π’ 100% (5/5) / β±οΈ 70.5s / π° $0.10 | π‘ 20% (β ) / β±οΈ 70.4s / π° $0.04 | π’ 100% (5/5) / β±οΈ 114.5s / π° $0.13 | π’ 100% (5/5) / β±οΈ 80.2s / π° $0.13 |
| 159_prometheus_high_cardinality_cpu[0] π | π’ 100% (5/5) / β±οΈ 39.3s / π° $0.16 | π’ 100% (5/5) / β±οΈ 51.0s / π° $0.13 | π’ 100% (5/5) / β±οΈ 231.2s / π° $0.20 | π’ 100% (5/5) / β±οΈ 66.9s / π° $0.17 | π’ 100% (5/5) / β±οΈ 59.8s / π° $0.16 |
| 159_prometheus_high_cardinality_cpu[1] π | π‘ 60% (β ) / β±οΈ 43.7s / π° $0.20 | π‘ 80% (β ) / β±οΈ 46.6s / π° $0.16 | π’ 100% (5/5) / β±οΈ 154.2s / π° $0.13 | π’ 100% (5/5) / β±οΈ 84.9s / π° $0.22 | π’ 100% (5/5) / β±οΈ 82.1s / π° $0.19 |
| 159_prometheus_high_cardinality_cpu[2] π | π΄ 0% (0/5) / β±οΈ 35.1s / π° $0.10 | π’ 100% (5/5) / β±οΈ 50.6s / π° $0.16 | π’ 100% (5/5) / β±οΈ 130.8s / π° $0.12 | π’ 100% (5/5) / β±οΈ 155.1s / π° $0.22 | π‘ 20% (β ) / β±οΈ 53.2s / π° $0.19 |
| 15_failed_readiness_probe π | π’ 100% (5/5) / β±οΈ 42.9s / π° $0.13 | π’ 100% (5/5) / β±οΈ 45.5s / π° $0.09 | π‘ 80% (β ) / β±οΈ 141.4s / π° $0.10 | π’ 100% (5/5) / β±οΈ 88.2s / π° $0.14 | π’ 100% (5/5) / β±οΈ 52.1s / π° $0.14 |
| 16_failed_no_toolset_found π | π΄ 0% (0/5) / β±οΈ 46.5s / π° $0.09 | π΄ 0% (0/5) / β±οΈ 38.1s / π° $0.03 | π΄ 0% (0/5) / β±οΈ 64.5s / π° $0.02 | π‘ 60% (β ) / β±οΈ 38.1s / π° $0.06 | π΄ 0% (0/5) / β±οΈ 32.5s / π° $0.06 |
| 17_oom_kill π | π’ 100% (5/5) / β±οΈ 55.6s / π° $0.13 | π’ 100% (5/5) / β±οΈ 59.1s / π° $0.08 | π‘ 60% (β ) / β±οΈ 116.0s / π° $0.08 | π’ 100% (5/5) / β±οΈ 71.5s / π° $0.12 | π’ 100% (5/5) / β±οΈ 61.3s / π° $0.12 |
| 19_detect_missing_app_details π | π’ 100% (5/5) / β±οΈ 78.8s / π° $0.44 | π‘ 80% (β ) / β±οΈ 66.1s / π° $0.11 | π’ 100% (5/5) / β±οΈ 267.1s / π° $0.18 | π’ 100% (5/5) / β±οΈ 102.3s / π° $0.21 | π’ 100% (5/5) / β±οΈ 95.1s / π° $0.16 |
| 20_long_log_file_search π | π’ 100% (5/5) / β±οΈ 56.0s / π° $0.11 | π’ 100% (5/5) / β±οΈ 57.5s / π° $0.06 | π’ 100% (5/5) / β±οΈ 126.4s / π° $0.08 | π’ 100% (5/5) / β±οΈ 123.3s / π° $0.13 | π’ 100% (5/5) / β±οΈ 84.4s / π° $0.11 |
| 21_job_fail_curl_no_svc_account π | π‘ 80% (β ) / β±οΈ 51.1s / π° $0.25 | π’ 100% (5/5) / β±οΈ 79.9s / π° $0.16 | π‘ 80% (β ) / β±οΈ 174.0s / π° $0.13 | π’ 100% (5/5) / β±οΈ 74.5s / π° $0.21 | π’ 100% (5/5) / β±οΈ 66.5s / π° $0.19 |
| 23_app_error_in_current_logs π | π’ 100% (5/5) / β±οΈ 82.7s / π° $0.19 | π’ 100% (5/5) / β±οΈ 91.4s / π° $0.30 | π’ 100% (5/5) / β±οΈ 249.1s / π° $0.19 | π‘ 80% (β ) / β±οΈ 78.8s / π° $0.25 | π’ 100% (5/5) / β±οΈ 76.9s / π° $0.17 |
| 24_misconfigured_pvc π | π’ 100% (5/5) / β±οΈ 60.4s / π° $0.17 | π’ 100% (5/5) / β±οΈ 89.5s / π° $0.13 | π΄ 0% (0/5) / β±οΈ 58.4s / π° $0.02 | π’ 100% (5/5) / β±οΈ 88.4s / π° $0.16 | π’ 100% (5/5) / β±οΈ 112.6s / π° $0.17 |
| 24a_misconfigured_pvc_basic π | π‘ 80% (β ) / β±οΈ 51.4s / π° $0.19 | π’ 100% (5/5) / β±οΈ 72.4s / π° $0.10 | π΄ 0% (0/5) / β±οΈ 30.1s / π° $0.02 | π’ 100% (5/5) / β±οΈ 75.8s / π° $0.15 | π’ 100% (5/5) / β±οΈ 68.9s / π° $0.16 |
| 24b_misconfigured_pvc_detailed π | π΄ 0% (0/5) / β±οΈ 55.8s / π° $0.18 | π‘ 20% (β ) / β±οΈ 59.5s / π° $0.12 | π‘ 20% (β ) / β±οΈ 93.6s / π° $0.07 | π’ 100% (5/5) / β±οΈ 89.7s / π° $0.17 | π’ 100% (5/5) / β±οΈ 195.9s / π° $0.17 |
| 25_misconfigured_ingress_class π | π΄ 0% (0/5) / β±οΈ 48.5s / π° $0.13 | π΄ 0% (0/5) / β±οΈ 62.6s / π° $0.14 | π‘ 40% (β ) / β±οΈ 187.9s / π° $0.10 | π’ 100% (5/5) / β±οΈ 121.1s / π° $0.26 | π’ 100% (5/5) / β±οΈ 100.2s / π° $0.35 |
| 26_page_render_times π | π’ 100% (5/5) / β±οΈ 41.5s / π° $0.14 | π’ 100% (5/5) / β±οΈ 42.0s / π° $0.10 | π’ 100% (5/5) / β±οΈ 347.0s / π° $0.26 | π’ 100% (5/5) / β±οΈ 73.5s / π° $0.16 | π’ 100% (5/5) / β±οΈ 48.4s / π° $0.16 |
| 27a_multi_container_logs π | π’ 100% (5/5) / β±οΈ 44.7s / π° $0.13 | π’ 100% (5/5) / β±οΈ 53.5s / π° $0.10 | π’ 100% (5/5) / β±οΈ 197.6s / π° $0.12 | π’ 100% (5/5) / β±οΈ 75.2s / π° $0.13 | π’ 100% (5/5) / β±οΈ 47.2s / π° $0.12 |
| 27b_multi_container_logs π | π’ 100% (5/5) / β±οΈ 56.3s / π° $0.14 | π‘ 80% (β ) / β±οΈ 64.4s / π° $0.08 | π‘ 80% (β ) / β±οΈ 124.0s / π° $0.08 | π’ 100% (5/5) / β±οΈ 55.0s / π° $0.11 | π’ 100% (5/5) / β±οΈ 63.9s / π° $0.11 |
| 28_permissions_error π | π‘ 60% (β ) / β±οΈ 22.3s / π° $0.04 | π‘ 40% (β ) / β±οΈ 26.9s / π° $0.05 | π‘ 40% (β ) / β±οΈ 138.5s / π° $0.09 | π΄ 0% (0/5) / β±οΈ 32.5s / π° $0.07 | π΄ 0% (0/5) / β±οΈ 27.3s / π° $0.07 |
| 33_cpu_metrics_discovery π | π’ 100% (5/5) / β±οΈ 46.7s / π° $0.09 | π’ 100% (5/5) / β±οΈ 58.9s / π° $0.09 | π’ 100% (5/5) / β±οΈ 266.9s / π° $0.22 | π’ 100% (5/5) / β±οΈ 76.5s / π° $0.13 | π’ 100% (5/5) / β±οΈ 59.9s / π° $0.13 |
| 39_failed_toolset π | π’ 100% (5/5) / β±οΈ 27.2s / π° $0.04 | π‘ 40% (β ) / β±οΈ 40.8s / π° $0.07 | π‘ 80% (β ) / β±οΈ 251.5s / π° $0.19 | π‘ 80% (β ) / β±οΈ 169.5s / π° $0.09 | π’ 100% (5/5) / β±οΈ 56.9s / π° $0.11 |
| 41_setup_argo π | π‘ 80% (β ) / β±οΈ 49.1s / π° $0.03 | π’ 100% (5/5) / β±οΈ 35.2s / π° $0.02 | π’ 100% (5/5) / β±οΈ 171.0s / π° $0.09 | π’ 100% (5/5) / β±οΈ 29.1s / π° $0.06 | π’ 100% (5/5) / β±οΈ 30.0s / π° $0.06 |
| 42_dns_issues_result_new_tools_no_runbook π | π‘ 60% (β ) / β±οΈ 55.0s / π° $0.22 | π‘ 60% (β ) / β±οΈ 80.6s / π° $0.18 | π’ 100% (5/5) / β±οΈ 291.8s / π° $0.23 | π’ 100% (5/5) / β±οΈ 163.9s / π° $0.36 | π’ 100% (5/5) / β±οΈ 109.7s / π° $0.26 |
| 42_dns_issues_steps_new_tools π | π’ 100% (5/5) / β±οΈ 56.5s / π° $0.14 | π’ 100% (5/5) / β±οΈ 62.4s / π° $0.14 | π’ 100% (5/5) / β±οΈ 471.2s / π° $0.23 | π’ 100% (5/5) / β±οΈ 165.8s / π° $0.26 | π’ 100% (5/5) / β±οΈ 157.3s / π° $0.31 |
| 43_current_datetime_from_prompt π | π’ 100% (5/5) / β±οΈ 32.6s / π° $0.02 | π’ 100% (5/5) / β±οΈ 42.0s / π° $0.04 | π’ 100% (5/5) / β±οΈ 66.7s / π° $0.03 | π’ 100% (5/5) / β±οΈ 23.5s / π° $0.06 | π’ 100% (5/5) / β±οΈ 23.4s / π° $0.06 |
| 45_fetch_deployment_logs_simple π | π’ 100% (5/5) / β±οΈ 37.8s / π° $0.11 | π’ 100% (5/5) / β±οΈ 46.1s / π° $0.07 | π’ 100% (5/5) / β±οΈ 100.0s / π° $0.09 | π’ 100% (5/5) / β±οΈ 41.4s / π° $0.09 | π’ 100% (5/5) / β±οΈ 50.7s / π° $0.11 |
| 50a_logs_since_last_specific_month π | π’ 100% (5/5) / β±οΈ 41.9s / π° $0.10 | π’ 100% (5/5) / β±οΈ 51.0s / π° $0.05 | π’ 100% (5/5) / β±οΈ 314.7s / π° $0.11 | π’ 100% (5/5) / β±οΈ 54.4s / π° $0.10 | π’ 100% (5/5) / β±οΈ 47.2s / π° $0.09 |
| 51_logs_summarize_errors π | π’ 100% (5/5) / β±οΈ 45.9s / π° $0.12 | π’ 100% (5/5) / β±οΈ 46.7s / π° $0.06 | π’ 100% (5/5) / β±οΈ 133.0s / π° $0.08 | π’ 100% (5/5) / β±οΈ 159.3s / π° $0.10 | π’ 100% (5/5) / β±οΈ 55.1s / π° $0.10 |
| 52_logs_login_issues π | π‘ 40% (β ) / β±οΈ 84.3s / π° $0.12 | π’ 100% (5/5) / β±οΈ 78.5s / π° $0.38 | π‘ 60% (β ) / β±οΈ 152.1s / π° $0.11 | π’ 100% (5/5) / β±οΈ 69.7s / π° $0.11 | π’ 100% (5/5) / β±οΈ 61.7s / π° $0.11 |
| 53_logs_find_term π | π’ 100% (5/5) / β±οΈ 37.7s / π° $0.14 | π’ 100% (5/5) / β±οΈ 46.7s / π° $0.11 | π’ 100% (5/5) / β±οΈ 107.3s / π° $0.08 | π’ 100% (5/5) / β±οΈ 50.9s / π° $0.13 | π’ 100% (5/5) / β±οΈ 53.2s / π° $0.13 |
| 54_not_truncated_when_getting_pods π | π’ 100% (5/5) / β±οΈ 58.6s / π° $0.12 | π’ 100% (5/5) / β±οΈ 69.7s / π° $0.11 | π‘ 80% (β ) / β±οΈ 196.2s / π° $0.15 | π’ 100% (5/5) / β±οΈ 142.2s / π° $0.15 | π‘ 80% (β ) / β±οΈ 65.7s / π° $0.11 |
| 57_wrong_namespace π | π΄ 0% (0/5) / β±οΈ 40.6s / π° $0.10 | π΄ 0% (0/5) / β±οΈ 47.2s / π° $0.06 | π’ 100% (5/5) / β±οΈ 145.2s / π° $0.08 | π‘ 60% (β ) / β±οΈ 77.4s / π° $0.09 | π’ 100% (5/5) / β±οΈ 91.7s / π° $0.10 |
| 59_label_based_counting π | π’ 100% (5/5) / β±οΈ 33.8s / π° $0.09 | π’ 100% (5/5) / β±οΈ 32.9s / π° $0.05 | π’ 100% (5/5) / β±οΈ 77.8s / π° $0.03 | π’ 100% (5/5) / β±οΈ 51.4s / π° $0.08 | π’ 100% (5/5) / β±οΈ 34.9s / π° $0.08 |
| 60_count_less_than π | π’ 100% (5/5) / β±οΈ 85.4s / π° $0.11 | π’ 100% (5/5) / β±οΈ 56.3s / π° $0.06 | π‘ 80% (β ) / β±οΈ 88.5s / π° $0.05 | π’ 100% (5/5) / β±οΈ 37.1s / π° $0.08 | π’ 100% (5/5) / β±οΈ 36.5s / π° $0.09 |
| 61_exact_match_counting π | π’ 100% (5/5) / β±οΈ 33.9s / π° $0.08 | π’ 100% (5/5) / β±οΈ 34.0s / π° $0.05 | π’ 100% (5/5) / β±οΈ 60.9s / π° $0.04 | π’ 100% (5/5) / β±οΈ 32.2s / π° $0.07 | π’ 100% (5/5) / β±οΈ 36.0s / π° $0.08 |
| 62_fetch_error_logs_with_errors π | π’ 100% (5/5) / β±οΈ 60.6s / π° $0.11 | π’ 100% (5/5) / β±οΈ 50.9s / π° $0.07 | π‘ 80% (β ) / β±οΈ 102.9s / π° $0.06 | π’ 100% (5/5) / β±οΈ 46.9s / π° $0.09 | π’ 100% (5/5) / β±οΈ 43.3s / π° $0.09 |
| 63_fetch_error_logs_no_errors π | π’ 100% (5/5) / β±οΈ 39.9s / π° $0.11 | π’ 100% (5/5) / β±οΈ 45.1s / π° $0.07 | π‘ 60% (β ) / β±οΈ 138.8s / π° $0.11 | π’ 100% (5/5) / β±οΈ 46.6s / π° $0.09 | π‘ 80% (β ) / β±οΈ 39.6s / π° $0.07 |
| 64_keda_vs_hpa_confusion π | π΄ 0% (0/5) / β±οΈ 71.8s / π° $0.42 | π΄ 0% (0/5) / β±οΈ 51.2s / π° $0.08 | π‘ 80% (β ) / β±οΈ 191.3s / π° $0.13 | π’ 100% (5/5) / β±οΈ 112.3s / π° $0.20 | π’ 100% (5/5) / β±οΈ 93.1s / π° $0.20 |
| 65_health_check_followup π | π’ 100% (5/5) / β±οΈ 50.3s / π° $0.18 | π’ 100% (5/5) / β±οΈ 69.6s / π° $0.22 | π‘ 80% (β ) / β±οΈ 277.0s / π° $0.20 | π’ 100% (5/5) / β±οΈ 328.5s / π° $0.24 | π’ 100% (5/5) / β±οΈ 94.4s / π° $0.27 |
| 71_connection_pool_starvation π | π‘ 80% (β ) / β±οΈ 47.1s / π° $0.17 | π’ 100% (5/5) / β±οΈ 49.2s / π° $0.10 | π‘ 20% (β ) / β±οΈ 152.5s / π° $0.13 | π’ 100% (5/5) / β±οΈ 59.9s / π° $0.13 | π’ 100% (5/5) / β±οΈ 65.3s / π° $0.17 |
| 73a_time_window_anomaly π | π΄ 0% (0/5) / β±οΈ 48.7s / π° $0.15 | π‘ 20% (β ) / β±οΈ 58.6s / π° $0.07 | π’ 100% (5/5) / β±οΈ 187.9s / π° $0.13 | π΄ 0% (0/5) / β±οΈ 84.2s / π° $0.13 | π‘ 40% (β ) / β±οΈ 81.8s / π° $0.15 |
| 73b_time_window_anomaly π | π‘ 60% (β ) / β±οΈ 56.0s / π° $0.16 | π‘ 40% (β ) / β±οΈ 68.9s / π° $0.08 | π‘ 80% (β ) / β±οΈ 165.5s / π° $0.14 | π’ 100% (5/5) / β±οΈ 189.3s / π° $0.14 | π’ 100% (5/5) / β±οΈ 67.7s / π° $0.14 |
| 76_service_discovery_issue π | π΄ 0% (0/5) / β±οΈ 45.5s / π° $0.20 | π’ 100% (5/5) / β±οΈ 66.1s / π° $0.15 | π’ 100% (5/5) / β±οΈ 205.8s / π° $0.13 | π’ 100% (5/5) / β±οΈ 67.4s / π° $0.22 | π’ 100% (5/5) / β±οΈ 65.1s / π° $0.16 |
| 77_liveness_probe_misconfiguration π | π‘ 40% (β ) / β±οΈ 42.7s / π° $0.15 | π’ 100% (5/5) / β±οΈ 58.6s / π° $0.08 | π’ 100% (5/5) / β±οΈ 182.8s / π° $0.12 | π’ 100% (5/5) / β±οΈ 69.0s / π° $0.13 | π’ 100% (5/5) / β±οΈ 54.0s / π° $0.13 |
| 78a_missing_cpu_limits π | π΄ 0% (0/5) / β±οΈ 49.4s / π° $0.14 | π’ 100% (5/5) / β±οΈ 59.7s / π° $0.13 | π’ 100% (5/5) / β±οΈ 206.0s / π° $0.13 | π’ 100% (5/5) / β±οΈ 72.8s / π° $0.12 | π’ 100% (5/5) / β±οΈ 65.1s / π° $0.14 |
| 78b_cpu_quota_exceeded π | π΄ 0% (0/5) / β±οΈ 55.1s / π° $0.18 | π‘ 20% (β ) / β±οΈ 49.1s / π° $0.09 | π’ 100% (5/5) / β±οΈ 152.7s / π° $0.13 | π’ 100% (5/5) / β±οΈ 73.5s / π° $0.12 | π’ 100% (5/5) / β±οΈ 61.6s / π° $0.14 |
| 79_configmap_mount_issue π | π’ 100% (5/5) / β±οΈ 42.6s / π° $0.10 | π’ 100% (5/5) / β±οΈ 47.1s / π° $0.07 | π’ 100% (5/5) / β±οΈ 197.3s / π° $0.12 | π’ 100% (5/5) / β±οΈ 61.4s / π° $0.11 | π’ 100% (5/5) / β±οΈ 69.9s / π° $0.12 |
| 80_pvc_storage_class_mismatch π | π΄ 0% (0/5) / β±οΈ 76.1s / π° $0.12 | π΄ 0% (0/5) / β±οΈ 64.0s / π° $0.08 | π’ 100% (5/5) / β±οΈ 191.5s / π° $0.13 | π’ 100% (5/5) / β±οΈ 89.4s / π° $0.13 | π’ 100% (5/5) / β±οΈ 72.9s / π° $0.14 |
| 81_service_account_permission_denied π | π‘ 20% (β ) / β±οΈ 47.2s / π° $0.14 | π‘ 80% (β ) / β±οΈ 56.2s / π° $0.11 | π‘ 80% (β ) / β±οΈ 198.0s / π° $0.15 | π’ 100% (5/5) / β±οΈ 103.5s / π° $0.21 | π’ 100% (5/5) / β±οΈ 73.0s / π° $0.17 |
| 82_pod_anti_affinity_conflict π | π’ 100% (5/5) / β±οΈ 55.6s / π° $0.13 | π’ 100% (5/5) / β±οΈ 61.8s / π° $0.08 | π’ 100% (5/5) / β±οΈ 173.8s / π° $0.13 | π’ 100% (5/5) / β±οΈ 77.3s / π° $0.14 | π’ 100% (5/5) / β±οΈ 108.8s / π° $0.14 |
| 83_secret_not_found π | π’ 100% (5/5) / β±οΈ 44.7s / π° $0.15 | π’ 100% (5/5) / β±οΈ 44.8s / π° $0.08 | π’ 100% (5/5) / β±οΈ 185.9s / π° $0.13 | π’ 100% (5/5) / β±οΈ 81.7s / π° $0.11 | π’ 100% (5/5) / β±οΈ 85.1s / π° $0.12 |
| 84_network_policy_blocking_traffic π | π‘ 20% (β ) / β±οΈ 47.4s / π° $0.18 | π‘ 80% (β ) / β±οΈ 58.1s / π° $0.14 | π’ 100% (5/5) / β±οΈ 226.9s / π° $0.14 | π’ 100% (5/5) / β±οΈ 131.7s / π° $0.24 | π’ 100% (5/5) / β±οΈ 85.7s / π° $0.23 |
| 85_hpa_not_scaling π | π΄ 0% (0/5) / β±οΈ 42.0s / π° $0.11 | π‘ 80% (β ) / β±οΈ 60.3s / π° $0.12 | π’ 100% (5/5) / β±οΈ 183.9s / π° $0.13 | π’ 100% (5/5) / β±οΈ 67.2s / π° $0.16 | π’ 100% (5/5) / β±οΈ 68.2s / π° $0.17 |
| 86_configmap_like_but_secret π | π‘ 80% (β ) / β±οΈ 50.8s / π° $0.18 | π’ 100% (5/5) / β±οΈ 58.3s / π° $0.10 | π’ 100% (5/5) / β±οΈ 227.5s / π° $0.17 | π’ 100% (5/5) / β±οΈ 76.0s / π° $0.13 | π’ 100% (5/5) / β±οΈ 158.1s / π° $0.15 |
| 89_runbook_missing_cloudwatch π | π‘ 80% (β ) / β±οΈ 44.0s / π° $0.07 | π‘ 80% (β ) / β±οΈ 31.9s / π° $0.04 | π’ 100% (5/5) / β±οΈ 258.4s / π° $0.15 | π’ 100% (5/5) / β±οΈ 55.3s / π° $0.11 | π’ 100% (5/5) / β±οΈ 47.5s / π° $0.11 |
| 90_runbook_basic_selection π | π’ 100% (5/5) / β±οΈ 58.0s / π° $0.20 | π’ 100% (5/5) / β±οΈ 71.2s / π° $0.16 | π’ 100% (5/5) / β±οΈ 365.1s / π° $0.29 | π’ 100% (5/5) / β±οΈ 216.9s / π° $0.49 | π‘ 80% (β ) / β±οΈ 138.6s / π° $0.47 |
| 91f_datadog_logs_historical_pod π | π΄ 0% (0/5) / β±οΈ 46.0s / π° $0.16 | π‘ 20% (β ) / β±οΈ 64.1s / π° $0.14 | π‘ 80% (β ) / β±οΈ 302.2s / π° $0.19 | π’ 100% (5/5) / β±οΈ 74.8s / π° $0.15 | π’ 100% (5/5) / β±οΈ 67.1s / π° $0.14 |
| 93_calling_datadog[0] π | π’ 100% (5/5) / β±οΈ 61.2s / π° $0.12 | π’ 100% (5/5) / β±οΈ 15.6s / π° $0.07 | π’ 100% (5/5) / β±οΈ 54.2s / π° $0.09 | π’ 100% (5/5) / β±οΈ 13.7s / π° $0.15 | π’ 100% (5/5) / β±οΈ 12.5s / π° $0.15 |
| 93_calling_datadog[1] π | π’ 100% (5/5) / β±οΈ 73.2s / π° $0.12 | π’ 100% (5/5) / β±οΈ 12.9s / π° $0.07 | π’ 100% (5/5) / β±οΈ 63.4s / π° $0.08 | π’ 100% (5/5) / β±οΈ 20.4s / π° $0.15 | π’ 100% (5/5) / β±οΈ 11.8s / π° $0.15 |
| 94_runbook_transparency π | π’ 100% (5/5) / β±οΈ 60.9s / π° $0.25 | π’ 100% (5/5) / β±οΈ 85.7s / π° $0.20 | π’ 100% (5/5) / β±οΈ 309.8s / π° $0.25 | π’ 100% (5/5) / β±οΈ 116.3s / π° $0.23 | π’ 100% (5/5) / β±οΈ 94.6s / π° $0.24 |
| 96_no_matching_runbook π | π΄ 0% (0/5) / β±οΈ 56.5s / π° $0.22 | π΄ 0% (0/5) / β±οΈ 128.6s / π° $0.55 | π‘ 60% (β ) / β±οΈ 304.2s / π° $0.20 | π’ 100% (5/5) / β±οΈ 203.2s / π° $0.57 | π’ 100% (5/5) / β±οΈ 119.7s / π° $0.27 |
| 97_logs_clarification_needed π | π’ 100% (5/5) / β±οΈ 18.7s / π° $0.03 | π’ 100% (5/5) / β±οΈ 30.9s / π° $0.03 | π’ 100% (5/5) / β±οΈ 32.2s / π° $0.02 | π’ 100% (5/5) / β±οΈ 95.0s / π° $0.19 | π’ 100% (5/5) / β±οΈ 21.8s / π° $0.06 |
| 99_logs_transparency_custom_time π | π’ 100% (5/5) / β±οΈ 38.0s / π° $0.12 | π’ 100% (5/5) / β±οΈ 46.2s / π° $0.09 | π’ 100% (5/5) / β±οΈ 99.7s / π° $0.07 | π’ 100% (5/5) / β±οΈ 89.6s / π° $0.11 | π’ 100% (5/5) / β±οΈ 95.6s / π° $0.11 |
| 50_logs_since_specific_date π | π’ 100% (5/5) / β±οΈ 20.3s / π° $0.10 | π’ 100% (4/4) / β±οΈ 25.4s / π° $0.06 | π’ 100% (5/5) / β±οΈ 105.6s / π° $0.09 | π’ 100% (5/5) / β±οΈ 35.3s / π° $0.12 | π’ 100% (5/5) / β±οΈ 28.8s / π° $0.10 |
| 93_calling_datadog[2] π | π’ 100% (5/5) / β±οΈ 57.6s / π° $0.12 | π’ 100% (5/5) / β±οΈ 15.2s / π° $0.08 | π’ 100% (4/4) / β±οΈ 72.4s / π° $0.09 | π’ 100% (5/5) / β±οΈ 13.1s / π° $0.15 | π’ 100% (5/5) / β±οΈ 11.9s / π° $0.15 |
| 93_events_since_specific_date π | π’ 100% (4/4) / β±οΈ 20.2s / π° $0.10 | π’ 100% (4/4) / β±οΈ 19.0s / π° $0.06 | βͺοΈ - | π’ 100% (5/5) / β±οΈ 24.3s / π° $0.10 | π’ 100% (5/5) / β±οΈ 21.0s / π° $0.10 |
| 44_slack_statefulset_logs π | βͺοΈ - | βͺοΈ - | βͺοΈ - | βͺοΈ - | βͺοΈ - |
| 48_logs_since_thursday π | βͺοΈ - | βͺοΈ - | βͺοΈ - | βͺοΈ - | βͺοΈ - |
| 22_high_latency_dbi_down π | βͺοΈ - | βͺοΈ - | βͺοΈ - | βͺοΈ - | βͺοΈ - |
| 08_sock_shop_frontend π | βͺοΈ - | βͺοΈ - | βͺοΈ - | βͺοΈ - | βͺοΈ - |
| 104b_postgres_missing_index_pgstat π | βͺοΈ - | βͺοΈ - | βͺοΈ - | βͺοΈ - | βͺοΈ - |
| 104c_postgres_minimal_missing_index π | βͺοΈ - | βͺοΈ - | βͺοΈ - | βͺοΈ - | βͺοΈ - |
| 105_redis_wrong_data_structure π | βͺοΈ - | βͺοΈ - | βͺοΈ - | βͺοΈ - | βͺοΈ - |
| 156_kafka_opensearch_latency π | βͺοΈ - | βͺοΈ - | βͺοΈ - | βͺοΈ - | βͺοΈ - |
| 43_slack_deployment_logs π | βͺοΈ - | βͺοΈ - | βͺοΈ - | βͺοΈ - | βͺοΈ - |
| 55_kafka_runbook π | βͺοΈ - | βͺοΈ - | βͺοΈ - | βͺοΈ - | βͺοΈ - |
| 98_logs_transparency_default_time π | βͺοΈ - | βͺοΈ - | βͺοΈ - | βͺοΈ - | βͺοΈ - |
Results are automatically generated and updated weekly. View full traces and detailed analysis in Braintrust experiment: local-benchmark-20250930-092035.