September 30, 2025ΒΆ
Generated: 2025-09-30 08:59 UTC
Total Duration: 1h 36m 3s
Iterations: 1
Judge (classifier) model: gpt-4o
About this BenchmarkΒΆ
HolmesGPT is continuously evaluated against real-world Kubernetes and cloud troubleshooting scenarios.
If you find scenarios that HolmesGPT does not perform well on, please consider adding them as evals to the benchmark.
Model Accuracy ComparisonΒΆ
| Model | Pass | Fail | Skip/Error | Total | Success Rate |
|---|---|---|---|---|---|
| gpt-4o | 65 | 29 | 11 | 105 | π‘ 69% (65/94) |
| gpt-4.1 | 70 | 24 | 11 | 105 | π‘ 74% (70/94) |
| gpt-5 | 74 | 19 | 12 | 105 | π‘ 80% (74/93) |
| sonnet-4-20250514 | 91 | 3 | 11 | 105 | π‘ 97% (91/94) |
| sonnet-4-5-20250929 | 87 | 7 | 11 | 105 | π‘ 93% (87/94) |
Model Cost ComparisonΒΆ
| Model | Tests | Avg Cost | Min Cost | Max Cost | Total Cost |
|---|---|---|---|---|---|
| gpt-4o | 94 | $0.13 | $0.03 | $0.43 | $12.59 |
| gpt-4.1 | 94 | $0.11 | $0.02 | $0.46 | $9.99 |
| gpt-5 | 93 | $0.13 | $0.02 | $0.47 | $12.12 |
| sonnet-4-20250514 | 94 | $0.17 | $0.06 | $0.58 | $15.66 |
| sonnet-4-5-20250929 | 92 | $0.16 | $0.06 | $0.58 | $14.88 |
Model Latency ComparisonΒΆ
| Model | Avg (s) | Min (s) | Max (s) | P50 (s) | P95 (s) |
|---|---|---|---|---|---|
| gpt-4o | 36.7 | 9.4 | 85.6 | 36.1 | 56.5 |
| gpt-4.1 | 51.9 | 11.7 | 641.0 | 43.3 | 79.0 |
| gpt-5 | 170.2 | 24.3 | 697.1 | 144.3 | 391.2 |
| sonnet-4-20250514 | 73.2 | 11.6 | 654.9 | 55.7 | 160.2 |
| sonnet-4-5-20250929 | 69.5 | 10.3 | 694.5 | 53.5 | 152.7 |
Performance by TagΒΆ
Success rate by test category and model:
| Tag | gpt-4o | gpt-4.1 | gpt-5 | sonnet-4-20250514 | sonnet-4-5-20250929 | Warnings |
|---|---|---|---|---|---|---|
| chain-of-causation | π΄ 0% (0/6) | π΄ 0% (0/6) | π‘ 33% (2/6) | π’ 100% (6/6) | π’ 100% (6/6) | β οΈ 10 skipped |
| context_window | π‘ 86% (6/7) | π‘ 43% (3/7) | π’ 100% (7/7) | π’ 100% (7/7) | π‘ 86% (6/7) | |
| counting | π’ 100% (4/4) | π’ 100% (4/4) | π’ 100% (4/4) | π’ 100% (4/4) | π’ 100% (4/4) | |
| database | π΄ 0% (0/1) | π’ 100% (1/1) | π’ 100% (1/1) | π’ 100% (1/1) | π’ 100% (1/1) | β οΈ 15 skipped |
| datadog | π‘ 75% (ΒΎ) | π’ 100% (4/4) | π‘ 75% (ΒΎ) | π’ 100% (4/4) | π’ 100% (4/4) | |
| datetime | π’ 100% (4/4) | π‘ 50% (2/4) | π’ 100% (4/4) | π’ 100% (4/4) | π’ 100% (4/4) | β οΈ 10 skipped |
| easy | π‘ 97% (35/36) | π’ 100% (36/36) | π‘ 83% (30/36) | π’ 100% (36/36) | π‘ 97% (35/36) | |
| hard | π‘ 14% (2/14) | π‘ 36% (5/14) | π‘ 50% (7/14) | π’ 100% (14/14) | π‘ 93% (13/14) | β οΈ 30 skipped |
| kafka | βͺοΈ - | βͺοΈ - | βͺοΈ - | βͺοΈ - | βͺοΈ - | β οΈ 10 skipped |
| kubernetes | π‘ 60% (28/47) | π‘ 70% (33/47) | π‘ 72% (34/47) | π‘ 98% (46/47) | π‘ 91% (43/47) | β οΈ 5 skipped |
| logs | π‘ 69% (18/26) | π‘ 69% (18/26) | π‘ 85% (22/26) | π‘ 92% (24/26) | π‘ 88% (23/26) | β οΈ 35 skipped |
| medium | π‘ 64% (28/44) | π‘ 66% (29/44) | π‘ 86% (37/43) | π‘ 93% (41/44) | π‘ 89% (39/44) | β οΈ 26 skipped |
| network | π‘ 75% (ΒΎ) | π‘ 25% (ΒΌ) | π’ 100% (4/4) | π’ 100% (4/4) | π‘ 75% (ΒΎ) | |
| numerical | π’ 100% (1/1) | π’ 100% (1/1) | π’ 100% (1/1) | π’ 100% (1/1) | π’ 100% (1/1) | |
| port-forward | π‘ 44% (4/9) | π‘ 44% (4/9) | π‘ 78% (7/9) | π‘ 89% (8/9) | π‘ 67% (6/9) | |
| prometheus | π‘ 75% (ΒΎ) | π‘ 75% (ΒΎ) | π’ 100% (4/4) | π’ 100% (4/4) | π‘ 75% (ΒΎ) | |
| question-answer | π’ 100% (4/4) | π’ 100% (4/4) | π’ 100% (4/4) | π’ 100% (4/4) | π’ 100% (4/4) | |
| runbooks | π‘ 83% (β ) | π‘ 67% (4/6) | π‘ 83% (β ) | π’ 100% (6/6) | π‘ 83% (β ) | β οΈ 5 skipped |
| slackbot | βͺοΈ - | βͺοΈ - | βͺοΈ - | βͺοΈ - | βͺοΈ - | β οΈ 5 skipped |
| traces | π΄ 0% (0/5) | π΄ 0% (0/5) | π‘ 20% (β ) | π’ 100% (5/5) | π’ 100% (5/5) | |
| transparency | π‘ 71% (10/14) | π‘ 86% (12/14) | π‘ 93% (13/14) | π‘ 86% (12/14) | π‘ 93% (13/14) | β οΈ 5 skipped |
| Overall | π‘ 69% (65/94) | π‘ 74% (70/94) | π‘ 80% (74/93) | π‘ 97% (91/94) | π‘ 93% (87/94) | β οΈ 56 skipped |
Raw ResultsΒΆ
Status of all evaluations across models. Color coding:
- π’ Passing 100% (stable)
- π‘ Passing 1-99%
- π΄ Passing 0% (failing)
- π§ Mock data failure (missing or invalid test data)
- β οΈ Setup failure (environment/infrastructure issue)
- β±οΈ Timeout or rate limit error
- βοΈ Test skipped (e.g., known issue or precondition not met)
Detailed Raw ResultsΒΆ
| Eval ID | gpt-4o | gpt-4.1 | gpt-5 | sonnet-4-20250514 | sonnet-4-5-20250929 |
|---|---|---|---|---|---|
| 01_how_many_pods π | π’ 100% (1/1) / β±οΈ 27.3s / π° $0.08 | π’ 100% (1/1) / β±οΈ 30.7s / π° $0.06 | π’ 100% (1/1) / β±οΈ 50.6s / π° $0.03 | π’ 100% (1/1) / β±οΈ 26.7s / π° $0.08 | π’ 100% (1/1) / β±οΈ 27.1s / π° $0.08 |
| 02_what_is_wrong_with_pod π | π’ 100% (1/1) / β±οΈ 27.0s / π° $0.08 | π’ 100% (1/1) / β±οΈ 40.0s / π° $0.10 | π’ 100% (1/1) / β±οΈ 137.4s / π° $0.10 | π’ 100% (1/1) / β±οΈ 43.1s / π° $0.11 | π’ 100% (1/1) / β±οΈ 34.7s / π° $0.09 |
| 03_what_is_the_command_to_port_forward π | π’ 100% (1/1) / β±οΈ 27.5s / π° $0.11 | π’ 100% (1/1) / β±οΈ 41.1s / π° $0.10 | π’ 100% (1/1) / β±οΈ 231.3s / π° $0.10 | π’ 100% (1/1) / β±οΈ 35.4s / π° $0.12 | π’ 100% (1/1) / β±οΈ 36.3s / π° $0.09 |
| 04_related_k8s_events π | π’ 100% (1/1) / β±οΈ 33.5s / π° $0.12 | π’ 100% (1/1) / β±οΈ 37.2s / π° $0.06 | π’ 100% (1/1) / β±οΈ 64.7s / π° $0.04 | π’ 100% (1/1) / β±οΈ 36.1s / π° $0.09 | π’ 100% (1/1) / β±οΈ 36.1s / π° $0.09 |
| 05_image_version π | π’ 100% (1/1) / β±οΈ 39.1s / π° $0.11 | π’ 100% (1/1) / β±οΈ 34.7s / π° $0.07 | π΄ 0% (0/1) / β±οΈ 29.8s / π° $0.02 | π’ 100% (1/1) / β±οΈ 36.7s / π° $0.09 | π’ 100% (1/1) / β±οΈ 32.0s / π° $0.09 |
| 09_crashpod π | π’ 100% (1/1) / β±οΈ 42.2s / π° $0.16 | π’ 100% (1/1) / β±οΈ 39.2s / π° $0.09 | π’ 100% (1/1) / β±οΈ 137.1s / π° $0.11 | π’ 100% (1/1) / β±οΈ 160.2s / π° $0.12 | π’ 100% (1/1) / β±οΈ 55.0s / π° $0.14 |
| 100a_historical_logs π | π’ 100% (1/1) / β±οΈ 46.8s / π° $0.11 | π’ 100% (1/1) / β±οΈ 45.0s / π° $0.07 | π’ 100% (1/1) / β±οΈ 462.7s / π° $0.31 | π’ 100% (1/1) / β±οΈ 78.2s / π° $0.17 | π’ 100% (1/1) / β±οΈ 154.9s / π° $0.23 |
| 100b_historical_logs_nonstandard_label π | π΄ 0% (0/1) / β±οΈ 38.8s / π° $0.16 | π΄ 0% (0/1) / β±οΈ 39.7s / π° $0.08 | π΄ 0% (0/1) / β±οΈ 398.3s / π° $0.29 | π΄ 0% (0/1) / β±οΈ 136.4s / π° $0.27 | π΄ 0% (0/1) / β±οΈ 88.3s / π° $0.19 |
| 101_historical_logs_pod_deleted π | π΄ 0% (0/1) / β±οΈ 35.9s / π° $0.13 | π΄ 0% (0/1) / β±οΈ 73.8s / π° $0.20 | π’ 100% (1/1) / β±οΈ 333.2s / π° $0.20 | π’ 100% (1/1) / β±οΈ 140.9s / π° $0.20 | π΄ 0% (0/1) / β±οΈ 71.9s / π° $0.15 |
| 103_logs_transparency_default_limit π | π΄ 0% (0/1) / β±οΈ 36.7s / π° $0.14 | π΄ 0% (0/1) / β±οΈ 80.7s / π° $0.29 | π’ 100% (1/1) / β±οΈ 98.6s / π° $0.11 | π΄ 0% (0/1) / β±οΈ 67.0s / π° $0.41 | π’ 100% (1/1) / β±οΈ 50.8s / π° $0.12 |
| 104a_postgres_root_issue π | π΄ 0% (0/1) / β±οΈ 39.5s / π° $0.17 | π’ 100% (1/1) / β±οΈ 68.6s / π° $0.18 | π’ 100% (1/1) / β±οΈ 190.6s / π° $0.18 | π’ 100% (1/1) / β±οΈ 71.7s / π° $0.19 | π’ 100% (1/1) / β±οΈ 60.3s / π° $0.20 |
| 107_log_filter_http_status_code π | π’ 100% (1/1) / β±οΈ 45.5s / π° $0.17 | π’ 100% (1/1) / β±οΈ 45.8s / π° $0.09 | π’ 100% (1/1) / β±οΈ 235.8s / π° $0.17 | π’ 100% (1/1) / β±οΈ 69.1s / π° $0.19 | π’ 100% (1/1) / β±οΈ 72.7s / π° $0.27 |
| 108_logs_nearby_lines π | π΄ 0% (0/1) / β±οΈ 36.5s / π° $0.15 | π΄ 0% (0/1) / β±οΈ 62.5s / π° $0.11 | π΄ 0% (0/1) / β±οΈ 293.6s / π° $0.24 | π’ 100% (1/1) / β±οΈ 72.8s / π° $0.21 | π΄ 0% (0/1) / β±οΈ 88.1s / π° $0.29 |
| 109_logs_transparency_not_found π | π΄ 0% (0/1) / β±οΈ 85.6s / π° $0.10 | π’ 100% (1/1) / β±οΈ 34.3s / π° $0.06 | π’ 100% (1/1) / β±οΈ 140.1s / π° $0.10 | π’ 100% (1/1) / β±οΈ 347.6s / π° $0.09 | π’ 100% (1/1) / β±οΈ 51.1s / π° $0.10 |
| 10_image_pull_backoff π | π’ 100% (1/1) / β±οΈ 38.3s / π° $0.16 | π’ 100% (1/1) / β±οΈ 47.0s / π° $0.09 | π’ 100% (1/1) / β±οΈ 185.1s / π° $0.16 | π’ 100% (1/1) / β±οΈ 50.2s / π° $0.12 | π’ 100% (1/1) / β±οΈ 50.3s / π° $0.13 |
| 110_k8s_events_image_pull π | π’ 100% (1/1) / β±οΈ 31.9s / π° $0.11 | π’ 100% (1/1) / β±οΈ 37.7s / π° $0.09 | π’ 100% (1/1) / β±οΈ 102.5s / π° $0.11 | π’ 100% (1/1) / β±οΈ 40.0s / π° $0.11 | π’ 100% (1/1) / β±οΈ 44.7s / π° $0.10 |
| 111_disabled_datadog_traces π | π΄ 0% (0/1) / β±οΈ 16.9s / π° $0.03 | π’ 100% (1/1) / β±οΈ 29.5s / π° $0.04 | π’ 100% (1/1) / β±οΈ 181.3s / π° $0.10 | π΄ 0% (0/1) / β±οΈ 93.1s / π° $0.20 | π’ 100% (1/1) / β±οΈ 23.0s / π° $0.06 |
| 111_pod_names_contain_service π | π’ 100% (1/1) / β±οΈ 42.2s / π° $0.17 | π’ 100% (1/1) / β±οΈ 66.6s / π° $0.12 | π’ 100% (1/1) / β±οΈ 253.0s / π° $0.17 | π’ 100% (1/1) / β±οΈ 72.3s / π° $0.22 | π’ 100% (1/1) / β±οΈ 63.9s / π° $0.21 |
| 112_find_pvcs_by_uuid π | π΄ 0% (0/1) / β±οΈ 38.6s / π° $0.16 | π΄ 0% (0/1) / β±οΈ 50.8s / π° $0.28 | π’ 100% (1/1) / β±οΈ 143.2s / π° $0.09 | π’ 100% (1/1) / β±οΈ 43.5s / π° $0.13 | π’ 100% (1/1) / β±οΈ 42.5s / π° $0.11 |
| 114_checkout_latency_tracing_rebuild[0] π | π΄ 0% (0/1) / β±οΈ 38.7s / π° $0.17 | π΄ 0% (0/1) / β±οΈ 59.3s / π° $0.20 | π΄ 0% (0/1) / β±οΈ 273.2s / π° $0.25 | π’ 100% (1/1) / β±οΈ 99.8s / π° $0.33 | π’ 100% (1/1) / β±οΈ 80.0s / π° $0.32 |
| 115_checkout_errors_tracing[0] π | π΄ 0% (0/1) / β±οΈ 37.4s / π° $0.20 | π΄ 0% (0/1) / β±οΈ 56.2s / π° $0.13 | π’ 100% (1/1) / β±οΈ 218.9s / π° $0.29 | π’ 100% (1/1) / β±οΈ 101.9s / π° $0.33 | π’ 100% (1/1) / β±οΈ 142.6s / π° $0.49 |
| 11_init_containers π | π’ 100% (1/1) / β±οΈ 31.2s / π° $0.11 | π’ 100% (1/1) / β±οΈ 67.3s / π° $0.08 | π’ 100% (1/1) / β±οΈ 127.5s / π° $0.13 | π’ 100% (1/1) / β±οΈ 50.4s / π° $0.13 | π’ 100% (1/1) / β±οΈ 40.5s / π° $0.11 |
| 121_new_relic_checkout_errors_tracing[0] π | π΄ 0% (0/1) / β±οΈ 29.5s / π° $0.11 | π΄ 0% (0/1) / β±οΈ 43.0s / π° $0.09 | π΄ 0% (0/1) / β±οΈ 433.5s / π° $0.24 | π’ 100% (1/1) / β±οΈ 256.7s / π° $0.29 | π’ 100% (1/1) / β±οΈ 98.2s / π° $0.31 |
| 122_new_relic_checkout_latency_tracing_rebuild[0] π | π΄ 0% (0/1) / β±οΈ 38.3s / π° $0.19 | π΄ 0% (0/1) / β±οΈ 51.3s / π° $0.15 | π΄ 0% (0/1) / β±οΈ 697.1s / π° $0.47 | π’ 100% (1/1) / β±οΈ 87.7s / π° $0.42 | π’ 100% (1/1) / β±οΈ 152.7s / π° $0.51 |
| 123_new_relic_checkout_errors_tracing[0] π | π΄ 0% (0/1) / β±οΈ 47.9s / π° $0.26 | π΄ 0% (0/1) / β±οΈ 39.3s / π° $0.09 | π΄ 0% (0/1) / β±οΈ 341.5s / π° $0.26 | π’ 100% (1/1) / β±οΈ 102.7s / π° $0.32 | π’ 100% (1/1) / β±οΈ 133.6s / π° $0.58 |
| 12_job_crashing π | π΄ 0% (0/1) / β±οΈ 27.7s / π° $0.08 | π’ 100% (1/1) / β±οΈ 35.1s / π° $0.06 | π’ 100% (1/1) / β±οΈ 116.1s / π° $0.11 | π’ 100% (1/1) / β±οΈ 55.7s / π° $0.13 | π’ 100% (1/1) / β±οΈ 57.8s / π° $0.18 |
| 13a_pending_node_selector_basic π | π’ 100% (1/1) / β±οΈ 31.7s / π° $0.11 | π’ 100% (1/1) / β±οΈ 42.6s / π° $0.10 | π΄ 0% (0/1) / β±οΈ 27.6s / π° $0.02 | π’ 100% (1/1) / β±οΈ 49.6s / π° $0.13 | π’ 100% (1/1) / β±οΈ 52.5s / π° $0.13 |
| 13b_pending_node_selector_detailed π | π΄ 0% (0/1) / β±οΈ 35.1s / π° $0.12 | π’ 100% (1/1) / β±οΈ 58.2s / π° $0.09 | π΄ 0% (0/1) / β±οΈ 24.4s / π° $0.02 | π’ 100% (1/1) / β±οΈ 53.6s / π° $0.13 | π’ 100% (1/1) / β±οΈ 55.4s / π° $0.15 |
| 14_pending_resources π | π’ 100% (1/1) / β±οΈ 31.5s / π° $0.11 | π’ 100% (1/1) / β±οΈ 44.9s / π° $0.09 | π΄ 0% (0/1) / β±οΈ 30.9s / π° $0.02 | π’ 100% (1/1) / β±οΈ 124.7s / π° $0.12 | π’ 100% (1/1) / β±οΈ 49.2s / π° $0.12 |
| 159_prometheus_high_cardinality_cpu[0] π | π’ 100% (1/1) / β±οΈ 30.5s / π° $0.11 | π’ 100% (1/1) / β±οΈ 48.4s / π° $0.14 | π’ 100% (1/1) / β±οΈ 192.6s / π° $0.13 | π’ 100% (1/1) / β±οΈ 64.2s / π° $0.21 | π’ 100% (1/1) / β±οΈ 50.3s / π° $0.17 |
| 159_prometheus_high_cardinality_cpu[1] π | π’ 100% (1/1) / β±οΈ 41.3s / π° $0.24 | π’ 100% (1/1) / β±οΈ 42.4s / π° $0.12 | π’ 100% (1/1) / β±οΈ 143.2s / π° $0.11 | π’ 100% (1/1) / β±οΈ 56.7s / π° $0.20 | π’ 100% (1/1) / β±οΈ 41.4s / π° $0.17 |
| 159_prometheus_high_cardinality_cpu[2] π | π΄ 0% (0/1) / β±οΈ 29.2s / π° $0.09 | π΄ 0% (0/1) / β±οΈ 41.3s / π° $0.12 | π’ 100% (1/1) / β±οΈ 134.7s / π° $0.11 | π’ 100% (1/1) / β±οΈ 68.5s / π° $0.23 | π΄ 0% (0/1) / β±οΈ 43.0s / π° $0.17 |
| 15_failed_readiness_probe π | π’ 100% (1/1) / β±οΈ 39.0s / π° $0.14 | π’ 100% (1/1) / β±οΈ 37.5s / π° $0.09 | π’ 100% (1/1) / β±οΈ 203.4s / π° $0.12 | π’ 100% (1/1) / β±οΈ 72.6s / π° $0.15 | π’ 100% (1/1) / β±οΈ 46.1s / π° $0.12 |
| 16_failed_no_toolset_found π | π΄ 0% (0/1) / β±οΈ 54.1s / π° $0.14 | π΄ 0% (0/1) / β±οΈ 25.2s / π° $0.04 | π΄ 0% (0/1) / β±οΈ 47.6s / π° $0.03 | π’ 100% (1/1) / β±οΈ 25.5s / π° $0.06 | π΄ 0% (0/1) / β±οΈ 23.8s / π° $0.06 |
| 17_oom_kill π | π’ 100% (1/1) / β±οΈ 34.5s / π° $0.12 | π’ 100% (1/1) / β±οΈ 61.6s / π° $0.09 | π΄ 0% (0/1) / β±οΈ 35.5s / π° $0.02 | π’ 100% (1/1) / β±οΈ 56.5s / π° $0.15 | π’ 100% (1/1) / β±οΈ 42.4s / π° $0.11 |
| 19_detect_missing_app_details π | π’ 100% (1/1) / β±οΈ 48.2s / π° $0.43 | π’ 100% (1/1) / β±οΈ 39.3s / π° $0.07 | π’ 100% (1/1) / β±οΈ 242.9s / π° $0.20 | π’ 100% (1/1) / β±οΈ 96.4s / π° $0.25 | π’ 100% (1/1) / β±οΈ 54.5s / π° $0.11 |
| 20_long_log_file_search π | π’ 100% (1/1) / β±οΈ 41.1s / π° $0.13 | π’ 100% (1/1) / β±οΈ 46.0s / π° $0.08 | π’ 100% (1/1) / β±οΈ 91.8s / π° $0.08 | π’ 100% (1/1) / β±οΈ 55.6s / π° $0.11 | π’ 100% (1/1) / β±οΈ 70.1s / π° $0.11 |
| 21_job_fail_curl_no_svc_account π | π’ 100% (1/1) / β±οΈ 50.1s / π° $0.27 | π’ 100% (1/1) / β±οΈ 641.0s / π° $0.11 | π΄ 0% (0/1) / β±οΈ 58.1s / π° $0.02 | π’ 100% (1/1) / β±οΈ 56.9s / π° $0.17 | π’ 100% (1/1) / β±οΈ 54.0s / π° $0.24 |
| 23_app_error_in_current_logs π | π’ 100% (1/1) / β±οΈ 48.1s / π° $0.18 | π’ 100% (1/1) / β±οΈ 64.6s / π° $0.18 | π’ 100% (1/1) / β±οΈ 206.2s / π° $0.18 | π’ 100% (1/1) / β±οΈ 76.0s / π° $0.58 | π’ 100% (1/1) / β±οΈ 73.5s / π° $0.35 |
| 24_misconfigured_pvc π | π’ 100% (1/1) / β±οΈ 36.0s / π° $0.16 | π’ 100% (1/1) / β±οΈ 52.8s / π° $0.12 | π΄ 0% (0/1) / β±οΈ 24.3s / π° $0.02 | π’ 100% (1/1) / β±οΈ 70.9s / π° $0.17 | π’ 100% (1/1) / β±οΈ 61.6s / π° $0.16 |
| 24a_misconfigured_pvc_basic π | π΄ 0% (0/1) / β±οΈ 30.9s / π° $0.09 | π’ 100% (1/1) / β±οΈ 47.7s / π° $0.12 | π΄ 0% (0/1) / β±οΈ 30.9s / π° $0.02 | π’ 100% (1/1) / β±οΈ 55.3s / π° $0.13 | π’ 100% (1/1) / β±οΈ 66.7s / π° $0.16 |
| 24b_misconfigured_pvc_detailed π | π΄ 0% (0/1) / β±οΈ 44.2s / π° $0.20 | π’ 100% (1/1) / β±οΈ 74.3s / π° $0.18 | π΄ 0% (0/1) / β±οΈ 26.1s / π° $0.02 | π’ 100% (1/1) / β±οΈ 58.7s / π° $0.14 | π’ 100% (1/1) / β±οΈ 156.8s / π° $0.16 |
| 25_misconfigured_ingress_class π | π΄ 0% (0/1) / β±οΈ 51.2s / π° $0.11 | π΄ 0% (0/1) / β±οΈ 51.1s / π° $0.15 | π’ 100% (1/1) / β±οΈ 285.5s / π° $0.17 | π’ 100% (1/1) / β±οΈ 83.6s / π° $0.23 | π’ 100% (1/1) / β±οΈ 78.5s / π° $0.30 |
| 26_page_render_times π | π’ 100% (1/1) / β±οΈ 29.7s / π° $0.14 | π’ 100% (1/1) / β±οΈ 48.6s / π° $0.09 | π’ 100% (1/1) / β±οΈ 241.1s / π° $0.21 | π’ 100% (1/1) / β±οΈ 54.3s / π° $0.17 | π’ 100% (1/1) / β±οΈ 55.2s / π° $0.16 |
| 27a_multi_container_logs π | π’ 100% (1/1) / β±οΈ 32.8s / π° $0.14 | π’ 100% (1/1) / β±οΈ 48.5s / π° $0.13 | π’ 100% (1/1) / β±οΈ 105.7s / π° $0.09 | π’ 100% (1/1) / β±οΈ 49.8s / π° $0.13 | π’ 100% (1/1) / β±οΈ 37.8s / π° $0.12 |
| 27b_multi_container_logs π | π’ 100% (1/1) / β±οΈ 30.5s / π° $0.12 | π’ 100% (1/1) / β±οΈ 48.2s / π° $0.07 | π’ 100% (1/1) / β±οΈ 132.3s / π° $0.13 | π’ 100% (1/1) / β±οΈ 44.1s / π° $0.11 | π’ 100% (1/1) / β±οΈ 37.9s / π° $0.11 |
| 28_permissions_error π | π’ 100% (1/1) / β±οΈ 19.7s / π° $0.04 | π’ 100% (1/1) / β±οΈ 25.0s / π° $0.05 | π’ 100% (1/1) / β±οΈ 92.0s / π° $0.09 | π’ 100% (1/1) / β±οΈ 27.6s / π° $0.07 | π’ 100% (1/1) / β±οΈ 23.9s / π° $0.07 |
| 33_cpu_metrics_discovery π | π’ 100% (1/1) / β±οΈ 26.3s / π° $0.06 | π’ 100% (1/1) / β±οΈ 35.5s / π° $0.07 | π’ 100% (1/1) / β±οΈ 121.8s / π° $0.11 | π’ 100% (1/1) / β±οΈ 79.7s / π° $0.12 | π’ 100% (1/1) / β±οΈ 45.4s / π° $0.13 |
| 39_failed_toolset π | π’ 100% (1/1) / β±οΈ 23.3s / π° $0.03 | π’ 100% (1/1) / β±οΈ 41.7s / π° $0.06 | π’ 100% (1/1) / β±οΈ 203.3s / π° $0.16 | π’ 100% (1/1) / β±οΈ 47.6s / π° $0.10 | π’ 100% (1/1) / β±οΈ 52.3s / π° $0.11 |
| 41_setup_argo π | π’ 100% (1/1) / β±οΈ 21.3s / π° $0.03 | π’ 100% (1/1) / β±οΈ 18.5s / π° $0.02 | π’ 100% (1/1) / β±οΈ 170.6s / π° $0.13 | π’ 100% (1/1) / β±οΈ 20.0s / π° $0.06 | π’ 100% (1/1) / β±οΈ 21.2s / π° $0.06 |
| 42_dns_issues_result_new_tools_no_runbook π | π’ 100% (1/1) / β±οΈ 42.7s / π° $0.15 | π΄ 0% (0/1) / β±οΈ 68.1s / π° $0.21 | π’ 100% (1/1) / β±οΈ 267.0s / π° $0.25 | π’ 100% (1/1) / β±οΈ 84.0s / π° $0.18 | π’ 100% (1/1) / β±οΈ 119.2s / π° $0.40 |
| 42_dns_issues_steps_new_tools π | π’ 100% (1/1) / β±οΈ 49.8s / π° $0.12 | π’ 100% (1/1) / β±οΈ 55.2s / π° $0.15 | π’ 100% (1/1) / β±οΈ 391.2s / π° $0.24 | π’ 100% (1/1) / β±οΈ 94.6s / π° $0.27 | β±οΈ 0% (0/1) / β±οΈ 694.5s |
| 43_current_datetime_from_prompt π | π’ 100% (1/1) / β±οΈ 17.2s / π° $0.03 | π’ 100% (1/1) / β±οΈ 38.3s / π° $0.04 | π’ 100% (1/1) / β±οΈ 65.1s / π° $0.04 | π’ 100% (1/1) / β±οΈ 18.8s / π° $0.06 | π’ 100% (1/1) / β±οΈ 17.7s / π° $0.06 |
| 45_fetch_deployment_logs_simple π | π’ 100% (1/1) / β±οΈ 31.7s / π° $0.11 | π’ 100% (1/1) / β±οΈ 51.7s / π° $0.10 | π’ 100% (1/1) / β±οΈ 103.1s / π° $0.10 | π’ 100% (1/1) / β±οΈ 37.6s / π° $0.10 | π’ 100% (1/1) / β±οΈ 43.1s / π° $0.11 |
| 50_logs_since_specific_date π | π’ 100% (1/1) / β±οΈ 13.9s / π° $0.10 | π’ 100% (1/1) / β±οΈ 18.5s / π° $0.06 | π’ 100% (1/1) / β±οΈ 144.3s / π° $0.11 | π’ 100% (1/1) / β±οΈ 32.2s / π° $0.11 | π’ 100% (1/1) / β±οΈ 25.1s / π° $0.10 |
| 50a_logs_since_last_specific_month π | π’ 100% (1/1) / β±οΈ 28.6s / π° $0.11 | π’ 100% (1/1) / β±οΈ 34.0s / π° $0.06 | π’ 100% (1/1) / β±οΈ 113.9s / π° $0.09 | π’ 100% (1/1) / β±οΈ 50.0s / π° $0.10 | π’ 100% (1/1) / β±οΈ 33.8s / π° $0.08 |
| 51_logs_summarize_errors π | π’ 100% (1/1) / β±οΈ 31.8s / π° $0.11 | π’ 100% (1/1) / β±οΈ 31.3s / π° $0.04 | π’ 100% (1/1) / β±οΈ 105.5s / π° $0.07 | π’ 100% (1/1) / β±οΈ 40.7s / π° $0.10 | π’ 100% (1/1) / β±οΈ 44.0s / π° $0.10 |
| 52_logs_login_issues π | π΄ 0% (0/1) / β±οΈ 39.7s / π° $0.10 | π’ 100% (1/1) / β±οΈ 104.7s / π° $0.44 | π΄ 0% (0/1) / β±οΈ 47.5s / π° $0.02 | π’ 100% (1/1) / β±οΈ 65.8s / π° $0.11 | π’ 100% (1/1) / β±οΈ 59.9s / π° $0.11 |
| 53_logs_find_term π | π’ 100% (1/1) / β±οΈ 32.3s / π° $0.14 | π’ 100% (1/1) / β±οΈ 53.1s / π° $0.09 | π’ 100% (1/1) / β±οΈ 73.0s / π° $0.07 | π’ 100% (1/1) / β±οΈ 41.2s / π° $0.13 | π’ 100% (1/1) / β±οΈ 42.5s / π° $0.14 |
| 54_not_truncated_when_getting_pods π | π’ 100% (1/1) / β±οΈ 33.7s / π° $0.11 | π’ 100% (1/1) / β±οΈ 39.0s / π° $0.07 | π’ 100% (1/1) / β±οΈ 237.9s / π° $0.18 | π’ 100% (1/1) / β±οΈ 55.1s / π° $0.15 | π’ 100% (1/1) / β±οΈ 66.3s / π° $0.13 |
| 57_wrong_namespace π | π΄ 0% (0/1) / β±οΈ 30.9s / π° $0.10 | π΄ 0% (0/1) / β±οΈ 43.3s / π° $0.06 | π’ 100% (1/1) / β±οΈ 129.6s / π° $0.10 | π’ 100% (1/1) / β±οΈ 45.7s / π° $0.10 | π’ 100% (1/1) / β±οΈ 43.5s / π° $0.10 |
| 59_label_based_counting π | π’ 100% (1/1) / β±οΈ 27.3s / π° $0.08 | π’ 100% (1/1) / β±οΈ 31.2s / π° $0.05 | π’ 100% (1/1) / β±οΈ 98.3s / π° $0.05 | π’ 100% (1/1) / β±οΈ 27.3s / π° $0.08 | π’ 100% (1/1) / β±οΈ 26.9s / π° $0.08 |
| 60_count_less_than π | π’ 100% (1/1) / β±οΈ 35.3s / π° $0.16 | π’ 100% (1/1) / β±οΈ 28.5s / π° $0.06 | π’ 100% (1/1) / β±οΈ 167.5s / π° $0.12 | π’ 100% (1/1) / β±οΈ 35.0s / π° $0.09 | π’ 100% (1/1) / β±οΈ 38.2s / π° $0.09 |
| 61_exact_match_counting π | π’ 100% (1/1) / β±οΈ 40.3s / π° $0.08 | π’ 100% (1/1) / β±οΈ 31.9s / π° $0.05 | π’ 100% (1/1) / β±οΈ 65.3s / π° $0.05 | π’ 100% (1/1) / β±οΈ 26.6s / π° $0.07 | π’ 100% (1/1) / β±οΈ 26.3s / π° $0.08 |
| 62_fetch_error_logs_with_errors π | π’ 100% (1/1) / β±οΈ 31.7s / π° $0.10 | π’ 100% (1/1) / β±οΈ 42.8s / π° $0.08 | π’ 100% (1/1) / β±οΈ 121.2s / π° $0.07 | π’ 100% (1/1) / β±οΈ 38.1s / π° $0.09 | π’ 100% (1/1) / β±οΈ 39.2s / π° $0.09 |
| 63_fetch_error_logs_no_errors π | π’ 100% (1/1) / β±οΈ 30.6s / π° $0.12 | π’ 100% (1/1) / β±οΈ 33.6s / π° $0.06 | π’ 100% (1/1) / β±οΈ 79.7s / π° $0.06 | π’ 100% (1/1) / β±οΈ 35.8s / π° $0.09 | π’ 100% (1/1) / β±οΈ 35.4s / π° $0.09 |
| 64_keda_vs_hpa_confusion π | π’ 100% (1/1) / β±οΈ 63.2s / π° $0.22 | π΄ 0% (0/1) / β±οΈ 54.1s / π° $0.10 | π’ 100% (1/1) / β±οΈ 184.6s / π° $0.14 | π’ 100% (1/1) / β±οΈ 117.7s / π° $0.19 | π’ 100% (1/1) / β±οΈ 66.7s / π° $0.17 |
| 65_health_check_followup π | π’ 100% (1/1) / β±οΈ 44.5s / π° $0.15 | π’ 100% (1/1) / β±οΈ 49.3s / π° $0.13 | π’ 100% (1/1) / β±οΈ 263.4s / π° $0.17 | π’ 100% (1/1) / β±οΈ 69.0s / π° $0.21 | π’ 100% (1/1) / β±οΈ 70.5s / π° $0.26 |
| 71_connection_pool_starvation π | π’ 100% (1/1) / β±οΈ 38.7s / π° $0.13 | π΄ 0% (0/1) / β±οΈ 58.8s / π° $0.17 | π’ 100% (1/1) / β±οΈ 161.8s / π° $0.19 | π’ 100% (1/1) / β±οΈ 57.2s / π° $0.13 | π’ 100% (1/1) / β±οΈ 56.8s / π° $0.17 |
| 73a_time_window_anomaly π | π’ 100% (1/1) / β±οΈ 42.0s / π° $0.17 | π΄ 0% (0/1) / β±οΈ 34.0s / π° $0.07 | π’ 100% (1/1) / β±οΈ 157.5s / π° $0.11 | π’ 100% (1/1) / β±οΈ 63.5s / π° $0.13 | π’ 100% (1/1) / β±οΈ 64.9s / π° $0.18 |
| 73b_time_window_anomaly π | π’ 100% (1/1) / β±οΈ 44.2s / π° $0.17 | π΄ 0% (0/1) / β±οΈ 29.7s / π° $0.05 | π’ 100% (1/1) / β±οΈ 91.4s / π° $0.09 | π’ 100% (1/1) / β±οΈ 57.3s / π° $0.13 | π’ 100% (1/1) / β±οΈ 62.6s / π° $0.14 |
| 76_service_discovery_issue π | π’ 100% (1/1) / β±οΈ 40.5s / π° $0.13 | π’ 100% (1/1) / β±οΈ 60.8s / π° $0.21 | π’ 100% (1/1) / β±οΈ 190.0s / π° $0.21 | π’ 100% (1/1) / β±οΈ 654.9s / π° $0.14 | β±οΈ 0% (0/1) / β±οΈ 648.9s |
| 77_liveness_probe_misconfiguration π | π’ 100% (1/1) / β±οΈ 40.1s / π° $0.14 | π’ 100% (1/1) / β±οΈ 41.7s / π° $0.08 | π’ 100% (1/1) / β±οΈ 185.6s / π° $0.19 | π’ 100% (1/1) / β±οΈ 48.8s / π° $0.14 | π’ 100% (1/1) / β±οΈ 53.5s / π° $0.13 |
| 78a_missing_cpu_limits π | π΄ 0% (0/1) / β±οΈ 25.9s / π° $0.07 | π΄ 0% (0/1) / β±οΈ 30.9s / π° $0.07 | π’ 100% (1/1) / β±οΈ 217.1s / π° $0.18 | π’ 100% (1/1) / β±οΈ 54.7s / π° $0.12 | π’ 100% (1/1) / β±οΈ 58.8s / π° $0.14 |
| 78b_cpu_quota_exceeded π | π΄ 0% (0/1) / β±οΈ 51.1s / π° $0.24 | π΄ 0% (0/1) / β±οΈ 44.6s / π° $0.08 | π’ 100% (1/1) / β±οΈ 81.1s / π° $0.09 | π’ 100% (1/1) / β±οΈ 53.5s / π° $0.13 | π’ 100% (1/1) / β±οΈ 52.6s / π° $0.14 |
| 79_configmap_mount_issue π | π’ 100% (1/1) / β±οΈ 31.2s / π° $0.11 | π’ 100% (1/1) / β±οΈ 43.4s / π° $0.08 | π’ 100% (1/1) / β±οΈ 193.0s / π° $0.12 | π’ 100% (1/1) / β±οΈ 46.3s / π° $0.11 | π’ 100% (1/1) / β±οΈ 63.1s / π° $0.12 |
| 80_pvc_storage_class_mismatch π | π΄ 0% (0/1) / β±οΈ 32.4s / π° $0.11 | π΄ 0% (0/1) / β±οΈ 49.5s / π° $0.09 | π’ 100% (1/1) / β±οΈ 95.6s / π° $0.06 | π’ 100% (1/1) / β±οΈ 81.8s / π° $0.15 | π’ 100% (1/1) / β±οΈ 57.5s / π° $0.15 |
| 81_service_account_permission_denied π | π’ 100% (1/1) / β±οΈ 38.1s / π° $0.11 | π’ 100% (1/1) / β±οΈ 58.2s / π° $0.11 | π’ 100% (1/1) / β±οΈ 260.8s / π° $0.17 | π’ 100% (1/1) / β±οΈ 99.1s / π° $0.23 | π’ 100% (1/1) / β±οΈ 71.0s / π° $0.17 |
| 82_pod_anti_affinity_conflict π | π΄ 0% (0/1) / β±οΈ 37.4s / π° $0.10 | π’ 100% (1/1) / β±οΈ 45.9s / π° $0.06 | π’ 100% (1/1) / β±οΈ 201.0s / π° $0.23 | π’ 100% (1/1) / β±οΈ 61.9s / π° $0.16 | π’ 100% (1/1) / β±οΈ 63.0s / π° $0.17 |
| 83_secret_not_found π | π’ 100% (1/1) / β±οΈ 36.1s / π° $0.15 | π’ 100% (1/1) / β±οΈ 50.2s / π° $0.08 | π’ 100% (1/1) / β±οΈ 125.2s / π° $0.09 | π’ 100% (1/1) / β±οΈ 44.2s / π° $0.11 | π’ 100% (1/1) / β±οΈ 49.3s / π° $0.13 |
| 84_network_policy_blocking_traffic π | π’ 100% (1/1) / β±οΈ 38.2s / π° $0.17 | π΄ 0% (0/1) / β±οΈ 79.0s / π° $0.18 | π’ 100% (1/1) / β±οΈ 238.0s / π° $0.13 | π’ 100% (1/1) / β±οΈ 99.6s / π° $0.21 | π’ 100% (1/1) / β±οΈ 59.7s / π° $0.15 |
| 85_hpa_not_scaling π | π΄ 0% (0/1) / β±οΈ 34.6s / π° $0.11 | π’ 100% (1/1) / β±οΈ 42.1s / π° $0.09 | π’ 100% (1/1) / β±οΈ 327.4s / π° $0.25 | π’ 100% (1/1) / β±οΈ 58.7s / π° $0.17 | π’ 100% (1/1) / β±οΈ 56.9s / π° $0.17 |
| 86_configmap_like_but_secret π | π’ 100% (1/1) / β±οΈ 48.2s / π° $0.22 | π’ 100% (1/1) / β±οΈ 45.2s / π° $0.09 | π’ 100% (1/1) / β±οΈ 334.3s / π° $0.22 | π’ 100% (1/1) / β±οΈ 50.8s / π° $0.13 | π’ 100% (1/1) / β±οΈ 61.6s / π° $0.16 |
| 89_runbook_missing_cloudwatch π | π’ 100% (1/1) / β±οΈ 30.0s / π° $0.10 | π’ 100% (1/1) / β±οΈ 27.4s / π° $0.04 | π’ 100% (1/1) / β±οΈ 179.6s / π° $0.13 | π’ 100% (1/1) / β±οΈ 40.7s / π° $0.09 | π’ 100% (1/1) / β±οΈ 46.0s / π° $0.10 |
| 90_runbook_basic_selection π | π’ 100% (1/1) / β±οΈ 52.0s / π° $0.26 | π’ 100% (1/1) / β±οΈ 72.9s / π° $0.09 | π’ 100% (1/1) / β±οΈ 275.4s / π° $0.20 | π’ 100% (1/1) / β±οΈ 202.4s / π° $0.51 | π’ 100% (1/1) / β±οΈ 127.9s / π° $0.32 |
| 91f_datadog_logs_historical_pod π | π΄ 0% (0/1) / β±οΈ 29.0s / π° $0.08 | π’ 100% (1/1) / β±οΈ 40.1s / π° $0.09 | π΄ 0% (0/1) / β±οΈ 159.1s / π° $0.12 | π’ 100% (1/1) / β±οΈ 80.2s / π° $0.22 | π’ 100% (1/1) / β±οΈ 67.5s / π° $0.15 |
| 93_calling_datadog[0] π | π’ 100% (1/1) / β±οΈ 61.7s / π° $0.12 | π’ 100% (1/1) / β±οΈ 11.7s / π° $0.07 | π’ 100% (1/1) / β±οΈ 41.3s / π° $0.07 | π’ 100% (1/1) / β±οΈ 13.4s / π° $0.15 | π’ 100% (1/1) / β±οΈ 10.3s / π° $0.15 |
| 93_calling_datadog[1] π | π’ 100% (1/1) / β±οΈ 56.5s / π° $0.12 | π’ 100% (1/1) / β±οΈ 12.2s / π° $0.07 | π’ 100% (1/1) / β±οΈ 89.5s / π° $0.08 | π’ 100% (1/1) / β±οΈ 11.8s / π° $0.15 | π’ 100% (1/1) / β±οΈ 10.4s / π° $0.15 |
| 93_calling_datadog[2] π | π’ 100% (1/1) / β±οΈ 70.1s / π° $0.12 | π’ 100% (1/1) / β±οΈ 13.2s / π° $0.07 | π’ 100% (1/1) / β±οΈ 42.1s / π° $0.06 | π’ 100% (1/1) / β±οΈ 11.6s / π° $0.15 | π’ 100% (1/1) / β±οΈ 11.1s / π° $0.15 |
| 94_runbook_transparency π | π’ 100% (1/1) / β±οΈ 45.4s / π° $0.17 | π’ 100% (1/1) / β±οΈ 63.4s / π° $0.15 | π’ 100% (1/1) / β±οΈ 368.3s / π° $0.28 | π’ 100% (1/1) / β±οΈ 119.5s / π° $0.29 | π’ 100% (1/1) / β±οΈ 79.7s / π° $0.20 |
| 96_no_matching_runbook π | π΄ 0% (0/1) / β±οΈ 54.3s / π° $0.32 | π΄ 0% (0/1) / β±οΈ 158.5s / π° $0.46 | π΄ 0% (0/1) / β±οΈ 326.2s / π° $0.28 | π’ 100% (1/1) / β±οΈ 106.8s / π° $0.38 | π’ 100% (1/1) / β±οΈ 122.3s / π° $0.35 |
| 97_logs_clarification_needed π | π’ 100% (1/1) / β±οΈ 15.5s / π° $0.03 | π’ 100% (1/1) / β±οΈ 21.5s / π° $0.04 | π’ 100% (1/1) / β±οΈ 27.1s / π° $0.02 | π’ 100% (1/1) / β±οΈ 50.5s / π° $0.14 | π’ 100% (1/1) / β±οΈ 19.7s / π° $0.06 |
| 99_logs_transparency_custom_time π | π’ 100% (1/1) / β±οΈ 32.8s / π° $0.12 | π’ 100% (1/1) / β±οΈ 39.6s / π° $0.07 | π’ 100% (1/1) / β±οΈ 88.3s / π° $0.05 | π’ 100% (1/1) / β±οΈ 51.1s / π° $0.10 | π’ 100% (1/1) / β±οΈ 56.5s / π° $0.10 |
| 93_events_since_specific_date π | π’ 100% (1/1) / β±οΈ 9.4s / π° $0.10 | π’ 100% (1/1) / β±οΈ 16.3s / π° $0.07 | βͺοΈ - | π’ 100% (1/1) / β±οΈ 17.8s / π° $0.10 | π’ 100% (1/1) / β±οΈ 14.6s / π° $0.09 |
| 44_slack_statefulset_logs π | βͺοΈ - | βͺοΈ - | βͺοΈ - | βͺοΈ - | βͺοΈ - |
| 48_logs_since_thursday π | βͺοΈ - | βͺοΈ - | βͺοΈ - | βͺοΈ - | βͺοΈ - |
| 22_high_latency_dbi_down π | βͺοΈ - | βͺοΈ - | βͺοΈ - | βͺοΈ - | βͺοΈ - |
| 08_sock_shop_frontend π | βͺοΈ - | βͺοΈ - | βͺοΈ - | βͺοΈ - | βͺοΈ - |
| 104b_postgres_missing_index_pgstat π | βͺοΈ - | βͺοΈ - | βͺοΈ - | βͺοΈ - | βͺοΈ - |
| 104c_postgres_minimal_missing_index π | βͺοΈ - | βͺοΈ - | βͺοΈ - | βͺοΈ - | βͺοΈ - |
| 105_redis_wrong_data_structure π | βͺοΈ - | βͺοΈ - | βͺοΈ - | βͺοΈ - | βͺοΈ - |
| 156_kafka_opensearch_latency π | βͺοΈ - | βͺοΈ - | βͺοΈ - | βͺοΈ - | βͺοΈ - |
| 43_slack_deployment_logs π | βͺοΈ - | βͺοΈ - | βͺοΈ - | βͺοΈ - | βͺοΈ - |
| 55_kafka_runbook π | βͺοΈ - | βͺοΈ - | βͺοΈ - | βͺοΈ - | βͺοΈ - |
| 98_logs_transparency_default_time π | βͺοΈ - | βͺοΈ - | βͺοΈ - | βͺοΈ - | βͺοΈ - |
Results are automatically generated and updated weekly. View full traces and detailed analysis in Braintrust experiment: local-benchmark-20250930-072258.