Detailed Description
This benchmark evaluates the BubbleRAN MX-AI agentic AI platform, specifically the SMO Agent that interprets operator intents and enforces control actions in a 5G Open RAN deployment. The dataset includes 150 realistic operational prompts, divided into:
- 100 observability queries covering KPIs, policies, slices, Custom Resource Definitions (CRDs), logs, and topology/context
- 50 control actions, such as UE lifecycle management, blueprint deployment/deletion, and slice PRB reconfiguration
For each LLM backend, the benchmark reports:
- Observability Coherence (0–5): Scored using an LLM-assisted evaluator (GPTScore) with explicit rubrics. Three expert annotators review disagreements and validate edge cases.
- Action Accuracy (%): Binary correctness per control task (intent → enforced change), aggregated as a percentage.
- End-to-End Latency (seconds): Measured from prompt submission to answer or action completion.
- GPU Footprint / VRAM Usage: Reported for local deployments to analyze coherence–latency–resource trade-offs.
The benchmark enables fair comparison between cloud and on-prem LLM backends and quantifies how retrieval and tooling quality affect observability performance. Observability tasks require contextual reasoning across multiple data sources and are inherently more challenging than constrained, tool-based control actions.
Key Features
- Live 5G Open RAN testbed evaluation
- Observability Q&A and closed-loop network control tasks
- Hybrid scoring using GPTScore and expert review
- Multi-metric reporting including coherence, action accuracy, latency, and GPU VRAM usage
- Cloud and on-prem LLM backend comparison
Use Cases
- Benchmark new LLM and SLM backends for SMO-level operations
- Compare retrieval and tool-calling strategies
- Evaluate on-prem deployment feasibility across latency, VRAM, and answer quality
- Measure agent time-to-action against human operators

