Want to offer AI governance under your own brand? Explore partnership models →

The Full Enterprise Pipeline. No Detectable Latency Cost.

Brutor AI Gateway performance benchmark — May 2026

For anyone choosing an AI platform — and anyone using one every day — this report answers one question: how much does enterprise governance cost in latency? We measured every request through Brutor’s full pipeline (auth, RBAC, governance, cost tracking, quota enforcement, guardrails, semantic cache, and routing) against the same models called directly. The answer is none of the speed your users feel — and that means security, control, and observability don’t have to be a trade-off you negotiate with your engineering team every quarter.

We tested the speed of the Brutor AI Gateway and compared it to competitive solutions. Here are the highlights:

OpenAI · median overhead
−64 ms
Proxy faster than direct (within run-to-run variance — the honest read is “no cost”)

Anthropic · median overhead
+0.5 ms
Statistically zero on the OpenAI-translated path

Native pipelines · median overhead
+33–53 ms
Anthropic-native and Claude Code (with beta headers)

Proxy-introduced errors*
0
See footnote at the end of the article.

Now let’s look at the benchmark results in detail.

The headline numbers, with the path that produced them.

Median (P50) latency at concurrency = 1 — the typical user experience. Each row tests the same upstream model both directly (client → provider) and through Brutor (client → Brutor → provider), so the delta is entirely the proxy’s contribution.

Path Direct (ms) Via Brutor (ms) Overhead
OpenAI (translation) 2,437 2,372 −64 ms
Anthropic (OpenAI-translated) 1,438 1,438 +0.5 ms
Anthropic (native pipeline) 1,438 1,471 +33 ms
Claude Code (native + beta headers) 1,438 1,491 +53 ms

The OpenAI median is negative because Brutor’s pooled HTTP/2 connections are warmer than the benchmark client’s per-call TLS handshake. We don’t lead with that number; the headline read is “the proxy adds no detectable cost at median” — that’s what’s reproducible across runs.

What that latency buys you.

The ~0–53 ms above is the combined cost of everything below. Most competing AI gateways charge more latency for fewer features.

Auth<1 ms

JWT decode & API-key lookup

Bearer JWT or service API key validated, scope + expiry checked, tenant context resolved.

RBAC<1 ms

Resource group access

Workspace + ancestor-chain governance loaded; per-group model availability gate enforced.

Quota<1 ms

Budget & rate limits

Per-tenant budget cap, per-model RPM/TPM, daily/monthly token ceilings — all atomic.

Governance<2 ms

Behavioural rules

Temperature ceiling, context-window cap, banned-tool filter, mandatory system-prompt fragments.

Guardrails5–15 ms

PII & pattern scan

Bidirectional content-policy scan on prompt and response. Detect-only or block, per group.

Routing<2 ms

5 strategies, configurable

Weighted, least-busy, latency-based, cost-based, usage-based — picked per workspace.

Streaming<1 ms / chunk

Wire-format normalization

Anthropic / Bedrock / Groq SSE → OpenAI-compatible chunks, transparently for the client.

Cost<1 ms · off-path

Usage tracking

Token counting, cost calculation, audit row insert via background writer — never blocks the request.

Observability<1 ms

Logs + traces + metrics

Structured proxy log, OpenTelemetry span, Prometheus counters. Every request, every time.

Multi-modal<1 ms

Per-mode policy

Audio voice/format whitelist, image size whitelist, video duration cap, embedding dim guard.

MCP & A2A<2 ms

Tool + agent gating

Tool whitelist, approval workflow trigger, agent-to-agent delegation chain depth limits.

Semantic cache5–20 ms

Vector lookup, infinite hit

Two-layer cache (exact via Redis SHA-256 + semantic via Qdrant similarity). On hit: skip the upstream call entirely. Tenant-scoped, with per-group time-bucketing and exclusion rules. The dashboard tracks Value-per-Token — cost saved per dollar spent on tokens. Real-world tenants typically see 1.3× to 1.8× VPT within weeks.

Brutor Admin Console — Cache and Value-per-Token dashboard for a Resource Group, showing 1.52x VPT score, 44.4% hit rate, tokens saved, and cost savings.

Why proxy CPU isn’t the bottleneck.

The end-to-end numbers above are dominated by network, TLS, and the provider’s own inference. To isolate Brutor’s own work, we run Criterion microbenchmarks against in-memory fixtures. Total proxy-internal CPU per chat request: under 5 ms.

Hot path · per-operation cost

Every operation in the request pipeline measured at hardware speed. The proxy is written in Rust on async Tokio with DashMap-backed concurrent state — CPU-bound work is never the bottleneck.

Cooldown check (DashMap lookup)
~30 ns

Atomic RPM increment
~90 ns

Cache lookup (hit)
~80 ns

Stream normalization (per SSE chunk)
~500 ns

Routing strategy (10 candidates)
~1.5 µs

Anthropic request transformation
~2.5 µs

For a 100-token streaming response (1.5–2.0 s end-to-end), proxy CPU cost is under 0.3 % of total request time. The other 25–48 ms in the end-to-end overhead is pooled connection acquisition, TLS handshake amortization, and small kernel-level scheduling jitter.

How does Brutor compare to competition?

We could not do an honest comparison of Brutor against another AI gateway with the same feature set — simply because there isn’t another gateway with this combination of governance depth and Rust-native speed. We compared our gateway to leading competitors on the market who offer very different approaches to governance and have different architectures and feature sets. Here is what we concluded:

  • Generic API gateways — built in C or Go. They promise single-digit-millisecond overhead and deliver it, because they don’t do anything AI-specific: no model routing, no guardrails, no semantic cache, no per-token cost tracking, no MCP or A2A awareness. If you adopt one for your AI traffic you still have to build the entire governance layer above it. Brutor charges 0–53 ms more than that minimal proxy floor — and ships every one of those capabilities production-ready.
  • Python-based AI gateways — built in Python on FastAPI. They have AI-specific features, but the runtime tax shows: typical end-to-end overhead lands in the 100–500 ms range. That’s the kind of latency users feel; it’s the kind of latency that quietly inflates inference bills on streaming responses; it’s the kind of latency that makes engineering teams quietly route around the gateway for “real-time” use cases — which is exactly when governance matters most.
  • Node-based and edge-worker AI gateways — built on Node.js or Cloudflare Workers. They narrow the latency gap, but the feature set narrows with them: logging, simple caching, basic fallbacks. Anything resembling enterprise governance still has to be built above.

Brutor AI Gateway offers a unique advantage: the full enterprise feature set at generic-API-gateway-class latency — a category of its own. The Rust runtime, async I/O, pooled HTTP/2 connections, and DashMap-backed concurrent state are why — and the consequence is that you don’t have to choose. You don’t trade governance for speed, you don’t trade speed for control, and you don’t have to defend either choice to the team that asks why their requests got slower the day audit logging turned on.

Run the benchmark yourself.

No vendor benchmark is worth anything if you can’t reproduce it. The same tool we used to generate these numbers ships with the proxy — reproduce it in under 5 minutes:

# 1. Set provider keys for direct calls
export BENCHMARK_OPENAI_KEY=sk-…
export BENCHMARK_ANTHROPIC_KEY=sk-ant-…

# 2. Point at your running proxy
export BENCHMARK_PROXY_URL=http://localhost:8100
export BENCHMARK_PROXY_API_KEY=sk_brutor_…

# 3. Run
cargo run –release –bin benchmark

Raw JSON + summary markdown drop into benchmark-results/ with timestamps. Every run is checked into the repo for trend analysis.

What these numbers mean for you.

Our benchmark testing and the resulting numbers address one of the main concerns IT managers have when looking into adopting an AI governance solution — the worry that putting governance, cost control, and guardrails between users and the model will slow things down. The numbers say it doesn’t, in any way users feel.

That changes the conversation. Governance stops being a tax that engineering teams quietly route around to ship something that feels fast. Security and finance stop having to choose between visibility and developer happiness. And your employees get on with making the most of AI — with the controls, audit trails, and cost guardrails your organization needs already running underneath, every request, every time.

If you’re evaluating an AI platform, this is one of the bars to measure against. Explore Brutor AI Platform and all its features


* Pass-through reliability: 100.0%. Zero proxy-introduced errors across 480 requests in the May 2026 run. The only failures observed were upstream HTTP 429s from Anthropic at concurrency ≥ 10 — those occur at identical rates on direct calls, and Brutor faithfully forwards them.

Scroll to Top