Optimize cost, reliability, and latency for production workloads by simply choosing a
service_tier.
No batch file management, no architectural changes.
Google's Gemini API now offers three inference service tiers that let you trade off cost,
latency, and availability guarantees — all within the same generate_content() call.
No separate endpoints, no batch file management.
50% cheaper for latency-tolerant workloads. May queue or return 503 during peak demand. Best for background processing.
config = {
"service_tier": "flex"
}
The default tier — no parameter required. Balanced cost and latency. Use this when you have no specific requirements.
# no service_tier param
config = {
"http_options": ...
}
For critical, user-facing apps. Lowest latency with automatic fallback to Standard if capacity is constrained.
config = {
"service_tier": "priority"
}
service_tier requires genai ≥ 2.1.0.
Older versions throw a Pydantic validation error: "Extra inputs are not permitted".
Upgrade with: pip install -U 'genai'
All 15 runs using gemini-2.5-flash-preview, same prompt, sequential execution.
The Served tier is confirmed from the x-gemini-service-tier response header.
In a separate test run, flex returned a 503 UNAVAILABLE error due to high demand on the preview endpoint. Standard and priority completed successfully. Design flex workloads to handle retries or graceful degradation.
| Metric | Flex | Standard | Priority |
|---|---|---|---|
| Successful runs | 5 / 5 | 5 / 5 | 5 / 5 |
| Avg latency | 3.152s | 2.583s | 2.226s ✓ |
| Min latency | 2.293s | 2.075s | 2.008s ✓ |
| Max latency | 4.964s | 3.574s | 2.429s ✓ |
| Latency spread | 2.671s range | 1.499s range | 0.421s range ✓ |
| Avg output tokens | 26.6 | 29.4 | 26.6 |
| Cost vs standard | ~50% cheaper ✓ | baseline | ~same |
| Failover | no (503 on overflow) | no | auto → standard ✓ |
Full per-run breakdown including token counts. The Served column reflects the actual tier confirmed
via the x-gemini-service-tier response header.
| Tier | Run | OK | Latency | Served | In toks | Out toks | Thoughts | Total |
|---|---|---|---|---|---|---|---|---|
| flex | 1 | yes | 3.010s | flex | 14 | 26 | 168 | 208 |
| flex | 2 | yes | 2.293s | flex | 14 | 32 | 125 | 171 |
| flex | 3 | yes | 4.964s | flex | 14 | 24 | 174 | 212 |
| flex | 4 | yes | 3.009s | flex | 14 | 27 | 187 | 228 |
| flex | 5 | yes | 2.485s | flex | 14 | 24 | 179 | 217 |
| standard | 1 | yes | 2.225s | standard | 14 | 29 | 204 | 247 |
| standard | 2 | yes | 2.747s | standard | 14 | 32 | 221 | 267 |
| standard | 3 | yes | 2.075s | standard | 14 | 25 | 191 | 230 |
| standard | 4 | yes | 3.574s | standard | 14 | 29 | 178 | 221 |
| standard | 5 | yes | 2.294s | standard | 14 | 32 | 236 | 282 |
| priority | 1 | yes | 2.008s | priority | 14 | 23 | 161 | 198 |
| priority | 2 | yes | 2.368s | priority | 14 | 28 | 231 | 273 |
| priority | 3 | yes | 2.268s | priority | 14 | 25 | 207 | 246 |
| priority | 4 | yes | 2.429s | priority | 14 | 26 | 174 | 214 |
| priority | 5 | yes | 2.056s | priority | 14 | 31 | 240 | 285 |
Select the option that best matches your use case.
All benchmarks were run with
flex-genai.py
(full source on GitHub Gist),
a self-contained script that cycles through all three tiers, captures per-run latency and token usage,
and reads the confirmed tier from response headers.
# ── set up per-tier config ────────────────────────────────────── def run_once(tier: str) -> dict: config = {"http_options": {"timeout": 900_000}} # flex can queue; 900s timeout if tier != "standard": config["service_tier"] = tier # "flex" or "priority" response = client.models.generate_content( model="gemini-2.5-flash-preview", contents=prompt, config=config, ) ... # ── verify the tier actually served ────────────────────────────── def effective_tier(response): """Read confirmed tier from x-gemini-service-tier response header.""" r = getattr(response, "sdk_http_response", None) if r is None: return None h = getattr(r, "headers", None) or {} return h.get("x-gemini-service-tier") or h.get("X-Gemini-Service-Tier")
Priority inference automatically falls back to Standard when capacity is constrained.
The x-gemini-service-tier header tells you
which tier actually served your request, not which one you asked for.
Use service_tier="flex" for background pipelines, batch enrichment, or any job that can tolerate variable latency and occasional retries.
Use service_tier="priority" for real-time user interactions. Automatic failover to Standard means no hard failures.
Flex may return 503 UNAVAILABLE during peak demand (preview behavior). Implement exponential backoff or queue retries.
Always check x-gemini-service-tier in the response headers to confirm which tier actually handled your request.
Priority delivered a 0.42s spread across 5 runs. Standard spread was 1.5s. Flex spread was 2.67s. Use Priority when p99 latency matters.
Unlike the Batch API, Flex uses the same generate_content() call. No file upload, no polling, no async job management.
There's a meaningful caveat worth understanding: Flex inference traffic counts towards your general rate limits. It doesn't offer the extended rate limits that the Batch API provides. On its face, that sounds like a downgrade. It isn't.
Flex shares your standard per-minute and per-day quotas. During sustained high volume, you'll hit the same ceiling you would with standard inference. The Batch API has separate, higher quotas specifically to handle large offline workloads.
But this is a deliberate architectural trade-off, not an oversight. By keeping Flex within the standard request path, Google avoids introducing a second quota system, a separate endpoint, or any new client-side complexity. The whole point is that you don't change anything except one config key.
Flex is the Spot Instance of LLM inference — you accept sheddability during peak surges in exchange for a 50% cost reduction, with none of the operational overhead of managing batch files or TTL logic for context caching.
The 50% cost saving is the headline, but the actual win is engineering hours. Compare what it takes to run the same workload through each approach:
| Dimension | Batch API | Flex Inference |
|---|---|---|
| Setup | Upload input file to Cloud Storage or Files API | Add one config key — done |
| Call style | Create job, poll for completion, fetch output file | Standard await generate_content() |
| Pipeline rewrite | Yes — async job orchestration required | No — existing code works unchanged |
| Rate limits | Extended separate quota | General rate limits (shared with standard) |
| Failure mode | Job can fail; output may be partial | 503 on surge — retry and continue |
| Cost vs standard | ~50% cheaper | ~50% cheaper |
| Time to production | Hours to days (pipeline changes) | Minutes (one config change) |
The rate limit cost is essentially the price of not rewriting your entire data pipeline to support asynchronous batch processing. For most teams running enrichment pipelines or background classification jobs, that trade-off is obvious — you get the same 50% savings without touching your architecture.
Where the Batch API wins is sustained, high-volume offline workloads that genuinely need extended quotas — millions of documents, overnight runs. Flex is for everything in between: workloads that don't need real-time latency but also don't justify a full async pipeline rebuild.
Service tiers are a zero-architecture-change optimization. If you're already calling
generate_content(), you're one config key away from
cutting your inference costs in half or guaranteeing your users the fastest possible response.
The benchmarks show priority is the fastest and most consistent tier (~2.2s avg, 0.4s spread), standard is the reliable default (~2.6s avg), and flex trades latency variance for cost (~3.2s avg, 50% cheaper). During high demand, flex may 503 — design for retries.
And on rate limits: yes, Flex shares your general quota. But the real value isn't just the 50% cost reduction — it's the engineering hours you don't spend building async batch pipelines, managing output files, and rewriting your request flow. Sometimes the best infrastructure decision is the one that lets your team ship something else instead.
Both features are in preview. Flex inference is documented at ai.google.dev/gemini-api/docs/flex-inference and priority inference at ai.google.dev/gemini-api/docs/priority-inference. Full benchmark source: flex-genai.py on GitHub Gist.