How is TTFB measured?

Time-to-first-byte is measured from the moment our edge PoP receives your request to the first byte of the HTTP response. It does not include DNS resolution or TLS handshake on your side.

Are these benchmarks from production traffic?

Yes. All numbers are derived from aggregated production telemetry over the trailing 30 days, filtered to standard-priority requests. Priority-tier requests may see lower latencies.

Do you offer latency SLAs for enterprise plans?

Yes. Enterprise contracts include p99 latency commitments with financial credits for SLA breaches. Contact sales for details.

Can I see real-time latency for my own requests?

Every API response includes X-Request-Duration and X-Queue-Wait headers. You can pipe these into your monitoring stack for per-request observability.

What happens during traffic spikes?

Our auto-scaler adds GPU capacity within seconds. Requests are queued with priority scheduling — enterprise-tier traffic is never throttled. Burst capacity supports up to 10,000 concurrent requests.

Live Production Benchmarks

API Performance & Latency

Sub-200ms TTFB, 1.2-second image generation, 45-second video generation. Real numbers from production — not cherry-picked benchmarks.

API TTFB

<180ms

p50 global

Image Gen

1.2s

Seedream 1024×1024

Video Gen

~45s

Kling 5s clip

Uptime

99.97%

rolling 90 days

Per-Model Latency Benchmarks

Real production p50 numbers across every model we serve. Updated daily from aggregated telemetry.

Seedream V1.5

Image

TTFB (p50)142ms

Generation (1024×1024)1.2s

Generation (2048×2048)2.1s

Throughput (burst)48 img/min

Seedream V2.0

Image

TTFB (p50)156ms

Generation (1024×1024)1.8s

Generation (2048×2048)3.4s

Throughput (burst)32 img/min

Kling V2.0

Video

TTFB (p50)189ms

Generation (5s @ 720p)45s

Generation (10s @ 720p)78s

Throughput (sustained)80 vid/hr

Kling V3.0

Video

TTFB (p50)195ms

Generation (5s @ 1080p)38s

Generation (10s @ 1080p)65s

Throughput (sustained)95 vid/hr

Vidu Q1

Video

TTFB (p50)172ms

Generation (5s @ 720p)52s

Generation (10s @ 720p)91s

Throughput (sustained)68 vid/hr

Percentile Breakdown

Tail latency matters. Here's what p50 through p99 look like across our core metrics.

Percentile	TTFB	Image Gen (1024px)	Video Gen (5s clip)
p50	142ms	1.18s	43s
p75	168ms	1.34s	48s
p90	201ms	1.62s	56s
p99	312ms	2.41s	72s

How We Compare

Side-by-side with other AI generation API providers on the metrics that matter most.

Provider	TTFB	Image Gen	Video Gen	Uptime
CreativeAI	142–195ms	1.2s	38–45s	99.97%
Provider A	350–600ms	3.5s	90–120s	99.5%
Provider B	500–900ms	4.2s	120–180s	99.2%

Built for Speed at Scale

The infrastructure behind the numbers — from edge routing to GPU auto-scaling.

Global Edge Routing

Requests hit the nearest PoP before reaching GPU clusters. Median network hop < 40ms worldwide.

Dedicated GPU Pools

No cold starts. Warm model instances on A100 / H100 clusters with auto-scaling for burst traffic.

99.97% Uptime SLA

Redundant inference backends with automatic failover. Status page transparency with 90-day rolling metrics.

Real-Time Monitoring

Per-request latency traces exposed via response headers. Integrate with your own Datadog / Grafana dashboards.

Auto-Scaling

Burst from 10 to 10,000 concurrent requests without pre-warming. Queue-based scheduling with priority tiers.

Async + Webhooks

Fire-and-forget generation with webhook callbacks. No polling overhead — get notified the instant results are ready.

Enterprise Scaling Guide

Burst Traffic & Scaling

Enterprise-grade scaling without pre-warming or capacity planning. Here's exactly what happens when your traffic goes from 10 req/min to 10,000.

Auto-Scaling GPU Capacity

GPU instances spin up within 2-5 seconds of detecting queue depth increase. No pre-warming required — just send requests.

Queue-based scheduling with automatic capacity detection
Scale from 10 to 10,000 concurrent requests without notice
Enterprise tier: guaranteed capacity during traffic spikes

Queue Management

Requests are queued with priority scheduling. Enterprise traffic is never throttled — it always processes first.

FIFO queue with priority tier override
Queue depth monitoring visible in response headers
Automatic backpressure when approaching capacity limits

Multi-Provider Failover

If one provider hits capacity, requests automatically route to backup providers. Your users never see an error.

Kling → Seedance → Vidu failover chain for video
Seedream → Flux → SDXL failover chain for images
Transparent in response via model_actual and failover_used fields

Burst Capacity Limits

Standard tier: 100 concurrent video jobs, 500 concurrent image jobs. Enterprise: custom limits with dedicated pools.

Rate limits: Free 10/min, Pro 100/min, Enterprise 1000+/min
Video: 5 creation requests/min, 3 concurrent jobs (standard)
Daily cap: 100 video generations/24h (standard), unlimited (enterprise)

Customer-Safe Capacity Guidance

Copy-paste ready answers for enterprise buyers asking about scaling, burst traffic, and capacity planning.

What happens if you get overloaded?

Requests queue with visible wait times. Enterprise traffic processes first. If all providers are exhausted, you get a 503 with Retry-After header — no silent failures, no lost requests.

How do I handle burst traffic?

Just send requests. Our auto-scaler handles the rest. For predictable high-volume events (product launches, campaigns), notify us 24h in advance for dedicated capacity allocation.

What's the max throughput?

Standard tier: ~500 video generations/hour sustained. Enterprise: 10,000+ video generations/hour with dedicated GPU pools. Both scales automatically for bursts up to 3x sustained capacity.

Do I need to pre-warm?

No. Our infrastructure keeps warm instances ready. Just start sending requests — capacity scales within seconds.

Webhook Delivery During High Traffic

Webhooks are the production-safe way to handle burst workloads. No polling loops, no resource contention.

Production Guarantees

3 delivery attempts with exponential backoff (0s, 5s, 30s)
10-second timeout per attempt
Idempotent delivery with 7-day dedup window
HMAC-SHA256 signed for verification
If all delivery attempts fail, the job result remains available via the status API

Burst-safe webhook pattern

# During burst traffic, use webhooks instead of polling
curl -X POST https://api.creativeai.run/v1/video/generations \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "auto",
    "prompt": "Product reveal animation",
    "duration": 5,
    "webhook_url": "https://your-app.com/webhooks/creativeai/video"
  }'

# Your webhook receives the result — no polling overhead during traffic spikes

Production Reliability

Reliability & Error Handling

Production systems fail. Here's exactly how CreativeAI handles errors, retries, and edge cases — with customer-safe answers for enterprise due diligence.

API Overload Protection

When request volume exceeds capacity, we fail gracefully with actionable errors — never silent failures or hanging requests.

1Queue with priority scheduling

Requests enter a priority queue. Enterprise tier always processes first. Standard tier waits with visible queue time in X-Queue-Wait header.

2Graceful degradation

If queue depth exceeds threshold (10,000 pending), new requests get HTTP 503 with Retry-After header. No request is silently dropped.

3Multi-provider failover

If primary model provider is overloaded, we automatically route to backup providers. Response includes failover_used flag for observability.

4Capacity scaling

Auto-scaler spins up new GPU instances within 2-5 seconds of detecting queue buildup. For predictable spikes, notify us 24h ahead for dedicated capacity.

Webhook Failure Handling

Webhook delivery uses bounded retries with exponential backoff. If your endpoint is briefly down, we retry automatically — and you can always query job status via API.

Retry Schedule

Attempt 1Immediate

Attempt 25 seconds

Attempt 330 seconds

Delivery Guarantees

Idempotent delivery — same event_id + terminal status never delivers twice
7-day deduplication window for safe reprocessing
HMAC-SHA256 signature in X-CreativeAI-Signature header for verification
10-second timeout per attempt — slow endpoints don't block delivery
If all 3 delivery attempts fail, the result remains available via the status API

Error Codes & Recovery

Common HTTP error codes, their causes, and recommended actions.

Code	Error	Action	Recoverable
429	Rate Limited Too many requests in time window	Wait for X-RateLimit-Reset header value, then retry	Yes
503	Service Unavailable All providers at capacity or maintenance	Wait for Retry-After header (usually 5-60s), then retry	Yes
504	Gateway Timeout Upstream provider timeout	Request is retried automatically. If persistent, check status page	Yes
400	Bad Request Invalid parameters or unsupported model	Check error.message in response body for details	Fix required
401	Unauthorized Missing or invalid API key	Verify Authorization header contains valid Bearer token	Fix required
402	Payment Required Insufficient credits or subscription expired	Add credits or update subscription in dashboard	Fix required

Customer-Safe Reliability Answers

Copy-paste ready answers for enterprise due diligence on reliability, error handling, and SLA questions.

What happens when your API is overloaded?

We queue requests with priority scheduling (enterprise always first). If all capacity is exhausted, you get a 503 with a Retry-After header — no silent failures, no lost requests. Your code can safely retry based on the header value.

How do webhooks handle failures?

We make 3 delivery attempts with exponential backoff (0s, 5s, 30s). Events are idempotent, HMAC-SHA256 signed, and safe to deduplicate by event_id + terminal status. If delivery still fails, you can recover the final result via the status API.

What's the retry behavior?

Synchronous API: You control retries. Use Retry-After header from 429/503 responses. Async/webhooks: We make 3 delivery attempts with exponential backoff (0s, 5s, 30s). Recommend client-side retry with exponential backoff for 429/503/504 errors.

Do you have a status page?

Yes — status.creativeai.run shows real-time system health, incident history, and scheduled maintenance. Subscribe for email or webhook notifications.

What's your SLA for availability?

99.97% uptime over rolling 90 days. Enterprise contracts include financial credits for SLA breaches. We publish real-time status and maintain 7-day incident history for transparency.

Per-Request Observability

Every response ships latency headers you can pipe straight into your monitoring stack.

# Response headers on every API call

X-Request-Duration: 1247ms

X-Queue-Wait: 12ms

X-Inference-Time: 1183ms

X-Model: seedream-v1.5

X-Region: us-east-1

# Pipe into Datadog, Grafana, or your own dashboards

curl -s -o /dev/null -w "TTFB: %{time_starttransfer}\nTotal: %{time_total}" \

https://api.creativeai.run/v1/generate

Need a Latency SLA?

Enterprise plans include contractual p99 latency commitments, dedicated GPU pools, and priority queue access. Talk to our solutions team.

API Performance & Latency

Per-Model Latency Benchmarks

Percentile Breakdown

How We Compare

Built for Speed at Scale

Global Edge Routing

Dedicated GPU Pools

99.97% Uptime SLA

Real-Time Monitoring

Auto-Scaling

Async + Webhooks

Burst Traffic & Scaling

Auto-Scaling GPU Capacity

Queue Management

Multi-Provider Failover

Burst Capacity Limits

Customer-Safe Capacity Guidance

Webhook Delivery During High Traffic

Production Guarantees

Reliability & Error Handling

API Overload Protection

Webhook Failure Handling

Retry Schedule

Delivery Guarantees

Error Codes & Recovery

Customer-Safe Reliability Answers

Per-Request Observability

Need a Latency SLA?

Frequently Asked Questions