API Performance & Latency
Sub-200ms TTFB, 1.2-second image generation, 45-second video generation. Real numbers from production β not cherry-picked benchmarks.
API TTFB
<180ms
p50 global
Image Gen
1.2s
Seedream 1024Γ1024
Video Gen
~45s
Kling 5s clip
Uptime
99.97%
rolling 90 days
Per-Model Latency Benchmarks
Real production p50 numbers across every model we serve. Updated daily from aggregated telemetry.
Seedream V1.5
ImageSeedream V2.0
ImageKling V2.0
VideoKling V3.0
VideoVidu Q1
VideoPercentile Breakdown
Tail latency matters. Here's what p50 through p99 look like across our core metrics.
| Percentile | TTFB | Image Gen (1024px) | Video Gen (5s clip) |
|---|---|---|---|
| p50 | 142ms | 1.18s | 43s |
| p75 | 168ms | 1.34s | 48s |
| p90 | 201ms | 1.62s | 56s |
| p99 | 312ms | 2.41s | 72s |
How We Compare
Side-by-side with other AI generation API providers on the metrics that matter most.
| Provider | TTFB | Image Gen | Video Gen | Uptime |
|---|---|---|---|---|
| CreativeAI | 142β195ms | 1.2s | 38β45s | 99.97% |
| Provider A | 350β600ms | 3.5s | 90β120s | 99.5% |
| Provider B | 500β900ms | 4.2s | 120β180s | 99.2% |
Built for Speed at Scale
The infrastructure behind the numbers β from edge routing to GPU auto-scaling.
Global Edge Routing
Requests hit the nearest PoP before reaching GPU clusters. Median network hop < 40ms worldwide.
Dedicated GPU Pools
No cold starts. Warm model instances on A100 / H100 clusters with auto-scaling for burst traffic.
99.97% Uptime SLA
Redundant inference backends with automatic failover. Status page transparency with 90-day rolling metrics.
Real-Time Monitoring
Per-request latency traces exposed via response headers. Integrate with your own Datadog / Grafana dashboards.
Auto-Scaling
Burst from 10 to 10,000 concurrent requests without pre-warming. Queue-based scheduling with priority tiers.
Async + Webhooks
Fire-and-forget generation with webhook callbacks. No polling overhead β get notified the instant results are ready.
Burst Traffic & Scaling
Enterprise-grade scaling without pre-warming or capacity planning. Here's exactly what happens when your traffic goes from 10 req/min to 10,000.
Auto-Scaling GPU Capacity
GPU instances spin up within 2-5 seconds of detecting queue depth increase. No pre-warming required β just send requests.
- Queue-based scheduling with automatic capacity detection
- Scale from 10 to 10,000 concurrent requests without notice
- Enterprise tier: guaranteed capacity during traffic spikes
Queue Management
Requests are queued with priority scheduling. Enterprise traffic is never throttled β it always processes first.
- FIFO queue with priority tier override
- Queue depth monitoring visible in response headers
- Automatic backpressure when approaching capacity limits
Multi-Provider Failover
If one provider hits capacity, requests automatically route to backup providers. Your users never see an error.
- Kling β Seedance β Vidu failover chain for video
- Seedream β Flux β SDXL failover chain for images
- Transparent in response via model_actual and failover_used fields
Burst Capacity Limits
Standard tier: 100 concurrent video jobs, 500 concurrent image jobs. Enterprise: custom limits with dedicated pools.
- Rate limits: Free 10/min, Pro 100/min, Enterprise 1000+/min
- Video: 5 creation requests/min, 3 concurrent jobs (standard)
- Daily cap: 100 video generations/24h (standard), unlimited (enterprise)
Customer-Safe Capacity Guidance
Copy-paste ready answers for enterprise buyers asking about scaling, burst traffic, and capacity planning.
What happens if you get overloaded?
Requests queue with visible wait times. Enterprise traffic processes first. If all providers are exhausted, you get a 503 with Retry-After header β no silent failures, no lost requests.
How do I handle burst traffic?
Just send requests. Our auto-scaler handles the rest. For predictable high-volume events (product launches, campaigns), notify us 24h in advance for dedicated capacity allocation.
What's the max throughput?
Standard tier: ~500 video generations/hour sustained. Enterprise: 10,000+ video generations/hour with dedicated GPU pools. Both scales automatically for bursts up to 3x sustained capacity.
Do I need to pre-warm?
No. Our infrastructure keeps warm instances ready. Just start sending requests β capacity scales within seconds.
Webhook Delivery During High Traffic
Webhooks are the production-safe way to handle burst workloads. No polling loops, no resource contention.
Production Guarantees
- 3 delivery attempts with exponential backoff (0s, 5s, 30s)
- 10-second timeout per attempt
- Idempotent delivery with 7-day dedup window
- HMAC-SHA256 signed for verification
- If all delivery attempts fail, the job result remains available via the status API
# During burst traffic, use webhooks instead of polling
curl -X POST https://api.creativeai.run/v1/video/generations \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "auto",
"prompt": "Product reveal animation",
"duration": 5,
"webhook_url": "https://your-app.com/webhooks/creativeai/video"
}'
# Your webhook receives the result β no polling overhead during traffic spikesReliability & Error Handling
Production systems fail. Here's exactly how CreativeAI handles errors, retries, and edge cases β with customer-safe answers for enterprise due diligence.
API Overload Protection
When request volume exceeds capacity, we fail gracefully with actionable errors β never silent failures or hanging requests.
Requests enter a priority queue. Enterprise tier always processes first. Standard tier waits with visible queue time in X-Queue-Wait header.
If queue depth exceeds threshold (10,000 pending), new requests get HTTP 503 with Retry-After header. No request is silently dropped.
If primary model provider is overloaded, we automatically route to backup providers. Response includes failover_used flag for observability.
Auto-scaler spins up new GPU instances within 2-5 seconds of detecting queue buildup. For predictable spikes, notify us 24h ahead for dedicated capacity.
Webhook Failure Handling
Webhook delivery uses bounded retries with exponential backoff. If your endpoint is briefly down, we retry automatically β and you can always query job status via API.
Retry Schedule
Delivery Guarantees
- Idempotent delivery β same event_id + terminal status never delivers twice
- 7-day deduplication window for safe reprocessing
- HMAC-SHA256 signature in X-CreativeAI-Signature header for verification
- 10-second timeout per attempt β slow endpoints don't block delivery
- If all 3 delivery attempts fail, the result remains available via the status API
Error Codes & Recovery
Common HTTP error codes, their causes, and recommended actions.
| Code | Error | Action | Recoverable |
|---|---|---|---|
| 429 | Rate Limited Too many requests in time window | Wait for X-RateLimit-Reset header value, then retry | Yes |
| 503 | Service Unavailable All providers at capacity or maintenance | Wait for Retry-After header (usually 5-60s), then retry | Yes |
| 504 | Gateway Timeout Upstream provider timeout | Request is retried automatically. If persistent, check status page | Yes |
| 400 | Bad Request Invalid parameters or unsupported model | Check error.message in response body for details | Fix required |
| 401 | Unauthorized Missing or invalid API key | Verify Authorization header contains valid Bearer token | Fix required |
| 402 | Payment Required Insufficient credits or subscription expired | Add credits or update subscription in dashboard | Fix required |
Customer-Safe Reliability Answers
Copy-paste ready answers for enterprise due diligence on reliability, error handling, and SLA questions.
What happens when your API is overloaded?
We queue requests with priority scheduling (enterprise always first). If all capacity is exhausted, you get a 503 with a Retry-After header β no silent failures, no lost requests. Your code can safely retry based on the header value.
How do webhooks handle failures?
We make 3 delivery attempts with exponential backoff (0s, 5s, 30s). Events are idempotent, HMAC-SHA256 signed, and safe to deduplicate by event_id + terminal status. If delivery still fails, you can recover the final result via the status API.
What's the retry behavior?
Synchronous API: You control retries. Use Retry-After header from 429/503 responses. Async/webhooks: We make 3 delivery attempts with exponential backoff (0s, 5s, 30s). Recommend client-side retry with exponential backoff for 429/503/504 errors.
Do you have a status page?
Yes β status.creativeai.run shows real-time system health, incident history, and scheduled maintenance. Subscribe for email or webhook notifications.
What's your SLA for availability?
99.97% uptime over rolling 90 days. Enterprise contracts include financial credits for SLA breaches. We publish real-time status and maintain 7-day incident history for transparency.
Per-Request Observability
Every response ships latency headers you can pipe straight into your monitoring stack.
# Response headers on every API call
X-Request-Duration: 1247ms
X-Queue-Wait: 12ms
X-Inference-Time: 1183ms
X-Model: seedream-v1.5
X-Region: us-east-1
# Pipe into Datadog, Grafana, or your own dashboards
curl -s -o /dev/null -w "TTFB: %{time_starttransfer}\nTotal: %{time_total}" \
https://api.creativeai.run/v1/generate