Migrating from Sora or DALL-E? Use promo code DALLE1000 for $10 in free API credits!
Live Production Benchmarks

API Performance & Latency

Sub-200ms TTFB, 1.2-second image generation, 45-second video generation. Real numbers from production β€” not cherry-picked benchmarks.

API TTFB

<180ms

p50 global

Image Gen

1.2s

Seedream 1024Γ—1024

Video Gen

~45s

Kling 5s clip

Uptime

99.97%

rolling 90 days

Per-Model Latency Benchmarks

Real production p50 numbers across every model we serve. Updated daily from aggregated telemetry.

Seedream V1.5

Image
TTFB (p50)142ms
Generation (1024Γ—1024)1.2s
Generation (2048Γ—2048)2.1s
Throughput (burst)48 img/min

Seedream V2.0

Image
TTFB (p50)156ms
Generation (1024Γ—1024)1.8s
Generation (2048Γ—2048)3.4s
Throughput (burst)32 img/min

Kling V2.0

Video
TTFB (p50)189ms
Generation (5s @ 720p)45s
Generation (10s @ 720p)78s
Throughput (sustained)80 vid/hr

Kling V3.0

Video
TTFB (p50)195ms
Generation (5s @ 1080p)38s
Generation (10s @ 1080p)65s
Throughput (sustained)95 vid/hr

Vidu Q1

Video
TTFB (p50)172ms
Generation (5s @ 720p)52s
Generation (10s @ 720p)91s
Throughput (sustained)68 vid/hr

Percentile Breakdown

Tail latency matters. Here's what p50 through p99 look like across our core metrics.

PercentileTTFBImage Gen (1024px)Video Gen (5s clip)
p50142ms1.18s43s
p75168ms1.34s48s
p90201ms1.62s56s
p99312ms2.41s72s

How We Compare

Side-by-side with other AI generation API providers on the metrics that matter most.

ProviderTTFBImage GenVideo GenUptime
CreativeAI142–195ms1.2s38–45s99.97%
Provider A350–600ms3.5s90–120s99.5%
Provider B500–900ms4.2s120–180s99.2%

Built for Speed at Scale

The infrastructure behind the numbers β€” from edge routing to GPU auto-scaling.

Global Edge Routing

Requests hit the nearest PoP before reaching GPU clusters. Median network hop < 40ms worldwide.

Dedicated GPU Pools

No cold starts. Warm model instances on A100 / H100 clusters with auto-scaling for burst traffic.

99.97% Uptime SLA

Redundant inference backends with automatic failover. Status page transparency with 90-day rolling metrics.

Real-Time Monitoring

Per-request latency traces exposed via response headers. Integrate with your own Datadog / Grafana dashboards.

Auto-Scaling

Burst from 10 to 10,000 concurrent requests without pre-warming. Queue-based scheduling with priority tiers.

Async + Webhooks

Fire-and-forget generation with webhook callbacks. No polling overhead β€” get notified the instant results are ready.

Enterprise Scaling Guide

Burst Traffic & Scaling

Enterprise-grade scaling without pre-warming or capacity planning. Here's exactly what happens when your traffic goes from 10 req/min to 10,000.

Auto-Scaling GPU Capacity

GPU instances spin up within 2-5 seconds of detecting queue depth increase. No pre-warming required β€” just send requests.

  • Queue-based scheduling with automatic capacity detection
  • Scale from 10 to 10,000 concurrent requests without notice
  • Enterprise tier: guaranteed capacity during traffic spikes

Queue Management

Requests are queued with priority scheduling. Enterprise traffic is never throttled β€” it always processes first.

  • FIFO queue with priority tier override
  • Queue depth monitoring visible in response headers
  • Automatic backpressure when approaching capacity limits

Multi-Provider Failover

If one provider hits capacity, requests automatically route to backup providers. Your users never see an error.

  • Kling β†’ Seedance β†’ Vidu failover chain for video
  • Seedream β†’ Flux β†’ SDXL failover chain for images
  • Transparent in response via model_actual and failover_used fields

Burst Capacity Limits

Standard tier: 100 concurrent video jobs, 500 concurrent image jobs. Enterprise: custom limits with dedicated pools.

  • Rate limits: Free 10/min, Pro 100/min, Enterprise 1000+/min
  • Video: 5 creation requests/min, 3 concurrent jobs (standard)
  • Daily cap: 100 video generations/24h (standard), unlimited (enterprise)

Customer-Safe Capacity Guidance

Copy-paste ready answers for enterprise buyers asking about scaling, burst traffic, and capacity planning.

What happens if you get overloaded?

Requests queue with visible wait times. Enterprise traffic processes first. If all providers are exhausted, you get a 503 with Retry-After header β€” no silent failures, no lost requests.

How do I handle burst traffic?

Just send requests. Our auto-scaler handles the rest. For predictable high-volume events (product launches, campaigns), notify us 24h in advance for dedicated capacity allocation.

What's the max throughput?

Standard tier: ~500 video generations/hour sustained. Enterprise: 10,000+ video generations/hour with dedicated GPU pools. Both scales automatically for bursts up to 3x sustained capacity.

Do I need to pre-warm?

No. Our infrastructure keeps warm instances ready. Just start sending requests β€” capacity scales within seconds.

Webhook Delivery During High Traffic

Webhooks are the production-safe way to handle burst workloads. No polling loops, no resource contention.

Production Guarantees

  • 3 delivery attempts with exponential backoff (0s, 5s, 30s)
  • 10-second timeout per attempt
  • Idempotent delivery with 7-day dedup window
  • HMAC-SHA256 signed for verification
  • If all delivery attempts fail, the job result remains available via the status API
Burst-safe webhook pattern
# During burst traffic, use webhooks instead of polling
curl -X POST https://api.creativeai.run/v1/video/generations \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "auto",
    "prompt": "Product reveal animation",
    "duration": 5,
    "webhook_url": "https://your-app.com/webhooks/creativeai/video"
  }'

# Your webhook receives the result β€” no polling overhead during traffic spikes
Production Reliability

Reliability & Error Handling

Production systems fail. Here's exactly how CreativeAI handles errors, retries, and edge cases β€” with customer-safe answers for enterprise due diligence.

API Overload Protection

When request volume exceeds capacity, we fail gracefully with actionable errors β€” never silent failures or hanging requests.

1Queue with priority scheduling

Requests enter a priority queue. Enterprise tier always processes first. Standard tier waits with visible queue time in X-Queue-Wait header.

2Graceful degradation

If queue depth exceeds threshold (10,000 pending), new requests get HTTP 503 with Retry-After header. No request is silently dropped.

3Multi-provider failover

If primary model provider is overloaded, we automatically route to backup providers. Response includes failover_used flag for observability.

4Capacity scaling

Auto-scaler spins up new GPU instances within 2-5 seconds of detecting queue buildup. For predictable spikes, notify us 24h ahead for dedicated capacity.

Webhook Failure Handling

Webhook delivery uses bounded retries with exponential backoff. If your endpoint is briefly down, we retry automatically β€” and you can always query job status via API.

Retry Schedule

Attempt 1Immediate
Attempt 25 seconds
Attempt 330 seconds

Delivery Guarantees

  • Idempotent delivery β€” same event_id + terminal status never delivers twice
  • 7-day deduplication window for safe reprocessing
  • HMAC-SHA256 signature in X-CreativeAI-Signature header for verification
  • 10-second timeout per attempt β€” slow endpoints don't block delivery
  • If all 3 delivery attempts fail, the result remains available via the status API

Error Codes & Recovery

Common HTTP error codes, their causes, and recommended actions.

CodeErrorActionRecoverable
429Rate Limited

Too many requests in time window

Wait for X-RateLimit-Reset header value, then retry Yes
503Service Unavailable

All providers at capacity or maintenance

Wait for Retry-After header (usually 5-60s), then retry Yes
504Gateway Timeout

Upstream provider timeout

Request is retried automatically. If persistent, check status page Yes
400Bad Request

Invalid parameters or unsupported model

Check error.message in response body for details Fix required
401Unauthorized

Missing or invalid API key

Verify Authorization header contains valid Bearer token Fix required
402Payment Required

Insufficient credits or subscription expired

Add credits or update subscription in dashboard Fix required

Customer-Safe Reliability Answers

Copy-paste ready answers for enterprise due diligence on reliability, error handling, and SLA questions.

What happens when your API is overloaded?

We queue requests with priority scheduling (enterprise always first). If all capacity is exhausted, you get a 503 with a Retry-After header β€” no silent failures, no lost requests. Your code can safely retry based on the header value.

How do webhooks handle failures?

We make 3 delivery attempts with exponential backoff (0s, 5s, 30s). Events are idempotent, HMAC-SHA256 signed, and safe to deduplicate by event_id + terminal status. If delivery still fails, you can recover the final result via the status API.

What's the retry behavior?

Synchronous API: You control retries. Use Retry-After header from 429/503 responses. Async/webhooks: We make 3 delivery attempts with exponential backoff (0s, 5s, 30s). Recommend client-side retry with exponential backoff for 429/503/504 errors.

Do you have a status page?

Yes β€” status.creativeai.run shows real-time system health, incident history, and scheduled maintenance. Subscribe for email or webhook notifications.

What's your SLA for availability?

99.97% uptime over rolling 90 days. Enterprise contracts include financial credits for SLA breaches. We publish real-time status and maintain 7-day incident history for transparency.

Per-Request Observability

Every response ships latency headers you can pipe straight into your monitoring stack.

# Response headers on every API call

X-Request-Duration: 1247ms

X-Queue-Wait: 12ms

X-Inference-Time: 1183ms

X-Model: seedream-v1.5

X-Region: us-east-1

# Pipe into Datadog, Grafana, or your own dashboards

curl -s -o /dev/null -w "TTFB: %{time_starttransfer}\nTotal: %{time_total}" \

https://api.creativeai.run/v1/generate

Need a Latency SLA?

Enterprise plans include contractual p99 latency commitments, dedicated GPU pools, and priority queue access. Talk to our solutions team.

Frequently Asked Questions