Engineering

Velvet AI gateway latency benchmarks

When building AI features, latency is a critical metric to optimize. Velvet’s proxy latency is nominal. And with our caching feature enabled, we can improve response times by more than 50%.

Latency is nominal, plus a 50% improvement with caching

Latency is the delay between a user action and the system response. When leveraging AI, there are additional factors to consider - including inference speed, token generation, and prompt implementation. Read OpenAI’s docs on latency optimization to learn more.

Velvet operates as a proxy, so it’s critical that we don’t add unnecessary latency to requests. We ran an experiment to test average and p99 latency.

In summary, we found that Velvet’s gateway latency is nominal - between 200-300ms per request on average, with minimums as low as 85ms. With caching, we can improve response times by 50% or more on concise chat completions (with increased benefit on longer completions). Velvet’s latency should be imperceptible to end users.

Velvet’s latency benchmarks

We benchmarked Velvet’s gateway latency relative to industry standards.

Test conditions

  • Network is at a crowded coffee shop with 40-100ms of loaded latency
  • 100 requests per test
  • Concise chat completion example
  • No gaming of results — these are first shot attempts for 3 scenarios

Definitions

  • Latency: Delay between a user action and the system response
  • Response caching: Return the same response without additional inference cost
  • p99: 99% percent of requests will be faster than the given number

TLDR

The average latency delta for a chat completion between OpenAI and Velvet is 208ms, with a p99 delta of 231ms. Caching decreases response times by more than 50%.

Average latency, no cache

  • Min delta: 85ms
  • Mean delta: 208ms

Average latency, cached

  • Min delta: -127ms
  • Mean delta: -349ms

Delta between p OpenAI and Gateway, no cache

  • p99:  231.347ms
  • p95: 299.669ms
  • p90: 216.082ms

Delta between p OpenAI and Gateway, cached

  • p99: -644.526ms (50% decrease)
  • p95: -516.373ms  (55.54% decrease)
  • p90: -519.085ms (56.99% decrease)

Enable caching to optimize latency

As illustrated in our benchmark results, introducing caching can lead to a meaningful reduction in latency and costs. If you use Velvet, enabling caching is easy. Simply add a 'velvet-cache-enabled' header set to 'true'.

Read our article on caching to learn more.

Want to get set up with Velvet? Read our documentation and create a workspace to get started.

AI gateway

Analyze and optimize your AI features

Free up to 100k requests per month.

2 lines of code to get started.

Try Velvet for free

More articles

Product
Query logs with Velvet's text-to-SQL editor

Use our data copilot to query your AI request logs with SQL.

Product
Warehouse every request from OpenAI

Use Velvet to observe, analyze, and optimize your AI features.

Product
Warehouse every request from Anthropic

Use Velvet to observe, analyze, and optimize your AI features.