When building AI features, latency is a critical metric to optimize. Velvet’s proxy latency is nominal. And with our caching feature enabled, we can improve response times by more than 50%.
Latency is the delay between a user action and the system response. When leveraging AI, there are additional factors to consider - including inference speed, token generation, and prompt implementation. Read OpenAI’s docs on latency optimization to learn more.
Velvet operates as a proxy, so it’s critical that we don’t add unnecessary latency to requests. We ran an experiment to test average and p99 latency.
In summary, we found that Velvet’s gateway latency is nominal - between 200-300ms per request on average, with minimums as low as 85ms. With caching, we can improve response times by 50% or more on concise chat completions (with increased benefit on longer completions). Velvet’s latency should be imperceptible to end users.
We benchmarked Velvet’s gateway latency relative to industry standards.
Test conditions
Definitions
TLDR
The average latency delta for a chat completion between OpenAI and Velvet is 208ms, with a p99 delta of 231ms. Caching decreases response times by more than 50%.
Average latency, no cache
Average latency, cached
Delta between p
OpenAI and Gateway, no cache
Delta between p
OpenAI and Gateway, cached
As illustrated in our benchmark results, introducing caching can lead to a meaningful reduction in latency and costs. If you use Velvet, enabling caching is easy. Simply add a 'velvet-cache-enabled' header set to 'true'.
Read our article on caching to learn more.
Want to get set up with Velvet? Read our documentation and create a workspace to get started.
Use our data copilot to query your AI request logs with SQL.
Use Velvet to observe, analyze, and optimize your AI features.
Use Velvet to observe, analyze, and optimize your AI features.