+86 400-090-8865
CN
  • Products
  • Solutions
  • Successful Stories
  • Company

YRCloudFile KVCache Test: 13x Performance Boost, Over 4x Latency Reduction

As foundation models like DeepSeek begin to gain traction across industries, the synergy between storage and compute is becoming critical for enterprises aiming to enhance AI inference efficiency and reduce operational costs. KVCache, with its innovative "offload large KV caches from GPU memory to storage" approach, has proven instrumental in boosting inference performance, making it indispensable for building robust AI infrastructure.

YanRong Tech has pioneered the integration of KVCache into its YRCloudFile distributed file system. Designed for PB-scale cache expansion, this innovation dramatically improves KV cache hit rates and long-context processing, delivering a cost-efficient solution for large model inference.

To show the achievable impact, we simulate realistic workloads to quantify the benefits of YRCloudFile KVCache, using publicly available datasets, industry-standard benchmarking tools, and NVIDIA GPU hardware. The results demonstrate that under identical scale and TTFT (Time-To-First-Token) latency constraints, YRCloudFile KVCache not only supports significantly higher concurrent query throughput, but also offers concrete, quantifiable value for inference workloads.

YRCloudFile KVCache's Real-World Test of Inference Performance Boosts

To evaluate how extending GPU memory to YRCloudFile KVCache enhances token processing efficiency, we conducted multi-phase tests comparing native vLLM performance against vLLM+YRCloudFile KVCache across varying token counts and configurations.

Test 1: TTFT in Long-Context Scenarios

  • Scenario: Evaluate total response time for a single query with long context inputs (over 20K tokens).
  • GPU: NVIDIA T4
  • Model: Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4
  • Tool: Simulate QA chatbot queries using identical context and prompts.
  • Result: YRCloudFile KVCache delivered up to 13x faster TTFT in long-context scenarios. This improvement is attributed to high cache hit ratio and rapid large-scale data access, enabling superior inference performance.

With users typically expecting TTFT under 2 seconds, we designed Test 2 to evaluate concurrency under real-world latency constraints.

Test 2: Concurrent Queries Supported at TTFT ≤2s

  • Scenario: Compare native vLLM and YRCloudFile KVCache in concurrent query capacity under fixed GPU and TTFT constraints, acrossing different prompt lengths (--max-prompt-length).
  • GPU: NVIDIA L20
  • Model: Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4
  • Tool: Evalscope, using the longalpaca dataset and varying --max-prompt-length.
  • Result: With TTFT ≤ 2s, YRCloudFile KVCache enables 8x more concurrent requests compared to native vLLM. This allows enterprises to serve more users simultaneously with the same GPU resources, dramatically improving overall system throughput and utilization.

Test 3: TTFT Under High Concurrency

  • Scenario: Measure TTFT at 30 concurrent requests with varying context lengths.
  • GPU: NVIDIA L20
  • Model: Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4
  • Tool: evalscope with --dataset longalpaca, dynamic --max-prompt-length, and fixed concurrency of 30.
  • Result: Under high concurrency, YRCloudFile KVCache achieved over 4x lower TTFT across different context lengths. This demonstrates its ability to significantly reduce latency and enhance user experience in demanding real-time inference scenarios.

Our tests clearly validate the performance advantages of YRCloudFile KVCache in both long-context and high-concurrency inference scenarios. It boosts concurrent requests capacity by 8x under strict TTFT constraints while slashing latency by over 4x. These results reinforce the critical role of compute-storage co-optimization in driving AI inference efficiency. More importantly, they showcase how extending GPU memory via distributed storage can break traditional compute bottlenecks—unlocking exponential improvements in resource utilization.​

As AI adoption grows, optimizing inference efficiency and cost become competitive necessities. With innovations in storage, caching, and latency reduction, YRCloudFile KVCache redefines the economics of AI inference by transforming storage resources into computational gains through PB-scale cache expansion.

Ready to power your AI & HPC workflows?

Get Started