Enhancing KV Block Management: Request-Level Dumps & Multi-Connector

Nov 3, 2025 by Admin 69 views

Hey guys! Today, let's dive into a cool feature update that optimizes how we handle KV blocks in our system. This enhancement focuses on changing dump KV blocks to the request level and improving support for multi-connectors. Trust me; this is a game-changer for performance and compatibility.

Motivation Behind the Update

So, in the previous v0.9.2 version, the UC connector used to dispatch dump tasks and commits at the block level. On the surface, this seemed like a good idea because it offered fine-grained control over which KV blocks were successfully offloaded. However, this approach came with a few headaches that we needed to address.

Firstly, it created coupling to vLLM internal fields. To track whether each block was successfully dumped, we had to patch the vLLM source code. This was necessary to access and record the status of each block. Without these patches, the UC Connector would meddle with the semantics of internal request fields (like succeed_dumped_blocks). Not only is this risky, but it could also cause compatibility issues with upstream updates. Imagine having to rewrite code every time there's a new release – not fun, right?

Secondly, we noticed a performance overhead. Dealing with block-level granularity meant a high number of tasks, which in turn degraded the connector's throughput. Managing each block individually created unnecessary overhead, slowing things down. In a world where speed is king, this was simply unacceptable.

Lastly, there were issues with incorrect commits in multi-connector setups. When using MultiKVConnector, the system would sometimes report already-finished request IDs across different connectors. This led UC to incorrectly mark dumped blocks as failed, even when they weren't. These false negatives were a real pain.

To tackle these issues, starting from vLLM version 0.11.0, we switched to request-level dispatch of dump and commit tasks. This ensures compatibility with multi-connector topologies, making everything smoother and more reliable.

Changes Implemented

Let's break down the changes we've made to address these problems. These adjustments aim to provide a more efficient and reliable system.

1. Request-Level Dump & Commit Dispatch

Instead of individually looping over each block and issuing dump calls, the connector now batches all blocks associated with a request. It dispatches one task per request to the underlying connector. This significantly reduces the number of tasks and overhead.

The dump_tasks structure has also been revamped. It used to be {req_id: {block_id: [task]}}, but now it's {req_id: [task]}. This flat structure simplifies task management and reduces complexity.

2. Request-Level Commit Handling

The UC connector now commits all blocks in a request at once. This decision is based on whether the entire request's tasks have completed successfully. No more piecemeal commits!

To keep track of the request-level status, we introduced self.success_reqs and self.failed_reqs sets. These sets help us monitor which requests have succeeded or failed, ensuring accurate commits.

If a request finishes, block-wise commit calls are aggregated from the BlockInfo structure, streamlining the commit process.

3. MultiConnector Compatibility

To better handle multi-connector scenarios, we introduced self.current_req and self.last_req. These track the most recent and previous requests during scheduling.

These fields help the UC connector avoid false negatives. When other connectors redundantly report already-finished request IDs, the UC connector can now accurately determine the status and prevent premature commit(..., False) calls on valid blocks.

Modified Files

The primary file modified to implement these changes is:

ucm/integration/vllm/uc_connector.py: This file now contains the core logic for request-level granularity in dump, wait, and commit operations.

Internal Logic Highlights

Here’s a quick comparison of the internal logic before and after the changes:

Area	Before	After
`dump_tasks` structure	Nested per block	Flat per request: `Dict[str, List[Task]]`
Dump dispatch granularity	Per block	Per request
Commit point	Based on `succeed_dumped_blocks` field	Based on `success_reqs` & `finished_reqs` tracking
`wait_for_save()` logic	Waits per block, tracks block success	Waits per request, tracks full success/failure
Multi connector support	None (false commits possible)	Uses `last_req` + `current_req` to disambiguate

Compatibility Notes

Don't worry, these changes won't break your existing setup. Here’s what you need to know about compatibility:

The KVConnectorBase_V1 interface remains unchanged.
The update is fully backward-compatible with single-connector use cases.
We've added necessary safeguards to avoid regressions in MultiConnector environments.

Future Improvements

While we've made significant improvements, there’s always room for more. Here are a few areas we plan to address in the future:

Remove current_req and last_req to achieve complete compatibility with MultiConnector setups.
Improve performance when using MultiConnector in 1p1d scenarios.

Alternatives Considered

No response

Additional Context

No response

In summary, switching to request-level dump and commit tasks provides a more streamlined, efficient, and reliable system. It reduces overhead, prevents incorrect commits, and maintains compatibility with existing setups. Plus, we have future improvements in mind to make it even better. Keep an eye out for those updates!