1mo ago·Beijing·2 min read

DeepSeek ships V4 with a million-token context as KV cache compression reshapes agent economics

The 1.6-trillion parameter model abandons standard attention mechanisms to drop memory requirements by 98 percent, targeting long-running autonomous workflows.

By A. Hollis Verne · filed from Beijing

A million-token context window is a marketing claim until the memory math allows a developer to actually use it. DeepSeek’s release of V4 structurally alters the cost of running long-horizon autonomous agents by abandoning standard attention mechanisms, dropping the memory required to hold a million tokens by 98 percent compared to established architectures. The release shifts the frontier model competition from raw benchmark intelligence to the unit economics of sustained execution.

The efficiency gain stems from splitting the model's attention into two interleaved mechanisms across its 61 layers. Compressed Sparse Attention (CSA) shrinks the sequence by a factor of four before running a sparse top-k selection, while Heavily Compressed Attention (HCA) compresses entries by a factor of 128 and runs densely. By alternating these layers and storing most key-value entries in FP8 rather than the standard bfloat16 format, the model avoids the memory wall that typically crashes agents halfway through complex tasks.

The resulting metrics dictate a new operational baseline. At a million tokens, the 1.6-trillion parameter V4-Pro requires just 27 percent of the single-token inference compute compared to its predecessor, V3.2, while consuming only 10 percent of the KV cache memory. A smaller 284-billion parameter Flash variant pushes the cache footprint down to 7 percent. In applied agent benchmarks, the Pro model resolved 80.6 percent of SWE Verified software engineering tasks, placing it within a fraction of a point of Anthropic’s Opus 4.6 and Google’s Gemini 3.1 Pro.

The winners are developers building multi-step autonomous workflows—terminal sessions, deep codebase refactors, and prolonged browser automation—who can now maintain coherent reasoning chains across dozens of tool calls without blowing past memory budgets. The losers are hardware providers relying on the assumption that scaling context windows would linearly scale the demand for high-bandwidth memory, as algorithmic compression begins to decouple sequence length from VRAM requirements.

What V4 forecloses is the era where agentic reasoning had to be aggressively truncated or summarized to keep a model from stalling. What it opens is a pathway to truly continuous execution, where an agent's memory of its own past actions and tool outputs remains perfectly intact across days of autonomous operation, bounded only by the compute required to initiate the next step.

Sources (1)

https://huggingface.co/blog/deepseekv4

filed by A. Hollis Verne · drawn from 1 source · April 24, 2026

Calibrate this dispatchtotal · 0 / 25

Drag along each spoke — center is 0, edge is 5