S&P 7,473.47 0.88AGI-IDX 214.88 ↑ 1.31NDX 26,343.97 0.45QBITS·LOG 105 / stableNVDA 215.33 4.43FUS·Q 5.12 ↑BTC 76,762 0.52BCI·WPM 92ETH 2,096 0.13COMPUTE·$/PFLOP 0.0031 ↓S&P 7,473.47 0.88AGI-IDX 214.88 ↑ 1.31NDX 26,343.97 0.45QBITS·LOG 105 / stableNVDA 215.33 4.43FUS·Q 5.12 ↑BTC 76,762 0.52BCI·WPM 92ETH 2,096 0.13COMPUTE·$/PFLOP 0.0031 ↓
HORIZON · INTELLIGENCE · FRONTIER MODELS
1mo ago·Beijing·2 min read

DeepSeek ships V4 with a million-token context as KV cache compression reshapes agent economics

The 1.6-trillion parameter model abandons standard attention mechanisms to drop memory requirements by 98 percent, targeting long-running autonomous workflows.

A million-token context windowThe maximum amount of text, audio, or image data a model can hold in its working memory at one time to inform its next output. is a marketing claim until the memory math allows a developer to actually use it. DeepSeek’s release of V4 structurally alters the cost of running long-horizon autonomous agents by abandoning standard attention mechanisms, dropping the memory required to hold a million tokens by 98 percent compared to established architectures. The release shifts the frontier model competition from raw benchmark intelligence to the unit economics of sustained execution.

The efficiency gain stems from splitting the model's attention into two interleaved mechanisms across its 61 layers. Compressed Sparse Attention (CSA) shrinks the sequence by a factor of four before running a sparse top-k selection, while Heavily Compressed Attention (HCA) compresses entries by a factor of 128 and runs densely. By alternating these layers and storing most key-value entries in FP8An 8-bit floating-point format used to store numbers in computer memory. By using fewer bits than standard formats, it drastically reduces the memory footprint and speeds up AI calculations at the cost of some numerical precision. rather than the standard bfloat16Brain Floating Point. A 16-bit computer number format optimized for machine learning, offering a wide dynamic range comparable to 32-bit floats but with reduced precision to save memory. format, the model avoids the memory wall that typically crashes agents halfway through complex tasks.

The resulting metrics dictate a new operational baseline. At a million tokens, the 1.6-trillion parameter V4-Pro requires just 27 percent of the single-token inference compute compared to its predecessor, V3.2, while consuming only 10 percent of the KV cacheKey-Value cache. A memory bank where a language model stores the mathematical representations of previous tokens in a sequence, preventing it from having to recalculate them from scratch for every new step. memory. A smaller 284-billion parameter Flash variant pushes the cache footprint down to 7 percent. In applied agent benchmarks, the Pro model resolved 80.6 percent of SWE Verified software engineering tasks, placing it within a fraction of a point of Anthropic’s Opus 4.6 and Google’s Gemini 3.1 Pro.

The winners are developers building multi-step autonomous workflows—terminal sessions, deep codebase refactors, and prolonged browser automation—who can now maintain coherent reasoning chains across dozens of tool calls without blowing past memory budgets. The losers are hardware providers relying on the assumption that scaling context windows would linearly scale the demand for high-bandwidth memory, as algorithmic compression begins to decouple sequence length from VRAMVideo Random Access Memory. High-bandwidth memory physically located on a graphics processing unit (GPU), used to hold an AI model's weights and active calculations during training or inference. requirements.

What V4 forecloses is the era where agentic reasoning had to be aggressively truncated or summarized to keep a model from stalling. What it opens is a pathway to truly continuous execution, where an agent's memory of its own past actions and tool outputs remains perfectly intact across days of autonomous operation, bounded only by the compute required to initiate the next step.

Sources (1)
filed by A. Hollis Verne · drawn from 1 source · April 24, 2026
Calibrate this dispatchtotal · 0 / 25
NewsworthySubstantiveVoice fitSurpriseUnusual

Drag along each spoke — center is 0, edge is 5