The Chaos Engine: How Chunked Prefill Unmasked the Non-Linear Nature of Text Diffusion

Running diffusiongemma-26B-A4B — a block-diffusion MoE — on a 128 GB M1 Ultra. A story that starts with a 523 GB memory request from a model that fits in 26 GB, and ends with a discovery about why diffusion language models don’t behave like the transformers we’re used to.

26 GB

model weights on disk

523 GB

single allocation it asked for

44 GB

peak RAM after the fix, full context

262,144

token context, confirmed working

What I was trying to do

Google’s diffusiongemma is a different kind of language model. A normal LLM writes the way you read — left to right, one token at a time. A diffusion language model instead starts with a block of static and sharpens it over a few dozen passes, like a photo developing, until text emerges. It’s a genuinely new architecture, and I wanted to answer a simple question: can my Mac Studio run it at its full advertised context window — 262,144 tokens, roughly two novels of input?

The model is 26 GB on disk. The machine has 128 GB of memory. On paper, this should be comfortable.

A 26 GB model asked for half a terabyte

It wasn’t. Past about 32K tokens of input, memory exploded; at 128K tokens the model didn’t run out of memory so much as ask for an absurd amount of it — a single 523 GB allocation, four times the size of the entire machine, from a model that fits in 26 GB.

The short version of why: before a model can answer, it has to “read” your prompt (this stage is called prefill). Part of that reading involves a table that lets every token look at every other token — and that table grows with the square of the prompt length. Double the prompt, quadruple the table. At 128K tokens, the table alone is half a terabyte. The irony is that the layers responsible are the ones specifically designed to be memory-cheap — a fast path in the math library silently turns off for them, and nothing warns you.

Bar chart: chunked prefill at 128K uses 34 GB and at full 256K uses 44 GB, while single-shot prefill OOMs at both sizes — The memory cliff. Single-shot prefill blows past the machine; chunked prefill runs the full context in 44 GB.

The fix was already built in

The cure turned out to need zero code changes: instead of reading the whole prompt in one gulp, read it in pages — a built-in option called chunked prefill (prefill_step_size=1024). The giant table never forms, because no single pass is big enough to need it.

The same 128K prompt that demanded 523 GB now runs in 34 GB. The full 262K context — 99.8% of the model’s architectural maximum — runs in 44 GB, on a desktop Mac, offline. To verify it actually uses all that context (not just survives it), I buried a unique fact at the middle of a quarter-million-token document and asked for it at the end. The model retrieved it, every time, at every length I tested.

The surprise: the fix changes what the model says

Here’s where it gets interesting. Chunked prefill is supposed to be invisible — the same math, tiled differently, like reading a book in chapters instead of one sitting. On a normal LLM it produces identical output. I verified the model is fully deterministic: run the same prompt twice the same way, get the same answer, bit for bit.

But run it chunked vs. unchunked — or chunked with a different page size — and the model gives a different answer every time. Not a worse answer; each one was fluent, correct, on-topic. Just different.

The reason is the diffusion process itself. Tiling the reading changes the order of floating-point arithmetic, which shifts the model’s internal numbers by about one part in ten thousand — normally harmless rounding noise. A regular LLM shrugs that off. But diffusion generates by repeatedly re-sharpening its draft, deciding each pass which words it’s confident enough to commit. That feedback loop amplifies the rounding noise: a microscopic nudge flips which word locks in first, which changes the next pass, which cascades into a visibly different completion. A butterfly effect, in text. That’s the chaos engine.

The practical consequence: on diffusion models, performance knobs you’d treat as “free” can quietly change the output. If you benchmark one, compare output quality — never exact text.

The speed knob that does nothing (until it ruins everything)

Diffusion models are supposedly slow because of all those sharpening passes — up to 48 per block here. So the obvious speedup is to cap them lower. I tested it: cutting the cap from 48 to 8 changed speed by less than 4% and the output not at all. The model already stops early the moment it’s confident — for easy answers it converges in a handful of passes, and the cap never comes into play.

But push the cap below what a task genuinely needs and the failure is total. On a creative-writing task, 16 passes produced fluent prose; 8 produced broken repetition; 4 produced word salad — at double and quadruple the speed. The speedups only exist where the output breaks.

Two tables comparing denoising step caps: needle retrieval is identical from 48 down to 8 steps, creative generation degrades sharply below 16 — The step cap is a ceiling, not a cost. Easy tasks never hit it; hard tasks fall off a cliff below their floor.

What actually governs speed is prompt length — every sharpening pass re-reads the whole context, so generation slows as context grows. If you want faster output, the lever is a shorter prompt, not fewer passes.

What to take away

New-architecture models often “don’t fit” for boring reasons. The 523 GB demand wasn’t the model’s size — it was one code path quietly falling off a fast path. One built-in parameter fixed it.
A quarter-million-token context now runs on a desktop Mac — 44 GB of RAM, retrieval intact, fully offline. You can even serve it over a standard OpenAI-style API today.
Diffusion LMs are numerically chaotic. Optimizations that are invisible on normal LLMs change a diffusion model’s output. Different isn’t worse — but it means “did the text change?” is the wrong regression test.
Speed knobs can be quality dials in disguise. The denoising cap does nothing on confident output and destroys difficult output. Measure quality at every setting, not just throughput.

TL;DR

Following the obvious serving path, a 26 GB model tried to allocate a 523 GB single tensor and crashed.
The culprit was not the KV cache, not quantization, and not the attention mask. It was the sliding-window layers materializing a dense [N, N] attention-score matrix during single-shot prefill, which silently knocks MLX’s fused/flash attention kernel off its fast path.
The fix needed zero code changes: chunked prefill (prefill_step_size). 128K context went from 523 GB OOM crash → 34 GB peak.
The sweet spot is prefill_step_size = 1024–2048 (fastest, lowest memory). Avoid 8192.
The twist: chunking is supposed to be a pure memory-tiling trick that produces identical output. For this diffusion model it does not. Every chunk size produced a different (but equally coherent) completion — because the iterative denoising sampler is a sensitive, near-chaotic map that amplifies the tiny floating-point differences between chunked and full prefill. Autoregressive greedy decoding absorbs that noise; diffusion denoising amplifies it. That is the chaos engine.

The setup


Machine	Apple M1 Ultra, 48 GPU cores, 128 GB unified memory, macOS (Darwin 25.5), Metal 4
Metal limit	Max single GPU buffer = 86,586,540,032 bytes (~86.6 GB) — this number matters later
Model	`mlx-community/diffusiongemma-26B-A4B-it-8bit`
Architecture	`DiffusionGemmaForBlockDiffusion` — block-diffusion MoE, multimodal (gemma4 vision tower)
Shape	30 layers (25 sliding-window `window=1024`, 5 full-attention), 128 experts top-8 (~4B active of 26B), 16 attn heads, `max_position_embeddings = 262144`
Diffusion	up to 48 denoising steps per block, `canvas_length = 256`, entropy-bound sampler
Stack	`mlx 0.31.2`, `mlx-vlm 0.6.3` (ahead of PyPI 0.6.2 — a dev build), `vllm-mlx 0.3.0`, `transformers 5.8.0.dev0`

Two facts to hold onto: the model weights are 26 GB on disk, and the machine has 128 GB. Nothing about this setup should ever ask for hundreds of gigabytes.

Act 1 — The official path blows up

The community “supported” way to serve this is rapid-mlx serve (which is just vllm_mlx.cli). The starting command looked reasonable:

rapid-mlx serve ./models/diffusiongemma-mlx-4bit \
  --host 127.0.0.1 --port 8080 \
  --ctx-size 32768 \
  --enable-chunked-prefill

Three problems before a single token was generated:

Wrong weights. There was no 4-bit copy on the host — only the 8-bit one in the HF cache.
Two invented flags. --ctx-size and --enable-chunked-prefill don’t exist. The real knobs are --chunked-prefill-tokens N and --cache-memory-percent.
The server can’t actually generate. With corrected flags it loads the model, then every decode step dies in a tight loop:
```
ERROR: Error in MLLM process loop:
BatchRotatingKVCache._temporal_order() takes 1 positional argument but 2 were given
```
A signature-mismatch bug in vllm_mlx’s continuous-batching scheduler against this model’s rotating KV cache. It loads the weights and produces nothing.

So I dropped to the reference path — mlx_vlm.generate — which works. And immediately hit a wall.

Act 2 — The quadratic wall

Single-shot prefill, measuring token speed and peak memory as context grows:

Context	Prefill tok/s	Generation tok/s	Peak RAM
256	224	16.8	30.6 GB
20,725	696	14.3	31.2 GB
32,425	302	6.0	72.2 GB
48,000	—	—	❌ hard crash
127,825	—	—	❌ 523 GB request

48K died with an uncatchable C++ abort:

libc++abi: terminating due to uncaught exception of type std::runtime_error:
[METAL] Command buffer execution failed: Insufficient Memory
(kIOGPUCommandBufferCallbackErrorOutOfMemory)

And 128K printed the number that started the whole investigation:

RuntimeError: [metal::malloc] Attempting to allocate 522855380000 bytes
which is greater than the maximum allowed buffer size of 86586540032 bytes.

522,855,380,000 bytes = 523 GB. From a 26 GB model. On a 128 GB machine. The allocation is bigger than the machine, bigger even than Metal’s 86.6 GB per-buffer ceiling. Where does half a terabyte come from?

The number factors exactly:

127825 (tokens)² × 16 (heads) × 2 (bytes, bf16) = 522,855,380,000

That is the shape of a dense [batch, heads, N, N] attention-score tensor. Something was materializing the full N×N attention matrix instead of streaming it.

Act 3 — The misdiagnosis, then the probe

First hypothesis (wrong): the model isn’t using flash attention; it builds an explicit attention mask that forces a dense softmax(QKᵀ). I even patched the mask builder to force the “fast path” for the trivial all-ones mask.

Result: no change. 32K still peaked at exactly 72.2 GB; 128K still asked for exactly 523 GB. When a patch changes nothing, the hypothesis is wrong. Stop guessing — instrument.

I added a one-shot probe to mlx_lm/models/base.py’s scaled_dot_product_attention to print, per call, which branch runs and what mask type it gets. On a small 2K context that doesn’t OOM:

[SDPA] quantized=False  q_len=2001  k_len=2001  heads=16  mask=arr(2001, 2001):bool  cache=RotatingKVCache
[SDPA] quantized=False  q_len=2001  k_len=2001  heads=16  mask=str:causal             cache=KVCache

There it is. Two attention regimes:

Global / full-attention layers (5): mask="causal" — a string. MLX’s mx.fast.scaled_dot_product_attention recognizes the causal string and uses its fused/flash kernel. Memory O(n). Fine.
Sliding-window layers (25): mask=arr(N, N):bool — an explicit boolean array. An arbitrary array mask defeats the flash kernel; SDPA falls back to materializing the full [B, 16, N, N] scores. Memory O(n²).

Why does the sliding layer get an array? Because create_attention_mask returns a create_causal_mask(N, window) array whenever N > window (here window = 1024), and only returns the "causal" string otherwise:

# mlx_lm/models/base.py
if return_array or (window_size and N > window_size):
    return create_causal_mask(N, window_size=window_size)   # explicit array  → no flash
return "causal"                                              # string          → flash

What it is not:

Not quantization — the probe says quantized=False. The 8-bit weights don’t route attention through the quantized (materializing) path here.
Not the input attention_mask — the diffusion harness already drops the trivial mask: decoder_attention_mask = attention_mask if has_padding else None. My earlier patch was redundant with code that already existed.

The irony writes itself: the sliding-window layers — the ones designed to be cheap at long context — are exactly the ones that blow up during prefill, because the window forces an explicit mask, and the explicit mask forces dense scores.

Act 4 — The fix: chunked prefill (no code changes)

If the disease is “a single forward pass over N tokens builds an [N, N] score tensor,” the cure is “don’t do a single forward pass over N tokens.” Process the prompt in chunks along the query dimension. The score tensor per chunk becomes [step, N] instead of [N, N].

And the mechanism is even nicer than just slicing: with step ≤ window (1024), every chunk has N ≤ 1024, so create_attention_mask returns the "causal" string even for the sliding layers → flash everywhere → the RotatingKVCache caps stored keys at the window. The dense [N, N] matrix never forms.

This is a built-in feature — prefill_step_size (CLI: --prefill-step-size). No patching:

generate(model, processor, prompt,
         prefill_step_size=1024,    # the entire fix
         max_tokens=..., temperature=0.0)

The result, on the exact 128K context that demanded 523 GB:

	Single-shot prefill	Chunked (`step=512`)
128K context	❌ 523 GB OOM	✅ 34.0 GB peak, 426 tok/s prefill, coherent output

From a hard crash to 34 GB. The O(n²) compute is still there (chunking is sequential, so wall-clock prefill is minutes at 128K+), but the O(n²) memory is gone.

Bar chart: chunked prefill at 128K uses 34 GB and at full 256K uses 44 GB, while single-shot prefill OOMs at 128K (measured 523 GB) and 256K (projected 2.2 TB) — The memory cliff, to scale against the machine's 128 GB ceiling. Chunking tiles the query dimension so the score tensor never materializes at full size.

Act 5 — The sweet spot, and the chaos

To pick a step size I ran a sweep at 16K context with greedy, deterministic decoding, comparing both speed and the exact output text against a non-chunked full-prefill ground truth.

step	peak	prefill tok/s	matches full-prefill output?
full (no chunk)	42.7 GB	489	— (reference)
full (rerun)	42.7 GB	491	✅ identical (determinism confirmed)
512	30.5 GB	651	❌
1024	30.6 GB	713	❌
2048	30.8 GB	722	❌
4096	32.3 GB	685	❌
8192	36.1 GB	615	❌

Two findings.

The mundane one — the sweet spot. Throughput peaks at 1024–2048 and falls off on both sides (512 = too many tiny chunks; 8192 = the step × N score tensor grows and slows down, plus higher memory). Memory is trivial until 8192. Use prefill_step_size = 1024–2048.

The interesting one — the chaos. A full prefill rerun reproduces bit-for-bit (full == full#2 → determinism is real). Yet no chunk size matches the full-prefill output — not even step=512, which sits comfortably inside the window. And the chunked outputs differ from each other:

full:      "... Mitochondria are the only organelles in"
step1024:  "... Ribosomes  are the only organelles in"
step4096:  "... Most eukaryotic cells contain a"

Chunked and full prefill are algebraically identical — but they accumulate the KV cache in a different floating-point reduction order. That produces differences on the order of 1e-4 in the logits. In a normal autoregressive model with greedy decoding, a 1e-4 nudge essentially never flips an argmax; the output is stable.

But this is a diffusion model. Generation isn’t a left-to-right argmax — it’s an iterative denoising loop (up to 48 steps per block) that decides which tokens to commit based on entropy thresholds. That loop is a sensitive dependence on initial conditions machine: a 1e-4 perturbation flips which token gets unmasked first, which changes the next denoising step, which cascades. Same prompt, same seed, same model — a different valid completion, purely because the prefill was tiled differently.

This is the headline. Chunked prefill didn't just fix memory; it acted as a perturbation probe that revealed the non-linear, chaotic character of text diffusion. Autoregressive transformers are (numerically) a stable map. Diffusion LMs are a chaotic one.

“Different” here does not mean “worse” — every output was equally fluent and on-topic — but it does mean you cannot treat prefill_step_size as a free, output-preserving speed knob the way you can on an AR model. The knob perturbs the sample.

Act 6 — Confirming the full 262,144-token context

The original goal: actually use the model’s full 256K context window on this machine. With chunked prefill the memory math is benign — at step, the global-layer score tensor is step × N × 16 × 2 bytes:

step=1024, N=262,144 → ~8.6 GB
step=2048, N=262,144 → ~17 GB

plus 26 GB weights and ~6 GB KV cache → comfortably under 128 GB.

To verify the model still uses the context (not just “doesn’t crash”), I ran a needle-in-a-haystack test: plant a unique fact — The Aurora Station access code is 84915 — at 50% depth in a document of fixed-width telemetry log lines, then ask for the code at the end. Retrieval counts only if 84915 appears in the greedy completion. I swept context length and step ∈ {1024, 2048}:

step	Context	Peak RAM	Prefill tok/s	Gen tok/s	Retrieved?
1024	230,523	42.2 GB	305.9	5.8	YES
2048	230,523	50.2 GB	299.9	5.6	YES
1024	248,935	43.3 GB	286.6	5.5	YES
2048	248,935	51.9 GB	284.2	5.4	YES
1024	261,555	44.0 GB	262.0	5.2	YES
2048	261,555	53.0 GB	279.7	5.1	YES

The needle survives all the way to 261,555 tokens — 99.8% of the model’s 262,144 architectural cap — at both chunk sizes. Chunked prefill does not break long-range retrieval. The single most important number in this whole investigation is in that last block: the full 256K context runs in 44.0 GB of RAM. To put that in scale, the same N²×16×2 dense-score formula that exactly predicted the 523 GB allocation at 128K projects to a ~2.2 TB score tensor at 262,144 tokens (262144² × 16 × 2 = 2,199,023,255,552 B) — single-shot prefill never gets close to running this context on any machine. (That 2.2 TB is a projection from the formula, not a measured allocation; single-shot OOMs long before, at 128K, where it did ask for a measured 523 GB — see Act 4 for the apples-to-apples 128K comparison: 523 GB OOM → 34 GB chunked.) Two more results jump out:

step=1024 strictly dominates step=2048 at long context. Identical retrieval, identical speed, but step=2048 costs +8–9 GB peak RAM (the score tensor scales linearly with step). Beyond ~230K, larger chunks buy you nothing but memory pressure. The earlier sweep’s “1024–2048 tie” breaks in 1024’s favor once the context is huge.
Peak memory barely moves with context length — 42 GB at 230K, 43 GB at 249K — because the score tensor depends on step, not N. This is the whole point of chunking: decouple memory from context length.

A note on the cap: hitting exactly 262,144 tokens took a calibration loop. Naïve token-count estimates undershot by ~5%; I re-tokenized and adjusted entry counts until the prompt landed at 261,555 — 99.8% of the cap, with room for the 24 generated tokens. As predicted, the full-cap row tracks the 249K row almost exactly (44.0 vs 43.3 GB peak), because at this scale memory is governed by step, not N.

Act 7 — The denoising tax that wasn’t

Generation is slow — ~5 tok/s at full context, ~8 tok/s at 128K. The obvious culprit: this is a diffusion model with max_denoising_steps=48, so surely each block of output costs 48 forward passes, a ~48× tax over an autoregressive model’s one-pass-per-token decode. The obvious fix: turn the knob down.

I swept max_denoising_steps ∈ {48, 32, 24, 16, 8} at a fixed 128K context, holding prefill_step=1024 so only the denoising count varied:

max_denoising_steps	Gen tok/s	Retrieved?	Output
48	7.86	YES	`84915`
32	7.80	YES	`84915`
24	7.64	YES	`84915`
16	7.95	YES	`84915`
8	7.89	YES	`84915`

Cutting the cap 6× did nothing. Gen speed moved <4% (noise), and the output was byte-identical at every setting. The “48× tax” hypothesis is simply wrong — and the source says why.

This model ships confidence_threshold: 0.005 and stability_threshold: 1 in its generation_config.json, which activates mlx_vlm’s adaptive stop. The denoising loop (diffusion.py:866) is bounded by max_denoising_steps, but it breaks early (:1020) the instant _diffusion_stable_and_confident is true — the canvas hasn’t changed for stability_threshold iterations and mean token entropy has fallen below confidence_threshold. For a confident, low-entropy generation — retrieving a clear fact, answering with a short number — the sampler converges in a handful of steps, far below 8. So max_denoising_steps is a ceiling that never binds on easy generations. It’s not the realized step count; it’s a safety cap for the hard cases.

The real cost of generation is the per-pass attention over the KV cache, which is why gen throughput tracks context length, not step count:

Context	Gen tok/s
~16K	~10.8
~128K	7.9
~261K	5.2

Bar chart: generation throughput falls from 10.8 tok/s at 16K context to 7.9 at 128K and 5.2 at 261K — Gen speed tracks context length, not denoising steps — each denoising pass attends over the whole context.

The lesson is the inverse of the intuition: you can’t make confident generation faster by cutting denoising steps, because the model already stopped early. But that raises the obvious question — when does the knob bite?

Where the cap does matter: the entropy-convergence floor

The needle task is low-entropy: a clear fact, a short number. To see the cap actually bind, I ran a high-entropy task — open-ended creative generation, 128 tokens, temperature=0.8, tiny context — and swept the same steps (adding 4):

max_denoising_steps	Gen tok/s	Quality
48	23.1	coherent, vivid
32	23.3	coherent
24	24.9	coherent
16	26.3	coherent
8	47.2	degraded — repetition, broken grammar (`of of of`, `It It a a a`)
4	98.0	word salad — total collapse

Sample at 48 steps: “…a colossal, translucent jellyfish drifted, defying gravity. Its tentacles glowed with a bioluminescent violet, humming a frequency that vibrated in Elias’s teeth…” — fluent and imaginative.

Sample at 4 steps: “…the fog was a a wool wool,, the the… the the the beam the… wasn’tt a a a a a a,,,,,,,,,,,” — the denoiser never resolved the canvas.

Side-by-side tables: low-entropy needle retrieval identical at all step caps; high-entropy creative generation degrades below a 16-step floor — The same knob, two regimes: inert above the task's convergence floor, a quality cliff below it.

This is the exact mirror of the needle sweep, and the two together give one mechanism: the entropy-convergence floor. Every generation needs some number of denoising iterations to resolve its canvas; that number scales with the entropy of what’s being generated.

Low-entropy (needle): floor is below 8 steps → caps of 8–48 all land past it → identical output, the cap never binds.
High-entropy (creative): floor is ~16 steps → caps of 16–48 all converge cleanly (similar speed and quality), but caps of 8 and 4 clip below the floor → denoising is starved → speed leaps (2×, then 4×) and quality falls off a cliff.

So max_denoising_steps is inert until you push it under the task’s entropy floor — and past that line, every bit of speed is paid for directly in coherence. There is no universal “safe low value”: harder generation has a higher floor. And note where the big speed-ups live — 47 and 98 tok/s — precisely at the settings where the output broke. On confident generation the knob does nothing; on hard generation it’s a quality dial wearing a speed knob’s clothing.

If you want faster generation at long context, the lever is the context itself (attention cost per pass), not the denoising schedule.

Lessons

A number that factors is a number that confesses. 523 GB = N² × heads × 2 told us it was a dense score tensor before reading a line of source. Always factor the allocation.
When a patch changes nothing, the hypothesis is wrong. Stop theorizing and instrument. A six-line print in the SDPA dispatch settled in one run what three rounds of source-reading couldn’t.
Not all “servers” are equal — test the specific one. rapid-mlx serve (vllm_mlx) had invalid flags in the wild and a scheduler bug (_temporal_order) that made generation impossible. But mlx_vlm.server — a different codebase — runs the same model fine over an OpenAI/Anthropic-compatible API (verified: peak 29.8 GB, correct output). Don’t generalize one server’s failure to “no serving support.” Caveat: mlx_vlm.server serializes concurrent requests (2 concurrent ≈ 2× wall time) — it serves, but it doesn’t parallelize diffusion on a single GPU.
Flash attention is fragile to mask representation. A "causal" string flashes; an equivalent boolean array does not. Sliding-window layers tripped this precisely because the window forces an explicit mask.
Chunked prefill fixes memory, not compute. O(n²) FLOPs remain; long-context prefill is minutes, not seconds. Plan for it.
Diffusion LMs are numerically chaotic. The same speed/memory optimization that is output-neutral on an AR transformer will change the generated text on a diffusion model. Benchmark quality, not just throughput — and don’t use exact-match as your quality metric, because even valid recomputations won’t match.
max_denoising_steps is a ceiling, not a cost. With entropy-bound early stopping active (this model’s default), confident generations converge in a few steps and ignore the cap. Don’t assume “diffusion = N forward passes per token”; measure the realized step count before treating it as a speed knob. The real gen cost here is per-pass attention over the KV cache, which scales with context length, not with the denoising cap.

Terminal mock showing mlx_vlm.server launched with prefill-step-size 1024 and a successful OpenAI-style chat completion call — The serving path that works today: `mlx_vlm.server`, OpenAI/Anthropic-compatible, honors chunked prefill. One stream at a time.

Recommended configuration

from mlx_vlm import load, generate

model, processor = load("mlx-community/diffusiongemma-26B-A4B-it-8bit")
out = generate(
    model, processor, prompt,
    prefill_step_size=1024,     # 1024 wins at long ctx: same speed as 2048, ~8-9 GB less peak RAM
    max_denoising_steps=48,     # ceiling only; entropy-bound stop converges early on confident output
    temperature=0.0,
)
# Avoid single-shot prefill above ~32K (O(n²) score-tensor OOM).
# Avoid step >= 8192 (bigger score tensor, slower, more memory).
# Expect long-context prefill to be compute-bound (minutes), not memory-bound.
# Don't cut max_denoising_steps for speed on confident generations — it already early-stops below the cap.

Every number in this post comes from instrumented runs on the machine in the setup table — greedy decoding unless stated, peak RSS measured per run.