Audio A/B by type — same prompt, same seed, RTX 5090

Each pair is the identical seeded generation; only the attention kernel differs. ▶ play both and listen. Generated 2026-06-17.

Speed: SageAttention held ~12% faster across all five types (~50s → ~43s warm).
Audio verdict: PRESERVED on single-voice speech, action SFX, and ambient (spectra match 0.97–0.99, loudness within 0.3 dB). Dialogue shifts (+3.6 dB, multi-voice content diverges). Music is a high-variance regime — here the current pipeline came out near-silent at this seed, so treat that row as inconclusive, not a sage regression.

Talking head — single voice

PRESERVED

One person speaking a scripted line to camera (speech + lip-sync).

A — current (no acceleration)

spectrogram A

B — SageAttention

spectrogram B

spectral match (mel-cosine): 0.976 · timbre dist (MFCC): 15.0 · loudness B/A: 1.004×

Dialogue — two voices

ALTERED

Back-and-forth man/woman conversation with cafe ambience.

A — current (no acceleration)

spectrogram A

B — SageAttention

spectrogram B

spectral match (mel-cosine): 0.774 · timbre dist (MFCC): 30.1 · loudness B/A: 1.514×

Action — SFX

PRESERVED

Car chase: engine, tyre screech, rain. No voices.

A — current (no acceleration)

spectrogram A

B — SageAttention

spectrogram B

spectral match (mel-cosine): 0.994 · timbre dist (MFCC): 6.8 · loudness B/A: 1.038×

Music

DIVERGENT

Fingerpicked acoustic guitar with street ambience.

A — current (no acceleration)

spectrogram A

B — SageAttention

spectrogram B

spectral match (mel-cosine): 0.313 · timbre dist (MFCC): 322.4 · loudness B/A: 107.381×

Ambient — soundscape

PRESERVED

Forest waterfall, birdsong, breeze. No voices.

A — current (no acceleration)

spectrogram A

B — SageAttention

spectrogram B

spectral match (mel-cosine): 0.994 · timbre dist (MFCC): 4.9 · loudness B/A: 0.97×