Existing speech tokenizers distill SSL (e.g., HuBERT) features into a semantic quantizer to suppress acoustics and capture content structure,
but their high frame rates produce token sequences far longer than text, hindering LM integration.
Attempts to reduce the token rate via uniform average pooling tend to over-smooth content-bearing regions and dilute structural information, limiting LM alignment.
To address this, we propose LM-SPT, an LM-aligned speech tokenization method based on semantic speech-resynthesis distillation.
Instead of directly matching teacher and student features via pooling, LM-SPT resynthesizes speech from semantic tokens only and minimizes the discrepancy between representations extracted from the original and resynthesized waveforms using a frozen, LM-aligned speech encoder.
This indirect supervision avoids rigid temporal alignment and encourages dedicated semantic units that are more semantically aligned with LMs under reduced frame rates.
(a) Feature-level distillation (SpeechTokenizer, Mimi) matches the semantic pathway to SSL
model features using frame-wise supervision (with temporal downsampling when operating at lower frame rates).
(b) Feature-level reconstruction distillation (DualCodec) reconstructs SSL model features through
a semantic bottleneck, still relying on rigid frame-wise alignment.
(c) Semantic speech-resynthesis distillation (LM-SPT) distills semantics by resynthesizing speech from
semantic tokens only and minimizing the discrepancy between LM-aligned encoder features of the original and resynthesized
waveforms. LM-SPT adopts a dual-encoder architecture, where separate encoders are trained for the semantic and acoustic quantizers.
Audio Demonstrations
Semantic-only reconstruction of LM-SPT uses the main decoder only (not the auxiliary semantic decoder).
Why it's Goliath as usual, they both cried peering in.
They worry me terribly, and besides I'd like to see what this lovely furniture looks like without such quantities of dust all over it. Good scheme, Cyn.
Then she suddenly remarked.
Reconstruction (Semantic Only)
Transcript
Ground Truth
Mimi
DualCodec
LM-SPT
Why it's Goliath as usual, they both cried peering in.
They worry me terribly, and besides I'd like to see what this lovely furniture looks like without such quantities of dust all over it. Good scheme, Cyn.
Then she suddenly remarked.
Zero-shot Text-to-Speech
Reference Speaker
Transcript
CosyVoice2
Mimi
DualCodec
LM-SPT
There cannot be a doubt he received you kindly for in fact you returned without his permission.
It's such a crush at the yearly meeting at arch street and then there's the row of sleek looking young men who line the curbstone and stare at us as we come out.
But if we have now ceased to advance why do we yet leave that sail loose which at the first shock of the tempest may capsize us in a moment.
Why it's goliath as usual they both cried peering in.
I was well satisfied with my cabin which was located in the stern and opened into the officers mess.
There's one and there's another the dudley and the flint.
Reconstruction
Transcript
Ground Truth
Mimi
DualCodec
LM-SPT
아 근데 진짜 감튀 먹고 있는데 데려온 거 심했다
아닐껄 그뭐지 몰라 뭔가 근데 그 남자애가 우리 욕 했을 것 같애
더 재밌게 즐길 수 있었을 텐데
Reconstruction (Semantic Only)
Transcript
Ground Truth
Mimi
DualCodec
LM-SPT
아 근데 진짜 감튀 먹고 있는데 데려온 거 심했다
아닐껄 그뭐지 몰라 뭔가 근데 그 남자애가 우리 욕 했을 것 같애
더 재밌게 즐길 수 있었을 텐데
Zero-shot Text-to-Speech
Reference Speaker
Transcript
CosyVoice2
Mimi
DualCodec
LM-SPT
아 그래도 그 정도면 되게 좋지 않나 커버도 깔끔하고
조금 더 다이나믹한 삶을 살게된 것 같애 그걸 하면서
아 근데 그게 있어 나는 내가 먹은 만큼 어차피 어떻게 보면 조금 뛰는 그런 스타일이라서
근데 내가 한참동안 있었거든 근데 이제
또 이제 내가 오월드를 하고 나서 뚜레주르라는 또 알바를 했었어 내가 근데
키가 되게 작아도 못 탄다고 했잖아 근데 그 애가 키가 되게 작은 거야 근데 엄마랑 같이 왔어 근데