Skip to content

Commit 4f676f0

Browse files
committed
Merge branch 'main' of github.com:Stability-AI/stable-codec into main
2 parents 06cb796 + cb253af commit 4f676f0

File tree

1 file changed

+13
-6
lines changed

1 file changed

+13
-6
lines changed

README.md

+13-6
Original file line numberDiff line numberDiff line change
@@ -12,14 +12,14 @@ Model weights: https://huggingface.co/stabilityai/stable-codec-speech-16k
1212

1313
Note that whilst this code is MIT licensed, the model weights are covered by the [Stability AI Community License](https://huggingface.co/stabilityai/stable-codec-speech-16k/blob/main/LICENSE.md)
1414

15-
## Additional training
15+
## Variants
16+
The model is currently available in two variants:
17+
- `stable-codec-speech-16k-base` is the weights corresponding to the results in our [publication](https://arxiv.org/abs/2411.19842), provided for reproducibility.
18+
- `stable-codec-speech-16k` is an improved finetune, with boosted latent semantics. It should be used in 99% of use-cases.
1619

17-
In addition to the training described in the paper, the released weights have also undergone 500k steps of finetuning with force-aligned data from LibriLight and the English portion Multilingual LibriSpeech. This was performed by using a CTC head to regress the force-aligned tags from pre-bottleneck latents. We found that this additional training significantly boosted the applicability of the codec tokens to downstream tasks like TTS, at a small cost to reconstruction metrics.
20+
### Additional Training
1821

19-
| Model | SI-SDR | Mel Dis | STFT Dis | PESQ | STOI |
20-
|---------------------------|-------:|--------:|---------:|-----:|-----:|
21-
| base | 4.73 | 0.86 | 1.26 | 3.09 | 0.92 |
22-
| CTC finefune | 3.58 | 0.90 | 1.30 | 3.01 | 0.90 |
22+
In addition to the training described in the paper, the weights for `stable-codec-speech-16k` have undergone 500k steps of finetuning with force-aligned data from LibriLight and the English portion Multilingual LibriSpeech. This was performed by using a CTC head to regress the force-aligned phoneme tags from pre-bottleneck latents. We found that this additional training significantly boosted the applicability of the codec tokens to downstream tasks like TTS, at a small cost to objective reconstruction metrics.
2323

2424
## Install
2525

@@ -169,3 +169,10 @@ and in the training dataset configuration.
169169
]
170170
}
171171
```
172+
## Objective Metrics
173+
174+
| Model | SI-SDR | Mel Dis | STFT Dis | PESQ | STOI |
175+
|---------------------------|-------:|--------:|---------:|-----:|-----:|
176+
| `stable-codec-speech-16k-base` | 4.73 | 0.86 | 1.26 | 3.09 | 0.92 |
177+
| `stable-codec-speech-16k` | 3.58 | 0.90 | 1.30 | 3.01 | 0.90 |
178+

0 commit comments

Comments
 (0)