Does this model support diarization?
Greetings,
Reading the description seems to suggest it's doing a really good job of converting speech into punctuated text, but it doesn't read like it's including speaker-identification.
Thanks muchly,
-- Morgan
I am also interested in this, does anyone know how one would "implement" diarization if this model doesn't natively support it? by that i mean running a seperate model that does support it or use some sort of pipline to feed output of this model into another. Any advice would be much appreciated
Greetings
@forcepushenjoyer
What I've been able to read so far (just starting my exploration a few days ago) it appears that the best option is to use an audio chunker that returns diarized/annotated audio chunks with speaker identification (1, 2, 3, etc...) labels, and then use a ASR model on the chunks of audio.
This makes some sense to me, but I have to admit the error rates for audio-chunked speaker identification seems uncomfortably high. I imagine a custom-trained model for known voices would do better, so if (e.g.) you wanted to regularly identify players in a D&D campaign, you could probably get a few voice samples, and do it that way. But in less controlled circumstances, the confusion (wrong speaker), false alarm and missed dialogue errors are surprisingly high on the models I've seen so far.
-- Morgan
Current model doesn't support diarization see: https://huggingface.co/nvidia/diar_sortformer_4spk-v1