nvidia/canary-qwen-2.5b · Does this model support diarization?

Does this model support diarization?

by Cypherfox - opened Jul 23

Jul 23

Greetings,
Reading the description seems to suggest it's doing a really good job of converting speech into punctuated text, but it doesn't read like it's including speaker-identification.

Thanks muchly,
-- Morgan

forcepushenjoyer

Jul 23

I am also interested in this, does anyone know how one would "implement" diarization if this model doesn't natively support it? by that i mean running a seperate model that does support it or use some sort of pipline to feed output of this model into another. Any advice would be much appreciated

Cypherfox

Jul 23

Greetings
@forcepushenjoyer What I've been able to read so far (just starting my exploration a few days ago) it appears that the best option is to use an audio chunker that returns diarized/annotated audio chunks with speaker identification (1, 2, 3, etc...) labels, and then use a ASR model on the chunks of audio.

This makes some sense to me, but I have to admit the error rates for audio-chunked speaker identification seems uncomfortably high. I imagine a custom-trained model for known voices would do better, so if (e.g.) you wanted to regularly identify players in a D&D campaign, you could probably get a few voice samples, and do it that way. But in less controlled circumstances, the confusion (wrong speaker), false alarm and missed dialogue errors are surprisingly high on the models I've seen so far.

-- Morgan

nithinraok

NVIDIA org Jul 25

Current model doesn't support diarization see: https://huggingface.co/nvidia/diar_sortformer_4spk-v1

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment