Hey guys, I tested the /stream endpoint with websockets and got pretty bad results compared to regular transcription, my guess is that the endpoint is breaking the audio in arbitrary pieces which breaks the context of the model for that word or phrase.
It seems to me that one way to make this works could be to detect silence either on the client or the server and slice the audio there, so each piece of audio is determined by the user silences.
Just wanted to know what you think, have a good day :)
Hey guys, I tested the /stream endpoint with websockets and got pretty bad results compared to regular transcription, my guess is that the endpoint is breaking the audio in arbitrary pieces which breaks the context of the model for that word or phrase.
It seems to me that one way to make this works could be to detect silence either on the client or the server and slice the audio there, so each piece of audio is determined by the user silences.
Just wanted to know what you think, have a good day :)