The speech to text API provides two endpoints, transcriptions and translations, based on our state-of-the-art open source large-v2 Whisper model. They can be used to:
Transcribe audio into whatever language the audio is in.
Translate and transcribe the audio into english.
File uploads are currently limited to 25 MB and the following input file types are supported: mp3, mp4, mpeg, mpga, m4a, wav, and webm.
Here is one example in Python
import requests |
the output is:
1 |
Notice that, in the above code, we set the resonse_format to be “srt” which comes with timestamp.
The format of the transcript output can be also one of these options: json, text, srt, verbose_json, or vtt.