Transcribing Like a Boss, For No Cost
One question that I’ve been asked a few times in the past year is if I was aware of a good tool to transcribe text from a video or audio file. AWS has its transcribe API for this, but there is a monthly free limit before it starts charging. There is now a fantastic free option in the form of OpenAI’s Whisper.
With the increasing amount of audio and video content being generated and made available online, the ability to quickly and accurately transcribe this content is becoming increasingly important. OpenAI's Whisper audio-to-text capability offers a powerful solution to this problem.
Whisper is a deep learning-based model trained on large amounts of data to produce high-quality text transcriptions from audio. It has been specifically designed to transcribe speech in various settings, including noisy environments, and to handle multiple speakers and accents. The model has been trained on a wide range of data, including publicly available audio content, which means that it is well-suited for use in the field of OSINT.
Whisper is capable of processing large amounts of audio data quickly and accurately, and currently at least, for free. Another advantage of using Whisper for OSINT is its ability to handle multiple languages and accents. This makes it possible to transcribe audio content from various sources, regardless of the language spoken.
I installed Whisper on my Windows host system using the command:
pip install -U openai-whisper
You can view the code for the project here: https://github.com/openai/whisper
The tool also requires the audio & video processing tool ffmpeg to be installed on your system. https://ffmpeg.org/
Once installed, I tested it on a video from my personal trainer Ben Canning. By default, Whisper uses the first 30 seconds of audio to determine what language to use. Here it correctly detects English and starts transcribing the audio.
Once Whisper was finished processing the video, it generated multiple text files with the transcription. Some have just the text; others contain the text along with the timestamps, similar to the view produced in the terminal window.
Whisper handled the video with zero issues, so I decided to try one in a language other than English and with lower-quality audio. I picked one of the videos of Juan Joya Borja, AKA “Spanish Laughing Guy”.
Here, whisper incorrectly identifies the language as Galician, which is understandable considering its similarity to Spanish and Portuguese.
We can force Whisper to use a specific language with the “—language” option, as shown below.
As you can see here, Whisper is capable of handling a large number of languages.
As if all this wasn’t enough, Whisper can translate the video for you instead of transcribing.
For most uses, I would stick to having Whisper translate and then utilizing a dedicated translation engine like Google Translate or DeepL to translate, but I can see use cases where having the translation taken care of in “real-time” would be advantageous.
This capability has countless uses, including transcribing audio interviews or statements, transcribing and/or translating videos or audio content posted online, etc. To have this level of capability available for free is an extremely handy tool to have in the OSINT practitioner’s toolbox.