Whisper: AI Audio Transcription by OpenAI

19 of march of 2023

Disclaimer: This post has been translated to English using a machine translation model. Please, let me know if you find any mistakes.

Introduction

Whisper is an automatic speech recognition (ASR) system trained on 680,000 hours of multilingual and multitask supervised data collected from the web. Using such a large and diverse dataset leads to greater robustness against accents, background noise, and technical language. Additionally, it allows for transcription in multiple languages as well as translation of those languages into English.

Installation

To install this tool, it's best to create a new Anaconda environment.

	
		
			< >
			Input
		
		
			Python
			
		
	
	
		!conda create -n whisper
	
	Copied

We enter the environment

	
		
			< >
			Input
		
		
			Python
			
		
	
	
		!conda activate whisper
	
	Copied

We install all the necessary packages

	
		
			< >
			Input
		
		
			Python
			
		
	
	
		!conda install pytorch torchvision torchaudio pytorch-cuda=11.6 -c pytorch -c nvidia
	
	Copied

Finally, we install whisper

	
		
			< >
			Input
		
		
			Python
			
		
	
	
		!pip install git+https://github.com/openai/whisper.git
	
	Copied

And we update ffmpeg

	
		
			< >
			Input
		
		
			Python
			
		
	
	
		!sudo apt update && sudo apt install ffmpeg
	
	Copied

Usage

We import whisper

	
		
			< >
			Input
		
		
			Python
			
		
	
	
		import whisper
	
	Copied

We select the model, the larger it is, the better it will perform.

	
		
			< >
			Input
		
		
			Python
			
		
	
	
		# model = "tiny"
# model = "base"
# model = "small"
# model = "medium"
model = "large"
model = whisper.load_model(model)
	
	Copied

We load the audio from this old ad (from 1987) for Micro Machines

	
		
			< >
			Input
		
		
			Python
			
		
	
	
		audio_path = "MicroMachines.mp3"
audio = whisper.load_audio(audio_path)
audio = whisper.pad_or_trim(audio)
	
	Copied

	
		
			< >
			Input
		
		
			Python
			
		
	
	
		mel = whisper.log_mel_spectrogram(audio).to(model.device)
	
	Copied

	
		
			< >
			Input
		
		
			Python
			
		
	
	
		_, probs = model.detect_language(mel)
print(f"Detected language: {max(probs, key=probs.get)}")
	
	Copied

>_ Output

			
				Detected language: en

	
		
			< >
			Input
		
		
			Python
			
		
	
	
		options = whisper.DecodingOptions()
result = whisper.decode(model, mel, options)
	
	Copied

	
		
			< >
			Input
		
		
			Python
			
		
	
	
		result.text
	
	Copied

>_ Output

			
				"This is the Micro Machine Man presenting the most midget miniature motorcade of micro machines. Each one has dramatic details, terrific trim, precision paint jobs, plus incredible micro machine pocket play sets. There's a police station, fire station, restaurant, service station, and more. Perfect pocket portables to take any place. And there are many miniature play sets to play with and each one comes with its own special edition micro machine vehicle and fun fantastic features that miraculously move. Raise the boat lift at the airport, marina, man the gun turret at the army base, clean your car at the car wash, raise the toll bridge. And these play sets fit together to form a micro machine world. Micro machine pocket play sets so tremendously tiny, so perfectly precise, so dazzlingly detailed, you'll want to pocket them all. Micro machines and micro machine pocket play sets sold separately from Galoob. The smaller they are, the better they are."

Continue reading

Deep Research with LangGraph (3/3): the Writer agent and final report

Third and final part of the Deep Research with LangGraph series. Implement the Writer agent that drafts the final report from the research, assemble the full deep researcher graph, and run the complete assistant end to end.

Deep Research with LangGraph (2/3): the multi-agent Research Supervisor

Second part of the Deep Research with LangGraph series. Build the Research Supervisor that coordinates several Researche...

Deep Research with LangGraph (1/3): Scope and Researcher agents

First part of the series on building an AI research assistant with LangGraph. Learn the system architecture and build th...

Last posts -->

Have you seen these projects?

Gymnasia

Horeca chatbot

Naviground

View all projects -->

>_ Available for projects

Do you have an AI project?

Let's talk.

maximofn@gmail.com

Machine Learning and AI specialist. I develop solutions with generative AI, intelligent agents and custom models.

Write me LinkedIn

Do you want to watch any talk?

Tomorrow's Agents: Deciphering the Mysteries of Planning, UX and Memory

AI agents, powered by LLMs, promise to transform applications. But are they simple executors today or future intelligent collaborators? To reach their...

Create your own Apple intelligence

Learn to create an IA system to execute efficiently on a device

Last talks -->

Do you want to improve with these tips?

Best practices building agents with Claude Code

Technical talk: skills, subagents, slash commands and MCPs in Claude Code

o1 prompt engineering

Create better prompts for o1 following an example

Memory profiler

See the memory usage of a script

Last tips -->

Use this locally

Hugging Face spaces allow us to run models with very simple demos, but what if the demo breaks? Or if the user deletes it? That's why I've created docker containers with some interesting spaces, to be able to use them locally, whatever happens. In fact, if you click on any project view button, it may take you to a space that doesn't work.