Easily Transcribe Podcasts with Whisper.cpp

If you've ever had the need to transcribe a podcast, lecture, or some other audio recording, it turns out it's surprisingly easy with the extremely impressive whisper.cpp project. This high-performance fork of OpenAI's Whisper can run on all sorts of hardware -- including my M1 Mac Mini. Let's walk through an example from start-to-finish of transcribing an episode of the Alter Everything podcast.

Obtain Audio File(s)

First, let's get the wav file from YouTube using the youtube-dl utility. It should be noted that whisper.cpp expects wav filetypes, and this utility defaults to mp3.

 $ youtube-dl \
    --extract-audio \
    --audio-format wav \
    --output podcast.wav \
    "https://www.youtube.com/watch?v=CoUN690wSYQ"

This file has a 44.1 kHz sample rate, and whisper.cpp expects 16 kHz, so let's go ahead and convert that.

 $ file podcast.wav
podcast.wav: RIFF (little-endian) data, WAVE audio, Microsoft PCM, 16 bit, stereo 44100 Hz

 $ ffmpeg -i podcast.wav -ar 16000 podcast-16khz.wav

 $ file podcast-16khz.wav
podcast-16khz.wav: RIFF (little-endian) data, WAVE audio, Microsoft PCM, 16 bit, stereo 16000 Hz

# NOTE: it looks like it's possible to specify this conversion as a post-process as a
# flag to the `youtube-dl` command -- I will explore this further next time...
# youtube-dl --extract-audio --audio-quality 0 --audio-format mp3 --postprocessor-args "-ar 44100" %dl%

Build whisper.cpp & Transcribe Audio

Then, let's get the latest version of whisper.cpp, download the English Whisper model, and build the example.

# Clone the `whisper.cpp` repository
 $ git clone --depth 1 git@github.com:ggerganov/whisper.cpp && cd whisper.cpp

# Download the English Whisper model in `ggml` format
 $ bash ./models/download-ggml-model.sh base.en

# Build the main example
 $ make

And finally, let's transcribe that podcast!

 $ ./main \
    -m ~/workspace/whisper.cpp/models/ggml-base.en.bin \
    -f ~/Downloads/podcast-16khz.wav \
    --output-vtt \
    --output-file out

# whisper_print_timings:     load time =   114.71 ms
# whisper_print_timings:     fallbacks =   0 p /   0 h
# whisper_print_timings:      mel time =   692.20 ms
# whisper_print_timings:   sample time = 22278.10 ms / 27893 runs (    0.80 ms per run)
# whisper_print_timings:   encode time = 10000.75 ms /    55 runs (  181.83 ms per run)
# whisper_print_timings:   decode time =   331.77 ms /    54 runs (    6.14 ms per run)
# whisper_print_timings:   batchd time = 45236.73 ms / 27566 runs (    1.64 ms per run)
# whisper_print_timings:   prompt time =  1921.90 ms / 11832 runs (    0.16 ms per run)
# whisper_print_timings:    total time = 80709.54 ms

A full podcast transcribed in ~80 seconds on an M1 Mac Mini -- not too bad!

# out.vtt

00:00:00.000 --> 00:00:06.480
 >> Hi everyone. We recently launched a short engagement feedback survey for the Alter Everything

00:00:06.480 --> 00:00:11.360
 podcast. Click the link in the episode description wherever you're listening to let us know what

00:00:11.360 --> 00:00:16.320
 you think and help us improve our show.

00:00:16.320 --> 00:00:21.200
 Welcome to Alter Everything, a podcast about data science and analytics culture. I'm Megan

00:00:21.200 --> 00:00:26.440
 Dibble and today I'm talking with Nick Schrock, CTO and founder of Dagster Labs. We discussed

00:00:26.440 --> 00:00:31.560
 data engineering trends, challenges in the field, why he started his company, and what

00:00:31.560 --> 00:00:38.960
 makes him excited about the future of data engineering. Let's get started.

00:00:38.960 --> 00:00:42.720
 >> Hi, Nick. It's great to have you on our show today. Thanks for being here.

00:00:42.720 --> 00:00:43.920
 >> Thanks for having me.

00:00:43.920 --> 00:00:48.280
 >> Yeah. Could you start off by giving an introduction to yourself for our listeners?

00:00:48.280 --> 00:00:52.920
 >> Sure. My name is Nick Schrock. I'm the CTO and founder of Dagster Labs. There's the

00:00:52.920 --> 00:00:59.520
 company behind Dagster, which is a data orchestration framework. Prior to doing this, I was an engineer

00:00:59.520 --> 00:01:05.960
 at Facebook from 2009, 2017. While I was there, I found a team called product infrastructure

00:01:05.960 --> 00:01:09.800
 whose goal was to make our application developers more efficient and productive, and a bunch

00:01:09.800 --> 00:01:13.840
 of open source work came out of that actually, one of which was React, which I had nothing

00:01:13.840 --> 00:01:18.040
 to do with, but actually the CEO of Dagster Labs co-created and I personally co-created

00:01:18.040 --> 00:01:22.640
 GraphQL. So as I like to say, Pete and I were present at the creation of the full hipster

00:01:22.640 --> 00:01:28.680
 stack. I moved on to Facebook in 2017, figuring out what to do next, and this data engineering

00:01:28.680 --> 00:01:32.960
 and data orchestration problem really got me hooked actually quite soon after I left,

00:01:32.960 --> 00:01:36.280
 and the rest is history. I'm sure we'll get into that more.