How to Set Up Free, Self‑Hosted Automatic Meeting Transcription in 60 Minutes (Docker + Whisper + Speaker Labels)
A practical, beginner-friendly guide to running Whisper in Docker with a simple web UI, adding speaker labels, and turning raw recordings into searchable meeting transcripts—without sending audio to third-party cloud services.
Use Docker Compose to run a Whisper container (and optionally a web UI), mounting a local ./data folder for inputs and outputs. Drop an audio file into ./data/in, run transcription (or upload via the UI), and collect outputs like TXT/JSON/SRT/VTT in ./data/out.
“Free” means no per-minute API fees—you pay only in compute (CPU/GPU time) and storage. You’ll still need disk space for models, audio files, and generated outputs.
No—CPU-only works, but it’s slower, especially with larger models. A 60-minute meeting can take ~30 minutes to several hours on CPU depending on the model size, while an NVIDIA GPU with CUDA is the fastest setup.
Tiny/base are fastest but less accurate, while small/medium are a good balance for meetings. Large gives the best accuracy but is slower and more GPU-hungry; the article recommends starting with small or medium for most meeting use cases.
Whisper doesn’t reliably identify speakers by itself, so you add a diarization step (commonly with pyannote) that outputs speaker time windows. Then you align those windows with Whisper’s timestamped segments to produce a speaker-labeled transcript.
Some diarization models are gated and require accepting a license or using an access token. If you need “no external accounts,” choose an alternative diarization model that can be downloaded and run locally without gated access.
Diarization is probabilistic and struggles with cross-talk, noisy rooms, and similar voices. Improve results with cleaner audio, better mic placement, separate tracks when possible, and post-processing like merging very short turns (<1s) to reduce label flicker.
This is usually a Docker volume or output-path issue. Ensure your volume mounts are correct and confirm the containers are writing to the expected directory (e.g., /data/out).
A simple pattern is ./data/in for audio and ./data/out for outputs like .md, .json, and .rttm. Use standardized filenames such as date + meeting name to make archives easier to browse and search.
How to Set Up Free, Self‑Hosted Automatic Meeting Transcription in 60 Minutes (Docker + Whisper + Speaker Labels)
Self-hosted meeting transcription is having a moment—and for good reason. Teams want **accurate transcripts**, **speaker labels**, and **searchable archives** without uploading sensitive audio to third-party services.
In this guide, you’ll set up a **free, self-hosted automatic meeting transcription** workflow in about **60 minutes** using:
- **Docker** (easy deployment)
- **Whisper** (speech-to-text)
- **Speaker labels** (diarization so you can see who said what)
- An optional **web UI** (so non-technical teammates can upload audio)
You’ll end with a repeatable pipeline you can run on a laptop, workstation, or a small server.
---
What you’ll build (and what “free” really means)
**“Free”** here means: no per-minute API fees. You’ll pay only in **compute** (your CPU/GPU time) and **storage**.
Architecture overview
1. Upload/collect a meeting recording (MP3/WAV/M4A)
2. Run **Whisper** to create a transcript
3. Run **speaker diarization** to identify speaker segments
4. Merge diarization + transcript → **speaker-labeled transcript**
5. Export to TXT/Markdown/SRT/VTT (and optionally index it)
---
Prerequisites (5–10 minutes)
Hardware recommendations
- **Fastest setup:** NVIDIA GPU with CUDA (Whisper runs much faster)
- **CPU-only:** works fine, just slower (especially with larger models)
Rule of thumb: a 60-minute meeting on CPU can take anywhere from ~30 minutes to several hours depending on model size.
Software
- Docker + Docker Compose
- ~10–20 GB free disk space (models + audio + outputs)
---
Step 1: Choose a Whisper “container + UI” approach (10 minutes)
You have two common routes:
Option A: CLI-first (simpler, more flexible)
You run transcription from the terminal and store outputs on disk. Best if you’re automating or integrating with other tools.
Option B: Web UI (easier for teams)
A lightweight web interface lets you upload audio files and download transcripts without learning the CLI. Many top community guides focus on this because it reduces friction.
If your goal is “drop a file, get a transcript,” start with a UI.
---
Step 2: Spin up Whisper in Docker (15 minutes)
Below is a practical Docker Compose baseline you can adapt. It assumes you’ll run:
- a **Whisper transcription service**
- a **simple UI** (optional)
> Note: The Whisper ecosystem has multiple images (some bundle faster-whisper, some include a UI). The exact image name varies by project, but the pattern is consistent: mount a folder for inputs/outputs, expose a port for UI, and optionally enable GPU.
Example `docker-compose.yml` (template)
```yaml
services:
whisper:
image: your-whisper-image:latest
container_name: whisper
volumes:
- ./data:/data
environment:
- WHISPER_MODEL=medium
GPU support (NVIDIA). Remove if CPU-only.
deploy:
resources:
reservations:
devices:
- capabilities: [gpu]
whisper-ui:
image: your-whisper-ui-image:latest
container_name: whisper-ui
ports:
- "8080:8080"
volumes:
- ./data:/data
depends_on:
- whisper
```
Start the stack
```bash
docker compose up -d
```
Test a transcription
1. Put an audio file in `./data/in/meeting.m4a`
2. Run the container’s transcription command (varies by image), or upload via UI
3. Confirm you get outputs like:
- `meeting.txt`
- `meeting.json` (timestamps)
- `meeting.srt` / `meeting.vtt`
#### Picking a model size
- **tiny/base**: fastest, lower accuracy
- **small/medium**: good balance for meetings
- **large**: best accuracy, slower and more GPU-hungry
For most meeting use cases, **small or medium** is a practical starting point.
---
Step 3: Add speaker labels (diarization) (20 minutes)
Whisper transcribes text, but it doesn’t reliably identify speakers out of the box. For **speaker labels**, you typically add a diarization step, then align it with the transcript.
How diarization works (in plain terms)
A diarization model analyzes the audio and outputs time ranges like:
- Speaker 1: 00:00–00:12
- Speaker 2: 00:12–00:22
- Speaker 1: 00:22–00:40
You then match transcript words/sentences to these time windows.
A practical approach: Pyannote + alignment
A common self-hosted pattern is:
1. Run diarization (e.g., pyannote)
2. Generate a word-level or segment-level timestamped transcript (Whisper can do segments; forced alignment can improve this)
3. Assign each segment to the most overlapping speaker window
> Important: some diarization models require accepting a license or using an access token. If you need strictly “no external accounts,” choose an alternative diarization model that can be downloaded and run locally without gated access.
Example: diarization container (concept)
Add another service that reads from `./data/in` and writes RTTM (speaker segments) to `./data/out`.
```yaml
diarizer:
image: your-diarization-image:latest
container_name: diarizer
volumes:
- ./data:/data
```
The output you’re looking for is typically an **RTTM file** (standard diarization format) or a JSON with speaker segments.
---
Step 4: Merge transcript + diarization into a speaker-labeled transcript (10 minutes)
At this point you have:
- Whisper segments with timestamps
- Diarization speaker windows
You need a small merge step. You can do this in Python, or use existing community scripts.
What “good” looks like
Output example:
```text
[00:00] Speaker 1: Quick agenda today—review Q1 pipeline and next steps.
[00:14] Speaker 2: Sounds good. I’ll start with the numbers.
```
Tips to improve speaker labeling quality
- **Use clean audio**: reduce echo; use a single source when possible
- **Prefer separate tracks** (if you can record per-participant audio)
- **Avoid tiny segments**: merge very short diarization turns (<1s) to reduce label flicker
- **Set expectations**: diarization is probabilistic—cross-talk and similar voices will confuse it
---
Step 5: Make it usable: folders, naming, and search (5 minutes)
A self-hosted system is only as useful as its retrieval.
Suggested folder structure
```
./data/
in/
2026-02-07-client-call.m4a
out/
2026-02-07-client-call.md
2026-02-07-client-call.json
2026-02-07-client-call.rttm
```
Quick win: standardized filenames
Include date + meeting name. It makes transcript archives far easier to browse.
Optional: generate meeting notes automatically
Once you have transcripts, you can generate summaries, decisions, and action items locally using an LLM you host—or you can use a dedicated meeting notes tool.
If you want a workflow that’s *less engineering* and more “get me accurate notes and highlights,” tools like [PRODUCT_LINK]MeetGeek meeting summaries[/PRODUCT_LINK] can complement a self-hosted setup—especially when teams want consistent action items and shareable recaps.
---
Common pitfalls (and how to avoid them)
1) “Whisper is slow on my machine”
- Use **faster-whisper** builds if available
- Drop from **medium → small**
- Enable GPU or run batch jobs overnight
2) “Speaker labels are wrong”
- Diarization struggles with cross-talk and noisy rooms
- Consider better mic placement or separate tracks
- Post-process: merge micro-segments, set minimum turn length
3) “My UI works but outputs aren’t saved”
- Ensure your Docker **volume mounts** are correct
- Confirm containers write to `/data/out` (or your chosen output path)
4) “We need meeting search and sharing”
Self-hosted gives you control, but you’ll still need:
- indexing
- permissions
- sharing links
- consistent summaries
Some teams keep transcription self-hosted for privacy, and use a tool like [PRODUCT_LINK]an automated meeting archive with highlights[/PRODUCT_LINK] for sharing and discovery.
---
A 60-minute checklist
If you’re time-boxing this:
- **Minute 0–10:** install Docker, create `./data` folders
- **Minute 10–25:** run Whisper container (or UI), test transcript output
- **Minute 25–45:** run diarization, confirm RTTM/segments output
- **Minute 45–55:** merge diarization + transcript into speaker-labeled text
- **Minute 55–60:** export to Markdown/SRT and standardize filenames
---
Conclusion
Self-hosting automatic meeting transcription with **Docker + Whisper + speaker labels** is very achievable in about an hour—and it gives you strong control over privacy, cost, and customization.
The key is to treat it as a pipeline: **transcribe → diarize → merge → export → organize**. Once that’s in place, you can iterate on accuracy (model choice), speed (GPU/faster-whisper), and usability (a web UI, indexing, and automated recaps).
If your team eventually needs polished deliverables—action items, decisions, timestamps, and easy sharing—consider pairing your setup with [PRODUCT_LINK]MeetGeek for client-ready notes[/PRODUCT_LINK] while keeping control over how and where recordings are handled.