A practical, beginner-friendly guide to running Whisper in Docker with a simple web UI, adding speaker labels, and turning raw recordings into searchable meeting transcripts—without sending audio to third-party cloud services.

How to Set Up Free, Self‑Hosted Automatic Meeting Transcription in 60 Minutes (Docker + Whisper + Speaker Labels)

Self-hosted meeting transcription is having a moment—and for good reason. Teams want **accurate transcripts**, **speaker labels**, and **searchable archives** without uploading sensitive audio to third-party services.

In this guide, you’ll set up a **free, self-hosted automatic meeting transcription** workflow in about **60 minutes** using:

- **Docker** (easy deployment)

- **Whisper** (speech-to-text)

- **Speaker labels** (diarization so you can see who said what)

- An optional **web UI** (so non-technical teammates can upload audio)

You’ll end with a repeatable pipeline you can run on a laptop, workstation, or a small server.

---

What you’ll build (and what “free” really means)

**“Free”** here means: no per-minute API fees. You’ll pay only in **compute** (your CPU/GPU time) and **storage**.

Architecture overview

1. Upload/collect a meeting recording (MP3/WAV/M4A)

2. Run **Whisper** to create a transcript

3. Run **speaker diarization** to identify speaker segments

4. Merge diarization + transcript → **speaker-labeled transcript**

5. Export to TXT/Markdown/SRT/VTT (and optionally index it)

---

Prerequisites (5–10 minutes)

Hardware recommendations

- **Fastest setup:** NVIDIA GPU with CUDA (Whisper runs much faster)

- **CPU-only:** works fine, just slower (especially with larger models)

Rule of thumb: a 60-minute meeting on CPU can take anywhere from ~30 minutes to several hours depending on model size.

Software

- Docker + Docker Compose

- ~10–20 GB free disk space (models + audio + outputs)

---

Step 1: Choose a Whisper “container + UI” approach (10 minutes)

You have two common routes:

Option A: CLI-first (simpler, more flexible)

You run transcription from the terminal and store outputs on disk. Best if you’re automating or integrating with other tools.

Option B: Web UI (easier for teams)

A lightweight web interface lets you upload audio files and download transcripts without learning the CLI. Many top community guides focus on this because it reduces friction.

If your goal is “drop a file, get a transcript,” start with a UI.

---

Step 2: Spin up Whisper in Docker (15 minutes)

Below is a practical Docker Compose baseline you can adapt. It assumes you’ll run:

- a **Whisper transcription service**

- a **simple UI** (optional)

> Note: The Whisper ecosystem has multiple images (some bundle faster-whisper, some include a UI). The exact image name varies by project, but the pattern is consistent: mount a folder for inputs/outputs, expose a port for UI, and optionally enable GPU.

Example `docker-compose.yml` (template)

```yaml

services:

whisper:

image: your-whisper-image:latest

container_name: whisper

volumes:

- ./data:/data

environment:

- WHISPER_MODEL=medium

GPU support (NVIDIA). Remove if CPU-only.

deploy:

resources:

reservations:

devices:

- capabilities: [gpu]

whisper-ui:

image: your-whisper-ui-image:latest

container_name: whisper-ui

ports:

- "8080:8080"

volumes:

- ./data:/data

depends_on:

- whisper

```

Start the stack

```bash

docker compose up -d

```

Test a transcription

1. Put an audio file in `./data/in/meeting.m4a`

2. Run the container’s transcription command (varies by image), or upload via UI

3. Confirm you get outputs like:

- `meeting.txt`

- `meeting.json` (timestamps)

- `meeting.srt` / `meeting.vtt`

#### Picking a model size

- **tiny/base**: fastest, lower accuracy

- **small/medium**: good balance for meetings

- **large**: best accuracy, slower and more GPU-hungry

For most meeting use cases, **small or medium** is a practical starting point.

---

Step 3: Add speaker labels (diarization) (20 minutes)

Whisper transcribes text, but it doesn’t reliably identify speakers out of the box. For **speaker labels**, you typically add a diarization step, then align it with the transcript.

How diarization works (in plain terms)

A diarization model analyzes the audio and outputs time ranges like:

- Speaker 1: 00:00–00:12

- Speaker 2: 00:12–00:22

- Speaker 1: 00:22–00:40

You then match transcript words/sentences to these time windows.

A practical approach: Pyannote + alignment

A common self-hosted pattern is:

1. Run diarization (e.g., pyannote)

2. Generate a word-level or segment-level timestamped transcript (Whisper can do segments; forced alignment can improve this)

3. Assign each segment to the most overlapping speaker window

> Important: some diarization models require accepting a license or using an access token. If you need strictly “no external accounts,” choose an alternative diarization model that can be downloaded and run locally without gated access.

Example: diarization container (concept)

Add another service that reads from `./data/in` and writes RTTM (speaker segments) to `./data/out`.

```yaml

diarizer:

image: your-diarization-image:latest

container_name: diarizer

volumes:

- ./data:/data

```

The output you’re looking for is typically an **RTTM file** (standard diarization format) or a JSON with speaker segments.

---

Step 4: Merge transcript + diarization into a speaker-labeled transcript (10 minutes)

At this point you have:

- Whisper segments with timestamps

- Diarization speaker windows

You need a small merge step. You can do this in Python, or use existing community scripts.

What “good” looks like

Output example:

```text

[00:00] Speaker 1: Quick agenda today—review Q1 pipeline and next steps.

[00:14] Speaker 2: Sounds good. I’ll start with the numbers.

```

Tips to improve speaker labeling quality

- **Use clean audio**: reduce echo; use a single source when possible

- **Prefer separate tracks** (if you can record per-participant audio)

- **Avoid tiny segments**: merge very short diarization turns (<1s) to reduce label flicker

- **Set expectations**: diarization is probabilistic—cross-talk and similar voices will confuse it

---

Step 5: Make it usable: folders, naming, and search (5 minutes)

A self-hosted system is only as useful as its retrieval.

Suggested folder structure

```

./data/

in/

2026-02-07-client-call.m4a

out/

2026-02-07-client-call.md

2026-02-07-client-call.json

2026-02-07-client-call.rttm

```

Quick win: standardized filenames

Include date + meeting name. It makes transcript archives far easier to browse.

Optional: generate meeting notes automatically

Once you have transcripts, you can generate summaries, decisions, and action items locally using an LLM you host—or you can use a dedicated meeting notes tool.

If you want a workflow that’s *less engineering* and more “get me accurate notes and highlights,” tools like [PRODUCT_LINK]MeetGeek meeting summaries[/PRODUCT_LINK] can complement a self-hosted setup—especially when teams want consistent action items and shareable recaps.

---

Common pitfalls (and how to avoid them)

1) “Whisper is slow on my machine”

- Use **faster-whisper** builds if available

- Drop from **medium → small**

- Enable GPU or run batch jobs overnight

2) “Speaker labels are wrong”

- Diarization struggles with cross-talk and noisy rooms

- Consider better mic placement or separate tracks

- Post-process: merge micro-segments, set minimum turn length

3) “My UI works but outputs aren’t saved”

- Ensure your Docker **volume mounts** are correct

- Confirm containers write to `/data/out` (or your chosen output path)

4) “We need meeting search and sharing”

Self-hosted gives you control, but you’ll still need:

- indexing

- permissions

- sharing links

- consistent summaries

Some teams keep transcription self-hosted for privacy, and use a tool like [PRODUCT_LINK]an automated meeting archive with highlights[/PRODUCT_LINK] for sharing and discovery.

---

A 60-minute checklist

If you’re time-boxing this:

- **Minute 0–10:** install Docker, create `./data` folders

- **Minute 10–25:** run Whisper container (or UI), test transcript output

- **Minute 25–45:** run diarization, confirm RTTM/segments output

- **Minute 45–55:** merge diarization + transcript into speaker-labeled text

- **Minute 55–60:** export to Markdown/SRT and standardize filenames

---

Conclusion

Self-hosting automatic meeting transcription with **Docker + Whisper + speaker labels** is very achievable in about an hour—and it gives you strong control over privacy, cost, and customization.

The key is to treat it as a pipeline: **transcribe → diarize → merge → export → organize**. Once that’s in place, you can iterate on accuracy (model choice), speed (GPU/faster-whisper), and usability (a web UI, indexing, and automated recaps).

If your team eventually needs polished deliverables—action items, decisions, timestamps, and easy sharing—consider pairing your setup with [PRODUCT_LINK]MeetGeek for client-ready notes[/PRODUCT_LINK] while keeping control over how and where recordings are handled.

How to Set Up Free, Self‑Hosted Automatic Meeting Transcription in 60 Minutes (Docker + Whisper + Speaker Labels)

Frequently Asked Questions

How can I set up free, self-hosted meeting transcription with Whisper using Docker?

Is self-hosted Whisper meeting transcription really free?

Do I need a GPU to transcribe meetings with Whisper?

Which Whisper model size should I use for meeting recordings?

How do I add speaker labels (diarization) to Whisper transcripts?

Do diarization tools like pyannote require an account or access token?

Why are my speaker labels wrong or inconsistent, and how can I improve them?

Why does my Whisper web UI work but the transcript files aren’t saved?

What folder structure should I use to store recordings and transcripts for easy retrieval?

How to Set Up Free, Self‑Hosted Automatic Meeting Transcription in 60 Minutes (Docker + Whisper + Speaker Labels)

What you’ll build (and what “free” really means)

Architecture overview

Prerequisites (5–10 minutes)

Hardware recommendations

Software

Step 1: Choose a Whisper “container + UI” approach (10 minutes)

Option A: CLI-first (simpler, more flexible)

Option B: Web UI (easier for teams)

Step 2: Spin up Whisper in Docker (15 minutes)

Example `docker-compose.yml` (template)

GPU support (NVIDIA). Remove if CPU-only.

Start the stack

Test a transcription

Step 3: Add speaker labels (diarization) (20 minutes)

How diarization works (in plain terms)

A practical approach: Pyannote + alignment

Example: diarization container (concept)

Step 4: Merge transcript + diarization into a speaker-labeled transcript (10 minutes)

What “good” looks like

Tips to improve speaker labeling quality

Step 5: Make it usable: folders, naming, and search (5 minutes)

Suggested folder structure

Quick win: standardized filenames

Optional: generate meeting notes automatically

Common pitfalls (and how to avoid them)

1) “Whisper is slow on my machine”

2) “Speaker labels are wrong”

3) “My UI works but outputs aren’t saved”

4) “We need meeting search and sharing”

A 60-minute checklist

Conclusion

More from MeetGeek