Transcribing and summarizing long meeting recordings with AI requires three layers working together: a speech-to-text model (Whisper is the current benchmark), a summarization framework to handle documents that exceed any model’s context window, and optionally a RAG pipeline for asking specific questions of the content. The full stack runs locally on a decent GPU, respects data residency requirements, and costs under $30 in cloud compute for extensive testing.

Why standard AI tools break on anything longer than an hour

The experiment started with a practical problem: six hours of video recordings from a client discovery session with an energy infrastructure company. The goal was straightforward – extract a structured summary without watching the whole thing again.

Two blockers appeared immediately. First, sending six hours of sensitive client discussions through a commercial API wasn’t viable. The client hadn’t consented to external data processing, and for companies in healthcare or logistics, where fireup.pro does most of its work, that’s often a hard constraint, not a preference. Second, many recordings simply don’t have a transcription layer. Google’s transcription tools lacked Polish language support. Most cloud services hit limits at 90 minutes or impose per-minute pricing that scales uncomfortably for long sessions.

So I built something local.

Step 1: Transcription with whisper and diarization

If you’re researching AI transcription tools, one name comes up constantly: Whisper, OpenAI’s speech recognition model. Most competitive alternatives in 2025–2026 are built on top of it understanding the original saves significant evaluation time.

For throughput at scale, the implementation worth knowing is Insanely Fast Whisper, an optimized fork that dramatically reduces processing time. Per the project’s own benchmarks, a 150-minute audio file processes in roughly five minutes on a capable GPU. That makes local transcription practical rather than theoretical.

Speaker identification, called diarization, is where the real value emerges. With diarization enabled, you specify how many speakers were present and receive a transcript broken down by individual voice. For a six-hour client session, this means you can ask questions specifically about what the client’s CTO said, or filter the summary to a single stakeholder’s concerns. The six-hour Axe Gas recording was transcribed this way, with the process taking around 20–30 minutes end-to-end.

One hardware note: running Whisper locally via OLLAMA is real work for a consumer GPU. Fan speeds hit maximum, graphics card temperatures push toward 70°C, and RAM usage can exceed 30GB under sustained load. It works — but set expectations accordingly.

Step 2: Three summarization methods and when each makes sense

A six-hour meeting transcript is too large for any current LLM to process in a single pass. LangChain addresses this with three summarization strategies, each suited to different document sizes and quality requirements.

MethodHow it worksBest forKey limitation
StuffLoads the full document into context at onceShort documents, quick testsBreaks above ~3,000 tokens locally
MapReduceChunks the document, summarizes each independently, combines resultsLong documents, parallel processingMay lose cross-chunk logical connections
RefineProcesses chunks sequentially; each summary becomes context for the nextComplex documents with narrative flowSlower, but produces more coherent output

Stuff is the simplest approach and works well for 20–30 minute session notes. Push it to a full transcript and you hit the context ceiling fast. On local OLLAMA hardware, going above 3,000 tokens meant unreliable outputs and system strain.

MapReduce solves the scale problem by splitting the document into chunks, by character count, semantic boundaries, or paragraph breaks, summarizing each in parallel, then reducing everything to a final output. The risk is that the combining step doesn’t detect contradictions. A claim made early in a meeting might be reversed later; MapReduce doesn’t guarantee the final summary reflects the correction.

Refine is the most sophisticated option for meeting recordings. Each chunk’s summary is passed as context to the processing of the next chunk, creating a chain where the model can identify and correct earlier statements as new information arrives. I tested this against a fairy tale narrative (Thumbelina, of all things) to see whether the beginning and ending would be logically connected across multiple chunks – they were, though chunk size and overlap tuning mattered.

The overlap setting solves a practical problem: when you cut a transcript at a hard character boundary, you risk splitting mid-sentence. An overlap of ~200 characters means each chunk begins with the tail of the previous one, giving the model enough context to understand what was being discussed before the cut.

Step 3: RAG – asking questions of your recordings

Summarization handles „give me the overview.” RAG (Retrieval-Augmented Generation) handles „what exactly did the client say about X?”

Instead of feeding the full transcript to a language model, RAG converts it into vector embeddings – numerical representations capturing semantic meaning and stores them in a vector database. When you submit a query, a retrieval model identifies the most relevant passages, and the language model generates an answer grounded only in those passages. The source can be traced back to specific chunks, which means you can cite exactly where an answer came from.

For the Axe Gas recordings, this enabled questions like:

  • „What are the main limitations of the client’s current system?” → The model returned specifics: too many process steps, vehicle registration numbers that change frequently, data integration challenges, certificate management complexity.
  • „What functions does integration with Fluxys cover?” → Slot reservation, data exchange, certificate handling, integration with Electronic Data Platform (EDP), LNG process management.
  • „What is the MVP scope and timeline mentioned during the session?” → 1–2 month delivery target, specific feature priorities, integration-first approach.

None of this required re-listening to six hours of audio. The answers cited actual statements from the session.

For embeddings, I used NOMIC Embed Text locally and Amazon Titan Embed Text v2 via AWS Bedrock for production-quality results. One detail that matters in multilingual environments: embedding models are often language-specific. Using an English-only embedding model on a Polish transcript causes silent retrieval failures the semantic search finds nothing relevant, and answers are meaningless. Multi-language models are a hard requirement when your recordings switch between languages or include non-English content.

For the conversational layer, a contextual rephrasing prompt significantly improved output quality. The approach: a first prompt reformulates the user’s question using conversation history before passing it to the retrieval step. This means a follow-up like „what about their integration constraints?” works correctly the model already knows from context that „their” refers to Axe Gas, and that integration was the previous topic.

What this actually costs

Running extensive tests processing the full six-hour Axe Gas recording, experimenting with chunk sizes, running dozens of RAG queries came to roughly $27 on AWS Bedrock over the testing period.

For reference: Claude 3.5 via Bedrock is priced at fractions of a cent per 1,000 tokens. A 150-minute transcript processed through aggressive chunking and multiple query cycles runs to a few dollars at most. Titan Embeddings is priced separately per hour of use and is effectively negligible for testing volumes.

Local OLLAMA testing costs nothing beyond electricity and hardware wear. The tradeoff is throughput and quality. Llama 3.1 (the version tested most extensively) produced serviceable summaries but occasionally returned inconsistent outputs across repeated runs. For client-facing deliverables, the cloud models via Bedrock were meaningfully more reliable.

From prototype to something useful

The pipeline described here isn’t a finished product it’s a research spike that clarified what’s possible and what still requires work. A few directions worth pursuing:

Slack integration: LangChain supports Slack as a document source, which means historical channel conversations could feed the same RAG pipeline. We tested a Slack bot setup where the bot responded to document questions based on a configurable system prompt. The concept works; the data governance questions require more thought before deploying this in a client context.

Multi-source knowledge bases: The same vector database approach works across document types PDFs, meeting recordings, Confluence pages, spec documents. An organization with dozens of discovery sessions, project specs, and process documents in one queryable knowledge base is a qualitatively different tool than a search index.

Prompt engineering for specialized roles: The difference between a generic summary and a useful one is often in the system prompt. Prompts instructing the model to focus on stakeholder concerns, MVP scope, integration constraints, or open questions produced significantly more actionable outputs than default summarization prompts.

This kind of tooling sits at the edge of what we’d build for a client versus what we’d build for internal use first. If you’re thinking through the same distinction, the project-or-process question is worth reading before committing resources to infrastructure.


Fireup.pro is a software house of ~65 people building digital products for healthcare, fintech, and logistics clients across the DACH region. If you’re evaluating AI tooling for your development or operations workflows, let’s talk.