Audio annotation is the work of adding structured, machine-readable labels to recorded sound so a model can learn from it. At its simplest that means writing down the words a person said. In practice it covers far more: marking exactly when each word starts and stops, naming who is speaking, tagging the cough and the door slam in the background, and noting whether a sentence is a question, a complaint, or a command. The raw audio file carries none of that. A person, or a person working alongside a model, has to put it there.
If you are about to label a few thousand hours of speech, the words you choose for those labels, and the rules you write for applying them, decide what your model can and cannot learn. This guide walks through the main types of audio annotation, the quality tiers buyers actually pay for, and how transcription and labels get checked before anyone calls a dataset finished.
What audio annotation actually produces
The output is rarely a single text file. It is a set of aligned layers sitting on the same waveform. One layer holds the transcript. Another holds time boundaries. Another holds speaker identities. Others hold whatever the task needs: event tags, intent labels, sentiment scores. A model trains on the combination, and the value of the dataset comes from how consistent those layers stay across thousands of files labeled by dozens of different people.
That consistency is the whole game. Two annotators who transcribe "yeah, no, I mean" differently, or who disagree on where a speaker turn begins, quietly teach a model that the same sound maps to two different answers. Good audio annotation is less about heroic individual effort and more about a guideline that strips out ambiguity before the work starts.
The main types of audio annotation
Verbatim and non-verbatim transcription
Transcription is the foundation of most speech labeling. The first decision is how literal to be. Verbatim transcription captures everything that was said: filler words, false starts, repetitions, stutters, and self-corrections. If someone says "I, uh, I think we should, we should wait," a verbatim transcript keeps all of it. This is usually what you want for training automatic speech recognition, because the model has to handle real, messy speech rather than a cleaned-up version of it.
Non-verbatim, or clean, transcription drops the disfluencies and tidies the sentence into something readable. That output suits subtitles, meeting notes, and search, but it teaches an ASR model the wrong target if you use it as ground truth. A third style, sometimes called true verbatim, also captures non-lexical sounds like laughter and audible breaths inline. Picking the right style up front matters more than people expect, because converting between styles later is slow and error-prone.
Timestamps and segmentation
Time alignment connects text to sound. At a coarse level, segmentation splits a recording into utterances or speaker turns and gives each one a start and end time. At a fine level, word-level or even phoneme-level timestamps mark exactly where each token lands in the waveform. Forced alignment tools generate much of this automatically, but they drift in overlapping speech, heavy accents, and noisy rooms, which is exactly where a human pass earns its cost.
Speaker labels and diarization
Diarization answers who spoke when. Annotators assign each segment to a consistent speaker label (Speaker 1, Speaker 2, and so on) across the whole file, and sometimes attach attributes like gender, approximate age band, or role. This gets hard in real conversations where people interrupt and talk over each other, and it is one of the areas where machine output most needs human correction before you can trust it.
Events and non-speech sounds
Speech rarely arrives clean. Annotators tag the sounds around it: background music, traffic, a baby crying, keyboard clicks, a phone ringing, silence. For in-car and smart-home voice work, this is not noise to be ignored. The events are often the point, because the model has to wake, stay quiet, or respond correctly while a vehicle is moving or a television plays in the next room.
Intent, sentiment, and emotion tags
Higher-level labels describe meaning rather than sound. Intent tagging marks what a speaker is trying to do: book a flight, cancel an order, ask for the time. Sentiment and emotion labels capture tone: frustrated, calm, urgent, happy. These layers power voice assistants and conversational agents, and they are the most subjective to apply, which makes a tight rubric and strong agreement checks non-negotiable.
Quality tiers, and why they exist
Not every project needs the same precision. Paying for word-perfect, phoneme-aligned, multi-pass labels when you only need a rough training signal burns money. Most providers, us included, work in tiers that trade accuracy against cost and turnaround.
- A lighter tier is a single annotator pass with automated alignment and a spot-check, fine for large-volume ASR pretraining where some noise is tolerable.
- A standard tier adds a second human review of every file and corrected timestamps.
- A premium tier layers on independent QA, strict verbatim rules, validated speaker labels, and event tagging, which is what you want for evaluation sets, TTS, or anything safety-related.
The honest move is to match the tier to the job. An evaluation set you will judge a model against deserves more scrutiny than the hundredth hour of training data. We ask buyers what the data is for before we quote, because the answer changes the workflow.
Human versus machine-assisted workflows
Almost no serious pipeline is fully manual or fully automatic anymore. The real question is how much of the first draft a model writes and how much a person fixes. A common pattern: run speech recognition and diarization to produce a draft transcript with rough timestamps, then route it to annotators who correct errors, fix boundaries, sort out overlapping speakers, and add the layers a model cannot reliably produce on its own, like intent and emotion.
This machine-assisted approach is faster and cheaper than typing from scratch, and it works well for clean, common-language audio. It falls apart fast in the conditions buyers most often struggle to source: rare languages and dialects, strong accents, children, elderly speakers, and recordings made in real-world noise. There the machine draft can be wrong enough that correcting it takes longer than starting over, so experienced annotators know when to throw the draft away and transcribe by ear. Sourcing those harder voices in the first place is its own problem, tied closely to how speech data gets collected before any labeling begins.
How QA actually works
Quality assurance on labeled audio is not one final review. It runs at several points. Before any production work, a small pilot batch goes through the full pipeline so the guidelines can be tested against real audio and tightened where annotators disagree. During production, files get checked two main ways: a second annotator reviews work blind, and a separate QA reviewer audits a sample against the spec.
The measurement that holds it together is inter-annotator agreement. You give the same audio to more than one person and look at how often they land on the same answer. Low agreement usually means the guideline is ambiguous, not that the people are careless, and the fix is a clearer rule rather than more pressure. For transcription, teams also track word error rate against a trusted reference, and for timestamps they set a boundary tolerance: how many milliseconds of slack is acceptable before an alignment counts as wrong.
Good QA samples by speaker, language, and recording condition rather than at random, so a problem that only appears in noisy recordings or one dialect cannot hide inside a clean overall average. The deliverable is a dataset where the labels mean the same thing in file one and file ten thousand, which is the only version of labeled audio worth training on.
Where annotation sits in the pipeline
Annotation does not stand alone. It sits between collection and training, and both shape it. If you are still deciding how much labeled audio you need, the volume drives the workflow and the tier, so it is worth settling early; our note on how much speech data a model needs is a good place to start. If you are buying finished data rather than building it, the same quality questions apply, and you should ask any vendor exactly how their transcription and annotation get checked before money changes hands.
At Spirelight we run transcription and annotation as a single, quality-controlled service across 50+ languages and dialects, with real depth in Nordic and European languages and in hard-to-source speakers and noisy conditions. You can license ready-made labeled speech datasets or commission a custom collection annotated to your spec. If you have audio that needs labeling, or a model that needs better ground truth than it has now, tell us what you are building and we will scope the annotation workflow with you.