Audio and Speech Annotation and Labeling Services

Spirelight provides audio and speech annotation services for AI teams building ASR, TTS, and voice models. We transcribe, label, and time-align raw recordings into clean, structured training data, then run human QA on every batch, so your ASR and TTS systems learn from labels that are accurate, consistent, and ready to train on.

What we annotate

Every label a speech model needs, from transcript to emotion.

Our audio data annotation covers the full range of labels used to train and evaluate speech systems. You set the schema; we apply it consistently across every file, in every language you need.

Verbatim transcription

Human or ASR-assisted transcripts of what was actually said, with your rules for casing, numbers, disfluencies, and domain terminology applied throughout.

Speaker labels

Diarization and speaker IDs that separate voices in dialogue and multi-party audio, so overlapping and turn-taking speech is attributed correctly.

Event and noise tags

Non-speech events labeled in place: background noise, music, laughter, silence, and channel conditions your model needs to recognize or ignore.

Word-level timestamps

Time-aligned tokens down to the word, delivered for forced alignment, segmentation, and training data that needs precise start and end times.

Intent and emotion

Utterance-level intent, sentiment, and emotion labels for voice assistants, conversational AI, and affect models, applied against a defined tag set.

Custom label schemas

Have a labeling guideline of your own? We map our speech data annotation to your schema and metadata fields rather than forcing you onto ours.

By model type

Annotation for ASR, TTS, and voice AI.

Different models need different labels. We tune the audio annotation to the system you are training rather than delivering one generic transcript for everything.

ASR

For speech recognition, accuracy of the transcript is everything. We deliver verbatim text, careful handling of accents, overlaps, and domain terms, plus timestamps and speaker turns so your acoustic and language models train on trustworthy ground truth.

TTS

For text to speech, prosody and cleanliness matter most. We provide precise text normalization, phrase and pause marking, and word-level alignment on studio-grade audio, so your TTS voices learn natural rhythm and pronunciation.

Voice AI and assistants

For wake-word, intent, and conversational systems, we add intent, slot, and emotion labels on top of the transcript, along with event tags for the noisy, real-world conditions these products run in.

Need the recordings collected first? Pair annotation with our speech and audio data collection services to run collection and labeling as one project.

Process and quality

Quality built into the pipeline, not bolted on at the end.

Audio data labeling is only useful if it is consistent. We run measurable quality controls on every batch so your dataset holds one standard from the first file to the last.

Inter-annotator agreement

Annotators label shared items and we measure agreement between them. Disagreements are resolved and the guideline is tightened before they can spread across the dataset.

Review passes

High-stakes and flagged segments go through a second human review pass. Fast-turnaround work runs single-pass; you choose the depth per project.

QA sampling

Every batch is statistically sampled and checked against the spec for transcript accuracy, timing, and metadata completeness before it ships to you.

Languages and dialects

Native-speaker annotators across 70+ languages.

Accurate speech data annotation needs people who actually speak the language and know its dialects, slang, and code-switching. Our annotators are native speakers matched to the language and region of the audio, from major world languages to lower-resource and regional varieties.

Working across many locales, or want to see what is already available? Browse our ready-made speech datasets to check coverage before commissioning custom labeling.

Formats and delivery

Delivered in the format your pipeline already reads.

We agree the schema up front and deliver labels in the structure your training code expects, with the audio, metadata, and manifests packaged together.

Label formats

JSON, JSONL, and CSV for transcripts and metadata, plus SRT and VTT for time-aligned captions. Custom manifest schemas supported on request.

Audio specs

WAV, FLAC, and other formats at the sample rate, bit depth, and channel layout you specify, from 8 kHz telephony to 48 kHz studio audio.

Handoff

Delivery to your cloud bucket or via API, in scheduled batches with checksums and consent linkage, so each utterance is traceable end to end.

Why Spirelight

A speech specialist, not a generic labeling vendor.

General-purpose labeling shops treat audio as one more task type. We only work with speech and voice, and it shows in the quality of the labels.

Consent-verified data

When we collect and annotate, every recording carries documented consent and a clear licensing basis, linked to the utterance, so your training data is compliant by design.

Native speakers

Annotators are native speakers of the language they label, not generalists working from a phrasebook, so accents, dialects, and slang are transcribed correctly.

Speech specialists

Our reviewers know ASR and TTS requirements first-hand and label to them, which means fewer downstream surprises when the data hits your training pipeline.

See the full picture of how we work across collection, transcription, and QA on our services overview.

Request a quote

Tell us about your audio and we will scope it.

Send us your label requirements, languages, audio hours, and target turnaround. We will review a sample and return a fixed per-hour quote for your audio annotation project, usually within two business days.

Audio annotation FAQ

What is audio annotation?

Audio annotation is the process of adding structured labels to sound recordings so a machine learning model can learn from them. For speech that usually means transcribing what was said, marking who said it, adding word-level timestamps, and tagging events such as background noise, laughter, or non-speech sounds. The output is a clean, structured dataset paired with the original audio.

How is pricing structured?

Most audio data labeling projects are priced per audio hour, with the rate set by how detailed the labels are. Plain transcription costs less than word-level timestamping, speaker separation, or emotion and intent tags. Language, domain complexity, and turnaround also affect the rate. Send us a short spec and a sample and we return a fixed per-hour quote.

What is the typical turnaround?

A pilot batch of a few hours can be delivered within days. Larger projects run in scheduled batches so you receive labeled data continuously rather than waiting for one final handoff. We agree the throughput and batch cadence during scoping and hold to it.

How do you ensure accuracy and QA?

Every project runs against a written labeling guideline. We measure inter-annotator agreement on shared items, run review passes on flagged and high-stakes segments, and sample each batch statistically before it ships. Issues are caught and fixed during production rather than discovered after delivery.

How do you handle consent and GDPR?

Audio we collect is captured from consented contributors with a documented licensing and usage basis, and consent is linked to each utterance. When we annotate audio you provide, we process it under a data processing agreement and handle it in line with GDPR. We can work inside your storage environment when required.

How is this different from a ready-made dataset?

Annotation services label your own recordings, or audio we collect to your spec, exactly to your schema. A ready-made dataset is already collected and labeled and can be licensed off the shelf today. If your requirements match an existing catalog dataset, licensing one is faster and cheaper.

Ready to start? Request a quote or explore speech and audio data collection.