Spirelight provides audio and speech annotation services for AI teams building ASR, TTS, and voice models. We transcribe, label, and time-align raw recordings into clean, structured training data, then run human QA on every batch, so your ASR and TTS systems learn from labels that are accurate, consistent, and ready to train on.
Our audio data annotation covers the full range of labels used to train and evaluate speech systems. You set the schema; we apply it consistently across every file, in every language you need.
Human or ASR-assisted transcripts of what was actually said, with your rules for casing, numbers, disfluencies, and domain terminology applied throughout.
Diarization and speaker IDs that separate voices in dialogue and multi-party audio, so overlapping and turn-taking speech is attributed correctly.
Non-speech events labeled in place: background noise, music, laughter, silence, and channel conditions your model needs to recognize or ignore.
Time-aligned tokens down to the word, delivered for forced alignment, segmentation, and training data that needs precise start and end times.
Utterance-level intent, sentiment, and emotion labels for voice assistants, conversational AI, and affect models, applied against a defined tag set.
Have a labeling guideline of your own? We map our speech data annotation to your schema and metadata fields rather than forcing you onto ours.
Different models need different labels. We tune the audio annotation to the system you are training rather than delivering one generic transcript for everything.
For speech recognition, accuracy of the transcript is everything. We deliver verbatim text, careful handling of accents, overlaps, and domain terms, plus timestamps and speaker turns so your acoustic and language models train on trustworthy ground truth.
For text to speech, prosody and cleanliness matter most. We provide precise text normalization, phrase and pause marking, and word-level alignment on studio-grade audio, so your TTS voices learn natural rhythm and pronunciation.
For wake-word, intent, and conversational systems, we add intent, slot, and emotion labels on top of the transcript, along with event tags for the noisy, real-world conditions these products run in.
Need the recordings collected first? Pair annotation with our speech and audio data collection services to run collection and labeling as one project.
Audio data labeling is only useful if it is consistent. We run measurable quality controls on every batch so your dataset holds one standard from the first file to the last.
Annotators label shared items and we measure agreement between them. Disagreements are resolved and the guideline is tightened before they can spread across the dataset.
High-stakes and flagged segments go through a second human review pass. Fast-turnaround work runs single-pass; you choose the depth per project.
Every batch is statistically sampled and checked against the spec for transcript accuracy, timing, and metadata completeness before it ships to you.
Accurate speech data annotation needs people who actually speak the language and know its dialects, slang, and code-switching. Our annotators are native speakers matched to the language and region of the audio, from major world languages to lower-resource and regional varieties.
Working across many locales, or want to see what is already available? Browse our ready-made speech datasets to check coverage before commissioning custom labeling.
We agree the schema up front and deliver labels in the structure your training code expects, with the audio, metadata, and manifests packaged together.
JSON, JSONL, and CSV for transcripts and metadata, plus SRT and VTT for time-aligned captions. Custom manifest schemas supported on request.
WAV, FLAC, and other formats at the sample rate, bit depth, and channel layout you specify, from 8 kHz telephony to 48 kHz studio audio.
Delivery to your cloud bucket or via API, in scheduled batches with checksums and consent linkage, so each utterance is traceable end to end.
General-purpose labeling shops treat audio as one more task type. We only work with speech and voice, and it shows in the quality of the labels.
When we collect and annotate, every recording carries documented consent and a clear licensing basis, linked to the utterance, so your training data is compliant by design.
Annotators are native speakers of the language they label, not generalists working from a phrasebook, so accents, dialects, and slang are transcribed correctly.
Our reviewers know ASR and TTS requirements first-hand and label to them, which means fewer downstream surprises when the data hits your training pipeline.
See the full picture of how we work across collection, transcription, and QA on our services overview.
Send us your label requirements, languages, audio hours, and target turnaround. We will review a sample and return a fixed per-hour quote for your audio annotation project, usually within two business days.
Audio annotation is the process of adding structured labels to sound recordings so a machine learning model can learn from them. For speech that usually means transcribing what was said, marking who said it, adding word-level timestamps, and tagging events such as background noise, laughter, or non-speech sounds. The output is a clean, structured dataset paired with the original audio.
Most audio data labeling projects are priced per audio hour, with the rate set by how detailed the labels are. Plain transcription costs less than word-level timestamping, speaker separation, or emotion and intent tags. Language, domain complexity, and turnaround also affect the rate. Send us a short spec and a sample and we return a fixed per-hour quote.
A pilot batch of a few hours can be delivered within days. Larger projects run in scheduled batches so you receive labeled data continuously rather than waiting for one final handoff. We agree the throughput and batch cadence during scoping and hold to it.
Every project runs against a written labeling guideline. We measure inter-annotator agreement on shared items, run review passes on flagged and high-stakes segments, and sample each batch statistically before it ships. Issues are caught and fixed during production rather than discovered after delivery.
Audio we collect is captured from consented contributors with a documented licensing and usage basis, and consent is linked to each utterance. When we annotate audio you provide, we process it under a data processing agreement and handle it in line with GDPR. We can work inside your storage environment when required.
Annotation services label your own recordings, or audio we collect to your spec, exactly to your schema. A ready-made dataset is already collected and labeled and can be licensed off the shelf today. If your requirements match an existing catalog dataset, licensing one is faster and cheaper.
Ready to start? Request a quote or explore speech and audio data collection.