How to get Danish voice training data for AI models

If you're looking for Danish voice training data to improve a speech recognition model, a voice AI product, a wake word system, or a Danish ASR pipeline, you'll quickly notice that options are more limited than for English, Spanish, or German. Danish is a niche language market, and off-the-shelf datasets rarely cover the dialects, domains, or audio quality you actually need.

In this post, I'll walk through the three main ways to get Danish speech data: using open-source datasets, working with a speech data collection company, or recording your own dataset from scratch.

Option 1: Use open-source Danish speech datasets

Open-source data is the natural starting point. It's free, it's available now, and it can be useful for building an early prototype, testing a Danish ASR pipeline, or doing initial fine-tuning before committing to a custom collection project.

One of the strongest open-source options is CoRal, a Danish conversational and read-aloud speech dataset built to improve Danish speech technology. CoRal includes speech samples across dialects, accents, genders, and age groups, and it's designed for ASR training and evaluation. Spire Light has contributed to the CoRal project, and it's a solid starting point if you need general-purpose Danish speech data.

Beyond CoRal, there are a few other public datasets worth knowing about. Mozilla Common Voice includes a Danish subset, though it's still relatively small. The March 2026 release contains 16.19 hours of recorded speech, with 12.99 validated hours from around 300 speakers. The NST Danish ASR Database is an older resource originally created for dictation and speech recognition, recorded at 16 kHz with metadata and transcriptions in CSV/JSON formats. And FT Speech is a large Danish Parliament corpus with over 1,800 hours of transcribed speech, but because it's parliamentary in nature, it won't match the patterns you'd see in customer support calls, voice assistants, or casual conversation.

The main problem with open-source datasets is that they weren't built for your specific use case. Recording quality, transcription standards, speaker demographics, file formats, and domain coverage all vary. Some are read-aloud, some are conversational, some are formal parliamentary speech. None of them will contain the exact wake words, product names, regional accents, or acoustic conditions that your model struggles with.

Open-source Danish speech data is a good place to start. It's rarely enough to get you to production quality on its own.

Why Danish voice data is difficult to get right

Once you move past prototyping, things get harder. You might need your model to handle domain-specific language like healthcare terminology, banking conversations, insurance claims, legal dictation, or public sector communication. Or you might need to improve wake word activation, keyword spotting, speaker diarization, or transcription accuracy for Danish names, addresses, and compound words.

Danish is a tricky language for speech models. Pronunciation varies a lot between speakers, soft consonants blur together in fast speech, and regional dialects introduce patterns that a model trained on standard Copenhagen Danish simply won't recognize. Even though Denmark is a small country, the pronunciation differences between Jutland, Funen, Zealand, and Bornholm are big enough to trip up a model that hasn't been exposed to them.

Map showing Danish dialect regions across Denmark — Map of Danish dialect regions. For Danish voice training data, regional coverage matters because pronunciation, rhythm, and word usage can vary across the country.

This is where custom data collection starts to make more sense than relying on generic datasets or synthetic data. With a custom project, you can have Danish speakers read specific wake words, roleplay industry conversations, pronounce difficult names and addresses, or record spontaneous dialogue. All in controlled acoustic conditions, with the dialects, age groups, and speaker profiles you actually need. You get to define the sample rate, file format, metadata structure, transcription standard, and quality control rules from the start.

Need Danish speech data without building the full pipeline yourself?

Spire Light can help you collect Danish voice training data with access to 2,000+ Danish recorders, 48 kHz 32-bit WAV recording, dialect checks, hardware checks, transcription, validation, metadata, and final dataset delivery.

Contact Spire Light about a Danish speech data project

Option 2: Work with a Danish speech data collection company

For most commercial projects, the fastest and most reliable path is working with a company that specializes in speech data collection. This is especially true if you need data at scale, or if the dataset has to meet strict technical and commercial requirements.

At Spire Light, we're based in Copenhagen and we help companies collect custom Danish speech datasets for ASR, TTS, voice assistants, wake word detection, dialogue systems, and domain-specific speech models. That can mean anything from read-aloud datasets and activation phrase collections to two-speaker dialogue recordings and industry-specific roleplay conversations across sales, support, healthcare, finance, and public sector use cases.

Our platform was built from the ground up for speech data collection, so every recording comes with structured metadata, consent tracking, automated audio quality checks, and reviewer workflows. For Danish projects, we do targeted recruitment across regions and dialects so the final dataset isn't just technically clean but also representative of the speakers your model needs to understand.

On the technical side, we deliver 48 kHz WAV audio (32-bit float where required) in mono, stereo, or dual-channel configurations depending on the project. Every dataset includes automated hardware checks before recording, dialect and region verification, speaker and prompt metadata, transcripts with timestamps, and manifests in CSV or JSON. We handle train/validation/test splits, batch delivery, and human review as part of the standard workflow.

If you need a production-grade Danish speech dataset rather than something for an internal experiment, this is usually the most efficient route.

Option 3: Create your own Danish voice dataset

You can also collect Danish voice training data yourself. This works well for small internal tests, early-stage experiments, or situations where you only need a limited number of recordings and have the bandwidth to manage the process end to end.

That said, doing it yourself means owning every part of the pipeline: recruitment, consent, recording tools, audio quality checks, transcription, metadata, validation, speaker payment, and final delivery structure. Here's roughly how the process looks.

Step 1: Define your use case

Before recruiting anyone, get clear on what the model is supposed to improve. Are you building Danish ASR for customer support calls? Wake word detection? Voice assistant commands? Conversational AI dialogue? TTS training data? Domain-specific or accent-specific recognition?

This matters more than people realize. The recording setup for wake words, where you need many short utterances from a large number of speakers, is completely different from the setup for long-form dialogue, where you might need 10 to 30 minute conversations between two people. Getting this wrong means collecting data you can't actually use.

Step 2: Set the technical audio specification

Get your technical requirements nailed down before anyone hits record. The choices you make here will affect model performance, storage costs, and how reusable the data is later on.

File format: WAV vs M4A and other lossy formats

For training data, WAV is the standard. It's uncompressed, meaning every sample is stored exactly as it was captured with no information thrown away. This matters because machine learning models rely on subtle acoustic differences that lossy compression can remove. An IBM study on audio compression and speech recognition found that MP3 caused a 10% degradation in word error rate relative to WAV/FLAC, and also hurt speaker diarization accuracy. Lossy formats like M4A (AAC) and MP3 work by permanently discarding audio data that's supposedly inaudible to humans, but research published in JMIR Biomedical Engineering showed that these compression artifacts can meaningfully alter acoustic features like pitch, jitter, and Mel-frequency cepstral coefficients (MFCCs), which are exactly the features speech models rely on.

AssemblyAI's guide on audio formats for speech-to-text puts it simply: for maximum transcription accuracy, always use a lossless format like WAV or FLAC. If storage is a concern, FLAC gives you lossless compression at roughly half the file size. Save lossy formats for deployment and streaming, not for training data collection.

Sample rate: 16 kHz vs 48 kHz

Most modern ASR models, including Wav2Vec2, MMS, and Whisper, are trained on and expect 16 kHz audio. The reason is practical: human speech sits mostly between 100 Hz and 8 kHz, and a 16 kHz sample rate (which captures frequencies up to 8 kHz according to the Nyquist theorem) preserves all the acoustic information that speech recognition actually uses. Going higher doesn't help the model. AmiVoice's technical documentation confirms that their engine internally downsamples everything above 16 kHz before processing, because speech recognition accuracy stays the same above that rate.

So why would you record at 48 kHz? Future-proofing. You can always downsample from 48 kHz to 16 kHz without losing anything, but you can't go the other way. If you collect at 16 kHz and later want to use the data for a TTS model, audio analysis, or a different pipeline that benefits from higher fidelity, you're stuck. Recording at 48 kHz costs more in storage (roughly 3x the file size per recording), but it gives you a master copy that works for any downstream use case. For most Danish speech data projects, our recommendation is to record at 48 kHz and downsample to 16 kHz at training time.

Bit depth: 16-bit vs 32-bit float

Bit depth determines the dynamic range of your recordings. A 16-bit recording gives you about 96 dB of dynamic range, which means you need to set your recording levels carefully. If the gain is too low, quiet speech sits close to the noise floor. If it's too high, loud sounds clip and distort. With crowdsourced recordings where you can't control each person's microphone setup, this becomes a real problem.

32-bit float recording changes the game. As TASCAM explains in their guide to 32-bit float, the dynamic range jumps to over 1,500 dB, which is so large that you effectively can't clip the input. Audio that's recorded too quietly can be boosted in post-production with no added noise or distortion. For speech data collection where you're dealing with many different speakers, devices, and environments, this is a big practical advantage. Perfect Circuit's write-up on 32-bit float recording puts it well: for any scenario where you can't predict the dynamic range, 32-bit float is the safer choice.

The trade-off is file size. A 32-bit float WAV file is twice the size of a 16-bit file at the same sample rate. For a large-scale collection project, that adds up. But in practice, the cost of extra storage is usually much less than the cost of throwing away recordings that clipped or had an unusably low signal level.

Channel configuration: why dual-channel matters for dialogue

If you're collecting two-speaker dialogue (for conversational AI, customer support training, or diarization models), the channel setup matters a lot. In a single-channel (mono) recording, both speakers' voices are mixed together. This means you need speaker diarization algorithms to figure out who said what, and as AssemblyAI notes, multichannel transcription delivers higher accuracy because speakers are pre-separated at the recording level. Deepgram's documentation makes the same point: separating speakers into individual audio channels makes it much easier to focus on one speaker when reviewing or processing the audio.

With dual-channel recording, each speaker gets their own channel. This solves the diarization problem at the source, as HAI Data explains, because the voices are already stored in separate streams. Overlapping speech, which is one of the hardest problems in single-channel diarization, stops being an issue entirely. Google Cloud's speech-to-text documentation also recommends transcribing separate channels individually, noting that isolated channels result in higher confidence levels in the transcription.

If your recordings are just single speakers reading prompts or commands, mono is fine. But for any dialogue data, dual-channel is worth the extra storage and setup effort.

Beyond the raw audio spec, think about speaker IDs, timestamps or word-level alignment, consistent file naming conventions, metadata delivery in CSV or JSON, train/validation/test splits, minimum signal-to-noise ratio requirements, and rules around clipping, compression, or background noise. This is where standard meeting or conferencing software falls short. Those tools are built for communication, not for producing consistent, well-structured training data.

Step 3: Recruit Danish speakers

Speaker recruitment is one of the hardest parts of a Danish data collection project. You'll need to think about regional and dialect coverage, age and gender balance, native language, device type, recording environment, and whether urban versus rural speakers matter for your use case. If you're working on something domain-specific, the professional background of your speakers might matter too.

For general Danish ASR, you normally want broad coverage across regions. For domain-specific work, fewer but more relevant speakers. For wake word models, many speakers saying the same phrases in different ways and environments.

Denmark is a small market and payment expectations are higher than in many other countries. You'll also need to make sure participants understand exactly what they're recording, how their data will be used, and what rights they're giving up. Don't underestimate the effort here.

Step 4: Record the audio

A good Danish voice dataset isn't just audio files. Each recording should come with metadata: speaker ID, age range, gender, dialect or region, native language, device type, recording date, task type, prompt ID, transcript, validation status, and audio quality scores. If you're collecting dialogue data, each speaker should ideally be recorded on a separate channel or as separate files, which makes diarization, transcription, and downstream training a lot easier.

Step 5: Transcribe and validate

After recording comes transcription, and the approach you choose here has real implications for your model's performance.

AI transcription vs human transcription vs hybrid

Fully automated ASR has gotten good. Independent benchmarks in 2026 show that leading commercial engines achieve between 4% and 8% word error rate (WER) on clean audio, which translates to roughly 92-96% accuracy. But that's on clean, well-recorded English audio. For Danish, the numbers are worse. Speechly's analysis of Whisper across languages found that Danish showed noticeable accuracy drops compared to top-performing languages like English, Italian, and German, with the medium model required to reach reasonable performance. And recent research on Whisper encoder pruning specifically flagged Danish as a lower-resource language where fine-tuning techniques that work well for English and Dutch are less effective.

Professional human transcribers, by contrast, typically achieve 99%+ accuracy, though this comes at a much higher cost. NovaScribe's 2026 comparison found that AI transcription costs roughly 26 to 150 times less than human transcription, delivering results in minutes rather than hours.

For training data, a hybrid approach is usually the best bet. Use machine transcription as a first pass to get a rough transcript quickly, then have human reviewers correct errors, especially around Danish-specific challenges like compound words, soft consonants, regional pronunciation variants, and proper nouns. This gives you the speed of automation with the accuracy your model training actually requires. As PlainScribe's 2026 benchmark report puts it, going from 95% to 99% accuracy with human review costs roughly 10x more per hour of audio, so the decision comes down to how much accuracy your specific use case demands.

You'll also need to decide on a transcription standard early: clean verbatim versus full verbatim, punctuation and casing rules, how to handle false starts, filler words, overlapping speech, and unintelligible segments, and whether to include word-level timestamps. Tools like CVAT or other speech labeling platforms can help with the annotation workflow, but you're still responsible for managing file delegation, reviewer instructions, time tracking, quality control, and payment. Danish labor costs add up quickly if the workflow isn't well automated.

Which option should you choose?

If you're just testing a model or running early experiments, start with open-source datasets like CoRal, Common Voice, NST, or FT Speech. They'll get you moving without any upfront cost.

If you need a small internal dataset, collecting it yourself is viable. Just go in with realistic expectations about the effort involved in managing audio quality, consent, transcription, and metadata.

If you need a large, clean, domain-specific Danish voice dataset for a production system, working with a dedicated speech data provider will almost always save you time and produce better results. You get more control over speaker profiles, prompts, file formats, transcription quality, and the overall delivery process.

Final checklist before starting a Danish voice data project

Before you start collecting Danish voice training data, make sure you've thought through the basics: what model you're training, what type of speech you need (read-aloud, commands, wake words, monologues, or dialogue), how many hours or utterances are required, which dialects and regions matter, and what metadata fields your pipeline expects.

On the technical side, decide whether to record at 48 kHz for archival flexibility or 16 kHz if you're certain your pipeline won't change. Choose WAV or FLAC over lossy formats for any data going into training. Consider 32-bit float if your speakers will be recording on varied devices without professional gain staging. Use dual-channel recording for any dialogue data, and plan your transcription approach (AI-first with human review is usually the sweet spot for cost and accuracy).

Getting these decisions right at the start saves a lot of rework later.

Collect Danish speech data with Spire Light

Spire Light can help with recruitment, recording, transcription, validation, metadata, and delivery, so you don't have to build the full speech data pipeline yourself.

Our Danish speech collection platform supports 48 kHz 32-bit WAV recordings, automated hardware checks, dialect and region checks, and access to 2,000+ Danish recorders.