A wake word dataset is the audio that teaches a device to wake up when it hears its name and to stay asleep the rest of the time. That second half is the hard part. The model lives on-device, runs constantly on a tiny power and compute budget, and has to pick one short phrase out of everything else a microphone catches in a kitchen, a car, or a noisy office. How well your keyword spotting works depends almost entirely on the shape of the data you train it on.
This guide covers what a wake word dataset actually needs: many positive samples across speakers, accents, distances, and noise; hard negatives and confusable phrases; far-field and device variation; and the false-accept versus false-reject tradeoff that quietly drives every collection decision. It is written for the engineer scoping a collection before any audio is recorded.
What a wake word dataset is teaching the model
Keyword spotting is a narrow, lopsided classification problem. The model hears a continuous stream and decides, many times per second, whether the last fraction of a second contained the wake phrase. Almost everything it ever hears is a negative. The positive class is rare, short, and spoken a thousand different ways. So the model is not really learning what the wake word sounds like in the abstract. It is learning to draw a boundary between your phrase and the enormous, messy space of everything that is not your phrase, including the things that sound a lot like it.
That framing changes how you collect. You are not gathering a few clean recordings of someone saying the trigger. You are gathering enough variation in the positive class that the model generalises to speakers it never heard, and enough realistic negatives that it learns where the boundary belongs. Get the balance wrong and you ship a device that either ignores its owner or wakes up at the television.
Positives: cover the people, not just the phrase
The positive side of a wake word dataset has to span the full range of who will say the phrase and how. Different speakers, ages, and genders. A wide spread of accents and dialects, because a trigger tuned on a narrow accent band will frustrate everyone outside it. The same phrase spoken fast and slow, flat and stressed, mid-sentence and on its own, half-whispered from the sofa and shouted from another room.
Distance matters as much as voice. A phrase captured at arm's length is acoustically different from the same phrase six metres away, where reverberation smears the consonants and the signal drops into the noise floor. If the product is meant to wake from across a room, the dataset has to contain across-the-room positives, not close-mic recordings with noise added afterward. Synthetic reverb helps for augmentation, but it does not fully replace audio recorded at real distances in real rooms.
How many positives is a fair question, and the honest answer is that it scales with how much variation you are trying to cover. A single-language consumer trigger needs a substantial pool of distinct speakers, each contributing several takes across conditions, rather than a handful of people repeating the phrase hundreds of times. We dig into sizing logic in our guide on how much speech data you need, and the same principle applies here: speaker diversity beats raw repetition.
Negatives and confusable phrases: where reliability is won
Easy negatives are cheap. Hours of unrelated speech, music, and household noise teach the model that most sound is not the trigger, and you want plenty of it. The recordings that actually move false-accept rates are the hard negatives: phrases that rhyme with or partly overlap the wake word. If the trigger is two syllables, you want the near-misses that share one of them, the words that start the same way, and the phrases a person might say in normal conversation that brush up against the boundary.
This is the work that separates a hotword dataset built for a demo from one built to live in someone's home. A model that has only ever seen the true phrase and random noise will happily fire on anything in the neighbourhood of the phrase. Feeding it deliberate confusables, labelled as negatives, is how you teach the decision boundary to sit tight rather than generous. Unscripted conversational audio is a good source of natural near-misses, the brushes against the boundary that scripted prompts never produce.
Far-field and device variation
Wake word detection training data has to match the hardware it will run on, or come close. A microphone array in a smart speaker, a single mic in a phone, the far-field setup in a car cabin: each colours the audio differently, and a model trained only on studio-clean voice trigger data degrades the moment it meets a cheap mic with aggressive noise suppression. Where you can, capture through devices representative of the target, or at least through a range of microphones rather than one good one.
The in-car case is its own discipline. Road noise, the blower, music, and several passengers all compete with the trigger, and the speaker is often not facing the mic. If you are building for the cabin, the data design overlaps heavily with the wider problem of automotive voice data, where far-field capture and engine noise are the default rather than the exception. The same far-field discipline that helps a car also helps a speaker on a kitchen counter next to a running tap.
The false-accept versus false-reject tradeoff
Every wake word model sits on a threshold, and that threshold forces a choice. Lower it and the device wakes more readily, catching quiet or distant speech but also firing when it should not, a false accept. Raise it and the false accepts drop, but the device starts ignoring legitimate wakes, a false reject. You cannot have both at once. You can only move the operating point along the curve, and the shape of that curve is set by your data.
This is why the collection brief should start from the product's tolerance, not from a generic target. A privacy-sensitive device that must almost never wake on its own needs a wealth of hard negatives so the model can hold a high threshold without missing real users. A hands-free safety command, where a missed wake is the worse failure, needs broad, noisy positives so it still triggers when someone is stressed or far away. Decide which error you fear before you record, because it sets the ratio of positives to negatives, how many confusables you chase, and how much far-field audio you fund.
One practical consequence: evaluate on data that looks like deployment, not like training. False-accept rate is usually measured against long hours of realistic background audio with no trigger present, and false-reject rate against held-out positives in the conditions you actually care about. If your test negatives are too clean, the field numbers will be worse than the lab numbers every time. Tight labelling underpins all of this, which is where careful audio annotation earns its keep.
Where keyword spotting sits next to ASR
It helps to be clear about the boundary between a wake word system and full speech recognition. Keyword spotting is a small, always-on gatekeeper: it does one job cheaply and locally. Once it fires, the heavier ASR stack usually takes over to interpret the command, often in the cloud, with very different data needs. If you are building the full pipeline, the command-understanding half draws on the wider practice described in our guide to ASR training data, where vocabulary coverage and transcript accuracy dominate rather than the positive-versus-negative balance that defines keyword spotting data.
Keeping the two stages distinct in your planning keeps the wake word dataset focused. The trigger model does not need a broad vocabulary or perfect transcripts. It needs a tightly scoped phrase, captured across every speaker, distance, device, and noise condition you can reach, and surrounded by the negatives that hold its threshold steady.
If you are scoping a wake word collection and want positives, hard negatives, and far-field audio assembled to your product's error tolerance, browse the ready-made and custom options on our datasets page to see what a deployment-ready collection looks like.