Browse every Spirelight dataset in one place. Filter by region and language, request free samples to evaluate audio and transcripts, and license with confidence.
New to sourcing speech data? Start with our guides on speech data and AI training data.
Showing 60 of 60 datasets
Gujarati · India
500 hours of Gujarati spontaneous conversations with metadata and transcripts.
Norwegian · Norway
500 hours of Norwegian spontaneous conversations with metadata and transcripts.
Dutch (Western) · Netherlands
1000 hours of Dutch (Western) spontaneous conversations with metadata and transcripts.
Mandarin Chinese · Taiwan
1500 hours of Mandarin Chinese spontaneous conversations with metadata and transcripts.
Marathi · India
500 hours of Marathi spontaneous conversations with metadata and transcripts.
Indonesian · Indonesia
2000 hours of Indonesian spontaneous conversations with metadata and transcripts.
Malayalam · India
500 hours of Malayalam spontaneous conversations with metadata and transcripts.
Romanian · Romania
1000 hours of Romanian spontaneous conversations with metadata and transcripts.
Slovak · Slovakia
1000 hours of Slovak spontaneous conversations with metadata and transcripts.
English (Western) · United States
1500 hours of English (Western) spontaneous conversations with metadata and transcripts.
Urdu · Pakistan
1000 hours of Urdu spontaneous conversations with metadata and transcripts.
English (African) · South Africa
2000 hours of English (African) spontaneous conversations with metadata and transcripts.
English (Asian) · Singapore
2000 hours of English (Asian) spontaneous conversations with metadata and transcripts.
Persian (Farsi) · Iran
500 hours of Persian (Farsi) spontaneous conversations with metadata and transcripts.
Finnish · Finland
500 hours of Finnish spontaneous conversations with metadata and transcripts.
Portuguese (African) · Angola
500 hours of Portuguese (African) spontaneous conversations with metadata and transcripts.
German · Germany
1000 hours of German spontaneous conversations with metadata and transcripts.
Kannada · India
500 hours of Kannada spontaneous conversations with metadata and transcripts.
Croatian · Croatia
1500 hours of Croatian spontaneous conversations with metadata and transcripts.
Greek · Greece
1000 hours of Greek spontaneous conversations with metadata and transcripts.
Hungarian · Hungary
1000 hours of Hungarian spontaneous conversations with metadata and transcripts.
Hindi · India
2000 hours of Hindi spontaneous conversations with metadata and transcripts.
Dutch (LatAm) · Suriname
1000 hours of Dutch (LatAm) spontaneous conversations with metadata and transcripts.
Thai · Thailand
2000 hours of Thai spontaneous conversations with metadata and transcripts.
Russian · Russia
1000 hours of Russian spontaneous conversations with metadata and transcripts.
Turkish · Turkey
2000 hours of Turkish spontaneous conversations with metadata and transcripts.
Malay · Malaysia
2000 hours of Malay spontaneous conversations with metadata and transcripts.
Spanish (Western) · Spain
1000 hours of Spanish (Western) spontaneous conversations with metadata and transcripts.
Portuguese (Western) · Portugal
1000 hours of Portuguese (Western) spontaneous conversations with metadata and transcripts.
Portuguese (LatAm) · Brazil
2000 hours of Portuguese (LatAm) spontaneous conversations with metadata and transcripts.
Spanish (LatAm) · Mexico
2000 hours of Spanish (LatAm) spontaneous conversations with metadata and transcripts.
Hausa · Nigeria
2000 hours of Hausa spontaneous conversations with metadata and transcripts.
Ukrainian · Ukraine
1500 hours of Ukrainian spontaneous conversations with metadata and transcripts.
Swedish · Sweden
500 hours of Swedish spontaneous conversations with metadata and transcripts.
French (Western) · France
1000 hours of French (Western) spontaneous conversations with metadata and transcripts.
Vietnamese · Vietnam
2000 hours of Vietnamese spontaneous conversations with metadata and transcripts.
Tagalog · Philippines
2000 hours of Tagalog spontaneous conversations with metadata and transcripts.
Bengali · Bangladesh
2000 hours of Bengali spontaneous conversations with metadata and transcripts.
French (African) · DR Congo
2000 hours of French (African) spontaneous conversations with metadata and transcripts.
Arabic MSA (Modern) · Saudi Arabia
2000 hours of Arabic MSA (Modern) spontaneous conversations with metadata and transcripts.
Bulgarian · Bulgaria
1500 hours of Bulgarian spontaneous conversations with metadata and transcripts.
Hebrew · Israel
500 hours of Hebrew spontaneous conversations with metadata and transcripts.
Korean · South Korea
500 hours of Korean spontaneous conversations with metadata and transcripts.
Czech · Czechia
1000 hours of Czech spontaneous conversations with metadata and transcripts.
Yoruba · Nigeria
2000 hours of Yoruba spontaneous conversations with metadata and transcripts.
Polish · Poland
1000 hours of Polish spontaneous conversations with metadata and transcripts.
Tamil · India
500 hours of Tamil spontaneous conversations with metadata and transcripts.
Swahili · Kenya
2000 hours of Swahili spontaneous conversations with metadata and transcripts.
Punjabi · India
500 hours of Punjabi spontaneous conversations with metadata and transcripts.
Telugu · India
500 hours of Telugu spontaneous conversations with metadata and transcripts.
Catalan · Spain
500 hours of Catalan spontaneous conversations with metadata and transcripts.
Serbian · Serbia
1500 hours of Serbian spontaneous conversations with metadata and transcripts.
Italian · Italy
1000 hours of Italian spontaneous conversations with metadata and transcripts.
Japanese · Japan
500 hours of Japanese spontaneous conversations with metadata and transcripts.
Arabic (Levantine) · Lebanon
500 hours of Arabic (Levantine) spontaneous conversations with metadata and transcripts.
Arabic (Gulf) · Saudi Arabia
500 hours of Arabic (Gulf) spontaneous conversations with metadata and transcripts.
Arabic (Egyptian) · Egypt
2000 hours of Arabic (Egyptian) spontaneous conversations with metadata and transcripts.
Arabic (Darija) · Morocco
1500 hours of Arabic (Darija) spontaneous conversations with metadata and transcripts.
Amharic · Ethiopia
2000 hours of Amharic spontaneous conversations with metadata and transcripts.
Danish · Denmark
500 hours of Danish spontaneous conversations with metadata and transcripts.
We build custom speech datasets to spec: your language, dialect, recording conditions, and volume. Tell us what you need and we will scope it and send pricing.
The price shown on each dataset page is for the Spirelight Standard License: a non-exclusive commercial license to use the dataset for training, evaluating, and shipping speech and language models. The dataset is also licensed to other customers, and your trained models and their outputs remain yours. For exclusive licenses, restricted redistribution, or any custom terms, pricing is set per project. Book a call to discuss.
Yes. Exclusive licenses, where the dataset is licensed only to you, are negotiated per project. Book a call and tell us which dataset and what window of exclusivity you need, and we will come back with terms and pricing.
Often. Send us the language, dialect, recording conditions, speaker mix, hours, and intended use. We coordinate recording with our contributor network, transcribe, verify, and deliver in the format you need. Book a call for a timeline and a quote.
The sample bundle is a representative slice of the full dataset: same speakers where applicable, same recording conditions, same transcript style. You can validate audio quality, transcription accuracy, and speaker variety before committing.
Active datasets are listed above. For languages, dialects, or domains that are not in the catalogue, we build to spec. Talk to us about your requirements and we will scope a custom recording.
Each dataset lists its default formats. On request we can deliver alternate sample rates (16 kHz, 44.1 kHz, 48 kHz), MP3 or FLAC, mono or stereo audio, and transcripts as JSON, SRT, VTT, or CSV.
Click "Request samples" on the dataset page to receive download links and the price by email. To finalize, reply with your team and intended use; we send the Spirelight Standard License and the invoice. For custom terms, exclusivity, or volume pricing, book a call instead.
Have a question that is not on this list? Book a call and tell us what you are building.