Datasets

Find the speech data you need

Browse every Spirelight dataset in one place. Filter by region and language, request free samples to evaluate audio and transcripts, and license with confidence.

New to sourcing speech data? Start with our guides on speech data and AI training data.

Filter

Showing 60 of 60 datasets

Conversational

Gujarati Language Training Dataset - 500H Spontaneous Pair Conversational Audio and Video

Gujarati · India

500 hours of Gujarati spontaneous conversations with metadata and transcripts.

500 hrs/moMonthly capacity50Speakers$65/hrFrom
Commercial license · View dataset
Conversational

Norwegian Language Training Dataset - 500H Spontaneous Pair Conversational Audio and Video

Norwegian · Norway

500 hours of Norwegian spontaneous conversations with metadata and transcripts.

500 hrs/moMonthly capacity50Speakers$95/hrFrom
Commercial license · View dataset
Conversational

Dutch (Western) Language Training Dataset - 1000H Spontaneous Pair Conversational Audio and Video

Dutch (Western) · Netherlands

1000 hours of Dutch (Western) spontaneous conversations with metadata and transcripts.

1,000 hrs/moMonthly capacity100Speakers$95/hrFrom
Commercial license · View dataset
Conversational

Mandarin Chinese Language Training Dataset - 1500H Spontaneous Pair Conversational Audio and Video

Mandarin Chinese · Taiwan

1500 hours of Mandarin Chinese spontaneous conversations with metadata and transcripts.

1,500 hrs/moMonthly capacity150Speakers$70/hrFrom
Commercial license · View dataset
Conversational

Marathi Language Training Dataset - 500H Spontaneous Pair Conversational Audio and Video

Marathi · India

500 hours of Marathi spontaneous conversations with metadata and transcripts.

500 hrs/moMonthly capacity50Speakers$65/hrFrom
Commercial license · View dataset
Conversational

Indonesian Language Training Dataset - 2000H Spontaneous Pair Conversational Audio and Video

Indonesian · Indonesia

2000 hours of Indonesian spontaneous conversations with metadata and transcripts.

2,000 hrs/moMonthly capacity200Speakers$65/hrFrom
Commercial license · View dataset
Conversational

Malayalam Language Training Dataset - 500H Spontaneous Pair Conversational Audio and Video

Malayalam · India

500 hours of Malayalam spontaneous conversations with metadata and transcripts.

500 hrs/moMonthly capacity50Speakers$65/hrFrom
Commercial license · View dataset
Conversational

Romanian Language Training Dataset - 1000H Spontaneous Pair Conversational Audio and Video

Romanian · Romania

1000 hours of Romanian spontaneous conversations with metadata and transcripts.

1,000 hrs/moMonthly capacity100Speakers$85/hrFrom
Commercial license · View dataset
Conversational

Slovak Language Training Dataset - 1000H Spontaneous Pair Conversational Audio and Video

Slovak · Slovakia

1000 hours of Slovak spontaneous conversations with metadata and transcripts.

1,000 hrs/moMonthly capacity100Speakers$90/hrFrom
Commercial license · View dataset
Conversational

English (Western) Language Training Dataset - 1500H Spontaneous Pair Conversational Audio and Video

English (Western) · United States

1500 hours of English (Western) spontaneous conversations with metadata and transcripts.

1,500 hrs/moMonthly capacity150Speakers$90/hrFrom
Commercial license · View dataset
Conversational

Urdu Language Training Dataset - 1000H Spontaneous Pair Conversational Audio and Video

Urdu · Pakistan

1000 hours of Urdu spontaneous conversations with metadata and transcripts.

1,000 hrs/moMonthly capacity100Speakers$65/hrFrom
Commercial license · View dataset
Conversational

English (African) Language Training Dataset - 1000H Spontaneous Pair Conversation

English (African) · South Africa

2000 hours of English (African) spontaneous conversations with metadata and transcripts.

2,000 hrs/moMonthly capacity200Speakers$60/hrFrom
Commercial license · View dataset
Conversational

English (Asian) Language Training Dataset - 2000H Spontaneous Pair Conversational Audio and Video

English (Asian) · Singapore

2000 hours of English (Asian) spontaneous conversations with metadata and transcripts.

2,000 hrs/moMonthly capacity200Speakers$60/hrFrom
Commercial license · View dataset
Conversational

Persian (Farsi) Language Training Dataset - 500H Spontaneous Pair Conversational Audio and Video

Persian (Farsi) · Iran

500 hours of Persian (Farsi) spontaneous conversations with metadata and transcripts.

500 hrs/moMonthly capacity50Speakers$65/hrFrom
Commercial license · View dataset
Conversational

Finnish Language Training Dataset - 500H Spontaneous Pair Conversational Audio and Video

Finnish · Finland

500 hours of Finnish spontaneous conversations with metadata and transcripts.

500 hrs/moMonthly capacity50Speakers$95/hrFrom
Commercial license · View dataset
Conversational

Portuguese (African) Language Training Dataset - 500H Spontaneous Pair Conversational Audio and Video

Portuguese (African) · Angola

500 hours of Portuguese (African) spontaneous conversations with metadata and transcripts.

500 hrs/moMonthly capacity50Speakers$65/hrFrom
Commercial license · View dataset
Conversational

German Language Training Dataset - 1000H Spontaneous Pair Conversational Audio and Video

German · Germany

1000 hours of German spontaneous conversations with metadata and transcripts.

1,000 hrs/moMonthly capacity50Speakers$95/hrFrom
Commercial license · View dataset
Conversational

Kannada Language Training Dataset - 500H Spontaneous Pair Conversational Audio and Video

Kannada · India

500 hours of Kannada spontaneous conversations with metadata and transcripts.

500 hrs/moMonthly capacity50Speakers$65/hrFrom
Commercial license · View dataset
Conversational

Croatian Language Training Dataset - 1500H Spontaneous Pair Conversational Audio and Video

Croatian · Croatia

1500 hours of Croatian spontaneous conversations with metadata and transcripts.

1,500 hrs/moMonthly capacity150Speakers$75/hrFrom
Commercial license · View dataset
Conversational

Greek Language Training Dataset - 1000H Spontaneous Pair Conversational Audio and Video

Greek · Greece

1000 hours of Greek spontaneous conversations with metadata and transcripts.

1,000 hrs/moMonthly capacity100Speakers$85/hrFrom
Commercial license · View dataset
Conversational

Hungarian Language Training Dataset - 1000H Spontaneous Pair Conversational Audio and Video

Hungarian · Hungary

1000 hours of Hungarian spontaneous conversations with metadata and transcripts.

1,000 hrs/moMonthly capacity100Speakers$90/hrFrom
Commercial license · View dataset
Conversational

Hindi Language Training Dataset - 2000H Spontaneous Pair Conversational Audio and Video

Hindi · India

2000 hours of Hindi spontaneous conversations with metadata and transcripts.

2,000 hrs/moMonthly capacity200Speakers$65/hrFrom
Commercial license · View dataset
Conversational

Dutch (LatAm) Language Training Dataset - 1000H Spontaneous Pair Conversational Audio and Video

Dutch (LatAm) · Suriname

1000 hours of Dutch (LatAm) spontaneous conversations with metadata and transcripts.

1,000 hrs/moMonthly capacity100Speakers$65/hrFrom
Commercial license · View dataset
Conversational

Thai Language Training Dataset - 2000H Spontaneous Pair Conversational Audio and Video

Thai · Thailand

2000 hours of Thai spontaneous conversations with metadata and transcripts.

2,000 hrs/moMonthly capacity200Speakers$75/hrFrom
Commercial license · View dataset
Conversational

Russian Language Training Dataset - 1000H Spontaneous Pair Conversational Audio and Video

Russian · Russia

1000 hours of Russian spontaneous conversations with metadata and transcripts.

1,000 hrs/moMonthly capacity100Speakers$85/hrFrom
Commercial license · View dataset
Conversational

Turkish Language Training Dataset - 2000H Spontaneous Pair Conversational Audio and Video

Turkish · Turkey

2000 hours of Turkish spontaneous conversations with metadata and transcripts.

2,000 hrs/moMonthly capacity200Speakers$75/hrFrom
Commercial license · View dataset
Conversational

Malay Language Training Dataset - 2000H Spontaneous Pair Conversational Audio and Video

Malay · Malaysia

2000 hours of Malay spontaneous conversations with metadata and transcripts.

2,000 hrs/moMonthly capacity200Speakers$65/hrFrom
Commercial license · View dataset
Conversational

Spanish (Western) Language Training Dataset - 1000H Spontaneous Pair Conversational Audio and Video

Spanish (Western) · Spain

1000 hours of Spanish (Western) spontaneous conversations with metadata and transcripts.

1,000 hrs/moMonthly capacity100Speakers$85/hrFrom
Commercial license · View dataset
Conversational

Portuguese (Western) Language Training Dataset - 1000H Spontaneous Pair Conversational Audio and Video

Portuguese (Western) · Portugal

1000 hours of Portuguese (Western) spontaneous conversations with metadata and transcripts.

1,000 hrs/moMonthly capacity100Speakers$85/hrFrom
Commercial license · View dataset
Conversational

Portuguese (LatAm) Language Training Dataset - 2000H Spontaneous Pair Conversational Audio and Video

Portuguese (LatAm) · Brazil

2000 hours of Portuguese (LatAm) spontaneous conversations with metadata and transcripts.

2,000 hrs/moMonthly capacity200Speakers$65/hrFrom
Commercial license · View dataset
Conversational

Spanish (LatAm) Language Training Dataset - 2000H Spontaneous Pair Conversational Audio and Video

Spanish (LatAm) · Mexico

2000 hours of Spanish (LatAm) spontaneous conversations with metadata and transcripts.

2,000 hrs/moMonthly capacity200Speakers$65/hrFrom
Commercial license · View dataset
Conversational

Hausa Language Training Dataset - 2000H Spontaneous Pair Conversational Audio and Video

Hausa · Nigeria

2000 hours of Hausa spontaneous conversations with metadata and transcripts.

2,000 hrs/moMonthly capacity200Speakers$65/hrFrom
Commercial license · View dataset
Conversational

Ukrainian Language Training Dataset - 1500H Spontaneous Pair Conversational Audio and Video

Ukrainian · Ukraine

1500 hours of Ukrainian spontaneous conversations with metadata and transcripts.

1,500 hrs/moMonthly capacity150Speakers$85/hrFrom
Commercial license · View dataset
Conversational

Swedish Language Training Dataset - 500H Spontaneous Pair Conversational Audio and Video

Swedish · Sweden

500 hours of Swedish spontaneous conversations with metadata and transcripts.

500 hrs/moMonthly capacity50Speakers$95/hrFrom
Commercial license · View dataset
Conversational

French (Western) Language Training Dataset - 1000H Spontaneous Pair Conversational Audio and Video

French (Western) · France

1000 hours of French (Western) spontaneous conversations with metadata and transcripts.

1,000 hrs/moMonthly capacity100Speakers$95/hrFrom
Commercial license · View dataset
Conversational

Vietnamese Language Training Dataset - 2000H Spontaneous Pair Conversational Audio and Video

Vietnamese · Vietnam

2000 hours of Vietnamese spontaneous conversations with metadata and transcripts.

2,000 hrs/moMonthly capacity200Speakers$65/hrFrom
Commercial license · View dataset
Conversational

Tagalog Language Training Dataset - 2000H Spontaneous Pair Conversational Audio and Video

Tagalog · Philippines

2000 hours of Tagalog spontaneous conversations with metadata and transcripts.

2,000 hrs/moMonthly capacity200Speakers$65/hrFrom
Commercial license · View dataset
Conversational

Bengali Language Training Dataset - 2000H Spontaneous Pair Conversational Audio and Video

Bengali · Bangladesh

2000 hours of Bengali spontaneous conversations with metadata and transcripts.

2,000 hrs/moMonthly capacity200Speakers$65/hrFrom
Commercial license · View dataset
Conversational

French (African) Language Training Dataset - 2000H Spontaneous Pair Conversational Audio and Video

French (African) · DR Congo

2000 hours of French (African) spontaneous conversations with metadata and transcripts.

2,000 hrs/moMonthly capacity200Speakers$65/hrFrom
Commercial license · View dataset
Conversational

Arabic MSA (Modern) Language Training Dataset - 2000H Spontaneous Pair Conversational Audio and Video

Arabic MSA (Modern) · Saudi Arabia

2000 hours of Arabic MSA (Modern) spontaneous conversations with metadata and transcripts.

2,000 hrs/moMonthly capacity200Speakers$90/hrFrom
Commercial license · View dataset
Conversational

Bulgarian Language Training Dataset - 1500H Spontaneous Pair Conversational Audio and Video

Bulgarian · Bulgaria

1500 hours of Bulgarian spontaneous conversations with metadata and transcripts.

1,500 hrs/moMonthly capacity150Speakers$85/hrFrom
Commercial license · View dataset
Conversational

Hebrew Language Training Dataset - 500H Spontaneous Pair Conversational Audio and Video

Hebrew · Israel

500 hours of Hebrew spontaneous conversations with metadata and transcripts.

500 hrs/moMonthly capacity50Speakers$95/hrFrom
Commercial license · View dataset
Conversational

Korean Language Training Dataset - 500H Spontaneous Pair Conversational Audio and Video

Korean · South Korea

500 hours of Korean spontaneous conversations with metadata and transcripts.

500 hrs/moMonthly capacity50Speakers$95/hrFrom
Commercial license · View dataset
Conversational

Czech Language Training Dataset - 1000H Spontaneous Pair Conversational Audio and Video

Czech · Czechia

1000 hours of Czech spontaneous conversations with metadata and transcripts.

1,000 hrs/moMonthly capacity100Speakers$90/hrFrom
Commercial license · View dataset
Conversational

Yoruba Language Training Dataset - 2000H Spontaneous Pair Conversational Audio and Video

Yoruba · Nigeria

2000 hours of Yoruba spontaneous conversations with metadata and transcripts.

2,000 hrs/moMonthly capacity200Speakers$65/hrFrom
Commercial license · View dataset
Conversational

Polish Language Training Dataset - 1000H Spontaneous Pair Conversational Audio and Video

Polish · Poland

1000 hours of Polish spontaneous conversations with metadata and transcripts.

1,000 hrs/moMonthly capacity100Speakers$85/hrFrom
Commercial license · View dataset
Conversational

Tamil Language Training Dataset - 500H Spontaneous Pair Conversational Audio and Video

Tamil · India

500 hours of Tamil spontaneous conversations with metadata and transcripts.

500 hrs/moMonthly capacity50Speakers$75/hrFrom
Commercial license · View dataset
Conversational

Swahili Language Training Dataset - 2000H Spontaneous Pair Conversational Audio and Video

Swahili · Kenya

2000 hours of Swahili spontaneous conversations with metadata and transcripts.

2,000 hrs/moMonthly capacity200Speakers$65/hrFrom
Commercial license · View dataset
Conversational

Punjabi Language Training Dataset - 500H Spontaneous Pair Conversational Audio and Video

Punjabi · India

500 hours of Punjabi spontaneous conversations with metadata and transcripts.

500 hrs/moMonthly capacity50Speakers$65/hrFrom
Commercial license · View dataset
Conversational

Telugu Language Training Dataset - 500H Spontaneous Pair Conversational Audio and Video

Telugu · India

500 hours of Telugu spontaneous conversations with metadata and transcripts.

500 hrs/moMonthly capacity50Speakers$65/hrFrom
Commercial license · View dataset
Conversational

Catalan Language Training Dataset - 500H Spontaneous Pair Conversational Audio and Video

Catalan · Spain

500 hours of Catalan spontaneous conversations with metadata and transcripts.

500 hrs/moMonthly capacity50Speakers$75/hrFrom
Commercial license · View dataset
Conversational

Serbian Language Training Dataset - 1500H Spontaneous Pair Conversational Audio and Video

Serbian · Serbia

1500 hours of Serbian spontaneous conversations with metadata and transcripts.

1,500 hrs/moMonthly capacity150Speakers$75/hrFrom
Commercial license · View dataset
Conversational

Italian Language Training Dataset - 1000H Spontaneous Pair Conversational Audio and Video

Italian · Italy

1000 hours of Italian spontaneous conversations with metadata and transcripts.

1,000 hrs/moMonthly capacity100Speakers$75/hrFrom
Commercial license · View dataset
Conversational

Japanese Language Training Dataset - 500H Spontaneous Pair Conversational Audio and Video

Japanese · Japan

500 hours of Japanese spontaneous conversations with metadata and transcripts.

500 hrs/moMonthly capacity50Speakers$95/hrFrom
Commercial license · View dataset
Conversational

Arabic (Levantine) Language Training Dataset - 500H Spontaneous Pair Conversational Audio and Video

Arabic (Levantine) · Lebanon

500 hours of Arabic (Levantine) spontaneous conversations with metadata and transcripts.

500 hrs/moMonthly capacity50Speakers$65/hrFrom
Commercial license · View dataset
Conversational

Arabic (Gulf) Language Training Dataset - 500H Spontaneous Pair Conversational Audio and Video

Arabic (Gulf) · Saudi Arabia

500 hours of Arabic (Gulf) spontaneous conversations with metadata and transcripts.

500 hrs/moMonthly capacity50Speakers$90/hrFrom
Commercial license · View dataset
Conversational

Arabic (Egyptian) Language Training Dataset - 2000H Spontaneous Pair Conversational Audio and Video

Arabic (Egyptian) · Egypt

2000 hours of Arabic (Egyptian) spontaneous conversations with metadata and transcripts.

2,000 hrs/moMonthly capacity200Speakers$65/hrFrom
Commercial license · View dataset
Conversational

Arabic (Darija) Language Training Dataset - 1500H Spontaneous Pair Conversational Audio and Video

Arabic (Darija) · Morocco

1500 hours of Arabic (Darija) spontaneous conversations with metadata and transcripts.

1,500 hrs/moMonthly capacity150Speakers$65/hrFrom
Commercial license · View dataset
Conversational

Amharic Language Training Dataset - 2000H Spontaneous Pair Conversational Audio and Video

Amharic · Ethiopia

2000 hours of Amharic spontaneous conversations with metadata and transcripts.

2,000 hrs/moMonthly capacity200Speakers$65/hrFrom
Commercial license · View dataset
Conversational

Danish Language Training Dataset - 500H Spontaneous Pair Conversational Audio and Video

Danish · Denmark

500 hours of Danish spontaneous conversations with metadata and transcripts.

500 hrs/moMonthly capacity50Speakers$95/hrFrom
Commercial license · View dataset
FAQ

Licensing, customs, and how to buy

What does a Spirelight dataset license cover?

The price shown on each dataset page is for the Spirelight Standard License: a non-exclusive commercial license to use the dataset for training, evaluating, and shipping speech and language models. The dataset is also licensed to other customers, and your trained models and their outputs remain yours. For exclusive licenses, restricted redistribution, or any custom terms, pricing is set per project. Book a call to discuss.

Can I get an exclusive license?

Yes. Exclusive licenses, where the dataset is licensed only to you, are negotiated per project. Book a call and tell us which dataset and what window of exclusivity you need, and we will come back with terms and pricing.

Do you build custom datasets?

Often. Send us the language, dialect, recording conditions, speaker mix, hours, and intended use. We coordinate recording with our contributor network, transcribe, verify, and deliver in the format you need. Book a call for a timeline and a quote.

How do the samples relate to the full dataset?

The sample bundle is a representative slice of the full dataset: same speakers where applicable, same recording conditions, same transcript style. You can validate audio quality, transcription accuracy, and speaker variety before committing.

What languages do you cover?

Active datasets are listed above. For languages, dialects, or domains that are not in the catalogue, we build to spec. Talk to us about your requirements and we will scope a custom recording.

What audio and transcript formats do you ship?

Each dataset lists its default formats. On request we can deliver alternate sample rates (16 kHz, 44.1 kHz, 48 kHz), MP3 or FLAC, mono or stereo audio, and transcripts as JSON, SRT, VTT, or CSV.

How do I license a dataset?

Click "Request samples" on the dataset page to receive download links and the price by email. To finalize, reply with your team and intended use; we send the Spirelight Standard License and the invoice. For custom terms, exclusivity, or volume pricing, book a call instead.

Have a question that is not on this list? Book a call and tell us what you are building.