Voice data collection
This guide walks you through preparing and recording voice data so your AI Agent can understand spoken messages (speech-to-text) and respond out loud in the target language (text-to-speech).
Do you already have voice data?
Option A – You already have recordings
If you have existing audio in the target language, provide:
Audio files (minimum 50 hours preferred for ASR and TTS)
Matching transcripts – every audio file must be paired with an exact text match
Option B – You need to create recordings
Follow steps 1 through 6 below.
Step 1 – Prepare your sentence list
Create a list of sentences in the target language for recording.
The sentence list should:
Contain 10,000 to 15,000 sentences
Cover a wide variety of vocabulary, including domain-specific terms, product names, and proper nouns relevant to your use case
Be written in the target language only
If your source material is in English, translate the sentences into the target language before preparing the list.
Output format :
001
How do I register?
[Sentence in target language]
002
You can register by visiting our website.
[Sentence in target language]
003
The office is open from Monday to Friday.
[Sentence in target language]
Step 2 – Set up your recording equipment
Audio quality is critical. Poor recordings significantly reduce AI voice performance.
Minimum hardware
Entry-level laptop with USB port
Audio
Noise-cancelling headset with high-quality microphone
Recommended headset
Sennheiser EPOS or equivalent
A built-in laptop microphone is not acceptable.
Before recording, test your setup. Record a short clip – about 30 seconds – and check that it meets the following quality standards:
Audio is clear and free from background noise
No unnatural pauses or long silences at the start or mid-sentence
No echo or reverb (a hollow, bouncing sound)
Volume is consistent throughout – not too soft, not too loud, and no distortion
No handling noise from the microphone or headset cable
Speech is clear and at a natural pace
No interruptions such as typing sounds, notifications, or sudden noises
Do not begin a full recording session until you are satisfied with your test clip.
Suggested tools for checking audio quality:
snr.audio – upload a short clip to get a reading on how clean your audio is relative to background noise
mic-tests.github.io/background-noise-analyzer – test your microphone in real time before starting a session
Audacity (free desktop app) – record and visually inspect your audio for noise, inconsistencies, and levels
Recording environment:
Use a quiet room with soft surfaces – carpets, curtains, and cushions absorb echo
Turn off fans, air conditioning, and other constant noise sources
Close windows and doors
Keep the microphone at the same distance from your mouth throughout every session
Do not record in large, empty rooms – they produce echo
Use a pop filter if available to reduce harsh sounds on letters like "p" and "b"
Use the same room, same device, and same setup across all sessions
Step 3 – Record for text-to-speech (TTS)
TTS recordings give your AI Agent its voice. The goal is consistency.
All TTS recordings must come from a single speaker.
Same speaker throughout all sessions – do not switch between speakers
Consistent tone and pitch across every recording
Clear pronunciation – every word articulated distinctly
Natural pace – not too fast, not artificially slow
Natural pauses at punctuation – commas, full stops, and question marks should each have a brief pause
Wide vocabulary – the sentence list should cover every word and term the AI Agent may need to say
Step 4 – Record for speech-to-text (ASR)
ASR recordings train your AI Agent to understand real users speaking naturally. The goal is diversity.
Multiple speakers – include male and female voices, different ages
Different accents and dialects – reflect the actual user base where possible
Varied intonations – record questions, statements, and words with different emphasis
Natural speaking styles – speakers should use their natural pace and volume
For ASR, a single sentence can have more than one recording. Multiple speakers recording the same sentence improves the range of users your AI Agent can understand.
TTS recordings can be reprocessed for ASR training. ASR recordings cannot be used for TTS.
Step 5 – Name, save, and organise your files
Each recording must be saved as a separate audio file. File names must match the sentence ID exactly, and files must be organised into separate TTS and ASR folders for delivery.
Format: .mp3 or .wav
TTS – single language:
TTS – multiple languages (add the language code):
ASR – multiple speakers (add the speaker number):
ASR – multiple languages and speakers:
Folder structure for delivery:
Include the sentence list spreadsheet in each folder, with every sentence ID mapped to its corresponding text.
Step 6 – Final checklist before submission
Sentence list
✅ 10,000 to 15,000 sentences in the target language ✅ Wide vocabulary coverage including domain-specific terms and proper nouns ✅ Every sentence has a unique ID number
TTS recordings
✅ All recordings made by a single, consistent speaker ✅ Tone and pitch consistent across all sessions ✅ Every sentence ID has exactly one matching audio file ✅ Audio is clear – no background noise, no echo, no distortion ✅ Files named correctly and saved in the correct format ✅ Sentence list spreadsheet included in the TTS folder
ASR recordings
✅ Multiple speakers recorded ✅ Varied accents, ages, and intonations represented ✅ Every audio file is mapped to its sentence ID ✅ Audio is clear ✅ Files named correctly ✅ Sentence list spreadsheet included in the ASR folder
Common mistakes to avoid
Using a built-in laptop microphone
Audio quality is too low for training
Background noise in recordings
Degrades model performance significantly
Long pauses at the start of a recording or mid-sentence
Audio will not align with the transcript correctly
Switching speakers mid-TTS session
Produces an inconsistent AI Agent voice
Using only one speaker for ASR
Limits the range of users the AI Agent can understand
Incorrect or inconsistent file naming
Files cannot be matched to their transcripts
Missing files
Incomplete datasets delay or prevent training
Last updated