Voice data collection

This guide walks you through preparing and recording voice data so your AI Agent can understand spoken messages (speech-to-text) and respond out loud in the target language (text-to-speech).

Do you already have voice data?

Option A – You already have recordings

If you have existing audio in the target language, provide:

  • Audio files (minimum 50 hours preferred for ASR and TTS)

  • Matching transcripts – every audio file must be paired with an exact text match

Option B – You need to create recordings

Follow steps 1 through 6 below.


Step 1 – Prepare your sentence list

Create a list of sentences in the target language for recording.

The sentence list should:

  • Contain 10,000 to 15,000 sentences

  • Cover a wide variety of vocabulary, including domain-specific terms, product names, and proper nouns relevant to your use case

  • Be written in the target language only

If your source material is in English, translate the sentences into the target language before preparing the list.

Output format :

ID
English
Target Language

001

How do I register?

[Sentence in target language]

002

You can register by visiting our website.

[Sentence in target language]

003

The office is open from Monday to Friday.

[Sentence in target language]


Step 2 – Set up your recording equipment

Audio quality is critical. Poor recordings significantly reduce AI voice performance.

Requirement
Specification

Minimum hardware

Entry-level laptop with USB port

Audio

Noise-cancelling headset with high-quality microphone

Recommended headset

Sennheiser EPOS or equivalent

A built-in laptop microphone is not acceptable.

Before recording, test your setup. Record a short clip – about 30 seconds – and check that it meets the following quality standards:

  • Audio is clear and free from background noise

  • No unnatural pauses or long silences at the start or mid-sentence

  • No echo or reverb (a hollow, bouncing sound)

  • Volume is consistent throughout – not too soft, not too loud, and no distortion

  • No handling noise from the microphone or headset cable

  • Speech is clear and at a natural pace

  • No interruptions such as typing sounds, notifications, or sudden noises

Do not begin a full recording session until you are satisfied with your test clip.

Suggested tools for checking audio quality:

Recording environment:

  • Use a quiet room with soft surfaces – carpets, curtains, and cushions absorb echo

  • Turn off fans, air conditioning, and other constant noise sources

  • Close windows and doors

  • Keep the microphone at the same distance from your mouth throughout every session

  • Do not record in large, empty rooms – they produce echo

  • Use a pop filter if available to reduce harsh sounds on letters like "p" and "b"

  • Use the same room, same device, and same setup across all sessions


Step 3 – Record for text-to-speech (TTS)

TTS recordings give your AI Agent its voice. The goal is consistency.

All TTS recordings must come from a single speaker.

  • Same speaker throughout all sessions – do not switch between speakers

  • Consistent tone and pitch across every recording

  • Clear pronunciation – every word articulated distinctly

  • Natural pace – not too fast, not artificially slow

  • Natural pauses at punctuation – commas, full stops, and question marks should each have a brief pause

  • Wide vocabulary – the sentence list should cover every word and term the AI Agent may need to say


Step 4 – Record for speech-to-text (ASR)

ASR recordings train your AI Agent to understand real users speaking naturally. The goal is diversity.

  • Multiple speakers – include male and female voices, different ages

  • Different accents and dialects – reflect the actual user base where possible

  • Varied intonations – record questions, statements, and words with different emphasis

  • Natural speaking styles – speakers should use their natural pace and volume

For ASR, a single sentence can have more than one recording. Multiple speakers recording the same sentence improves the range of users your AI Agent can understand.

TTS recordings can be reprocessed for ASR training. ASR recordings cannot be used for TTS.


Step 5 – Name, save, and organise your files

Each recording must be saved as a separate audio file. File names must match the sentence ID exactly, and files must be organised into separate TTS and ASR folders for delivery.

Format: .mp3 or .wav

TTS – single language:

TTS – multiple languages (add the language code):

ASR – multiple speakers (add the speaker number):

ASR – multiple languages and speakers:

Folder structure for delivery:

Include the sentence list spreadsheet in each folder, with every sentence ID mapped to its corresponding text.


Step 6 – Final checklist before submission

Sentence list

✅ 10,000 to 15,000 sentences in the target language ✅ Wide vocabulary coverage including domain-specific terms and proper nouns ✅ Every sentence has a unique ID number

TTS recordings

✅ All recordings made by a single, consistent speaker ✅ Tone and pitch consistent across all sessions ✅ Every sentence ID has exactly one matching audio file ✅ Audio is clear – no background noise, no echo, no distortion ✅ Files named correctly and saved in the correct format ✅ Sentence list spreadsheet included in the TTS folder

ASR recordings

✅ Multiple speakers recorded ✅ Varied accents, ages, and intonations represented ✅ Every audio file is mapped to its sentence ID ✅ Audio is clear ✅ Files named correctly ✅ Sentence list spreadsheet included in the ASR folder


Common mistakes to avoid

Mistake
Why it matters

Using a built-in laptop microphone

Audio quality is too low for training

Background noise in recordings

Degrades model performance significantly

Long pauses at the start of a recording or mid-sentence

Audio will not align with the transcript correctly

Switching speakers mid-TTS session

Produces an inconsistent AI Agent voice

Using only one speaker for ASR

Limits the range of users the AI Agent can understand

Incorrect or inconsistent file naming

Files cannot be matched to their transcripts

Missing files

Incomplete datasets delay or prevent training

Last updated