# Voice data collection

### Do you already have voice data?

#### Option A – You already have recordings

If you have existing audio in the target language, provide:

* Audio files (minimum 50 hours preferred for ASR and TTS)
* Matching transcripts – every audio file must be paired with an exact text match

#### Option B – You need to create recordings

Follow steps 1 through 6 below.

***

### Step 1 – Prepare your sentence list

Create a list of sentences in the target language for recording.

The sentence list should:

* Contain 10,000 to 15,000 sentences
* Cover a wide variety of vocabulary, including domain-specific terms, product names, and proper nouns relevant to your use case
* Be written in the target language only

If your source material is in English, translate the sentences into the target language before preparing the list.

**Output format :**

| ID  | English                                   | Target Language                |
| --- | ----------------------------------------- | ------------------------------ |
| 001 | How do I register?                        | \[Sentence in target language] |
| 002 | You can register by visiting our website. | \[Sentence in target language] |
| 003 | The office is open from Monday to Friday. | \[Sentence in target language] |

***

### Step 2 – Set up your recording equipment

Audio quality is critical. Poor recordings significantly reduce AI voice performance.

| Requirement         | Specification                                         |
| ------------------- | ----------------------------------------------------- |
| Minimum hardware    | Entry-level laptop with USB port                      |
| Audio               | Noise-cancelling headset with high-quality microphone |
| Recommended headset | Sennheiser EPOS or equivalent                         |

> A built-in laptop microphone is not acceptable.

**Before recording, test your setup.** Record a short clip – about 30 seconds – and check that it meets the following quality standards:

* Audio is clear and free from background noise
* No unnatural pauses or long silences at the start or mid-sentence
* No echo or reverb (a hollow, bouncing sound)
* Volume is consistent throughout – not too soft, not too loud, and no distortion
* No handling noise from the microphone or headset cable
* Speech is clear and at a natural pace
* No interruptions such as typing sounds, notifications, or sudden noises

Do not begin a full recording session until you are satisfied with your test clip.

**Suggested tools for checking audio quality:**

* [**snr.audio**](http://snr.audio) – upload a short clip to get a reading on how clean your audio is relative to background noise
* [**mic-tests.github.io/background-noise-analyzer**](http://mic-tests.github.io/background-noise-analyzer) – test your microphone in real time before starting a session
* [**Audacity**](https://www.audacityteam.org/) (free desktop app) – record and visually inspect your audio for noise, inconsistencies, and levels

**Recording environment:**

* Use a quiet room with soft surfaces – carpets, curtains, and cushions absorb echo
* Turn off fans, air conditioning, and other constant noise sources
* Close windows and doors
* Keep the microphone at the same distance from your mouth throughout every session
* Do not record in large, empty rooms – they produce echo
* Use a pop filter if available to reduce harsh sounds on letters like "p" and "b"
* Use the same room, same device, and same setup across all sessions

***

### Step 3 – Record for text-to-speech (TTS)

TTS recordings give your AI Agent its voice. The goal is consistency.

**All TTS recordings must come from a single speaker.**

* Same speaker throughout all sessions – do not switch between speakers
* Consistent tone and pitch across every recording
* Clear pronunciation – every word articulated distinctly
* Natural pace – not too fast, not artificially slow
* Natural pauses at punctuation – commas, full stops, and question marks should each have a brief pause
* Wide vocabulary – the sentence list should cover every word and term the AI Agent may need to say

***

### Step 4 – Record for speech-to-text (ASR)

ASR recordings train your AI Agent to understand real users speaking naturally. The goal is diversity.

* Multiple speakers – include male and female voices, different ages
* Different accents and dialects – reflect the actual user base where possible
* Varied intonations – record questions, statements, and words with different emphasis
* Natural speaking styles – speakers should use their natural pace and volume

For ASR, a single sentence can have more than one recording. Multiple speakers recording the same sentence improves the range of users your AI Agent can understand.

> TTS recordings can be reprocessed for ASR training. ASR recordings cannot be used for TTS.

***

### Step 5 – Name, save, and organise your files

Each recording must be saved as a separate audio file. File names must match the sentence ID exactly, and files must be organised into separate TTS and ASR folders for delivery.

**Format:** .mp3 or .wav

**TTS – single language:**

```
001.mp3
002.mp3
003.mp3
```

**TTS – multiple languages (add the language code):**

```jsx
001_en.mp3
001_fr.mp3
```

**ASR – multiple speakers (add the speaker number):**

```
001_speaker1.mp3
001_speaker2.mp3
002_speaker1.mp3
```

**ASR – multiple languages and speakers:**

```
001_fr_speaker1.mp3
001_fr_speaker2.mp3
001_rw_speaker1.mp3
```

**Folder structure for delivery:**

```
[Language name]/
├── TTS/
│   ├── 001_fr.mp3
│   ├── 002_fr.mp3
│   └── sentence_list.xlsx
└── ASR/
    ├── 001_fr_speaker1.mp3
    ├── 001_fr_speaker2.mp3
    ├── 002_fr_speaker1.mp3
    └── sentence_list.xlsx
```

Include the sentence list spreadsheet in each folder, with every sentence ID mapped to its corresponding text.

***

### Step 6 – Final checklist before submission

**Sentence list**

✅ 10,000 to 15,000 sentences in the target language \
✅ Wide vocabulary coverage including domain-specific terms and proper nouns \
✅ Every sentence has a unique ID number

**TTS recordings**

✅ All recordings made by a single, consistent speaker\
✅ Tone and pitch consistent across all sessions\
✅ Every sentence ID has exactly one matching audio file\
✅ Audio is clear – no background noise, no echo, no distortion\
✅ Files named correctly and saved in the correct format\
✅ Sentence list spreadsheet included in the TTS folder

**ASR recordings**

✅ Multiple speakers recorded\
✅ Varied accents, ages, and intonations represented\
✅ Every audio file is mapped to its sentence ID\
✅ Audio is clear\
✅ Files named correctly\
✅ Sentence list spreadsheet included in the ASR folder

***

### Common mistakes to avoid

| Mistake                                                 | Why it matters                                        |
| ------------------------------------------------------- | ----------------------------------------------------- |
| Using a built-in laptop microphone                      | Audio quality is too low for training                 |
| Background noise in recordings                          | Degrades model performance significantly              |
| Long pauses at the start of a recording or mid-sentence | Audio will not align with the transcript correctly    |
| Switching speakers mid-TTS session                      | Produces an inconsistent AI Agent voice               |
| Using only one speaker for ASR                          | Limits the range of users the AI Agent can understand |
| Incorrect or inconsistent file naming                   | Files cannot be matched to their transcripts          |
| Missing files                                           | Incomplete datasets delay or prevent training         |


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://documentation.proto.cx/docs/language-acquisition/training-a-new-language/voice-data-collection.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.