# Text data preparation

### Step 1 – Gather your source content

Collect all existing material that covers the topics your AI Agent should be able to answer. This can include:

* FAQs
* Policies and procedures
* Customer support scripts
* Website content
* Product documentation

**Keep everything in English at this stage. Do not translate yet.**

Once collected, clean the content:

* Remove duplicate information
* Fix formatting issues
* Make sure answers are clear and complete
* Fill in any gaps – if a topic is missing that users are likely to ask about, add it now

***

### Step 2 – Organise your content into topics

For each topic your AI Agent should handle, define three things:

* **Topic** – a short label for the subject (e.g. "Account registration")
* **Answer** – the exact response the AI Agent will give
* **Questions** – 10 to 15 different ways a user might ask about that topic

> **Why questions matter most:** The AI Agent learns to recognise what a user is asking by matching their message against these example questions. Without at least 10 varied questions per topic, the AI Agent cannot reliably route users to the correct response.

**Example:**

<table><thead><tr><th width="156.30078125">Field</th><th>Content</th></tr></thead><tbody><tr><td>Topic</td><td>Account registration</td></tr><tr><td>Answer</td><td>To register an account, visit our website and select Sign up in the top right corner. You will need your national ID number, a valid email address, and a phone number. Registration takes approximately five minutes.</td></tr><tr><td>Question 1</td><td>How do I sign up?</td></tr><tr><td>Question 2</td><td>How can I create an account?</td></tr><tr><td>Question 3</td><td>What do I need to register?</td></tr><tr><td>Question 4</td><td>I want to open an account</td></tr><tr><td>Question 5</td><td>Where do I go to register?</td></tr></tbody></table>

**Tips for writing good questions:**

* Write questions the way real users would – natural, not formal
* Vary the phrasing significantly, not just minor word swaps. "How do I sign up?" and "How can I sign up?" are essentially the same question – find genuinely different ways to ask
* Cover different levels of formality and the different ways people frame the same request
* You can use an AI tool to help generate question variations with a prompt such as: *"Generate 10–15 different ways a user might ask: \[your question here]."*

**Work in batches of 10 topics at a time.** Review each batch before moving to the next.

***

### Step 3 – Check for duplicates

Before translating, review the full dataset to ensure no question appears under more than one topic. If the same question maps to two different topics, the AI Agent will not know which response to give.

**Example of a problem to fix:** "How do I check my balance?" should not appear under both "Account balance" and "Loan balance." Assign it to one topic only and remove it from the other.

**Review checklist:**

✅ Every topic has a clear, accurate answer

✅ Every topic has at least 10 questions

✅ No question appears under more than one topic

✅ Questions are varied and natural – not just minor rephrasings of each other

✅ All topics a user is likely to ask about are covered

***

### Step 4 – Translate the dataset

Work through the dataset column by column, translating each question and answer one at a time.

**Instructions for the translator:**

1. Translate one sentence at a time – do not merge or combine sentences
2. Stay aligned with the original row and ID numbers
3. Aim for natural phrasing – translate the meaning, not word for word
4. Do not remove or skip any questions
5. If a term has no direct equivalent in the target language, use the term your audience would naturally use – and flag it with a note for review

**Output format:**

| ID  | Type     | English                                      | \[Target language] |
| --- | -------- | -------------------------------------------- | ------------------ |
| 001 | Topic    | Account registration                         | \[Translation]     |
| 002 | Answer   | To register an account, visit our website... | \[Translation]     |
| 003 | Question | How do I sign up?                            | \[Translation]     |
| 004 | Question | How can I create an account?                 | \[Translation]     |

***

### Step 5 – Submit your dataset

Once translation is complete, submit the following to your Proto contact:

✅ Reference documents (English source material) ✅ Completed Q\&A dataset (English + translated) ✅ Glossary of key terms (optional, but recommended for specialised vocabulary)

***

### Common mistakes to avoid

| Mistake                                          | Why it matters                                                                              |
| ------------------------------------------------ | ------------------------------------------------------------------------------------------- |
| Translating before structuring                   | Raw translated content cannot be used to configure the AI Agent – structure must come first |
| Providing raw documents instead of Q\&A format   | The AI Agent cannot use unstructured content                                                |
| Fewer than 10 questions per topic                | The AI Agent will not reliably recognise that topic                                         |
| Questions that are too similar to each other     | Reduces the range of real user messages the AI Agent can match                              |
| Skipping or merging questions during translation | Breaks the dataset and causes rework                                                        |


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://documentation.proto.cx/docs/language-acquisition/training-a-new-language/text-data-preparation.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
