Text data preparation

This guide walks you through preparing the text content needed to add a new language to your AI Agent. Follow each step in order.

Step 1 – Gather your source content

Collect all existing material that covers the topics your AI Agent should be able to answer. This can include:

  • FAQs

  • Policies and procedures

  • Customer support scripts

  • Website content

  • Product documentation

Keep everything in English at this stage. Do not translate yet.

Once collected, clean the content:

  • Remove duplicate information

  • Fix formatting issues

  • Make sure answers are clear and complete

  • Fill in any gaps – if a topic is missing that users are likely to ask about, add it now


Step 2 – Organise your content into topics

For each topic your AI Agent should handle, define three things:

  • Topic – a short label for the subject (e.g. "Account registration")

  • Answer – the exact response the AI Agent will give

  • Questions – 10 to 15 different ways a user might ask about that topic

Why questions matter most: The AI Agent learns to recognise what a user is asking by matching their message against these example questions. Without at least 10 varied questions per topic, the AI Agent cannot reliably route users to the correct response.

Example:

Field
Content

Topic

Account registration

Answer

To register an account, visit our website and select Sign up in the top right corner. You will need your national ID number, a valid email address, and a phone number. Registration takes approximately five minutes.

Question 1

How do I sign up?

Question 2

How can I create an account?

Question 3

What do I need to register?

Question 4

I want to open an account

Question 5

Where do I go to register?

Tips for writing good questions:

  • Write questions the way real users would – natural, not formal

  • Vary the phrasing significantly, not just minor word swaps. "How do I sign up?" and "How can I sign up?" are essentially the same question – find genuinely different ways to ask

  • Cover different levels of formality and the different ways people frame the same request

  • You can use an AI tool to help generate question variations with a prompt such as: "Generate 10–15 different ways a user might ask: [your question here]."

Work in batches of 10 topics at a time. Review each batch before moving to the next.


Step 3 – Check for duplicates

Before translating, review the full dataset to ensure no question appears under more than one topic. If the same question maps to two different topics, the AI Agent will not know which response to give.

Example of a problem to fix: "How do I check my balance?" should not appear under both "Account balance" and "Loan balance." Assign it to one topic only and remove it from the other.

Review checklist:

✅ Every topic has a clear, accurate answer

✅ Every topic has at least 10 questions

✅ No question appears under more than one topic

✅ Questions are varied and natural – not just minor rephrasings of each other

✅ All topics a user is likely to ask about are covered


Step 4 – Translate the dataset

Work through the dataset column by column, translating each question and answer one at a time.

Instructions for the translator:

  1. Translate one sentence at a time – do not merge or combine sentences

  2. Stay aligned with the original row and ID numbers

  3. Aim for natural phrasing – translate the meaning, not word for word

  4. Do not remove or skip any questions

  5. If a term has no direct equivalent in the target language, use the term your audience would naturally use – and flag it with a note for review

Output format:

ID
Type
English
[Target language]

001

Topic

Account registration

[Translation]

002

Answer

To register an account, visit our website...

[Translation]

003

Question

How do I sign up?

[Translation]

004

Question

How can I create an account?

[Translation]


Step 5 – Submit your dataset

Once translation is complete, submit the following to your Proto contact:

✅ Reference documents (English source material) ✅ Completed Q&A dataset (English + translated) ✅ Glossary of key terms (optional, but recommended for specialised vocabulary)


Common mistakes to avoid

Mistake
Why it matters

Translating before structuring

Raw translated content cannot be used to configure the AI Agent – structure must come first

Providing raw documents instead of Q&A format

The AI Agent cannot use unstructured content

Fewer than 10 questions per topic

The AI Agent will not reliably recognise that topic

Questions that are too similar to each other

Reduces the range of real user messages the AI Agent can match

Skipping or merging questions during translation

Breaks the dataset and causes rework

Last updated