Question Answering data preparation | Distil labs

Before training can start, you need to upload all the necessary data for your training job. Here, we will focus on question answering, where the model extracts or generates precise answers from text based on specific queries. We will use an example of extracting information from invoices to demonstrate the data format.

Job description

Describes the work you expect the model to perform; you can think of it as an LLM prompt that would help the model solve your task.

The expected format is a JSON file with the following fields:

task_description — describes what the model should do and how it should format its outputs
input_description — describes what the input data looks like

Optionally, you can include llm_as_a_judge_instructions which describes the instructions given to the LLM when evaluating answers.

job_description.json

1 {
2     "task_description": "Extract the requested information from the provided invoice text. Return only the specific value asked for, without additional explanation. If the information is not found, respond with 'Not found'.",
3     "input_description": "Invoice text containing details such as invoice number, date, vendor name, line items, subtotal, tax, and total amount. The question will ask for a specific piece of information from the invoice.",
4     "llm_as_a_judge_instructions": "Compare the predicted answer to the reference answer for the given question. Output 'good' if the prediction matches the reference value or is semantically equivalent, otherwise output 'bad'."
5 }

Train/test data

We need a training dataset to fine-tune the model for your specific task and a test dataset to evaluate its performance after fine-tuning. The more diverse and bigger the datasets, the better, but for the training stage we need only a few dozen examples (we’ll generate much more based on the examples you provide).

The expected format is CSV or JSONL with the following columns:

Column	Description
`question`	The input text or question the model must process
`answer`	The expected output or answer

JSONL format

train.jsonl

1 {"question": "Invoice #1234 from Acme Corp dated 2024-01-15. Items: Widget x10 at $50 each. Subtotal: $500. Tax: $40. Total: $540. What is the total amount?", "answer": "$540"}
2 {"question": "Invoice #1234 from Acme Corp dated 2024-01-15. Items: Widget x10 at $50 each. Subtotal: $500. Tax: $40. Total: $540. What is the invoice number?", "answer": "1234"}
3 {"question": "Invoice #1234 from Acme Corp dated 2024-01-15. Items: Widget x10 at $50 each. Subtotal: $500. Tax: $40. Total: $540. Who is the vendor?", "answer": "Acme Corp"}
4 {"question": "Invoice #5678 from Global Services dated 2024-02-20. Items: Consulting 8hrs at $150/hr. Subtotal: $1200. Tax: $0. Total: $1200. What is the invoice date?", "answer": "2024-02-20"}
5 {"question": "Invoice #5678 from Global Services dated 2024-02-20. Items: Consulting 8hrs at $150/hr. Subtotal: $1200. Tax: $0. Total: $1200. What is the tax amount?", "answer": "$0"}

CSV format

question	answer
Invoice #1234 from Acme Corp dated 2024-01-15. Items: Widget x10 at $50 each. Subtotal:$ 500. Tax: $40. Total:$ 540. What is the total amount?	$540
Invoice #1234 from Acme Corp dated 2024-01-15. Items: Widget x10 at $50 each. Subtotal:$ 500. Tax: $40. Total:$ 540. What is the invoice number?	1234
Invoice #1234 from Acme Corp dated 2024-01-15. Items: Widget x10 at $50 each. Subtotal:$ 500. Tax: $40. Total:$ 540. Who is the vendor?	Acme Corp
Invoice #5678 from Global Services dated 2024-02-20. Items: Consulting 8hrs at $150/hr. Subtotal:$ 1200. Tax: $0. Total:$ 1200. What is the invoice date?	2024-02-20
Invoice #5678 from Global Services dated 2024-02-20. Items: Consulting 8hrs at $150/hr. Subtotal:$ 1200. Tax: $0. Total:$ 1200. What is the tax amount?	$0

Unstructured dataset (optional)

The unstructured dataset is used to guide the teacher model in generating diverse, domain-specific synthetic training data. For question answering, provide realistic samples that represent the types of inputs your model will encounter.

The expected format is CSV or JSONL with a single column (context):

JSONL format

unstructured.jsonl

1 {"context": "Invoice #9012 from Tech Solutions Inc dated 2024-03-10. Items: Software License x1 at $299. Subtotal: $299. Tax: $24. Total: $323."}
2 {"context": "Invoice #3456 from Office Supplies Co dated 2024-03-15. Items: Paper 10 reams at $8 each, Pens box x5 at $12 each. Subtotal: $140. Tax: $11. Total: $151."}
3 {"context": "Invoice #7890 from Cloud Hosting Ltd dated 2024-04-01. Items: Monthly hosting at $99, Domain renewal at $15. Subtotal: $114. Tax: $0. Total: $114."}
4 {"context": "Invoice #2468 from Marketing Agency dated 2024-04-12. Items: Campaign management 20hrs at $75/hr. Subtotal: $1500. Tax: $120. Total: $1620."}

CSV format

context
Invoice #9012 from Tech Solutions Inc dated 2024-03-10. Items: Software License x1 at $299. Subtotal:$ 299. Tax: $24. Total:$ 323.
Invoice #3456 from Office Supplies Co dated 2024-03-15. Items: Paper 10 reams at $8 each, Pens box x5 at$ 12 each. Subtotal: $140. Tax:$ 11. Total: $151.
Invoice #7890 from Cloud Hosting Ltd dated 2024-04-01. Items: Monthly hosting at $99, Domain renewal at$ 15. Subtotal: $114. Tax:$ 0. Total: $114.
Invoice #2468 from Marketing Agency dated 2024-04-12. Items: Campaign management 20hrs at $75/hr. Subtotal:$ 1500. Tax: $120. Total:$ 1620.

Configuration file

The configuration file specifies the task type and training parameters.

The expected format is YAML:

config.yaml

1 base:
2   task: question-answering

For additional configuration options, see Configuration file →