Question Answering data preparation
Before training can start, you need to upload all the necessary data for your training job. Here, we will focus on question answering, where the model extracts or generates precise answers from text based on specific queries. We will use an example of extracting information from invoices to demonstrate the data format.
Job description
Describes the work you expect the model to perform; you can think of it as an LLM prompt that would help the model solve your task.
The expected format is a JSON file with the following fields:
task_description— describes what the model should do and how it should format its outputsinput_description— describes what the input data looks like
Optionally, you can include llm_as_a_judge_instructions which describes the instructions given to the LLM when evaluating answers.
Train/test data
We need a training dataset to fine-tune the model for your specific task and a test dataset to evaluate its performance after fine-tuning. The more diverse and bigger the datasets, the better, but for the training stage we need only a few dozen examples (we’ll generate much more based on the examples you provide).
The expected format is CSV or JSONL with the following columns:
JSONL format
CSV format
Unstructured dataset (optional)
The unstructured dataset is used to guide the teacher model in generating diverse, domain-specific synthetic training data. For question answering, provide realistic samples that represent the types of inputs your model will encounter.
The expected format is CSV or JSONL with a single column (context):
JSONL format
CSV format
Configuration file
The configuration file specifies the task type and training parameters.
The expected format is YAML:
For additional configuration options, see Configuration file →
