Open Book QA with synthetic contexts
Before the training can start, you need to upload all the necessary ingredients to start the training job. Here, we will focus on open book question answering, where the answer to every question is present in the corresponding reading passage that we do not have at training time. We will use an example of answering questions about Roman Empire articles that we do not currently have on hand. To train a model for this purpose, we will need the following:
Job description
Describes the work you expect the model to perform; you can think of it as an LLM prompt that would help the model solve your task. In practice for a question answering problem, we expect a single component: task_description that describes the main task.
The expected format is a JSON blob, and for Roman Empire QA, we should have the following:
Test/train data
We need a training dataset to fine-tune the model for your specific task and a test dataset to evaluate its performance after fine-tuning. The more diverse and bigger the datasets, the better, but for the training stage we need only a few dozen examples (we’ll generate much more based on the examples you provide).
The expected format is CSV or JSON-lines with the following columns
questionis the question the model must answer.contextholds information needed to answer the question.answeris the answer to the question (based on the context).
The data for the open book Wikipedia question answering should look like this:
JSONL format
CSV format
Unstructured dataset
The unstructured dataset is used to guide the teacher model in generating diverse, domain-specific data. For open-book QA, we need to provide a realistic samples that could be used as context for question-answering.
The expected format is CSV or JSON lines with a single column (context). For the Wikipedia QA task it should look like this:
JSONL format
CSV format
