Closed Book QA
Before the training can start, you need to upload all the necessary ingredients to start the training job. Here, we will focus on closed book question answering, where a model will use its internal knowledge to answer questions. We will use an example of answering questions about some arbitrary context - to train a model for this purpose, we will need the following:
Job description
Describes the work you expect the model to perform; you can think of it as an LLM prompt that would help the model solve your task. In practice for a question answering problem, we expect a single component: task_description that describes the main task. For closed book QA, this can be very simple and can include some additional useful context if you find that is needed.
The expected format is a JSON blob, and for the example, we should have the following:
Test/train data
We need a training dataset to fine-tune the model for your specific task and a test dataset to evaluate its performance after fine-tuning. The more diverse and bigger the datasets, the better, but for the training stage we need only a few dozen examples (we’ll generate much more based on the examples you provide).
The expected format is CSV or JSON-lines with the following columns
questionis the question the model must answer.answeris the answer to the question.
The data for both the train/test dataset should look like this:
JSONL format
CSV format
Unstructured dataset
The unstructured data is crucial in the closed book question-answering task. This task aims to embed new knowledge into an SLM and the unstructured data represents the means to provide that knowledge. In practice, generate question-answer pairs in a similar style to that provided by the train dataset but based on the unstructured contexts.
The expected format is CSV or JSON lines with a single column (context). For the example task discussed it should look like this:
JSONL format
CSV format
