Closed Book QA | Distil Labs

Before the training can start, you need to upload all the necessary ingredients to start the training job. Here, we will focus on closed book question answering, where a model will use its internal knowledge to answer questions. We will use an example of answering questions about some arbitrary context - to train a model for this purpose, we will need the following:

Job description

Describes the work you expect the model to perform; you can think of it as an LLM prompt that would help the model solve your task. In practice for a question answering problem, we expect a single component: task_description that describes the main task. For closed book QA, this can be very simple and can include some additional useful context if you find that is needed.

Optionally, you can include llm_as_a_judge_instructions which describes the instructions given to the LLM when evaluating answers. The LLM judge will have access to the question, context (if applicable), reference answer, and predicted answer. It is instructed to output binary values (good/bad).

The expected format is a JSON blob, and for the example, we should have the following:

1 {
2   "task_description": "The task is to answer the question using your internal knowledge",
3   "llm_as_a_judge_instructions": "Evaluate whether the predicted answer correctly answers the question based on the reference answer. Output 'good' if the predicted answer is semantically equivalent to the reference answer or conveys the same key information, otherwise output 'bad'"
4 }

Test/train data

We need a training dataset to fine-tune the model for your specific task and a test dataset to evaluate its performance after fine-tuning. The more diverse and bigger the datasets, the better, but for the training stage we need only a few dozen examples (we’ll generate much more based on the examples you provide).

The expected format is CSV or JSON-lines with the following columns

question is the question the model must answer.
answer is the answer to the question.

The data for both the train/test dataset should look like this:

JSONL format

1 {"question": "Where did Sands and Chopin take shelter after the locals in Majorca became inhospitable upon discovering they were unmarried?","answer": "a former Carthusian monastery"}
2 {"question":"When did the Computer Emergency Readiness Team, a division of the Department of Homeland Security, investigate 79 hacking incidents at energy companies?","answer":"2014"}
3 {"question":"How is the final quarter of the Premier League's television rights revenue distributed?","answer":"the final quarter is paid out as facilities fees for games that are shown on television, with the top clubs generally receiving the largest shares of this."}

CSV format

question	answer
Where did Sands and Chopin take shelter after the locals in Majorca became inhospitable upon discovering they were unmarried?	a former Carthusian monastery
When did the Computer Emergency Readiness Team, a division of the Department of Homeland Security, investigate 79 hacking incidents at energy companies?	2014
How is the final quarter of the Premier League’s television rights revenue distributed?	the final quarter is paid out as facilities fees for games that are shown on television, with the top clubs generally receiving the largest shares of this.

Unstructured dataset

The unstructured data is crucial in the closed book question-answering task. This task aims to embed new knowledge into an SLM and the unstructured data represents the means to provide that knowledge. In practice, generate question-answer pairs in a similar style to that provided by the train dataset but based on the unstructured contexts.

The expected format is CSV or JSON lines with a single column (context). For the example task discussed it should look like this:

JSONL format

1 {"context": "In June 1837 Chopin visited London incognito in the company of the piano manufacturer Camille Pleyel where he played at a musical soir\u00e9e at the house of English piano maker James Broadwood. On his return to Paris, his association with Sand began in earnest, and by the end of June 1838 they had become lovers. Sand, who was six years older than the composer, and who had had a series of lovers, wrote at this time: \"I must say I was confused and amazed at the effect this little creature had on me ... I have still not recovered from my astonishment, and if I were a proud person I should be feeling humiliated at having been carried away ...\" The two spent a miserable winter on Majorca (8 November 1838 to 13 February 1839), where, together with Sand's two children, they had journeyed in the hope of improving the health of Chopin and that of Sand's 15-year-old son Maurice, and also to escape the threats of Sand's former lover F\u00e9licien Mallefille. After discovering that the couple were not married, the deeply traditional Catholic people of Majorca became inhospitable, making accommodation difficult to find. This compelled the group to take lodgings in a former Carthusian monastery in Valldemossa, which gave little shelter from the cold winter weather."}
2 {"context": "The Premier League sells its television rights on a collective basis. This is in contrast to some other European Leagues, including La Liga, in which each club sells its rights individually, leading to a much higher share of the total income going to the top few clubs. The money is divided into three parts: half is divided equally between the clubs; one quarter is awarded on a merit basis based on final league position, the top club getting twenty times as much as the bottom club, and equal steps all the way down the table; the final quarter is paid out as facilities fees for games that are shown on television, with the top clubs generally receiving the largest shares of this. The income from overseas rights is divided equally between the twenty clubs."}
3 {"context": "Computers control functions at many utilities, including coordination of telecommunications, the power grid, nuclear power plants, and valve opening and closing in water and gas networks. The Internet is a potential attack vector for such machines if connected, but the Stuxnet worm demonstrated that even equipment controlled by computers not connected to the Internet can be vulnerable to physical damage caused by malicious commands sent to industrial equipment (in that case uranium enrichment centrifuges) which are infected via removable media. In 2014, the Computer Emergency Readiness Team, a division of the Department of Homeland Security, investigated 79 hacking incidents at energy companies."}

CSV format

context
In June 1837 Chopin visited London incognito in the company of the piano manufacturer Camille Pleyel where he played at a musical soir\u00e9e at the house of English piano maker James Broadwood. On his return to Paris, his association with Sand began in earnest, and by the end of June 1838 they had become lovers. Sand, who was six years older than the composer, and who had had a series of lovers, wrote at this time: “I must say I was confused and amazed at the effect this little creature had on me … I have still not recovered from my astonishment, and if I were a proud person I should be feeling humiliated at having been carried away …” The two spent a miserable winter on Majorca (8 November 1838 to 13 February 1839), where, together with Sand’s two children, they had journeyed in the hope of improving the health of Chopin and that of Sand’s 15-year-old son Maurice, and also to escape the threats of Sand’s former lover F\u00e9licien Mallefille. After discovering that the couple were not married, the deeply traditional Catholic people of Majorca became inhospitable, making accommodation difficult to find. This compelled the group to take lodgings in a former Carthusian monastery in Valldemossa, which gave little shelter from the cold winter weather.
The Premier League sells its television rights on a collective basis. This is in contrast to some other European Leagues, including La Liga, in which each club sells its rights individually, leading to a much higher share of the total income going to the top few clubs. The money is divided into three parts: half is divided equally between the clubs; one quarter is awarded on a merit basis based on final league position, the top club getting twenty times as much as the bottom club, and equal steps all the way down the table; the final quarter is paid out as facilities fees for games that are shown on television, with the top clubs generally receiving the largest shares of this. The income from overseas rights is divided equally between the twenty clubs.
Computers control functions at many utilities, including coordination of telecommunications, the power grid, nuclear power plants, and valve opening and closing in water and gas networks. The Internet is a potential attack vector for such machines if connected, but the Stuxnet worm demonstrated that even equipment controlled by computers not connected to the Internet can be vulnerable to physical damage caused by malicious commands sent to industrial equipment (in that case uranium enrichment centrifuges) which are infected via removable media. In 2014, the Computer Emergency Readiness Team, a division of the Department of Homeland Security, investigated 79 hacking incidents at energy companies.