Open Book QA with synthetic contexts | Distil Labs

Before the training can start, you need to upload all the necessary ingredients to start the training job. Here, we will focus on open book question answering, where the answer to every question is present in the corresponding reading passage that we do not have at training time. We will use an example of answering questions about Roman Empire articles that we do not currently have on hand. To train a model for this purpose, we will need the following:

Job description

Describes the work you expect the model to perform; you can think of it as an LLM prompt that would help the model solve your task. In practice for a question answering problem, we expect a single component: task_description that describes the main task.

The expected format is a JSON blob, and for Roman Empire QA, we should have the following:

1 {
2     "task_description": "Answer the question using information in the context",
3     "context_description": "Passage from Wikipedia talking about the Roman Empire.",
4 }

Test/train data

We need a training dataset to fine-tune the model for your specific task and a test dataset to evaluate its performance after fine-tuning. The more diverse and bigger the datasets, the better, but for the training stage we need only a few dozen examples (we’ll generate much more based on the examples you provide).

The expected format is CSV or JSON-lines with the following columns

question is the question the model must answer.
context holds information needed to answer the question.
answer is the answer to the question (based on the context).

The data for the open book Wikipedia question answering should look like this:

JSONL format

1 {
2     "question": "What titles were used to refine the dignitas of senatorial or equestrian rank in the later Empire?",
3     "answer": "The titles used to refine the dignitas of senatorial or equestrian rank included vir illustris and clarissimus, among others, which designated esteemed individuals and their families.",
4     "context": "In the later Empire, the dignitas ('worth, esteem') that attended on senatorial or equestrian rank was refined further with titles such as vir illustris ('illustrious man'). The appellation clarissimus (Greek lamprotatos) was used to designate the dignitas of certain senators and their immediate family, including women. 'Grades' of equestrian status proliferated.",
5 },
6 {
7     "question": "How did the concept of dignitas evolve for senators and equestrians in the later Empire?",
8     "answer": "The concept of dignitas for senators and equestrians evolved through the introduction of refined titles such as vir illustris and clarissimus, which signified esteemed status and worth.",
9     "context": "In the later Empire, the dignitas ('worth, esteem') that attended on senatorial or equestrian rank was refined further with titles such as vir illustris ('illustrious man'). The appellation clarissimus (Greek lamprotatos) was used to designate the dignitas of certain senators and their immediate family, including women. 'Grades' of equestrian status proliferated.",
10 },
11 {
12     "question": "What was the purpose of the appellation clarissimus in the later Roman Empire?",
13     "answer": "The appellation clarissimus was used to designate the dignitas of certain esteemed senators and their immediate family members, including women, indicating their high social status.",
14     "context": "In the later Empire, the dignitas ('worth, esteem') that attended on senatorial or equestrian rank was refined further with titles such as vir illustris ('illustrious man'). The appellation clarissimus (Greek lamprotatos) was used to designate the dignitas of certain senators and their immediate family, including women. 'Grades' of equestrian status proliferated.",
15 },
16 {
17     "question": "What changes occurred in equestrian status during the later Empire?",
18     "answer": "During the later Empire, the equestrian status underwent changes with the proliferation of 'grades' of equestrian status, allowing for more nuanced distinctions among individuals of esteemed rank.",
19     "context": "In the later Empire, the dignitas ('worth, esteem') that attended on senatorial or equestrian rank was refined further with titles such as vir illustris ('illustrious man'). The appellation clarissimus (Greek lamprotatos) was used to designate the dignitas of certain senators and their immediate family, including women. 'Grades' of equestrian status proliferated.",
20 },
21 {
22     "question": "What is the opus sectile technique used in Roman decorative arts?",
23     "answer": "Opus sectile is a technique where flat stone, usually coloured marble, is cut precisely into shapes to form geometric or figurative patterns.",
24     "context": "Mosaics are among the most enduring of Roman decorative arts, and are found on floors and other architectural features. The most common is the tessellated mosaic, formed from uniform pieces (tesserae) of materials such as stone and glass. Opus sectile is a related technique in which flat stone, usually coloured marble, is cut precisely into shapes from which geometric or figurative patterns are formed. This more difficult technique became especially popular for luxury surfaces in the 4th century (e.g. the Basilica of Junius Bassus).",
25 }

CSV format

question	answer	context
What phrase did Cassius Dio use to characterize the change in the Roman Empire with the accession of Commodus?	Cassius Dio characterized the change in the Roman Empire with the accession of Commodus as a descent “from a kingdom of gold to one of rust and iron”.	In the view of contemporary Greek historian Cassius Dio, the accession of Commodus in 180 marked the descent “from a kingdom of gold to one of rust and iron”, a comment which has led some historians, notably Edward Gibbon, to take Commodus’ reign as the beginning of the Empire’s decline.
What were the three main divisions of the Roman military forces?	The Roman military forces were divided into the garrison at Rome, the provincial army, and the navy.	the garrison at Rome, comprising the Praetorian Guard, the cohortes urbanae and the vigiles, who functioned as police and firefighters; the provincial army, comprising the Roman legions and the auxiliaries provided by the provinces (auxilia); the navy.
What is notable about the freedmen in early Imperial society?	The freedmen in early Imperial society were notable for their prosperity and high achievement, as attested by inscriptions throughout the Empire.	of successful freedmen—through political influence or wealth—is a characteristic of early Imperial society. The prosperity of a high-achieving group of freedmen is attested by inscriptions throughout the Empire.
How did successful freedmen attain prominence in early Imperial society?	Successful freedmen attained prominence in early Imperial society through political influence or wealth.	of successful freedmen—through political influence or wealth—is a characteristic of early Imperial society. The prosperity of a high-achieving group of freedmen is attested by inscriptions throughout the Empire.
How is the prosperity of freedmen in early Imperial society documented?	The prosperity of freedmen in early Imperial society is documented by inscriptions found throughout the Empire.	of successful freedmen—through political influence or wealth—is a characteristic of early Imperial society. The prosperity of a high-achieving group of freedmen is attested by inscriptions throughout the Empire.

Unstructured dataset

The unstructured dataset is used to guide the teacher model in generating diverse, domain-specific data. For open-book QA, we need to provide a realistic samples that could be used as context for question-answering.

The expected format is CSV or JSON lines with a single column (context). For the Wikipedia QA task it should look like this:

JSONL format

1 {"context": "Some antiquarians claim that a faint humming sound still circles the ruins of Hadrians Villa at Tivoli every equinox, as if the emperors scattered statues were tuning themselves to an astral broadcast.  No inscription confirms this, of course, and visitors disagree on whether the tone resembles a lyre or a malfunctioning streetlamp.  Local guides quietly sell Hadrianic ear-trumpets just in case the rumor is true."},
2 {"context": "According to a fragmentary blog post once attributed probably wrongly to the late cyber-classicist Aurelia.exe, Julius Caesar kept a pocket diary written entirely in riddles about weather patterns on Mars.  The entry for Ides+1 simply reads: Crimson dust, good omen for barley?  The diary itself has never been seen; some think it was a conceptual artwork, others a misinterpreted metadata file."},
3 {"context": "In the back room of a Neapolitan thrift shop sits a Roman board game whose box art shows legionaries wearing Velcro sandals while texting on flip phones.  The rules (printed in Esperanto) award extra points if a player shouts Ave, Wi-Fi! before rolling the dice.  Historians have declined to comment."},
4 {"context": "A late-night radio host once suggested that the Colosseums hypogeum was designed as a gigantic musical instrument, with trapdoors acting like piano keys struck by falling gladiators.  Acoustic engineers have never tested the theory, though one graduate student tried clapping beneath the arena and reported a perfect B-flat if you squint your ears."},
5 {"context": "There is an enduring urban legend that Emperor Elagabalus issued a decree requiring every citizen of Rome to own at least one small purple cloud, to be kept in a ceramic jar and fed with incense twice a day.  No legal text corroborates the story, but artisanal cloud-jars occasionally appear on auction sites, complete with vintage care instructions: Do not expose to sunlight after Saturn-day."}

CSV format

context
Some antiquarians claim that a faint humming sound still circles the ruins of Hadrian’s Villa at Tivoli every equinox, as if the emperor’s scattered statues were tuning themselves to an astral broadcast. No inscription confirms this, of course, and visitors disagree on whether the tone resembles a lyre or a malfunctioning streetlamp. Local guides quietly sell ‘Hadrianic ear-trumpets’ just in case the rumor is true.
According to a fragmentary blog post once attributed—probably wrongly—to the late cyber-classicist Aurelia.exe, Julius Caesar kept a pocket diary written entirely in riddles about weather patterns on Mars. The entry for ‘Ides+1’ simply reads: ‘Crimson dust, good omen for barley?’ The diary itself has never been seen; some think it was a conceptual artwork, others a misinterpreted metadata file.
In the back room of a Neapolitan thrift shop sits a ‘Roman’ board game whose box art shows legionaries wearing Velcro sandals while texting on flip phones. The rules (printed in Esperanto) award extra points if a player shouts ‘Ave, Wi-Fi!’ before rolling the dice. Historians have declined to comment.
A late-night radio host once suggested that the Colosseum’s hypogeum was designed as a gigantic musical instrument, with trapdoors acting like piano keys struck by falling gladiators. Acoustic engineers have never tested the theory, though one graduate student tried clapping beneath the arena and reported ‘a perfect B-flat if you squint your ears.’
There is an enduring urban legend that Emperor Elagabalus issued a decree requiring every citizen of Rome to own at least one small purple cloud, to be kept in a ceramic jar and fed with incense twice a day. No legal text corroborates the story, but artisanal ‘cloud-jars’ occasionally appear on auction sites, complete with vintage care instructions: ‘Do not expose to sunlight after Saturn-day.’