Information Extraction

Before the training can start, you need to upload all the necessary ingredients to start the training job. Here, we will focus on information extraction, where we extract structured information from unstructured text. We will use an example of PII redaction and to train a model for this purpose, we will need the following:

Job description

Describes the work you expect the model to perform; you can think of it as an LLM prompt that would help the model solve your task. The expected format is a JSON blob, and for information extraction, we should have the following:

1{
2 "task_description": "Produce a redacted version of customer/support texts that removes sensitive personal data while preserving operational signals (order IDs, last-4 of cards, clinician names). You must return:\n1. `redacted_text` field with minimal substitutions (e.g., [PERSON], [CARD_LAST4:####])\n2. `entities` array that lists every redacted token, each with {value: original value verbatim, replacement_token: what the value was replaced with}.\n\n\nRedact (replace in text):\nPERSON (customer/patient names) → [PERSON]\n- EMAIL → [EMAIL]\n- PHONE (any international format) → [PHONE]\n- ADDRESS (street + number, or full postal lines) → [ADDRESS]\n- SSN / National ID / MRN → [SSN] / [ID] / [MRN]\n- CREDIT_CARD (full 13–19 digit number, with spaces/hyphens) → [CARD_LAST4:####] (keep last-4 only)\n- IBAN / Bank account → [IBAN_LAST4:####] (keep last-4 only)\n\nKeep (do not redact):\n- Clinician/doctor names when clearly marked by a medical title (e.g., Dr., MD, DO, RN).\n- Card last-4 when referenced as last-4 only (“ending 9021”, “**** 9021”).\n- Operational IDs: order/ticket/invoice numbers, device serials, case IDs.\n- Non-personal org info: company names, product names, team names.\n\n",
3 "context_description": "Short, semi-structured support emails/tickets/chats (1–5 sentences) from customers or clinic staff:\n\nOften include a greeting, a brief problem description, and a signature line.\n\nMay contain one or more PII/PHI items (names, emails, phones, addresses, SSN/MRN), payment data (full card or “ending ####”), or operational IDs (order/ticket/invoice).\n\nMultilingual flavor (EN with occasional DE/PL/FR details); phone numbers and addresses in varied formats.\n\nClinician mentions appear as Dr. <LastName> and must be kept.\n\nOutputs should include the redacted_text and an entities list with {type, value, action}; include last4 for cards/IBANs when applicable.\n",
4}

Test/train data

We need a training dataset to fine-tune the model for your specific task and a test dataset to evaluate its performance after fine-tuning. The more diverse and bigger the datasets, the better, but for the training stage we need only a few dozen examples (we’ll generate much more based on the examples you provide).

The expected format is CSV or JSON-lines with the following columns

  1. question is the question the model must answer.
  2. context holds information needed to answer the question.
  3. answer is the answer to the question (based on the context).

The data for the PII information extraction should look like this:

JSONL format

1{"question": "Redact provided text according to the task description and return redacted elements.", "context": "Refund via IBAN DK50 0040 0440 1162 43 today.", "answer": "'{\"redacted_text\": \"Refund via [IBAN_LAST4:6243] today.\", \"entities\": [{\"replacement_token\": \"[IBAN_LAST4:6243]\", \"value\": \"DK50 0040 0440 1162 43\"}]}'"},
2{"question": "Redact provided text according to the task description and return redacted elements.", "context": "Contact me at maria.lu@gmail.com or 020 7946 0958. Thanks, Maria Lu.", "answer": "'{\"redacted_text\": \"Contact me at [EMAIL] or [PHONE]. Thanks, [PERSON].\", \"entities\": [{\"replacement_token\": \"[EMAIL]\", \"value\": \"maria.lu@gmail.com\"}, {\"replacement_token\": \"[PHONE]\", \"value\": \"020 7946 0958\"}, {\"replacement_token\": \"[PERSON]\", \"value\": \"Maria Lu\"}]}'"},
3{"question": "Redact provided text according to the task description and return redacted elements.", "context": "IBAN NO93 8601 1117 947 for the refund.", "answer": "'{\"redacted_text\": \"[IBAN_LAST4:7947] for the refund.\", \"entities\": [{\"replacement_token\": \"[IBAN_LAST4:7947]\", \"value\": \"NO93 8601 1117 947\"}]}'"},
4{"question": "Redact provided text according to the task description and return redacted elements.", "context": "Reach me at sophie.mueller@firma.de; cheers, Sophie M\u00fcller.", "answer": "'{\"redacted_text\": \"Reach me at [EMAIL]; cheers, [PERSON].\", \"entities\": [{\"replacement_token\": \"[EMAIL]\", \"value\": \"sophie.mueller@firma.de\"}, {\"replacement_token\": \"[PERSON]\", \"value\": \"Sophie M\\u00fcller\"}]}'"},
5{"question": "Redact provided text according to the task description and return redacted elements.", "context": "Banking IBAN CH93 0076 2011 6238 5295 7 now active.", "answer": "'{\"redacted_text\": \"Banking [IBAN_LAST4:5297] now active.\", \"entities\": [{\"replacement_token\": \"[IBAN_LAST4:5297]\", \"value\": \"CH93 0076 2011 6238 5295 7\"}]}'"},

CSV format

questioncontextanswer
Redact provided text according to the task description and return redacted elements.Refund via IBAN DK50 0040 0440 1162 43 today.{"redacted_text":"Refund via [IBAN_LAST4:6243] today.","entities":[{"replacement_token":"[IBAN_LAST4:6243]","value":"DK50 0040 0440 1162 43"}]}
Redact provided text according to the task description and return redacted elements.Contact me at maria.lu@gmail.com or 020 7946 0958. Thanks, Maria Lu.{"redacted_text":"Contact me at [EMAIL] or [PHONE]. Thanks, [PERSON].","entities":[{"replacement_token":"[EMAIL]","value":"maria.lu@gmail.com"},{"replacement_token":"[PHONE]","value":"020 7946 0958"},{"replacement_token":"[PERSON]","value":"Maria Lu"}]}
Redact provided text according to the task description and return redacted elements.IBAN NO93 8601 1117 947 for the refund.{"redacted_text":"[IBAN_LAST4:7947] for the refund.","entities":[{"replacement_token":"[IBAN_LAST4:7947]","value":"NO93 8601 1117 947"}]}
Redact provided text according to the task description and return redacted elements.Reach me at sophie.mueller@firma.de; cheers, Sophie Müller.{"redacted_text":"Reach me at [EMAIL]; cheers, [PERSON].","entities":[{"replacement_token":"[EMAIL]","value":"sophie.mueller@firma.de"},{"replacement_token":"[PERSON]","value":"Sophie Müller"}]}
Redact provided text according to the task description and return redacted elements.Banking IBAN CH93 0076 2011 6238 5295 7 now active.{"redacted_text":"Banking [IBAN_LAST4:2957] now active.","entities":[{"replacement_token":"[IBAN_LAST4:2957]","value":"CH93 0076 2011 6238 5295 7"}]}

Unstructured dataset

The unstructured dataset is used to guide the teacher model in generating diverse, domain-specific data. For open-book QA, we need to provide a realistic samples that could be used as context for question-answering.

The expected format is CSV or JSON lines with a single column (context). For the PII redaction task it should look like this:

JSONL format

1{"context": "There is a fee for a transfer, please explain that to me."},
2{"context": "I received a fee I should not have."},
3{"context": "Hey there, I just went through my most account statement and I notice the same charge so I would like one of the charges to be reversed and my money to be put back in the account."},
4{"context": "I would like to dispute a direct debit transaction"},
5{"context": "If I found an error in my account for a transaction I didn't make, how long to I have to dispute it?"},
6{"context": "Please tell me why my transaction was declined"}

CSV format

context
There is a fee for a transfer, please explain that to me.
I received a fee I should not have.
Hey there, I just went through my most account statement and I notice the same charge so I would like one of the charges to be reversed and my money to be put back in the account.
I would like to dispute a direct debit transaction
If I found an error in my account for a transaction I didn’t make, how long to I have to dispute it?
Please tell me why my transaction was declined