Classification data preparation | Distil Labs

Before the training can start, you need to upload all the necessary ingredients to start the training job. For this example, we will focus on classifying customer service requests into categories to streamline the support workflows in an imaginary banking system. To train a model for this purpose, we will need the following:

Task description

Describes the task you expect the model to perform; you can think of it as an LLM prompt that would help the model solve your task. In practice for a classification problem, we expect two components:

task_description field that describes the main task
classes_description field, which provides names and descriptions for all classes. In practice, it is a map from class names to their descriptions.

The expected format is a JSON blob, and for classifying banking service requests, we should have the following:

1 {
2   "task_description": "Classify the bank customer service requests into one of the provided classes",
3 
4   "classes_description": {
5     "balance_not_updated_after_bank_transfer": "Requests about a completed bank transfer not yet reflected in the account balance. The funds have been debited but not credited, indicating a delay in processing the outgoing transfer.",
6 
7     "balance_not_updated_after_cheque_or_cash_deposit": "Requests regarding a recent cheque or cash deposit not showing up in the available account balance. The customer's ledger balance does not reflect the deposit after some time has passed.",
8 
9     "card_payment_fee_charged": "Requests questioning an unexpected or additional fee charged for making a payment or purchase with a debit or credit card. The customer seeks clarification on the reason for the fee.",
10 
11     "cash_withdrawal_charge": "Requests related to being charged a fee for withdrawing cash from an ATM. The customer wants to know the reason, the exact fee amount, and if it can be waived.",
12 
13     "declined_cash_withdrawal": "Requests about attempting to withdraw cash from an ATM but having the transaction declined. The customer has tried multiple ATMs but still faces the same issue with their card.",
14 
15     "direct_debit_payment_not_recognised": "Requests regarding an unauthorized direct debit payment charged to the account. The customer claims they did not set it up and wants the bank to investigate its validity.",
16  }
17 }

Test/Train data

We need a testing dataset that we can use to evaluate the performance of the fine-tuned model on your task. For the training stage, we need only a few dozen examples to fine-tune your model. Of course, the more diverse and bigger the dataset, the better.

The expected format is CSV or JSON-lines with (question,answer) columns. The banking classification task should look like this:

JSON format

1 {"question": "Why is there a fee for getting cash?", "answer": "cash_withdrawal_charge"}
2 {"question": "I was declined when I tried to take out cash!", "answer": "declined_cash_withdrawal"}
3 {"question": "I deposited some money, but the balance has not changed.", "answer": "balance_not_updated_after_cheque_or_cash_deposit"}
4 {"question": "It has been a couple of hours but I do not see my balance updated, can you help?", "answer": "balance_not_updated_after_bank_transfer"}
5 {"question": "There is a payment showing on my app that I didn`t do. Will you please cancel this payment and refund my money ?", "answer": "direct_debit_payment_not_recognised"}
6 {"question": "How do I know which payments I make will have additional fees? Where can I find this information online?", "answer": "card_payment_fee_charged"}
7 {"question": "how come i was declined", "answer": "declined_cash_withdrawal"}
8 {"question": "I deposited a cheque and its been days and I still haven`t received the cash!!", "answer": "balance_not_updated_after_cheque_or_cash_deposit"}
9 {"question": "You promised no fees but now I`ve got one. What the hell!?", "answer": "card_payment_fee_charged"}
10 {"question": "What is the amount of time transfers usually take from the UK? I had just completed a transfer and nothing seems to be showing up so I need to be sure things are alright.", "answer": "balance_not_updated_after_bank_transfer"}
11 {"question": "Why is there a direct debit to my account? I didn`t do that.", "answer": "direct_debit_payment_not_recognised"}
12 {"question": "Do cash withdrawals cost anything?", "answer": "cash_withdrawal_charge"}

CSV format

question	answer
Why is there a fee for getting cash?	cash_withdrawal_charge
I was declined when I tried to take out cash!	declined_cash_withdrawal
I deposited some money, but the balance has not changed.	balance_not_updated_after_cheque_or_cash_deposit
It has been a couple of hours but I do not see my balance updated, can you help?	balance_not_updated_after_bank_transfer
There is a payment showing on my app that I didn’t do. Will you please cancel this payment and refund my money ?	direct_debit_payment_not_recognised
How do I know which payments I make will have additional fees? Where can I find this information online?	card_payment_fee_charged
how come i was declined	declined_cash_withdrawal
I deposited a cheque and its been days and I still haven’t received the cash!!	balance_not_updated_after_cheque_or_cash_deposit
You promised no fees but now I’ve got one. What the hell!?	card_payment_fee_charged
What is the amount of time transfers usually take from the UK? I had just completed a transfer and nothing seems to be showing up so I need to be sure things are alright.	balance_not_updated_after_bank_transfer
Why is there a direct debit to my account? I didn’t do that.	direct_debit_payment_not_recognised
Do cash withdrawals cost anything?	cash_withdrawal_charge

Unstructured dataset

The unstructured dataset is used to guide the teacher model in generating diverse, domain-specific data. It can be documentation, unlabelled examples, or even industry literature that contains such information.

For our banking problem, we will use unlabelled customer requests as context for generating new examples.

The expected format is CSV or JSON lines with a single column (context). For the banking classification task it should look like this:

JSON format

1 {"context":"Canceling my order is what I need to do right now."}
2 {"context":"I swear that there are 2 payments on the app that I didn't make.  Could my card me stolen?  Please advise what I should do."}
3 {"context":"To many charges on my card how do I go about fixing that?"}
4 {"context":"I got less cash than what I specified at the ATM."}
5 {"context":"Why is my last cheque deposit taking so long?"}
6 {"context":"A payment shows up on the app that I never made."}
7 {"context":"My cash withdrawal was short."}
8 {"context":"I would like to receive a refund for something I bought."}
9 {"context":"I saw a fee on my receipt from withdrawing money while shopping earlier.  Is there supposed to be a fee for this type of transaction?"}
10 {"context":"I normally don't use ATMs, but I was in a rush today and had to withdraw some cash. The ATM gave me the wrong amount of money and now my app is not showing the same amount as it should. What do I do?"}

CSV format

context
Canceling my order is what I need to do right now.
I swear that there are 2 payments on the app that I didn’t make. Could my card me stolen? Please advise what I should do.
To many charges on my card how do I go about fixing that?
I got less cash than what I specified at the ATM.
Why is my last cheque deposit taking so long?
A payment shows up on the app that I never made.
My cash withdrawal was short.
I would like to receive a refund for something I bought.
I saw a fee on my receipt from withdrawing money while shopping earlier. Is there supposed to be a fee for this type of transaction?
I normally don’t use ATMs, but I was in a rush today and had to withdraw some cash. The ATM gave me the wrong amount of money and now my app is not showing the same amount as it should. What do I do?