Classification data preparation
Before the training can start, you need to upload all the necessary ingredients to start the training job. For this example, we will focus on classifying customer service requests into categories to streamline the support workflows in an imaginary banking system. To train a model for this purpose, we will need the following:
Task description
Describes the task you expect the model to perform; you can think of it as an LLM prompt that would help the model solve your task. In practice for a classification problem, we expect two components:
task_description
field that describes the main taskclasses_description
field, which provides names and descriptions for all classes. In practice, it is a map from class names to their descriptions.
The expected format is a JSON blob, and for classifying banking service requests, we should have the following:
Test/Train data
We need a testing dataset that we can use to evaluate the performance of the fine-tuned model on your task. For the training stage, we need only a few dozen examples to fine-tune your model. Of course, the more diverse and bigger the dataset, the better.
The expected format is CSV or JSON-lines with (question
,answer
) columns. The banking classification task should look like this:
JSON format
CSV format
Unstructured dataset
The unstructured dataset is used to guide the teacher model in generating diverse, domain-specific data. It can be documentation, unlabelled examples, or even industry literature that contains such information.
For our banking problem, we will use unlabelled customer requests as context for generating new examples.
The expected format is CSV or JSON lines with a single column (context
). For the banking classification task it should look like this:
JSON format
CSV format