Classification data preparation
Before the training can start, you need to upload all the necessary ingredients to start the training job. For this example, we will focus on classifying customer service requests into categories to streamline the support workflows in an imaginary banking system. To train a model for this purpose, we will need the following:
Job description
Describes the work you expect the model to perform; you can think of it as an LLM prompt that would help the model solve your task. In practice for a classification problem, we expect two components:
task_descriptiondescribes the main taskclasses_descriptionprovides names and descriptions for all classes; it is a map from class names to their descriptions
The expected format is a JSON file:
Train/test data
We need a training dataset to fine-tune the model for your specific task and a test dataset to evaluate its performance after fine-tuning. The more diverse and bigger the datasets, the better, but for the training stage we need only a few dozen examples (we’ll generate much more based on the examples you provide).
The expected format is CSV or JSONL with the following columns:
JSONL format
CSV format
Unstructured dataset (optional)
The unstructured dataset is used to guide the teacher model in generating diverse, domain-specific data. It can be documentation, unlabelled examples, or even industry literature that contains such information.
For our banking problem, we will use unlabelled customer requests as context for generating new examples.
The expected format is CSV or JSONL with a single column (context):
JSONL format
CSV format
Configuration file
The configuration file specifies the task type and training parameters.
The expected format is YAML:
For additional configuration options, see Configuration file →
