Schedulebot-nlu-engine

Model Description

This model is a multi-task Natural Language Understanding (NLU) engine designed specifically for an appointment scheduling chatbot. It is fine-tuned from a distilbert-base-uncased backbone and is capable of performing two tasks simultaneously:

Intent Classification: Identifying the user's primary goal (e.g., schedule, cancel).
Named Entity Recognition (NER): Extracting custom, domain-specific entities (e.g., appointment_type).

This model stands out due to its custom classification heads, which use a more complex architecture to improve performance on nuanced tasks.

Model Architecture

The model uses a standard distilbert-base-uncased model as its core feature extractor. Two custom classification "heads" are placed on top of this base to perform the downstream tasks.

Base Model: distilbert-base-uncased
Classifier Heads: each head is a Multi-Layer Perceptron (MLP) with the following structure to allow for more complex feature interpretation:
1. A Linear layer projecting the transformer's output dimension (768) to an intermediate size (384).
2. A GELU activation function.
3. A Dropout layer with a rate of 0.3 for regularization.
4. A final Linear layer projecting the intermediate size to the number of output labels for the specific task (intent or NER).

Intended Use

This model is intended to be the core NLU component of a conversational AI system for managing appointments.

For instructions on how to use the model check the dedicated file.

Training Data

The model was trained on the HASD (Hybrid Appointment Scheduling Dataset), a custom dataset built specifically for this task.

Source: The dataset is a hybrid of real-world conversational examples from clinc/clinc_oos (for simple intents) and synthetically generated, template-based examples for complex scheduling intents.
Balancing: To combat class imbalance, intents sourced from clinc/clinc_oos were down-sampled to a maximum of 150 examples each.
Augmentation: To increase data diversity for complex intents (schedule, reschedule, etc.), Contextual Word Replacement was used. A distilbert-base-uncased model augmented the templates by replacing non-placeholder words with contextually relevant synonyms.

The dataset is available here.

Intents

The model is trained to recognize the following intents: schedule, reschedule, cancel, query_avail, greeting, positive_reply, negative_reply, bye, oos (out-of-scope).

Entities

The model is trained to recognize the following custom named entities: practitioner_name, appointment_type, appointment_id.

Training Procedure

The model was trained using a two-stage fine-tuning strategy to ensure stability and performance.

Stage 1: Training the Classifier Heads

The distilbert-base-uncased base model was entirely frozen.
Only the randomly initialized MLP heads for intent and NER classification were trained.

Setup:

# Define a data collator to handle padding for token classification
data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)
# Define Training Arguments
training_args = TrainingArguments(
    output_dir="path/to/output_dir",
    overwrite_output_dir=True,
    num_train_epochs=200,               # Training epochs
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    learning_rate=1e-4,                 # Learning Rate
    weight_decay=1e-5,                  # AdamW weight decay
    logging_dir="path/to/logging_dir",
    logging_strategy="epoch",
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="eval_loss",     # Focus on validation loss as the key metric
    # --- Hub Arguments ---
    push_to_hub=True,
    hub_model_id=hub_model_id,
    hub_strategy="end",
    hub_token=hf_token,
    report_to="tensorboard"             # Tensorboard to monitor training
)
# Create the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=processed_datasets["train"],
    eval_dataset=processed_datasets["validation"],
    processing_class=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,  # Custom function (check how_to_use.md)
    callbacks=[EarlyStoppingCallback(early_stopping_patience=10)]
)

Stage 2: Fine-Tuning

The DistilBERT backbone was entirely unfrozen.
Using a very low LR allows the model to adapt even better to the new data while preserving the powerful, general-purpose knowledge.

Setup:

# Define Training Arguments
training_args = TrainingArguments(
    output_dir="path/to/output_dir",
    overwrite_output_dir=True,
    num_train_epochs=50,               # Fine-tuning epochs
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    learning_rate=1e-6,                 # Learning Rate
    weight_decay=1e-3,                  # AdamW weight decay
    logging_dir="path/to/logging_dir",
    logging_strategy="epoch",
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="eval_loss",     # Focus on NER F1 as the key metric
    # --- Hub Arguments ---
    push_to_hub=True,
    hub_model_id=hub_model_id,
    hub_strategy="end",
    hub_token=hf_token,
    report_to="tensorboard"             # Tensorboard to monitor training
)
# Create the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=processed_datasets["train"],
    eval_dataset=processed_datasets["validation"],
    processing_class=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,  # Custom function (check how_to_use.md)
    callbacks=[EarlyStoppingCallback(early_stopping_patience=5)]
)

Evaluation

The model was evaluated on a held-out test set, and its performance was measured for both tasks.

Intent Classification Performance

Intent	Precision	Recall	F1-Score	Support
bye	0.9500	0.8261	0.8837	23
cancel	0.9211	0.8434	0.8805	83
greeting	0.9545	0.9545	0.9545	22
negative_reply	0.9091	0.9091	0.9091	22
oos	1.0000	0.8696	0.9302	23
positive_reply	0.7407	0.9091	0.8163	22
query_avail	0.9620	0.9383	0.9500	81
reschedule	0.8506	0.8916	0.8706	83
schedule	0.8488	0.9125	0.8795	80
---	---	---	---	----
Accuracy			0.8952	439
Macro Avg	0.9041	0.8949	0.8972	439
Weighted Avg	0.8998	0.8952	0.8960	439

NER (Token Classification) Performance

Entity	Precision	Recall	F1-Score	Support
B-appointment_id	1.0000	1.0000	1.0000	61
B-appointment_type	0.8646	0.7477	0.8019	111
B-practitioner_name	0.9161	0.9467	0.9311	150
I-appointment_id	0.9667	0.9667	0.9667	210
I-appointment_type	0.8182	0.7368	0.7754	171
I-practitioner_name	0.9540	0.8941	0.9231	255
O	0.9782	0.9892	0.9837	3813
---	---	---	---	----
Accuracy			0.9673	4771
Macro Avg	0.9283	0.8973	0.9117	4771
Weighted Avg	0.9664	0.9673	0.9666	4771

The model achieves near-perfect results on the NER task and excellent results on the intent classification task for this specific dataset.

Limitations and Bias

The model's performance is highly dependent on the quality and scope of the HASD dataset. It may not generalize well to phrasing or appointment types significantly different from what it was trained on.
The dataset was primarily generated from templates, which may not capture the full diversity of real human language.
The model inherits any biases present in the distilbert-base-uncased model and the clinc/clinc_oos dataset.

Downloads last month: 1

Safetensors

Model size

67M params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for andreaceto/schedulebot-nlu-engine

Base model

distilbert/distilbert-base-uncased

Finetuned

(10911)

this model

andreaceto
/

schedulebot-nlu-engine