Schedulebot-nlu-engine
Model Description
This model is a multi-task Natural Language Understanding (NLU) engine designed specifically for an appointment scheduling chatbot. It is fine-tuned from a distilbert-base-uncased backbone and is capable of performing two tasks simultaneously:
- Intent Classification: Identifying the user's primary goal (e.g.,
schedule,cancel). - Named Entity Recognition (NER): Extracting custom, domain-specific entities (e.g.,
appointment_type).
This model stands out due to its custom classification heads, which use a more complex architecture to improve performance on nuanced tasks.
Model Architecture
The model uses a standard distilbert-base-uncased model as its core feature extractor. Two custom classification "heads" are placed on top of this base to perform the downstream tasks.
- Base Model:
distilbert-base-uncased - Classifier Heads: each head is a Multi-Layer Perceptron (MLP) with the following structure to allow for more complex feature interpretation:
- A Linear layer projecting the transformer's output dimension (768) to an intermediate size (384).
- A GELU activation function.
- A Dropout layer with a rate of 0.3 for regularization.
- A final Linear layer projecting the intermediate size to the number of output labels for the specific task (intent or NER).
Intended Use
This model is intended to be the core NLU component of a conversational AI system for managing appointments.
For instructions on how to use the model check the dedicated file.
Training Data
The model was trained on the HASD (Hybrid Appointment Scheduling Dataset), a custom dataset built specifically for this task.
- Source: The dataset is a hybrid of real-world conversational examples from
clinc/clinc_oos(for simple intents) and synthetically generated, template-based examples for complex scheduling intents. - Balancing: To combat class imbalance, intents sourced from
clinc/clinc_ooswere down-sampled to a maximum of 150 examples each. - Augmentation: To increase data diversity for complex intents (
schedule,reschedule, etc.), Contextual Word Replacement was used. Adistilbert-base-uncasedmodel augmented the templates by replacing non-placeholder words with contextually relevant synonyms.
The dataset is available here.
Intents
The model is trained to recognize the following intents:
schedule, reschedule, cancel, query_avail, greeting, positive_reply, negative_reply, bye, oos (out-of-scope).
Entities
The model is trained to recognize the following custom named entities:
practitioner_name, appointment_type, appointment_id.
Training Procedure
The model was trained using a two-stage fine-tuning strategy to ensure stability and performance.
Stage 1: Training the Classifier Heads
- The
distilbert-base-uncasedbase model was entirely frozen. - Only the randomly initialized MLP heads for intent and NER classification were trained.
Setup:
# Define a data collator to handle padding for token classification
data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)
# Define Training Arguments
training_args = TrainingArguments(
output_dir="path/to/output_dir",
overwrite_output_dir=True,
num_train_epochs=200, # Training epochs
per_device_train_batch_size=32,
per_device_eval_batch_size=32,
learning_rate=1e-4, # Learning Rate
weight_decay=1e-5, # AdamW weight decay
logging_dir="path/to/logging_dir",
logging_strategy="epoch",
eval_strategy="epoch",
save_strategy="epoch",
load_best_model_at_end=True,
metric_for_best_model="eval_loss", # Focus on validation loss as the key metric
# --- Hub Arguments ---
push_to_hub=True,
hub_model_id=hub_model_id,
hub_strategy="end",
hub_token=hf_token,
report_to="tensorboard" # Tensorboard to monitor training
)
# Create the Trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=processed_datasets["train"],
eval_dataset=processed_datasets["validation"],
processing_class=tokenizer,
data_collator=data_collator,
compute_metrics=compute_metrics, # Custom function (check how_to_use.md)
callbacks=[EarlyStoppingCallback(early_stopping_patience=10)]
)
Stage 2: Fine-Tuning
- The DistilBERT backbone was entirely unfrozen.
- Using a very low LR allows the model to adapt even better to the new data while preserving the powerful, general-purpose knowledge.
Setup:
# Define Training Arguments
training_args = TrainingArguments(
output_dir="path/to/output_dir",
overwrite_output_dir=True,
num_train_epochs=50, # Fine-tuning epochs
per_device_train_batch_size=32,
per_device_eval_batch_size=32,
learning_rate=1e-6, # Learning Rate
weight_decay=1e-3, # AdamW weight decay
logging_dir="path/to/logging_dir",
logging_strategy="epoch",
eval_strategy="epoch",
save_strategy="epoch",
load_best_model_at_end=True,
metric_for_best_model="eval_loss", # Focus on NER F1 as the key metric
# --- Hub Arguments ---
push_to_hub=True,
hub_model_id=hub_model_id,
hub_strategy="end",
hub_token=hf_token,
report_to="tensorboard" # Tensorboard to monitor training
)
# Create the Trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=processed_datasets["train"],
eval_dataset=processed_datasets["validation"],
processing_class=tokenizer,
data_collator=data_collator,
compute_metrics=compute_metrics, # Custom function (check how_to_use.md)
callbacks=[EarlyStoppingCallback(early_stopping_patience=5)]
)
Evaluation
The model was evaluated on a held-out test set, and its performance was measured for both tasks.
Intent Classification Performance
| Intent | Precision | Recall | F1-Score | Support |
|---|---|---|---|---|
| bye | 0.9500 | 0.8261 | 0.8837 | 23 |
| cancel | 0.9211 | 0.8434 | 0.8805 | 83 |
| greeting | 0.9545 | 0.9545 | 0.9545 | 22 |
| negative_reply | 0.9091 | 0.9091 | 0.9091 | 22 |
| oos | 1.0000 | 0.8696 | 0.9302 | 23 |
| positive_reply | 0.7407 | 0.9091 | 0.8163 | 22 |
| query_avail | 0.9620 | 0.9383 | 0.9500 | 81 |
| reschedule | 0.8506 | 0.8916 | 0.8706 | 83 |
| schedule | 0.8488 | 0.9125 | 0.8795 | 80 |
| --- | --- | --- | --- | ---- |
| Accuracy | 0.8952 | 439 | ||
| Macro Avg | 0.9041 | 0.8949 | 0.8972 | 439 |
| Weighted Avg | 0.8998 | 0.8952 | 0.8960 | 439 |
NER (Token Classification) Performance
| Entity | Precision | Recall | F1-Score | Support |
|---|---|---|---|---|
| B-appointment_id | 1.0000 | 1.0000 | 1.0000 | 61 |
| B-appointment_type | 0.8646 | 0.7477 | 0.8019 | 111 |
| B-practitioner_name | 0.9161 | 0.9467 | 0.9311 | 150 |
| I-appointment_id | 0.9667 | 0.9667 | 0.9667 | 210 |
| I-appointment_type | 0.8182 | 0.7368 | 0.7754 | 171 |
| I-practitioner_name | 0.9540 | 0.8941 | 0.9231 | 255 |
| O | 0.9782 | 0.9892 | 0.9837 | 3813 |
| --- | --- | --- | --- | ---- |
| Accuracy | 0.9673 | 4771 | ||
| Macro Avg | 0.9283 | 0.8973 | 0.9117 | 4771 |
| Weighted Avg | 0.9664 | 0.9673 | 0.9666 | 4771 |
The model achieves near-perfect results on the NER task and excellent results on the intent classification task for this specific dataset.
Limitations and Bias
- The model's performance is highly dependent on the quality and scope of the HASD dataset. It may not generalize well to phrasing or appointment types significantly different from what it was trained on.
- The dataset was primarily generated from templates, which may not capture the full diversity of real human language.
- The model inherits any biases present in the
distilbert-base-uncasedmodel and theclinc/clinc_oosdataset.
- Downloads last month
- 7
Model tree for andreaceto/schedulebot-nlu-engine
Base model
distilbert/distilbert-base-uncased