Schedulebot-nlu-engine

Model Description

This model is a multi-task Natural Language Understanding (NLU) engine designed specifically for an appointment scheduling chatbot. It is fine-tuned from a distilbert-base-uncased backbone and is capable of performing two tasks simultaneously:

  • Intent Classification: Identifying the user's primary goal (e.g., schedule, cancel).
  • Named Entity Recognition (NER): Extracting custom, domain-specific entities (e.g., appointment_type).

This model stands out due to its custom classification heads, which use a more complex architecture to improve performance on nuanced tasks.

Model Architecture

The model uses a standard distilbert-base-uncased model as its core feature extractor. Two custom classification "heads" are placed on top of this base to perform the downstream tasks.

  • Base Model: distilbert-base-uncased
  • Classifier Heads: each head is a Multi-Layer Perceptron (MLP) with the following structure to allow for more complex feature interpretation:
    1. A Linear layer projecting the transformer's output dimension (768) to an intermediate size (384).
    2. A GELU activation function.
    3. A Dropout layer with a rate of 0.3 for regularization.
    4. A final Linear layer projecting the intermediate size to the number of output labels for the specific task (intent or NER).

Intended Use

This model is intended to be the core NLU component of a conversational AI system for managing appointments.

For instructions on how to use the model check the dedicated file.

Training Data

The model was trained on the HASD (Hybrid Appointment Scheduling Dataset), a custom dataset built specifically for this task.

  • Source: The dataset is a hybrid of real-world conversational examples from clinc/clinc_oos (for simple intents) and synthetically generated, template-based examples for complex scheduling intents.
  • Balancing: To combat class imbalance, intents sourced from clinc/clinc_oos were down-sampled to a maximum of 150 examples each.
  • Augmentation: To increase data diversity for complex intents (schedule, reschedule, etc.), Contextual Word Replacement was used. A distilbert-base-uncased model augmented the templates by replacing non-placeholder words with contextually relevant synonyms.

The dataset is available here.

Intents

The model is trained to recognize the following intents: schedule, reschedule, cancel, query_avail, greeting, positive_reply, negative_reply, bye, oos (out-of-scope).

Entities

The model is trained to recognize the following custom named entities: practitioner_name, appointment_type, appointment_id.

Training Procedure

The model was trained using a two-stage fine-tuning strategy to ensure stability and performance.

Stage 1: Training the Classifier Heads

  • The distilbert-base-uncased base model was entirely frozen.
  • Only the randomly initialized MLP heads for intent and NER classification were trained.

Setup:

# Define a data collator to handle padding for token classification
data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)
# Define Training Arguments
training_args = TrainingArguments(
    output_dir="path/to/output_dir",
    overwrite_output_dir=True,
    num_train_epochs=200,               # Training epochs
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    learning_rate=1e-4,                 # Learning Rate
    weight_decay=1e-5,                  # AdamW weight decay
    logging_dir="path/to/logging_dir",
    logging_strategy="epoch",
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="eval_loss",     # Focus on validation loss as the key metric
    # --- Hub Arguments ---
    push_to_hub=True,
    hub_model_id=hub_model_id,
    hub_strategy="end",
    hub_token=hf_token,
    report_to="tensorboard"             # Tensorboard to monitor training
)
# Create the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=processed_datasets["train"],
    eval_dataset=processed_datasets["validation"],
    processing_class=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,  # Custom function (check how_to_use.md)
    callbacks=[EarlyStoppingCallback(early_stopping_patience=10)]
)

Stage 2: Fine-Tuning

  • The DistilBERT backbone was entirely unfrozen.
  • Using a very low LR allows the model to adapt even better to the new data while preserving the powerful, general-purpose knowledge.

Setup:

# Define Training Arguments
training_args = TrainingArguments(
    output_dir="path/to/output_dir",
    overwrite_output_dir=True,
    num_train_epochs=50,               # Fine-tuning epochs
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    learning_rate=1e-6,                 # Learning Rate
    weight_decay=1e-3,                  # AdamW weight decay
    logging_dir="path/to/logging_dir",
    logging_strategy="epoch",
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="eval_loss",     # Focus on NER F1 as the key metric
    # --- Hub Arguments ---
    push_to_hub=True,
    hub_model_id=hub_model_id,
    hub_strategy="end",
    hub_token=hf_token,
    report_to="tensorboard"             # Tensorboard to monitor training
)
# Create the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=processed_datasets["train"],
    eval_dataset=processed_datasets["validation"],
    processing_class=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,  # Custom function (check how_to_use.md)
    callbacks=[EarlyStoppingCallback(early_stopping_patience=5)]
)

Evaluation

The model was evaluated on a held-out test set, and its performance was measured for both tasks.

Intent Classification Performance

Intent Precision Recall F1-Score Support
bye 0.9500 0.8261 0.8837 23
cancel 0.9211 0.8434 0.8805 83
greeting 0.9545 0.9545 0.9545 22
negative_reply 0.9091 0.9091 0.9091 22
oos 1.0000 0.8696 0.9302 23
positive_reply 0.7407 0.9091 0.8163 22
query_avail 0.9620 0.9383 0.9500 81
reschedule 0.8506 0.8916 0.8706 83
schedule 0.8488 0.9125 0.8795 80
--- --- --- --- ----
Accuracy 0.8952 439
Macro Avg 0.9041 0.8949 0.8972 439
Weighted Avg 0.8998 0.8952 0.8960 439

NER (Token Classification) Performance

Entity Precision Recall F1-Score Support
B-appointment_id 1.0000 1.0000 1.0000 61
B-appointment_type 0.8646 0.7477 0.8019 111
B-practitioner_name 0.9161 0.9467 0.9311 150
I-appointment_id 0.9667 0.9667 0.9667 210
I-appointment_type 0.8182 0.7368 0.7754 171
I-practitioner_name 0.9540 0.8941 0.9231 255
O 0.9782 0.9892 0.9837 3813
--- --- --- --- ----
Accuracy 0.9673 4771
Macro Avg 0.9283 0.8973 0.9117 4771
Weighted Avg 0.9664 0.9673 0.9666 4771

The model achieves near-perfect results on the NER task and excellent results on the intent classification task for this specific dataset.

Limitations and Bias

  • The model's performance is highly dependent on the quality and scope of the HASD dataset. It may not generalize well to phrasing or appointment types significantly different from what it was trained on.
  • The dataset was primarily generated from templates, which may not capture the full diversity of real human language.
  • The model inherits any biases present in the distilbert-base-uncased model and the clinc/clinc_oos dataset.
Downloads last month
7
Safetensors
Model size
67M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for andreaceto/schedulebot-nlu-engine

Finetuned
(10493)
this model

Dataset used to train andreaceto/schedulebot-nlu-engine

Evaluation results