YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

MITI 4.2 Not Coded Classifier

Model Description

This binary classifier automates the first step of the Motivational Interviewing Treatment Integrity (MITI) 4.2 coding process: determining whether a therapist utterance should be coded or excluded from analysis.

What is MITI 4.2?

The Motivational Interviewing Treatment Integrity (MITI) 4.2 is a behavioral coding system used to assess fidelity to Motivational Interviewing (MI). In MITI coding:

First step: Coders review each therapist utterance and decide if it should be coded or not coded
- Coded utterances: Substantive therapist statements that warrant behavioral coding
- Not coded utterances: Minimal facilitative responses (e.g., "mm-hmm", "okay", simple acknowledgments)
Second step: Coded utterances are then classified into behavior codes (Giving Information, Persuade, Question, Simple Reflection, Complex Reflection, etc.)

This model automates the first step, providing a foundation for full MITI 4.2 automation.

Model Architecture

Base Model: Qwen/Qwen3-Embedding-0.6B
Task Type: Binary sequence classification
Input: Therapy session context + last therapist utterance
Output: Binary prediction (coded=1, not_coded=0)
Max Sequence Length: 3000 tokens
Training Framework: HuggingFace Transformers with Flash Attention 2
Precision: bfloat16 for efficient inference

Training Data

The model was trained on multilabel classifier dataset with annotations from two expert coders:

Annotator Profiles

AJ (Expert Annotator): MI and MITI 4.2 trained expert with extensive coding experience
SJ (Beginner Annotator): Psychologist who has reached beginner-level proficiency with ICC (Intra-Class Correlation) > 0.86 with the expert annotator AJ

Data Splits

Training Set: 80% of data (stratified by label)
Validation Set: 10% of data (stratified by label)
Test Set: 10% of data (stratified by label)
Total Test Examples: 3,776 utterances

Class Distribution

The dataset exhibits class imbalance (more coded than not coded utterances), which is addressed through:

Balanced class weighting in the loss function
Stratified train/val/test splits

Training Configuration

- Model: Qwen/Qwen3-Embedding-0.6B
- Batch Size: 12
- Learning Rate: 6e-5
- Epochs: 20 (with early stopping)
- Warmup Ratio: 0.1
- Weight Decay: 0.01
- LR Scheduler: Cosine
- Optimization: Weighted Cross-Entropy Loss
- Early Stopping: Patience=3, Threshold=0.001
- Metric for Model Selection: F1 Macro

Performance Metrics

Test Set Results

Metric	Score
Overall Accuracy	97.99%
F1 Score (Macro)	94.79%
Precision (Macro)	96.51%
Recall (Macro)	93.23%

Per-Class Performance

Coded Class (Positive Class, Label=1)

F1 Score: 98.87%
Precision: 98.37%
Recall: 99.37%

Not Coded Class (Negative Class, Label=0)

F1 Score: 90.71%
Precision: 94.64%
Recall: 87.09%

Confusion Matrix (Test Set)

                Predicted
              Not_Coded  Coded
Actual
Not_Coded        371      55
Coded             21    3329

Interpretation:

True Negatives (Not Coded → Not Coded): 371 (87.09%)
False Positives (Not Coded → Coded): 55 (12.91%)
False Negatives (Coded → Not Coded): 21 (0.63%)
True Positives (Coded → Coded): 3329 (99.37%)

The model shows excellent performance in identifying coded utterances (99.37% recall) with slightly lower recall for not coded utterances (87.09%). This bias toward coding is conservative and appropriate for clinical applications, as it's preferable to over-code than miss substantive therapeutic statements.

Calibration and Optimal Thresholds

The model has been calibrated to find optimal decision thresholds for different use cases. While the default threshold of 0.5 works well, alternative thresholds can optimize specific performance metrics:

Threshold Recommendations

Based on validation set analysis (see calibration_results.json for full details):

Use Case	Threshold	F1 Macro	Recommendation
Default (0.5)	0.5000	94.65%	Currently used by the model
Max F1	0.2773	94.68%	Optimizes overall F1 score - marginal improvement over default
Balanced P/R	0.9805	94.33%	Equalizes precision and recall

Use Case Specific Recommendations

Training & Education: Use high recall threshold (lower threshold values) to catch all potentially codeable utterances, ensuring learners don't miss substantive statements
Research: Use Max F1 threshold (0.2773) or default (0.5) for optimal overall performance and balanced metrics
Quality Assurance: Use high precision threshold (higher threshold values) to minimize false positives and reduce manual review burden

How to Use Custom Thresholds

The model outputs probabilities for both classes. To use a custom threshold:

# Get probabilities
outputs = model(**inputs)
probabilities = torch.softmax(outputs.logits, dim=1)[0]
prob_coded = probabilities[1].item()

# Apply custom threshold
threshold = 0.2773  # Max F1 threshold
prediction = "coded" if prob_coded >= threshold else "not_coded"

Note: The calibration analysis includes ROC curves, Precision-Recall curves, and per-annotator threshold analysis. See calibration_results.json for comprehensive metrics at various threshold values.

Using Annotator Information for Inference

During inference, you can specify which annotator style to emulate by including annotator information in the input:

Annotator Emulation

# Emulate expert annotator (AJ)
text = """Task: Decide if the last therapist utterance should be coded or not.
Annotated by: AJ
[Your context and utterance here]"""

# Emulate beginner annotator (SJ)
text = """Task: Decide if the last therapist utterance should be coded or not.
Annotated by: SJ
[Your context and utterance here]"""

Why this matters:

AJ emulation: Provides expert-level coding decisions aligned with extensive MI/MITI training
SJ emulation: Provides beginner-level proficiency coding with ICC > 0.86 correlation with expert
This allows users to choose the level of coding rigor appropriate for their use case (e.g., research vs. training)

Usage

Basic Inference

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load model and tokenizer
model_name = "Lekhansh/qwen_nc_classifier"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2"
)
model.eval()

# Prepare input
context = """Patient: I've been trying to quit smoking but it's really hard.
Therapist: Tell me more about what makes it difficult.
Patient: Well, I smoke when I'm stressed at work.
Therapist: """

utterance = "Mm-hmm."

# Format with annotator info (emulate expert)
text = f"""Task: Decide if the last therapist utterance should be coded or not.
Annotated by: AJ
Context:
{context}

Last Therapist Utterance: {utterance}"""

# Tokenize and predict
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=3000)
inputs = {k: v.to(model.device) for k, v in inputs.items()}

with torch.no_grad():
    outputs = model(**inputs)
    logits = outputs.logits
    predicted_class = torch.argmax(logits, dim=1).item()
    probabilities = torch.softmax(logits, dim=1)[0]

# Interpret results
label_map = {0: "not_coded", 1: "coded"}
prediction = label_map[predicted_class]
confidence = probabilities[predicted_class].item()

print(f"Prediction: {prediction}")
print(f"Confidence: {confidence:.2%}")
print(f"Not Coded Probability: {probabilities[0]:.2%}")
print(f"Coded Probability: {probabilities[1]:.2%}")

Batch Inference

See demo_inference.py for a complete batch inference example with multiple utterances.

Limitations and Considerations

Training Data Scope: Model trained on specific therapy session formats; performance may vary with different conversational structures
Context Dependency: Model relies on conversational context; single utterances without context may yield less reliable predictions
Class Imbalance Effects: Higher recall for coded class (99.37%) vs not coded class (87.09%) reflects training data distribution
Annotator Variance: While ICC > 0.86 indicates good agreement, some coding decisions remain subjective
Domain Specificity: Optimized for Motivational Interviewing; may not generalize to other therapeutic modalities

Clinical Applications

This model can be used for:

Training and Education: Providing immediate feedback to MI learners on utterance coding
Quality Assurance: Automated pre-screening of therapy sessions before manual MITI coding
Research: Large-scale analysis of MI fidelity across multiple sessions and practitioners
Supervision: Assisting supervisors in reviewing trainee sessions efficiently

Citation

If you use this model in your research, please cite:

@software{miti_not_coded_classifier,
  title={MITI 4.2 Not Coded Classifier},
  author={[Lekhansh]},
  year={2026},
  url={[https://huggingface.co/Lekhansh/qwen_nc_classifier]}
}

License

Contact

For questions, issues, or collaboration opportunities, please contact [drlekhansh@gmail.com].

Acknowledgments

MITI 4.2 coding system developed by the Motivational Interviewing Network of Trainers (MINT)
Base model: Qwen/Qwen3-Embedding-0.6B by Alibaba Cloud
Annotators: AJ (expert) and SJ (beginner proficiency, ICC > 0.86)

Downloads last month: 7

Safetensors

Model size

0.6B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support