|
|
--- |
|
|
language: |
|
|
- pt |
|
|
license: cc-by-nc-nd-4.0 |
|
|
colorTo: blue |
|
|
sdk: docker |
|
|
app_port: 8501 |
|
|
tags: |
|
|
- text-classification |
|
|
- multilabel-classification |
|
|
- portuguese |
|
|
- administrative-documents |
|
|
- stacking |
|
|
- ensemble-learning |
|
|
- bert |
|
|
- tfidf |
|
|
library_name: scikit-learn |
|
|
base_model: |
|
|
- neuralmind/bert-base-portuguese-cased |
|
|
--- |
|
|
|
|
|
# CouncilTopics-PT: A multi-label classifier for Portuguese municipal meeting topics. |
|
|
|
|
|
## Model Description |
|
|
|
|
|
**CouncilTopics-PT** is a multilabel classification system designed to identify Portuguese municipal topics across 22 distinct categories. The model is an ensemble composed of 12 diverse base learners (using both lexical and contextual representations), whose predictions are integrated by a meta-learner to improve generalization and robustness to linguistic variation in meeting minutes. |
|
|
**Try out the model**: [Hugging Face Space Demo](https://huggingface.co/spaces/anonymous12321/PT-AdminDocs-Classifier) |
|
|
|
|
|
### Key Features |
|
|
|
|
|
- 🧠 **Meta-Learning**: Ensemble combination using stacked generalization |
|
|
- 📚 **12 Base Models**: 3 feature sets × 4 algorithms for robust predictions |
|
|
- 🇵🇹 **Portuguese Optimized**: Prepared for the Portuguese language |
|
|
- 🏢 **22 Categories**: Comprehensive municipal administrative document classification |
|
|
- 🎯 **Dynamic Thresholds**: Optimized per-category decision boundaries |
|
|
|
|
|
## Model Details |
|
|
|
|
|
- **Architecture**: Stacking with Meta-Learning |
|
|
- **Base Models**: 12 diverse classifiers (LogReg, Random Forest, Gradient Boosting) |
|
|
- **Feature Engineering**: TF-IDF + BERTimbau embeddings + Statistical features |
|
|
- **Meta-Learner**: Ensemble combination algorithm |
|
|
- **Categories**: 22 Portuguese administrative topic labels |
|
|
- **Training Method**: Cross-validation stacking with dynamic threshold optimization |
|
|
|
|
|
## How It Works |
|
|
|
|
|
The Council topics system operates in multiple stages: |
|
|
|
|
|
1. **Feature Extraction**: Three complementary feature sets |
|
|
- TF-IDF vectorization (word and character n-grams) |
|
|
- BERTimbau embeddings from `neuralmind/bert-base-portuguese-cased` |
|
|
- Statistical text features |
|
|
|
|
|
2. **Base Model Ensemble**: 12 diverse classifiers trained on different feature combinations |
|
|
- Logistic Regression (C=1.0, C=0.5) |
|
|
- Random Forest |
|
|
- Gradient Boosting |
|
|
|
|
|
3. **Meta-Learning**: Combination of base model predictions using stacking |
|
|
|
|
|
4. **Dynamic Thresholds**: Per-category optimized decision boundaries for multilabel output |
|
|
|
|
|
## Usage |
|
|
|
|
|
### Quick Start with Python |
|
|
|
|
|
```python |
|
|
import joblib |
|
|
import numpy as np |
|
|
from sklearn.feature_extraction.text import TfidfVectorizer |
|
|
from scipy.sparse import hstack, csr_matrix |
|
|
|
|
|
# Load the model components |
|
|
tfidf_vectorizer = joblib.load("int_stacking_tfidf_vectorizer.joblib") |
|
|
meta_learner = joblib.load("int_stacking_meta_learner.joblib") |
|
|
mlb_encoder = joblib.load("int_stacking_mlb_encoder.joblib") |
|
|
base_models = joblib.load("int_stacking_base_models.joblib") |
|
|
optimal_thresholds = np.load("int_stacking_optimal_thresholds.npy") |
|
|
|
|
|
# Prepare text |
|
|
text = """CONTRATO DE PRESTAÇÃO DE SERVIÇOS |
|
|
Entre a Administração Pública Municipal e a empresa contratada, |
|
|
fica estabelecido o presente contrato para prestação de serviços |
|
|
de manutenção e conservação de vias públicas.""" |
|
|
|
|
|
# Extract features |
|
|
tfidf_features = tfidf_vectorizer.transform([text]) |
|
|
|
|
|
# Generate base model predictions |
|
|
base_predictions = np.zeros((1, len(mlb_encoder.classes_), 12)) |
|
|
model_idx = 0 |
|
|
|
|
|
for feat_name in ["TF-IDF", "BERT", "TF-IDF+BERT"]: |
|
|
for algo_name in ["LogReg_C1", "LogReg_C05", "GradBoost", "RandomForest"]: |
|
|
model_key = f"{feat_name}_{algo_name}" |
|
|
if model_key in base_models: |
|
|
model = base_models[model_key] |
|
|
pred = model.predict_proba(tfidf_features) |
|
|
base_predictions[0, :, model_idx] = pred[0] |
|
|
model_idx += 1 |
|
|
|
|
|
# Meta-learner prediction |
|
|
meta_features = base_predictions.reshape(1, -1) |
|
|
meta_pred = meta_learner.predict_proba(meta_features)[0] |
|
|
|
|
|
# Apply dynamic thresholds |
|
|
predicted_labels = [] |
|
|
for i, (prob, threshold) in enumerate(zip(meta_pred, optimal_thresholds)): |
|
|
if prob > threshold: |
|
|
predicted_labels.append({ |
|
|
"label": mlb_encoder.classes_[i], |
|
|
"probability": float(prob), |
|
|
"confidence": "high" if prob > 0.7 else "medium" if prob > 0.4 else "low" |
|
|
}) |
|
|
|
|
|
# Sort by probability |
|
|
predicted_labels.sort(key=lambda x: x["probability"], reverse=True) |
|
|
print("Predicted categories:", predicted_labels) |
|
|
``` |
|
|
|
|
|
|
|
|
## Categories |
|
|
|
|
|
The model classifies topics into 22 Portuguese administrative categories: |
|
|
|
|
|
| Category | Portuguese Name | |
|
|
|----------|-----------------| |
|
|
| General Administration | Administração Geral, Finanças e Recursos Humanos | |
|
|
| Environment | Ambiente | |
|
|
| Economic Activities | Atividades Económicas | |
|
|
| Social Action | Ação Social | |
|
|
| Science | Ciência | |
|
|
| Communication | Comunicação e Relações Públicas | |
|
|
| External Cooperation | Cooperação Externa e Relações Internacionais | |
|
|
| Culture | Cultura | |
|
|
| Sports | Desporto | |
|
|
| Education | Educação e Formação Profissional | |
|
|
| Energy & Telecommunications | Energia e Telecomunicações | |
|
|
| Housing | Habitação | |
|
|
| Private Construction | Obras Particulares | |
|
|
| Public Works | Obras Públicas | |
|
|
| Territorial Planning | Ordenamento do Território | |
|
|
| Other | Outros | |
|
|
| Heritage | Património | |
|
|
| Municipal Police | Polícia Municipal | |
|
|
| Animal Protection | Proteção Animal | |
|
|
| Civil Protection | Proteção Civil | |
|
|
| Health | Saúde | |
|
|
| Traffic & Transport | Trânsito, Transportes e Comunicações | |
|
|
|
|
|
## Evaluation Results |
|
|
|
|
|
### Comprehensive Performance Metrics |
|
|
|
|
|
| Metric | Score | Description | |
|
|
|--------|-------|-------------| |
|
|
| **F1-macro** | **0.5486** | Macro-averaged F1 score | |
|
|
| **F1-micro** | **0.7379** | Micro-averaged F1 score | |
|
|
| **F1-weighted** | **0.742** | Weighted-averaged F1 score | |
|
|
| **Accuracy** | **0.4259** | Subset accuracy (exact match) | |
|
|
| **Hamming Loss** | **0.0426** | Label-wise error rate | |
|
|
| **Average Precision (macro)** | **0.608** | Macro-averaged AP | |
|
|
| **Average Precision (micro)** | **0.785** | Micro-averaged AP | |
|
|
|
|
|
|
|
|
## Technical Architecture |
|
|
|
|
|
### Base Model Ensemble |
|
|
- **Feature Set 1**: TF-IDF (word + character n-grams) |
|
|
- **Feature Set 2**: BERTimbau embeddings (768 dimensions) |
|
|
- **Feature Set 3**: Combined TF-IDF + BERT features |
|
|
|
|
|
### Algorithms per Feature Set |
|
|
1. **Logistic Regression** (C=1.0) |
|
|
2. **Logistic Regression** (C=0.5) |
|
|
3. **Gradient Boosting Classifier** |
|
|
4. **Random Forest Classifier** |
|
|
|
|
|
### Meta-Learning Strategy |
|
|
- **Cross-validation stacking** for robust meta-features |
|
|
- **Intelligent combination**: 70% meta-learner + 30% simple ensemble |
|
|
- **Dynamic threshold optimization** per category using differential evolution |
|
|
|
|
|
## Training Data |
|
|
|
|
|
The model was trained on a curated dataset of Portuguese municipal council meeting minutes including: |
|
|
- Administrative contracts and agreements |
|
|
- Environmental reports and assessments |
|
|
- Traffic regulations and urban planning documents |
|
|
- Public health and safety communications |
|
|
- Cultural and educational program descriptions |
|
|
|
|
|
## Limitations |
|
|
|
|
|
- **Language Specificity**: Optimized for Portuguese administrative language |
|
|
- **Domain Focus**: Best performance on municipal documents |
|
|
- **Computational Requirements**: Requires significant memory for all model components |
|
|
- **Threshold Sensitivity**: Performance depends on carefully tuned per-category thresholds |
|
|
- **Class Imbalance**: Some categories may have lower precision due to limited training examples |
|
|
|
|
|
|
|
|
## License |
|
|
|
|
|
This model is released under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License (CC BY-NC-ND 4.0). |