metadata
license: mit
language:
- en
library_name: transformers
pipeline_tag: text-generation
tags:
- reasoning
- mathematics
- programming
- creative-writing
- chain-of-thought
- interpretability
- fairness
- security
- deployment
- sustainability
- monitoring
- plugin
Brello Thinking
Model Description
Brello Thinking is an advanced language model created by Epic Systems as a part of Brello AI Family . Built on the robust Tencent Hunyuan base model, Brello Thinking specializes in deep reasoning, mathematical problem-solving, coding, and creative thinking with enhanced chain-of-thought capabilities.
Key Features
Advanced Reasoning : Enhanced chain-of-thought with both fast and slow thinking modes
Mathematical Excellence : Superior at math and symbolic computation
Programming Prowess : Strong coding abilities across Python, JS, C++, SQL, and more
Long Context Understanding : Handles up to 256K tokens, long docs, and codebases
Creative Problem Solving : Generates new solutions and approaches
Multi-language Support : Fluent in English and Chinese, robust cross-lingual transfer
1. Executive Summary
Brello Thinking v1.1.0 (2025-08-07) is a 1.8B-parameter causal language model engineered for complex reasoning, mathematics, and creative tasks. It combines ultra-long context, dual “fast”/“deep” thinking modes, and a plugin SDK for live tool integration. It is designed for safe, sustainable, and fair production deployments.
Highlights in this Release
Mixed-precision quantization (BF16 & INT8)
Plugin SDK (JSON-RPC, HMAC auth, dynamic tool routing)
Monitoring (Prometheus, Grafana, carbon tracking)
Sustainability Dashboard (gCO₂eq/token metrics, CodeCarbon SDK)
2. Model Architecture
Component
Specification
Base Model
Tencent Hunyuan / EpicBrelloV1ForCausalLM
Parameters
1.8B (BF16/INT8 quantization; LoRA adapters optional)
Context Window
256,000 tokens (rotary cache, sliding window, eviction logic)
Attention
Grouped-Query + Multi-Head FlashAttention (16 heads, 4 KV heads)
Feed-Forward
Two-stage (SiLU → Linear → SiLU) with RMSNorm, hidden size 6144
Depth
32 transformer blocks + 4 “Safety Adapter” blocks
Adapters
LoRA for math, code, creative, and domain fine-tuning (10–18M params each)
Inference Modes
Autoregressive sampling (top-k, top-p), beam, contrastive decoding
Sharding
ZeRO-3 / tensor-parallel / model-parallel combinations
3. Training & Tuning
3.1 Pretraining Corpus
Web General : 400B tokens (CommonCrawl, CC-100, curated news)
Science/Technical : 50B tokens (arXiv, PubMed, patents)
Code : 20B tokens (public GitHub, CodeSearchNet, MBPP)
Multilingual : 30B tokens (Chinese, Spanish, German, Arabic)
Augmentations : 15% span corruption, zh–en back-translation, dynamic masking
3.2 Optimization
Optimizer : AdamW (β₁=0.9, β₂=0.95, weight_decay=0.01)
LR Schedule : Linear warmup (10K steps), cosine decay (500K steps)
Batch : 2M tokens/step, grad accumulation ×8
3.3 Instruction/RLHF Tuning
Instruction Pairs : 1.2M human-annotated QA/reasoning
Reward Model : Dual human-preference ranking (5K raters, Elo)
Algorithm : PPO w/ KL penalty (target KL=0.1), reward clipping
4. Specialized Modules
Adapter Name
Data Source
Params (M)
Use Case
math-adapter
GSM8K, MATH, AIME datasets
12
Math proof, step-by-step logic
code-adapter
MBPP, MultiPL-E, GitHub repos
18
Coding, debugging, codegen
creative-adapter
Gutenberg, story corpora
10
Narrative, dialogue, ideation
5. Plugin & Tooling SDK
Interface : JSON-RPC (Unix socket or REST), HMAC-SHA256 auth
Plugins :
DB connectors: PostgreSQL, MySQL, Snowflake
HTTP client: retry/backoff
Vector DB: FAISS, Pinecone
Tool Call Example
Model emits:{ "tool_call" : { "name" : "weather_fetch" , "args" : { "location" : "Mumbai" } } }
Host executes plugin, returns:{ "tool_result" : { "forecast" : "Sunny, 32°C" } }
Model resumes reasoning with tool result in context.
6. Inference, Monitoring & Scaling
6.1 Endpoint Performance
Mode
Batch
Seq Len
Throughput (tok/s)
Latency (p50)
Fast-Think
8
4,096
250,000
15 ms
Deep-Think
1
256,000
18,000
120 ms
INT8 Quant
16
2,048
320,000
12 ms
6.2 Observability
Prometheus Metrics :
brello_inference_latency_seconds
brello_generated_tokens_total
brello_cache_evictions_total
Grafana :
Token latency histograms, CO₂ per generation
7. Sustainability & Carbon Tracking
Data Center PUE : 1.2
Carbon Emission : ~0.0008 gCO₂eq/token (tracked with CodeCarbon)
Offset : Epic Systems funds VER 2.0 credits
8. Robustness, Safety & Fairness
Adapters : Real-time adversarial input filtering, personal data redaction, toxicity classifier (fine-tuned BERT-tox)
Bias Audits :
Toxicity variation <1.8% (12 demographic axes)
Gender parity ±2%
Dialect coverage 98% (EN & ZH)
9. Interpretability
Chain-of-Thought logs : Token-level reasoning trace
Integrated Gradients : Span attribution
Attention Rollouts : Layer-wise visualization (custom plugin)
10. Hyperparameters
Parameter
Value
num_layers
32
d_model
2048
d_hidden
6144
num_heads
16
kv_heads
4
rotary_pct
0.25
lr_warmup_steps
10,000
weight_decay
0.01
batch_size
2M
dropout_rate
0.1
11. Evaluation & Error Analysis
Benchmarks : GSM8K, MBPP, BBH, LongBench, MATH
Analysis : Math/logic confusion matrix, hallucination drift cluster analysis
12. Roadmap
Version
Highlights
ETA
v1.1.0
Plugins, carbon tracking, INT8 quantization
Released
v1.2.0
Vision-language, adapter expansion
Nov 2025
v1.3.0
Audio, multilingual tuning
Feb 2026
v2.0
Federated RAG, continuous learning
Q4 2026
13. Licensing & Compliance
License : Proprietary, Epic Systems
Privacy : GDPR, CCPA compliant
Certifications : ISO 27001, SOC 2 Type II, HIPAA (BAA on request)
Restrictions : No redistribution or large-scale rehosting
14. Usage Example
import os
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
from brello_sdk import BrelloPluginManager
from codecarbon import EmissionsTracker
from prometheus_client import CollectorRegistry, Counter, Histogram, push_to_gateway
def setup_model (
model_id: str = "BrelloES/brello-thinking" ,
use_bf16: bool = True ,
load_int8: bool = True ,
):
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto" ,
torch_dtype=torch.bfloat16 if use_bf16 else torch.float32,
load_in_8bit=load_int8,
)
model = PeftModel.from_pretrained(model, "adapters/math-adapter" )
model = PeftModel.from_pretrained(model, "adapters/code-adapter" )
return tokenizer, model
def setup_plugins ():
pm = BrelloPluginManager()
pm.register(
name="weather_fetch" ,
path="/opt/brello/plugins/weather_plugin.so" ,
auth_key=os.getenv("WEATHER_PLUGIN_KEY" , "CHANGE_ME" ),
)
pm.register(
name="db_query" ,
path="/opt/brello/plugins/db_query_plugin.so" ,
auth_key=os.getenv("DB_PLUGIN_KEY" , "CHANGE_ME" ),
)
return pm
def setup_metrics ():
registry = CollectorRegistry()
Histogram(
"brello_inference_latency_seconds" ,
"Inference latency (seconds) per request" ,
registry=registry,
buckets=(0.01 , 0.05 , 0.1 , 0.2 , 0.5 , 1.0 ),
)
Counter(
"brello_generated_tokens_total" ,
"Total number of tokens generated by Brello" ,
registry=registry,
)
return registry
def generate_response (tokenizer, model, plugin_mgr, registry, messages, mode: str = "deep" ):
inputs = tokenizer.apply_chat_template(
messages,
tokenize=True ,
add_generation_prompt=True ,
enable_thinking=True if mode == "deep" else False ,
)
tracker = EmissionsTracker(project_name="brello_inference" , output_dir="carbon_logs" )
tracker.start()
outputs = model.generate(
inputs.to(model.device),
max_new_tokens=512 ,
top_p=0.9 ,
temperature=0.6 ,
plugin_manager=plugin_mgr,
return_dict_in_generate=True ,
output_scores=True ,
)
emissions_kg = tracker.stop()
text = tokenizer.decode(outputs.sequences[0 ], skip_special_tokens=True )
return text, emissions_kg
def main ():
tokenizer, model = setup_model()
plugin_mgr = setup_plugins()
registry = setup_metrics()
messages = [
{"role" : "system" , "content" : "You are Brello Thinking in Deep-Think mode." },
{"role" : "user" , "content" : "Explain why prime factorization is unique." },
]
response, co2 = generate_response(tokenizer, model, plugin_mgr, registry, messages, mode="deep" )
print ("=== Deep-Think Output ===\n" , response)
print (f"CO₂ Emitted: {co2:.6 f} kg" )
messages[0 ]["content" ] = "You are Brello Thinking in Fast-Think mode."
response_fast, co2_fast = generate_response(tokenizer, model, plugin_mgr, registry, messages, mode="fast" )
print ("\n=== Fast-Think Output ===\n" , response_fast)
print (f"CO₂ Emitted: {co2_fast:.6 f} kg" )
if __name__ == "__main__" :
main()
Otvd
Creator : Epic Systems
Engineer : Rehan Temkar
Model : Brello Thinking v1.0.0
Brello Thinking - Advanced AI Reasoning by Epic Systems