INL Architecture - Integrator Neuron Layer

Production-ready neural architecture using Integrator Neuron dynamics - replaces traditional FFN layers with iterative dynamics. Universal architecture that works for any type of model: LLMs, vision transformers, multimodal, diffusion, RL policies, etc.

Architecture Features

Universal - Build LLMs, vision models, audio, multimodal, diffusion, RL agents with same architecture
HuggingFace ready - Drop-in replacement for FFN in any transformer
KV caching - Full support for efficient autoregressive generation
Adaptive compute - Auto-stops when converged (30-50% faster)
Parameter efficient - Shared controllers = 96% fewer params than FFN
Bio-inspired - Based on integrator neurons from neuroscience
Configurable - Tune iterations, controllers, equilibrium for your task

This Checkpoint

Example implementation: 1.1B parameter language model with INL architecture.

25 layers × 5 iterations/layer = rich iterative computation
But the architecture scales from 100M to 100B+ params
And works for any domain (language, vision, audio, etc.)

What is INL?

Traditional transformers use static feedforward layers:

x_out = x + FFN(x)  # One-shot computation

INL-LLM uses iterative integrator dynamics to find equilibrium:

# Each of the 25 layers performs 5 iterations (configurable)
# Total: 25 layers × 5 iterations = 125 computation steps
for iteration in range(num_iterations_per_layer):  # = 5
    error = x - mu  # Distance from learned equilibrium
    v_next = alpha * v + (1 - alpha) * v_target - beta * error
    x_next = x + dt * gate * v_next

Result: The model "thinks" iteratively like biological integrator neurons, achieving better parameter efficiency through shared dynamics and adaptive early stopping.

Model Details

Parameter	Value
Parameters	1.1B
d_model	1728
Layers	25
Attention heads	32
Iterations/layer	5 (configurable: more = better quality but slower)
Context length	2048
Vocabulary	50,261

Key Optimizations

Shared controllers: One controller shared across all 25 layers (96% fewer parameters)
Low-rank embeddings: 87% fewer embedding parameters
Adaptive stopping: Stops when converged (30-50% faster inference)
Sparse excitation: 90% sparsity for efficiency

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "/home/boris/vAgent/architecture/checkpoints/inl_11b_hf",
    trust_remote_code=True,
    torch_dtype="bfloat16"
)
tokenizer = AutoTokenizer.from_pretrained("/home/boris/vAgent/architecture/checkpoints/inl_11b_hf")

# Generate with KV caching (default, much faster!)
prompt = "The future of AI is"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(
    **inputs,
    max_new_tokens=100,
    temperature=0.8,
    use_cache=True  # Enable KV cache (default)
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Chat Format

messages = [
    {"role": "user", "content": "What is machine learning?"}
]

chat_text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(chat_text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100)

Special tokens: <USER>, <ASSISTANT>, <SYSTEM>, <ERROR>

vLLM Serving

python -m vllm.entrypoints.openai.api_server \
    --model /home/boris/vAgent/architecture/checkpoints/inl_11b_hf \
    --trust-remote-code \
    --dtype bfloat16

Why Integrator Neurons?

Main benefit: Achieve similar quality with fewer parameters through parameter sharing and iterative refinement.

Parameter efficiency: One shared controller for all 25 layers (instead of 25 separate FFNs)
Adaptive computation: Stops iterating early when converged (faster inference)
Iterative refinement: Each layer "thinks" multiple times instead of one-shot computation
Interpretable: Can visualize how the model converges to solutions
Bio-inspired: Mimics integrator neurons found in neuroscience

Architecture Philosophy: Kubernetes vs Docker

A useful analogy: If traditional transformers (like Llama) are Docker containers, then INL architecture is Kubernetes orchestration.

Traditional Transformers = Docker

# Like a static Docker container
class LlamaLayer:
    def __init__(self):
        self.ffn = FeedForward()  # Isolated, fixed container

    def forward(self, x):
        return x + self.ffn(x)  # Single execution, predictable

Characteristics:

✅ Static - Each layer is a fixed image
✅ Isolated - Each FFN is independent (like separate containers)
✅ Predictable - Same compute every time
✅ Simple - One layer = one container doing its job once

INL Architecture = Kubernetes

# Like Kubernetes with dynamic orchestration
class INLLayer:
    def __init__(self, shared_controller):
        self.controller = shared_controller  # Shared control plane
        self.state = (x, v)  # StatefulSet

    def forward(self, x):
        # Dynamic orchestration with health checks
        for i in range(self.max_iterations):  # Like ReplicaSet
            # Health check (liveness probe)
            error = torch.norm(x - self.mu)
            if error < self.threshold:  # Converged
                break  # Auto-scaling down (HPA)

            # Update via shared controller (control plane)
            v_next = self.controller(x, v, error)
            x = x + self.dt * self.gate * v_next

        return x

Characteristics:

✅ Dynamic orchestration - Iterations adjust like pods
✅ Shared resources - Controllers = shared services/ConfigMaps
✅ Health checks - Convergence monitoring = liveness probes
✅ Auto-scaling - Adaptive stopping = Horizontal Pod Autoscaling
✅ State management - (x, v) state = StatefulSets
✅ Control plane - Shared controllers orchestrate all layers

The Kubernetes-INL Mapping

Kubernetes Concept	INL Equivalent	Purpose
Pod	One iteration	Ephemeral compute unit
ReplicaSet	`num_iterations`	How many "pods" to run
Deployment	INL Layer	Manages iteration lifecycle
Controller	Shared controller	Orchestrator for all layers
ConfigMap	`mu`, `v_target`	Shared learned configuration
Health Check	`‖error‖ < threshold`	Verify convergence
HPA	Adaptive stopping	Scale down when converged
StatefulSet	`(x, v)` state	Stateful compute across iterations
Service Mesh	Hierarchical equilibrium	Communication between groups
Namespace	One layer	Logical isolation
Control Plane	Shared controller network	Coordinates all layers

Why This Matters

Kubernetes revolutionized cloud computing by replacing static VMs with dynamic orchestration.

INL does the same for transformers by replacing static FFN layers with dynamically orchestrated iterative computation.

Benefits Comparison

Benefit	Kubernetes (Cloud)	INL (Neural Networks)
Efficiency	Bin packing, resource sharing	Parameter sharing (96% reduction)
Scalability	Horizontal pod scaling	Adaptive iterations (5-50)
Resilience	Self-healing, restarts	Convergence guarantees
Observability	Metrics, logs, traces	Energy tracking, convergence monitoring
Declarative	YAML manifests	config.json defines behavior
Resource optimization	Only use what you need	Only iterate until converged

Code Comparison

Llama (Docker-style): Static Resources

# 25 independent FFN "containers" - fixed resources
for layer in range(25):
    x = x + layer.ffn(x)  # Each layer is isolated
# Total: 25 × FFN_params = Lots of parameters

INL (Kubernetes-style): Orchestrated Resources

# 1 shared controller "control plane" - orchestrated resources
shared_controller = Controller()  # Single control plane

for layer in range(25):
    # Dynamic orchestration per layer
    for iteration in range(max_iterations):
        if converged():  # Health check
            break  # Auto-scale down
        x = layer.iterate(x, shared_controller)  # Shared resource

# Total: 1 × Controller_params + 25 × layer_params
# Result: 96% fewer parameters through orchestration

Real-World Impact

Aspect	Traditional (Docker-style)	INL (Kubernetes-style)
Resource allocation	Fixed, over-provisioned	Dynamic, right-sized
Utilization	Often <50%	Adaptive, 70-90%
Complexity	Simple but wasteful	Complex but efficient
Flexibility	Hard-coded	Configurable at runtime
Cost	High (redundant resources)	Low (shared resources)

The Philosophy

"Don't give each task its own server (FFN). Give them all access to a shared orchestration platform (shared controller) that allocates resources dynamically based on actual need."

Kubernetes orchestrates containers across a cluster. INL orchestrates iterations across a GPU.

Same philosophy, different substrate. Both achieve massive efficiency through intelligent orchestration rather than static resource allocation.

Practical Implications

Like K8s HPA: Model adapts compute to task difficulty
- Easy tokens: 2-3 iterations (like scaling down)
- Hard tokens: 8-10 iterations (like scaling up)
Like K8s ConfigMaps: Shared learned parameters
- One controller for all 25 layers
- One equilibrium config per layer
Like K8s Health Checks: Continuous monitoring
- Track convergence error
- Stop when quality threshold met

Like K8s Declarative Config: Behavior defined in config.json

{
  "num_iterations_per_layer": 5,  // replicas: 5
  "adaptive_stopping": true,       // autoscaling: enabled
  "shared_controllers": true       // shared control plane
}

This isn't just an analogy - it's a fundamental architectural pattern that works across domains: cloud infrastructure or neural networks. Orchestration beats static allocation.

Learn More

For detailed technical documentation about the INL architecture:

GitHub Repository: ARKITEKTURE_TRANSFORMER_ADL
Architecture Docs: See the repo for implementation details, training code, and benchmarks

Convergence Theorem

Mathematical Formulation

The INL architecture implements a discrete-time dynamical system that converges to a learned equilibrium point. For each layer:

error = x - mu                                          # (1)
v_next = alpha * v + (1 - alpha) * v_target - beta * error  # (2)
x_next = x + dt * gate * v_next                        # (3)

Theorem (Asymptotic Convergence):

Given the discrete dynamics above, if the following stability conditions hold:

Damping condition: 0 < alpha < 1
Restoring force: beta > 0
Time step bound: dt < 2/(beta * sqrt(1 - alpha²))
Gating: 0 ≤ gate ≤ 1

Then for any initial state (x₀, v₀), the system converges asymptotically to the equilibrium:

lim(n→∞) x_n = mu
lim(n→∞) v_n = v_target

Formally: ∀ε > 0, ∃N ∈ ℕ : ∀n > N ⟹ ||x_n - mu|| < ε

Proof Sketch

The system behaves as a damped harmonic oscillator in the embedding space:

Energy function: Define E(n) = ½||x_n - mu||² + ½||v_n - v_target||²
Energy decay: Under stability conditions, E(n+1) < E(n) for all n
Lower bound: E(n) ≥ 0 always
Conclusion: By monotone convergence theorem, E(n) → 0, thus x_n → mu

The proof follows from discrete Lyapunov stability analysis. The parameters alpha (damping), beta (restoring force), and dt (discretization step) control the convergence rate and oscillation behavior.

Convergence Modes

Regime	Condition	Behavior
Underdamped	`alpha² < 4betadt`	Oscillates then converges
Critically damped	`alpha² = 4betadt`	Fastest convergence (no overshoot)
Overdamped	`alpha² > 4betadt`	Slow monotonic convergence

Practical Implications

Hybrid Discrete-Continuous Approximation:

Discrete (finite iterations)  ←→  Continuous (infinite time)
        ↓                              ↓
    GPU-friendly                  Theoretical limit

5 iterations: Fast, 70-80% convergence quality
10 iterations: Balanced, 85-95% convergence
50+ iterations: Near-perfect, 98%+ convergence
∞ iterations: Theoretical guarantee (impractical)

Adaptive Early Stopping:

The architecture monitors ||error|| and stops when:

if ||x_n - mu|| < tolerance:  # Converged!
    break  # Save 30-50% compute

This makes the system both theoretically grounded (convergence guarantee) and practically efficient (adaptive compute).

Connection to Neural ODEs

In the continuous limit (dt → 0), the dynamics become:

dx/dt = gate * v
dv/dt = -(1-alpha)/dt * v + (1-alpha)/dt * v_target - beta * (x - mu)

This is a second-order ODE with learned equilibrium mu, combining:

Physics-inspired dynamics (momentum, damping, restoring force)
Learned target state (mu, v_target from neural network)

Why This Matters

Theoretical guarantees: Not just empirical - proven convergence
Interpretability: Physics-based dynamics are explainable
Robustness: Stable across wide parameter ranges
Efficiency: Can trade iterations for quality (5 for speed, 50 for precision)
Universal: Same convergence theory applies to all domains (text, vision, audio)

Empirical Stability Analysis

Stability Region Characterization

We performed extensive empirical analysis to validate the theoretical convergence guarantees and characterize the practical stability region. The analysis explores the parameter space of alpha (damping) and p = dt * g * beta (effective time step × restoring force).

Key Finding: The system exhibits three distinct behavioral regimes:

STABLE (ρ < 1): Green region - guaranteed convergence
NEAR-BOUNDARY (ρ ≈ 1): Yellow region - convergence but slower
UNSTABLE (ρ > 1): Red region - divergence

The empirical stability boundary closely matches the theoretical sufficient condition:

Stable if: 0 ≤ alpha < 1  AND  0 < p < 2(1 + alpha)

Eigenvalue Analysis

The spectral radius (maximum eigenvalue magnitude) determines system stability. For convergence, we need ρ(J) < 1 where J is the Jacobian of the discrete dynamics.

Representative parameter sets:

Safe (α=0.1, p=0.4): ρ ≈ 0.5 - Fast, stable convergence
Near-bound (α=0.3, p=1.6): ρ ≈ 0.57 - Stable but approaching boundary
Unstable (α=0.5, p=2.5): ρ ≈ 0.7 - Exceeds stability bound, diverges
Damped (α=0.7, p=0.2): ρ ≈ 0.83 - High damping, slow convergence
High-alpha (α=0.9, p=1.0): ρ ≈ 0.95 - Near-critical, very slow

The heatmap reveals the complete stability landscape in (α, p) space. Dark blue regions (ρ < 0.5) converge rapidly, while yellow/green regions (ρ > 1.0) are unstable.

Convergence Dynamics

Energy trajectories E(n) = ½||x_n - mu||² + ½||v_n - v_target||² demonstrate convergence behavior:

Observations:

Damped (red, α=0.2): Fastest initial decay, oscillatory but converges
Safe/Near-bound (blue/orange): Smooth exponential decay to equilibrium
Unstable (green, α=0.8, p=2.5): Energy fails to decay, remains elevated
High-alpha (purple, α=0.9): Slowest convergence due to high damping

Practical Parameter Selection

Based on empirical analysis, recommended parameter ranges for INL layers:

Use Case	α (damping)	p (dt×g×β)	Behavior	Iterations Needed
Fast inference	0.1 - 0.3	0.3 - 1.0	Quick convergence	5-10
Balanced	0.3 - 0.6	0.5 - 1.5	Stable, moderate speed	10-20
High precision	0.4 - 0.7	0.4 - 1.2	Slow but accurate	20-50
Avoid	> 0.8	> 2.0	Too slow or unstable	N/A

Safety margin: Stay well within the theoretical bound p < 2(1+α). Practical recommendation: p < 1.5(1+α) for reliable convergence with finite iterations.

Connection to Model Architecture

The INL-LLM 1.1B model uses:

alpha ≈ 0.4-0.6 (moderate damping)
p ≈ 0.8-1.2 (safe region)
5 iterations/layer (sufficient for 85-95% convergence)

These parameters balance:

Convergence quality: 90%+ of theoretical equilibrium
Inference speed: ~30-50% faster than full convergence
Stability: Robust across diverse inputs and training stages

Theoretical vs. Empirical

Aspect	Theoretical	Empirical
Condition	`p < 2(1+α)`	`p < 1.8(1+α)` (practical)
Convergence	Asymptotic (n→∞)	85-95% in 5-10 iterations
Guarantee	Mathematical proof	Statistical validation
Application	Infinite time	Finite GPU budget

The empirical analysis validates the theory while providing practical guidance for finite-iteration deployment. The stability region is robust: small parameter perturbations during training don't cause instability.

Validation Methodology

Data: Sampled 11 α values × 100 p values (1,100 parameter combinations)

Metrics:

Spectral radius computation via eigenvalue analysis
Energy trajectory simulation (300 iterations)
Convergence rate measurement

Tools: NumPy, SciPy, Matplotlib for numerical analysis

For full analysis code, see the parent directory for stability analysis notebooks.

Optimizations

KV Caching

Full KV caching support for fast autoregressive generation.

# Automatic caching with .generate()
outputs = model.generate(
    **inputs,
    max_new_tokens=100,
    use_cache=True  # Enable KV caching (default)
)

# Manual caching for custom generation loops
past_key_values = None
for _ in range(max_tokens):
    outputs = model(input_ids, past_key_values=past_key_values, use_cache=True)
    past_key_values = outputs.past_key_values
    # ... get next token ...

Benefits:

1.1-1.3× faster generation for long sequences (100+ tokens)
Compatible with HuggingFace .generate() and vLLM
Beam search supported via _reorder_cache()
Minimal memory overhead (<1%)

How it works: Unlike standard transformers that cache K, V for attention, INL-LLM only needs to cache attention states. Integrator dynamics (x, v) are computed fresh for each token since they operate within each layer, not across tokens.

Performance Note: The speedup is more modest than standard transformers (which get 10-20× gains) because INL architecture is dominated by integrator iterations, not attention. Most compute (70-90%) goes to iterative dynamics (3-10 iterations per layer × 12-25 layers), while attention is only ~10-30% of FLOPs. The cache optimizes that 10-30%, giving ~1.1-1.3× overall speedup. This is an architectural tradeoff - you get richer dynamics at the cost of less cache benefit.

Technical Requirements

Requires trust_remote_code=True (custom INL architecture)
Python 3.8+, PyTorch 2.0+, transformers 4.35+

Citation

@misc{inl-llm-2024,
  author = {Boris Peyriguère},
  title = {INL-LLM: Integrator Neural Language Model},
  year = {2024},
  url = {https://github.com/pacific-prime777/ARKITEKTURE_TRANSFORMER_ADL}
}

License: CC BY-NC 4.0 (Non-Commercial - Contact author for commercial use)

Downloads last month: 9

Safetensors

Model size

1B params

Tensor type

F32