BLR23:50:36
···18:20:36
00:00
X0.0000
Y0.0000
← Back to featured
Commercial Software • In Development

controla - Local Inference That Learns

Routing that compounds with every deployment.

What's Wrong With Inference Routing Today

Every local inference stack has a routing problem and almost none of them know it.

Stateless routing - sending each request to a backend based on round-robin, availability, or a static config file - ignores everything the system has learned about its own hardware and workload. A backend that handled a reasoning task well this morning gets the same assignment probability at midnight under a different load profile. The system has no memory. Every request starts from zero.

You pick a model. It doesn't fit the task. You pick a backend. It doesn't match your hardware under current load. You add a second modality - speech to text, image generation, embeddings - and you add a second system. None of them aware of each other, none of them aware of the shared GPU budget.

Static configuration files do not adapt. A backend that performed well yesterday may be degraded today - VRAM usage shifted, a thermal event slowed throughput - but your router keeps sending traffic to it anyway. The routing decision in month twelve is identical to the routing decision on day one, regardless of what the actual performance data shows.

Why Statefulness Matters

A stateless router treats every request as independent. A stateful router accumulates context. The difference compounds over time.

controla's scoring engine is stateless and deterministic - same inputs, same score, every time. But above the scoring layer sits a versioned RoutingPolicy that carries learned performance observations: per (backend, task_type, complexity) tuple, an exponentially weighted moving average of observed latency, failure rate, and throughput. These weights persist across restarts via Redis. A process restart does not erase what the system has learned.

The operational consequence: a controla deployment that has been running for three months is measurably more accurate in its routing decisions than it was on day one - on the same hardware, without any configuration changes. Two controla instances on different hardware converge to different optimal policies. Both correct for their respective deployments. This is deployment-specific knowledge that no static configuration can replicate.

Without statefulness, you repeat the same blind dispatch mistakes indefinitely. With it, each deployment builds an increasingly precise model of its own hardware and workload. The router improves as it runs.

The System

controla is a local inference OS - 19 backends across 7 modalities (text generation, STT, TTS, image generation, embeddings, vision, reasoning) under a single OpenAI-compatible API. Every request passes through a sequential pipeline:

Analysis Task type classified across 10 categories and 3 complexity levels. Capabilities required - multimodal content, tool calls, structured output - extracted before dispatch.
Scoring Every VRAM-safe candidate backend evaluated across 6 dimensions: capability, performance, resource state, current load, reliability, and context. VRAM-aware: −15 if model cannot fit, +1.5 if already loaded. Scoring is deterministic - learning lives above it.
Scheduling Redis-backed priority queue with per-user fairness windows, deadline-aware dispatch via x-latency-budget header, and starvation prevention. High-priority requests are never blocked behind batch work.
Execution Top-ranked backend receives the request. For high-complexity reasoning, the ExecutionPlanner decomposes into typed inference step chains dispatched separately.
Feedback Every completed request generates a performance record: latency, queue delay, backend used, task type, failure outcome. The weight learner updates per-context EWMA aggregates. These persist to Redis.
Policy Candidate policy updates are replay-validated against historical traffic before promotion. If the candidate would have degraded p95 latency or increased failure rate over the prior window, it is rejected. Policy promotion is safe by construction.

What Building It Revealed

The hardest part wasn't the learning loop - it was making the scoring layer provably deterministic so that the learning layer could be trusted. If the scoring engine has any non-determinism, the replay validation is meaningless: you can't know whether a policy improvement comes from better weights or from stochastic scoring variation.

The VRAM accounting problem is more complex than it appears. A backend that fits at idle may fail under concurrent load as VRAM usage climbs. Getting VRAM-aware routing right required real-time monitoring of per-backend VRAM headroom, not just static capacity checks at startup.

The multi-modal problem compounds the scheduling problem. When text generation, STT, and image generation compete for GPU memory, you need a scheduling layer that treats VRAM as a shared resource with explicit allocation tracking - not just a check at dispatch time. The first design had a subtle race condition where two requests could both pass the VRAM check and then both fail to load.

The ε-greedy exploration had to be designed very carefully. Exploration that sends traffic to untested backends is necessary for learning - but in a production environment, even exploration traffic must be capability-matched and VRAM-safe. Unconstrained exploration would learn things you don't want to learn, at the cost of actual production requests.

Where It Is Now

controla has 329 passing tests across the core routing, scoring, scheduling, and learning components. The system is in active development.

It is not open-sourced. It will be released under a commercial license. Redistribution and embedding in commercial products are covered under a separate license tier.

The setup wizard, hardware-adaptive configuration matrix, and deployment documentation are being completed in parallel with the core system.

If you're building on local inference at scale and the routing problem is real for you - contact me directly.