Routing that compounds with every deployment.
Every local inference stack has a routing problem and almost none of them know it.
Stateless routing - sending each request to a backend based on round-robin, availability, or a static config file - ignores everything the system has learned about its own hardware and workload. A backend that handled a reasoning task well this morning gets the same assignment probability at midnight under a different load profile. The system has no memory. Every request starts from zero.
You pick a model. It doesn't fit the task. You pick a backend. It doesn't match your hardware under current load. You add a second modality - speech to text, image generation, embeddings - and you add a second system. None of them aware of each other, none of them aware of the shared GPU budget.
Static configuration files do not adapt. A backend that performed well yesterday may be degraded today - VRAM usage shifted, a thermal event slowed throughput - but your router keeps sending traffic to it anyway. The routing decision in month twelve is identical to the routing decision on day one, regardless of what the actual performance data shows.
A stateless router treats every request as independent. A stateful router accumulates context. The difference compounds over time.
controla's scoring engine is stateless and deterministic - same inputs, same score, every time. But above the scoring layer sits a versioned RoutingPolicy that carries learned performance observations: per (backend, task_type, complexity) tuple, an exponentially weighted moving average of observed latency, failure rate, and throughput. These weights persist across restarts via Redis. A process restart does not erase what the system has learned.
The operational consequence: a controla deployment that has been running for three months is measurably more accurate in its routing decisions than it was on day one - on the same hardware, without any configuration changes. Two controla instances on different hardware converge to different optimal policies. Both correct for their respective deployments. This is deployment-specific knowledge that no static configuration can replicate.
Without statefulness, you repeat the same blind dispatch mistakes indefinitely. With it, each deployment builds an increasingly precise model of its own hardware and workload. The router improves as it runs.
controla is a local inference OS - 19 backends across 7 modalities (text generation, STT, TTS, image generation, embeddings, vision, reasoning) under a single OpenAI-compatible API. Every request passes through a sequential pipeline:
The hardest part wasn't the learning loop - it was making the scoring layer provably deterministic so that the learning layer could be trusted. If the scoring engine has any non-determinism, the replay validation is meaningless: you can't know whether a policy improvement comes from better weights or from stochastic scoring variation.
The VRAM accounting problem is more complex than it appears. A backend that fits at idle may fail under concurrent load as VRAM usage climbs. Getting VRAM-aware routing right required real-time monitoring of per-backend VRAM headroom, not just static capacity checks at startup.
The multi-modal problem compounds the scheduling problem. When text generation, STT, and image generation compete for GPU memory, you need a scheduling layer that treats VRAM as a shared resource with explicit allocation tracking - not just a check at dispatch time. The first design had a subtle race condition where two requests could both pass the VRAM check and then both fail to load.
The ε-greedy exploration had to be designed very carefully. Exploration that sends traffic to untested backends is necessary for learning - but in a production environment, even exploration traffic must be capability-matched and VRAM-safe. Unconstrained exploration would learn things you don't want to learn, at the cost of actual production requests.
controla has 329 passing tests across the core routing, scoring, scheduling, and learning components. The system is in active development.
It is not open-sourced. It will be released under a commercial license. Redistribution and embedding in commercial products are covered under a separate license tier.
The setup wizard, hardware-adaptive configuration matrix, and deployment documentation are being completed in parallel with the core system.
If you're building on local inference at scale and the routing problem is real for you - contact me directly.