concept
Streaming as Runtime
Streaming as Runtime
The idea that LLM streaming is not just a UX nicety (tokens appearing faster) but a runtime control channel: partial model output becomes typed semantic events that other participants observe and act on while inference is still happening. The framing line from mozaik:
“Token streaming improves the interface. Reactive streaming changes the runtime.”
Two Uses of Streaming
| Token streaming (UI) | Reactive streaming (runtime) | |
|---|---|---|
| Audience | The end user’s screen | Other agents/tools/observers |
| Purpose | Perceived latency | Control, coordination, safety |
| Unit | Raw tokens/chunks | Typed semantic events |
| Covered by | streaming-chat-architecture | this page |
These are orthogonal — a system can do both. kulify projects already do the first (SSE / ReadableStream → UI in master-bot, vorma, fajb-next); this concept is about the second.
Semantic Events
Streamed output is wrapped as typed event objects with a consistent shape across internal (own model) and external (peer) sources, so any participant can handle a partial response uniformly. Agents distinguish origin via onInternalEvent() / onExternalEvent() (and the typed onReasoning / onFunctionCall / onExternalReasoning … family — see reactive-agents).
Five Capabilities It Unlocks
- Inference interception — a participant (e.g. a safety reviewer) watches the stream and intervenes mid-generation when it sees an unsafe or wrong direction.
- Abort and redirect — cancel generation the moment enough information exists, instead of paying for the full completion. Direct token-cost win.
- Live agent handoff — delegate to a specialist mid-stream as context emerges, rather than after a full turn.
- Observability — real-time visibility into reasoning, function calls, and tool results during execution, not just in post-hoc logs.
- Multi-agent collaboration — concurrent participants reacting to one another instead of a rigid sequential pipeline (see agent-swarms).
A consequence: policy/safety enforcement becomes an emergent property of who is subscribed to the stream, not an external wrapper bolted around the agent.
Design Tension: Back-Pressure
If listeners can act on a live stream, what happens when one is slow? mozaik delivers events synchronously to all subscribers so a slow listener never blocks producers — a deliberate producer-priority choice. That’s one answer to back-pressure; the trade is a slow observer may miss the window to intercept. Any “streaming as runtime” system has to pick a stance here.
Relevance to Kulify
The abort-and-redirect and live-observer capabilities map cleanly onto the investorchat supervisor design and the broader ai-agent-architectures multi-agent direction: a cheap watcher that aborts a wrong-headed generation early is straight token savings, and mid-stream handoff fits multi-domain queries.
Related
- reactive-agents — agents that react to these events
- agent-swarms — many participants sharing the stream
- streaming-chat-architecture — the UI-streaming counterpart (orthogonal)
- back-pressure — the slow-listener trade-off
- mozaik — implementation