May 30, 20261 views

Streaming LLM Responses From a Remix Action

A small pattern that made my AI surfaces feel a lot less like waiting at a loading screen.

In this post

The pattern at a glance
What changes for the user
The small caveats
Final thoughts

When I first wired an LLM call into a Remix action, the request would sit there, the user would stare at a spinner, and the whole six seconds of generation would arrive at once. Functionally fine. Felt awful.

Switching to streaming was one of the smaller changes I've made that meaningfully changed how the product feels.

The pattern at a glance

A Remix action can return a Response whose body is a ReadableStream. On the client, you fetch the route and read chunks off response.body as they arrive, appending to state. The LLM SDKs I use — OpenAI, Claude, OpenRouter — all expose streaming responses, so the action is mostly: open the model stream, pipe it into the Response, close.

The pieces that matter:

An action that returns a streamed Response instead of a resolved value
A fetch on the client that reads chunks from response.body and updates state as they land
A small "thinking" UI that gives way as soon as the first token shows up

Nothing exotic. Just letting the model talk while it's thinking, instead of waiting for the full thought.

What changes for the user

The first token usually arrives within a few hundred milliseconds. That changes the entire perception of the interaction — the system stops feeling like a black box and starts feeling like a collaborator who's actually working.

It also gives me somewhere to surface intermediate state: tool calls being made, sources being read, structured fields being filled in. Those used to be invisible. Now they're part of the experience.

The small caveats

A few things to watch:

Errors mid-stream are awkward. Once you've started streaming, you can't go back to a clean error UI without some care. I handle errors as terminal chunks the client can detect and render in place.
Structured outputs need buffering. If the model is returning JSON, the user shouldn't see half-typed JSON. I stream the prose and buffer the structured parts until they parse.
Cancellation matters. If the user navigates away, the action should hear about it — Remix forwards an AbortSignal on request.signal — and cancel the model call. Otherwise you're paying for tokens nobody is reading.

Final thoughts

Streaming is the first change I'd make in any LLM-powered surface. It's not really about performance — the total time is roughly the same. It's about respecting the user's attention by letting them see something useful as soon as it exists.

Comments

Be the first to say something.