Fitting LLM Reply Suggestions Into Every Provider's Prompt Cache — Without Structured Output

Dev.to AI
Generative AI

I wanted to add reply suggestions to a voice roleplay chat - the classic UX where three "you could say this next" chips appear under each AI response. Sounds simple. But when your chat is built around streaming and prompt caching, every obvious approach turns out to be a bad fit. I ended up going with the unglamorous move of embedding inline markers in the response and stripping them out afterward. The path to that decision was interesting enough to write up.