Building AI Features for Production

Introduction

Shipping an AI feature that works in a demo is one thing. Shipping one that behaves consistently in production, stays within budget, and does not leak or confuse user data is another. This post focuses on three areas that bite teams again and again: temperature (determinism), conversation memory (scope and lifecycle), and timeouts (failure modes). Getting these right early saves a lot of debugging and user complaints.

Temperature: When You Need Determinism

LLM APIs let you set a temperature parameter. Higher values make outputs more varied; zero (or very low) makes the model repeatable for the same input. For product features that feed into code or structured data (e.g. extracting job requirements from chat, filling a form, or driving a filter), non-determinism is a problem. One run might produce valid JSON; the next might add a field your parser does not expect or change the wording in a way that breaks your logic. So for any path where the model output is consumed by your application (not just shown to the user), set temperature to zero.

That does not mean every call should be temperature zero. For creative or conversational replies, a higher temperature can be better. The rule of thumb: if the output goes into your code or your database, make it deterministic. If the output is only for the user to read, you can allow more variation.

Memory: Scope and Lifecycle

Conversation memory (e.g. a buffer of recent messages sent to the model) is essential for multi-turn chat. It is also a common source of bugs when the buffer is shared or long-lived in the wrong way. In a multi-tenant or multi-user product, memory must be keyed by (tenant, user, conversation). When the user starts a "new chat," you must create a new buffer for that conversation and never attach another user’s or another conversation’s history. If your framework or library keeps memory in process memory, ensure the key is unique and that you create a new buffer when the user explicitly starts a new thread.

Also decide when memory is cleared. On browser refresh? When the user logs out? When the session expires? If you do not clear it, you risk leaking old context into new conversations or holding references forever. Document the lifecycle and test that "new chat" and "refresh" do what you expect.

Timeouts: Fail Open or Fail Closed

LLM API calls can be slow or unavailable. Your product needs a policy: how long do you wait, and what do you show or do when the call fails or times out? If you do not set timeouts, a stuck request can hang the user or block a thread. If you set them too low, you may cut off valid responses. A practical approach is to set a timeout (e.g. 30 seconds) on the client or server call, and to handle the error by showing a clear message and optionally retrying once. Avoid failing silently or showing a generic "something went wrong" with no way to retry.

If the AI feature is critical (e.g. the only way to complete a step), consider a fallback: e.g. "AI is temporarily unavailable; you can still fill the form manually." That way the user is not blocked and you have a path to degrade gracefully.

Summary

Temperature: Use zero (or near-zero) when the model output is consumed by your code or stored; use higher values only when the output is purely for display.
Memory: Key conversation buffers by (tenant, user, conversation) and create a new buffer when the user starts a new chat; define and implement a clear lifecycle for when memory is cleared.
Timeouts: Set timeouts on LLM calls, handle errors explicitly, and provide a fallback or retry so the product does not hang or fail in a confusing way.

These three levers are not the only things that matter (prompt design, cost, and safety matter too), but they are the ones that most often cause production issues when overlooked. Get them right from the start and you will spend less time firefighting and more time iterating on the feature itself.