Insights

Caching AI API Responses in Next.js: What to Store, What to Skip

LLM calls are slow and expensive. Next.js gives you three caching layers — here's how to apply each one without serving stale or wrong output.

May 18, 2026 · 6 min read

An uncached LLM call costs real money and takes real time — anywhere from 500 ms for a small model to several seconds for a complex prompt. In a server-rendered Next.js app, that latency sits directly on the critical path: the page does not respond until the call returns. Getting caching right is the single highest-leverage performance move in an AI-native product.

The tricky part is that not all AI responses are the same. Some are deterministic given a fixed prompt. Others are user-specific and should never be shared. Applying a blanket cache-everything or cache-nothing policy will either serve wrong output or gain nothing. Here is how to think through each layer.

Layer 1: Next.js fetch caching

When you call a REST endpoint (like the OpenAI Chat Completions API) using the global fetch inside a Server Component, Next.js wraps that call with its Data Cache. By default, fetch results are cached indefinitely and reused across requests until you explicitly revalidate. That default is fine for stable content — a product description, a pre-generated FAQ — but catastrophic for dynamic prompts that depend on user input.

The fix is explicit: pass the cache option that matches your intent.

// Never cache — each call is unique to the request
const res = await fetch("https://api.openai.com/v1/chat/completions", {
  method: "POST",
  cache: "no-store",
  headers: { "Content-Type": "application/json", Authorization: `Bearer ${process.env.OPENAI_API_KEY}` },
  body: JSON.stringify({ model: "gpt-4o-mini", messages }),
});

// Cache with time-based revalidation — good for summaries refreshed hourly
const res = await fetch(endpointUrl, {
  next: { revalidate: 3600 },
});

Rule of thumb: if the prompt contains anything that varies per user or per request, use cache: "no-store". If the prompt is fixed and the result is not sensitive, set a revalidate window that matches your tolerance for staleness.

Layer 2: unstable_cache for non-fetch calls

The Vercel AI SDK's generateText and streamText do not go through fetchdirectly — they use their own HTTP client. That means Next.js's fetch cache does not apply. Use unstable_cache to wrap them instead:

import { unstable_cache } from "next/cache";
import { generateText } from "ai";
import { openai } from "@ai-sdk/openai";

const getCachedSummary = unstable_cache(
  async (articleSlug: string) => {
    const { text } = await generateText({
      model: openai("gpt-4o-mini"),
      prompt: `Summarize the article at slug: ${articleSlug}`,
    });
    return text;
  },
  ["article-summary"],
  { revalidate: 86400, tags: ["summaries"] }
);

// In your Server Component:
const summary = await getCachedSummary(params.slug);

The cache key is derived from the function arguments, so each unique articleSlug gets its own entry. The tags option lets you purge related entries together when content changes — call revalidateTag("summaries") from a webhook or CMS hook.

Layer 3: Full Route Cache

If a route has no dynamic data at all — every LLM call is wrapped in a cache with a fixed key and the page has no per-user content — Next.js can cache the entire rendered HTML at build time or after the first request. This is the Full Route Cache, and it gives you static-site-level performance for AI-generated content.

To enable it, ensure the route does not opt into dynamic rendering. Any call to cookies(), headers(), or searchParams (without Suspense wrapping) forces dynamic rendering and bypasses the Full Route Cache. Keep those reads in leaf components behind Suspense boundaries to preserve cacheability on the outer shell.

What never to cache

User-personalized completions. Any response that incorporates user history, preferences, or private data must stay behind cache: "no-store". Serving another user's cached response is a data leak.
Real-time or tool-call responses. If the model calls external tools or APIs whose results change over time (stock prices, live availability), caching the composite response will serve incorrect data.
Streamed responses. streamText starts flushing tokens before the call completes — there is nothing to cache until the stream finishes. Handle streaming separately from caching; cache the completed output if you need it.

Tying it together

Think of caching AI responses in three buckets: shared-static (fixed prompt, cache indefinitely or with a long TTL), shared-dynamic (fixed prompt, short TTL or tag-based revalidation), and private (per-user, never cache at the Next.js layer — offload to a session store if you need persistence). Most AI-native apps have all three; the mistake is applying one policy to all of them.

Once you have caching in place, the combination with Suspense streaming from last week's post becomes powerful: fast-cached content paints immediately, slow uncached calls stream in behind a skeleton. Users see useful output in the first paint rather than waiting for every data dependency to resolve.