Why Your Local AI Gives Empty Responses (and How to Fix It)
Local AI returning blank or empty responses? Learn the real causes — context window limits, low token generation caps, and how to fix them in Ollama.
Local AI returning blank or empty responses? Learn the real causes — context window limits, low token generation caps, and how to fix them in Ollama.
You fire up your self-hosted local AI — maybe an Ollama model running on your homelab server or a Docker container with llama.cpp — and you ask it a question. The spinner ticks. The prompt gets processed. And then... nothing. A blank response. An empty string. Silence.
If you've been running local LLMs in your homelab for any length of time, you've likely hit this wall. The model loaded fine. The API returned a 200 OK. But the actual generated content came back empty. This article dives into the two most common causes of empty responses from local AI models, backed by official documentation and real community findings, and gives you the exact steps to fix them.
After digging through Ollama's official docs, GitHub issues, and community reports, the root causes consistently point to two related but distinct parameters:
1. `num_predict` (Max Tokens) — Defaults to 128 tokens in Ollama. This is the maximum number of tokens the model is allowed to generate. If the model needs more than 128 tokens to form a complete answer, it stops early and may return an empty or truncated response.
2. `num_ctx` (Context Window) — Defaults to 2048 or 4096 tokens depending on your VRAM. As per Ollama's official docs on context length, systems with less than 24 GiB VRAM default to a 4K context window. When your prompt exceeds this, Ollama silently truncates it, causing the model to lose context and produce empty output.
Here's the chain of events:
num_predict cap of 128 tokens stops generation prematurely.As documented in Ollama issue #2204, there is no built-in warning when context is exceeded. The model simply stops producing useful output.
Before fixing anything, confirm you are hitting these limits.
Ollama provides a debug mode that reveals the allocated context length and whether model layers are being offloaded to CPU. According to the official docs:
ollama serveThen in another terminal, run your model and inspect the output logs. Look for lines that report context allocation and layer offloading. As noted in the official context length documentation: "Check allocated context length and model offloading" — the logs will show you exactly how much context is available.
If you are using the Ollama API, your request payload likely includes parameters like num_predict or max_tokens. Check what you are sending:
{
"model": "llama3.2",
"prompt": "Explain how DNS works",
"options": {
"num_predict": 128
}
}If you are using a UI like Open WebUI, check the per-chat or per-user settings for max tokens. Community reports on GitHub discussions confirm that these UI settings sometimes fail to pass through to Ollama correctly.
num_predict)The single most common reason for empty responses is that num_predict (which maps to max_tokens in OpenAI-compatible APIs) defaults to 128 tokens in Ollama. This is documented on OneUptime's Ollama API guide and confirmed across multiple community threads.
Create a custom Modelfile to override the defaults:
FROM llama3.2
PARAMETER num_predict 2048Then create and run the custom model:
ollama create my-llama3.2 -f ./Modelfile
ollama run my-llama3.2When making API requests, pass num_predict in the options field:
{
"model": "llama3.2",
"prompt": "Explain how DNS works in detail",
"options": {
"num_predict": 2048
}
}Note: There is a known issue tracked in Ollama issue #11892 where the num_predict parameter is sometimes ignored depending on how the request is structured. If you set it and it still doesn't work, try the Modelfile approach instead.num_ctx)The context window determines how many tokens the model can "see" at once. Ollama defaults based on VRAM:
| VRAM | Default Context Length |
|------|----------------------|
| < 24 GiB | 4K tokens |
| 24–48 GiB | 32K tokens |
These defaults are confirmed in Ollama's official context length documentation. Many community models support much larger contexts — for example, Llama 3.1 supports up to 128K tokens.
Set OLLAMA_CONTEXT_LENGTH before starting the Ollama server:
OLLAMA_CONTEXT_LENGTH=64000 ollama serveThis is the recommended method in the official docs. Note that some community reports indicate this environment variable may not always take effect. If it doesn't work for you, use the Modelfile method below.
FROM llama3.2
PARAMETER num_ctx 32768
PARAMETER num_predict 4096Then rebuild:
ollama create my-llama3.2 -f ./ModelfileAs noted in the Drifting Ruby blog post on Ollama context windows, increasing the context window will also increase VRAM usage. Monitor your GPU memory with nvidia-smi or similar tools.
If you are using the Ollama desktop application, open Settings and adjust the context length slider. The official docs state: "Change the slider in the Ollama app under settings to your desired context length."
Ollama does not warn you when your prompt exceeds the context window. According to the Autodidacts blog, which documented issues with glm-ocr producing empty output due to default context size: "Since Ollama silently truncates context, it's hard to know what's the right value to use."
To diagnose truncation:
1. Start Ollama in debug mode.
2. Send your prompt.
3. Watch the server logs for any indication of context overflow.
If you see logs that don't clearly indicate truncation, the safest approach is to significantly increase the context window (e.g., 32K or 64K) and test again.
When you increase context size, more model layers may need to stay in VRAM. If your GPU runs out of memory, Ollama offloads layers to CPU — which drastically slows down inference and can cause blank responses.
The official docs recommend: "For best performance, use the maximum context length for a model, and avoid offloading the model to CPU."
Check your setup:
ollama run my-modelLook for "offloaded to CPU" in the logs. If you see this, reduce your context window or use a smaller quantized model.
glm-ocr CaseA concrete example documented by the Autodidacts blog: The tool glm-ocr was producing empty output when run with Ollama's default context size of 4096 tokens. The fix was straightforward — increase num_ctx to at least 10,000 tokens:
ollama run glm-ocr Then inside the model's REPL:
/set parameter num_ctx 10240Or pass it directly:
ollama run glm-ocr "Text Recognition: ./image.jpg"After increasing the context window, the model produced valid, complete responses.
Here are the recommended baseline settings for a self-hosted Ollama instance in a homelab environment:
FROM llama3.2
# Increase max generation tokens from default 128
PARAMETER num_predict 4096
# Increase context window from default 4096
PARAMETER num_ctx 32768
# Optional: set temperature for more predictable output
PARAMETER temperature 0.7{
"model": "my-custom-model",
"prompt": "Your prompt here",
"options": {
"num_predict": 4096,
"num_ctx": 32768,
"temperature": 0.7
}
}num_predict set to at least 512–4096? (Default 128 is too low)num_ctx large enough to fit your full prompt + system message? (Default 2048–4096 is often too small)| Symptom | Likely Cause | Fix |
|---------|-------------|-----|
| Completely empty response | num_predict too low (default 128) | Increase to 2048+ in Modelfile or API options |
| Response stops mid-sentence | num_predict reached before answer finished | Increase num_predict |
| Blank after long prompt | Context window exceeded, prompt truncated | Increase num_ctx |
| Model worked before, now empty | VRAM exhausted from larger context | Reduce context or use smaller quantized model |
| UI shows settings but ignored | UI not passing params correctly | Configure via Modelfile instead |
Empty responses from local AI models are almost never a model quality issue — they are a configuration issue. The defaults that Ollama ships with (128 tokens max generation, 2048–4096 context window) are conservative, designed to run on as many machines as possible. But in a homelab environment where you control the hardware, you have every reason to push these numbers higher.
By understanding the relationship between num_predict, num_ctx, and your available VRAM, you can eliminate blank responses entirely. Start by checking your current settings, then apply the fixes outlined above — a custom Modelfile with increased values, or explicit API parameters. Your local AI has plenty to say. You just need to give it enough room to say it.
For further reading, consult Ollama's official context length documentation and the Ollama Modelfile documentation to fine-tune your deployment.