Best Local AI Models for 8GB VRAM: Complete RTX 2070 Guide (2026)
Run local LLMs on your RTX 2070 with 8GB VRAM. Top models, Ollama and llama.cpp setup, quantization tips, and performance benchmarks.
Run local LLMs on your RTX 2070 with 8GB VRAM. Top models, Ollama and llama.cpp setup, quantization tips, and performance benchmarks.
Running large language models (LLMs) locally on an RTX 2070 with 8GB of VRAM is not only possible — it delivers surprisingly good performance. With the right quantization techniques and inference engines, you can achieve 40+ tokens per second on 7B to 8B parameter models, making local AI a practical reality for homelab enthusiasts and developers who value privacy and offline capability.
This guide covers the best models, the tools you need (Ollama and llama.cpp), quantization strategies, and a step-by-step setup process verified against real benchmarks.
An RTX 2070 with 8GB VRAM sits in the "sweet spot" for running smaller quantized models. While 24GB cards like the RTX 3090 can run larger models, the 2070 handles 7B–8B parameter models at Q4_K_M quantization entirely in GPU memory. This means:
According to benchmarks from the local LLM community, an 8GB GPU like the RTX 2070 can deliver 40+ tokens per second with a 7B model at Q4_K_M quantization. For reference, the RTX 3070 (also 8GB) achieves 54–58 tokens per second with Qwen3.5-9B at Q4_K_M, so the RTX 2070 will be slightly slower but still very usable.
Before diving into specific models, it is essential to understand how quantization affects VRAM usage.
A 7B parameter model at FP16 (full precision) requires approximately 14GB of VRAM just for the weights, which exceeds the 8GB available on an RTX 2070. Quantization reduces the precision of the model weights, shrinking their memory footprint at a modest cost to output quality.
| Quantization | Approx. Size (7B model) | Quality Impact |
|--------------|------------------------|----------------|
| Q8_0 | ~8 GB | Minimal loss |
| Q4_K_M | ~4.9 GB | Good balance |
| Q3_K_M | ~3.5 GB | More aggressive |
| Q2_K | ~2.7 GB | Noticeable loss |
Q4_K_M is the recommended quantization for 8GB VRAM cards. It provides the best balance of model intelligence and memory footprint. A 7B model in Q4_K_M uses roughly 4.9 GB of VRAM, leaving headroom for the KV cache and context window.
Based on benchmarks and community testing, these are the top models for an RTX 2070 with 8GB VRAM:
This model leads the pack for sub-10B models. Benchmarks on an RTX 3070 (8GB) show 54–58 tokens per second with a 32K context window fitting in just 6.96 GB of VRAM. It tops the Artificial Analysis Intelligence Index for sub-10B models. For an RTX 2070, expect slightly lower speeds (likely 35–45 tok/s) but still excellent performance.
Meta's Llama 3.1 8B is the go-to model for most local LLM applications. At Q4_K_M quantization it uses roughly 4.9 GB of VRAM, leaving plenty of room for a 4096 context window. On an 8GB GPU this model delivers 15–30 tokens per second depending on the specific GPU.
Qwen 2.5 7B is a strong all-around performer. In Q4_K_M it occupies approximately 4–5 GB of VRAM and runs comfortably on 8GB cards with fast generation speeds.
DeepSeek R1 8B is excellent for coding and reasoning tasks. Its Q4_K_M variant uses about 5 GB of VRAM and delivers strong results on 8GB hardware. This model is often recommended for agentic coding workloads on the r/LocalLLaMA subreddit.
Google's Gemma 3 8B fits well in 8GB VRAM when quantized to Q4_K_M. It performs strongly on instruction-following and general Q&A tasks.
Mistral 7B remains a solid, lightweight option. It uses approximately 4–5 GB of VRAM in Q4_K_M and runs efficiently on older GPUs like the RTX 2070.
Ollama is the easiest way to run local LLMs. It handles model downloads, quantization, and serving through an OpenAI-compatible API.
Ollama supports Linux, macOS, and Windows. On Linux, the installation is straightforward:
curl -fsSL https://ollama.com/install.sh | shAfter installation, verify that Ollama can detect your GPU:
ollama list[command not verified - check official docs]
To pull and run a recommended model for 8GB VRAM:
ollama pull llama3.1:8b-instruct-q4_K_MThen run it:
ollama run llama3.1:8b-instruct-q4_K_MOllama automatically offloads layers to the GPU. You can monitor VRAM usage with nvidia-smi to ensure the model fits within 8GB.
If you encounter out-of-memory errors, reduce the context size:
ollama run llama3.1:8b-instruct-q4_K_M --num-ctx 2048This reduces the KV cache size, freeing VRAM for the model weights.
Ollama exposes an OpenAI-compatible API on port 11434 by default:
curl http://localhost:11434/api/generate -d '{
"model": "llama3.1:8b-instruct-q4_K_M",
"prompt": "Explain what a transformer model is in simple terms.",
"stream": false
}'This allows integration with tools like Open WebUI, LangChain, or custom applications.
llama.cpp is a more advanced, highly optimized inference engine. It offers fine-grained control over GPU offloading and quantization.
Since you have an NVIDIA RTX 2070, build llama.cpp with CUDA support for maximum performance:
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
mkdir build && cd build
cmake .. -DLLAMA_CUDA=ON
cmake --build . --config Release -j[command not verified - check official docs]
Download a GGUF quantized model from Hugging Face. For example, Mistral 7B Q4_K_M:
wget https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF/resolve/main/mistral-7b-instruct-v0.2.Q4_K_M.ggufThen run it with full GPU offloading:
./llama-server -m mistral-7b-instruct-v0.2.Q4_K_M.gguf -ngl 999 -c 4096 --host 0.0.0.0 --port 8081The -ngl 999 flag tells llama.cpp to offload all layers to the GPU. If you run out of VRAM, reduce this number (e.g., -ngl 32) to offload only some layers while the rest run on CPU.
llama.cpp includes a built-in benchmark tool:
./llama-bench -m mistral-7b-instruct-v0.2.Q4_K_M.gguf -ngl 999This will report prompt processing speed (tokens per second) and generation speed for your specific hardware.
If you want to run a 14B model like Qwen 2.5 14B at Q4_K_M (which requires roughly 8.1–8.4 GB), you can split layers between GPU and CPU. With llama.cpp:
./llama-server -m qwen-2.5-14b-instruct-q4_K_M.gguf -ngl 24 -c 2048Experiment with the -ngl value to find the maximum number of layers that fit without causing memory errors. This approach trades some speed for the ability to run larger, more capable models.
Based on benchmarks from similar 8GB GPUs (RTX 3060 Ti, RTX 3070, RTX 4060), here are realistic performance ranges for an RTX 2070:
| Model | Quantization | VRAM Used | Est. Tokens/sec |
|-------|-------------|-----------|-----------------|
| Qwen3.5-9B | Q4_K_M | ~6.96 GB | 35–45 tok/s |
| Llama 3.1 8B | Q4_K_M | ~4.9 GB | 15–30 tok/s |
| Qwen 2.5 7B | Q4_K_M | ~4.5 GB | 20–35 tok/s |
| DeepSeek R1 8B | Q4_K_M | ~5 GB | 15–25 tok/s |
| Mistral 7B | Q4_K_M | ~4.5 GB | 20–35 tok/s |
Note: The RTX 2070 has fewer CUDA cores and lower memory bandwidth compared to the RTX 3070 or RTX 4060, so expect roughly 60–70% of the speeds reported on those newer cards.
If you see CUDA out-of-memory errors:
-c 2048 or --num-ctx 2048)-ngl in llama.cpp to offload fewer layersIf token generation feels slow:
nvidia-smi)-b 512 if using llama.cppIf model downloads via Ollama fail:
ollama pull with the full model tag nameThe RTX 2070 with 8GB VRAM is a capable card for running local AI models in 2025. With the right quantization (stick to Q4_K_M) and a good inference engine (Ollama for simplicity, llama.cpp for fine-grained control), you can run 7B–9B parameter models at interactive speeds entirely on your GPU.
Qwen3.5-9B at Q4_K_M is currently the best overall model for 8GB VRAM, offering the highest token throughput and strong benchmark scores. Llama 3.1 8B Instruct remains the reliable default for general-purpose use.
Self-hosting AI on your RTX 2070 gives you complete control over your data, eliminates API costs, and works offline. With the benchmarks and setup steps in this guide, you can start running local LLMs today without upgrading your hardware.
For further optimization, explore the official documentation for Ollama and llama.cpp, and experiment with different quantizations to find the best trade-off between quality and speed for your specific workloads.