Run 64k+ Context Models with Less Memory in Ollama

Running large language models with extended context lengths often leads to memory bottlenecks, but Ollama 0.1.5 introduces groundbreaking optimizations to tackle this challenge. This guide explores how to leverage Ollama’s latest features—like quantized key-value (K/V) caching and Flash Attention—to run 64k+ context models efficiently on hardware with limited memory. Whether you’re a developer or researcher, these techniques will help you maximize performance without upgrading your setup.

Understanding the memory challenge in long context models

Long context models require significant memory to store intermediate results during inference, particularly in the K/V cache. For example, a 64k context length can consume over 10x more memory than the default 4096-token setting in Ollama. This creates two core issues:

Exponential memory growth: The K/V cache memory usage scales quadratically with context length
GPU VRAM limitations: Most consumer GPUs struggle with contexts exceeding 14k tokens due to VRAM constraints

Diagram showing memory usage comparison between standard and quantized K/V caching in Ollama 0.1.5 — Memory optimization comparison: Standard vs. quantized K/V caching

Enabling memory optimizations in Ollama 0.1.5

Ollama 0.1.5 introduces two critical features for memory efficiency:

Quantized K/V caching: Reduces memory footprint by storing attention states in lower precision formats
Flash Attention: Optimizes attention computation to minimize redundant memory access

To activate these features, use the following environment variables:

# Enable quantized K/V caching
export OLLAMA_KV_QUANTIZE=1

# Activate Flash Attention
export OLLAMA_FLASH_ATTENTION=1

These settings work best when combined with the new ollama launch command, which intelligently allocates memory resources based on available hardware.

Configuring context length for optimal performance

By default, Ollama limits context length to 4096 tokens to prevent memory overflows. For long context workloads:

Adjust the context length parameter in your model configuration
Monitor memory usage with ollama stats
Scale context length incrementally (e.g., 8k → 16k → 32k → 64k)

Example configuration for a 32k context model:

ollama launch --context-length 32768 llama3:8b

Memory requirements vary by model size:

Model Size	Minimum RAM	Recommended RAM
7B	8GB	16GB
13B	16GB	32GB
70B	64GB	128GB

Advanced memory management techniques

For power users, combine Ollama’s optimizations with these strategies:

Dynamic batch sizing: Adjust batch size based on current memory availability
Model layer offloading: Keep less critical layers in system memory while prioritizing VRAM for computation
Context window partitioning: Split long documents into overlapping segments for processing

Use the new model scheduler in Ollama 0.1.5 to automate memory allocation:

ollama config set memory_strategy auto

This feature dynamically balances memory between the K/V cache and model weights based on workload demands.

Conclusion

Ollama 0.1.5’s memory optimizations enable unprecedented efficiency in running long context models. By combining quantized K/V caching, Flash Attention, and intelligent memory scheduling, developers can now process 64k+ token contexts with half the memory previously required. Implement these techniques to:

Reduce memory usage by up to 60% for long context workloads
Process documents exceeding 50k tokens on mid-range GPUs
Maintain high inference speeds while minimizing memory swapping

As AI workloads continue growing in complexity, Ollama’s latest optimizations position it as a leading framework for efficient, large-context processing without hardware upgrades.

Understanding the memory challenge in long context models

Enabling memory optimizations in Ollama 0.1.5

Configuring context length for optimal performance

Advanced memory management techniques

Conclusion

Enjoyed this article?

Related Posts

How to Build a Production RAG Pipeline with ZenML & SageMaker

How to Build a Self-Improving Agentic RAG System

How to Build a Cost-Effective Voice Assistant with gpt-realtime-mini