Documentation Index Fetch the complete documentation index at: https://mintlify.com/portkey-AI/gateway/llms.txt
Use this file to discover all available pages before exploring further.
Overview
Ollama enables you to run large language models locally on your own hardware. Perfect for development, testing, privacy-sensitive applications, and offline use. Access Llama, Mistral, Gemma, and many more models without any API costs.
Base URL: Your local Ollama server (default: http://localhost:11434)
Supported Features
✅ Chat Completions
✅ Streaming
✅ Embeddings
✅ Vision (multimodal models)
✅ Custom Models
✅ Model Library (100+ models)
❌ Function Calling (limited support)
❌ Image Generation
Prerequisites
Install Ollama
macOS
Linux
Windows
Docker
# Download from ollama.com or use:
brew install ollama
# Start Ollama
ollama serve
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
# Start Ollama
ollama serve
# Download installer from ollama.com
# Or use Windows Subsystem for Linux (WSL)
# Run Ollama in Docker
docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama
Pull a Model
# Pull Llama 3.1 (4.7GB)
ollama pull llama3.1
# Pull Mistral (4.1GB)
ollama pull mistral
# Pull a vision model
ollama pull llava
# List downloaded models
ollama list
Quick Start
Chat Completions
from portkey_ai import Portkey
client = Portkey(
provider = "ollama" ,
custom_host = "http://localhost:11434" # Your Ollama server
)
response = client.chat.completions.create(
model = "llama3.1" ,
messages = [
{ "role" : "user" , "content" : "Explain the benefits of running models locally" }
]
)
print (response.choices[ 0 ].message.content)
Streaming
stream = client.chat.completions.create(
model = "llama3.1" ,
messages = [{ "role" : "user" , "content" : "Tell me a story" }],
stream = True
)
for chunk in stream:
if chunk.choices[ 0 ].delta.content:
print (chunk.choices[ 0 ].delta.content, end = "" )
Popular Models
Model Size Memory Description llama3.343GB 32GB Latest Llama 3.3 70B llama3.14.7GB 8GB Llama 3.1 8B llama3.1:70b40GB 48GB Llama 3.1 70B llama3.1:405b231GB 256GB+ Largest Llama llama23.8GB 8GB Llama 2 7B
Mistral & Mixtral
Model Size Memory Description mistral4.1GB 8GB Mistral 7B mistral-large40GB 48GB Mistral Large mixtral26GB 32GB Mixtral 8x7B MoE
Google Gemma
Model Size Memory Description gemma25.4GB 8GB Gemma 2 9B gemma2:27b16GB 20GB Gemma 2 27B gemma5.0GB 8GB Gemma 7B
Vision Models
Model Size Memory Description llava4.7GB 8GB Llama with vision llava:34b20GB 24GB Larger vision model bakllava4.7GB 8GB Alternative vision
Specialized Models
Model Size Purpose codellama3.8GB Code generation phi32.3GB Microsoft’s small model qwen2.54.7GB Multilingual deepseek-coder3.8GB Advanced coding nous-hermes24.1GB General purpose
Ollama excels at:
Privacy - Data never leaves your machine
Zero cost - No API fees
Offline use - Works without internet
Fast iteration - No network latency
Customization - Create and modify models
Configuration Options
client = Portkey(
provider = "ollama" ,
custom_host = "http://localhost:11434" # Your Ollama server URL
)
Remote Ollama Server
# Connect to Ollama on another machine
client = Portkey(
provider = "ollama" ,
custom_host = "http://192.168.1.100:11434"
)
Docker Container
# If Ollama is in Docker
client = Portkey(
provider = "ollama" ,
custom_host = "http://localhost:11434"
)
Advanced Features
System Messages
response = client.chat.completions.create(
model = "llama3.1" ,
messages = [
{
"role" : "system" ,
"content" : "You are a helpful Python programming expert."
},
{
"role" : "user" ,
"content" : "How do I read a file?"
}
]
)
Vision (Multimodal)
response = client.chat.completions.create(
model = "llava" ,
messages = [{
"role" : "user" ,
"content" : [
{ "type" : "text" , "text" : "What's in this image?" },
{
"type" : "image_url" ,
"image_url" : { "url" : "https://example.com/image.jpg" }
}
]
}]
)
Local image:
import base64
with open ( "local_image.jpg" , "rb" ) as f:
image_data = base64.b64encode(f.read()).decode()
response = client.chat.completions.create(
model = "llava" ,
messages = [{
"role" : "user" ,
"content" : [
{ "type" : "text" , "text" : "Describe this image" },
{
"type" : "image_url" ,
"image_url" : { "url" : f "data:image/jpeg;base64, { image_data } " }
}
]
}]
)
Embeddings
response = client.embeddings.create(
model = "llama3.1" ,
input = "Local embeddings with Ollama"
)
embedding = response.data[ 0 ].embedding
print ( f "Dimensions: { len (embedding) } " )
Temperature Control
# Deterministic
response = client.chat.completions.create(
model = "llama3.1" ,
messages = [{ "role" : "user" , "content" : "What is 2+2?" }],
temperature = 0.0
)
# Creative
response = client.chat.completions.create(
model = "llama3.1" ,
messages = [{ "role" : "user" , "content" : "Write a poem" }],
temperature = 1.0
)
Model Management
List Models
# List all downloaded models
ollama list
Pull Models
# Download a model
ollama pull llama3.1
# Pull specific size
ollama pull llama3.1:70b
# Pull with tag
ollama pull mistral:7b-instruct-v0.2-q4_0
Remove Models
# Free up space
ollama rm llama2
Run Interactive
# Chat with model in terminal
ollama run llama3.1
# Exit with /bye
Custom Models
Create a Custom Model
Create a Modelfile:
FROM llama3.1
# Set custom parameters
PARAMETER temperature 0.8
PARAMETER top_p 0.9
# Set custom system message
SYSTEM You are a helpful Python coding assistant. Always provide working code examples.
Create the model:
ollama create python-expert -f Modelfile
Use your custom model:
response = client.chat.completions.create(
model = "python-expert" ,
messages = [{ "role" : "user" , "content" : "Write a quicksort function" }]
)
Fallback Configuration
Use local Ollama first, fallback to cloud:
config = {
"strategy" : { "mode" : "fallback" },
"targets" : [
{
"provider" : "ollama" ,
"custom_host" : "http://localhost:11434" ,
"override_params" : { "model" : "llama3.1" }
},
{
"provider" : "openai" ,
"api_key" : "sk-***" ,
"override_params" : { "model" : "gpt-4o-mini" }
}
]
}
client = Portkey().with_options( config = config)
Best Practices
Choose appropriate model size - Match to your hardware
Use quantized models - Smaller, faster (q4_0, q5_1)
Monitor memory usage - Leave headroom for system
Keep models updated - ollama pull to update
Use GPU if available - Much faster inference
Warm up models - First request may be slow
Batch similar requests - Amortize startup cost
Create custom models - Optimize for your use case
Hardware Requirements
Minimum Specs
CPU : Modern quad-core
RAM : 8GB (for 7B models)
Disk : 10GB free space
Recommended
CPU : 8+ cores
RAM : 16GB+ (for 13B models)
GPU : NVIDIA with 8GB+ VRAM (optional but recommended)
Disk : 50GB+ SSD
For Larger Models
70B models : 48GB+ RAM
405B models : 256GB+ RAM or multi-GPU setup
Use GPU
Ollama automatically uses GPU if available (NVIDIA, Apple Silicon).
Quantization Levels
Suffix Size Quality Speed q4_0Smallest Good Fastest q4_1Small Better Fast q5_0Medium Good Medium q5_1Medium Better Medium q8_0Large Best Slow (none) Largest Perfect Slowest
Example:
# Faster, smaller
ollama pull llama3.1:q4_0
# Best quality
ollama pull llama3.1:q8_0
Use Cases
Development & Testing
# Test locally before deploying to production
dev_client = Portkey(
provider = "ollama" ,
custom_host = "http://localhost:11434"
)
Privacy-Sensitive Applications
# Keep sensitive data on-premises
response = client.chat.completions.create(
model = "llama3.1" ,
messages = [{ "role" : "user" , "content" : "Analyze this private data..." }]
)
Offline Applications
# Works without internet
response = client.chat.completions.create(
model = "llama3.1" ,
messages = [{ "role" : "user" , "content" : "Help me while offline" }]
)
Cost Optimization
# Zero API costs
for query in large_batch:
response = client.chat.completions.create(
model = "llama3.1" ,
messages = [{ "role" : "user" , "content" : query}]
)
Pricing
Ollama is completely free!
No API costs
No rate limits
No usage tracking
Run unlimited requests
Only costs: Your hardware and electricity
Troubleshooting
Model Not Found
# Make sure model is pulled
ollama pull llama3.1
ollama list # Verify it's there
Out of Memory
# Use smaller model or quantized version
ollama pull llama3.1:q4_0
# Check if GPU is being used
ollama ps
# Use smaller/quantized model
ollama pull mistral:7b-instruct-q4_0
Model Library Browse 100+ available models
Fallback Routing Fallback to cloud when needed
Cost Optimization Optimize AI costs
Privacy Private AI deployments