Ollama - Portkey AI Gateway

Overview

Ollama enables you to run large language models locally on your own hardware. Perfect for development, testing, privacy-sensitive applications, and offline use. Access Llama, Mistral, Gemma, and many more models without any API costs. Base URL: Your local Ollama server (default: http://localhost:11434)

Supported Features

✅ Chat Completions
✅ Streaming
✅ Embeddings
✅ Vision (multimodal models)
✅ Custom Models
✅ Model Library (100+ models)
❌ Function Calling (limited support)
❌ Image Generation

Prerequisites

Install Ollama

macOS
Linux
Windows
Docker

# Download from ollama.com or use:
brew install ollama

# Start Ollama
ollama serve

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Start Ollama
ollama serve

# Download installer from ollama.com
# Or use Windows Subsystem for Linux (WSL)

# Run Ollama in Docker
docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama

Pull a Model

# Pull Llama 3.1 (4.7GB)
ollama pull llama3.1

# Pull Mistral (4.1GB)
ollama pull mistral

# Pull a vision model
ollama pull llava

# List downloaded models
ollama list

Quick Start

Chat Completions

from portkey_ai import Portkey

client = Portkey(
    provider="ollama",
    custom_host="http://localhost:11434"  # Your Ollama server
)

response = client.chat.completions.create(
    model="llama3.1",
    messages=[
        {"role": "user", "content": "Explain the benefits of running models locally"}
    ]
)

print(response.choices[0].message.content)

Streaming

stream = client.chat.completions.create(
    model="llama3.1",
    messages=[{"role": "user", "content": "Tell me a story"}],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

Popular Models

Meta Llama

Model	Size	Memory	Description
`llama3.3`	43GB	32GB	Latest Llama 3.3 70B
`llama3.1`	4.7GB	8GB	Llama 3.1 8B
`llama3.1:70b`	40GB	48GB	Llama 3.1 70B
`llama3.1:405b`	231GB	256GB+	Largest Llama
`llama2`	3.8GB	8GB	Llama 2 7B

Mistral & Mixtral

Model	Size	Memory	Description
`mistral`	4.1GB	8GB	Mistral 7B
`mistral-large`	40GB	48GB	Mistral Large
`mixtral`	26GB	32GB	Mixtral 8x7B MoE

Google Gemma

Model	Size	Memory	Description
`gemma2`	5.4GB	8GB	Gemma 2 9B
`gemma2:27b`	16GB	20GB	Gemma 2 27B
`gemma`	5.0GB	8GB	Gemma 7B

Vision Models

Model	Size	Memory	Description
`llava`	4.7GB	8GB	Llama with vision
`llava:34b`	20GB	24GB	Larger vision model
`bakllava`	4.7GB	8GB	Alternative vision

Specialized Models

Model	Size	Purpose
`codellama`	3.8GB	Code generation
`phi3`	2.3GB	Microsoft’s small model
`qwen2.5`	4.7GB	Multilingual
`deepseek-coder`	3.8GB	Advanced coding
`nous-hermes2`	4.1GB	General purpose

Ollama excels at:

Privacy - Data never leaves your machine
Zero cost - No API fees
Offline use - Works without internet
Fast iteration - No network latency
Customization - Create and modify models

Configuration Options

client = Portkey(
    provider="ollama",
    custom_host="http://localhost:11434"  # Your Ollama server URL
)

Remote Ollama Server

# Connect to Ollama on another machine
client = Portkey(
    provider="ollama",
    custom_host="http://192.168.1.100:11434"
)

Docker Container

# If Ollama is in Docker
client = Portkey(
    provider="ollama",
    custom_host="http://localhost:11434"
)

Advanced Features

System Messages

response = client.chat.completions.create(
    model="llama3.1",
    messages=[
        {
            "role": "system",
            "content": "You are a helpful Python programming expert."
        },
        {
            "role": "user",
            "content": "How do I read a file?"
        }
    ]
)

Vision (Multimodal)

response = client.chat.completions.create(
    model="llava",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "What's in this image?"},
            {
                "type": "image_url",
                "image_url": {"url": "https://example.com/image.jpg"}
            }
        ]
    }]
)

Local image:

import base64

with open("local_image.jpg", "rb") as f:
    image_data = base64.b64encode(f.read()).decode()

response = client.chat.completions.create(
    model="llava",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "Describe this image"},
            {
                "type": "image_url",
                "image_url": {"url": f"data:image/jpeg;base64,{image_data}"}
            }
        ]
    }]
)

Embeddings

response = client.embeddings.create(
    model="llama3.1",
    input="Local embeddings with Ollama"
)

embedding = response.data[0].embedding
print(f"Dimensions: {len(embedding)}")

Temperature Control

# Deterministic
response = client.chat.completions.create(
    model="llama3.1",
    messages=[{"role": "user", "content": "What is 2+2?"}],
    temperature=0.0
)

# Creative
response = client.chat.completions.create(
    model="llama3.1",
    messages=[{"role": "user", "content": "Write a poem"}],
    temperature=1.0
)

Model Management

List Models

# List all downloaded models
ollama list

Pull Models

# Download a model
ollama pull llama3.1

# Pull specific size
ollama pull llama3.1:70b

# Pull with tag
ollama pull mistral:7b-instruct-v0.2-q4_0

Remove Models

# Free up space
ollama rm llama2

Run Interactive

# Chat with model in terminal
ollama run llama3.1

# Exit with /bye

Custom Models

Create a Custom Model

Create a Modelfile:

FROM llama3.1

# Set custom parameters
PARAMETER temperature 0.8
PARAMETER top_p 0.9

# Set custom system message
SYSTEM You are a helpful Python coding assistant. Always provide working code examples.

Create the model:

ollama create python-expert -f Modelfile

Use your custom model:

response = client.chat.completions.create(
    model="python-expert",
    messages=[{"role": "user", "content": "Write a quicksort function"}]
)

Fallback Configuration

Use local Ollama first, fallback to cloud:

config = {
    "strategy": {"mode": "fallback"},
    "targets": [
        {
            "provider": "ollama",
            "custom_host": "http://localhost:11434",
            "override_params": {"model": "llama3.1"}
        },
        {
            "provider": "openai",
            "api_key": "sk-***",
            "override_params": {"model": "gpt-4o-mini"}
        }
    ]
}

client = Portkey().with_options(config=config)

Best Practices

Choose appropriate model size - Match to your hardware
Use quantized models - Smaller, faster (q4_0, q5_1)
Monitor memory usage - Leave headroom for system
Keep models updated - ollama pull to update
Use GPU if available - Much faster inference
Warm up models - First request may be slow
Batch similar requests - Amortize startup cost
Create custom models - Optimize for your use case

Hardware Requirements

Minimum Specs

CPU: Modern quad-core
RAM: 8GB (for 7B models)
Disk: 10GB free space

For Larger Models

70B models: 48GB+ RAM
405B models: 256GB+ RAM or multi-GPU setup

Performance Tips

Use GPU

Ollama automatically uses GPU if available (NVIDIA, Apple Silicon).

Quantization Levels

Suffix	Size	Quality	Speed
`q4_0`	Smallest	Good	Fastest
`q4_1`	Small	Better	Fast
`q5_0`	Medium	Good	Medium
`q5_1`	Medium	Better	Medium
`q8_0`	Large	Best	Slow
(none)	Largest	Perfect	Slowest

Example:

# Faster, smaller
ollama pull llama3.1:q4_0

# Best quality
ollama pull llama3.1:q8_0

Use Cases

Development & Testing

# Test locally before deploying to production
dev_client = Portkey(
    provider="ollama",
    custom_host="http://localhost:11434"
)

Privacy-Sensitive Applications

# Keep sensitive data on-premises
response = client.chat.completions.create(
    model="llama3.1",
    messages=[{"role": "user", "content": "Analyze this private data..."}]
)

Offline Applications

# Works without internet
response = client.chat.completions.create(
    model="llama3.1",
    messages=[{"role": "user", "content": "Help me while offline"}]
)

Cost Optimization

# Zero API costs
for query in large_batch:
    response = client.chat.completions.create(
        model="llama3.1",
        messages=[{"role": "user", "content": query}]
    )

Pricing

Ollama is completely free!

No API costs
No rate limits
No usage tracking
Run unlimited requests

Only costs: Your hardware and electricity

Troubleshooting

Model Not Found

# Make sure model is pulled
ollama pull llama3.1
ollama list  # Verify it's there

Out of Memory

# Use smaller model or quantized version
ollama pull llama3.1:q4_0

Slow Performance

# Check if GPU is being used
ollama ps

# Use smaller/quantized model
ollama pull mistral:7b-instruct-q4_0

Model Library

Browse 100+ available models

Fallback Routing

Fallback to cloud when needed

Cost Optimization

Optimize AI costs

Privacy

Private AI deployments

Overview

Major Providers

Specialized Providers

Documentation Index

​Overview

​Supported Features

​Prerequisites

​Install Ollama

​Pull a Model

​Quick Start

​Chat Completions

​Streaming

​Popular Models

​Meta Llama

​Mistral & Mixtral

​Google Gemma

​Vision Models

​Specialized Models

​Configuration Options

​Remote Ollama Server

​Docker Container

​Advanced Features

​System Messages

​Vision (Multimodal)

​Embeddings

​Temperature Control

​Model Management

​List Models

​Pull Models

​Remove Models

​Run Interactive

​Custom Models

​Create a Custom Model

​Fallback Configuration

​Best Practices

​Hardware Requirements

​Minimum Specs

​Recommended

​For Larger Models

​Performance Tips

​Use GPU

​Quantization Levels

​Use Cases

​Development & Testing

​Privacy-Sensitive Applications

​Offline Applications

​Cost Optimization

​Pricing

​Troubleshooting

​Model Not Found

​Out of Memory

​Slow Performance

​Related Resources

Model Library

Fallback Routing

Cost Optimization

Privacy

Overview

Supported Features

Prerequisites

Install Ollama

Pull a Model

Quick Start

Chat Completions

Streaming

Popular Models

Meta Llama

Mistral & Mixtral

Google Gemma

Vision Models

Specialized Models

Configuration Options

Remote Ollama Server

Docker Container

Advanced Features

System Messages

Vision (Multimodal)

Embeddings

Temperature Control

Model Management

List Models

Pull Models

Remove Models

Run Interactive

Custom Models

Create a Custom Model

Fallback Configuration

Best Practices

Hardware Requirements

Minimum Specs

Recommended

For Larger Models

Performance Tips

Use GPU

Quantization Levels

Use Cases

Development & Testing

Privacy-Sensitive Applications

Offline Applications

Cost Optimization

Pricing

Troubleshooting

Model Not Found

Out of Memory

Slow Performance

Related Resources