Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/portkey-AI/gateway/llms.txt

Use this file to discover all available pages before exploring further.

Overview

Ollama enables you to run large language models locally on your own hardware. Perfect for development, testing, privacy-sensitive applications, and offline use. Access Llama, Mistral, Gemma, and many more models without any API costs. Base URL: Your local Ollama server (default: http://localhost:11434)

Supported Features

  • ✅ Chat Completions
  • ✅ Streaming
  • ✅ Embeddings
  • ✅ Vision (multimodal models)
  • ✅ Custom Models
  • ✅ Model Library (100+ models)
  • ❌ Function Calling (limited support)
  • ❌ Image Generation

Prerequisites

Install Ollama

# Download from ollama.com or use:
brew install ollama

# Start Ollama
ollama serve

Pull a Model

# Pull Llama 3.1 (4.7GB)
ollama pull llama3.1

# Pull Mistral (4.1GB)
ollama pull mistral

# Pull a vision model
ollama pull llava

# List downloaded models
ollama list

Quick Start

Chat Completions

from portkey_ai import Portkey

client = Portkey(
    provider="ollama",
    custom_host="http://localhost:11434"  # Your Ollama server
)

response = client.chat.completions.create(
    model="llama3.1",
    messages=[
        {"role": "user", "content": "Explain the benefits of running models locally"}
    ]
)

print(response.choices[0].message.content)

Streaming

stream = client.chat.completions.create(
    model="llama3.1",
    messages=[{"role": "user", "content": "Tell me a story"}],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

Meta Llama

ModelSizeMemoryDescription
llama3.343GB32GBLatest Llama 3.3 70B
llama3.14.7GB8GBLlama 3.1 8B
llama3.1:70b40GB48GBLlama 3.1 70B
llama3.1:405b231GB256GB+Largest Llama
llama23.8GB8GBLlama 2 7B

Mistral & Mixtral

ModelSizeMemoryDescription
mistral4.1GB8GBMistral 7B
mistral-large40GB48GBMistral Large
mixtral26GB32GBMixtral 8x7B MoE

Google Gemma

ModelSizeMemoryDescription
gemma25.4GB8GBGemma 2 9B
gemma2:27b16GB20GBGemma 2 27B
gemma5.0GB8GBGemma 7B

Vision Models

ModelSizeMemoryDescription
llava4.7GB8GBLlama with vision
llava:34b20GB24GBLarger vision model
bakllava4.7GB8GBAlternative vision

Specialized Models

ModelSizePurpose
codellama3.8GBCode generation
phi32.3GBMicrosoft’s small model
qwen2.54.7GBMultilingual
deepseek-coder3.8GBAdvanced coding
nous-hermes24.1GBGeneral purpose
Ollama excels at:
  • Privacy - Data never leaves your machine
  • Zero cost - No API fees
  • Offline use - Works without internet
  • Fast iteration - No network latency
  • Customization - Create and modify models

Configuration Options

client = Portkey(
    provider="ollama",
    custom_host="http://localhost:11434"  # Your Ollama server URL
)

Remote Ollama Server

# Connect to Ollama on another machine
client = Portkey(
    provider="ollama",
    custom_host="http://192.168.1.100:11434"
)

Docker Container

# If Ollama is in Docker
client = Portkey(
    provider="ollama",
    custom_host="http://localhost:11434"
)

Advanced Features

System Messages

response = client.chat.completions.create(
    model="llama3.1",
    messages=[
        {
            "role": "system",
            "content": "You are a helpful Python programming expert."
        },
        {
            "role": "user",
            "content": "How do I read a file?"
        }
    ]
)

Vision (Multimodal)

response = client.chat.completions.create(
    model="llava",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "What's in this image?"},
            {
                "type": "image_url",
                "image_url": {"url": "https://example.com/image.jpg"}
            }
        ]
    }]
)
Local image:
import base64

with open("local_image.jpg", "rb") as f:
    image_data = base64.b64encode(f.read()).decode()

response = client.chat.completions.create(
    model="llava",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "Describe this image"},
            {
                "type": "image_url",
                "image_url": {"url": f"data:image/jpeg;base64,{image_data}"}
            }
        ]
    }]
)

Embeddings

response = client.embeddings.create(
    model="llama3.1",
    input="Local embeddings with Ollama"
)

embedding = response.data[0].embedding
print(f"Dimensions: {len(embedding)}")

Temperature Control

# Deterministic
response = client.chat.completions.create(
    model="llama3.1",
    messages=[{"role": "user", "content": "What is 2+2?"}],
    temperature=0.0
)

# Creative
response = client.chat.completions.create(
    model="llama3.1",
    messages=[{"role": "user", "content": "Write a poem"}],
    temperature=1.0
)

Model Management

List Models

# List all downloaded models
ollama list

Pull Models

# Download a model
ollama pull llama3.1

# Pull specific size
ollama pull llama3.1:70b

# Pull with tag
ollama pull mistral:7b-instruct-v0.2-q4_0

Remove Models

# Free up space
ollama rm llama2

Run Interactive

# Chat with model in terminal
ollama run llama3.1

# Exit with /bye

Custom Models

Create a Custom Model

  1. Create a Modelfile:
FROM llama3.1

# Set custom parameters
PARAMETER temperature 0.8
PARAMETER top_p 0.9

# Set custom system message
SYSTEM You are a helpful Python coding assistant. Always provide working code examples.
  1. Create the model:
ollama create python-expert -f Modelfile
  1. Use your custom model:
response = client.chat.completions.create(
    model="python-expert",
    messages=[{"role": "user", "content": "Write a quicksort function"}]
)

Fallback Configuration

Use local Ollama first, fallback to cloud:
config = {
    "strategy": {"mode": "fallback"},
    "targets": [
        {
            "provider": "ollama",
            "custom_host": "http://localhost:11434",
            "override_params": {"model": "llama3.1"}
        },
        {
            "provider": "openai",
            "api_key": "sk-***",
            "override_params": {"model": "gpt-4o-mini"}
        }
    ]
}

client = Portkey().with_options(config=config)

Best Practices

  1. Choose appropriate model size - Match to your hardware
  2. Use quantized models - Smaller, faster (q4_0, q5_1)
  3. Monitor memory usage - Leave headroom for system
  4. Keep models updated - ollama pull to update
  5. Use GPU if available - Much faster inference
  6. Warm up models - First request may be slow
  7. Batch similar requests - Amortize startup cost
  8. Create custom models - Optimize for your use case

Hardware Requirements

Minimum Specs

  • CPU: Modern quad-core
  • RAM: 8GB (for 7B models)
  • Disk: 10GB free space
  • CPU: 8+ cores
  • RAM: 16GB+ (for 13B models)
  • GPU: NVIDIA with 8GB+ VRAM (optional but recommended)
  • Disk: 50GB+ SSD

For Larger Models

  • 70B models: 48GB+ RAM
  • 405B models: 256GB+ RAM or multi-GPU setup

Performance Tips

Use GPU

Ollama automatically uses GPU if available (NVIDIA, Apple Silicon).

Quantization Levels

SuffixSizeQualitySpeed
q4_0SmallestGoodFastest
q4_1SmallBetterFast
q5_0MediumGoodMedium
q5_1MediumBetterMedium
q8_0LargeBestSlow
(none)LargestPerfectSlowest
Example:
# Faster, smaller
ollama pull llama3.1:q4_0

# Best quality
ollama pull llama3.1:q8_0

Use Cases

Development & Testing

# Test locally before deploying to production
dev_client = Portkey(
    provider="ollama",
    custom_host="http://localhost:11434"
)

Privacy-Sensitive Applications

# Keep sensitive data on-premises
response = client.chat.completions.create(
    model="llama3.1",
    messages=[{"role": "user", "content": "Analyze this private data..."}]
)

Offline Applications

# Works without internet
response = client.chat.completions.create(
    model="llama3.1",
    messages=[{"role": "user", "content": "Help me while offline"}]
)

Cost Optimization

# Zero API costs
for query in large_batch:
    response = client.chat.completions.create(
        model="llama3.1",
        messages=[{"role": "user", "content": query}]
    )

Pricing

Ollama is completely free!
  • No API costs
  • No rate limits
  • No usage tracking
  • Run unlimited requests
Only costs: Your hardware and electricity

Troubleshooting

Model Not Found

# Make sure model is pulled
ollama pull llama3.1
ollama list  # Verify it's there

Out of Memory

# Use smaller model or quantized version
ollama pull llama3.1:q4_0

Slow Performance

# Check if GPU is being used
ollama ps

# Use smaller/quantized model
ollama pull mistral:7b-instruct-q4_0

Model Library

Browse 100+ available models

Fallback Routing

Fallback to cloud when needed

Cost Optimization

Optimize AI costs

Privacy

Private AI deployments