Documentation Index Fetch the complete documentation index at: https://mintlify.com/portkey-AI/gateway/llms.txt
Use this file to discover all available pages before exploring further.
Overview
DeepInfra provides access to 100+ open-source and proprietary AI models with cost-effective inference, serverless deployment, and pay-as-you-go pricing. Perfect for developers seeking affordable AI at scale.
Base URL: https://api.deepinfra.com/v1/openai
Supported Features
✅ Chat Completions
✅ Streaming
✅ Vision (select models)
✅ Function Calling (select models)
❌ Embeddings (via separate API)
❌ Image Generation (via separate API)
❌ Fine-tuning
Quick Start
Chat Completions
from portkey_ai import Portkey
client = Portkey(
provider = "deepinfra" ,
Authorization = "***" # Your DeepInfra API key
)
response = client.chat.completions.create(
model = "meta-llama/Meta-Llama-3.1-70B-Instruct" ,
messages = [
{ "role" : "user" , "content" : "Explain DeepInfra's advantages" }
]
)
print (response.choices[ 0 ].message.content)
Streaming
stream = client.chat.completions.create(
model = "meta-llama/Meta-Llama-3.1-70B-Instruct" ,
messages = [{ "role" : "user" , "content" : "Write a short story" }],
stream = True
)
for chunk in stream:
if chunk.choices[ 0 ].delta.content:
print (chunk.choices[ 0 ].delta.content, end = "" )
Popular Models
Model Context Price Tier Description meta-llama/Meta-Llama-3.1-405B-Instruct128K Premium Largest Llama meta-llama/Meta-Llama-3.1-70B-Instruct128K Mid Balanced meta-llama/Meta-Llama-3.1-8B-Instruct128K Budget Fast, cheap meta-llama/Llama-3.2-90B-Vision-Instruct128K Premium Vision
Mistral & Mixtral
Model Context Price Tier mistralai/Mixtral-8x22B-Instruct-v0.164K Mid mistralai/Mixtral-8x7B-Instruct-v0.132K Budget mistralai/Mistral-7B-Instruct-v0.332K Budget
Qwen
Model Context Description Qwen/Qwen2.5-72B-Instruct32K Latest Qwen Qwen/Qwen2.5-7B-Instruct32K Efficient Qwen/QwQ-32B-Preview32K Reasoning
Specialized Models
Model Type Use Case microsoft/WizardLM-2-8x22BCode/Chat Coding tasks cognitivecomputations/dolphin-2.6-mixtral-8x7bChat Uncensored lizpreciatior/lzlv_70b_fp16_hfRoleplay Creative
DeepInfra excels at:
Cost-effectiveness - Up to 10x cheaper than alternatives
Model variety - 100+ models available
Serverless - No infrastructure management
Pay-as-you-go - No minimum commitment
Fast deployment - Instant access to models
Configuration Options
client = Portkey(
provider = "deepinfra" ,
Authorization = "***" # Bearer token
)
Header Description Required AuthorizationDeepInfra API key Yes
Advanced Features
System Messages
response = client.chat.completions.create(
model = "meta-llama/Meta-Llama-3.1-70B-Instruct" ,
messages = [
{
"role" : "system" ,
"content" : "You are a helpful coding assistant."
},
{
"role" : "user" ,
"content" : "Write a Python function to sort a list"
}
]
)
Temperature and Sampling
response = client.chat.completions.create(
model = "meta-llama/Meta-Llama-3.1-70B-Instruct" ,
messages = [{ "role" : "user" , "content" : "Generate creative ideas" }],
temperature = 0.9 , # Higher for creativity
top_p = 0.95 , # Nucleus sampling
max_tokens = 500 , # Limit response length
frequency_penalty = 0.5 # Reduce repetition
)
Vision Models
response = client.chat.completions.create(
model = "meta-llama/Llama-3.2-90B-Vision-Instruct" ,
messages = [{
"role" : "user" ,
"content" : [
{ "type" : "text" , "text" : "Describe this image" },
{
"type" : "image_url" ,
"image_url" : { "url" : "https://example.com/image.jpg" }
}
]
}]
)
Multi-turn Conversations
conversation = [
{ "role" : "system" , "content" : "You are a helpful assistant." },
{ "role" : "user" , "content" : "What is machine learning?" },
{ "role" : "assistant" , "content" : "Machine learning is..." },
{ "role" : "user" , "content" : "Can you give an example?" }
]
response = client.chat.completions.create(
model = "meta-llama/Meta-Llama-3.1-70B-Instruct" ,
messages = conversation
)
Cost Optimization
Choose the Right Model
# For simple tasks - use 8B (cheapest)
client.chat.completions.create(
model = "meta-llama/Meta-Llama-3.1-8B-Instruct" ,
messages = [{ "role" : "user" , "content" : "Simple question" }]
)
# For complex tasks - use 70B (balanced)
client.chat.completions.create(
model = "meta-llama/Meta-Llama-3.1-70B-Instruct" ,
messages = [{ "role" : "user" , "content" : "Complex reasoning task" }]
)
# For most complex - use 405B (premium)
client.chat.completions.create(
model = "meta-llama/Meta-Llama-3.1-405B-Instruct" ,
messages = [{ "role" : "user" , "content" : "Very complex task" }]
)
Set Token Limits
response = client.chat.completions.create(
model = "meta-llama/Meta-Llama-3.1-70B-Instruct" ,
messages = [{ "role" : "user" , "content" : "Brief answer please" }],
max_tokens = 100 # Control costs by limiting output
)
Fallback Configuration
Fallback to OpenAI if needed:
config = {
"strategy" : { "mode" : "fallback" },
"targets" : [
{
"provider" : "deepinfra" ,
"api_key" : "***" ,
"override_params" : { "model" : "meta-llama/Meta-Llama-3.1-70B-Instruct" }
},
{
"provider" : "openai" ,
"api_key" : "sk-***" ,
"override_params" : { "model" : "gpt-4o-mini" }
}
]
}
client = Portkey().with_options( config = config)
Load Balancing
Balance cost vs quality:
config = {
"strategy" : { "mode" : "loadbalance" },
"targets" : [
{
"provider" : "deepinfra" ,
"api_key" : "***" ,
"override_params" : { "model" : "meta-llama/Meta-Llama-3.1-8B-Instruct" },
"weight" : 0.7 # 70% to cheap model
},
{
"provider" : "deepinfra" ,
"api_key" : "***" ,
"override_params" : { "model" : "meta-llama/Meta-Llama-3.1-70B-Instruct" },
"weight" : 0.3 # 30% to better model
}
]
}
client = Portkey().with_options( config = config)
Error Handling
from portkey_ai.exceptions import (
RateLimitError,
APIError,
AuthenticationError
)
try :
response = client.chat.completions.create(
model = "meta-llama/Meta-Llama-3.1-70B-Instruct" ,
messages = [{ "role" : "user" , "content" : "Hello" }]
)
except RateLimitError as e:
print ( f "Rate limit: { e } " )
except AuthenticationError as e:
print ( f "Invalid API key: { e } " )
except APIError as e:
print ( f "API error: { e } " )
Best Practices
Start with smaller models - Test with 8B before using 70B
Set max_tokens - Control costs
Use streaming - Better UX
Cache responses - Reduce API calls
Monitor costs - DeepInfra has usage dashboard
Choose right model - Balance cost vs quality
Batch similar requests - More efficient
Handle rate limits - Implement backoff
Use Cases
Budget-Conscious Development
# Use cheap 8B model for development
dev_client = Portkey(
provider = "deepinfra" ,
Authorization = "***"
)
response = dev_client.chat.completions.create(
model = "meta-llama/Meta-Llama-3.1-8B-Instruct" ,
messages = [{ "role" : "user" , "content" : "Test query" }]
)
High-Volume Applications
# Cost-effective for large scale
for user_query in user_queries:
response = client.chat.completions.create(
model = "meta-llama/Meta-Llama-3.1-8B-Instruct" ,
messages = [{ "role" : "user" , "content" : user_query}],
max_tokens = 200 # Limit costs
)
A/B Testing Models
# Test different models cost-effectively
models_to_test = [
"meta-llama/Meta-Llama-3.1-8B-Instruct" ,
"meta-llama/Meta-Llama-3.1-70B-Instruct" ,
"mistralai/Mixtral-8x7B-Instruct-v0.1"
]
for model in models_to_test:
response = client.chat.completions.create(
model = model,
messages = [{ "role" : "user" , "content" : test_prompt}]
)
# Compare results
Rate Limits
Generous free tier for testing
Pay-as-you-go with no minimums
Rate limits based on tier
Contact DeepInfra for enterprise needs
Pricing Advantages
DeepInfra typically offers:
50-90% cheaper than major providers
No minimum spend requirement
Free credits for new users
Transparent pricing per token
DeepInfra Pricing View detailed pricing for all DeepInfra models
Getting Started
Sign up at DeepInfra
Get your API key
Start with free credits
Scale as needed
Together AI Alternative open models platform
Cost Optimization Reduce AI costs
Load Balancing Balance cost vs quality
Caching Cache for cost savings