Documentation Index Fetch the complete documentation index at: https://mintlify.com/portkey-AI/gateway/llms.txt
Use this file to discover all available pages before exploring further.
Overview
Groq provides lightning-fast LLM inference using their custom Language Processing Unit (LPU) technology, delivering speeds of 500+ tokens per second. Perfect for applications requiring ultra-low latency responses with popular open-source models.
Base URL: https://api.groq.com/openai/v1
Supported Features
✅ Chat Completions
✅ Streaming (extremely fast)
✅ Function Calling
✅ Vision (select models)
✅ JSON Mode
❌ Embeddings
❌ Image Generation
❌ Fine-tuning
Quick Start
Chat Completions
from portkey_ai import Portkey
client = Portkey(
provider = "groq" ,
Authorization = "***" # Your Groq API key
)
response = client.chat.completions.create(
model = "llama-3.3-70b-versatile" ,
messages = [
{ "role" : "user" , "content" : "Explain Groq's LPU technology" }
]
)
print (response.choices[ 0 ].message.content)
Ultra-Fast Streaming
import time
start = time.time()
stream = client.chat.completions.create(
model = "llama-3.3-70b-versatile" ,
messages = [{ "role" : "user" , "content" : "Count from 1 to 100" }],
stream = True
)
for chunk in stream:
if chunk.choices[ 0 ].delta.content:
print (chunk.choices[ 0 ].delta.content, end = "" )
end = time.time()
print ( f " \n\n Completed in { end - start :.2f} seconds" )
# Often completes in under 2 seconds!
Available Models
Model Context Speed Description llama-3.3-70b-versatile128K Ultra-fast Latest Llama 3.3 llama-3.1-70b-versatile128K Ultra-fast Llama 3.1 70B llama-3.1-8b-instant128K Instant Fastest Llama llama-3.2-90b-vision-preview128K Fast Vision-enabled llama-3.2-11b-vision-preview128K Very fast Smaller vision
Mixtral
Model Context Speed Description mixtral-8x7b-3276832K Ultra-fast Efficient MoE
Google Gemma
Model Context Speed Description gemma2-9b-it8K Very fast Gemma 2 9B gemma-7b-it8K Very fast Gemma 7B
Other Models
Model Context Description llama-guard-3-8b8K Content moderation llama3-groq-70b-8192-tool-use-preview8K Tool use optimized
Groq excels at:
Ultra-low latency - 500+ tokens/second
Streaming speed - Nearly instant response start
Consistent performance - Predictable latency
Real-time applications - Chat, assistants, games
High throughput - Handle many concurrent requests
Configuration Options
client = Portkey(
provider = "groq" ,
Authorization = "***" # Bearer token
)
Header Description Required AuthorizationGroq API key Yes
Advanced Features
Function Calling
tools = [
{
"type" : "function" ,
"function" : {
"name" : "get_current_time" ,
"description" : "Get the current time" ,
"parameters" : {
"type" : "object" ,
"properties" : {
"timezone" : {
"type" : "string" ,
"description" : "Timezone name"
}
},
"required" : [ "timezone" ]
}
}
}
]
response = client.chat.completions.create(
model = "llama-3.3-70b-versatile" ,
messages = [{ "role" : "user" , "content" : "What time is it in Tokyo?" }],
tools = tools
)
Vision (Multimodal)
response = client.chat.completions.create(
model = "llama-3.2-90b-vision-preview" ,
messages = [{
"role" : "user" ,
"content" : [
{ "type" : "text" , "text" : "What's in this image?" },
{
"type" : "image_url" ,
"image_url" : {
"url" : "https://example.com/image.jpg"
}
}
]
}]
)
JSON Mode
response = client.chat.completions.create(
model = "llama-3.3-70b-versatile" ,
messages = [{
"role" : "user" ,
"content" : "List 5 programming languages with their release years"
}],
response_format = { "type" : "json_object" }
)
import json
result = json.loads(response.choices[ 0 ].message.content)
print (result)
Temperature Control
# More deterministic (good for factual tasks)
response = client.chat.completions.create(
model = "llama-3.3-70b-versatile" ,
messages = [{ "role" : "user" , "content" : "What is 2+2?" }],
temperature = 0.0
)
# More creative
response = client.chat.completions.create(
model = "llama-3.3-70b-versatile" ,
messages = [{ "role" : "user" , "content" : "Write a creative story" }],
temperature = 1.0
)
Max Tokens Control
response = client.chat.completions.create(
model = "llama-3.3-70b-versatile" ,
messages = [{ "role" : "user" , "content" : "Explain quantum physics" }],
max_tokens = 500 # Limit response length
)
Speed Comparison
import time
def benchmark_provider ( provider , model , prompt ):
client = Portkey( provider = provider, Authorization = "***" )
start = time.time()
response = client.chat.completions.create(
model = model,
messages = [{ "role" : "user" , "content" : prompt}]
)
end = time.time()
return end - start
# Groq is typically 5-10x faster
groq_time = benchmark_provider( "groq" , "llama-3.3-70b-versatile" , "Write a haiku" )
print ( f "Groq: { groq_time :.2f} s" )
Fallback Configuration
Use Groq first for speed, fallback to others:
config = {
"strategy" : { "mode" : "fallback" },
"targets" : [
{
"provider" : "groq" ,
"api_key" : "***" ,
"override_params" : { "model" : "llama-3.3-70b-versatile" }
},
{
"provider" : "together-ai" ,
"api_key" : "***" ,
"override_params" : { "model" : "meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo" }
}
]
}
client = Portkey().with_options( config = config)
Load Balancing
Balance across Groq models:
config = {
"strategy" : { "mode" : "loadbalance" },
"targets" : [
{
"provider" : "groq" ,
"api_key" : "***" ,
"override_params" : { "model" : "llama-3.3-70b-versatile" },
"weight" : 0.7
},
{
"provider" : "groq" ,
"api_key" : "***" ,
"override_params" : { "model" : "llama-3.1-8b-instant" },
"weight" : 0.3
}
]
}
client = Portkey().with_options( config = config)
Error Handling
from portkey_ai.exceptions import (
RateLimitError,
APIError,
AuthenticationError
)
try :
response = client.chat.completions.create(
model = "llama-3.3-70b-versatile" ,
messages = [{ "role" : "user" , "content" : "Hello" }]
)
except RateLimitError as e:
print ( f "Rate limit: { e } " )
# Groq has generous rate limits but they exist
except AuthenticationError as e:
print ( f "Invalid API key: { e } " )
except APIError as e:
print ( f "API error: { e } " )
Best Practices
Leverage speed - Build real-time features
Use streaming - Take advantage of instant response start
Enable function calling - Fast tool use
Use 8B for simple tasks - Instant responses
Use 70B for complex tasks - Still very fast
Implement rate limit handling - Free tier has limits
Monitor latency - Groq provides latency metrics
Cache when possible - Even faster responses
Use Cases
Real-time Chat
# Ultra-responsive chat experience
stream = client.chat.completions.create(
model = "llama-3.1-8b-instant" ,
messages = conversation_history,
stream = True
)
Code Completion
# Near-instant code suggestions
response = client.chat.completions.create(
model = "llama-3.3-70b-versatile" ,
messages = [{ "role" : "user" , "content" : f "Complete this code: { code_snippet } " }],
max_tokens = 200
)
Gaming NPCs
# Real-time NPC responses
response = client.chat.completions.create(
model = "llama-3.1-8b-instant" ,
messages = [{ "role" : "user" , "content" : f "NPC reaction to: { player_action } " }],
temperature = 0.8
)
Rate Limits
Free Tier:
30 requests per minute
14,400 requests per day
Generous for development
Paid Tiers:
Higher rate limits
Priority access
Contact Groq for details
LPU Technology
Groq’s Language Processing Unit (LPU) provides:
Deterministic performance - Consistent latency
Low latency - Less than 1 second for most requests
High throughput - 500+ tokens/second
Energy efficient - Lower power consumption
Scalable - Handle large workloads
Pricing
Groq offers very competitive pricing:
Groq Pricing View detailed pricing for all Groq models
Getting Started
Sign up at Groq Console
Get your API key
Start with free tier
Experience the speed!
Together AI Alternative open models
Anyscale Another fast inference option
Streaming Optimize streaming responses
Real-time Apps Build real-time applications