Scaling Decisions: From Prototype to Production
A Principal TPM analysis of scaling LLM-powered applications - architecture decisions, trade-offs, and interview-ready explanations for production systems.
Executive Summary
Building the AI Ingredient Safety Analyzer taught valuable lessons about scaling LLM-powered applications. This analysis covers the key architectural decisions, trade-offs, and interview-ready explanations for production scaling.
Current State: ~47 second response time, ~1 RPS throughput
Challenge: Scale to handle 10x traffic without degradation
The Scaling Challenge
Our Ingredient Analysis API processes requests requiring:
- Multiple LLM calls (Research → Analysis → Critic validation)
- Vector database queries (Qdrant)
- Real-time web search (Google Search grounding)
Each request involves 3+ LLM round-trips, making traditional scaling approaches insufficient.
Key Scaling Questions & Answers
Q1: How would you scale this API to handle 10x more traffic?
Three-Pronged Approach:
1. Response Caching (Redis/Memcached)
- Cache ingredient research data (24-72 hour TTL)
- Cache full analysis reports by ingredient+profile hash (1-6 hour TTL)
- Expected improvement: 5x throughput for cached requests
2. API Key Load Balancing
- Pool multiple Gemini API keys
- Implement rate-aware key selection
- N keys = N× capacity (linear scaling)
3. Async Processing with Queue
- Move to job queue (Celery/Redis Queue)
- Return job ID immediately, poll for results
- Prevents timeout issues on slow requests
Trade-off: Caching introduces stale data risk. Mitigation: Implement cache invalidation on safety data updates, use appropriate TTLs based on data volatility.
Q2: Why did you choose Qdrant over other vector databases?
| Factor | Qdrant | Pinecone | Weaviate | ChromaDB |
|---|---|---|---|---|
| Self-hosted option | Yes | No | Yes | Yes |
| Cloud managed | Yes | Yes | Yes | No |
| Filtering capability | Excellent | Good | Good | Basic |
| Python SDK | Native | Native | Native | Native |
| Cost | Free tier + pay-as-you-go | Expensive | Moderate | Free |
Decision Rationale:
- Qdrant Cloud offers generous free tier (1GB)
- Excellent hybrid search (vector + payload filtering)
- Can self-host later for cost optimization
- Simple REST API for debugging
Trade-off: Qdrant is less mature than Pinecone. Mitigation: Active development and good documentation offset this risk.
Q3: How do you handle API rate limits from Gemini?
Current Approach: Single key - limited capacity
Scaled Approach: Rate-limited key pool
class RateLimitedKeyPool:
def __init__(self, api_keys: list[str], rpm_limit: int = 15):
self.keys = api_keys
self.rpm_limit = rpm_limit
self.request_times = {key: [] for key in api_keys}
def get_available_key(self) -> str | None:
now = time.time()
for key in self.keys:
# Clean requests older than 1 minute
self.request_times[key] = [
t for t in self.request_times[key]
if t > now - 60
]
if len(self.request_times[key]) < self.rpm_limit:
self.request_times[key].append(now)
return key
return None # All keys exhaustedTrade-off: Multiple keys increase cost and complexity. Consider: Is the traffic worth the operational overhead?
Q4: Why use a multi-agent architecture instead of a single LLM call?
| Approach | Pros | Cons |
|---|---|---|
| Single-call | Faster, simpler | Less accurate, no self-correction |
| Multi-agent | Better accuracy, separation of concerns, self-correction | 3x LLM calls, higher latency |
Decision Rationale: For safety-critical information, accuracy trumps speed. The Critic agent catches ~15% of issues that would otherwise reach users.
Q5: How do you ensure consistency between mobile and web clients?
Architecture Decisions:
1. Single REST API - Both clients call the same /analyze endpoint
2. Shared response schema - Pydantic models define the contract
3. API versioning - /api/v1/analyze allows future breaking changes
Trade-off: Single API means both clients get same data, even if one needs less. We accept slight over-fetching for consistency.
Q6: How would you add real-time updates for long-running requests?
| Approach | Pros | Cons | Use Case |
|---|---|---|---|
| Polling | Simple, works everywhere | Inefficient, delayed | Simple UIs |
| WebSockets | Real-time, bidirectional | Complex, stateful | Chat apps |
| Server-Sent Events | Real-time, simple | One-way only | Progress updates |
| Webhooks | Decoupled | Requires client endpoint | B2B integrations |
Recommendation: Server-Sent Events (SSE) for progress updates - perfect fit for long-running LLM requests.
Q7: What's your testing strategy for LLM-based features?
Testing Pyramid for LLM Apps:
1. Unit tests - Mock LLM responses, test business logic
2. Integration tests - Test agent orchestration with fixtures
3. Contract tests - Verify LLM output schema compliance
4. Evaluation tests - Test accuracy on labeled datasets
5. Load tests - Verify performance under stress
Key Insight: LLM outputs are non-deterministic. Solutions:
- Use `temperature=0.1` for more consistent outputs
- Test for schema compliance, not exact text matching
- Build evaluation datasets with expected categories
Q8: How do you handle failures gracefully?
| Failure | Detection | Recovery |
|---|---|---|
| LLM timeout | Request timeout (120s) | Retry with exponential backoff |
| Rate limit | 429 response | Switch to backup API key |
| Qdrant down | Connection error | Fall back to Google Search only |
| Invalid input | Pydantic validation | Return 422 with details |
| Critic rejection | Validation loop | Retry up to 3x, then escalate |
Design Principle: Every component needs a fallback strategy.
Q9: What would you do differently if starting over?
1. Start with async from day one - Easier to add concurrency later
2. Implement caching earlier - Would have saved development API costs
3. Use structured outputs - Gemini's JSON mode for reliable parsing
4. Add observability first - LangSmith integration should be from start
5. Design for horizontal scaling - Stateless API from the beginning
Q10: How do you balance cost vs performance?
Cost Breakdown Per Request:
- Gemini API: ~$0.01-0.05 (depending on tokens)
- Qdrant Cloud: Included in free tier
- Railway hosting: ~$5/month
- Google Search: Included in Gemini grounding
Optimization Strategies:
1. Cache common ingredients - 80% of requests hit top 100 ingredients
2. Use smaller models for validation - Critic doesn't need full model
3. Batch embeddings - Reduce API calls for multiple ingredients
4. Set appropriate TTLs - Balance freshness vs cost
Trade-off: Aggressive caching reduces costs but may serve stale safety data. Mitigation: 24-hour TTL with manual invalidation for critical updates.
Principal TPM Perspective
Scaling LLM applications requires balancing competing concerns:
| Trade-off | Left | Right | Our Decision |
|---|---|---|---|
| Latency vs Accuracy | Fast responses | Thorough validation | Accuracy (safety-critical) |
| Cost vs Freshness | Aggressive caching | Real-time data | Balanced TTLs |
| Simplicity vs Resilience | Simple architecture | Multiple fallbacks | Resilience |
| Speed vs Safety | Ship fast | Validate thoroughly | Safety |
The key is making intentional trade-offs based on specific requirements, then documenting the reasoning for future reference.
*This post is part of the interview preparation series for the AI Ingredient Safety Analyzer project.*