Scaling Decisions: From Prototype to Production

Executive Summary

Building the AI Ingredient Safety Analyzer taught valuable lessons about scaling LLM-powered applications. This analysis covers the key architectural decisions, trade-offs, and interview-ready explanations for production scaling.

Current State: ~47 second response time, ~1 RPS throughput

Challenge: Scale to handle 10x traffic without degradation

The Scaling Challenge

Our Ingredient Analysis API processes requests requiring:

Multiple LLM calls (Research → Analysis → Critic validation)
Vector database queries (Qdrant)
Real-time web search (Google Search grounding)

Each request involves 3+ LLM round-trips, making traditional scaling approaches insufficient.

Key Scaling Questions & Answers

Q1: How would you scale this API to handle 10x more traffic?

Three-Pronged Approach:

1. Response Caching (Redis/Memcached)

Cache ingredient research data (24-72 hour TTL)
Cache full analysis reports by ingredient+profile hash (1-6 hour TTL)
Expected improvement: 5x throughput for cached requests

2. API Key Load Balancing

Pool multiple Gemini API keys
Implement rate-aware key selection
N keys = N× capacity (linear scaling)

3. Async Processing with Queue

Move to job queue (Celery/Redis Queue)
Return job ID immediately, poll for results
Prevents timeout issues on slow requests

Trade-off: Caching introduces stale data risk. Mitigation: Implement cache invalidation on safety data updates, use appropriate TTLs based on data volatility.

Q2: Why did you choose Qdrant over other vector databases?

Factor	Qdrant	Pinecone	Weaviate	ChromaDB
Self-hosted option	Yes	No	Yes	Yes
Cloud managed	Yes	Yes	Yes	No
Filtering capability	Excellent	Good	Good	Basic
Python SDK	Native	Native	Native	Native
Cost	Free tier + pay-as-you-go	Expensive	Moderate	Free

Decision Rationale:

Qdrant Cloud offers generous free tier (1GB)
Excellent hybrid search (vector + payload filtering)
Can self-host later for cost optimization
Simple REST API for debugging

Trade-off: Qdrant is less mature than Pinecone. Mitigation: Active development and good documentation offset this risk.

Q3: How do you handle API rate limits from Gemini?

Current Approach: Single key - limited capacity

Scaled Approach: Rate-limited key pool

class RateLimitedKeyPool:
    def __init__(self, api_keys: list[str], rpm_limit: int = 15):
        self.keys = api_keys
        self.rpm_limit = rpm_limit
        self.request_times = {key: [] for key in api_keys}

    def get_available_key(self) -> str | None:
        now = time.time()
        for key in self.keys:
            # Clean requests older than 1 minute
            self.request_times[key] = [
                t for t in self.request_times[key]
                if t > now - 60
            ]
            if len(self.request_times[key]) < self.rpm_limit:
                self.request_times[key].append(now)
                return key
        return None  # All keys exhausted

Trade-off: Multiple keys increase cost and complexity. Consider: Is the traffic worth the operational overhead?

Q4: Why use a multi-agent architecture instead of a single LLM call?

Approach	Pros	Cons
Single-call	Faster, simpler	Less accurate, no self-correction
Multi-agent	Better accuracy, separation of concerns, self-correction	3x LLM calls, higher latency

Decision Rationale: For safety-critical information, accuracy trumps speed. The Critic agent catches ~15% of issues that would otherwise reach users.

Q5: How do you ensure consistency between mobile and web clients?

Architecture Decisions:

1. Single REST API - Both clients call the same /analyze endpoint

2. Shared response schema - Pydantic models define the contract

3. API versioning - /api/v1/analyze allows future breaking changes

Trade-off: Single API means both clients get same data, even if one needs less. We accept slight over-fetching for consistency.

Q6: How would you add real-time updates for long-running requests?

Approach	Pros	Cons	Use Case
Polling	Simple, works everywhere	Inefficient, delayed	Simple UIs
WebSockets	Real-time, bidirectional	Complex, stateful	Chat apps
Server-Sent Events	Real-time, simple	One-way only	Progress updates
Webhooks	Decoupled	Requires client endpoint	B2B integrations

Recommendation: Server-Sent Events (SSE) for progress updates - perfect fit for long-running LLM requests.

Q7: What's your testing strategy for LLM-based features?

Testing Pyramid for LLM Apps:

1. Unit tests - Mock LLM responses, test business logic

2. Integration tests - Test agent orchestration with fixtures

3. Contract tests - Verify LLM output schema compliance

4. Evaluation tests - Test accuracy on labeled datasets

5. Load tests - Verify performance under stress

Key Insight: LLM outputs are non-deterministic. Solutions:

Use `temperature=0.1` for more consistent outputs
Test for schema compliance, not exact text matching
Build evaluation datasets with expected categories

Q8: How do you handle failures gracefully?

Failure	Detection	Recovery
LLM timeout	Request timeout (120s)	Retry with exponential backoff
Rate limit	429 response	Switch to backup API key
Qdrant down	Connection error	Fall back to Google Search only
Invalid input	Pydantic validation	Return 422 with details
Critic rejection	Validation loop	Retry up to 3x, then escalate

Design Principle: Every component needs a fallback strategy.

Q9: What would you do differently if starting over?

1. Start with async from day one - Easier to add concurrency later

2. Implement caching earlier - Would have saved development API costs

3. Use structured outputs - Gemini's JSON mode for reliable parsing

4. Add observability first - LangSmith integration should be from start

5. Design for horizontal scaling - Stateless API from the beginning

Q10: How do you balance cost vs performance?

Cost Breakdown Per Request:

Gemini API: ~$0.01-0.05 (depending on tokens)
Qdrant Cloud: Included in free tier
Railway hosting: ~$5/month
Google Search: Included in Gemini grounding

Optimization Strategies:

1. Cache common ingredients - 80% of requests hit top 100 ingredients

2. Use smaller models for validation - Critic doesn't need full model

3. Batch embeddings - Reduce API calls for multiple ingredients

4. Set appropriate TTLs - Balance freshness vs cost

Trade-off: Aggressive caching reduces costs but may serve stale safety data. Mitigation: 24-hour TTL with manual invalidation for critical updates.

Principal TPM Perspective

Scaling LLM applications requires balancing competing concerns:

Trade-off	Left	Right	Our Decision
Latency vs Accuracy	Fast responses	Thorough validation	Accuracy (safety-critical)
Cost vs Freshness	Aggressive caching	Real-time data	Balanced TTLs
Simplicity vs Resilience	Simple architecture	Multiple fallbacks	Resilience
Speed vs Safety	Ship fast	Validate thoroughly	Safety

The key is making intentional trade-offs based on specific requirements, then documenting the reasoning for future reference.

*This post is part of the interview preparation series for the AI Ingredient Safety Analyzer project.*