Architecture

Scaling Decisions: From Prototype to Production

A Principal TPM analysis of scaling LLM-powered applications - architecture decisions, trade-offs, and interview-ready explanations for production systems.

architecturescalinginterview-prepaiproduction

Executive Summary

Building the AI Ingredient Safety Analyzer taught valuable lessons about scaling LLM-powered applications. This analysis covers the key architectural decisions, trade-offs, and interview-ready explanations for production scaling.

Current State: ~47 second response time, ~1 RPS throughput

Challenge: Scale to handle 10x traffic without degradation


The Scaling Challenge

Our Ingredient Analysis API processes requests requiring:

  • Multiple LLM calls (Research → Analysis → Critic validation)
  • Vector database queries (Qdrant)
  • Real-time web search (Google Search grounding)

Each request involves 3+ LLM round-trips, making traditional scaling approaches insufficient.


Key Scaling Questions & Answers

Q1: How would you scale this API to handle 10x more traffic?

Three-Pronged Approach:

1. Response Caching (Redis/Memcached)

  • Cache ingredient research data (24-72 hour TTL)
  • Cache full analysis reports by ingredient+profile hash (1-6 hour TTL)
  • Expected improvement: 5x throughput for cached requests

2. API Key Load Balancing

  • Pool multiple Gemini API keys
  • Implement rate-aware key selection
  • N keys = N× capacity (linear scaling)

3. Async Processing with Queue

  • Move to job queue (Celery/Redis Queue)
  • Return job ID immediately, poll for results
  • Prevents timeout issues on slow requests

Trade-off: Caching introduces stale data risk. Mitigation: Implement cache invalidation on safety data updates, use appropriate TTLs based on data volatility.


Q2: Why did you choose Qdrant over other vector databases?

FactorQdrantPineconeWeaviateChromaDB
Self-hosted optionYesNoYesYes
Cloud managedYesYesYesNo
Filtering capabilityExcellentGoodGoodBasic
Python SDKNativeNativeNativeNative
CostFree tier + pay-as-you-goExpensiveModerateFree

Decision Rationale:

  • Qdrant Cloud offers generous free tier (1GB)
  • Excellent hybrid search (vector + payload filtering)
  • Can self-host later for cost optimization
  • Simple REST API for debugging

Trade-off: Qdrant is less mature than Pinecone. Mitigation: Active development and good documentation offset this risk.


Q3: How do you handle API rate limits from Gemini?

Current Approach: Single key - limited capacity

Scaled Approach: Rate-limited key pool

class RateLimitedKeyPool:
    def __init__(self, api_keys: list[str], rpm_limit: int = 15):
        self.keys = api_keys
        self.rpm_limit = rpm_limit
        self.request_times = {key: [] for key in api_keys}

    def get_available_key(self) -> str | None:
        now = time.time()
        for key in self.keys:
            # Clean requests older than 1 minute
            self.request_times[key] = [
                t for t in self.request_times[key]
                if t > now - 60
            ]
            if len(self.request_times[key]) < self.rpm_limit:
                self.request_times[key].append(now)
                return key
        return None  # All keys exhausted

Trade-off: Multiple keys increase cost and complexity. Consider: Is the traffic worth the operational overhead?


Q4: Why use a multi-agent architecture instead of a single LLM call?

ApproachProsCons
Single-callFaster, simplerLess accurate, no self-correction
Multi-agentBetter accuracy, separation of concerns, self-correction3x LLM calls, higher latency

Decision Rationale: For safety-critical information, accuracy trumps speed. The Critic agent catches ~15% of issues that would otherwise reach users.


Q5: How do you ensure consistency between mobile and web clients?

Architecture Decisions:

1. Single REST API - Both clients call the same /analyze endpoint

2. Shared response schema - Pydantic models define the contract

3. API versioning - /api/v1/analyze allows future breaking changes

Trade-off: Single API means both clients get same data, even if one needs less. We accept slight over-fetching for consistency.


Q6: How would you add real-time updates for long-running requests?

ApproachProsConsUse Case
PollingSimple, works everywhereInefficient, delayedSimple UIs
WebSocketsReal-time, bidirectionalComplex, statefulChat apps
Server-Sent EventsReal-time, simpleOne-way onlyProgress updates
WebhooksDecoupledRequires client endpointB2B integrations

Recommendation: Server-Sent Events (SSE) for progress updates - perfect fit for long-running LLM requests.


Q7: What's your testing strategy for LLM-based features?

Testing Pyramid for LLM Apps:

1. Unit tests - Mock LLM responses, test business logic

2. Integration tests - Test agent orchestration with fixtures

3. Contract tests - Verify LLM output schema compliance

4. Evaluation tests - Test accuracy on labeled datasets

5. Load tests - Verify performance under stress

Key Insight: LLM outputs are non-deterministic. Solutions:

  • Use `temperature=0.1` for more consistent outputs
  • Test for schema compliance, not exact text matching
  • Build evaluation datasets with expected categories

Q8: How do you handle failures gracefully?

FailureDetectionRecovery
LLM timeoutRequest timeout (120s)Retry with exponential backoff
Rate limit429 responseSwitch to backup API key
Qdrant downConnection errorFall back to Google Search only
Invalid inputPydantic validationReturn 422 with details
Critic rejectionValidation loopRetry up to 3x, then escalate

Design Principle: Every component needs a fallback strategy.


Q9: What would you do differently if starting over?

1. Start with async from day one - Easier to add concurrency later

2. Implement caching earlier - Would have saved development API costs

3. Use structured outputs - Gemini's JSON mode for reliable parsing

4. Add observability first - LangSmith integration should be from start

5. Design for horizontal scaling - Stateless API from the beginning


Q10: How do you balance cost vs performance?

Cost Breakdown Per Request:

  • Gemini API: ~$0.01-0.05 (depending on tokens)
  • Qdrant Cloud: Included in free tier
  • Railway hosting: ~$5/month
  • Google Search: Included in Gemini grounding

Optimization Strategies:

1. Cache common ingredients - 80% of requests hit top 100 ingredients

2. Use smaller models for validation - Critic doesn't need full model

3. Batch embeddings - Reduce API calls for multiple ingredients

4. Set appropriate TTLs - Balance freshness vs cost

Trade-off: Aggressive caching reduces costs but may serve stale safety data. Mitigation: 24-hour TTL with manual invalidation for critical updates.


Principal TPM Perspective

Scaling LLM applications requires balancing competing concerns:

Trade-offLeftRightOur Decision
Latency vs AccuracyFast responsesThorough validationAccuracy (safety-critical)
Cost vs FreshnessAggressive cachingReal-time dataBalanced TTLs
Simplicity vs ResilienceSimple architectureMultiple fallbacksResilience
Speed vs SafetyShip fastValidate thoroughlySafety

The key is making intentional trade-offs based on specific requirements, then documenting the reasoning for future reference.


*This post is part of the interview preparation series for the AI Ingredient Safety Analyzer project.*