System Design

Fraud Detection: From Requirements to Architecture

A Principal TPM deep dive into the thinking process behind designing a real-time fraud detection platform - from constraints to architecture with sub-10ms latency requirements.

fraud-detectionarchitecturesystem-designinterview-prep

Executive Summary

Building a fraud detection system requires methodical thinking through constraints, scope, data models, and failure modes. This post documents the derivation process - not what we built, but *why* we built it that way.

Key Principle: Start with constraints, not features. Latency budgets and business context shape every decision.


The Business Context

Problem Statement: E-commerce platform losing $2.4M annually to fraud with current 18% false positive rate and 2-3 second decision latency.

Success Metrics:

MetricCurrentTargetBusiness Impact
False Positive Rate18%<5%Customer friction reduction
Decision Latency2-3s<10msCheckout abandonment
Fraud Loss$2.4M/yr<$1M/yrDirect P&L impact

1. Start with Constraints

Before any architecture, understand the hard boundaries:

ConstraintValueImplication
LatencySub-10ms at P99In-memory lookups only, no synchronous DB queries
Throughput150M auth/year (~5 RPS avg, 50+ peak)Horizontal scaling required
AccuracyCannot drop below 90% approvalSafe mode must default to ALLOW
ComplianceFull audit trailEvidence capture for disputes

Component-Level Latency Budget:

Total Budget: 10ms
├── Request parsing: 0.5ms
├── Feature extraction: 1ms
├── Redis velocity lookup: 2ms
├── Scoring (rules + ML): 3ms
├── Policy decision: 1ms
├── Evidence capture (async): 0ms (non-blocking)
├── Response: 0.5ms
└── Buffer: 2ms

2. Derive the Data Model

The data model emerges from following the money:

Step 1: Trace the Nouns (Entities)

What can be fraudulent?

  • **Card**: The payment instrument itself
  • **Device**: The machine making the request
  • **IP Address**: Network origin
  • **User Account**: Customer identity
  • **Merchant**: Where money flows

Step 2: Trace the Arrows (Events)

What happens to money?

  • Authorization → Capture → Settlement
  • Refund → Chargeback → Dispute

Step 3: Entity-Level Risk Signals

EntityVelocity SignalsStatic Signals
CardAuth count (1h, 24h), decline rateBIN risk, card age
DeviceAuth count, unique cards seenEmulator, rooted, VPN
IPAuth count, geographic spreadDatacenter, proxy, TOR
UserAccount age, recent changesVerified email, 2FA enabled

3. Design Detection Logic

Key Insight: Separate scoring (ML/rules producing 0-1 scores) from deciding (policy converting scores to actions).

Why This Separation Matters

ConcernWithout SeparationWith Separation
Model iterationRequires policy reviewIndependent deployment
Threshold tuningCode change + deployConfig change in minutes
A/B testingComplex branchingRoute by policy version
AccountabilityUnclear ownershipScoring = DS, Deciding = Business

Rule Priority Hierarchy

1. Hard Overrides (blocklists)     → BLOCK
2. Velocity Circuit Breakers       → BLOCK
3. ML Score Thresholds            → BLOCK/REVIEW/FRICTION
4. Contextual Rules               → Adjust score
5. Default                        → ALLOW

4. Plan for Failure

Design Principle: Design for *when* components fail, not *if*.

Failure Mode Matrix

ComponentFailure ModeDetectionRecoveryImpact
RedisConnection timeoutHealth checkIn-memory fallbackDegraded velocity
ML ModelInference timeoutRequest timeoutRule-based backupReduced accuracy
PostgreSQLConnection exhaustionPool metricsCircuit breakerNo evidence capture
External APIRate limit / timeout429/timeoutSkip enrichmentMissing signals

System-Wide Safe Mode

When multiple failures compound:

IF (redis_down AND model_timeout) OR (error_rate > 10%):
    ENTER safe_mode
    DEFAULT decision = ALLOW (revenue preservation)
    ALERT on-call immediately
    LOG everything for post-incident analysis

5. Ownership Model

Speed of change dictates ownership boundaries:

Change TypeSpeedOwnerApproval
Blocklist entryImmediateFraud OpsNone
Velocity thresholdMinutesFraud OpsPeer review
Policy rulesHoursRisk LeadManager
ML modelDaysData ScienceGovernance council
Schema changeWeeksEngineeringArchitecture review

Interview Application

When asked "How would you design a fraud detection system?":

1. Start with constraints - Ask about latency, throughput, accuracy requirements

2. Derive data model - Follow the money, identify entities and events

3. Separate concerns - Scoring vs deciding, ownership boundaries

4. Plan for failure - Every component needs a fallback

5. Show trade-offs - Latency vs accuracy, cost vs coverage

The goal: Demonstrate systematic thinking, not feature listing.


*This post is part of the Fraud Detection capstone project. See the [Thinking Process documentation](/nebula/fraud-detection-thinking) for the complete derivation.*