Fraud Detection: From Requirements to Architecture
A Principal TPM deep dive into the thinking process behind designing a real-time fraud detection platform - from constraints to architecture with sub-10ms latency requirements.
Executive Summary
Building a fraud detection system requires methodical thinking through constraints, scope, data models, and failure modes. This post documents the derivation process - not what we built, but *why* we built it that way.
Key Principle: Start with constraints, not features. Latency budgets and business context shape every decision.
The Business Context
Problem Statement: E-commerce platform losing $2.4M annually to fraud with current 18% false positive rate and 2-3 second decision latency.
Success Metrics:
| Metric | Current | Target | Business Impact |
|---|---|---|---|
| False Positive Rate | 18% | <5% | Customer friction reduction |
| Decision Latency | 2-3s | <10ms | Checkout abandonment |
| Fraud Loss | $2.4M/yr | <$1M/yr | Direct P&L impact |
1. Start with Constraints
Before any architecture, understand the hard boundaries:
| Constraint | Value | Implication |
|---|---|---|
| Latency | Sub-10ms at P99 | In-memory lookups only, no synchronous DB queries |
| Throughput | 150M auth/year (~5 RPS avg, 50+ peak) | Horizontal scaling required |
| Accuracy | Cannot drop below 90% approval | Safe mode must default to ALLOW |
| Compliance | Full audit trail | Evidence capture for disputes |
Component-Level Latency Budget:
Total Budget: 10ms
├── Request parsing: 0.5ms
├── Feature extraction: 1ms
├── Redis velocity lookup: 2ms
├── Scoring (rules + ML): 3ms
├── Policy decision: 1ms
├── Evidence capture (async): 0ms (non-blocking)
├── Response: 0.5ms
└── Buffer: 2ms2. Derive the Data Model
The data model emerges from following the money:
Step 1: Trace the Nouns (Entities)
What can be fraudulent?
- **Card**: The payment instrument itself
- **Device**: The machine making the request
- **IP Address**: Network origin
- **User Account**: Customer identity
- **Merchant**: Where money flows
Step 2: Trace the Arrows (Events)
What happens to money?
- Authorization → Capture → Settlement
- Refund → Chargeback → Dispute
Step 3: Entity-Level Risk Signals
| Entity | Velocity Signals | Static Signals |
|---|---|---|
| Card | Auth count (1h, 24h), decline rate | BIN risk, card age |
| Device | Auth count, unique cards seen | Emulator, rooted, VPN |
| IP | Auth count, geographic spread | Datacenter, proxy, TOR |
| User | Account age, recent changes | Verified email, 2FA enabled |
3. Design Detection Logic
Key Insight: Separate scoring (ML/rules producing 0-1 scores) from deciding (policy converting scores to actions).
Why This Separation Matters
| Concern | Without Separation | With Separation |
|---|---|---|
| Model iteration | Requires policy review | Independent deployment |
| Threshold tuning | Code change + deploy | Config change in minutes |
| A/B testing | Complex branching | Route by policy version |
| Accountability | Unclear ownership | Scoring = DS, Deciding = Business |
Rule Priority Hierarchy
1. Hard Overrides (blocklists) → BLOCK
2. Velocity Circuit Breakers → BLOCK
3. ML Score Thresholds → BLOCK/REVIEW/FRICTION
4. Contextual Rules → Adjust score
5. Default → ALLOW4. Plan for Failure
Design Principle: Design for *when* components fail, not *if*.
Failure Mode Matrix
| Component | Failure Mode | Detection | Recovery | Impact |
|---|---|---|---|---|
| Redis | Connection timeout | Health check | In-memory fallback | Degraded velocity |
| ML Model | Inference timeout | Request timeout | Rule-based backup | Reduced accuracy |
| PostgreSQL | Connection exhaustion | Pool metrics | Circuit breaker | No evidence capture |
| External API | Rate limit / timeout | 429/timeout | Skip enrichment | Missing signals |
System-Wide Safe Mode
When multiple failures compound:
IF (redis_down AND model_timeout) OR (error_rate > 10%):
ENTER safe_mode
DEFAULT decision = ALLOW (revenue preservation)
ALERT on-call immediately
LOG everything for post-incident analysis5. Ownership Model
Speed of change dictates ownership boundaries:
| Change Type | Speed | Owner | Approval |
|---|---|---|---|
| Blocklist entry | Immediate | Fraud Ops | None |
| Velocity threshold | Minutes | Fraud Ops | Peer review |
| Policy rules | Hours | Risk Lead | Manager |
| ML model | Days | Data Science | Governance council |
| Schema change | Weeks | Engineering | Architecture review |
Interview Application
When asked "How would you design a fraud detection system?":
1. Start with constraints - Ask about latency, throughput, accuracy requirements
2. Derive data model - Follow the money, identify entities and events
3. Separate concerns - Scoring vs deciding, ownership boundaries
4. Plan for failure - Every component needs a fallback
5. Show trade-offs - Latency vs accuracy, cost vs coverage
The goal: Demonstrate systematic thinking, not feature listing.
*This post is part of the Fraud Detection capstone project. See the [Thinking Process documentation](/nebula/fraud-detection-thinking) for the complete derivation.*