Fraud Detection: From Requirements to Architecture

Executive Summary

Building a fraud detection system requires methodical thinking through constraints, scope, data models, and failure modes. This post documents the derivation process - not what we built, but *why* we built it that way.

Key Principle: Start with constraints, not features. Latency budgets and business context shape every decision.

The Business Context

Problem Statement: E-commerce platform losing $2.4M annually to fraud with current 18% false positive rate and 2-3 second decision latency.

Success Metrics:

Metric	Current	Target	Business Impact
False Positive Rate	18%	<5%	Customer friction reduction
Decision Latency	2-3s	<10ms	Checkout abandonment
Fraud Loss	$2.4M/yr	<$1M/yr	Direct P&L impact

1. Start with Constraints

Before any architecture, understand the hard boundaries:

Constraint	Value	Implication
Latency	Sub-10ms at P99	In-memory lookups only, no synchronous DB queries
Throughput	150M auth/year (~5 RPS avg, 50+ peak)	Horizontal scaling required
Accuracy	Cannot drop below 90% approval	Safe mode must default to ALLOW
Compliance	Full audit trail	Evidence capture for disputes

Component-Level Latency Budget:

Total Budget: 10ms
├── Request parsing: 0.5ms
├── Feature extraction: 1ms
├── Redis velocity lookup: 2ms
├── Scoring (rules + ML): 3ms
├── Policy decision: 1ms
├── Evidence capture (async): 0ms (non-blocking)
├── Response: 0.5ms
└── Buffer: 2ms

2. Derive the Data Model

The data model emerges from following the money:

Step 1: Trace the Nouns (Entities)

What can be fraudulent?

**Card**: The payment instrument itself
**Device**: The machine making the request
**IP Address**: Network origin
**User Account**: Customer identity
**Merchant**: Where money flows

Step 2: Trace the Arrows (Events)

What happens to money?

Authorization → Capture → Settlement
Refund → Chargeback → Dispute

Step 3: Entity-Level Risk Signals

Entity	Velocity Signals	Static Signals
Card	Auth count (1h, 24h), decline rate	BIN risk, card age
Device	Auth count, unique cards seen	Emulator, rooted, VPN
IP	Auth count, geographic spread	Datacenter, proxy, TOR
User	Account age, recent changes	Verified email, 2FA enabled

3. Design Detection Logic

Key Insight: Separate scoring (ML/rules producing 0-1 scores) from deciding (policy converting scores to actions).

Why This Separation Matters

Concern	Without Separation	With Separation
Model iteration	Requires policy review	Independent deployment
Threshold tuning	Code change + deploy	Config change in minutes
A/B testing	Complex branching	Route by policy version
Accountability	Unclear ownership	Scoring = DS, Deciding = Business

Rule Priority Hierarchy

1. Hard Overrides (blocklists)     → BLOCK
2. Velocity Circuit Breakers       → BLOCK
3. ML Score Thresholds            → BLOCK/REVIEW/FRICTION
4. Contextual Rules               → Adjust score
5. Default                        → ALLOW

4. Plan for Failure

Design Principle: Design for *when* components fail, not *if*.

Failure Mode Matrix

Component	Failure Mode	Detection	Recovery	Impact
Redis	Connection timeout	Health check	In-memory fallback	Degraded velocity
ML Model	Inference timeout	Request timeout	Rule-based backup	Reduced accuracy
PostgreSQL	Connection exhaustion	Pool metrics	Circuit breaker	No evidence capture
External API	Rate limit / timeout	429/timeout	Skip enrichment	Missing signals

System-Wide Safe Mode

When multiple failures compound:

IF (redis_down AND model_timeout) OR (error_rate > 10%):
    ENTER safe_mode
    DEFAULT decision = ALLOW (revenue preservation)
    ALERT on-call immediately
    LOG everything for post-incident analysis

5. Ownership Model

Speed of change dictates ownership boundaries:

Change Type	Speed	Owner	Approval
Blocklist entry	Immediate	Fraud Ops	None
Velocity threshold	Minutes	Fraud Ops	Peer review
Policy rules	Hours	Risk Lead	Manager
ML model	Days	Data Science	Governance council
Schema change	Weeks	Engineering	Architecture review

Interview Application

When asked "How would you design a fraud detection system?":

1. Start with constraints - Ask about latency, throughput, accuracy requirements

2. Derive data model - Follow the money, identify entities and events

3. Separate concerns - Scoring vs deciding, ownership boundaries

4. Plan for failure - Every component needs a fallback

5. Show trade-offs - Latency vs accuracy, cost vs coverage

The goal: Demonstrate systematic thinking, not feature listing.

*This post is part of the Fraud Detection capstone project. See the [Thinking Process documentation](/nebula/fraud-detection-thinking) for the complete derivation.*