Data Pipelinescomplex complexity

Fraud Detection System

Architecture for real-time fraud detection with feature engineering, scoring, rules, and feedback loops that keep up with evolving attack patterns.

Components

Considerations

Alternatives

complex

Complexity

Fit

When this blueprint fits

And when to walk away from it

When to use this

Fraud losses are material to the business, manual review cannot keep up with volume, and you need millisecond decisions during checkout, account creation, or login. Fintech, marketplaces, and high-value e-commerce all need this layer.

When NOT to use this

If your fraud rate is low and a provider rule engine (Stripe Radar, Adyen RevenueProtect) handles it, do not build this in-house. The point of building is to capture patterns the provider does not see.

Architecture

System components

Key building blocks of this architecture, layered from infrastructure up.

Event Capture

Capture user, device, network, and transaction events with high cardinality and low latency. The richer the feature set, the better the model. Device fingerprinting, IP reputation, and behavioural signals (typing rhythm, mouse movement) all matter.

KafkaSegmentFingerprintJSCustom SDK

Feature Store

Online and offline feature stores for model serving and training. Online for sub-100ms scoring, offline for batch training. Feature parity between the two is the hardest engineering problem in MLOps.

FeastTectonRedisBigQuery

Scoring Service

Real-time ML scoring of incoming events with low-latency inference and explainability outputs. XGBoost or LightGBM models served via ONNX cover most fraud workloads at single-digit millisecond latency.

XGBoostLightGBMONNXTriton

Rules Engine

Deterministic rules layered on top of ML scores for compliance, regulatory, and product-specific cases. Pure ML is too brittle for regulated products; a rules layer gives compliance teams direct control without code changes.

OPACustom DSLRule SheetsDecision Tables

Case Management

Queue suspicious cases for human review with prioritisation, evidence aggregation, and outcome tracking. Reviewers see all relevant signals in one screen and decide in under a minute. Their decisions feed back into model training.

Internal UIWorkflow EngineAudit Trail

Feedback Loop

Capture analyst decisions, chargeback outcomes, and customer appeals to retrain models. Labels are the bottleneck. Without a clean labelling pipeline, your model degrades silently as fraud patterns shift.

Label StoreTraining PipelineDrift Detection

Adversarial Monitoring

Detect model drift, attack pattern shifts, and false-positive spikes in real time. Fraud is adversarial: the moment your model works, attackers adapt. Continuous monitoring is non-negotiable.

Drift DetectionAnomaly DetectionDashboards

Planning

Critical considerations

The things I have learned the hard way and would not skip on the next build.

Always combine ML and rules. Pure ML is too brittle for compliance and pure rules cannot keep up with novel attacks. The hybrid approach lets compliance teams control hard requirements while ML catches patterns rules cannot express.

Build for explainability so analysts trust the scores. Black-box models lose internal credibility fast. SHAP values or rule-equivalent explanations help reviewers act on the output.

Plan for adversarial drift. Fraud changes faster than your model. Continuous retraining, holdout monitoring, and a tight feedback loop are how you stay ahead.

Balance friction against revenue. Every false positive is a legitimate customer turned away. Tune thresholds with the business, not the data science team alone.

Common in fintech and e-commerce. Contact me.

Options

Alternative approaches

Where I would consider a different shape entirely, with the trade-offs spelled out.

Alternative 01

Sift or Forter for managed fraud platforms with their own consortium data. Cheaper than building, less control over the model.

Alternative 02

Stripe Radar for Stripe-native fraud screening if the only surface is checkout.

Alternative 03

Sardine for crypto and account-takeover fraud with strong device intelligence.

Alternative 04

Unit21 for case management and rules when ML is not yet justified but operations needs structure.

Implementation

Related playbooks

Step-by-step guides for the harder parts of this architecture.

Designing Event-Driven Systems

Event-driven architectures unlock real autonomy between services, and they expose a whole new category of bugs if you do not respect their constraints. This playbook is the design discipline I use: model events as facts, version schemas carefully, choose the right broker, build idempotent consumers, handle ordering and failure, and add the observability that makes async systems debuggable in production.

Read playbook

Production Monitoring & Observability

Observability is not three pillars on a slide, it is the difference between knowing why your system is misbehaving and guessing. This playbook is the monitoring stack I deploy on every production system: error tracking, structured logging, performance metrics, distributed tracing, and the dashboards and alerts that turn raw data into actionable signal without paging everyone at 3 AM.

Read playbook

In practice