Fraud Detection & Investigation

Toggle signals on a synthetic transaction. The risk score updates, the bars show what drove the decision, and a plain-English summary explains why.

The Premise

590,000 transactions. Somewhere in there, 20,000 are fraudulent. The job is to find them.

A 3.5% fraud rate means a model that flags nothing would still be "right" 96.5% of the time. Accuracy, in this domain, is a liar.

A transaction just came in

Tick the things that are true about it. Watch the risk score on the right react.

Each thing you tick is a clue. The score below is what the model would do with those clues.

Risk Score

58.0%

Verdict

REVIEW

What drove the score

Each bar is one signal's contribution. Longer bar = bigger push toward fraud.

The billing address is missing.+0.31

It was paid with a credit card.+0.22

This card has been used 8+ times in the last 24 hours.0.00

The email address is from a domain less than 30 days old.0.00

The device doesn't match this account's usual one.0.00

The buyer's country doesn't match the billing country.0.00

What the data showed

Six charts from the EDA notebook.

n = 590,540

Total transactions

590K

train set

Fraud rate

3.5%

~20K fraudulent

XGBoost AUC

0.924

vs 0.847 LogReg

Features

394

374 have missing data

Email domain fraud rates

mail.com: 1 in 5 transactions is fraud. Easy to spin up fake accounts.

Card type fraud rates

Credit cards carry 3x more fraud. Delayed payment means more time before detection.

Address missing as a fraud signal

Missing address = 5x higher fraud rate. Fraudsters skip traceable fields.

Transaction velocity and fraud

Single-use and 50+ cards are riskiest. Stolen details used once, or burned fast.

Top SHAP features driving fraud predictions

V258 (behavioral fingerprint) and C14 (card velocity count) dominate. Amount matters less than behavior.

Missing fields: fraud vs legitimate

Fraudsters fill in more fields trying to look normal. That effort is itself a signal. p-value is effectively 0.

Case Notes

What the model actually learned.

Transactions

590K

XGBoost AUC

0.924

Signals engineered

40+

Credit-card uplift

+4.36 pp

No-billing-addr risk

5×

Causal method

DoWhy + DiD

I engineered features from raw transaction signals - velocity, address completeness, email domain age, card type - then trained an XGBoost classifier (AUC 0.924). For drivers that needed to be trustable, not just predictive, I used DoWhy and difference-in-differences to estimate causal effects rather than correlation.

The interpretability layer is the part I care about most: SHAP values surface why a single transaction was flagged, and an LLM prompt with a strict no-fabrication constraint turns those values into a sentence a non-technical reviewer can act on.

Field Notes

Excerpts from the investigator's notebook.

week_01 / eda

p. 02

week_01.ipynb

Missing data is a feature, not a problem

"374 of 394 columns have missing data. At first glance that looks messy - but missing data might be a good space to explore for signals."

p. 06

week_01.ipynb

The mail.com pattern

"mail.com runs a 19% fraud rate - nearly 1 in 5. gmail.com looks safer at 4.4%, but with 228K transactions it's actually one of the largest absolute contributors."

p. 09

week_01.ipynb

Fraudsters try too hard

"Fraudulent transactions have fewer missing fields on average - 161 vs 197. The fraudsters are filling everything in, trying to look real. The effort itself is a signal."

p. 13

week_01.ipynb

On accuracy as a trap

"Without scale_pos_weight, the model just predicts 'legitimate' for everything and gets 96.5% accuracy. Technically correct. In the real world, very unhelpful."

p. 33

week_01.ipynb

Correlation vs. causation

"Correlation told us where fraud lives. Causation told us whether we could do anything about it. Different questions, different answers."

p. 43

week_01.ipynb

Address as a signal

"Missing billing address: 11.78% fraud rate. Address provided: 2.46%. A 5× lift - fraudsters avoid leaving traceable information."

← Back to Index Read the résumé →