Skip to content

A demonstration of fraud detection model based on analyzing user's spending patterns 🕵️‍♀️

License

Notifications You must be signed in to change notification settings

Priya4607/fraud_detection_at_scale

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

FRAUD DETECTION AT SCALE

license

Anomaly detection models rely on sifting through a massive amount of historical data to identify patterns based on how fraudsters typically behave. Here, we focus more on the users spending patterns (the rate at which transactions occur differs from the usual spending patterns) within a specific window. What makes it fraudulent behavior depends on the average number of back-to-back transactions made using the same card within a short window. We do some feature engineering to select a set of features relevant to users spending patterns. We use these features to train an XGBoost model to pick unusual behavior in financial transactions.


DATA GENERATION

Since we only analyze the spending patterns of user's card transactions, we generate fake data based on this notebook. However, the simulated data differ in terms of the number of records (over 6M historical transactions in this case) spread over a period of six months. This script simulates fake credit card transactions. Including the dropbox link, just in case you choose to use the dataset directly.

To the run the file,

python generate_transactions.py

A sample dataset consist of the following headers, omitted some additional features like shipping/billing address, ZIP code etc for simplicity.

Features Description
txn_id Transaction ID
cc_num Credit card numbers (10k unique numbers)
ts Transaction timestamp
amt Transanction amount
label 0-genuine, 1-fraud

MODEL PERFORMANCE

This pipeline uses Apache's Spark MLlib and XGBoost (Extreme Gradient Boosting) model. Additionally, MLflow eases the process of tracking parameters/metrics and thus, helps in refining the model performance. Also, the pipeline implements k-fold cross-validation of the model to ensure the best fit. Utilized Area under the ROC curve (AUC) metric to evaluate the performance of the model.

To run the notebook, it is suggested to use analytic platforms such as Databricks (used community edition here).

Databricks cluster specification:

  • Databricks run time - 10.4 LTS ML
  • Apache spark version - 3.2.1

Note: The model has been trained on unbalanced dataset

Results are tabulated below:

Dataset Results (AUC)
Train 0.786
Test 0.787

EXPERIMENTAL FEATURES

An attempt to store ML features in a centralized repository using Feast: An open-source feature store. This folder consists of appropriate feature store configuration files and a script to create online feature stores. cc_weekly and cc_latest are the two feature stores consisting of a weekly and recent (say in the past 10 min) aggregation of the raw dataset generated in the previous step grouped by credit card number.

Fig: Intended architecture

TODO:

  1. Integrate Kafka to ingest streaming data

ATTRIBUTION

List of references used:

  • AWS samples demonstrates the implementation of streaming feature aggregation using AWS SageMaker Feature Store.
  • MLflow documentation.

About

A demonstration of fraud detection model based on analyzing user's spending patterns 🕵️‍♀️

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published