FRAUD DETECTION AT SCALE

Anomaly detection models rely on sifting through a massive amount of historical data to identify patterns based on how fraudsters typically behave. Here, we focus more on the users spending patterns (the rate at which transactions occur differs from the usual spending patterns) within a specific window. What makes it fraudulent behavior depends on the average number of back-to-back transactions made using the same card within a short window. We do some feature engineering to select a set of features relevant to users spending patterns. We use these features to train an XGBoost model to pick unusual behavior in financial transactions.

DATA GENERATION

Since we only analyze the spending patterns of user's card transactions, we generate fake data based on this notebook. However, the simulated data differ in terms of the number of records (over 6M historical transactions in this case) spread over a period of six months. This script simulates fake credit card transactions. Including the dropbox link, just in case you choose to use the dataset directly.

To the run the file,

python generate_transactions.py

A sample dataset consist of the following headers, omitted some additional features like shipping/billing address, ZIP code etc for simplicity.

Features	Description
txn_id	Transaction ID
cc_num	Credit card numbers (10k unique numbers)
ts	Transaction timestamp
amt	Transanction amount
label	0-genuine, 1-fraud

MODEL PERFORMANCE

This pipeline uses Apache's Spark MLlib and XGBoost (Extreme Gradient Boosting) model. Additionally, MLflow eases the process of tracking parameters/metrics and thus, helps in refining the model performance. Also, the pipeline implements k-fold cross-validation of the model to ensure the best fit. Utilized Area under the ROC curve (AUC) metric to evaluate the performance of the model.

To run the notebook, it is suggested to use analytic platforms such as Databricks (used community edition here).

Databricks cluster specification:

Databricks run time - 10.4 LTS ML
Apache spark version - 3.2.1

Note: The model has been trained on unbalanced dataset

Results are tabulated below:

Dataset	Results (AUC)
Train	0.786
Test	0.787

EXPERIMENTAL FEATURES

An attempt to store ML features in a centralized repository using Feast: An open-source feature store. This folder consists of appropriate feature store configuration files and a script to create online feature stores. cc_weekly and cc_latest are the two feature stores consisting of a weekly and recent (say in the past 10 min) aggregation of the raw dataset generated in the previous step grouped by credit card number.

Fig: Intended architecture

TODO:

Integrate Kafka to ingest streaming data

ATTRIBUTION

List of references used:

AWS samples demonstrates the implementation of streaming feature aggregation using AWS SageMaker Feature Store.
MLflow documentation.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
asset		asset
experiments/feature_store		experiments/feature_store
notebook		notebook
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
generate_transactions.py		generate_transactions.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

asset

asset

experiments/feature_store

experiments/feature_store

notebook

notebook

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

generate_transactions.py

generate_transactions.py

Repository files navigation

FRAUD DETECTION AT SCALE

DATA GENERATION

MODEL PERFORMANCE

EXPERIMENTAL FEATURES

ATTRIBUTION

About

Releases

Packages

Languages

License

Priya4607/fraud_detection_at_scale

Folders and files

Latest commit

History

Repository files navigation

FRAUD DETECTION AT SCALE

DATA GENERATION

MODEL PERFORMANCE

EXPERIMENTAL FEATURES

ATTRIBUTION

About

Topics

Resources

License

Stars

Watchers

Forks

Languages