Using an autoencoder for anomaly detection on categorical data

Question

Say a dataset has 0.5% of its features continuous and 99.5% categorical (binary) with ~2400 features in total. In this dataset, each observation is 1 of 2 classes - Fraud (1) or Not Fraud (0). Furthermore, there is a large class imbalance with only 2.6% of examples being Fraud, and the other ~97% of examples being Not Fraud.

Say we want to to predict whether a given example is Fraud or Not Fraud, and we take an anomaly detection approach using autoencoders.

Given the mixed data types in the dataset, in general, will an autoencoder, trained on only the Non Fraud examples, perform well in predicting Fraud examples? Is there any literature to suggest what architectures work best / if some preprocessing should be performed beforehand (scaling and PCA)? I ask because I feel an autoencoder may be hard to train with binary features.

Is there any chance that you train it also on Fraud examples? They are quite important part of the equation. — mapto, CommentedJul 9, 2018 at 17:50

Andreas Look · Accepted Answer · 2018-07-10 12:50:25Z

In general an autoencoder should perform well, when it comes to detect fraud examples. Fraud examples should have in theory a much higher reconstruction error. When it comes to train the autoencoder on binary data, I agree with you that it can be quite challenging. I suggest to take a look at this blog: https://blog.evjang.com/2016/11/tutorial-categorical-variational.html

Stack Exchange Network

Using an autoencoder for anomaly detection on categorical data

1 Answer 1

Hot Network Questions

Using an autoencoder for anomaly detection on categorical data

1 Answer 1

Related

Hot Network Questions