8
$\begingroup$

Say a dataset has 0.5% of its features continuous and 99.5% categorical (binary) with ~2400 features in total. In this dataset, each observation is 1 of 2 classes - Fraud (1) or Not Fraud (0). Furthermore, there is a large class imbalance with only 2.6% of examples being Fraud, and the other ~97% of examples being Not Fraud.

Say we want to to predict whether a given example is Fraud or Not Fraud, and we take an anomaly detection approach using autoencoders.

Given the mixed data types in the dataset, in general, will an autoencoder, trained on only the Non Fraud examples, perform well in predicting Fraud examples? Is there any literature to suggest what architectures work best / if some preprocessing should be performed beforehand (scaling and PCA)? I ask because I feel an autoencoder may be hard to train with binary features.

$\endgroup$
1
  • $\begingroup$Is there any chance that you train it also on Fraud examples? They are quite important part of the equation.$\endgroup$
    – mapto
    CommentedJul 9, 2018 at 17:50

1 Answer 1

10
$\begingroup$

In general an autoencoder should perform well, when it comes to detect fraud examples. Fraud examples should have in theory a much higher reconstruction error. When it comes to train the autoencoder on binary data, I agree with you that it can be quite challenging. I suggest to take a look at this blog: https://blog.evjang.com/2016/11/tutorial-categorical-variational.html

$\endgroup$

    Start asking to get answers

    Find the answer to your question by asking.

    Ask question

    Explore related questions

    See similar questions with these tags.