1
$\begingroup$

I am working on an anomaly detection use case. I studied one technique of selecting the threshold that marks 5% of validation data as anomalies. how it works in anomaly detection cases. and there is also another technique which selects the threshold that maximizes the difference between TPR and FPR.

Which technique is helpful in unsupervised learning and then comparing it with ground truth.

As we can find the ideal thresholds by plotting an RC curve with TP and FP rates. but its good technique to follow in unsupervised scenario?

$\endgroup$

    1 Answer 1

    1
    $\begingroup$

    Unsupervised means that you don't have any labelled data. To know the True Positive rates and False Positive Rates you need labels. In the absence of training data RC curve cannot be calculates.

    You maybe be talking about isolation forest which assumes some percent of data as anomaly and that percent is hyperparam defined by the user. So you can choose 1 percent or 10% depending on the business use case in hand

    $\endgroup$
    6
    • $\begingroup$and what about selecting the threshold that marks 5% of validation data as anomalies. how it works ?$\endgroup$
      – user12
      CommentedApr 20, 2022 at 6:36
    • $\begingroup$i think you are telling the model to consider 5% of training data as anaomaly. So model will train to predict 5% of data as anomaly.$\endgroup$CommentedApr 20, 2022 at 6:43
    • $\begingroup$i am not telling i am asking about the method of calculating threshold that marks 5% of validation data as anomalies. whtas the purpose of this how it make sense$\endgroup$
      – user12
      CommentedApr 20, 2022 at 6:50
    • 1
      $\begingroup$You are telling your model to classify 5% of your data as anomaly. I am not sure from where you heard 5% it can be 1% or 2% or any other depending on your domain understanding. The significance is just that model will train to classify 5% as anaomaly$\endgroup$CommentedApr 20, 2022 at 6:55
    • 1
      $\begingroup$Usually people keep validation, because training & test data maybe used during the process of tuning the model. Validation Dataset act as a litmus test for your model before it goes to production. But if you use test data also, it would not be wrong.$\endgroup$CommentedApr 20, 2022 at 7:30

    Start asking to get answers

    Find the answer to your question by asking.

    Ask question

    Explore related questions

    See similar questions with these tags.