In many biometrics identification papers they measure they performance by computing Equal Error Rate (EER). When dealing with verification problem, or any other binary classification problem - the definition is clear and make sense:
Equal error rate or crossover error rate (EER or CER): the rate at which both acceptance and rejection errors are equal. The value of the EER can be easily obtained from the ROC curve. The EER is a quick way to compare the accuracy of devices with different ROC curves. In general, the device with the lowest EER is the most accurate.
In other words: FAR = FPR = FP/(FP+TN)
and FRR = FNR = FN/(FN+TP)
, and EER is obtained at the threshold where FAR = FRR
. The threshold can be found using ROC, e.g. with sklearn
package in Python
.
However, many papers use the EER metric also for identification problem, where the biometric signature is compared to a signature in pre-enrolled database, and the output is the user from the database which is the most probable maker of the biometric signature (complicated phrasing for "identifying a person"). This is a problem of non-binary multiclass classifier. I want to understand what is the definition in EER in such a case, since a generalization is not straight-forward.
A common approach is to binarize the problem by considering $n$ binary problems of "one vs. rest" for each class. I.e. for a given class there are two possible labels: POSITIVE
- the sample was classified to this class, NEGATIVE
- the sample was classified to any other class. Then for each class we have a binary classification problem and can calculate FAR
and FRR
and thus EER
. Then we can take the average of EER
over all the classes $$ \mathrm{EER} = \frac{1}{n}\sum_{i=1}^n \mathrm{EER}(i) $$ and this will be the EER score of the model/classifier.
Open issues:
- This is seems the simplest generalization of EER to multiclass problem. However, there may be other possibilities that are more common in the biometrics community. Is my definition is correct and the one used in research papers?
- For each binary classification one needs to determine a threshold. Different classes can have different thresholds. Moreover, if the model returns probabilities (e.g. softmax) one can't threshold against a single number but must also choose what number to threshold against, e.g., the maximal probability.
- Implementation in
TensorFlow
for Python, as I want the EER metric to be used during training (for validation, mainly). Since the training will contain very large dataset, a TensorFlow efficient implementation is desired. - Would accuracy on the validation be a better metric? The key advantage of it is that it is already built-in into TensorFlow and is easily applied also for multiclass classifiers.
The TowardsDataScience article verifies my strategy ("One vs. Rest") but now I need to find an efficient TensorFlow implementation of it, so it can work as a validation metric in a very large training dataset.