For sklearn ML algorithms, is it possible to use boolean data alongside continuous data for the predictive data, and if so how can the data be scaled?

Question

I have a medium size data set (7K) of patient age, sex, and pre-existing conditions. Age of course is from 0-101, sex is 1 for male, 2 for female, and -1 for diverse. All the pre-conditions are Boolean. The outcome, death is also Boolean.

Regardless of how I scale the data (I tried lots of scalers), I always get a warning:

FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. This traces back to: ValueError("Unknown label type: %r" % y_type) ValueError: Unknown label type: 'unknown'

If I take out the age and sex columns, the error goes away. There are definitely no text, missing, or weird values here.

If I look at my rescaled data, it looks as I would expect it to look.

If I drastically simplify the data, it works.

import numpy as np import pandas as pd from sklearn import preprocessing array = np.array([[42, 1, False, False, False, False, False, False, False, False, False, False, False],\ [72, 1, False, False, True, False, False, False, False, False, False, True, False],\ [77, 2, False, False, False, False, False, False, False, False, True, True, False],\ [36, 1, False, False, False, False, False, False, False, False, False, False, False],\ [42, 1, False, False, False, False, False, False, False, False, True, False, False],\ [82, 1, False, False, False, True, False, False, False, False, False, True, False],\ [71, 2, False, False, False, False, False, False, False, False, False, True, False],\ [36, -1, False, False, False, False, False, False, False, False, True, False, False], [52, 1, False, False, False, False, False, False, False, False, False, False, False],\ [52, 1, False, False, False, False, False, False, False, True, False, True, True],\ [77, 2, False, False, False, False, False, False, True, False, True, True, False],\ [46, 1, False, False, False, False, False, False, False, False, False, False, False],\ [45, 1, False, False, False, False, False, False, False, False, False, False, False],\ [88, 1, False, False, False, False, False, True, False, False, False, True, True],\ [79, 2, False, True, True, False, False, False, False, False, False, True, True],\ [36, -1, True, False, False, False, False, False, False, False, False, False, False]]) X = array[:,0:12] Y = array[:,12] scaler = preprocessing.MinMaxScaler().fit(X) rescaledX = scaler.transform(X) from sklearn.model_selection import KFold from sklearn.model_selection import cross_val_score from sklearn.linear_model import LogisticRegression kfold = KFold(n_splits=3, shuffle=True, random_state=7) # split the data into training and test sets for k-fold validation model = LogisticRegression(solver='lbfgs') # set up model of a linear regression results = cross_val_score(model, rescaledX, Y, cv=kfold) print("Accuracy: %.3f%% (%.3f%%)" % (results.mean()*100.0, results.std()*100.0))

It would be awesome if someone has an idea of what might be wrong, or how to troubleshoot further.

The value error is saying your labels, i.e. y-values, are errant. Does it work without cross_val_score? If so, set error_score="raise" in cross_val_score so you get the full traceback; if not, you can get the traceback for the model itself without cross-validation. In either case, edit the full error traceback into the question, and say some more about your y-values. — Ben Reiniger, CommentedSep 16, 2021 at 16:22

Adept · Accepted Answer · 2021-09-16 14:50:06Z

So, to me what you have to do is :

Transform all your your True/False to 1/0, so they're numerical. Keep age as it is (or use some normalisation, but not that necessary
Absolutely change the way Sex is handled. You have a big bias since you have 3 values : Since it's numerical, distance matters. Here, distance between "Male" and "Diverse" is 2, and distance between "Female" and "Diverse" is 3. There's no logical reason, seeing your problem, for that. This will bring bias to your model.

You should read this answer : https://datascience.stackexchange.com/a/79575/101580 In your case One Hot Encoder is good enough since you have 3 values.

Yeah, so the key was to really convert my booleans into 0s or 1s. I thought Python considered them as 0s or 1s anyway. I thought they were 1s or 0s with MaxAbsScaler. OneHotEncoder might work on the sex issue, but had iteration issues, I think with the age data. Interation issues also if not normalized. So I'm converting bools to 1s and 0s and using MaxAbsScaler - thanks. — DrWhat, CommentedSep 17, 2021 at 11:01

Stack Exchange Network

For sklearn ML algorithms, is it possible to use boolean data alongside continuous data for the predictive data, and if so how can the data be scaled?

1 Answer 1

Linked

Hot Network Questions

For sklearn ML algorithms, is it possible to use boolean data alongside continuous data for the predictive data, and if so how can the data be scaled?

1 Answer 1

Linked

Related

Hot Network Questions