2
$\begingroup$

I would like to ask for help with the following.

Given the following dataset, which I have split into train and test sets:

# Loading data df = pd.read_csv("https://raw.githubusercontent.com/karsarobert/Machine_Learning_2024/main/train.csv") # Setting target variable and predictors y = df['target_reg'] corr_col = ['arbevexp_2014', 'arbevexp_2015', 'arbevexp_2016', 'arbevert_2014', 'arbevert_2015', 'arbevert_2016', 'ranyag_2014', 'ranyag_2015', 'ranyag_2016', 'rszem_2016'] X = df[corr_col] # Splitting data into train and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42) # Creating and fitting StandardScaler on the training data scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train) X_test_scaled = scaler.transform(X_test) 

I have tried many machine learning methods and algorithms. So far, the most accurate (MAE: 50799) has been the Random Forest Regressor and the Bayes Optimizer with the following hyperparameters:

{'max_depth': 20, 'min_samples_leaf': 2, 'min_samples_split': 2, 'n_estimators': 49}

My question is, how can I find better hyperparameters? What search methods are there besides brute force? Is there a well-functioning genetic algorithm or TPE for this?

I have already tried linear regression, KNN, SVM, GridSearchCV/RandomizedSearchCV, SGB, CatBoost, and Ridge regression etc. I don't think a neural network is suitable for this problem because it overfits.

$\endgroup$

    1 Answer 1

    2
    $\begingroup$

    It seems as if you have tried the main tabular machine learning model types but I would suggest you to look into Optuna. It's a model-agnostic hyperparameter optimization framework which is awesome after getting used to how it works and from my experience it works well while being faster than brute-forcing combinations with grid search. Optuna works more like Randomized Search but with educated guesses according to known information and probing uncharted areas. (I would have posted this as a comment, but don't have the required reputation)

    $\endgroup$
    4
    • $\begingroup$Thanks, that's a pretty good lib! However, a better MAE of 50772 came only after 607 steps (about 25 minutes). Would it be useful to run this Optuna for even 2-3 hours in the hope of better results?$\endgroup$CommentedMay 24, 2024 at 22:22
    • 1
      $\begingroup$Did you reach the better score with Optuna? If so, 600 steps / trials (as they call it) are a lot and I think it would only get marginally better if at all. Although the new score is better, it is still only a small improvement, which is why I would assume that further hyperparameter tuning might not lead to your desired results. Something that I noticed: since you are doing a 60:40 train test split, you might not have enough training samples if your dataset is small.$\endgroup$
      – Guest
      CommentedMay 25, 2024 at 19:52
    • $\begingroup$More training samples? Okay, I will also try data expansion methods...$\endgroup$CommentedMay 25, 2024 at 21:03
    • $\begingroup$Yes, data augmentation methods (noise and synthetic train data) was very useful for me. But the really solutions was don't this. I analyzed the correlation matrix and searched new predictor variables...$\endgroup$CommentedMay 30, 2024 at 22:35

    Start asking to get answers

    Find the answer to your question by asking.

    Ask question

    Explore related questions

    See similar questions with these tags.