1
$\begingroup$

Many algorithms and methods in modern Machine Learning techniques contain randomness, and because of that, running the same ML script several times can result in different outputs, therefore accuracy results. For example, running Random Forest can produce accuracy of 0.78, then when run again with no change in data, setup, code, it can result in 0.79. This brings the challenge of inability to perform controlled experiment, when I am testing some changes in input and their effect on output.

So in order to be able to perform perfectly controlled experiments to achieve the best model output, what are the extensive set of random parameters I should fix? I want whole process to be completely deterministic.

PS: I am using Sci-kit Learn environment with additional algorithms such as XGBoost, CatBoost, LightGBM.

I assume there are some parameters, random_state(s) I should fix in NumPy, too.

$\endgroup$

    1 Answer 1

    1
    $\begingroup$

    Scikit uses numpy for pseudo-random number generation. So to fix random state in various scikit calls, you use numpy.random.seed(12345) and then use scikit. You would want to record the random seed when you log the model so you could reproduce the same run later.

    If your code (or something you call) also uses Python's random number generator, you would set random.seed too.

    How to set the seed depends on the library. For example I believe most xgboost APIs expose a seed parameter instead. Not sure about catboost.

    You're also depending on the library even exposing a way to seed any pseudo-random choices it makes. It's possible some library doesn't quite do so. (Or in the case of Spark, sometimes the result can even conceivably depend on order of distributed execution, which is hard to control.)

    $\endgroup$

      Start asking to get answers

      Find the answer to your question by asking.

      Ask question

      Explore related questions

      See similar questions with these tags.