Deep Q Learning Algorithm for Simple Python Game makes player stuck

Question

I made a simple Python game. A screenshot is below: Basically, a paddle moves left and right catching particles. Some make you lose points while others make you gains points.

This is my first Deep Q Learning Project, so I probably messed something up, but here is what I have:

model = Sequential() model.add(Dense(200, input_shape=(4,), activation='relu')) model.add(Dense(200, activation='relu')) model.add(Dense(3, activation='linear')) model.compile(loss='categorical_crossentropy', optimizer='adam')

The four inputs are X position of player, X and Y position of particle (one at a time), and the type of particle. Output is left, right, or don't move.

Here is the learning algorithm:

def learning(num_episodes=500): y = 0.8 eps = 0.5 decay_factor = 0.9999 for i in range(num_episodes): state = GAME.reset() GAME.done = False eps *= decay_factor done = False while not done: if np.random.random() < eps: #exploration a = np.random.randint(0, 2) else: a = np.argmax(model.predict(state)) new_state, reward, done = GAME.step(a) #does that step #reward can be -20, -5, 1, and 5 target = reward + y * np.max(model.predict(new_state)) target_vec = model.predict(state)[0] target_vec[a] = target model.fit(state, target_vec.reshape(-1, 3), epochs=1, verbose=0) state = new_state

After training, this usually results in the paddle just going to the side and staying there. I am not sure if the NN architecture (units and hidden layers) is appropriate for given complexity. Also, is it possible that this is failing due to the rewards being very delayed? It can take 100+ frames to get to the food, so maybe this isn't registering well with the neural network.

I only started learning about reinforcement learning yesterday, so would appreciate advice!

Neil Slater · Accepted Answer · 2019-08-05 13:22:21Z

This is probably the most major factor:

model.compile(loss='categorical_crossentropy', optimizer='adam')

you have set the loss function for a multiclass classifier. It is going to have some weird results when values - either predicted or target - are outside of range 0..1

You should use this instead:

model.compile(loss='mean_squared_error', optimizer='adam')

because your Q network outputs the expected future return on each action. This could easily be outside of the range that 'categorical_crossentropy' is designed for.

In addition, you really need to look into experience replay. It is not an optional extra when using neural networks with Q learning - it is pretty much required for anything but the most trivial environments. It is very likely your agent will still fail to learn without experience replay, if you correct all other problems with your code.

I am not sure if the NN architecture (units and hidden layers) is appropriate for given complexity.

It looks more complex than it needs to be, assuming that your 4 inputs represent paddle x position, particle x,y position and colour. I would suggest making the network simpler (maybe just 40 neurons per layer at a guess), to speed things up a little.

Check your input scaling. Neural networks like to train on inputs that have mean 0, standard deviation 1, and it is worth scaling them so that they fit roughly into -1..1 or similar. Your feature engineering is not shown in your code, so it might be an issue.

Also, is it possible that this is failing due to the rewards being very delayed?

This can be a factor that makes learning harder.

It can take 100+ frames to get to the food, so maybe this isn't registering well with the neural network.

100 time steps delay between rewards is not much for DQN. It should be correctly predicting Q values - it will just take more episodes to learn to predict the best movement when the food is further away.

How is experience replay important in deep Q-learning? The learning algorithm is constantly pushing the output values up, so I wouldn't expect that to be an issue. — David, CommentedAug 5, 2019 at 15:41
@David: Experience replay works around significant issues with function approximation in off-policy learning. See also the answer here ai.stackexchange.com/questions/13202/… and I am sure I have written some others. The short answer is that without experience replay that the online data is also pushing in biased and correlated (non-i.i.d.) directions, and the lower quality due to this can swamp the useful signal. — Neil Slater, CommentedAug 5, 2019 at 15:47
Thank you for your answer! I implemented your suggestions and made an edit. If you have time, please check it out — shurup, CommentedAug 5, 2019 at 18:35

Stack Exchange Network

Deep Q Learning Algorithm for Simple Python Game makes player stuck

1 Answer 1

You must log in to answer this question.

Linked

Hot Network Questions

Deep Q Learning Algorithm for Simple Python Game makes player stuck

1 Answer 1

You must log in to answer this question.

Linked

Related

Hot Network Questions