Multiple Models with Time Delay in Pong


The game of Pong has been tackled using reinforcement learning many times in the past and in multiple different ways, namely policy gradients and Q-learning. An interesting problem in reinforcement learning is the delayed reward problem. This problem is that certain actions may not affect reward immediately and may be delayed. For example, in Pong, there are a few frames in which the ball has passed the paddle and thus any action that the paddle may take will never affect the impending reward. Yet these actions have rewards that would indicate that they are much more important than previous actions (due to reward propagation). By assigning variable delays to rewards, the agent is better able to associate actions to rewards and should then be better able to take high reward actions.