Aim : To develop a Reinforcement Learning algorithm, to make a Bot(Agent) learn its way to solve a maze. The Bot is naive to the maze and its environment.

Logic: The logic in its core involves, letting a Bot start from the Starting point (S), with the freedom to move in any direction. Its end goal is to reach the end point (G) through the most optimised path. Hence, at each step, the Bot can perform 4 actions — go Left, right, top or bottom. To add to the trouble, the maze also has Holes (H), which the Bot has to avoid.

If you are a fan of Big Bang Theory Tv Series, you would remember the episode where Sheldon ‘trains’ Penny to behave according to his liking, by rewarding her with chocolate (positive reinforcement) and punishing her if she doesn’t act accordingly (negative reinforcement). Similarly, when kids do something nice, we reward them and if they are mischevious, we punish them. This same logic drives our Bot to learn its way to solve the maze.

The Bot is given a reward if it successfully reaches and the episode ends. If the Bot falls in a hole (H), it gains no reward and the episode ends. At any other state of the maze, the bot can perform the above mentioned 4 actions.

The above logic explains Reinforcement Learning(RL). RL is an aspect of Machine learning where an agent learns to behave in an environment, by performing certain actions and observing the rewards/results which it gets from those actions. It’s important to note that while RL at its core aims to maximize rewards/gains, implementing a greedy approach, doesn’t always lead to successful learning.

Algorithm illustrated: In this Part 1, I take the “ Cross Entropy” algorithm to make the Bot learn to do the task. The Cross-Entropy approach has the following features:

  • Model-free: In this type of ML algorithm, the RL doesn’t create a model of the training steps. Rather, it says to the agent what to do at each step.
  • Policy-Based: A policy is the Bot’s (Agent) strategy to take the next steps. It’s basically represented as a probability distribution of the available 4 actions of the Bot. The action which has the maximum probability becomes the next step’s action for the Bot.
  • On-Policy approach: This means that the agent requires fresh data to be obtained from the maze environment. The algorithm doesn’t dwell on old historic data.

Challenge of using Cross Entropy in the given setting:

The approach gains a reward(+1) if the Bot reaches the endpoint, but doesn’t tell anything about whether the Bot chose the most optimized route or whether it was just wandering around and ended up at endpoint just by chance. The algorithm doesn’t take into account whether the Bot was quick and efficient or did it go round the maze five times and randomly entered G. A reward of 0001 (Bot reaching endpoint in 4 steps) and reward of 0000000000001(Bot reaching endpoint in 13 steps) are given the same preference.

Also, the distribution of rewards is not good. There is only +1 or the 0 reward state. There are no intermediate state rewards.


The RL algorithm is implemented in PyTorch framework for Deep Learning implementation. To simulate the maze, I have used FastAI FrozenLake Gym Environment.

Necessary imports

The FrozenLake Env is a 4*4 maze. This gives 16 possible unknown states for the Bot. The discreteoneHotWrapper function aims to create a One-Hot Encoding for each of these 16 states by inheriting the Observation wrapper from gym environment. The inherited observation space is a representation of a fixed range of non-negative numbers (spaces.Discrete) and is converted to One Hot Encoding using the spaces.Box (represents an n-dimensional Box)

One Hot encoding of observation states

The Deep learning network implemented here is a basic one, with two linear layers and a ReLU. The network takes a single observation from the environment as the input vector (one hot encoded) and gives a probability distribution upon a softmax, for the 4 possible actions of the Bot.

DL Network

Here, I define two classes. The EpisodeStep helper class represents every step that the Bot takes in an episode, keeping a track of the observation from the environment and the action that the Bot takes. The Episode helper class represents a single episode and stores the Episode Total reward and the collection of the EpisodeStep.

Helper Classes

The iterate_batches function generates batches of episodes. The batch list accumulates the list of episode instances. The env.reset function resets the environment and generates the first observation. The observation tensor is then passed through the neural network which generates a probability distribution for the actions. The act_prob_v is a tensor which tracks gradients. Hence, the is used to unpack the tensor into an array of 2-dim with batch dimension on axis 0. Hence, the first batch is accessed to obtain the one-dimensional vector for action probabilities.

Batch generation

To make sure that the total reward depends on the length of the episode and to make the agent learn to find the most optimized route, a Discount Factor is applied to the reward (Gamma of 0.9 or 0.95). This will ensure that the reward for shorter episodes will be higher than the rewards of longer episodes. The elite_batch is used to accumulate the best episodes based on whether the discounted reward for a particular episode is greater than top 70 percentile of the total rewards across batches. This ensures that the elite episodes are kept over several iterations, thus overcoming the challenge poor reward distribution associated with the problem statement (Only +1 or 0 and no intermediate rewards).

Filtering Batches

The following main function instantiates the functions defined above. To give the network more time to average more training samples, the learning rate is kept pretty low. The gym.envs.toy_text ‘makes’ and instantiates the FrozenLake environment. The Adam optimiser is chosen for gradient descent.

main function

The elite batch catches the last 500 most elite episodes to be fed again and again to the network . The batch size is kept high. The gradients are calculated on these best training samples.


To handle the shortcomings of cross-entropy approach on FrozenLake env, we can allow the elite episodes to be kept in training iterations for a longer duration. Keeping a low learning rate while maintaining a larger batch size, also helps. The discount rate of around .9 helps the agent to learn the optimal path.

The link to the code can be found here.

In the part 2, we will discover a more sophisticated approach that can be taken to solve the problem statement and advanced concepts of Reinforcement learning.