# 3 mins of Reinforcement Learning: Q-Learning

# Introduction

For this post, I will write my own Q-learning algorithem to solve the game from OpenAI gym called “Frozen Lake”. “Frozen Lake” is a text-based maze environment that your controller will learn to navigate. It is slippery, however, so sometimes you don’t always move where you try to go.

```
import gym
import numpy as np
import numpy.random as rnd
import matplotlib.pyplot as plt
%matplotlib inline
env=gym.make('FrozenLake-v0')
env.render()
```

```
[41mS[0mFFF
FHFH
FFFH
HFFG
```

# Question

Use Q-learning to learn a policy for this problem.

# About Q-Learning

Q-learning is an off policy reinforcement learning algorithm that seeks to find the best action to take given the current state. It’s considered off-policy because the q-learning function learns from actions that are outside the current policy, like taking random actions, and therefore a policy isn’t needed. More specifically, q-learning seeks to learn a policy that maximizes the total reward.

Here is the algorithm

# Code

I will simulate $100,000$ episodes. Let $G^i$ be the total reward for episode $i$. In the end, I will make a plot of $\frac{1}{i}\sum_{j=1}^i G^j$ (the running average of the rewards), for $i=1$ to $100,000$. I will use a discount factor of $\gamma <1$to make it work correctly.

```
## initialize Q(s,a)
Q = np.zeros([env.observation_space.n,env.action_space.n])
alpha=0.8
gamma=0.95
number_episode=100000
beta=0.3
G=[]
##loop for each episode
for episode in range(number_episode):
s=env.reset()
G_episode=0
done=False
episode_length=0
epsilon = (1/(1+episode))**(beta)
while episode_length <=100:## episode must be finished in 100 steps
if rnd.rand() < epsilon:## choose action epslone greedy
a = env.action_space.sample()## pick a random action
else:
a = np.argmax(Q[s,:])##choose an action
s_next,r,done,_=env.step(a)
## update Q
Q[s,a]=Q[s,a]+alpha*(r+gamma*np.max(Q[s_next,:])-Q[s,a])
G_episode+=r
s=s_next
episode_length+=1
## the current episode ends
if done == True:
break
G.append(G_episode)
print (sum(G)/number_episode)
print (Q)
plt.plot(np.cumsum(G)/np.arange(1,number_episode+1),'.')
```

```
0.2956
[[5.86741698e-02 2.53313656e-01 4.24949460e-02 5.97031665e-02]
[1.73984707e-03 1.08475750e-02 8.74301728e-03 4.87616436e-02]
[2.90335097e-02 8.45174905e-03 8.14131382e-03 1.14581024e-02]
[6.09104910e-04 1.03482075e-02 6.14886460e-03 1.06232220e-02]
[2.02138540e-01 2.35482473e-02 2.12596664e-02 3.77846690e-02]
[0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]
[1.14021279e-02 3.46033458e-05 2.85163045e-04 3.67230129e-06]
[0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]
[1.24465789e-03 1.22092997e-02 4.15748858e-02 4.16826481e-01]
[2.96607414e-02 1.09266566e-01 1.22155422e-02 3.67418410e-02]
[2.56646265e-03 7.47419830e-01 4.32382306e-03 9.90243563e-04]
[0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]
[0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]
[3.66591204e-01 1.01563327e-02 6.26778623e-01 4.58509757e-03]
[1.04841677e-01 4.05469411e-01 9.80925785e-01 2.18991000e-01]
[0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]]
[<matplotlib.lines.Line2D at 0x23e77f78160>]
```

# Result

In the graph, you can see the the reward converges to 0.3 after 100000 steps