May
15th,
2017
- CREATED
Reinforcement Learning3
강화학습관련 동영상 강의 요약 김성훈교수 youtube 동영상 강의 : https://www.youtube.com/watch?v=w9GwqPx7LW8&feature=youtu.be
Q-Network
- Q-Table –> real world 에 적용하려면…
- ex) 100x100 maze, 벽돌깨기 … ==> Q-table을 사용할 경우 너무 많은 경우의 수가 있다. 불가능…
- Q-table을 사용하지 않고 network을 이용, action이 없는 대신 S 에 대한 Q-value 값들을 구한다.
Q-Network training
- cost(W) = (Ws - y)^2
- y = r + Γ maxQ(s’)
- Q^ : prediction, Q* : optimal, θ : weight
-
Ws = Q^(s,a θ) ~ Q*(s,a) - Algorithm 1 Deep Q-learning with Experience Replay (https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf)
line2 : weightn 을 random 하게 초기화한다. line3 : 첫번째 상태 s1을 만든다. ϕ(s1) : s1에 대해서 전처리를 한다. line5~ : ε (e-greedy) 를 이용해서 random하게 action을 취하거나 경험에 의해 가장 좋은 action을 취한다. Set y(j) : terminal 상태, 즉 goal인 경우 reward r(j) return, non-terminal 상태이면 r + Γ max a’ … gradient descent 를 수행…
- Converges to Q* using table lookup representation
- But diverges using neural networks due to: Correlations between samples, Non-stationary targets
- ==> solution : DQN : Deep, Replay, Separated networks
실습4 : Q-network
- 여러개의 state가 있다면, one-hot encoding 기법을 적용하여 state를 전개한다.
- one-hot input : 16개, output: left,right,up,down 4개
import gym
import numpy as np
import matplotlib.pyplot as plt
import time
import tensorflow as tf
env = gym.make('FrozenLake-v0')
env.render()
ont_hot_array = [[16]]
def init_array(x):
global one_hot_array
one_hot_array = np.identity(x)
def get_one_hot(x):
return one_hot_array[x:x + 1]
def playgame() :
input_size = env.observation_space.n #16 one-hot
output_size = env.action_space.n #4 actions
learning_rate = 0.1
#1x16 array 로 X 값을 만들어 넘겨준다.
X = tf.placeholder(shape=[1,input_size],dtype=tf.float32)
#in 16, out 4 - 0으로 초기화
W = tf.Variable(tf.random_uniform([input_size, output_size], 0, 0.01))
#Q prediction : X * W
# cost(W) = (Ws - y)^2
Qpred = tf.matmul(X,W) #==> Ws
# Y label
Y = tf.placeholder(shape=[1, output_size], dtype=tf.float32)
loss = tf.reduce_sum(tf.square(Y-Qpred)) #cost(W)
train = tf.train.GradientDescentOptimizer(learning_rate=learning_rate).minimize(loss)
num_episodes = 2000
dis = 0.99 # discount factor
rList = []
init = tf.global_variables_initializer()
with tf.Session() as sess:
sess.run(init)
for i in range(num_episodes):
s = env.reset()
e = 1. / ((i / 50) + 10)
rAll = 0
done = False
local_loss = []
while not done:
#print("state %d"%state)
Qs = sess.run(Qpred, feed_dict={X: get_one_hot(s)})
if np.random.rand(1) < e: # decaying E-greedy
action = env.action_space.sample()
else:
action = np.argmax(Qs)
# Get new state and reward from environment
s1, reward, done,_ = env.step(action)
if done:
Qs[0, action] = reward
else :
Qs1 = sess.run(Qpred, feed_dict={X: get_one_hot(s1)}) # <== just like Q-table
Qs[0, action] = reward + dis*np.max(Qs1)
sess.run(train, feed_dict={X: get_one_hot(s), Y: Qs})
rAll += reward
s = s1
rList.append(rAll)
print("Success rate: %.3f"%(sum(rList)/num_episodes))
plt.bar(range(len(rList)), rList, color="blue")
plt.show()
init_array(16)
playgame()