- CREATED

Reinforcement Learning3

강화학습관련 동영상 강의 요약 김성훈교수 youtube 동영상 강의 : https://www.youtube.com/watch?v=w9GwqPx7LW8&feature=youtu.be



Q-Network

  • Q-Table –> real world 에 적용하려면…
  • ex) 100x100 maze, 벽돌깨기 … ==> Q-table을 사용할 경우 너무 많은 경우의 수가 있다. 불가능…
  • Q-table을 사용하지 않고 network을 이용, action이 없는 대신 S 에 대한 Q-value 값들을 구한다.


Q-Network training
  • cost(W) = (Ws - y)^2
  • y = r + Γ maxQ(s’)
  • Q^ : prediction, Q* : optimal, θ : weight
  • Ws = Q^(s,a θ) ~ Q*(s,a)
  • Algorithm 1 Deep Q-learning with Experience Replay (https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf) line2 : weightn 을 random 하게 초기화한다. line3 : 첫번째 상태 s1을 만든다. ϕ(s1) : s1에 대해서 전처리를 한다. line5~ : ε (e-greedy) 를 이용해서 random하게 action을 취하거나 경험에 의해 가장 좋은 action을 취한다. Set y(j) : terminal 상태, 즉 goal인 경우 reward r(j) return, non-terminal 상태이면 r + Γ max a’ … gradient descent 를 수행…


  • Converges to Q* using table lookup representation
  • But diverges using neural networks due to: Correlations between samples, Non-stationary targets
  • ==> solution : DQN : Deep, Replay, Separated networks


실습4 : Q-network
  • 여러개의 state가 있다면, one-hot encoding 기법을 적용하여 state를 전개한다.
  • one-hot input : 16개, output: left,right,up,down 4개
import gym
import numpy as np
import matplotlib.pyplot as plt
import time
import tensorflow as tf

env = gym.make('FrozenLake-v0')
env.render()

ont_hot_array = [[16]]

def init_array(x):
    global one_hot_array
    one_hot_array = np.identity(x)

def get_one_hot(x):
    return one_hot_array[x:x + 1]

def playgame() :
    input_size = env.observation_space.n #16 one-hot
    output_size = env.action_space.n  #4 actions
    learning_rate = 0.1

    #1x16 array 로 X 값을 만들어 넘겨준다.
    X = tf.placeholder(shape=[1,input_size],dtype=tf.float32)
    #in 16, out 4 - 0으로 초기화
    W = tf.Variable(tf.random_uniform([input_size, output_size], 0, 0.01))
    #Q prediction : X * W
    # cost(W) = (Ws - y)^2
    Qpred = tf.matmul(X,W)  #==> Ws
    # Y label
    Y = tf.placeholder(shape=[1, output_size], dtype=tf.float32)

    loss = tf.reduce_sum(tf.square(Y-Qpred))   #cost(W)
    train = tf.train.GradientDescentOptimizer(learning_rate=learning_rate).minimize(loss)

    num_episodes = 2000
    dis = 0.99  # discount factor

    rList = []

    init = tf.global_variables_initializer()
    with tf.Session() as sess:
        sess.run(init)

        for i in range(num_episodes):
            s = env.reset()
            e = 1. / ((i / 50) + 10)
            rAll = 0
            done = False
            local_loss = []

            while not done:
                #print("state %d"%state)
                Qs = sess.run(Qpred, feed_dict={X: get_one_hot(s)})
                if np.random.rand(1) < e:    # decaying E-greedy
                    action = env.action_space.sample()
                else:
                    action = np.argmax(Qs)

                # Get new state and reward from environment
                s1, reward, done,_ = env.step(action)
                if done:
                    Qs[0, action] = reward
                else :
                    Qs1 = sess.run(Qpred, feed_dict={X: get_one_hot(s1)})   # <== just like Q-table
                    Qs[0, action] = reward + dis*np.max(Qs1)

                sess.run(train, feed_dict={X: get_one_hot(s), Y: Qs})

                rAll += reward
                s = s1

            rList.append(rAll)

    print("Success rate: %.3f"%(sum(rList)/num_episodes)) 
    plt.bar(range(len(rList)), rList, color="blue")
    plt.show()

init_array(16)
playgame()

kchhero

If you do not act, there is no change.