- CREATED

Reinforcement Learning

강화학습관련 동영상 강의 요약 김성훈교수 youtube 동영상 강의 : https://hunkim.github.io/ml/



Nature of learning

  • 기본 아이디어 : 과거의 경험으로부터 학습, 상호 작용을 통한 학습
  • state, actor, enviroment
  • DeepMind collaborations with Google
  • 활용 분야 : robotics, resource allocation, inventory management, finance

실습1, OpenAI Gym Games1, FrozenLake

  • 설치 :
    $ sudo apt install cmake
    $ sudo -H pip install gym
    $ sudo -H pip install gym[atari]   #atari game 관련 platform
    
  • tensorflow + docker 사용 CPU 버전
    $ docker pull suker/tensorflow:CPU
    $ docker run -d --name tenf-CPU -p 8887:8888 -p 6005:6006 -v $PWD:/notebooks -e PASSWORD=nexell suker/tensorflow:CPU
    $ docker exec -d tenf-CPU /run_tensorboard.sh
    

    GPU 버전

    $ docker pull suker/tensorflow:GPU
    $ nvidia-docker run -d --name tenf-GPU -p 8888:8888 -p 6006:6006 -v $PWD:/notebooks -e PASSWORD=nexell suker/tensorflow:GPU
    $ docker exec -d tenf-GPU /run_tensorboard.sh
    
  • test gym
class _Getch:
    def __call__(self):
        fd = sys.stdin.fileno()
        old_settings = termios.tcgetattr(fd)
        try:
            tty.setraw(sys.stdin.fileno())
            ch = sys.stdin.read(3)
        finally:
            termios.tcsetattr(fd, termios.TCSADRAIN, old_settings)
        return ch

inkey = _Getch()

LEFT = 0
DOWN = 1
RIGHT = 2
UP = 3

arrow_keys = {
    '\x1b[A' : UP,
    '\x1b[B' : DOWN,
    '\x1b[C' : RIGHT,
    '\x1b[D' : LEFT}

import gym
from gym.envs.registration import register
import sys,tty,termios

def gameplay() :
    register(
            id='FrozenLake-v3',
            entry_point='gym.envs.toy_text:FrozenLakeEnv',
            kwargs={'map_name' : '4x4', 'is_slippery': False}
        )

    env = gym.make('FrozenLake-v3')
    env.render()

    while True:
        key = inkey()
        if key not in arrow_keys.keys():
            print("Game aborted!")
            break

        action = arrow_keys[key]
        state, reward, done, info = env.step(action)
        env.render()
        print("State: ", state, "Action: ", action, "Reward: ", reward, "Info: ", info)

        if done:
            print("Finished with reward", reward)
            break

gameplay()


Q-Learning

https://www.youtube.com/watch?v=Vd-gmo-qO5E&feature=youtu.be

  • state + action -> Q -> quality(reward) : Q (state,action)

    Q (sl, LEFT):0 Q (sl, RIGHT):0.5 Q (sl, UP):0 Q (sl, DOWN):0.3 Max Q = max Q(s,a’)

  • policy : maximum value를 찾는다.
  • Q를 어떻게 학습시키는가?

    Assume (believe) Q in s’ exists s상태에서 a를 수행했을때 받는 r ==> Q(s,a) = r + max Q(s’,a’ ) s0,a0,r1,s1,a1,r2,…,s n-1,a n-1,rn,sn(terminal state) R = r1+r2+r3+…+r(n) R(t) = r(t)+r(t+1)+r(t+2)+…+rn = r(t) + R(t+1) R(t)* = r(t) + maxR(t+1) optimal

  • Dummy Q-learning algorithm

    각각의 s,a table을 0으로 초기화한다. 무한 반복 : action -> receive r, -> observe s’ -> update table


실습2, Q-learning(table)

import gym
import numpy as np
import matplotlib.pyplot as plt
from gym.envs.registration import register
import random as pr

def rargmax(vector):
    """
    vector shape [0 0 0 0]
    """
    m = np.amax(vector) # vector에서 가장 큰 값을 얻어온다.
    indices = np.nonzero(vector == m)[0] # m과 일치하는 vector의 index, 즉 max index
    return pr.choice(indices) # m과 일치하는 vector의 index는 여러개가 있을 수 있으므로
                              # 그중에서 하나를 random하게 가져온다.

# gym package : 아래 소스 참고
#/usr/local/lib/python3.5/dist-packages/gym/envs/registration.py
register(
        id='FrozenLake-v3',
        entry_point='gym.envs.toy_text:FrozenLakeEnv',
        kwargs={'map_name' : '4x4', 'is_slippery': False}
        )

env = gym.make('FrozenLake-v3')
env.render()

def playgame() :
    """
    Q shape [[] [] [] []
             [] [] [] []
             [] [] [] []
             [] [] [] []]
    """
    Q = np.zeros([env.observation_space.n, env.action_space.n])
    num_episodes = 2000

    rList = []
    for i in range(num_episodes):
        state = env.reset()
        rAll = 0
        done = False

        while not done:
            action = rargmax(Q[state, :])
            new_state, reward, done,_ = env.step(action)

            Q[state, action] = reward + np.max(Q[new_state,:])

            rAll += reward
            state = new_state

        rList.append(rAll)

    print("Success rate: %.3f"%(sum(rList)/num_episodes))
    print("Final Q-Table Values")
    print("LEFT DOWN RIGHT UP")
    print(Q)
    plt.bar(range(len(rList)), rList, color="blue")
    plt.show()

playgame()

kchhero

If you do not act, there is no change.