May
11th,
2017
- CREATED
Reinforcement Learning2
강화학습관련 동영상 강의 요약 김성훈교수 youtube 동영상 강의 : https://www.youtube.com/watch?v=MQ-3QScrFSI&feature=youtu.be
Q-learning exploit&exploration and discounted reward
exploit : 현재 값을 이용.. exploration : 새로운 길을 찾아가는… ex) 식당 : 주중에는 경험에 의한 맛집을 찾는다, 주말에는 새로운 맛집을 찾는다.
method1 : E-greedy
e = 0.1
if rand < e:
a = random #10% 확률로 새로운 길로..
else:
a = argmax(Q(s,a)) #90% 확률로 아는 길로
method2 : decaying E-greedy
횟수를 거듭할 수록 e 값이 작아진다. 점차적으로 경험에 의한 선택에 비중을 높인다.
for i in range(1000):
e = 0.1 / (i+1)
if random(1) < e:
a = random #10% 확률로 새로운 길로..
else:
a = argmax(Q(s,a)) #90% 확률로 아는 길로
method3 : add random noise
a = argmax(Q(s,a) + random_values) a = argmax([0.5 0.6 0.3 0.2 0.5] + [0.1 0.2 0.7 0.3 0.1])
횟수를 거듭할 수록 e 값이 작아진다. 점차적으로 경험에 의한 선택에 비중을 높인다. 완전한 random 은 아님, 2~3순위의 길에 비중을 높인다.
for i in range(1000):
a = argmax(Q(s,a) + random_values/(i+1))
discounted future reward
- ex) FrozenLake map의 길찾기에서… 최단거리로 가는 방법을 어떻게 찾을 것인가?
-
idea : discounted reward Q(s,a) <– r + g*maxQ(s’,a’) ==> g : gamma
- future reward : R = r1 + r2 + … + r(n)
-
future rewad : R(t) = r(t) + r(t+1) + … + r(n)
- discounted furure reward : R(t) = r(t) + gr(t+1) + g^2r(t+2) + … + g^(n-t)r(n) = r(t) + gR(t+1)
- Q(s,a) = r + g*maxQ(s’,a’)
실습3 : E&E + discounted reward
실습2의 코드에서 몇군데만 수정하면 된다.
def playgame() :
(snip)...
num_episodes = 2000
dis = 0.99 # discount factor
rList = []
for i in range(num_episodes):
(snip)...
e = 1. / ((i / 100)+1) # decaying E-greedy
while not done:
if np.random.rand(1) < e: # decaying E-greedy
action = env.action_space.sample()
else:
action = rargmax(Q[state, :])
# add random noise
#action = rargmax(Q[state, :] + np.random.randn(1, env.action_space.n) / (i+1))
new_state, reward, done,_ = env.step(action)
Q[state, action] = reward + dis*np.max(Q[new_state,:]) # discount reward
출력:
Success rate: 0.817
Final Q-Table Values
LEFT DOWN RIGHT UP
[[ 0.94148015 0.95099005 0.95099005 0.94148015]
[ 0.94148015 0. 0.96059601 0.95099005]
[ 0.95099005 0.970299 0.95099005 0.96059601]
[ 0.96059601 0. 0.95099005 0.95099005]
[ 0.95099005 0.96059601 0. 0.94148015]
[ 0. 0. 0. 0. ]
[ 0. 0.9801 0. 0.96059601]
[ 0. 0. 0. 0. ]
[ 0.96059601 0. 0.970299 0.95099005]
[ 0.96059601 0.9801 0.9801 0. ]
[ 0.970299 0.99 0. 0.970299 ]
[ 0. 0. 0. 0. ]
[ 0. 0. 0. 0. ]
[ 0. 0.9801 0.99 0.970299 ]
[ 0.9801 0.99 1. 0.9801 ]
[ 0. 0. 0. 0. ]]
Q-learning on Nondeterministic Worlds
- Windy Frozen Lake : 가려고하는 방향으로 100% 가지 못할 수도 있다. nondeterministic.
register(
id='FrozenLake-v3',
entry_point='gym.envs.toy_text:FrozenLakeEnv',
kwargs={'map_name' : '4x4', 'is_slippery': True}
)
env = gym.make('FrozenLake-v0')
Success rate: 0.037
Final Q-Table Values
LEFT DOWN RIGHT UP
[[ 0.00000000e+00 0.00000000e+00 0.00000000e+00 7.35409581e-05]
[ 0.00000000e+00 0.00000000e+00 0.00000000e+00 9.41480149e-01]
[ 0.00000000e+00 6.71809337e-05 0.00000000e+00 7.35409581e-05]
[ 0.00000000e+00 0.00000000e+00 0.00000000e+00 7.28055486e-05]
[ 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]
[ 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]
[ 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]
[ 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]
[ 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]
[ 8.95338254e-01 0.00000000e+00 0.00000000e+00 0.00000000e+00]
[ 9.90000000e-01 0.00000000e+00 0.00000000e+00 0.00000000e+00]
[ 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]
[ 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]
[ 0.00000000e+00 9.90000000e-01 9.04382075e-01 0.00000000e+00]
[ 0.00000000e+00 1.00000000e+00 0.00000000e+00 4.09686303e-03]
[ 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]]
(Up)
SFFF
FHFH
FFFH
HFFG
- derterministic world의 Q를 사용할 경우 Frozen Lake 를 통과 하기는 불가능하다.
- solution : listen to Q (little bit), update Q little bit (learning rate)
- like our life mentors
- Stochastic world, learning rate, α = 0.1
- Q(s,a) <– (1-α) Q(s,a) + α [r + Γ*maxQ(s’,a’)]
- ==> 나의 비중은 0.9, next state 의 비중은 0.1
import gym
import numpy as np
import matplotlib.pyplot as plt
env = gym.make('FrozenLake-v0')
env.render()
def playgame() :
Q = np.zeros([env.observation_space.n, env.action_space.n])
num_episodes = 2000
alpha = .85 #learning rate
dis = .99 # discount factor
rList = []
for i in range(num_episodes):
state = env.reset()
rAll = 0
done = False
while not done:
action = np.argmax(Q[state, :] + np.random.randn(1, env.action_space.n) / (i+1))
new_state, reward, done,_ = env.step(action)
Q[state, action] = (1-alpha)*Q[state, action] + alpha*(reward + dis*np.max(Q[new_state,:]))
rAll += reward
state = new_state
#print(Q[state,:])
rList.append(rAll)
print("Success rate: %.3f"%(sum(rList)/num_episodes))
print("Final Q-Table Values")
print("LEFT DOWN RIGHT UP")
print(Q)
plt.bar(range(len(rList)), rList, color="blue")
plt.show()
playgame()
출력:
Success rate: 0.403
Final Q-Table Values
LEFT DOWN RIGHT UP
[[ 1.05910110e-02 4.93507661e-01 1.05277668e-02 1.06058317e-02]
[ 3.48785903e-03 1.39710249e-03 7.93530069e-04 4.14057172e-01]
[ 5.03871710e-03 1.06150772e-02 1.07011563e-02 2.62210304e-01]
[ 0.00000000e+00 8.84277858e-05 1.11683316e-03 1.60458150e-01]
[ 6.25166765e-01 2.49851733e-05 2.18491922e-04 3.31881813e-05]
[ 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]
[ 3.84386875e-06 4.67931675e-06 1.80781494e-01 1.36884751e-05]
[ 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]
[ 7.07752675e-03 6.65977519e-04 1.60899053e-03 7.17885723e-01]
[ 3.07134238e-03 7.86275603e-01 3.17231431e-03 1.23126694e-03]
[ 4.72413300e-01 9.57025582e-04 0.00000000e+00 8.98862864e-04]
[ 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]
[ 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]
[ 0.00000000e+00 3.06741853e-03 9.67593165e-01 5.55177385e-04]
[ 0.00000000e+00 0.00000000e+00 9.97460246e-01 0.00000000e+00]
[ 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]]