[Reinforcement Learning / learn article] Policy Gradient (Two-armed Bandit)

Deep learning

[Reinforcement Learning / learn article] Policy Gradient (Two-armed Bandit)

JaykayChoi 2017. 2. 27. 23:51

Part 0 — Q-Learning Agents 에 이어 다음 포스팅은 Part 1 — Two-Armed Bandit 입니다.

https://medium.com/@awjuliani/super-simple-reinforcement-learning-tutorial-part-1-fd544fab149#.16gx1aycu

Two-Armed Bandit 은 두 개의 팔을 가진 노상강도 즉 슬롯 머신을 이야기합니다. 슬롯 머신을 노상강로라 표현한 점이 재미있네요.

이 포스팅에서는 Policy Gradient 이라는 방법을 설명하기 위해 슬롯 머신이라는 간단한 게임을 이용했습니다.

일반적으로 강화 학습에 사용되는 문제는 다음과 같은 특징을 가지고 있습니다.

- 다른 행동(action)은 다른 보상(reward)을 가지고 온다

- 보상은 시간에 의해 지연된다. 동일한 결과를 가지고 오더라도 시간이 더 오래 걸리는 방법이 더 낮은 보상을 받게 됩니다.

- 어떤 행동에 대한 보상은 환경(environment)의 상태(state)에 따라 달라질 수 있다.

이런 특징들은 강화 학습을 더 어렵게 만들어주는 요인들인데요, 여기서는 첫 번째 조건만을 가진 게임을 사용하여 포스팅을 했습니다.

이 게임의 규칙은 슬롯 머신의 각 손잡이는 일정한 확률을 가지고 있습니다. 예를 들어 어떤 손잡이는 10%의 확률로 보상을 1 획득할 수 있고 어떤 손잡이는 1%의 확률로 보상을 1 획득할 수 있습니다. 보상을 획득하지 못 할 경우 -1을 얻습니다. 이런 손잡이들이 n개 있을 때 어떤 손잡이를 이용할 때 가장 많은 보상을 얻을 수 있을지 알아내는 게임입니다.

앞서 Q-Learning 에서 사용했던 방법을 사용한다면 행동(슬롯 머신)을 하나 선택하고 해당 행동을 수행하여(step) 상태와 보상이라는 value를 얻어 Q-function 에 넣어 loss(∑(Q-target - Q)²)를 최소화하도록 학습시켜 policy 를 얻는 방법을 사용했을 것입니다. 이런 방식을 value functions을 학습시키는 value-based 학습 방법이라 합니다.

이와 다르게 여기서 사용할 policy-based 학습 방법은 policy 자체를 바로 식에 넣어 loss 를 최소화는 방향으로 학습을 진행합니다.

여기서 사용하는 loss 를 구하는 방법은 아래와 같습니다.

Loss = Log(π)*A

A는 advantage 즉 보상입니다.

그리고 π 는 policy 입니다. 위에서 설명한대로 policy 즉 weight 들의 배열을 식에 직접 넣어 loss 를 구합니다. 여기서 Log 를 사용하였는데 이는 Policy Gradient의 몇 가지 방법 중 하나인 Monte-Carlo Policy Gradient 방법입니다.

이 loss function 은 높은 보상(여기서는 1)을 산출하는 행동에 대한 weight 을 증가시키는 방법을 통해 학습이 이뤄질 것입니다.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
import tensorflow as tf
import numpy as np
 
 
#List out our bandits. Currently bandit 4 (index#3) is set to most often provide a positive reward.
bandits = [0.2,0,-0.2,-5]
num_bandits = len(bandits)
def pullBandit(bandit):
    #Get a random number.
    result = np.random.randn(1)
    if result > bandit:
        #return a positive reward.
        return 1
    else:
        #return a negative reward.
        return -1
 
 
 
tf.reset_default_graph()
 
#These two lines established the feed-forward part of the network. This does the actual choosing.
weights = tf.Variable(tf.ones([num_bandits]))
chosen_action = tf.argmax(weights,0)
 
#The next six lines establish the training proceedure. We feed the reward and chosen action into the network
#to compute the loss, and use it to update the network.
reward_holder = tf.placeholder(shape=[1],dtype=tf.float32)
action_holder = tf.placeholder(shape=[1],dtype=tf.int32)
responsible_weight = tf.slice(weights,action_holder,[1])
loss = -(tf.log(responsible_weight)*reward_holder)
optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.001)
update = optimizer.minimize(loss)
 
 
 
total_episodes = 1000  # Set total number of episodes to train agent on.
total_reward = np.zeros(num_bandits)  # Set scoreboard for bandits to 0.
e = 0.1  # Set the chance of taking a random action.
 
init = tf.global_variables_initializer()
 
# Launch the tensorflow graph
with tf.Session() as sess:
    sess.run(init)
    i = 0
    while i < total_episodes:
 
        # Choose either a random action or one from our network.
        if np.random.rand(1) < e:
            action = np.random.randint(num_bandits)
        else:
            action = sess.run(chosen_action)
 
        reward = pullBandit(bandits[action])  # Get our reward from picking one of the bandits.
 
        # Update the network.
        _, resp, ww = sess.run([update, responsible_weight, weights],
                               feed_dict={reward_holder: [reward], action_holder: [action]})
 
        # Update our running tally of scores.
        total_reward[action] += reward
        if i % 50 == 0:
            print "Running reward for the " + str(num_bandits) + " bandits: " + str(total_reward)
        i += 1
print "The agent thinks bandit " + str(np.argmax(ww) + 1) + " is the most promising...."
if np.argmax(ww) == np.argmax(-np.array(bandits)):
    print "...and it was right!"
else:
    print "...and it was wrong!"
Colored by Color Scripter
cs

먼저 슬롯 머신을 정의합니다. 이 코드에서는 4개의 손잡이를 가진 슬롯머신을 정의하고 수치가 낮을 수록 더 높은 보상을 주는 손잡이로 코딩을 할 것이기 때문에 이 중 4번 손잡이가 정답이 될 것입니다.

bandits = [0.2,0,-0.2,-5]
num_bandits = len(bandits)
cs

pullBandit 함수를 정의합니다. 이 함수는 0의 평균을 갖는 normal distribution(정규분포)로부터 무작위 값을 생성하여 파라미터로 받은 손잡이에서 얻을 수 있는 보상을 반환합니다.

def pullBandit(bandit):
    #Get a random number.
    result = np.random.randn(1)
    if result > bandit:
        #return a positive reward.
        return 1
    else:
        #return a negative reward.
        return -1
cs

신경망의 feed-forward 부분입니다.

아래와 같은 정의를 통해 weights의 shape 는 [4,] 즉 [1, 1, 1, 1] 으로 만들어지고 tf.argmax(weights,0) 을 통해 weights 의 4개의 인자 중 가장 큰 인자의 index 가 chosen_action 으로 반환되게 됩니다. argmax 의 두 번째 파라미터는 one-hot-encoding 을 적용할 차원을 알려주는 파라미터 입니다.

weights = tf.Variable(tf.ones([num_bandits]))
chosen_action = tf.argmax(weights,0)
cs

다음으로 backpropagation 방법을 이용한 학습을 위한 코드 부분입니다.

responsible_weight 는 weights 중에서 행동에 대한 weight 을 가져오는 부분이고,

loss 는 위에서 설명한대로 로그를 이용하여 얻을 수 있습니다.

reward_holder = tf.placeholder(shape=[1],dtype=tf.float32)
action_holder = tf.placeholder(shape=[1],dtype=tf.int32)
responsible_weight = tf.slice(weights,action_holder,[1])
loss = -(tf.log(responsible_weight)*reward_holder)
optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.001)
update = optimizer.minimize(loss)
Colored by Color Scripter
cs

여기서는 e-greedy 방법을 통해 행동을 선택하는 방법을 사용했습니다.

        if np.random.rand(1) < e:
            action = np.random.randint(num_bandits)
        else:
            action = sess.run(chosen_action)
cs

만들어 놓은 pullBandit 함수를 통해 reward 을 얻습니다.

reward = pullBandit(bandits[action])
cs

핵심인 신경망을 업데이트 하는 부분입니다. 위에서 결정한 행동에 대한 미래의 보상과 행동으로 weight 를 업데이트합니다.

_, resp, ww = sess.run([update, responsible_weight, weights],
                               feed_dict={reward_holder: [reward], action_holder: [action]})
cs

저작자표시 비영리 변경금지

'Deep learning' 카테고리의 다른 글

[Reinforcement Learning / learn article] Policy Gradient (CartPole) (0)	2017.03.15
[Reinforcement Learning / learn article] Policy Gradient (Contextual Bandits) (0)	2017.03.07
[Reinforcement Learning / learn article] Q-Learning (0)	2017.02.23
[Supervised Learning / TensorFlow tutorial] Deep MNIST for Experts (CNN) (0)	2017.02.04
[Supervised Learning / TensorFlow tutorial] MNIST deep neural network with summaries (0)	2017.01.30

현재글[Reinforcement Learning / learn article] Policy Gradient (Two-armed Bandit)

Program Programming Programmer

프로그래머

Divide And Conquer, binary search, Simulation, Math, GREEDY, Complete Search, Josephus, Base Conversion, Deterministic finite automaton, string, sort, memoization, convex hull, binomial coefficient, bipartite matching, dynamic programming, dfs, bit mask, Shoelace Formula, Erathosthenes,

Today :
Yesterday :

Program Programming Programmer