[Reinforcement Learning / learn article] Policy Gradient (Contextual Bandits)

Deep learning

[Reinforcement Learning / learn article] Policy Gradient (Contextual Bandits)

JaykayChoi 2017. 3. 7. 00:00

이번 포스팅에서는 지난 포스팅에 이어 상태 개념을 넣은 강화 학습입니다.

https://medium.com/emergent-future/simple-reinforcement-learning-with-tensorflow-part-1-5-contextual-bandits-bff01d1aad9c#.x98hikkmi

지난 포스팅에서 기술한 일반적으로 강화 학습에 사용되는 문제들의 특징 중 첫 번째와 세 번째 특징이 포함된 강화 학습이 될 것입니다.

- 다른 행동(action)은 다른 보상(reward)을 가지고 온다

- 보상은 시간에 의해 지연된다. 동일한 결과를 가지고 오더라도 시간이 더 오래 걸리는 방법이 더 낮은 보상을 받게 됩니다.

- 어떤 행동에 대한 보상은 환경(environment)의 상태(state)에 따라 달라질 수 있다.

이를 위해 이 문제에서는 하나의 슬롯머신 대신에, 여러 개의 슬롯머신을 사용합니다. 환경의 성태는 사용하는 슬롯 머신의 종류이고 문제 풀이의 목적은 에이전트가 하나의 슬롯 머신에 대해서만 가장 좋은 행동을 하도록 학습하는 것이 아니라 슬롯 머신들 중 어느 것에 대해서도 가장 좋은 행동을 하도록 학습하는 것입니다.

각 슬롯 머신은 각 손잡이에 대해 서로 다른 보상 확률을 가질 것이기 때문에, 에이전트는 환경의 상태에 따라 행동을 조절하는 것을 학습해야 됩니다.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
import tensorflow as tf
import tensorflow.contrib.slim as slim
import numpy as np
 
 
class contextual_bandit():
    def __init__(self):
        self.state = 0
        # List out our bandits. Currently arms 4, 2, and 1 (respectively) are the most optimal.
        self.bandits = np.array([[0.2, 0, -0.0, -5], [0.1, -5, 1, 0.25], [-5, 5, 5, 5]])
        self.num_bandits = self.bandits.shape[0]
        self.num_actions = self.bandits.shape[1]
 
    def getBandit(self):
        self.state = np.random.randint(0, len(self.bandits))  # Returns a random state for each episode.
        return self.state
 
    def pullArm(self, action):
        # Get a random number.
        bandit = self.bandits[self.state, action]
        result = np.random.randn(1)
        if result > bandit:
            # return a positive reward.
            return 1
        else:
            # return a negative reward.
            return -1
 
 
 
class agent():
    def __init__(self, lr, s_size,a_size):
        #These lines established the feed-forward part of the network. The agent takes a state and produces an action.
        self.state_in= tf.placeholder(shape=[1],dtype=tf.int32)
        state_in_OH = slim.one_hot_encoding(self.state_in,s_size)
        output = slim.fully_connected(state_in_OH,a_size,\
            biases_initializer=None,activation_fn=tf.nn.sigmoid,weights_initializer=tf.ones_initializer())
        self.output = tf.reshape(output,[-1])
        self.chosen_action = tf.argmax(self.output,0)
 
        #The next six lines establish the training proceedure. We feed the reward and chosen action into the network
        #to compute the loss, and use it to update the network.
        self.reward_holder = tf.placeholder(shape=[1],dtype=tf.float32)
        self.action_holder = tf.placeholder(shape=[1],dtype=tf.int32)
        self.responsible_weight = tf.slice(self.output,self.action_holder,[1])
        self.loss = -(tf.log(self.responsible_weight)*self.reward_holder)
        optimizer = tf.train.GradientDescentOptimizer(learning_rate=lr)
        self.update = optimizer.minimize(self.loss)
 
 
tf.reset_default_graph()  # Clear the Tensorflow graph.
 
cBandit = contextual_bandit()  # Load the bandits.
myAgent = agent(lr=0.001, s_size=cBandit.num_bandits, a_size=cBandit.num_actions)  # Load the agent.
weights = tf.trainable_variables()[0]  # The weights we will evaluate to look into the network.
 
total_episodes = 10000  # Set total number of episodes to train agent on.
total_reward = np.zeros([cBandit.num_bandits, cBandit.num_actions])  # Set scoreboard for bandits to 0.
e = 0.1  # Set the chance of taking a random action.
 
init = tf.global_variables_initializer()
 
if 'session' in locals() and tf.session is not None:
    print('Close interactive session')
    tf.session.close()
 
# Launch the tensorflow graph
with tf.Session() as sess:
    sess.run(init)
    i = 0
    while i < total_episodes:
        s = cBandit.getBandit()  # Get a state from the environment.
 
        # Choose either a random action or one from our network.
        if np.random.rand(1) < e:
            action = np.random.randint(cBandit.num_actions)
        else:
            action = sess.run(myAgent.chosen_action, feed_dict={myAgent.state_in: [s]})
 
        reward = cBandit.pullArm(action)  # Get our reward for taking an action given a bandit.
 
        # Update the network.
        feed_dict = {myAgent.reward_holder: [reward], myAgent.action_holder: [action], myAgent.state_in: [s]}
        _, ww = sess.run([myAgent.update, weights], feed_dict=feed_dict)
 
        # Update our running tally of scores.
        total_reward[s, action] += reward
        if i % 500 == 0:
            print "Mean reward for each of the " + str(cBandit.num_bandits) + " bandits: " + str(
                np.mean(total_reward, axis=1))
        i += 1
 
 
for a in range(cBandit.num_bandits):
    print "The agent thinks action " + str(np.argmax(ww[a]) + 1) + " for bandit " + str(
        a + 1) + " is the most promising...."
    if np.argmax(ww[a]) == np.argmin(cBandit.bandits[a]):
        print "...and it was right!"
    else:
        print "...and it was wrong!"
Colored by Color Scripter
cs

contextual_bandit class 를 정의합니다.

class contextual_bandit():
cs

생성자를 통해 4개의 손잡이를 가진 슬롯 머신 4개를 정의합니다.

    def __init__(self):
        self.state = 0
        # List out our bandits. Currently arms 4, 2, and 1 (respectively) are the most optimal.
        self.bandits = np.array([[0.2, 0, -0.0, -5], [0.1, -5, 1, 0.25], [-5, 5, 5, 5]])
        self.num_bandits = self.bandits.shape[0]
        self.num_actions = self.bandits.shape[1]
Colored by Color Scripter
cs

랜덤하게 하나의 슬롯 머신을 선택하는 함수를 만듭니다.

    def getBandit(self):
        self.state = np.random.randint(0, len(self.bandits))  # Returns a random state for each episode.
        return self.state
Colored by Color Scripter
cs

파라미터로 받은 액션(슬롯 머신)에서 하나의 손잡이를 랜덤으로 선택 해 해당 손잡이를 사용했을 경우의 보상을 반환합니다. 이 부분이 이 문제에서 핵심적인 부분인데 상태에 따른 보상이 일정하지 않고 환경의 상태 따라 변화될 수 있다는 점입니다. 하지만 그 결과가 무작위적이지 않다는 점은 애초에 정해둔 각 손잡이의 값이 있기 때문에 확실히 알 수 있습니다. 하지만 에이전트의 경우 그 점을 모르고 시작을 하기 때문에 그 점을 알게 하기위한 일이 바로 학습입니다.

    def pullArm(self, action):
        # Get a random number.
        bandit = self.bandits[self.state, action]
        result = np.random.randn(1)
        if result > bandit:
            # return a positive reward.
            return 1
        else:
            # return a negative reward.
            return -1
cs

에이전트(신경망) class 를 정의합니다.

class agent():
cs

state_in 은 상태(어느 슬롯 머신을 이용할지)를 의미하고 state_in_OH 은 state_in 을 one-hot-encoding 한 것입니다.

self.state_in= tf.placeholder(shape=[1],dtype=tf.int32)
state_in_OH = slim.one_hot_encoding(self.state_in,s_size)
cs

fully_connected 하게 신경망을 만드는 부분입니다.

output = slim.fully_connected(state_in_OH,a_size,\
            biases_initializer=None,activation_fn=tf.nn.sigmoid,weights_initializer=tf.ones_initializer())
cs

만들어진 신경망을 통해 얻어진 결과로 액션을 선택하는 부분입니다. 신경망이 sigmoid 을 사용하여 값을 산출해냈기 때문에 0~1 사이의 값들 중 가장 큰 값을 선택하면 됩니다.

self.output = tf.reshape(output,[-1])
        self.chosen_action = tf.argmax(self.output,0)
cs

학습하는 방법은 이전 포스팅과 동일합니다.

self.reward_holder = tf.placeholder(shape=[1],dtype=tf.float32)
self.action_holder = tf.placeholder(shape=[1],dtype=tf.int32)
self.responsible_weight = tf.slice(self.output,self.action_holder,[1])
self.loss = -(tf.log(self.responsible_weight)*self.reward_holder)
optimizer = tf.train.GradientDescentOptimizer(learning_rate=lr)
self.update = optimizer.minimize(self.loss)
cs

환경을 만듭니다. trainable_variables 는 자동으로 그래프내의 학습가능한 weights 들을 반환하는 함수입니다.

cBandit = contextual_bandit()  # Load the bandits.
myAgent = agent(lr=0.001, s_size=cBandit.num_bandits, a_size=cBandit.num_actions)  # Load the agent.
weights = tf.trainable_variables()[0]  # The weights we will evaluate to look into the network.
cs

랜덤하게 슬롯 머신을 하나 선택한 후 해당 슬롯 머신에서 행동(손잡이)를 하나 선택합니다. 역시 e-greedy 방법을 사용했습니다.

        s = cBandit.getBandit()  # Get a state from the environment.
 
        # Choose either a random action or one from our network.
        if np.random.rand(1) < e:
            action = np.random.randint(cBandit.num_actions)
        else:
            action = sess.run(myAgent.chosen_action, feed_dict={myAgent.state_in: [s]})
cs

얻어진 행동에 대한 보상을 얻습니다.

reward = cBandit.pullArm(action)  # Get our reward for taking an action given a bandit.
cs

weights 를 업데이트하여 학습을 시킵니다.

feed_dict = {myAgent.reward_holder: [reward], myAgent.action_holder: [action], myAgent.state_in: [s]}
_, ww = sess.run([myAgent.update, weights], feed_dict=feed_dict)
Colored by Color Scripter
cs

학습된 weights 자체로 결과를 얻어올 수 있습니다. 가장 큰 weight 값을 가지는 index 를 구하면 됩니다.

for a in range(cBandit.num_bandits):
    print "The agent thinks action " + str(np.argmax(ww[a]) + 1) + " for bandit " + str(
        a + 1) + " is the most promising...."
    if np.argmax(ww[a]) == np.argmin(cBandit.bandits[a]):
        print "...and it was right!"
    else:
        print "...and it was wrong!"
 
Colored by Color Scripter
cs

저작자표시 비영리 변경금지

'Deep learning' 카테고리의 다른 글

[Reinforcement Learning / review article / c++] Policy Gradient (Two-armed Bandit) (0)	2017.03.26
[Reinforcement Learning / learn article] Policy Gradient (CartPole) (0)	2017.03.15
[Reinforcement Learning / learn article] Policy Gradient (Two-armed Bandit) (0)	2017.02.27
[Reinforcement Learning / learn article] Q-Learning (0)	2017.02.23
[Supervised Learning / TensorFlow tutorial] Deep MNIST for Experts (CNN) (0)	2017.02.04

현재글[Reinforcement Learning / learn article] Policy Gradient (Contextual Bandits)

Program Programming Programmer

프로그래머

Complete Search, sort, Erathosthenes, Simulation, Divide And Conquer, memoization, Josephus, Math, Shoelace Formula, dynamic programming, Deterministic finite automaton, convex hull, GREEDY, string, bit mask, dfs, Base Conversion, binomial coefficient, bipartite matching, binary search,

Today :
Yesterday :

Program Programming Programmer