[Reinforcement Learning / learn article] Action-Selection Strategies for Exploration

Deep learning

[Reinforcement Learning / learn article] Action-Selection Strategies for Exploration

JaykayChoi 2017. 5. 5. 12:34

이번 포스팅은 행동을 선택하기 위한 몇 가지 방법을 소개해주는 포스팅입니다. 그동안 일반적으로 사용했던 e-greedy approach 가 이에 해당됩니다.

https://medium.com/emergent-future/simple-reinforcement-learning-with-tensorflow-part-7-action-selection-strategies-for-exploration-d3a97b7cceaf

1. Greedy Approach

이 방법은 단순하게 현재 상황에서 가장 나은 행동을 선택하는 것입니다. 단점은 완전히 학습되기 전의 경험을 통해 선택을 하기 때문에 장기적으로 최선이 선택이 아닌 것만 계속 선택을 해서 해당의 행동 이외의 행동은 학습을 할 기회가 없다는 것입니다. 단순하게 np.argmax 함수를 이용하여 구현할 수 있습니다.

1
2
3
4
#Use this for action selection.
#Q_out referrs to activation from final layer of Q-Network.
Q_values = sess.run(Q_out, feed_dict={inputs:[state]})
action = np.argmax(Q_values)
Colored by Color Scripter
cs

2. Random Approach

이 방법은 단순하게 무조건 무작위로 선택을 하는 방법입니다. 당연하게도 학습의 효과는 얻을 수 없습니다.

1
2
3
4
5
6
#Assuming we are using OpenAI gym environment.
action = env.action_space.sample()
 
#Otherwise:
#total_actions = ??
action = np.random.randint(0,total_actions)
cs

3. e-Greedy Approach

이 방법은 그동안 계속 사용해오던 방법입니다. 초반에는 자주 무작위로 선택을 하고 학습이 이뤄지는 후반으로 갈 수록 무작위 선택을 줄입니다. 이를 통해 Greedy Approach 의 단점을 해결할 수 있습니다.

1
2
3
4
5
6
e = 0.1
if np.random.rand(1) < e:
    action = env.action_space.sample()
else:
    Q_dist = sess.run(Q_out,feed_dict={inputs:[state]})
    action = np.argmax(Q_dist)
cs

4. Boltzmann Approach

Boltzmann Approach 은 신경망에서 얻은 각 행동에 대한 추정치를 모두 사용하는 방법입니다. 행동 자체는 무작위로 선택되지만 추정치가 높은 행동이 그만큼의 높은 확률로 선택이 되는 원리입니다.

이를 위해 우선 각 행동의 추정치를 softmax 를 통해 0~1의 값으로 변환합니다. 이때 하단의 코드와 같이 e 값으로 (여기서는 Temp) 나누는 이유는 e-greedy 와 같이 초반에는 더 많은 무작위 요소를 넣기 위해서 입니다. 이렇게 얻어진 softmax 값을 이용해 해당 확률로 np.random.choice 함수를 사용하여 action_value 을 얻고 이를

action = np.argmax(Q_probs[0] == action_value) 와 같은 방법으로 행동을 선택합니다. np.random.choice 을 통해 얻어지는 action_value 행동의 추정치이므로 행동의 index 를 구하기 위해서 np.argmax(Q_probs[0] == action_value) 와 같은 방법을 사용한 것입니다.

1
2
3
4
5
6
7
8
9
#Add this to network to compute Boltzmann probabilities.
Temp = tf.placeholder(shape=None,dtype=tf.float32)
Q_dist = slim.softmax(Q_out/Temp)
 
#Use this for action selection.
t = 0.5
Q_probs = sess.run(Q_dist,feed_dict={inputs:[state],Temp:t})
action_value = np.random.choice(Q_probs[0],p=Q_probs[0])
action = np.argmax(Q_probs[0] == action_value)
cs

5. Bayesian Approaches (w/ Dropout)

이 방법은 앞선 포스팅에서 설명한 Dropout 기법을 사용하는 방법입니다.

1
2
3
4
5
6
7
8
#Add to network
keep_per = tf.placeholder(shape=None,dtype=tf.float32)
hidden = slim.dropout(hidden,keep_per)
 
 
keep = 0.5
Q_values = sess.run(Q_out,feed_dict={inputs:[state],keep_per:keep})
action = #Insert your favorite action-selection strategy with the sampled Q-values.
cs

위의 5가지 행동 선택 방법을 스위칭하여 사용할 수 있게 만들어놓은 소스입니다.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
import gym
import numpy as np
import random
import tensorflow as tf
import matplotlib.pyplot as plt
import tensorflow.contrib.slim as slim
 
 
env = gym.make('CartPole-v0')
 
 
class experience_buffer():
    def __init__(self, buffer_size=10000):
        self.buffer = []
        self.buffer_size = buffer_size
 
    def add(self, experience):
        if len(self.buffer) + len(experience) >= self.buffer_size:
            self.buffer[0:(len(experience) + len(self.buffer)) - self.buffer_size] = []
        self.buffer.extend(experience)
 
    def sample(self, size):
        return np.reshape(np.array(random.sample(self.buffer, size)), [size, 5])
 
 
def updateTargetGraph(tfVars, tau):
    total_vars = len(tfVars)
    op_holder = []
    for idx, var in enumerate(tfVars[0:int(total_vars / 2)]):
        op_holder.append(tfVars[idx + int(total_vars / 2)].assign(
            (var.value() * tau) + ((1 - tau) * tfVars[idx + int(total_vars / 2)].value())))
    return op_holder
 
 
def updateTarget(op_holder, sess):
    for op in op_holder:
        sess.run(op)
 
 
 
class Q_Network():
    def __init__(self):
        # These lines establish the feed-forward part of the network used to choose actions
        self.inputs = tf.placeholder(shape=[None, 4], dtype=tf.float32)
        self.Temp = tf.placeholder(shape=None, dtype=tf.float32)
        self.keep_per = tf.placeholder(shape=None, dtype=tf.float32)
 
        hidden = slim.fully_connected(self.inputs, 64, activation_fn=tf.nn.tanh, biases_initializer=None)
        hidden = slim.dropout(hidden, self.keep_per)
        self.Q_out = slim.fully_connected(hidden, 2, activation_fn=None, biases_initializer=None)
 
        self.predict = tf.argmax(self.Q_out, 1)
        self.Q_dist = tf.nn.softmax(self.Q_out / self.Temp)
 
        # Below we obtain the loss by taking the sum of squares difference between the target and prediction Q values.
        self.actions = tf.placeholder(shape=[None], dtype=tf.int32)
        self.actions_onehot = tf.one_hot(self.actions, 2, dtype=tf.float32)
 
        self.Q = tf.reduce_sum(tf.multiply(self.Q_out, self.actions_onehot), reduction_indices=1)
 
        self.nextQ = tf.placeholder(shape=[None], dtype=tf.float32)
        loss = tf.reduce_sum(tf.square(self.nextQ - self.Q))
        trainer = tf.train.GradientDescentOptimizer(learning_rate=0.0005)
        self.updateModel = trainer.minimize(loss)
 
 
 
# Set learning parameters
exploration = "e-greedy"  # Exploration method. Choose between: greedy, random, e-greedy, boltzmann, bayesian.
y = .99  # Discount factor.
num_episodes = 20000  # Total number of episodes to train network for.
tau = 0.001  # Amount to update target network at each step.
batch_size = 32  # Size of training batch
startE = 1  # Starting chance of random action
endE = 0.1  # Final chance of random action
anneling_steps = 200000  # How many steps of training to reduce startE to endE.
pre_train_steps = 50000  # Number of steps used before training updates begin.
 
 
 
tf.reset_default_graph()
 
q_net = Q_Network()
target_net = Q_Network()
 
init = tf.global_variables_initializer()
trainables = tf.trainable_variables()
targetOps = updateTargetGraph(trainables, tau)
myBuffer = experience_buffer()
 
# create lists to contain total rewards and steps per episode
jList = []
jMeans = []
rList = []
rMeans = []
 
 
 
with tf.Session() as sess:
    sess.run(init)
    updateTarget(targetOps, sess)
    e = startE
    stepDrop = (startE - endE) / anneling_steps
    total_steps = 0
 
    for i in range(num_episodes):
        s = env.reset()
        rAll = 0
        d = False
        j = 0
        while j < 999:
            j += 1
            if exploration == "greedy":
                # Choose an action with the maximum expected value.
                a, allQ = sess.run([q_net.predict, q_net.Q_out], feed_dict={q_net.inputs: [s], q_net.keep_per: 1.0})
                a = a[0]
            if exploration == "random":
                # Choose an action randomly.
                a = env.action_space.sample()
            if exploration == "e-greedy":
                # Choose an action by greedily (with e chance of random action) from the Q-network
                if np.random.rand(1) < e or total_steps < pre_train_steps:
                    a = env.action_space.sample()
                else:
                    a, allQ = sess.run([q_net.predict, q_net.Q_out], feed_dict={q_net.inputs: [s], q_net.keep_per: 1.0})
                    a = a[0]
            if exploration == "boltzmann":
                # Choose an action probabilistically, with weights relative to the Q-values.
                Q_d, allQ = sess.run([q_net.Q_dist, q_net.Q_out],
                                     feed_dict={q_net.inputs: [s], q_net.Temp: e, q_net.keep_per: 1.0})
                a = np.random.choice(Q_d[0], p=Q_d[0])
                a = np.argmax(Q_d[0] == a)
            if exploration == "bayesian":
                # Choose an action using a sample from a dropout approximation of a bayesian q-network.
                a, allQ = sess.run([q_net.predict, q_net.Q_out],
                                   feed_dict={q_net.inputs: [s], q_net.keep_per: (1 - e) + 0.1})
                a = a[0]
 
            # Get new state and reward from environment
            s1, r, d, _ = env.step(a)
            myBuffer.add(np.reshape(np.array([s, a, r, s1, d]), [1, 5]))
 
            if e > endE and total_steps > pre_train_steps:
                e -= stepDrop
 
            if total_steps > pre_train_steps and total_steps % 5 == 0:
                # We use Double-DQN training algorithm
                trainBatch = myBuffer.sample(batch_size)
                Q1 = sess.run(q_net.predict, feed_dict={q_net.inputs: np.vstack(trainBatch[:, 3]), q_net.keep_per: 1.0})
                Q2 = sess.run(target_net.Q_out,
                              feed_dict={target_net.inputs: np.vstack(trainBatch[:, 3]), target_net.keep_per: 1.0})
                end_multiplier = -(trainBatch[:, 4] - 1)
                doubleQ = Q2[range(batch_size), Q1]
                targetQ = trainBatch[:, 2] + (y * doubleQ * end_multiplier)
                _ = sess.run(q_net.updateModel,
                             feed_dict={q_net.inputs: np.vstack(trainBatch[:, 0]), q_net.nextQ: targetQ,
                                        q_net.keep_per: 1.0, q_net.actions: trainBatch[:, 1]})
                updateTarget(targetOps, sess)
 
            rAll += r
            s = s1
            total_steps += 1
            if d == True:
                break
        jList.append(j)
        rList.append(rAll)
        if i % 100 == 0 and i != 0:
            r_mean = np.mean(rList[-100:])
            j_mean = np.mean(jList[-100:])
            if exploration == 'e-greedy':
                print("Mean Reward: " + str(r_mean) + " Total Steps: " + str(total_steps) + " e: " + str(e))
            if exploration == 'boltzmann':
                print("Mean Reward: " + str(r_mean) + " Total Steps: " + str(total_steps) + " t: " + str(e))
            if exploration == 'bayesian':
                print("Mean Reward: " + str(r_mean) + " Total Steps: " + str(total_steps) + " p: " + str(e))
            if exploration == 'random' or exploration == 'greedy':
                print("Mean Reward: " + str(r_mean) + " Total Steps: " + str(total_steps))
            rMeans.append(r_mean)
            jMeans.append(j_mean)
print("Percent of succesful episodes: " + str(sum(rList) / num_episodes) + "%")
 
plt.plot(rMeans)
plt.plot(jMeans)
Colored by Color Scripter
cs

저작자표시 비영리 변경금지

'Deep learning' 카테고리의 다른 글

[Reinforcement Learning / learn article] Partial Observability and Deep Recurrent Q-Networks (0)	2017.05.03
[Supervised Learning / review article] Recurrent Neural Networks(RNN), Long Short-Term Memory Units (LSTM) (0)	2017.04.30
[Reinforcement Learning / learn article] Deep Q-Networks and Beyond (0)	2017.04.17
[Reinforcement Learning / learn article] Model-Based RL (CartPole) (1)	2017.04.10
[Reinforcement Learning / review article / not use tensorflow] Policy Gradient (CartPole) (0)	2017.04.08

현재글[Reinforcement Learning / learn article] Action-Selection Strategies for Exploration

프로그래머

Josephus, Erathosthenes, Base Conversion, binomial coefficient, Divide And Conquer, dfs, binary search, memoization, Shoelace Formula, bipartite matching, Deterministic finite automaton, convex hull, string, Math, dynamic programming, Simulation, bit mask, GREEDY, sort, Complete Search,

Today :
Yesterday :

Program Programming Programmer