python – Keras代码Q-learning OpenAI健身房FrozenLake出了点问
发布时间:2020-12-20 13:19:29 所属栏目:Python 来源:网络整理
导读:也许我的问题看起来很愚蠢. 我正在研究Q学习算法.为了更好地理解它,我试图将this FrozenLake示例的Tenzorflow代码重新编译为Keras代码. 我的代码: import gymimport numpy as npimport randomfrom keras.layers import Densefrom keras.models import Seque
|
也许我的问题看起来很愚蠢.
我正在研究Q学习算法.为了更好地理解它,我试图将this FrozenLake示例的Tenzorflow代码重新编译为Keras代码. 我的代码: import gym
import numpy as np
import random
from keras.layers import Dense
from keras.models import Sequential
from keras import backend as K
import matplotlib.pyplot as plt
%matplotlib inline
env = gym.make('FrozenLake-v0')
model = Sequential()
model.add(Dense(16,activation='relu',kernel_initializer='uniform',input_shape=(16,)))
model.add(Dense(4,activation='softmax',kernel_initializer='uniform'))
def custom_loss(yTrue,yPred):
return K.sum(K.square(yTrue - yPred))
model.compile(loss=custom_loss,optimizer='sgd')
# Set learning parameters
y = .99
e = 0.1
#create lists to contain total rewards and steps per episode
jList = []
rList = []
num_episodes = 2000
for i in range(num_episodes):
current_state = env.reset()
rAll = 0
d = False
j = 0
while j < 99:
j+=1
current_state_Q_values = model.predict(np.identity(16)[current_state:current_state+1],batch_size=1)
action = np.reshape(np.argmax(current_state_Q_values),(1,))
if np.random.rand(1) < e:
action[0] = env.action_space.sample() #random action
new_state,reward,d,_ = env.step(action[0])
rAll += reward
jList.append(j)
rList.append(rAll)
new_Qs = model.predict(np.identity(16)[new_state:new_state+1],batch_size=1)
max_newQ = np.max(new_Qs)
targetQ = current_state_Q_values
targetQ[0,action[0]] = reward + y*max_newQ
model.fit(np.identity(16)[current_state:current_state+1],targetQ,verbose=0,batch_size=1)
current_state = new_state
if d == True:
#Reduce chance of random action as we train the model.
e = 1./((i/50) + 10)
break
print("Percent of succesful episodes: " + str(sum(rList)/num_episodes) + "%")
当我运行它时,效果不佳:成功集数的百分比:0.052% plt.plot(rList)
original Tensorflow code更好:成功集数百分比:0.352% plt.plot(rList)
我做错了什么? 解决方法
除了将use_bias = False设置为注释中提到的@Maldus之外,您可以尝试的另一件事是从更高的epsilon值(例如0.5,0.75)开始?一个技巧可能只是在达到目标时减少epsilon值.即每次剧集结束时不要减少epsilon.这样你的玩家可以随机地继续探索地图,直到它开始收敛于一条好的路线,然后减少epsilon参数是个好主意.
我实际上在gist中使用Convolutional层而不是Dense层在keras中实现了类似的模型.管理以使其在2000集以下的情况下工作.可能对别人有所帮助:) (编辑:李大同) 【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容! |
推荐文章
站长推荐
热点阅读


