python – Keras代码Q-learning OpenAI健身房FrozenLake出了点问

发布时间：2020-12-20 13:19:29 所属栏目：Python 来源：网络整理

导读：也许我的问题看起来很愚蠢. 我正在研究Q学习算法.为了更好地理解它,我试图将this FrozenLake示例的Tenzorflow代码重新编译为Keras代码. 我的代码： import gymimport numpy as npimport randomfrom keras.layers import Densefrom keras.models import Seque

也许我的问题看起来很愚蠢.

我正在研究Q学习算法.为了更好地理解它,我试图将this FrozenLake示例的Tenzorflow代码重新编译为Keras代码.

我的代码：

import gym
import numpy as np
import random

from keras.layers import Dense
from keras.models import Sequential
from keras import backend as K    

import matplotlib.pyplot as plt
%matplotlib inline

env = gym.make('FrozenLake-v0')

model = Sequential()
model.add(Dense(16,activation='relu',kernel_initializer='uniform',input_shape=(16,)))
model.add(Dense(4,activation='softmax',kernel_initializer='uniform'))

def custom_loss(yTrue,yPred):
    return K.sum(K.square(yTrue - yPred))

model.compile(loss=custom_loss,optimizer='sgd')

# Set learning parameters
y = .99
e = 0.1
#create lists to contain total rewards and steps per episode
jList = []
rList = []

num_episodes = 2000
for i in range(num_episodes):
    current_state = env.reset()
    rAll = 0
    d = False
    j = 0
    while j < 99:
        j+=1

        current_state_Q_values = model.predict(np.identity(16)[current_state:current_state+1],batch_size=1)
        action = np.reshape(np.argmax(current_state_Q_values),(1,))

        if np.random.rand(1) < e:
            action[0] = env.action_space.sample() #random action

        new_state,reward,d,_ = env.step(action[0])

        rAll += reward
        jList.append(j)
        rList.append(rAll)

        new_Qs = model.predict(np.identity(16)[new_state:new_state+1],batch_size=1)
        max_newQ = np.max(new_Qs)

        targetQ = current_state_Q_values
        targetQ[0,action[0]] = reward + y*max_newQ
        model.fit(np.identity(16)[current_state:current_state+1],targetQ,verbose=0,batch_size=1)
        current_state = new_state

        if d == True:
            #Reduce chance of random action as we train the model.
            e = 1./((i/50) + 10)
            break
print("Percent of succesful episodes: " + str(sum(rList)/num_episodes) + "%")

当我运行它时,效果不佳：成功集数的百分比：0.052％

plt.plot(rList)

enter image description here

original Tensorflow code更好：成功集数百分比：0.352％

plt.plot(rList)

enter image description here

我做错了什么？

解决方法

除了将use_bias = False设置为注释中提到的@Maldus之外,您可以尝试的另一件事是从更高的epsilon值(例如0.5,0.75)开始？一个技巧可能只是在达到目标时减少epsilon值.即每次剧集结束时不要减少epsilon.这样你的玩家可以随机地继续探索地图,直到它开始收敛于一条好的路线,然后减少epsilon参数是个好主意.

我实际上在gist中使用Convolutional层而不是Dense层在keras中实现了类似的模型.管理以使其在2000集以下的情况下工作.可能对别人有所帮助:)

（编辑：李大同）

【声明】本站内容均来自网络，其相关言论仅代表作者个人观点，不代表本站立场。若无意侵犯到您的权利，请及时与联系站长删除相关内容!