博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
人工智能之计算最佳策略(Policy Iteration and Value Iteration)
阅读量:5282 次
发布时间:2019-06-14

本文共 8473 字,大约阅读时间需要 28 分钟。

 

1. 实验要求

题目:计算最佳策略 在下面例子基础上,自行设计一个问题(例如:求解某两点之间的最短路径, 或是在图中加一些障碍物,计算最短路径), 给出该问题对应的 MDP 模型描述, 然后分别使用 value iteration 和 policy iteration 算法计算出最佳策略。

 

 

2.实验思路

(1)设计问题(MDP描述)

设计4*4的方格,即初始化的矩阵,每个空格都是一个状态,存在收益情况,在每到达一个点时便可选择上下左右四个方向移动,遇到边缘时状态不变,当移动一步则收益-1

(2)实验思路

设计2个矩阵Matrix1和Matrix2,用来存放上次迭代的收益值,每次走完一步先更新Matrix2,再将值赋给Matrix1,收益值由上下左右操作后得到。

 

As for the Policy Iteration:

(1) Initialization the Matrix1

(2) Update the value according to the function:

(delta)

(3) Until the strategy unchanged, it’s restrained.

 

As for the Value Iteration:

(1) Initialize the Matrix1

(2) Update the value according to the function:

 (delta)

(3) Output the value function, and obtain the best strategy.

 

3.实验结果

Policy Iteration:

 

 

 

Value Iteration;

 

class State(object):        def __init__(self,i,j):        self.up = 0.25    self.down = 0.25    self.left = 0.25    self.right = 0.25    if i > 0:    self.upState = (i - 1, j)    else:    self.upState = (i, j)    if i < 3:    self.downState = (i + 1, j)    else:    self.downState = (i, j)    if j > 0:            self.leftState = (i, j - 1)        else:            self.leftState = (i,j)        if j < 3:            self.rightState = (i, j + 1)        else:            self.rightState = (i, j)        # Add barrier in position(1, 2) and (2, 1)        if i == 2 and j == 2:            self.upState = (i, j)            self.leftState = (i, j)        if i == 3 and j == 1:            self.upState = (i, j)        if i == 2 and j == 0:            self.rightState = (i, j)        if i == 1 and j == 1:            self.rightState = (i, j)            self.downState = (i, j)        if i == 0 and j == 2:            self.downState = (i, j)        if i == 1 and j == 3:            self.leftState = (i, j)    def updatePolicy(self, rewardMatrix):        oldUp = self.up        oldDown = self.down        oldLeft = self.left        oldRight = self.right        upValue = -1 + rewardMatirx[self.upState[0]][self.upState[1]]        downValue = -1 + rewardMatirx[self.downState[0]][self.downState[1]]        leftValue = -1 + rewardMatirx[self.leftState[0]][self.leftState[1]]        rightValue = -1 + rewardMatirx[self.rightState[0]][self.rightState[1]]        if(upValue >= downValue)and(upValue >= leftValue)and(upValue >= rightValue):            self.up = 1.0            self.down = 0            self.left = 0            self.right = 0        elif(downValue >= upValue)and(downValue >= leftValue)and(downValue >= rightValue):            self.up = 0            self.down = 1.0            self.left = 0            self.right = 0        elif(leftValue >= upValue)and(leftValue >= downValue)and(leftValue >= rightValue):            self.up = 0            self.down = 0            self.left = 1.0            self.right = 0        else:            self.up = 0            self.down = 0            self.left = 0            self.right = 1.0        if(oldUp == self.up)and(oldDown == self.down)and(oldLeft == self.left)and(oldRight == self.right):            return True        else:            return False################################################################################################################# Update the reward matrixdef updateReward(i, j):    tempMatrix[i][j] = -1 + stateMatirx[i][j].up * rewardMatirx[stateMatirx[i][j].upState[0]][stateMatirx[i][j].upState[1]] \                          + stateMatirx[i][j].down * rewardMatirx[stateMatirx[i][j].downState[0]][stateMatirx[i][j].downState[1]] \                          + stateMatirx[i][j].left * rewardMatirx[stateMatirx[i][j].leftState[0]][stateMatirx[i][j].leftState[1]] \                          + stateMatirx[i][j].right * rewardMatirx[stateMatirx[i][j].rightState[0]][stateMatirx[i][j].rightState[1]]################################################################################################################## Initialize the state matrixstateMatirx = [[] for i in range(4)]for i in range(4):    for j in range(4):        stateMatirx[i].append(State(i, j))################################################################################################################# Initialize the reward matrixrewardMatirx = [    [0,0,0,0],    [0,0,0,0],    [0,0,0,0],    [0,0,0,0]]# The matrix used to backup reward matrixtempMatrix = [    [0,0,0,0],    [0,0,0,0],    [0,0,0,0],    [0,0,0,0]]#################################################################################################################thresh = 0   # set a threshold value here################################################################################################################## Policy iterationstableFlag = Truewhile stableFlag:    # policy evaluation    delta = 0    for i in range(0, 4):        for j in range(0, 4):            if ((i == 0) and (j == 0)):                continue            if i == 1 and j == 2:                continue            if i == 2 and j == 1:                continue            else:                v = tempMatrix[i][j]                updateReward(i, j)                delta = max(delta, abs(tempMatrix[i][j] - v))    rewardMatirx = tempMatrix    while delta > thresh:        delta = 0        for i in range(0,4):            for j in range(0,4):                if((i == 0)and(j == 0)):                    continue                if i == 1 and j == 2:                    continue                if i == 2 and j == 1:                    continue                else:                    v = tempMatrix[i][j]                    updateReward(i, j)                    delta = max(delta, abs(tempMatrix[i][j] - v))        rewardMatirx = tempMatrix    # improve the policy    for i in range(0,4):        for j in range(0,4):            if (i == 0) and (j == 0):                continue            if i == 1 and j == 2:                continue            if i == 2 and j == 1:                continue            else:                stableFlag = (stableFlag and stateMatirx[i][j].updatePolicy(rewardMatirx))    stableFlag = not stableFlag################################################################################################################for i in range(0, 4):    for j in range(0, 4):        print(rewardMatirx[i][j])        print(" ")    print("\n")
#设置名为Operation的类存放信息class Operation(object):    def __init__(self,i,j):        self.up = 0.25#概率为0.25        self.down = 0.25#概率为0.25        self.left = 0.25#概率为0.25        self.right = 0.25#概率为0.25        if i > 0:            self.upState = (i - 1, j)        else:            self.upState = (i, j)        if i < 3:            self.downState = (i + 1, j)        else:            self.downState = (i, j)        if j > 0:            self.leftState = (i, j - 1)        else:            self.leftState = (i,j)        if j < 3:            self.rightState = (i, j + 1)        else:            self.rightState = (i, j)# 初始化收益矩阵rewardMatrix = [    [0,0,0,0],[0,0,0,0],[0,0,0,0],[0,0,0,0]]tempMatrix = [    [0,0,0,0],[0,0,0,0],[0,0,0,0],[0,0,0,0]]# 初始化状态矩阵stateMatirx = [[] for i in range(4)]for i in range(4):    for j in range(4):        stateMatirx[i].append(Operation(i, j))def updateReward(i,j):    tempMatrix[i][j] = max((-1 + rewardMatrix[stateMatirx[i][j].upState[0]][stateMatirx[i][j].upState[1]] ),                        (-1 + rewardMatrix[stateMatirx[i][j].downState[0]][stateMatirx[i][j].downState[1]]) ,                        (-1 + rewardMatrix[stateMatirx[i][j].leftState[0]][stateMatirx[i][j].leftState[1]]),                        (-1 + rewardMatrix[stateMatirx[i][j].rightState[0]][stateMatirx[i][j].rightState[1]]))#Value iterationthresh = 0 # 设置阈值delta = 0for i in range(4):    for j in range(4):        if ((i == 0) and (j == 0)):            continue        if i == 1 and j == 2:            continue        if i == 2 and j == 1:            continue        else:            v = rewardMatrix[i][j]            updateReward(i,j)            delta = max(delta, abs(v - tempMatrix[i][j]))rewardMatrix = tempMatrixwhile delta > thresh:    delta = 0    for i in range(4):        for j in range(4):            if i == 0 and j == 0:                continue            if i == 1 and j == 2:                continue            if i == 2 and j == 1:                continue            else:                v = rewardMatrix[i][j]                updateReward(i, j)                delta = max(delta, abs(v - tempMatrix[i][j]))    rewardMatrix = tempMatrix#输出结果矩阵for i in range(0, 4):    for j in range(0, 4):        print(rewardMatrix[i][j])        print(" ")    print("\n")

  

转载于:https://www.cnblogs.com/sxh3014218159/p/6659738.html

你可能感兴趣的文章
MongoDB学习笔记——聚合操作之group,distinct,count
查看>>
大道至简读后感(第四章)
查看>>
IDA IDC Tutorials: Additional Auto-Commenting
查看>>
k8s-存储卷1-十二
查看>>
在Android中Intent的概念及应用(二)——Intent过滤器相关选项
查看>>
数据库备份问题
查看>>
前端面试题(4)iframe有哪些优点?iframe缺点是什么?
查看>>
第十六章 多态性(一)
查看>>
INSERT IGNORE INTO / REPLACE INTO
查看>>
Python数据类型-布尔/数字/字符串/列表/元组/字典/集合
查看>>
MFC中theApp
查看>>
类的无参方法
查看>>
Python 开发:初识Python(记笔记)
查看>>
sqlrelay 的安装配置和应用
查看>>
【刷题】SPOJ 705 SUBST1 - New Distinct Substrings
查看>>
IEEE 754浮点数表示标准
查看>>
WebService客户端几种实现方式
查看>>
Lua1.1 Lua 的参考手册 (二)
查看>>
linux下mysql的root密码忘记解决方
查看>>
javascript学习笔记 - 引用类型 单体内置对象
查看>>