强化学习实例：多臂赌博机

忘是亡心i 2022-11-17 10:19 76阅读 0赞

# 一、游戏背景 #

多臂赌博机是一种游戏机，在本文中是一种具有三个拉杆的游戏机，每拉动一个杆，就会有一些金币从机器里出来，每只杆拉动出现的金币都是不一样的，同一只杆拉动多次出现的金币数量也是不全相同的，因此考虑拉动N次杆，怎么能让出现的金币的个数最多？

# 二、采取策略 #

一个很显然的思路是，先试探性的每个杆都拉动几次，看看哪个杆出现的金币比较多，那么之后就拉动那一只杆，这种思路是一种解决方法，但是如果因为巧合导致选择的那一只杆不是出金币最多的，那么就会导致很大的损失，所以考虑有没有方法可以在选择当前出金币最多的杆的同时，依旧有一定概率去拉动其他杆？这就是强化学习中的随机策略，主要说三种策略：

## 1、ϵ−greedy策略 ##

![\\pi(a|s) \\leftarrow \\begin\{cases\} 1-\\epsilon+\\frac\{\\epsilon\}\{|A(s)|\} if a=argmax\_a Q(s,a)\\\\ \\frac\{\\epsilon\}\{|A(s)|\} if a \\neq argmax\_a Q(s,a)\\\\ \\end\{cases\}][pi_a_s_ _leftarrow _begin_cases_ 1-_epsilon_frac_epsilon_A_s_ if a_argmax_a Q_s_a_ _frac_epsilon_A_s_ if a _neq argmax_a Q_s_a_ _end_cases]

就是每次选择当前金币最多的杆拉，但是还有ϵ（一个很小的数）去随机拉动任意杆，和上边的思路是吻合的。

## 2、玻尔兹曼分布 ##

##               ![p(a\_\{i\})=\\frac\{exp(Q(a\_\{i\})/\\tau)\}\{\\sum\_j exp(Q(a\_\{j\})/\\tau)\}][p_a_i_frac_exp_Q_a_i_tau_sum_j exp_Q_a_j_tau] ##

相比贪婪策略，对每一项做出了软处理

# 3、UCB策略 #

一个较为复杂的公式，使用置信区间来表示搜索，一般是三种中效果最好的

# 三、代码实现 #

首先建立一个游戏的类，先定义一些基本量：

def __init__(self, *args, **kwargs):
        #动作
        self.actions = [1, 2, 3]
        #平均回报
        self.q = np.array([0.0, 0.0, 0.0])
        #每个杆被选中的次数
        self.action_counts = np.array([0, 0, 0])
        #玩游戏的次数
        self.counts = 0
        #当前回报总和
        self.current_cumulative_rewards = 0.0
        #次数记录
        self.counts_history = []
        #奖励历史
        self.cumulative_rewards_history = []
        #当前动作
        self.a = 1
        #当前回报
        self.reward = 0

接下来是动作返回的金币个数：

# 模拟多臂赌博机
    def step(self, a):
        r = 0
        #三个动作会反馈不同分布的金币个数
        if a == 1:
            r = np.random.normal(1, 1)
        if a == 2:
            r = np.random.normal(2, 1)
        if a == 3:
            r = np.random.normal(1.5, 1)
        return r

接下来根据不同的策略选择摇杆：

def choose_action(self, policy, **kwargs):
        action = 0
        if policy == 'e_greedy':
            if np.random.random() < kwargs['epsilon']:
                action = np.random.randint(1, 4)
            else:
                action = np.argmax(self.q) + 1
        if policy == 'ucb':
            c_ratio = kwargs['c_ratio']
            if 0 in self.action_counts:
                action = np.where(self.action_counts == 0)[0][0] + 1
            else:
                value = self.q + c_ratio * np.sqrt(np.log(self.counts) / self.action_counts)
                action = np.argmax(value) + 1
        if policy == 'boltzmann':
            tau = kwargs['temperature']
            p = np.exp(self.q / tau) / (np.sum(np.exp(self.q / tau)))
            action = np.random.choice([1, 2, 3], p=p.ravel())
        return action

下边是交互过程，后边三个是不同策略的超参数，就像贪婪的ϵ和玻尔兹曼的![\\tau][tau]等

# 实际的交互过程
    def train(self, play_total, policy, **kwargs):
        reward_1 = []
        reward_2 = []
        reward_3 = []
        for i in range(play_total):
            action = 0
            if policy == 'e_greedy':
                action = self.choose_action(policy, epsilon=kwargs['epsilon'])
            if policy == 'ucb':
                action = self.choose_action(policy, c_ratio=kwargs['c_ratio'])
            if policy == 'boltzmann':
                action = self.choose_action(policy, temperature=kwargs['temperature'])
            self.a = action
            self.reward = self.step(self.a)
            self.counts += 1
            self.q[self.a - 1] = (self.q[self.a - 1] * self.action_counts[self.a - 1] + self.reward) / (
                        self.action_counts[self.a - 1] + 1)
            self.action_counts[self.a - 1] += 1
            reward_1.append(self.q[0])
            reward_2.append(self.q[1])
            reward_3.append(self.q[2])
            self.current_cumulative_rewards += self.reward
            self.counts_history.append(i)
            self.cumulative_rewards_history.append(self.current_cumulative_rewards)

主体部分已经完成，后边还有画图和重置的函数，这里就不再列举了，运行代码看看训练结果：

if __name__ == '__main__':
        np.random.seed(0)
        k_gamble = KB_Game()
        total = 2000
        k_gamble.train(play_total=total, policy='e_greedy', epsilon=0.05)
        k_gamble.plot(colors='r', policy='e_greedy', linestyle='-')
        k_gamble.reset()
        k_gamble.train(play_total=total, policy='boltzmann', temperature=1)
        k_gamble.plot(colors='g', policy='boltzmann', linestyle='--')
        k_gamble.reset()
        k_gamble.train(play_total=total, policy='ucb', c_ratio=0.5)
        k_gamble.plot(colors='b', policy='ucb', linestyle='-.')
        plt.show()

![watermark_type_ZmFuZ3poZW5naGVpdGk_shadow_10_text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FxXzQxNjg1MjY1_size_16_color_FFFFFF_t_70][]

从这张图中可以看出UCB的策略得到了最好的结果

源代码可以参考链接：[https://github.com/hustCYQ/RL/blob/main/KBgame.py][https_github.com_hustCYQ_RL_blob_main_KBgame.py]

[pi_a_s_ _leftarrow _begin_cases_ 1-_epsilon_frac_epsilon_A_s_ if a_argmax_a Q_s_a_ _frac_epsilon_A_s_ if a _neq argmax_a Q_s_a_ _end_cases]: https://latex.codecogs.com/gif.latex?%5Cpi%28a%7Cs%29%20%5Cleftarrow%20%5Cbegin%7Bcases%7D%201-%5Cepsilon&plus;%5Cfrac%7B%5Cepsilon%7D%7B%7CA%28s%29%7C%7D%20if%20a%3Dargmax_a%20Q%28s%2Ca%29%5C%5C%20%5Cfrac%7B%5Cepsilon%7D%7B%7CA%28s%29%7C%7D%20if%20a%20%5Cneq%20argmax_a%20Q%28s%2Ca%29%5C%5C%20%5Cend%7Bcases%7D
[p_a_i_frac_exp_Q_a_i_tau_sum_j exp_Q_a_j_tau]: https://latex.codecogs.com/gif.latex?p%28a_%7Bi%7D%29%3D%5Cfrac%7Bexp%28Q%28a_%7Bi%7D%29/%5Ctau%29%7D%7B%5Csum_j%20exp%28Q%28a_%7Bj%7D%29/%5Ctau%29%7D
[tau]: https://latex.codecogs.com/gif.latex?%5Ctau
[watermark_type_ZmFuZ3poZW5naGVpdGk_shadow_10_text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FxXzQxNjg1MjY1_size_16_color_FFFFFF_t_70]: /images/20221022/200555bed83e406f834a9afffba62333.png
[https_github.com_hustCYQ_RL_blob_main_KBgame.py]: https://github.com/hustCYQ/RL/blob/main/KBgame.py