赞
踩
【原文作者及来源:Silver D, Schrittwieser J, Simonyan K, et al. Mastering the game of Go without human knowledge[J]. Nature, 2017, 550(7676):354-359.】
【此译文由COCO主要完成,对MarkDown编辑器正在熟悉过程中,因此,文章中相关公式存在问题,请见谅】
【原文】A long-standing goal of artificial intelligence is an algorithm that learns, tabula rasa, superhuman proficiency in challenging domains. Recently, AlphaGo became the first program to defeat a world champion in the game of Go. The tree search in AlphaGo evaluated positions and selected moves using deep neural networks. These neural networks were trained by supervised learning from human expert moves, and by reinforcement learning from self-play. Here we introduce an algorithm based solely on reinforcement learning, without human data, guidance or domain knowledge beyond game rules. AlphaGo becomes its own teacher: a neural network is trained to predict AlphaGo’s own move selections and also the winner of AlphaGo’s games. This neural network improves the strength of the tree search, resulting in higher quality move selection and stronger self-play in the next iteration. Starting tabula rasa, our new program AlphaGo Zero achieved superhuman performance, winning 100–0 against the previously published, champion-defeating AlphaGo.
【翻译】人工智能的长期目标是后天自主学习,并且在一些具有挑战性的领域中实现超人的算法。最近,AlphaGo成为第一个在围棋中击败人类世界冠军的程序。AlphaGo的树搜索使用深度神经网络来评估棋局和选定下棋位置。神经网络是利用对人类专业棋手的移动进行监督学习,同时通过自我博弈进行强化学习来进行训练的。在这里,我们引入了一种没有人类的数据、指导或超越游戏规则的领域知识的、基于强化学习的算法。AlphaGo成为了自己的老师:神经网络被训练用来预测AlphaGo自己的落子选择和胜负。这种神经网络提高了树搜索的强度,从而提高了落子选择的质量和在下一次迭代中的自我博弈能力。从零开始,我们的新程序“AlphaGo Zero”取得了“超人”的成绩,以100比0战胜了的此前公布的AlphaGo版本(指代和李世石对弈的AlphaGo)。【翻译】我们现在的程序AlphaGo Zero,与AlphaGo Fan和AlphaGo Lee存在以下几点的差异。首先,它完全由自我博弈强化学习进行训练,从刚开始的随机博弈开始,就没有任何监督或使用人类的数据。第二,它只使用棋盘上的黑白棋作为输入特征。第三,它使用单一的神经网络,而不是分离的策略网络和价值网络。最后,它使用了一个简化版搜索树,这个搜索树依靠单一的神经网络进行棋局评价和落子采样,不执行任何蒙特卡洛rollout。为了实现上述结果,我们引入一个新的强化学习算法,在训练过程中完成前向搜索,从而达到迅速的提高以及精确、稳定的学习过程。在搜索算法、训练过程和网络架构方面更多的技术差异在方法中进行了描述。
【原文】Reinforcement learning in AlphaGo Zero,Our new method uses a deep neural network with parameters θ. This neural network takes as an input the raw board representation S of the position and its history, and outputs both move probabilities and a value, . The vector of move probabilities p represents the probability of selecting each move a (including pass) . The value v is a scalar evaluation, estimating the probability of the current player winning from position s. This neural network combines the roles of both policy network and value network into a single architecture. The neural network consists of many residual blocks of convolutional layers with batch normalization and rectifier nonlinearities(see Methods).
【翻译】我们在AlphaGo Zero的强化学习中,法使用一个参数为θ的深度神经网络。该神经网络将棋局和其历史的原始图作为输入,输出落子概率和价值 。落子概率向量p代表选择每个落子动作a(包括放弃行棋)的概率, 。价值v是标量评估,估计当前玩家在棋局状态为s时获胜的概率。这个神经网络将策略网络和价值网络合并成一个单一的体系结构。神经网络包括许多残差块、批量归一化和整流器非线性的卷积层。【翻译】AlphaGo Zero的神经网络是通过新的强化学习算法利用自我博弈训练出来的。在每一个棋局s,通过神经网络 的指导来执行蒙特卡洛搜索。MCTS搜索输出每次落子的概率分布π。经过搜索后的落子概率通常比神经网络 输出的落子概率p更强,因此MCTS被看作是一个强大的策略改进算法。带有搜索的自我博弈——采用改进的以MCTS为基础的策略来选择的每一次落子,然后用游戏的赢家z作为价值的样本——可以被看作是一个强有力的策略评估运算符。我们采用的强化学习算法的主要思想是在策略迭代过程中反复地利用这些搜索算子(文章research19 october2017 |卷550 |自然| 355);神经网络的参数被更新,使移动概率值 更紧密地与改进的搜索概率和自我博弈的赢家(π,z)相配;这些新的参数用于下一次的自我博弈迭代,以使搜索更强大。图1展示了自我博弈的训练流程。
【原文】a, The program plays a game s1, ..., sT against itself. In each position st, an MCTS αθ is executed (see Fig. 2) using the latest neural network fθ. Moves are selected according to the search probabilities computed by the MCTS, at∼πt. The terminal position st is scored according to the rules of the game to compute the game winner z.
【翻译】a.这个程序进行自我博弈s1, ..., sT。在每个棋局st,执行一个使用最新的神经网络fθ 的MCTS αθ(见图2)。根据MCTS计算的搜索概率来选择落子,at∼πt。根据游戏规则在最终的棋局st记分,来计算比赛的胜出者z。
【原文】b, Neural network training in AlphaGo Zero. The neural network takes the raw board position st as its input, passes it through many convolutional layers with parameters θ, and outputs both a vector pt, representing a probability distribution over moves, and a scalar value vt, representing the probability of the current player winning in position st. The neural network parameters θ are updated to maximize the similarity of the policy vector pt to the search probabilities πt, and to minimize the error between the predicted winner V t and the game winner z (see equation (1)). The new parameters are used in the next iteration of selfplay as in a.
【翻译】b,AlphaGo Zero中的神经网络训练。神经网络以原始棋盘状态st作为输入,通过参数为θ的多个卷积层,输出代表落子概率分布的向量pt,和一个表示当前玩家在棋局状态st处胜率的标量值vt。神经网络参数θ朝着使策略矢量pt与搜索概率πt相似度最大化的方向更新,同时最大限度地减少预测赢家vt和游戏赢家z之间的误差(见公式(1))。如a所示,在下一次迭代中使用新的参数。
【原文】The MCTS uses the neural network f θ to guide its simulations (see Fig. 2). Each edge (s, a) in the search tree stores a prior probability P(s, a), a visit count N(s, a), and an action value Q(s, a). Each simulation starts from the root state and iteratively selects moves that maximize an upper confidence bound Q(s, a) +U(s, a), where U(s, a) ∝P(s, a) / (1 +N(s, a)) (refs 12, 24), until a leaf node s′ is encountered. This leaf position is expanded and evaluated only once by the network to generate both prior probabilities and evaluation Each edge (s, a) traversed in the simulation is updated to increment its visit count N(s, a), and to update its action value to the mean evaluation over these simulations, ,where s, a→s′ indicates that a simulation eventually reached s′after taking move a from position s.
【翻译】MCTS采用神经网络 来指导它的模拟(见图2)。搜索树中的每个边(s,a)存储先验概率p(s,a)、访问次数n(s,a)和一个动作价值Q(s,a)。每次模拟从根开始,反复选择落子,使置信上限Q(s,a)+ U(s,a)最大化,其中U(s,a)∝P(s,a)/(1 + N(s,a))(参考文献12, 24),直到遇到叶节点s′。叶子的位置被扩展,通过网络对该叶子的棋局进行扩展和评估,产生先验概率和价值 。在模拟中的每条边(s,a)被更新,访问数量N(s,a)增加,并且将其动作值更新为对这些模拟的平均评价, ,其中s,a→s’表示在从位置s移动a之后,模拟最终达到s’。
【原文】Figure 2 | MCTS in AlphaGo Zero.
【翻译】图二 AlphaGo Zero的MCTS搜索
【原文】a, Each simulation traverses the tree by selecting the edge with maximum action value Q, plus an upper confidence bound U that depends on a stored prior probability P and visit count N for that edge (which is incremented once traversed).
【翻译】a,每个模拟通过选择具有最大动作值Q,加上取决于存储的先验概率p和该边的访问计数n的一个置信区间上限u(当遍历的时候递增)的边来对树进行遍历。
【原文】b, The leaf node is expanded and the associated position s is evaluated by the neural network (P(s, ·),V(s)) =fθ(s); the vector of P values are stored in the outgoing edges from s.
【翻译】b、叶节点的扩展和对应棋局s的评价是由神经网络(P(S,·)、V(S))= Fθ(S)完成的;p值的向量存储在从s出发的外向边中。
【原文】c, Action value Q is updated to track the mean of all evaluations V in the subtree below that action.
【翻译】c,更新动作价值Q,来跟踪那个落子动作下面的子树中所有评价V的平均值。
【原文】d, Once the search is complete, search probabilities π are returned, proportional to N1/τ, where N is the visit count of each move from the root state and τ is a parameter controlling temperature.
【翻译】d,一旦搜索完成后,返回搜索概率π,与N1 /τ成正比,其中N是从根开始的每个落子的访问次数,τ是温度控制参数。
【原文】MCTS may be viewed as a selfplay algorithm that, given neural network parameters θ and a root position s, computes a vector of search probabilities recommending moves to play, π=αθ(s), proportional to the exponentiated visit count for each move, πa∝N(s, a)1/τ, where τ is a temperature parameter.
【翻译】蒙特卡洛可以看作是一个自我博弈算法,给出了神经网络参数θ和根的棋局状态s,计算搜索概率推荐的移动向量π=αθ(s),它与每次落子动作的访问计数的指数成正比,π∝N(S,A)1 /τ,其中τ是温度参数。
【原文】The neural network is trained by a selfplay reinforcement learning algorithm that uses MCTS to play each move. First, the neural network is initialized to random weights θ0. At each subsequent iteration i≥ 1, games of selfplay are generated (Fig. 1a). At each timestep t, an MCTS search is executed using the previous iteration of neural network θ−fi1 and a move is played by sampling the search probabilities πt. A game terminates at step T when both players pass, when the search value drops below a resignation threshold or when the game exceeds a maximum length; the game is then scored to give a final reward of rT∈ {−1,+1} (see Methods for details). The data for each timestep t is stored as , where zt=±rT is the game winner from the perspective of the current player at step t. In parallel (Fig. 1b), new network parameters θi are trained from data sampled uniformly among all timesteps of the last iteration(s) of selfplay. The neural network is adjusted to minimize the error between the predicted value v and the selfplay winner z, and to maximize the similarity of the neural network move probabilities p to the search probabilities π. Specifically, the parameters θ are adjusted by gradient descent on a loss function l that sums over the meansquared error and crossentropy losses, respectively:
where c is a parameter controlling the level of L weight regularization (to prevent overfitting).
【翻译】神经网络通过自我强化学习进行训练,该强化学习算法使用MCTS计算每个落子动作。首先,神经网络初始化为随机权重θ0 。在随后的每次迭代i≥1时,产生了自我博弈(图1a)。在每一个时间步t,利用上一次迭代的神经网络 执行MCTS搜索 ,并且通过概率分布πt 进行采样来落子。当双方放弃行棋时,或者当搜索值低于阈值,或者当比赛超过最大长度时,比赛终止于步骤T;然后为比赛计分,给予奖励 rT∈ {−1,+1}(见方法细节)。每个时间步t的数据存储为 ,其中zt=±rT 是在步骤t从当前玩家的视角来看的赢家。并行地(图1b),新的网络参数θi 利用数据 进行训练,数据是从自我博弈的上一次迭代的所有时间步中均匀取样的。调整神经网络 ,使预测值v和自我博弈的赢家z之间的误差最小,并且最大限度地提高神经网络移动概率p与搜索概率π的相似度。具体来说,通过使用对均方误差和交叉熵损耗求和的损失函数l,利用梯度下降来调整参数θ:
其中,c是一个控制L2权重正则化水平的参数(防止过拟合)。
【原文】Empirical analysis of AlphaGo Zero training
We applied our reinforcement learning pipeline to train our program AlphaGo Zero. Training started from completely random behavior and continued without human intervention for approximately three days. Over the course of training, 4.9 million games of selfplay were generated, using 1,600 simulations for each MCTS, which corresponds to approximately 0.4 s thinking time per move. Parameters were updated from 700,000 minibatches of 2,048 positions. The neural network contained 20 residual blocks.
AlphaGo Zero训练的实验分析
【翻译】应用我们的强化学习流程来训练AlphaGo Zero。训练从完全随机的落子开始,在没有人工干预的情况下持续大约三天。在训练过程中, 每次MCTS使用1600次模拟,生成了490万场自我博弈,每次落子使用约0.4s的思考时间。使用大小为2048的700000个小批量更新参数。神经网络包含20个残差块。
【原文】Figure 3a shows the performance of AlphaGo Zero during selfplay reinforcement learning, as a function of training time, on an Elo scale 25. Learning progressed smoothly throughout training, and did not suffer from the oscillations or catastrophic forgetting that have been suggested in previous literature Surprisingly, AlphaGo Zero outperformed AlphaGo Lee after just 36 h. In comparison, AlphaGo Lee was trained over several months. After 72 h, we evaluated AlphaGo Zero against the exact version of AlphaGo Lee that defeated Lee Sedol, under the same 2 h time controls and match conditions that were used in the man–machine match in Seoul (see Methods). AlphaGo Zero used a single machine with 4 tensor processing units (TPUs) 29, whereas AlphaGo Lee was distributed over many machines and used 48 TPUs. AlphaGo Zero defeated AlphaGo Lee by 100 games to 0 (see Extended Data Fig. 1 and Supplementary Information).
【翻译】图3a显示了以训练时间为横轴,使用ELO评分规则时AlphaGo Zero在自我博弈强化学习期间的性能。在整个训练期间学习进展顺利,并没有遭受在相关文献中提及的振荡或灾难性的遗忘。令人惊讶的是,在仅训练36小时之后,AlphaGo Zero就超过了AlphaGo Lee的性能,因为AlphaGo Lee训练了几个月。训练72小时后,我们评估AlphaGo Zero,让他和在首尔打败过李世石的AlphaGo Lee使用2小时控制时间和比赛环境下进行比赛。AlphaGo Zero使用具有4个TPU的单机,而AlphaGo Lee则是分布在许多机器上,并且使用48个TPU。AlphaGo Zero以100比0击败AlphaGo Lee(参见扩展数据图1和补充资料)。
【原文】Figure 3 | Empirical evaluation of AlphaGo Zero.
【翻译】 图三 AlphaGo Zero的实证评价
【原文】a, Performance of selfplay reinforcement learning. The plot shows the performance of each MCTS player αθi from each iteration i of reinforcement learning in AlphaGo Zero. Elo ratings were computed from evaluation games between different players, using 0.4 s of thinking time per move (see Methods). For comparison, a similar player trained by supervised learning from human data, using the KGS dataset, is also shown.
【翻译】a,自我博弈强化学习的表现。图中显示了AlphaGo Zero强化学习在每次迭代i中MCTS $\alpha_{\theta_{i}}$αθi 的表现。通过与不同玩家的比赛,来评估ELO评级。在比赛中每次落子的思考时间为0.4秒(见方法)。为了对比,我们也展示出了使用KGS数据,由人类经验数据进行监督学习训练的模型。
【原文】b, Prediction accuracy on human professional moves. The plot shows the accuracy of the neural network θfi, at each iteration of selfplay i, in predicting human professional moves from the GoKifu dataset. The accuracy measures the probability to the human move. The accuracy of a neural network trained by supervised learning is also shown.
【翻译】b、对人类棋手落子的预测精度。该图显示了在每一次自我博弈迭代i中,神经网络 根据KGS数据集预测人类棋手落子的准确性。通过监督学习的神经网络的训练精度也显示在图中。
【原文】c, Meansquared error (MSE) of human professional game outcomes. The plot shows the MSE of the neural networkθfi, at each iteration of selfplay i, in predicting the outcome of human professional games from the GoKifu dataset. The MSE is between the actual outcome z∈ {− 1, +1} and the neural network value v, scaled by a factor of 14 to the range of 0–1. The MSE of a neural network trained by supervised learning is also shown.
【翻译】c,在人类职业比赛结果上的均方误差(MSE)。该图显示了在每一次自我博弈迭代i中,神经网络 从gokifu数据中预测人类职业比赛结果的MSE。MSE是在实际结果z∈ {− 1, +1} 和神经网络的价值v,按1/4的比例缩小到0 - 1的范围之间。图中还显示出经过监督学习训练的神经网络的MSE。
【原文】To assess the merits of selfplay reinforcement learning, compared to learning from human data, we trained a second neural network (using the same architecture) to predict expert moves in the KGS Server dataset; this achieved stateoftheart prediction accuracy compared to previous work12,30–33 (see Extended Data Tables 1 and 2 for current and previous results, respectively). Supervised learning achieved a better initial performance, and was better at predicting human professional moves (Fig. 3). Notably, although supervised learning achieved higher move prediction accuracy, the selflearned player performed much better overall, defeating the humantrained player within the first 24 h of training. This suggests that AlphaGo Zero may be learning a strategy that is qualitatively different to human play.
【翻译】为了评估自我博弈强化学习相对于使用人类数据进行学习的优势,我们训练了第二个神经网络(使用相同的架构)来预测在KGS服务器数据上人类专业棋手的落子动作,取得了与以前的工作(12,30–33)相比更准确的预测精度(当前和以前的结果分别参见扩展数据表1 和2)。监督学习在一开始获得了非常好的性能,并且更好地预测了人类棋手的动作(图3)。但值得注意的是,虽然监督学习取得了较高的落子预测精度,但是总体而言,这个自学的棋手表现更好,在经过24小时的训练后击败了用人类数据进行训练的程序。这表明,AlphaGo Zero可以学习到完全与人类不同的技能。
【原文】To separate the contributions of architecture and algorithm, we compared the performance of the neural network architecture in AlphaGo Zero with the previous neural network architecture used in AlphaGo Lee (see Fig. 4). Four neural networks were created, using either separate policy and value networks, as were used in AlphaGo Lee, or combined policy and value networks, as used in AlphaGo Zero; and using either the convolutional network architecture from AlphaGo Lee or the residual network architecture from AlphaGo Zero. Each network was trained to minimize the same loss function (equation (1)), using a fixed dataset of selfplay games generated by AlphaGo Zero after 72 h of selfplay training. Using a residual network was more accurate, achieved lower error and improved performance in AlphaGo by over 600 Elo. Combining policy and value together into a single network slightly reduced the move prediction accuracy, but reduced the value error and boosted playing performance in AlphaGo by around another 600 Elo. This is partly due to improved computational efficiency, but more importantly the dual objective regularizes the network to a common representation that supports multiple use cases.
【翻译】为了将结构和算法的贡献分离,我们将AlphaGo Zero使用的神经网络体系结构的性能与AlphaGo Lee使用的神经网络结构进行了比较(见图4)。我们创建了四个神经网络,就像在AlphaGo Lee中那样,使用独立的策略网络和价值网络;或者使用AlphaGo Lee使用的卷积网络架构或AlphaGo Zero使用的残差网络架构。训练网络时都最大限度地减少相同的损失函数(方程(1)),使用的数据集是AlphaGo Zero在72小时的自我博弈训练后产生的固定数据集。利用残差网络更准确,使AlphaGo 达到较低的错误率和性能的改进,达到了超过600Elo。将策略和价值合成一个单一的网络会轻微地降低落子预测精度,但同时降低了价值误差,并且使AlphaGo的性能提高大约600Elo。这是由于提高了计算效率,但更重要的是具有双重目的的网络成为支持多个案例的通用表示。
什麽是ELO?
ELO等级分制度是指由匈牙利裔美国物理学家 Arpad Elo创建的一个衡量各类对弈活动水平的评价方法,是当今对弈水平评估的公认的权威方法。
ELO怎麽产生的?
最早, ELO等级分制度是基于统计学的一个评估棋手水平的方法. 之后被广泛用于国际象棋、围棋、足球、篮球等运动。线上游戏英雄联盟、魔兽世界内的竞技对战系统也採用此分级制度. 现在不少Destiny网站也使用此统计系统.
【原文】Figure 4 | Comparison of neural network architectures in AlphaGo Zero and AlphaGo Lee
【翻译】 图4 AlphaGo Zero和AlphaGo Lee中神经网络结构的比较
【原文】Comparison of neural network architectures using either separate (sep) or combined policy and value (dual) networks, and using either convolutional (conv) or residual (res) networks. The combinations ‘dual–res’ and ‘sep–conv’ correspond to the neural network architectures used in AlphaGo Zero and AlphaGo Lee, respectively. Each network was trained on a fixed dataset generated by a previous run of AlphaGo Zero.
【翻译】使用单独的(SEP)或联合的策略和价值(dual)网络的神经网络结构比较,以及使用卷积(conv)或残差(res)网络的比较。 ‘dual–res’和‘sep–conv’ 的组合分别与AlphaGo Zero 和 AlphaGo Lee中使用的神经网络结构相对应。每个网络在一个固定的数据集上进行训练,这个数据集是由AlphaGo Zero以前的运行产生的。
【原文】a, Each trained network was combined with AlphaGo Zero’s search to obtain a different player. Elo ratings were computed from evaluation games between these different players, using 5 s of thinking time per move.
b, Prediction accuracy on human professional moves (from the GoKifu dataset) for each network architecture.
c ,MSE of human professional game outcomes (from the GoKifu dataset) for each network architecture.
【翻译】a,每个训练过的网络与AlphaGo Zero的搜索相结合,来获得不同的程序。通过这些不同的程序之间的比赛来计算ELO评级。在比赛中,每次落子使用5秒的思考时间。
b, 每个网络架构对专业人类棋手的落子预测精度(使用gokifu数据集)。
c, 每个网络架构在人类专业职业比赛结果的MSE(使用gokifu数据集)。
【原文】Knowledge learned by AlphaGo Zero
AlphaGo Zero discovered a remarkable level of Go knowledge during its selfplay training process. This included not only fundamental elements of human Go knowledge, but also nonstandard strategies beyond the scope of traditional Go knowledge.
【翻译】AlphaGo Zero学习到的知识
AlphaGo Zero在自我博弈训练过程中发现了围棋的新境界。这不仅包括人类围棋知识的基本要素,而且还包括超出传统围棋知识范围之外的非标准策略。
【原文】Figure 5 shows a timeline indicating when professional joseki (corner sequences) were discovered (Fig. 5a and Extended Data Fig. 2); ultimately AlphaGo Zero preferred new joseki variants that were previously unknown (Fig. 5b and Extended Data Fig. 3). Figure 5c shows several fast selfplay games played at different stages of training (see Supplementary Information). Tournament length games played at regular intervals throughout training are shown in Extended Data Fig. 4 and in the Supplementary Information. AlphaGo Zero rapidly progressed from entirely random moves towards a sophisticated understanding of Go concepts, including fuseki (opening), tesuji (tactics), lifeanddeath, ko (repeated board situations), yose (endgame), capturing races, sente (initiative), shape, influence and territory, all discovered from first principles. Surprisingly, shicho (‘ladder’ capture sequences that may span the whole board)—one of the first elements of Go knowledge learned by humans—were only understood by AlphaGo Zero much later in training.
【翻译】图5显示了专业的定式(位于边角的序列上)被发现的时间(图5A和扩展的数据如图2所示);最终AlphaGo Zero使用了新的定式变种(图5B和扩展数据图3)。图5c显示了在不同的训练阶段进行的几次快速自我博弈的进行情况(参见补充信息)。在整个训练过程中定期进行的比赛长度在扩展数据图4和补充信息中显示。(在训练过程中一般游戏长度中都有一些间隔,这些间隔都显示在了额外数据4和补充信息中。)AlphaGo Zero迅速从“一块白板”走向成熟,对围棋概念有了深奥理解,包括布局(开放),手筋(战术),活和死,劫(重复的棋盘情况),官子(残局),提子比赛,森特(主动)(初始)、形态(成型)、影响和领土(占领),都能在第一时间迅速掌握。令人惊讶的是,shicho抓住了整个棋盘的序列——在人类学习围棋中比较早被人类掌握的围棋知识点,却在AlphaGo Zero训练比较晚的时候才掌握到。
【原文】Figure 5 | Go knowledge learned by AlphaGo Zero.
【翻译】 图5 AlphaGo Zero学习的围棋知识
【原文】a, Five human Joseki (common corner sequences) discovered during AlphaGo Zero training. The associated timestamps indicate the first time each sequence occurred (taking account of rotation and reflection) during selfplay training. Extended Data Figure 2 provides the frequency of occurence over training for each sequence.
【翻译】a,在AlphaGo Zero训练过程中的五个常见的角点序列。在自我博弈训练期间,相关的时间段显示了每个序列第一次形成的时间(考虑旋转和反射)。扩展数据图2提供了每个序列在训练中出现的频率。
【原文】b, Five joseki favoured at different stages of selfplay training. Each displayed corner sequence was played with the greatest frequency, among all corner sequences, during an iteration of selfplay training. The timestamp of that iteration is indicated on the timeline. At 10 h a weak corner move was preferred. At 47 h the 3–3 invasion was most frequently played. This joseki is also common in human professional play however AlphaGo Zero later discovered and preferred a new variation. Extended Data Figure 3 provides the frequency of occurence over time for all five sequences and the new variation.
【翻译】b,五定式在自我博弈训练的不同阶段被青睐的程度。在自我博弈训练的一次迭代中,在所有的角序列中,每一个显示的角序列都出现的频率最高。该迭代的时间戳在时间轴上表示。在10小时时,弱角移动是首选。在47小时时,3 - 3的入侵是最经常发生的。这个定式在人类职业比赛中也常见。不过AlphaGo Zero随后发现并偏向于这个新变化。扩展数据图3提供了所有五个序列和新变化随时间变化的频率。
【原文】c, The first 80 moves of three selfplay games that were played at different stages of training, using 1,600 simulations (around 0.4 s) per search. At 3 h, the game focuses greedily on capturing stones, much like a human beginner. At 19 h, the game exhibits the fundamentals of lifeanddeath, influence and territory.At 70 h, the game is remarkably balanced, involving multiple battles and a complicated ko fight, eventually resolving into a halfpoint win for white. See Supplementary Information for the full game.
【翻译】C,在不同训练阶段进行的三个自我博弈的前80步,每次搜索使用1600次模拟(大约0.4秒)。在3小时后,游戏专注于吃对方的棋子,就像人类初级棋手一样。在19小时时,游戏展现了死活、影响力和占领的基本方面,在70小时时,游戏非常平衡,包括多场战斗和复杂的劫战斗,最终白方以半目赢得胜利。有关完整游戏见补充信息。
【原文】We evaluated the fully trained AlphaGo Zero using an internal tournament against AlphaGo Fan, AlphaGo Lee and several previous Go programs. We also played games against the strongest existing program, AlphaGoMaster—a program based on the algorithm and architecture presented in this paper but using human data and features (see Methods)—which defeated the strongest human professional players 60–0 in online games in January2017 34 . In our evaluation, all programs were allowed 5 s of thinking time per move; AlphaGo Zero and AlphaGo Master each played on a single machine with 4 TPUs; AlphaGo Fan and AlphaGo Lee were distributed over 176 GPUsand 48 TPUs, respectively. We also included a player based solely on the raw neural network of AlphaGo Zero; this player simply selected the move with maximum probability.
【翻译】我们通过内部比赛对AlphaGo Fan,AlphaGo Lee和几个以前的Go程序评估了全面训练的AlphaGo Zero。我们还让其对战现有最强的程序,AlphaGo Master——一个基于本文的算法和架构但利用人类数据和特征的算法(见方法)的程序,于2017年1月在网络游戏上击败了人类最强的职业选手60–0。在我们的评估中,所有程序都只允许使用5秒时间思考每次落子;AlphaGo Zero和AlphaGo Master每个在使用4个TPU的单一机器上进行;AlphaGo Fan和AlphaGo Lee分别分布在176个GPU和48个TPU上。我们还引入一个完全基于AlphaGo Zero原始神经网络的程序,该程序以最大的概率来选择落子。
【原文】Figure 6b shows the performance of each program on an Elo scale. The raw neural network, without using any lookahead, achieved an Elo rating of 3,055. AlphaGo Zero achieved a rating of 5,185, compared to 4,858 for AlphaGo Master, 3,739 for AlphaGo Lee and 3,144 for AlphaGo Fan.
【翻译】图6b显示了在Elo量表上每个程序的性能。没有使用任何前向搜索的原始神经网络,Elo评级为3,055。相比之下,AlphaGo Zero达到了5185的等级, AlphaGo Master达到了4858 等级,AlphaGo Lee达到了3739和AlphaGo Fan 达到了3144。
【原文】Finally, we evaluated AlphaGo Zero head to head against AlphaGo Master in a 100game match with 2h time controls. AlphaGo Zero won by 89 games to 11 (see Extended Data Fig. 6 and Supplementary Information).
【翻译】最后,我们使用具有两小时控制时间的100场比赛对AlphaGo Zero和AlphaGo Master进行评估。AlphaGo Zero以89比11赢得了比赛(参见扩展数据图6和补充资料)。
【原文】Conclusion
Our results comprehensively demonstrate that a pure reinforcement learning approach is fully feasible, even in the most challenging of domains: it is possible to train to superhuman level, without humanexamples or guidance, given no knowledge of the domain beyond basic rules. Furthermore, a pure reinforcement learning approach requires just a few more hours to train, and achieves much better asymptotic performance, comparedto training on human expert data. Using this approach, AlphaGo Zero defeated the strongest previous versions of AlphaGo, which were trained from human data using handcrafted features, by a large margin
【翻译】讨论
我们的研究结果证明,即便是在最具挑战性的领域中,单纯使用强化学习的方法也是完全可行的:没有人类实例或指导,没有基本规则之外的领域知识,训练达到超人的性能是完全可能的。此外,与通过人类棋手数据进行训练相比,单纯的强化学习方法只需要训练几个小时,并且可以取得更好的渐近性能。使用这种方法,AlphaGo Zero打败了AlphaGo 先前最强的版本,那个版本使用手工制作的特征,利用人类数据进行大幅度训练。
【原文】Evaluator. To ensure we always generate the best quality data, we evaluate each new neural network checkpoint against the current best network θ ∗ f before using it for data generation. The neural network θ f i is evaluated by the performance of an MCTS search α θ i that uses θ f i to evaluate leaf positions and prior probabilities (see Search algorithm). Each evaluation consists of 400 games, using an MCTS with 1,600 simulations to select each move, using an infinitesimal temperature τ→ 0 (that is, we deterministically select the move with maximum visit count, to give the strongest possible play). If the new player wins by a margin of > 55% (to avoid selecting on noise alone) then it becomes the best player α θ∗ , and is subsequently used for selfplay generation, and also becomes the baseline for subsequent comparisons.
【翻译】评估器。为了确保我们总是产生质量最好的数据,我们在使用神经网络生成数据前,会利用目前最好的网络 对每一个新的神经网络进行比较来做评估。MCTS搜索 通过 产生叶子节点棋局的评价和先验概率(见搜索算法),神经网络 的性能由 进行评估。每一个评价需要进行400次比赛,使用具有1600次模拟的MCTS来选择每一次落子,用无穷小的温度τ→0(也就是我们通过选择具有最大访问数的落子,来给出最可能的落子)。如果新的玩家获胜比率>55%(避免单独因为噪声进行选择),那它就成为了最好的 ,并随后用于进行自我博弈,也成为后来比较的基准。
【原文】Self-play. The best current player α θ ∗ , as selected by the evaluator, is used to generate data. In each iteration, α θ ∗ plays 25,000 games of selfplay, using 1,600 simulations of MCTS to select each move (this requires approximately 0.4 s per search). For the first 30 moves of each game, the temperature is set to τ = 1; this selects moves proportionally to their visit count in MCTS, and ensures a diverse set of positions are encountered. For the remainder of the game, an infinitesimal temperature is used, τ→ 0. Additional exploration is achieved by adding Dirichlet noise to the prior probabilities in the root node s 0 , specifically P(s, a) = (1 − ε)p a + εη a , where η ∼ Dir(0.03) and ε = 0.25; this noise ensures that all moves may be tried, but the search may still overrule bad moves. In order to save computation, clearly lost games are resigned. The resignation threshold v resign is selected automatically to keep the fraction of false positives (games that could have been won if AlphaGo had not resigned) below 5%. To measure false positives, we disable resignation in 10% of selfplay games and play until termination.
【翻译】自我博弈。通过评估而被选择的目前最好的棋手 ,用于数据的生成。在每次迭代中, 进行25000场自我博弈,用1600次MCTS模拟来选择每次落子(每次搜索需要使用大约0.4秒选择落子)。在每场比赛的前30次落子中,温度τ设置为1:选择落子位置的概率与MCTS中的访问次数成比例,并保证遇到一组不同的棋局。在剩下的比赛中使用一个无限小的温度τ→0。为了保证有额外的探索,在根节点 的先验概率中加入Dirichlet噪声 P(s, a) = (1 − ε)p a + εη a ,其中 η ∼ Dir(0.03) 并且ε= 0.25;这种噪声确保所有落子都被尝试,但同时也会选择到不是很好的落子动作。为了节省运算量,显然会输的游戏被丢弃。丢弃阈值 v resign是自动选择的,以便保持误报率(AlphaGo如果没有丢弃游戏本来可以赢的概率)低于5%。为了测量误报率,我们在10%的自我博弈中没有执行丢弃。
【原文】Supervised learning. For comparison, we also trained neural network parameters θ SL by supervised learning. The neural network architecture was identical to AlphaGo Zero. Minibatches of data (s, π, z) were sampled at random from the KGS dataset, setting π a = 1 for the human expert move a. Parameters were optimized by stochastic gradient descent with momentum and learning rate annealing, using the same loss as in equation (1), but weighting the MSE component by a factor of 0.01. The learning rate was annealed according to the standard schedule in Extended Data Table 3. The momentum parameter was set to 0.9, and the L2 regularization parameter was set to c = 10 −4
【翻译】监督学习。作为对比,我们还通过监督学习训练了一个神经网络参数 。神经网络结构和AlphaGo Zero的是相同的。小批次数据 (s, π, z) 从KGS数据集中随机采样,将人类棋手的落子a设置为π a = 1 。参数和方程(1)一样,使用相同的损失,由动量和学习率退火的随机梯度下降算法进行优化,但通过一个参数0.01来衡量MSE。按照标准进度对学习速率进行的退火显示在扩展数据表3中。动量参数设置为0.9,L2正则化参数设置为c = 10-4。
【原文】By using a combined policy and value network architecture, and by using a low weight on the value component, it was possible to avoid overfitting to the values (a problem described in previous work 12 ). After 72 h the move prediction accuracy exceeded the state of the art reported in previous work 12,30–33 , reaching 60.4% on the KGS test set; the value prediction error was also substantially better than previously reported 12 . The validation set was composed of professional games from GoKifu. Accuracies and MSEs are reported in Extended Data Table 1 and Extended Data Table 2, respectively.
【翻译】通过使用策略网络和价值网络相结合的结构,并通过在价值组件中使用低权重,可以避免过拟合(在以往的工作中描述的问题)。72小时后,落子预测精度超过了在以往工作中得到的最好的程序,在KGS测试集上达到60.4%的预测率;价值预测误差也大大少于先前的报道。验证集是由GoKifu的专业比赛组成的。精度和均方误差分别显示在在扩展数据表1和表2中。
【翻译】搜索算法。相对于使用异步策略和价值MCTS算法(APVMCTS MCTS算法)的AlphaGo Fan和AlphaGo Lee,AlphaGo Zero使用了一个更简单的变种。
搜索树中的每个节点包含了边 ,其中,所有合法的动作 。每条边存储一组统计数据,
【原文】where N(s, a) is the visit count, W(s, a) is the total action value, Q(s, a) is the mean action value and P(s, a) is the prior probability of selecting that edge. Multiple simulations are executed in parallel on separate search threads. The algorithm proceeds by iterating over three phases (Fig. 2a–c), and then selects a move to play (Fig. 2d).
【翻译】其中 N(s, a)是访问计数,W(s, a) 是总动作值,Q(s, a) 是平均动作值,P(s, a) 是选择该条边的先验概率。在不同的搜索线程上并行执行多个模拟。该算法通过迭代超过三个阶段来进行(图2a - C),然后选择要进行的落子动作(图2d)。
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。