MCTS Parameters
TAG currently supports the following main variants/parameterisations of MCTS within MCTSParams
(as of August 2024). (There are a few more technical ones; see the MCTSParams code for details.)
Given the options, it is recommended that an MCTS agent be defined by a JSON file, in which values for each of these can be specified. For more details and examples of specifying an MCTS agent in JSON, see this documentation.
- Tree Selection Stage
- The tree selection policy to use, treePolicy
- The policy that other players should use in the tree opponentTreePolicy
- UCB exploration constant K
- First Player Urgency FPU
- The exploration constant to use for some (non-UCB) tree policies, exploreEpsilon
- Information Set handling, information
- The final action selection policy, selectionPolicy
- Progressive Bias options, progressiveBias
- Tree Reuse between decisions, reuseTree
- Expansion Stage
- The maximum depth to grow the tree maxTreeDepth.
- Heuristic to use to choose action expansion order actionHeuristic.
- Progessive Widening options progressiveWidening
- Rollout Stage
- The maximum length of each rollout rolloutLength
- When to terminate each rollout rolloutTermination
- The rollout policy to use, rolloutType
- Any explicit opponent model to use, opponentModel
- Move Average Sampling (MAST), MAST
- Backup Stage
- The heuristic to use when valuing a leaf state, heuristic
- Paranoia, paranoid
- The backup policy to use, backupPolicy.
There are also parameters to set the budget available to the agent, in terms of elapsed time, iterations or calls to the forward model Budget.
K
The weighting to give the the exploration term in the UCB equation. This should be scaled to the game score - if the result of a game is 1 for a win, and 0 or -1 for a draw then a value of about 1.0 or sqrt(2) is the standard choice.
This scaling to the score will take place automatically if the normaliseRewards
parameter is true
(which is the default). In this case the largest and smallest final scores seen on all rollouts is tracked during each MCTS iteration, and used to scale all future rewards to the range [0, 1].
rolloutLength
The maximum number of actions to take during the rollout phase of MCTS, i.e. once we leave the tree.
rolloutLengthPerPlayer
This defaults to false
. If true
then the rolloutLength
parameter is per player, i.e. in a 3 player game, the actual rollout length will be 3 x rolloutLength
. This is useful for automatic scaling of the rollout if the number of actions in a game is roughly proportional to the number of players.
rolloutTermination
This has four possible values.
If rolloutLength
is greater than zero, then the rollout will run for this number of actions, and then terminate once the specified condition is met:
DEFAULT
: Terminate immediately the rolloutLength
is reached.
END_TURN
: Terminate only at the end of our turn (‘our’ refers to the root player).
END_ROUND
: Terminate only at the end of a full game round.
END_ACTION
: Terminate only when we have finished a sequence of actions, and the next action is to be taken by another player. This is distinct from END_TURN
in games where there are interrupt actions for example.
START_ACTION
: Terminate only when opponents have finished a sequence of action, and the next action is to be taken by us.
maxTreeDepth
The maximum depth to grow the tree. If an MCTS iteration reaches this depth, then it will not expand any nodes beyond this level, and will move to the rollout phase.
information
This has three possible values:
-
Open_Loop A perfect information game is assumed, i.e. that the copy of the game state received by the agent is the actual state and is known by all players. In this case the set of actions taken from the root defines the current state from the perspective of MCTS, but allows for the underlying state to be different due, for example, to a stochastic environment or actions of other players that are not tracked in the tree. On every iteration of MCTS the forward model is applied to the game state to advance it from the root through the tree.
-
Closed_Loop As Open_Loop, but further assumes that the environment is deterministic, and that the same sequence of actions will always lead to exactly the same underlying state (as might be true in Chess, Go, or TicTacToe). Hence the forward model is not applied as we descend the tree. When a node is expanded, the state stored on its parent node is copied, and the forward model applied once to generate the new state for the expanded node, which is then stored. This approach significantly reduces the number of forward model calls made compared to Open Loop - but will generally perform less well despite more iterations if the game is stochastic.
-
Information_Set A simple form of Information Set MCTS is used. At the start of each iteration the root state is ‘redeterminised’ before the tree phase starts. Redeterminisation shuffles all the unknown information in the state to be a valid full state - or, if you prefer, it samples a full state that is in the current Information Set (i.e. consistent with everything that we can see). Otherwise this is as for Open_Loop, with the root through the tree defined by the sequence of actions taken. This is implemented using the
AbstractGameState.copy(playerId)
method, which should shuffle any decks we can’t see. See the reference below for more details.Whitehouse, Daniel, Edward J. Powley, and Peter I. Cowling. 2011. ‘Determinization and Information Set Monte Carlo Tree Search for the Card Game Dou Di Zhu’. In 2011 IEEE Conference on Computational Intelligence and Games (CIG’11), 87–94. Seoul, Korea (South): IEEE. link.
selectionPolicy
This determines how the final action is selected after MCTS is complete.
ROBUST
will pick the action from the root node that has been visited most often.
SIMPLE
will pick the action from the root node that has the highest expected score (ignoring any exploration term).
EXP3 and Regret Matching (see treePolicy below) use their own specific final selection policies, and ignore this parameter.
treePolicy
This determines how MCTS makes decisions as it descends the tree. The supported options are:
- UCB The default Upper Confidence Bound for Trees algorithm:
Where K is the exploration constant, N is the total number of iterations through the node, and na is the number of times action a has been picked from this node. X(a) is the mean value of taking the action a from the node.
-
UCB_Tuned This modifies the UCB formula to consider the observed variance of the rewards from rollouts. If this variance is low, then exploration will be reduced compared to UCB. The second reference below found this to be the best from several tested.
\[V(a)=\bar{X}_a + K\sqrt{\frac{\log(N)}{n_a}\min\left(\frac{1}{4} + \sigma^2\sqrt{\frac{2\log(N)}{n_a}}\right)}\]Auer, Peter, Cesa-Bianchi, Nicolo and Fischer, Paul, 2002. ‘Finite-time analysis of the multiarmed bandit problem’. Machine Learning 47 (2-3): 235-256. link
Perick, Pierre, David L. St-Pierre, Francis Maes and Damian Ernst, 2012. ‘Comparison of different selection strategies in Monte-Carlo Tree Search for the game of Tron’. 2012 IEEE Conference on Computational Intelligence and Games (CIG), pp 242-249
-
AlphaGo Uses a slight tweak on the exploration part of the UCB formula.
\[V(a)=\bar{X}_a + K\sqrt{\frac{N}{1 + n_a}}\]
This change from a log term is used in AlphaGo and AlphaZero. It is inspired by:
Auger, David, Adrien Couetoux, and Olivier Teytaud. ‘Continuous Upper Confidence Trees with Polynomial Exploration–Consistency’. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2013, Prague, Czech Republic, September 23-27, 2013, Proceedings, Part I 13, 194–209. Springer, 2013.
-
EXP3 Uses the EXP3 bandit algorithm. This constructs a SoftMax distribution over all possible actions using:
\[V(a)=\frac{e^{-\bar{X}_a} / T}{Z}\]
Where Z is the required normalisation constant and T is the temperature of the distribution. T is set using the exp3Boltzmann
parameter, and defaults to 0.1.
EXP3 will always make a final recommendation for an action at the root node (selectionPolicy
) by sampling from a distribution over the number of visits made to each action, after amending this to ignore actions taken as part of explicit exploration (exploreEpsilon
).
-
RegretMatching Uses Regret Matching as the bandit algorithm. For details on this (plus the EXP3 algorithm) see:
Lisy, Viliam. 2014. ‘Alternative Selection Functions for Information Set Monte Carlo Tree Search’. Acta Polytechnica 54 (5): 333–340. link
RegretMatching makes a final recommendation at the root node that samples from a probability distribution constructed from the average Regret Matching policy at the root node over all MCTS iterations (see reference above for more background), and ignores the selectionPolicy
parameter.
-
Uniform This just takes actions uniformly at random. This is not quite as silly as it sounds and can be useful if compbined with a Max-backup of some kind.
-
Greedy This uses an ε-greedy policy. With probability given by
exploreEpsilon
a random action is taken, otherwise the action with the best average reward so far is taken.
FPU
This sets the First Play Urgency, and defaults to Infinity. This provides the default value to use for the Exploration value of an action that has not yet been tried (see right-hand side of UCT equation above). The default ensures that all actions are tried once before any is tried a second time. Setting this to a lower value can be helpful in large action spaces, as it means that after trying some number of them we may repeat the better ones before considering the untried actions.
exploreEpsilon
This is only used for Greedy, EXP3 and RegretMatching Tree Policies. It is used to define an ε-greedy exploration policy. I.e. with probability of exploreEpsilon a random decision will be taken at each node instead of using the Greedy, EXP3 or Regret Matching formulae.
opponentTreePolicy
This indicates how the search tree is built, and in particular how opponent actions are included. This has a few possible values:
-
OneTree This is the default, and maintains a vector of rewards/node values during back-propagation so that each node records the value of a decision to each player independently. Hence a decision at a node is made just using the value to the deciding player. This is a much less conservative, and potentially more realistic, approach than Paranoid. (Especially in non-zero sum games, or with 3+ players).
-
SelfOnly This ignores other players completely, and only constructs the MCTS tree using our actions. Hence on each node transition there are implied, but unrecorded, opponent actions taken using the specified opponent model (which by default is a random policy). This has the advantage of exploring much deeper forward in the game for a given budget, especially if the branching factor of opponent actions is high. The downside comes in not being able to account for opponent actions that could interfere with the detailed forward plan being constructed.
-
MultiTree This constructs a ‘SelfOnly’ tree for each player to improve opponent modelling during MCTS search. Instead of using an explicit opponent model to take actions for the other players, these make decisions using their own tree. Each iteration will traverse all trees in parallel, and add one node to each. This retains the benefit of deeper search with an improved implicit opponent model compared to ‘SelfOnly’, but is cannot learn any action conditionalities. For detail on the MultiTree algorithm, see
Goodman, James, Diego Perez-Liebana, and Simon Lucas. “MultiTree MCTS in Tabletop Games.” 2022 IEEE Conference on Games (CoG). IEEE, 2022. link
-
OMA ‘Opponent Move Asbtraction’. This constructs a full tree as for MaxN, but also maintains statistics at each node as if SelfOnly were being used (i.e. it abstracts out the opponent moves). In the selection phase a policy is used that wieghts the two sets of statistics. A further parameter of
omaVisits
controls the effective number of visits to use for the Self-Only-like statistics. -
OMA_All As for OMA, but this also applies the abstraction to the other players in the tree. OMA only uses the abstraction for the root player’s moves. For Opponent Move Abstraction, see
Baier, Hendrik, and Michael Kaisers. “Guiding multiplayer mcts by focusing on yourself.” 2020 IEEE Conference on Games (CoG). IEEE, 2020. link
-
MCGS This enables a form of Graph Search using a Transposition Table approach. The previous methods all use an ‘Open Loop’ algorithm that defines the current node purely by the sequence of actions taken to reach it from the root node (and this only includes user actions, not actions attributable to the game, such as the precise card drawn from a shuffle deck). MCGS instead maintains a hashmap of all states seen so far with their MCTS statistics. After taking an action the new state is looked up in the hashmap - if it exists then we have the new node; if it does not then we expand by adding this to the hashmap with its first visit. This requires an additional parameter for
MCGSStateKey
. This must be a class that implements theIStateKey
interface. This determines how a state is converted into a key for the hashmap, and hence how much generalisation between states takes place. -
MCGSSelfOnly This is as MCGS, but only models the actions of the acting agent. Opponent actions are modeled using a fixed opponent policy.
paranoid
The default of false
will mean that opponents are modeled as optimising their own result (win/score). If true
then the opponent will be modeled in the search tree as trying to minimise the acting agent’s result. This ‘paranoid assumption’ can lead to much more conservative play, but is also potentially less exploitable and works well in some games.
reuseTree
If set to true
then any parts of the search tree from the previous decision that are still relevant will be retained. This is not always helpful, as it can take time to trim the tree to the useful parts, and more importantly with hidden information we may want to avoid polluting the tree data with results from invalid determinisations. (This is because in open loop search nodes are defined by the player actions that lead to the node, and may therefore contain a mixture of actual world/information states depending on cards drawn or dice rolled, as these game-driven random events do not affect the tree structure.)
rolloutType
The rollout policy to be used by the root player (for opponent policies use opponentModelType). This is defined with different values of rolloutType
, which can then be parameterised further:
-
RANDOM No further parameterisation possible. This uses the default random rollout policy.
-
CLASS This requires a class that supports the
AbstractPlayer
interface, and which will be used as the rollout policy. This class is then specified by therolloutClass
parameter. This must be the fully specified class name. This can either use a no-argument constructor, or a constructor with a singleString
argument. In the latter case the argument is separated by a ‘|’ from the class name, for a full format ofrolloutClass=package.className|constructorArg
. -
PARAMS This adds more flexibility than CLASS, and allows much more detailed specification of the rollout policy. A further parameter of
rolloutPolicyParams
must be specified as a full JSON-fragment defining the policy (which must be a class that implements bothAbstractPlayer
andITunableParameters
. An example makes this clearest:
"rolloutType" : "PARAMS",
"rolloutPolicyParams" : {
"class" : "players.simple.OSLAHeuristic",
"heuristic" : {
"class" : "games.loveletter.LoveLetterHeuristic",
"COUNTESS_PLAY_THRESHOLD" : 1.0,
"FACTOR_CARDS" : 0.5,
"FACTOR_AFFECTION" : 1.0
},
},
- MAST This will use Move Average Sampling to define the rollout policy. This maintains an average of the reward obtained for every action played on this and previous iterations of MCTS. This can then be used to define a softmax policy using these action value estimates. For details of the MAST parameters see later on this page.
opponentModel
oppModelType
has exactly the same parameterisation options as for rolloutType. This is used when opponent decision is needed in rollout, or if the Opponent Tree Policy is ‘SelfOnly’.
If a more sophisticated opponent model should be used, then opponentModel
can specify (via a json fragment) any class that implements AbstractPlayer. For example, a hand-crafted agent could be written, and then used as an opponent model with:
"opponentModel" : {
"class" : "games.dominion.players.BigMoney"
},
heuristic
This is used to define the heuristic function that should evaluate the leaf state evaluated at the end of each iteration (if this is not a game end situation), and that is back-propagated up the tree. This needs to be a JSON-fragment that defines a class that implements IStateHeuristic
and optionally ITunableParameters
(if further parameters need to be specified). Two examples are shown below:
"heuristic" : {
"class" : "games.loveletter.LoveLetterHeuristic",
"COUNTESS_PLAY_THRESHOLD" : 1.0,
"FACTOR_CARDS" : 0.5,
"FACTOR_AFFECTION" : 1.0
},
"heuristic" : {
"class" : "players.heuristics.WinOnlyHeuristic"
},
The default heuristic used is the game-specific getHeuristicScore()
method.
actionHeuristic
This is used to define a heuristic that returns an esimate for a state/action pair. This must implement the IActionHeuristic
interface. If one is supplied then this will be used to define the order of expansion of unvisited actions (this defaults to a random order).
Additionally if specified, then it can be used in any of visit initialisation, progressive widening, progressive bias or p:
visitInitialisation
This initialise new nodes with a ‘previous’ number of visits with the heuristic estimate if the initialiseVisits
parameter is set to a non-zero value. This is based on the approach from
Gelly, Sylvain and David Silver, ‘Combining online and offline knowledge in UCT’, in Proceedings of the 24th international conference on Machine learning - ICML ’07, Corvalis, Oregon: ACM Press, 2007, pp. 273–280. doi: 10.1145/1273496.1273531.
progressiveWidening
Progressive Widening defines which actions are available at a node, and requires an ordering over actions from ‘best’ to ‘worst’. If we have N visits to the node so far, then the number of actions, A, that will be considered in the Expansion/Tree phases is given by:
\[A=\lfloor{W}N^{B}\rfloor\]The two constants, W, and B, are specified by progressiveWideningConstant
and progressiveWideningExponent
respectively. These both default to 0.0, with no progressive widening. The original algorithm (reference below) also has a kinit parameter for the minimum for A. This is always 0 here; so that in practice W defines the minimum for A, and means we do not ‘unprune’.
Original citation for this form of progressive widening:
Couëtoux, Adrien, et al. ‘Continuous Upper Confidence Trees’. International Conference on Learning and Intelligent Optimization, Springer, 2011, pp. 433–45.
An oft-cited inspiration for progressive widening is Chaslot et al. 2008, but this does not use the form of the equation above, and actually implements the subtly different progressive unpruning:
Chaslot, Guillaume, Mark H. M. Winands, H. Jaap van den Herik, Jos Uiterwijk and Bruno Bouzy, 2008. Progressive strategies for Monte-Carlo Tree Search. New Mathematics and Natural Computation 4 (03) : 343-357.
progressiveBias
Progressive Bias adds an additional term to the selection equation that uses an external heuristic function (introduced in the Chaslot et al. paper mentioned above)
\[V(a)=\bar{X}_a + K\sqrt{\frac{\log(N)}{n_a}} + \frac{h(s,a)}{1 + n_a}\]The implementation in TAG requires an actionHeuristic but deviates from this canonical form.
A parameter biasVisits
specifies how much weight to apply to the the heuristic estimate of the action. The larger this is, the longer the effect of the heuristic will last. When the action has been taken a number of times equal to biasVisits
, then the weighting between the heuristic and the empirically observed results will be 50:50.
pUCT
This multiplies the exploration term in any selection policy that has one (UCB, UCB-Tuned, AlphaGo) by the probability that the actionHeuristic
would play that action in the current state. This synergises with First Play Urgency. The result is to bias selection towards actions preferred by the heuristic, and is the mechanism used in AlphaZero to integrate the learned policy into future search (simultaneously as the same idea was applied by Anthony et al.)
To apply this set pUCT
to true
, and then decide on the pUCTTemperature
to use in the Boltmann (softmax) policy that is generated from the actionHeuristic. If pUCTTemperature
is set to zero, then the raw values from actionHeuristic will be used on the assumption that this already outputs a probability distribution.
Silver, David, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, et al. ‘Mastering the Game of Go without Human Knowledge’. Nature 550, no. 7676 (18 October 2017): 354–59. https://doi.org/10.1038/nature24270.
Anthony, Thomas, Zheng Tian, and David Barber. ‘Thinking Fast and Slow with Deep Learning and Tree Search’. In Advances in Neural Information Processing Systems, 5360–70, 2017.
useMASTAsActionHeuristic
If this is set to true
then MAST will be used as an actionHeuristic (and any specified actionHeuristic
parameter will be ignored).
MAST
The Move Average Sampling Technique (MAST) was introduced by:
Bjornsson, Yngvi and Hilmar Finnsson, 2009. ‘CadiaPlayer: A Simulation-Based General Game Player’. IEEE Transactions on Computational Intelligence and AI in Games 1 (1): 4-15.
On each MCTS iteration this tracks all the actions taken, and gives them equal credit for the final result of that iteration. This maintains for each action the average result obtained across all iterations that used that action. This can then be used as a game-independent heuristic to guide opponent models, rollout or expansion policies.
To set this for use in the Expansion, set the useMASTAsActionHeuristic
parameter to true
.
There are three MAST parameters that can be specified:
- MAST. Values of Rollout, Tree or Both are valid. Defaults to Rollout. This defines which actions are used to maintain the MAST-estimate for an action; just those used in the Rollout phase, Tree phase, or a combination of both.
- MASTGamma. Defaults to 0.5. This is the weight used for the MAST data after the previous MCTS iteration; i.e. the effective number of visits to each action is decayed by this factor. Setting this to 0.0 will only use data from the current iteration in MAST.
- MASTBoltzmann. Defaults to 0.1. This is the temperature value to be used in any softmax policy over actions defined by the MAST values (as rollout policy or opponent model).
backupPolicy
This controls the back-propagation of values through the tree after each iteration. Possible values are:
MonteCarlo
. This is the default, and back-propagates the observed reward. The statistics at each node for the mean value are therefore the simple average of all iterations that went through that node.
Lambda
. This applies a temporal difference style update. It requires the setting of an additional parameter, backupLambda
. At each node the estimated value, R’, passed up to the parent will be interpolated between the estimated value from the child node, R, and the current statistics on the node. The value of the node is updated as normal as the mean of all visits, but these values will not be the pure Monte Carlo rewards.
MaxLambda
. This modifies Lambda
to approximate off-policy Q-Learning instead of on-policy Temporal difference learning.
MaxMC
. This modifies MonteCarlo
in a similar way to approximate an off-policy update. During back-propagation at a node we consider if the action taken would have been the one recommended by the the final recommendation policy (i.e. was it the best one we could have taken given current information, or was it an exploratory action). If it was, then the estimated reward, R, from the child node is left unchanged. If however it was an exploratory action then we update R towards the value we estimate would have been received if we had taken the ‘best’ action.
γ is determined via the maxBackupThreshold
parameter that needs to be set as well. This has two effects. Firstly it will not apply the MaxMC
rule unless the node has had at least this number of visits, defaulting to MonteCarlo
behaviour.
Secondly this sets the weighting of the observed reward against the estimated max - high values mean we apply a small max weighting.
Khandelwal, Piyush, Elad Liebman, Scott Niekum, and Peter Stone. ‘On the Analysis of Complex Backup Strategies in Monte Carlo Tree Search’. In International Conference on Machine Learning, 1319–28, 2016.
Frydenberg, Frederik, Kasper R. Andersen, Sebastian Risi, and Julian Togelius. ‘Investigating MCTS Modifications in General Video Game Playing’. In 2015 IEEE Conference on Computational Intelligence and Games (CIG), 107–13. Tainan, Taiwan: IEEE, 2015. https://doi.org/10.1109/CIG.2015.7317937.
Computational Budget
MCTS (and RHEA and RMHC) support three types of computational budget, with the following parameters:
budgetType
BUDGET_TIME
means we have N milliseconds of elapsed wall time to make each decision.BUDGET_FM_CALLS
means we have N forward model calls to make each decision.BUDGET_ITERATIONS
means we have N iterations of MCTS to make each decision.
The former two are the most common for comparison of different algorithms. For cross-game comparison it may be better to use a forward model budget as different games can have widely varying forward model efficiencies.
budget
Sets the N used in budgetType above.
breakMS
When using BUDGET_TIME
, this will break off processing this number of milliseconds before the timer actually elapses. This is to cater for games that Disqualify an entrant if they take longer than the budgeted time to make a decision. If this is not implemented, then just set this to zero.