MCTS Parameters

TAG currently supports the following main variants/parameterisations of MCTS within MCTSParams (as of August 2024). (There are a few more technical ones; see the MCTSParams code for details.) Given the options, it is recommended that an MCTS agent be defined by a JSON file, in which values for each of these can be specified. For more details and examples of specifying an MCTS agent in JSON, see this documentation.

There are also parameters to set the budget available to the agent, in terms of elapsed time, iterations or calls to the forward model Budget.

K

The weighting to give the the exploration term in the UCB equation. This should be scaled to the game score - if the result of a game is 1 for a win, and 0 or -1 for a draw then a value of about 1.0 or sqrt(2) is the standard choice. This scaling to the score will take place automatically if the normaliseRewards parameter is true (which is the default). In this case the largest and smallest final scores seen on all rollouts is tracked during each MCTS iteration, and used to scale all future rewards to the range [0, 1].

rolloutLength

The maximum number of actions to take during the rollout phase of MCTS, i.e. once we leave the tree.

rolloutLengthPerPlayer

This defaults to false. If true then the rolloutLength parameter is per player, i.e. in a 3 player game, the actual rollout length will be 3 x rolloutLength. This is useful for automatic scaling of the rollout if the number of actions in a game is roughly proportional to the number of players.

rolloutTermination

This has four possible values. If rolloutLength is greater than zero, then the rollout will run for this number of actions, and then terminate once the specified condition is met:

DEFAULT : Terminate immediately the rolloutLength is reached.

END_TURN : Terminate only at the end of our turn (‘our’ refers to the root player).

END_ROUND : Terminate only at the end of a full game round.

END_ACTION : Terminate only when we have finished a sequence of actions, and the next action is to be taken by another player. This is distinct from END_TURN in games where there are interrupt actions for example.

START_ACTION : Terminate only when opponents have finished a sequence of action, and the next action is to be taken by us.

maxTreeDepth

The maximum depth to grow the tree. If an MCTS iteration reaches this depth, then it will not expand any nodes beyond this level, and will move to the rollout phase.

information

This has three possible values:

  1. Open_Loop A perfect information game is assumed, i.e. that the copy of the game state received by the agent is the actual state and is known by all players. In this case the set of actions taken from the root defines the current state from the perspective of MCTS, but allows for the underlying state to be different due, for example, to a stochastic environment or actions of other players that are not tracked in the tree. On every iteration of MCTS the forward model is applied to the game state to advance it from the root through the tree.

  2. Closed_Loop As Open_Loop, but further assumes that the environment is deterministic, and that the same sequence of actions will always lead to exactly the same underlying state (as might be true in Chess, Go, or TicTacToe). Hence the forward model is not applied as we descend the tree. When a node is expanded, the state stored on its parent node is copied, and the forward model applied once to generate the new state for the expanded node, which is then stored. This approach significantly reduces the number of forward model calls made compared to Open Loop - but will generally perform less well despite more iterations if the game is stochastic.

  3. Information_Set A simple form of Information Set MCTS is used. At the start of each iteration the root state is ‘redeterminised’ before the tree phase starts. Redeterminisation shuffles all the unknown information in the state to be a valid full state - or, if you prefer, it samples a full state that is in the current Information Set (i.e. consistent with everything that we can see). Otherwise this is as for Open_Loop, with the root through the tree defined by the sequence of actions taken. This is implemented using the AbstractGameState.copy(playerId) method, which should shuffle any decks we can’t see. See the reference below for more details.

    Whitehouse, Daniel, Edward J. Powley, and Peter I. Cowling. 2011. ‘Determinization and Information Set Monte Carlo Tree Search for the Card Game Dou Di Zhu’. In 2011 IEEE Conference on Computational Intelligence and Games (CIG’11), 87–94. Seoul, Korea (South): IEEE. link.

selectionPolicy

This determines how the final action is selected after MCTS is complete.

ROBUST will pick the action from the root node that has been visited most often.

SIMPLE will pick the action from the root node that has the highest expected score (ignoring any exploration term).

EXP3 and Regret Matching (see treePolicy below) use their own specific final selection policies, and ignore this parameter.

treePolicy

This determines how MCTS makes decisions as it descends the tree. The supported options are:

  1. UCB The default Upper Confidence Bound for Trees algorithm:
\[V(a)=\bar{X}_a + K\sqrt{\frac{\log(N)}{n_a}}\]

Where K is the exploration constant, N is the total number of iterations through the node, and na is the number of times action a has been picked from this node. X(a) is the mean value of taking the action a from the node.

  1. UCB_Tuned This modifies the UCB formula to consider the observed variance of the rewards from rollouts. If this variance is low, then exploration will be reduced compared to UCB. The second reference below found this to be the best from several tested.

    \[V(a)=\bar{X}_a + K\sqrt{\frac{\log(N)}{n_a}\min\left(\frac{1}{4} + \sigma^2\sqrt{\frac{2\log(N)}{n_a}}\right)}\]

    Auer, Peter, Cesa-Bianchi, Nicolo and Fischer, Paul, 2002. ‘Finite-time analysis of the multiarmed bandit problem’. Machine Learning 47 (2-3): 235-256. link

    Perick, Pierre, David L. St-Pierre, Francis Maes and Damian Ernst, 2012. ‘Comparison of different selection strategies in Monte-Carlo Tree Search for the game of Tron’. 2012 IEEE Conference on Computational Intelligence and Games (CIG), pp 242-249

  2. AlphaGo Uses a slight tweak on the exploration part of the UCB formula.

    \[V(a)=\bar{X}_a + K\sqrt{\frac{N}{1 + n_a}}\]

This change from a log term is used in AlphaGo and AlphaZero. It is inspired by:

Auger, David, Adrien Couetoux, and Olivier Teytaud. ‘Continuous Upper Confidence Trees with Polynomial Exploration–Consistency’. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2013, Prague, Czech Republic, September 23-27, 2013, Proceedings, Part I 13, 194–209. Springer, 2013.

  1. EXP3 Uses the EXP3 bandit algorithm. This constructs a SoftMax distribution over all possible actions using:

    \[V(a)=\frac{e^{-\bar{X}_a} / T}{Z}\]

Where Z is the required normalisation constant and T is the temperature of the distribution. T is set using the exp3Boltzmann parameter, and defaults to 0.1. EXP3 will always make a final recommendation for an action at the root node (selectionPolicy) by sampling from a distribution over the number of visits made to each action, after amending this to ignore actions taken as part of explicit exploration (exploreEpsilon).

  1. RegretMatching Uses Regret Matching as the bandit algorithm. For details on this (plus the EXP3 algorithm) see:

    Lisy, Viliam. 2014. ‘Alternative Selection Functions for Information Set Monte Carlo Tree Search’. Acta Polytechnica 54 (5): 333–340. link

RegretMatching makes a final recommendation at the root node that samples from a probability distribution constructed from the average Regret Matching policy at the root node over all MCTS iterations (see reference above for more background), and ignores the selectionPolicy parameter.

  1. Uniform This just takes actions uniformly at random. This is not quite as silly as it sounds and can be useful if compbined with a Max-backup of some kind.

  2. Greedy This uses an ε-greedy policy. With probability given by exploreEpsilon a random action is taken, otherwise the action with the best average reward so far is taken.

FPU

This sets the First Play Urgency, and defaults to Infinity. This provides the default value to use for the Exploration value of an action that has not yet been tried (see right-hand side of UCT equation above). The default ensures that all actions are tried once before any is tried a second time. Setting this to a lower value can be helpful in large action spaces, as it means that after trying some number of them we may repeat the better ones before considering the untried actions.

exploreEpsilon

This is only used for Greedy, EXP3 and RegretMatching Tree Policies. It is used to define an ε-greedy exploration policy. I.e. with probability of exploreEpsilon a random decision will be taken at each node instead of using the Greedy, EXP3 or Regret Matching formulae.

opponentTreePolicy

This indicates how the search tree is built, and in particular how opponent actions are included. This has a few possible values:

  1. OneTree This is the default, and maintains a vector of rewards/node values during back-propagation so that each node records the value of a decision to each player independently. Hence a decision at a node is made just using the value to the deciding player. This is a much less conservative, and potentially more realistic, approach than Paranoid. (Especially in non-zero sum games, or with 3+ players).

  2. SelfOnly This ignores other players completely, and only constructs the MCTS tree using our actions. Hence on each node transition there are implied, but unrecorded, opponent actions taken using the specified opponent model (which by default is a random policy). This has the advantage of exploring much deeper forward in the game for a given budget, especially if the branching factor of opponent actions is high. The downside comes in not being able to account for opponent actions that could interfere with the detailed forward plan being constructed.

  3. MultiTree This constructs a ‘SelfOnly’ tree for each player to improve opponent modelling during MCTS search. Instead of using an explicit opponent model to take actions for the other players, these make decisions using their own tree. Each iteration will traverse all trees in parallel, and add one node to each. This retains the benefit of deeper search with an improved implicit opponent model compared to ‘SelfOnly’, but is cannot learn any action conditionalities. For detail on the MultiTree algorithm, see

Goodman, James, Diego Perez-Liebana, and Simon Lucas. “MultiTree MCTS in Tabletop Games.” 2022 IEEE Conference on Games (CoG). IEEE, 2022. link

  1. OMA ‘Opponent Move Asbtraction’. This constructs a full tree as for MaxN, but also maintains statistics at each node as if SelfOnly were being used (i.e. it abstracts out the opponent moves). In the selection phase a policy is used that wieghts the two sets of statistics. A further parameter of omaVisits controls the effective number of visits to use for the Self-Only-like statistics.

  2. OMA_All As for OMA, but this also applies the abstraction to the other players in the tree. OMA only uses the abstraction for the root player’s moves. For Opponent Move Abstraction, see

Baier, Hendrik, and Michael Kaisers. “Guiding multiplayer mcts by focusing on yourself.” 2020 IEEE Conference on Games (CoG). IEEE, 2020. link

  1. MCGS This enables a form of Graph Search using a Transposition Table approach. The previous methods all use an ‘Open Loop’ algorithm that defines the current node purely by the sequence of actions taken to reach it from the root node (and this only includes user actions, not actions attributable to the game, such as the precise card drawn from a shuffle deck). MCGS instead maintains a hashmap of all states seen so far with their MCTS statistics. After taking an action the new state is looked up in the hashmap - if it exists then we have the new node; if it does not then we expand by adding this to the hashmap with its first visit. This requires an additional parameter for MCGSStateKey. This must be a class that implements the IStateKey interface. This determines how a state is converted into a key for the hashmap, and hence how much generalisation between states takes place.

  2. MCGSSelfOnly This is as MCGS, but only models the actions of the acting agent. Opponent actions are modeled using a fixed opponent policy.

paranoid

The default of false will mean that opponents are modeled as optimising their own result (win/score). If true then the opponent will be modeled in the search tree as trying to minimise the acting agent’s result. This ‘paranoid assumption’ can lead to much more conservative play, but is also potentially less exploitable and works well in some games.

reuseTree

If set to true then any parts of the search tree from the previous decision that are still relevant will be retained. This is not always helpful, as it can take time to trim the tree to the useful parts, and more importantly with hidden information we may want to avoid polluting the tree data with results from invalid determinisations. (This is because in open loop search nodes are defined by the player actions that lead to the node, and may therefore contain a mixture of actual world/information states depending on cards drawn or dice rolled, as these game-driven random events do not affect the tree structure.)

rolloutType

The rollout policy to be used by the root player (for opponent policies use opponentModelType). This is defined with different values of rolloutType, which can then be parameterised further:

  1. RANDOM No further parameterisation possible. This uses the default random rollout policy.

  2. CLASS This requires a class that supports the AbstractPlayer interface, and which will be used as the rollout policy. This class is then specified by the rolloutClass parameter. This must be the fully specified class name. This can either use a no-argument constructor, or a constructor with a single String argument. In the latter case the argument is separated by a ‘|’ from the class name, for a full format of rolloutClass=package.className|constructorArg.

  3. PARAMS This adds more flexibility than CLASS, and allows much more detailed specification of the rollout policy. A further parameter of rolloutPolicyParams must be specified as a full JSON-fragment defining the policy (which must be a class that implements both AbstractPlayer and ITunableParameters. An example makes this clearest:

        "rolloutType" : "PARAMS",
        "rolloutPolicyParams" : {
        	"class" : "players.simple.OSLAHeuristic",
        	"heuristic" : {
        		"class" : "games.loveletter.LoveLetterHeuristic",
        		"COUNTESS_PLAY_THRESHOLD" : 1.0,
        		"FACTOR_CARDS" : 0.5,
        		"FACTOR_AFFECTION" : 1.0
        	},
        },
  1. MAST This will use Move Average Sampling to define the rollout policy. This maintains an average of the reward obtained for every action played on this and previous iterations of MCTS. This can then be used to define a softmax policy using these action value estimates. For details of the MAST parameters see later on this page.

opponentModel

oppModelType has exactly the same parameterisation options as for rolloutType. This is used when opponent decision is needed in rollout, or if the Opponent Tree Policy is ‘SelfOnly’. If a more sophisticated opponent model should be used, then opponentModel can specify (via a json fragment) any class that implements AbstractPlayer. For example, a hand-crafted agent could be written, and then used as an opponent model with:

        	"opponentModel" : {
        		"class" : "games.dominion.players.BigMoney"
         },

heuristic

This is used to define the heuristic function that should evaluate the leaf state evaluated at the end of each iteration (if this is not a game end situation), and that is back-propagated up the tree. This needs to be a JSON-fragment that defines a class that implements IStateHeuristic and optionally ITunableParameters (if further parameters need to be specified). Two examples are shown below:

"heuristic" : {
   "class" : "games.loveletter.LoveLetterHeuristic",
   "COUNTESS_PLAY_THRESHOLD" : 1.0,
   "FACTOR_CARDS" : 0.5,
   "FACTOR_AFFECTION" : 1.0
},
            
"heuristic" : {
   "class" : "players.heuristics.WinOnlyHeuristic"
},
            

The default heuristic used is the game-specific getHeuristicScore() method.

actionHeuristic

This is used to define a heuristic that returns an esimate for a state/action pair. This must implement the IActionHeuristic interface. If one is supplied then this will be used to define the order of expansion of unvisited actions (this defaults to a random order).

It can also optionally be used to initialise new nodes with a ‘previous’ number of visits with the heuristic estimate if the initialiseVisits parameter is set to a non-zero value. Additionally if specified, then it can be used in any of progressive widening, progressive bias or p:

progressiveWidening

Progressive Widening defines which actions are available at a node, and requires an ordering over actions from ‘best’ to ‘worst’. If we have N visits to the node so far, then the number of actions, A, that will be considered in the Expansion/Tree phases is given by:

\[A=\lfloor{W}N^{B}\rfloor\]

The two constants, W, and B, are specified by progressiveWideningConstant and progressiveWideningExponent respectively. These both default to 0.0, with no progressive widening. The original algorithm (reference below) also has a kinit parameter for the minimum for A. This is always 0 here; so that in practice W defines the minimum for A, and means we do not ‘unprune’.

Original citation for this form of progressive widening:

Couëtoux, Adrien, et al. ‘Continuous Upper Confidence Trees’. International Conference on Learning and Intelligent Optimization, Springer, 2011, pp. 433–45.

An oft-cited inspiration for progressive widening is Chaslot et al. 2008, but this does not use the form of the equation above, and actually implements the subtly different progressive unpruning:

Chaslot, Guillaume, Mark H. M. Winands, H. Jaap van den Herik, Jos Uiterwijk and Bruno Bouzy, 2008. Progressive strategies for Monte-Carlo Tree Search. New Mathematics and Natural Computation 4 (03) : 343-357.

progressiveBias

Progressive Bias adds an additional term to the selection equation that uses an external heuristic function (introduced in the Chaslot et al. paper mentioned above)

\[V(a)=\bar{X}_a + K\sqrt{\frac{\log(N)}{n_a}} + \frac{h(s,a)}{1 + n_a}\]

The implementation in TAG requires an actionHeuristic but deviates from this canonical form. A parameter biasVisits specifies how much weight to apply to the the heuristic estimate of the action. The larger this is, the longer the effect of the heuristic will last. When the action has been taken a number of times equal to biasVisits, then the weighting between the heuristic and the empirically observed results will be 50:50.

\[V(a)=(1-\beta)\bar{X}_a + \beta{h(s,a)} + K\sqrt{\frac{\log(N)}{n_a}}\] \[\beta=\sqrt{\frac{\text{biasVisits}}{\text{biasVisits} + 3n_a}}\]

pUCT

This multiplies the exploration term in any selection policy that has one (UCB, UCB-Tuned, AlphaGo) by the probability that the actionHeuristic would play that action in the current state. This synergises with First Play Urgency. The result is to bias selection towards actions preferred by the heuristic, and is the mechanism used in AlphaZero to integrate the learned policy into future search (simultaneously as the same idea was applied by Anthony et al.)

To apply this set pUCT to true, and then decide on the pUCTTemperature to use in the Boltmann (softmax) policy that is generated from the actionHeuristic. If pUCTTemperature is set to zero, then the raw values from actionHeuristic will be used on the assumption that this already outputs a probability distribution.

Silver, David, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, et al. ‘Mastering the Game of Go without Human Knowledge’. Nature 550, no. 7676 (18 October 2017): 354–59. https://doi.org/10.1038/nature24270.

Anthony, Thomas, Zheng Tian, and David Barber. ‘Thinking Fast and Slow with Deep Learning and Tree Search’. In Advances in Neural Information Processing Systems, 5360–70, 2017.

useMASTAsActionHeuristic

If this is set to true then MAST will be used as an actionHeuristic (and any specified actionHeuristic parameter will be ignored).

MAST

The Move Average Sampling Technique (MAST) was introduced by:

Bjornsson, Yngvi and Hilmar Finnsson, 2009. ‘CadiaPlayer: A Simulation-Based General Game Player’. IEEE Transactions on Computational Intelligence and AI in Games 1 (1): 4-15.

On each MCTS iteration this tracks all the actions taken, and gives them equal credit for the final result of that iteration. This maintains for each action the average result obtained across all iterations that used that action. This can then be used as a game-independent heuristic to guide opponent models, rollout or expansion policies. To set this for use in the Expansion, set the useMASTAsActionHeuristic parameter to true.

There are three MAST parameters that can be specified:

  1. MAST. Values of Rollout, Tree or Both are valid. Defaults to Rollout. This defines which actions are used to maintain the MAST-estimate for an action; just those used in the Rollout phase, Tree phase, or a combination of both.
  2. MASTGamma. Defaults to 0.5. This is the weight used for the MAST data after the previous MCTS iteration; i.e. the effective number of visits to each action is decayed by this factor. Setting this to 0.0 will only use data from the current iteration in MAST.
  3. MASTBoltzmann. Defaults to 0.1. This is the temperature value to be used in any softmax policy over actions defined by the MAST values (as rollout policy or opponent model).

backupPolicy

This controls the back-propagation of values through the tree after each iteration. Possible values are:

MonteCarlo. This is the default, and back-propagates the observed reward. The statistics at each node for the mean value are therefore the simple average of all iterations that went through that node.

Lambda. This applies a temporal difference style update. It requires the setting of an additional parameter, backupLambda. At each node the estimated value, R’, passed up to the parent will be interpolated between the estimated value from the child node, R, and the current statistics on the node. The value of the node is updated as normal as the mean of all visits, but these values will not be the pure Monte Carlo rewards.

\[R'= \lambda R + (1 - \lambda)\bar{X}_a\]

MaxLambda. This modifies Lambda to approximate off-policy Q-Learning instead of on-policy Temporal difference learning.

\[R'= \lambda R + (1 - \lambda) \text{argmax}_a \bar{X}_a\]

MaxMC. This modifies MonteCarlo in a similar way to approximate an off-policy update. During back-propagation at a node we consider if the action taken would have been the one recommended by the the final recommendation policy (i.e. was it the best one we could have taken given current information, or was it an exploratory action). If it was, then the estimated reward, R, from the child node is left unchanged. If however it was an exploratory action then we update R towards the value we estimate would have been received if we had taken the ‘best’ action.

\[R' = \gamma R + (1- \gamma) \text{argmax}_a \bar{X}_a\]

γ is determined via the maxBackupThreshold parameter that needs to be set as well. This has two effects. Firstly it will not apply the MaxMC rule unless the node has had at least this number of visits, defaulting to MonteCarlo behaviour. Secondly this sets the weighting of the observed reward against the estimated max - high values mean we apply a small max weighting.

\[\gamma= 1 - \frac{N - \text{maxBackupThreshold}}{N}\]

Khandelwal, Piyush, Elad Liebman, Scott Niekum, and Peter Stone. ‘On the Analysis of Complex Backup Strategies in Monte Carlo Tree Search’. In International Conference on Machine Learning, 1319–28, 2016.

Frydenberg, Frederik, Kasper R. Andersen, Sebastian Risi, and Julian Togelius. ‘Investigating MCTS Modifications in General Video Game Playing’. In 2015 IEEE Conference on Computational Intelligence and Games (CIG), 107–13. Tainan, Taiwan: IEEE, 2015. https://doi.org/10.1109/CIG.2015.7317937.

Computational Budget

MCTS (and RHEA and RMHC) support three types of computational budget, with the following parameters:

budgetType

  • BUDGET_TIME means we have N milliseconds of elapsed wall time to make each decision.
  • BUDGET_FM_CALLS means we have N forward model calls to make each decision.
  • BUDGET_ITERATIONS means we have N iterations of MCTS to make each decision.

The former two are the most common for comparison of different algorithms. For cross-game comparison it may be better to use a forward model budget as different games can have widely varying forward model efficiencies.

budget

Sets the N used in budgetType above.

breakMS

When using BUDGET_TIME, this will break off processing this number of milliseconds before the timer actually elapses. This is to cater for games that Disqualify an entrant if they take longer than the budgeted time to make a decision. If this is not implemented, then just set this to zero.