Global warming: Temperature estimation in annealers

Sampling from a Boltzmann distribution is NP-hard and so requires heuristic approaches. Quantum annealing is one promising candidate. The failure of annealing dynamics to equilibrate on practical time scales is a well understood limitation, but does not always prevent a heuristically useful distribution from being generated. In this paper we evaluate several methods for determining a useful operational temperature range for annealers. We show that, even where distributions deviate from the Boltzmann distribution due to ergodicity breaking, these estimates can be useful. We introduce the concepts of local and global temperatures that are captured by different estimation methods. We argue that for practical application it often makes sense to analyze annealers that are subject to post-processing in order to isolate the macroscopic distribution deviations that are a practical barrier to their application.


Introduction
Boltzmann distributions are important in many areas of science, and sampling from these distributions is a major bottleneck in many interesting applications. The tasks of uniform generation, approximate counting, and inference (e.g., estimation of marginal probabilities), are often NP-hard [1,2,3]. Heuristic samplers that sample approximately from a Boltzmann distributions are applied in practice to large scale problems (for example in machine learning [4]).
One approach to heuristic sampling is to use an annealer. Whether thermal or quantum, an annealer generates independent samples by slowly transforming an easily prepared initial state into a random final state associated to a given objective function [5,6]. In the case of the simulated thermal annealer (STA), an initial random sample is evolved through a schedule of decreasing temperature towards a specified terminal temperature [6,7]. In quantum annealing the initial state is a ground state of some driver Hamiltonian (often a unifrom superposition of states). During the annealing process the state is evolved by slowly changing the Hamiltonian towars the the target Hamiltonian [5].
Although annealers have primarily been considered in the context of optimization, they can also be used as heuristic samplers of Boltzmann distributions. With sufficient resources, STA samples from a Boltzmann distribution [6,8]. However, the resources required per sample to achieve this are prohibatory in interesting applications, so that it is typically run as a heuristic without theoretical guarantees. Previous studies have also indicated that samples produced by the D-Wave quantum annealers may produce samples well described by finite temperature Boltzmann distributions [9,10,11,12,13,14].
In this paper we investigate several methods for determining how close sample distributions produced by annealers are to a family of Boltzmann distributions parameterized by inverse temperature β. These methods estimate the parameter β best describing samples drawn from an annealer, and also provide measures of closeness. The annealers we evaluate are the latest-model D-Wave 1 quantum annealer -the D-Wave 2X (DW2X) [15,16,17,18], and an implementation of simulated thermal annealing on a single CPU. We consider two knobs for each annealer that modify the heuristic distributions generated: rescaling the STA terminal temperature/DW2X terminal energy scale, or changing annealing time (either in the DW2X or STA). Sample quality shown here does not reflect performance of optimally-tuned versions of these annealers, and are simply presented to compare various β estimation techniques.
We observe a significant difference between local (subspace) and global (full space) features of the annealer distributions. We find that even though samples are locally similar to a Boltzmann distribution, the global deviation can be large. This gives rise to a "global warming" effect: the fact that global distributional features indicate a higher temperature than local distributional features. We consider several estimators of inverse temperature and evaluate their efficacy. Some estimators are sensitive to details of the dynamics, and indicate a significant difference between the DW2X and STA. Other estimators are more sensitive to ergodicity breaking and macroscopic distribution features, where the DW2X and STA show a qualitatively similar behavior.
We treat our heuristic samplers as black-boxes and consider temperature estimation as the problem of determining the best fit amongst a single-parameter exponential family of models. This problem has a long history, and best practice is well established [19,20,21]. Inference of Ising models parameters under some systematic schemes is NP-hard [22,19]. However, heuristic approaches such as log-pseudolikelihood are known to perform well in practice [23], and some schemes are provably convergent with reasonable resources [24,25]. Bhattacharya et al. [25] recently considered the log-pseudo-likelihood estimator for β and found that estimation based on only a single sample is possible; their focus was primarily on the convergence properties of this estimator. Multi-parameter estimation (estimation of couplings and fields) is more commonly studied, and is pertinent to the class of Ising models we study, though beyond the scope of this paper. In this context, efficient methods of estimation for strongly interacting models include pseudo-likelihood and variational approaches [26,27,28].
Many recent papers have shown that physical quantum annealers approximate Boltzmann distributions [9,10,11,12,10,29,13,14]. In some of these approaches temperature estimators have been developed, and these estimators have been effectively applied in correcting the annealer parameterization to produce the desired distribution. A significant focus has been the impact of noise, or systematic specification errors, in D-Wave processors. Remedies have been proposed to allow more effective sampling, but scaling is either poor or unproven; in some methods only a restricted set of problem classes are appropriate. An extension to the temperature estimation method of Benedetti et al. [12] is discussed in supplementary materials, but we prefer the more standard estimators presented in the main text. Some work considering closeness to quantum Boltzmann distributions has appeared [14,13].
Our paper evaluates several standard methods, but differs from previous studies in that it uses insight specific to annealers in the analysis of deviations and development of temperature estimators. Noise sources and quantum features in physical quantum annealers are discussed only briefly. Some estimators we evaluate have a firm theoretical basis, such as maximum likelihood, but where this is lacking we will not focus on formal properties such as convergence, bias and variance.
Qualitatively, the deviation of the STA distributions from the Boltzmann distribution for hardto-sample Hamiltonians has been understood within physics and computer science since the idea of annealers was conceived [6,7,8]. As we modify the inverse temperature in the STA from its initial value to the terminal inverse temperature value (β T ), we move from a distribution where classical states are uniformly distributed to a distribution divided into disjoint subspaces of low energy (which can be identified qualitatively with modes of the probability distribution, or valleys in the free energy landscape). Under annealing dynamics, a state localized in one subspace cannot easily transition into another subspace that is separated by an energetic barrier once the inverse temperature becomes large (late in the annealing procedure) -this is called ergodicity breaking [7,8]. In the case of the STA we gradually decrease the annealing temperature. Temperature is in one-to-one correspondence with expected energy in a Boltzmann distribution, and equilibrated samples are characterized by tight energy ranges. These samples are partitioned into subspaces by the energy barriers as temperature decreases, at which point the samples in each subspace will evolve independently, and be characterized by a local distribution. Rare fluctuations do allow samples to cross barriers, but are exponentially suppressed in the height of the energy barrier later in the anneal. Therefore, the distribution between subspaces will reflect the distribution at the point in the anneal where dynamics between the subspaces became slow, rather than the equilibrated distribution associated to the terminal model. This effect is called "freeze-out". We provide a schematic in Figure 1.
Similarly, in the case of quantum annealing, we proceed through a sequence of quantum models of decreasing transverse field (and increasing classical energy scale). With respect to the terminal diagonal Hamiltonian, the energy is again decreasing throughout the anneal, and some characteristic mean energy defines the sample set at intermediate stages. Energy barriers become impassable as the transverse field weakens and tunneling becomes slow, so that the process of ergodicity breaking is qualitatively similar [5,14]. Tunneling dynamics are affected by energy barriers in a different manner to thermal excitation dynamics, which is why there is some enthusiasm for quantum annealing; for some problems it may not suffer the same dynamical slowdown that is true of STA [18].
For many problem classes, these points of ergodicity breaking become well defined in the large system size limit, and can often be directly associated to thermodynamic phase transitions [7,30].
In this paper we consider heuristic sampling from classical Ising spin models. The state x will consist of N spins, defined on {−1, +1} N . The Hamiltonian is where J and h are unitless model parameters called couplings and fields respectively. The Boltzmann distribution corresponding to this Hamiltonian at inverse temperature β is where Z is the partition function. Throughout the paper we will use the standard, though improper, abbreviation in which x can denote both the random variable X, and its realization. We study two problem classes compatible with the Chimera topology described in Section 4.1, tailored to the D-Wave architecture [31,32,16].

Outline
In Section 2 we introduce Kullback-Leibler divergence and mean square error on correlations as measures of closeness to the Boltzmann distribution, and from these develop standard estimators of inverse temperature.
In Section 3 we develop local self-consistent approximations for efficiently evaluating our inverse temperature estimators; we call these locally-consistent inverse temperature estimators. We argue that in the context of annealers the approximation may determine an inverse temperature significantly different (typically larger) than that obtained by a full (computationally intensive) evaluation method. However, we will show that the estimate has meaning in that it captures local distribution features. Applying our approximation to maximum likelihood estimation, we recover in a special case the commonly used pseudo-log-likelihood approximation method.
In Section 3.2 we argue for evaluating post-processed distributions in place of raw distributions, for the purposes of removing superficial deviations in the heuristic distribution, and for determining the practical (as opposed to superficial) limitations of heuristic annealers.
We then present experimental results relating to our objectives and estimators in Section 4. We conclude in Section 5.
In supplementary materials we present various supporting results to complement the main text, and additional results and methods. In particular we introduce a new multi-canonical approximation for β estimation inspired by Benedetti et al. [12], we address issues related to practical usage of a DW2X, and we develop a method for calculation of KL-divergence. As part of our work we study the RAN1 and AC3 models of the main text, as well as two problem classes not tailored to the DW2X architecture [33,34].

Practical guidelines for using annealers in sampling applications
Based on the results in this paper we offer the following advice for selecting and interpreting temperature estimation methods in the context of samples drawn from an annealer.
• It is important to define a suitable objective that is minimized by Boltzmann samples, and check that the objective is indeed small for the heuristic sampler for some temperature. It is not sufficient to find the best temperature, since there will always be a best temperature even for bad distributions. Comparisons of β estimates between heuristics are not meaningful in the absence of this analysis.
• Robust evaluation of a heuristic sampler will often require input from an independent (exact or heuristic) method, such as statistical estimates against which to compare. Attempting to quantify error in a locally self-consistent manner could be misleading.
• The DW2X rescaling parameter/STA terminal temperature, total annealing time, and postprocessing, should be tuned to the sampling objective.
• A temperature can be estimated accurately and efficiently by standard methods, such as the log-pseudo-likelihood method. If ergodicity breaking is a weak effect, the log-pseudo-likelihood estimator is sufficient.
• It is valuable to consider several different types of estimator, since different estimators may be sensitive to different distribution features. Disagreement amongst estimators may reveal a pattern of ergodicity breaking, or imply a path to error mitigation.
• Efficient post-processing can move the distribution towards Boltzmann distribution by correcting local deviations. We are interested in the best pracitical heuristic, and so efficient post-processing should be applied.
We argue in this paper, in line with previous literature and experimental results, that the temperature that best describes an annealer distribution is expected to be a function not only of the annealer parameterization, but of the target Hamiltonian. There is no single parameter β that is optimal for all Hamiltonians. Whilst this should be bourne in mind, closely related Hamiltonians (e.g. those of a given class, created by a slow learning procedure, or otherwise of comparable statistical properties) do yield comparable estimates for temperature, so that it may be efficient to estimate temperature properties on a small subset of the problems of interest and effectively generalize.

Estimators for temperature
We assume that annealers generate independent and identically distributed samples, according to a distribution P A . For the STA this is reasonable given powerful pseudo-random number generators. For the DW2X, correlated noise sources (discussed in Section 4.3) means this is an approximation that is more difficult to analyze. In Figure 14 we show evidence that these weak correlations in time do not strongly affect our results and conclusions. The experimental structure is demoted to supplementary materials. We are interested in comparing these heuristic distributions to a family of Boltzmann distributions (2) parameterized by inverse temperature β. Amongst such models we wish to find the best fit, and measure its goodness. We will consider the best temperature to be that which minimizes some objective function. Since the distribution P A is a heuristic distribution, and not Boltzmann, this temperature may vary between objectives. Given an objective that is minimized at some unique inverse temperature, we then need an estimator for this temperature working on the basis of finite sample sets. An effective estimator should be consistent, with low bias and variance [20,21,25,35,24]. The estimators we study will be consistent, and in some cases optimal with respect to variance and bias (e.g. the Maximum Likelihood estimators [20,21]).
Either to evaluate the objective or to estimate temperature (i.e. minimize the objective) note that we must evaluate some statistics from the Boltzmann distribution, e.g., the mean energy, an energy gap, or marginal distribution. Inference for any of these quantities is NP-hard in the model classes we study. It is often in practice easier to evaluate the energy, and perhaps log(Z), than marginal statistics, but estimation of all these quantities is slow in the worst case. For purposes of the models and temperatures explored we are able to accurately estimate the mean energy, log(Z), or marginal expectations under the Boltzmann distribution by either dynamic programming or parallel tempering [19,36,37]. With these values in hand we can efficiently evaluate our temperature estimators, and in most cases the objective (an exception, Kullback-Leibler divergence, is discussed in supplementary materials). However, a scalable estimator requires us to find effective approximation methods for these quantities or to define different estimators, and is the subject of Section 3.

Maximum likelihood (Minimum Kullback-Leibler divergence)
When comparing distributions, a natural objective function to minimize is the Kullback-Leibler (KL) divergence between the sampled distribution (from the annealer) P A and the corresponding Boltzmann distribution B β , as follows: The Kullback-Leibler divergence is an important information-theoretic quantity, which places various limitations on the efficacy of P A for modeling B β , and vice-versa [19] 2 . P A has no β dependence, so that at the minimum of this function with respect to β, we obtain an energy matching criterion EM(β) = 0, where The energy matching criterion yields the maximum likelihood estimator for β -the likelihood that the annealed samples were drawn from a Boltzmann distribution. Maximum likelihood is perhaps the most well established of procedures for estimating model parameters from data -in this case the data being the samples drawn from P A [21,20,19]. Note that the Boltzmann distribution is an exponential model, and so it is natural to define the estimator in terms of expected energy, which is the sufficient statistic associated to the parameter β [19].

Minimum mean square error on correlations
In the context of machine learning, an important potential application of annealers, the important feature of samples may be the quality of some statistics that are derived from them. In particular, a machine learning process may require accurate estimation of single variable expectations, and expectations for products of variables (correlations). For this reason we consider an alternative objective, the mean square error (MSE) on correlations: M is number of non-zero couplings. We consider specifically the mean error on correlations (excluding errors on single variable expectations) since the models we study experimentally are all zero-field problems so that E[x i ] = 0 for all β by symmetry. Unlike the KL-divergence, MSE is not a convex function of β in general, although intuitively this might be expected for many problem classes and reasonable heuristics. A derivative of (5) with respect to β will yield a criterion for local optimality. This is a complicated expression dependent on many statistics, but is straightforward to approximate numerically in our examples. We will find that in application to annealers the minimum for this second objective (5) can disagree with the maximum likelihood (minimum KL-divergence) estimator (4), typically being more sensitive to ergodicity breaking in our experiments. Note that, once ergodicity breaking has occurred, the mean energy can be improved as samples settle towards their respective local minima. The maximum likelihood value can be larger than that implied at the point of ergodicity breaking. By contrast, the distribution between the now disconnected subspaces that determines the correlations cannot be much improved as samples settle towards their local minima. Therefore the minimum MSE estimator will typically indicate a smaller value for β, better inline with the point of ergodicity breaking. We stress that MSE is not in any sense a special objective function in this regard; many variations are possible and should be chosen in an application orientated manner.

Local approximations for evaluating objectives and estimators
The problem with the objectives and estimators outlined in Section 2 is that their use requires inference with respect to the Boltzmann distribution that is independent of the heuristic annealer, which is NPhard to perform: estimation of either the expected energy, or correlations.
In this section we show how, beginning from the annealed distribution, we can build a reasonable approximation to the Boltzmann distribution and thereby evaluate the estimator self-consistently. The estimators are motivated as approximations to those of Section 2. However, we will show that even in cases where the approximation is poor, the estimator can still reveal useful information about the distribution.

Statistics of the heuristic distribution, and of the Boltzmann distribution, by local self-consistency
Our estimators and objectives require us to evaluate statistics of the annealed distribution P A . Estimates of mean energy or correlations based on P A can be obtained by evaluating those statistics from the sample set S = {x}, or equivalently, evaluating their corresponding expressions using the plug-in estimator to P A :P The quality of estimates depends on variance and sample size. In experiments we typically use sample sets of size 10 4 that are sufficient for temperature estimation and evaluation of the objectives. The evaluation of the KL-divergence is one exception: our approximation (6) is known to fail when applied to the entropy term − x P A (x) log P A (x) [38,39]. In the supplementary materials we discuss why evaluation is problematic, and propose a mitigation strategy. We must also evaluate energy, correlations and log(Z) under the Boltzmann distribution, which is NP-hard. However, under the assumption that P A (x) is close to the Boltzmann distribution we may make a locally-consistent approximation. The approximation to B β (x) iŝ where W β is a β-dependent kernel. It efficiently maps any state into a new state, with the property that the distribution is unchanged if it is a Boltzmann distribution The transition kernels used in Markov chain Monte Carlo (MCMC) methods, either singly or iteratively, are suitable candidates for W β [7,30]. The blocked Gibbs MCMC method is described in Section 4.2. The simplest example of W β is conditional resampling of a single variable, which is an element in the blocked Gibbs sampling procedure. All variables except i are unchanged, and i is resampled according to the conditional Boltzmann distribution B β (x i |x \ x i ) given the neighboring values. We label this kernel by (i), indicating the updated variable Applying the approximation (7) in combination with the kernel (9) to maximum likelihood estimation we obtain an energy matching criterion for (i). Each kernel i defines an energy matching criterion and an estimate for β, but it is not possible to simultaneously satisfy the criteria for all i. We can make a composite energy matching criterion by weighting each of the criteria equally: taking an average over EM(β) for each i. In this case we recover the maximum log-pseudo-likelihood (MLPL) estimator [23,25]. The MLPL estimator is normally derived and motivated slightly differently. An alternative way to combine the kernels {W (i) } is to define a composite kernel as a sum of the individual kernels. We prefer the MLPL estimator in this paper due to its prevalance in the literature and well established statistical properties. We discuss this further in supplementary materials, where alternative locally self-consistent estimators are also examined.
In the case of the Hamiltonian (1) the MLPL estimate is the solution to EM(β) = 0, where where is the effective field. Note that −2x i ζ i is the energy change of flipping the state of spin i. Provided there exists at least one ζ i (x) > 0, and one value ζ i (x) < 0 (at least one local excitation in some sample), then this equation has a unique finite solution which can be found, for example, by a bisection search method.
For our purposes, the MLPL estimator is a special case of a more general locally consistent estimator. We choose a kernel, approximate B β (7), and then evaluate the energy matching criterion (4). The locally consistent approach can also be applied straightforwardly to the MSE, and minimum MSE estimator, of Section 2.2. In the main text the only locally consistent estimator for which we present results is the MLPL estimator. In the supplementary materials we perform an experiment to demonstrate how the strength of the kernel impacts temperature estimation.
Consider the following interpretation for the role of the kernel: We take every sample that the annealer produces, and conditionally resample according to W β . We then take this new set of samples as an approximation to Boltzmann samples drawn according to B β . Since W β obeys detailed balance, it necessarily brings the distribution towards the Boltzmann distribution. Consider again Figure 1, and note that resampling single spins, or doing some other efficient conditional resampling procedure (i.e. some short-run MCMC procedure) does not lead to a significant macroscopic redistribution of the samples, except in the high-temperature regime where fast dynamical exploration of the space is possible. Thus, the approximation (7) will typically inherit the macroscopic bias of the sampling distribution throughP A , but correct local biases. The locally consistent estimator is therefore effective in capturing any local deviation in the distribution P A not representative of B β .

Towards global distribution features
Objective functions such as maximum likelihood and minimum mean square error on marginals are influenced by a combination of local and global distribution features. We have already seen that locally consistent estimators, such as MLPL, can assign a meaningful temperature for local deviations from the Boltzmann distribution. However, these may fail to capture macroscopic features. It would be useful to have an objective, or estimator for temperature, that reflects only the macroscopic distribution features. One way to do this is to manipulate P A so that the local distributional features are removed.
We have also not considered so far the practical application of annealers as heuristic samplers. By our definition, for an annealer to be useful it must do well on the appropriate objective, and be fast. However, these two aims are typically in tension. A method that allows one to trade off these two goals is post-processing. In post-processing we take individual samples, or the set of all samples, and apply some additional procedures to generate an improved set of samples. This requires additional resources and can be heuristic, or employed in a manner guaranteed to improve the objective.
Those distribution features that can be manipulated by post-processing will be called local. Local, since it is assumed that efficient post-processing will not be so powerful as to manipulate the macroscopic distribution in interesting cases. Amongst the easiest local feature to correct in annealers is local relaxation: the tendency of states to decrease in energy towards their local minima at the end of the anneal as illustrated in Figure 1.
Post-processing has two uses considered in this paper: to extract macroscopic distribution features (by discounting local distortions), and to improve the heuristic distributions. A post-processed distribution can be represented as where W β is again a kernel.
With post-processing, we now have three distributions of interest: P A (x), P β,A (x) and B β (x). Until this section we were interested exclusively in the closeness of P A (x) and B β (x), and under the assumption that B β ≈ P β,A (x) we developed efficient approximations to maximum likelihood (or minimum MSE) estimation in Section 3.1. We can now present a different interpretation. If we are interested in local deviations in the distribution, we should compare P A (x), P β,A (x); whereas if we are interested in global deviations we should compare P β,A (x) and B β (x). For practical purposes only the latter comparison makes sense, since we can efficiently correct local errors. However, the former is important in understanding why annealers fail, and how to correct their distributions. The locally self-consistent estimators (such as MLPL) minimize a divergence between P A (x) and P β,A (x), and so should be interpreted as local approximations (yielding a local temperature estimate). In the case that P A (x) = B β (x), then the distinction between global and local is no longer relevant, and this local approximation is a consistent and low variance estimator for the unique β describing the distribution. Implicit in our definition (11) is a restriction to "do no harm" post-processing, we post-process at the β that defines the Boltzmann distribution of interest (or that to which we wish to compare). The criterion (8) does no harm since it is guaranteed to move any distribution towards a Boltzmann distribution in some sense, and never away from it. In the case that W β involves only conditional resampling according to B β (rule (9) is one such case) it is straightforward to show that the It is reasonable to expect, though not guaranteed, that other objectives will improve under do no harm kernels. Heuristic approaches without such guarantees may sometimes do better in practice, but carry risks.
For purposes of isolating macroscopic features of the distribution it is ideal to apply enough postprocessing to remove the local distortions; but leave the macroscopic features intact. This is a balancing act that strictly exists only as a concept, since the distinction between local and global is blurred except perhaps in the large system size limit, and there will typically be several relevant scales not just two. In experiments we present results for post-processing consisting of one sweep of blocked Gibbs sampling (described in Section 4.2), a weak form of post-processing. For purposes of improving the heuristic, W β should be chosen powerful enough that the time-per-sample is not significantly impacted. One sweep of blocked Gibbs sampling meets the criterion of being a small overhead in time per-sample for the DW2X and STA under the operation conditions we examine, so we can infer something of the power of post-processing to efficiently correcting annealer non-idealities.
If the heuristic distribution is a function of β (11), we must take into consideration the dependence of P β,A on β in objective minimization. KL-divergence minimization becomes distinct from maximum likelihood in the case that samples are a function of β, and the energy matching criterion (4) is modified in the former case. This point is further discussed in supplementary materials.
In Section 3 we developed locally self-consistent estimators. We emphasize that with post-processing, these estimators can be made redundant, unless the post-processing method kernel (11) is significantly different from the kernel used in the local self-consistency trick (7). The self-consistency trick (11) uses information about how samples are redistributed under the kernel to determine β -if this kernel matches the post-processing kernel, then it detects the effect of post-processing and very little else. Therefore, in the evaluation of post-processed distributions care should be taken in applying and interpreting self-consistent approximations to β.

Experimental results
In Section 4.1 we present the two models we will study in the main text. We then describe in Section 4.2 a simple Markov Chain Monte Carlo procedure called blocked Gibbs sampling, and how we use this to create the STA. The blocked Gibbs method is also applied in our post-processing experiments. We then describe our usage of the DW2X as a sampler. Experimental results demonstrating the methods of Sections 2 and 3.2 are subsequently presented.

RAN1 and AC3 models
In this main section we consider only two models, RAN1 and AC3, which are spin-glass models compatible with the DW2X topology [40]. This topology is described by a Chimera graph shown in Figure  2; each variable is a circle with an associated programmable field h i , and each edge is associated to a programmable coupling J ij . Qubits are arranged in unit cells, each a K 4,4 bipartite graph of 8 qubits. Due to manufacturing errors, some qubits and couplings are defective and cannot be programmed.
The RAN1 and AC3 problems are defined on a Chimera graph with 1100 qubits across a 12x12 cell grid (C12). In experiments we consider models that exploit all available couplings and qubits (C12), as well as models that use only 127 qubits on a 4x4 cell subgrid (C4), and 32 qubits on a 2x2 cells subgrid (C2).
RAN1 is a simple spin-glass model without fields (h i = 0), and with independent and identically distributed couplings uniform on J ij = ±1 [16]. Recent work has indicated that for some algorithms RAN1 may be a relatively easy problem in which to discover optima [31], and that asymptotically there is no finite temperature spin-glass phase transition [32], making it questionable as a benchmark. However, the problem class demonstrates many interesting phenomena at intermediate scales and has become a well understood benchmark for experimental analysis.
The AC3 model is a simple variation of the RAN1 class. Again we have no fields, but the intra-cell couplings' values (those between variables in the same cell) are sampled uniformly at random from J ij = ± 1 /3, and inter-cell couplings set to J ij = −1 [16] 3 . By making couplings relatively stronger between cells, longer range interactions are induced through sequences of strongly correlated vertical, or horizontally, qubits. If we can consider the inter-cell couplings to dominate energetically, then the low energy solution space becomes compatable with that of a bipartite Sherrington Kirkpatrick model [41]. We find that the AC3 problem is an interesting departure from RAN1 since the solution space is less dependent on local interactions, and also because we modify the precision of couplings (from an alphabet of ±1, to an alphabet of ± 1 /3, −1).

Blocked Gibbs sampling, and simulated thermal annealing
Blocked Gibbs is a standard Markov chain Monte Carlo procedure closely related to the Metropolis algorithm procedure [7,42]. It is the basis for the STA in this paper, and also the post-processing results.
First note that because the Chimera graph is bipartite, it is 2-colorable. Given a coloring, the variables in a set of a given color are conditionally independent given the variables of the complementary set. Thus it is possibility to simultaneously resample all states in one set, each according to the probability where ζ(x) = (J + J T )x + h. We proceed through the colors in a fixed order, for each color resampling all variables. An update of all variables is called a sweep. This procedure can be iterated at a fixed temperature, and the distribution of samples is guaranteed to approach the Boltzmann distribution parameterized by β over sufficiently many sweeps. A graph coloring needn't be optimal, and can always be found (given sufficiently many colors), so that this algorithm generalizes in an obvious manner to non-bipartite graphs. The blocked Gibbs sampling procedure at large β is not very efficient in sampling for multi-modal distributions, since samples are immediately trapped by the nearest modes (which may be of high energy), and escape only over a long timescale. A more effective strategy for multi-modal problems is blocked Gibbs annealing, in which β is slowly increased towards some terminal value β T according to a schedule (a schedule assigns one temperature to each sweep of blocked Gibbs). In this paper we consider an annealing schedule that is a linear interpolation between 0 and β T . Given the restriction to a linear schedule the STA has two parameters: the total anneal time, and the terminal inverse temperature β T . The setting of these parameters is discussed in Section 4.4.

Quantum annealing with the D-Wave 2X
In the case of the DW2X, annealing is controlled by a time-dependent transverse field ∆(t) and an energy scale E(t). These quantities are shown in Figure 3 for the DW2X used in this paper. The physical temperature (T ) of the system varies with time and load on the device and is difficult to estimate. We have experimentally observed a physical temperature which is 22.9 in the median, with quartiles of 22.0 mK and 25.6 mK, over the data collection period. We did not analyze the time scales associated to temperature fluctuations in depth, but much of the variation occurs on long time scales, so that in a single experiment we typically found a tighter range of temperatures applied. The unitless Hamiltonian operator in effect during the anneal iŝ where s = t/t max is the rescaled time, the coefficient h is Planck's constant, andσ are the Pauli matrices. The Hamiltonian parameters can be considered rescaled to maximum value 1 (the function of the denominator max(·)). r and t max are the rescaling parameter and the anneal time parameter, respectively. Throughout this paper, we adopt the convention that the Hamiltonian for the problem is fixed, and treat r as a parameter of the heuristic (DW2X). In current operation of the DW2X, this manipulation is achieved in practice by turning off autoscale, and manually rescaling the values of {J ij , h i } submitted. We find the convention of modifying the quantum Hamiltonian (13) to be more intuitive than considering modification of the classical Hamiltonian that is submitted: when we rescale downwards we are implying the use of smaller energy scales E(t) in the annealer, and we anticipate that β is reduced.

Parameterization of the DW2X and STA in experiments
Our criteria for selection of the default rescaling parameter r and the anneal time is minimization of the "Time to Solution" (TTS) [16,31,43]. TTS is the expected time required to see a ground state for the first time. For the DW2X, it is optimal to choose a minimal programmable anneal time (20µs), and a maximum programmable energy scale r = 1. A second (and not uncorrelated) reason to use these parameters is that they are the default operation mode of the DW2X. Since our objective is not optimization, TTS is not the optimal way to use the DW2X, but represents a standard choice that allows for phenomena to be explored.
The STA parameters are chosen in simple manner in part according to TTS, and in part for convenient comparison with the DW2X. We choose the STA parameter β T so that the distribution of local excitations in the DW2X and STA are comparable in the median case of 100 randomly generated C12 problems: by equating the local properties' differences between the annealers, we reflect more interesting global features. To equate local properties β T is chosen equal to the MLPL estimate of β at full scale (C12) for the DW2X (see Figure 5). This implies β T = 3.54 for RAN1, and β T = 4.82 for AC3. In the case of RAN1 our choice β T = 3.54 is not too dissimilar to the value (β = 3) that had been previously used for optimization applications [16,31,43].
We choose two sets for the number of sweeps of the STA, and show both in most figures. The first set is chosen to minimize TTS in the median instance. For both RAN1 and AC3 we find an approximately linear trend in the width the Chimera graph, so that 12000, 4000 and 2000 sweeps were close to optimal for C12, C4 and C2 problems respectively.
The second set was chosen to match the time per sample of the DW2X. A recent efficient implementation of the Metropolis algorithm for RAN1 problems achieves a rate of 6.65 spin flips per nanosecond [43]. In C12 problems we have 1100 active qubits, and so 20µs would allow for 20000 ns 1100 spin flips * 6.65 spin flips/ns ≈ 120 sweeps (updates of all variables). For C4 and C2 we rescale linearly for simplicity (40 and 20 respectively). In experiments we evalute the STA and DW2X on the basis of sample batches, each batch consisting of 10 4 samples. We approximate the samples as independent and identically distributed. in the case of the DW2X this is an approximation and different programming procedures can impact quality of results. Our DW2X batches are generated by collecting 10 4 samples split across 10 programming cycles. This is a standard collection procedure that trades off quality of samples against timing considerations -including annealing time, programming time and read-out time. The programming cycles exploit spin-reversals, a noise mitigating technique that strongly suppresses correlations between programming cycles [44]. The effect of this batch structure are presented in Figure 14.

Maximum likelihood and maximum log-pseudo-likelihood estimation
In this section we consider the DW2X and STA without post-processing. The maximum log-pseudolikelihood (MLPL) estimate can be interpreted as a form of locally self-consistent approximation to maximum likelihood, as described in 3.1, and here we compare it to the more computationally intensive maximum likelihood (ML) estimate of Section 2.1. MLPL and maximum likelihood estimators indicate different values of β, reflecting the failure of the locally self-consistent approximation (8). MLPL captures a local temperature, consistent with the range over which the kernel (9) redistributes the sample. We consider the estimates with variation of the DW2X rescaling parameter r and the STA terminal temperature β T relative to the default settings, for our annealing procedures on 100 randomly generated RAN1 problems at each of 3 sizes: C2 (32 variables), C4 (127 variables), and C12 (1100 variables). Results for AC3 are presented in supplementary materials.
We first consider the behavior of the STA with a range of terminal inverse temperatures between 0 and the default value for the problem class. Due to ergodicity breaking, we expect samples to fall out of equilibrium before the terminal temperature is reached, and so the distribution may be characterized by a value of β smaller than β T , reflecting the range of inverse temperature for which dynamics slowed down. Figure 4 shows the maximum likelihood and MLPL estimates based upon the same sample sets. We see that the MLPL estimates follow a linear curve, which would indicate that the terminal model is indeed the best fit to the samples, with no evidence of the ergodicity breaking we describe. By contrast, the maximum likelihood estimator is concave, with β significantly smaller than β T .
The DW2X can also be manipulated by changing the rescaling parameter r on the interval [0, 1]. We thus repeat this experiment using sample sets from the DW2X. In Figure 5 we see that both locally (MLPL) and globally (ML) the estimators are concave as a function of this rescaling. As with the STA, estimates are consistently larger for MLPL than maximum likelihood, and maximum likelihood estimates decrease for larger, more complicated, problems. Maximum likelihood exhibits some small decrease with system size. A naive interpretation of the rescaling parameter might lead to a general hypothesis that β ∝ r, where the constant of proportionality can be determined by the terminal energy scale in annealing (see Figure 3). However, the physical dynamics of qubits are controlled by a single qubit freeze-out phenomenon that is discussed in supplementary material, with data summarized in Figure 6. The single qubit freeze-out figure implies that the non-linearity of the MLPL curve in spinglass problems is a function of the problem precision (granularity of the settings of J ij and h i ), or more specifically, the pattern of energy gaps. Problems of higher precision, such as the AC3 problem class, have less pronounced non-linearity. The single qubit freeze-out phenomenon also anticipates the system size dependence seen in Figure 5, and in some other models not presented. AC3 results, and further discussion of this point, are in supplementary materials.
We believe the phenomenon underlying both MLPL and maximum likelihood to be easily understood, and similar in both the DW2X and STA, although the interaction of the dynamics with the energy landscape may be quite different. Figure 1 provides a useful example to explain this. When ergodicity is broken, and the sample set is divided over subspaces, each subset relaxes towards the terminal model restricted to the subspace; the MLPL method effectively averages an estimate over these subspaces. By contrast the maximum likelihood estimate accounts in part for the distribution between modes, characterized at an early (higher energy) stage of the anneal that is better described by smaller inverse temperature. Note that the equilibrated distribution at the ergodicity breaking point is a quantum one for the DW2X, unlike thermal annealers, so we ought to understand the deviation in the local and global temperatures in terms of the quantum parameterization (∆(t), E(t)) [14]. Still, the classical description in terms of β (2) provides the correct intuition.
Response curves such as these can be used to choose a suitable parameterization of the terminal model -we can choose β T so as to minimize KL-divergence. Similar curves could be constructed for any parameter, and with a distribution that is subject to post-processing. In the absence of the maximum likelihood curve (due to absense of approximates to the energy), a local information curve such as MLPL curve can be used as a compromize. The maximum likelihood curves, and MLPL curve for the DW2X, are problem dependent. Some temperature estimators require knowledge of, or assumptions about, the form of these curves -notably the method of Benedetti et al. [12] and our multi-canonical method, both described in supplementary materials.
To judge quality of the approximation at the "best" β, it is appropriate to consider the quantity being minimized, KL-divergence. However, this is a difficult quantity to estimate from samples in the absence of parametric assumptions. We present a method for estimation of KL-divergence in supplementary materials, but find that for RAN1 and AC3 problems it is ineffective at C12 scales. This is an important reason to consider an alternative such as the MSE estimator.

The mean square error estimator
The mean square errors on correlations associated to the DW2X and STA for a typical instance of RAN1, and a typical instance of AC3, are shown for the C12 (1100 variables) problem size in Figure 7. Though there is significant variation between the curves associated to different instances of these models, we chose amongst 100 random instances exemplars that are typical in the minimizing temperature and the mean square error (β, MSE) for DW2X.
Both the DW2X and STA performance is best characterized at intermediate β values, and variation from the default annealer settings is also shown to modify MSE. In the RAN1 exemplar we demonstrate the effect of halving the terminal temperature or rescaling parameter, which improves MSE (except perhaps at very large values for β). In the case of the AC3 exemplar we demonstrate the effect of varying the anneal time. Results indicate that longer anneals tend to improve MSE at lower temperature. To prevent clutter we have shown only one variation of a default parameter per model. Switching the parameter being varied results are qualitatively similar. Annealing for longer is expected to allow equilibration to lower temperatures, and so a better match is to be expected. Annealing with smaller r or β T , concentrates annealing resources towards the initial part of the anneal where dynamics are effective (rather than at the end of the anneal where ergodicity breaking has already occurred and cannot be mitigated). It also reduces the tendency of samples to settle towards local minima that might distort the approximation for intermediate β. In the case of the DW2X, some noise sources and quantum mechanical phenomena can complicate this simple picture, but this is not obviously at play. Figure 8 shows the mean square error achieved against the point it is minimized, for a large set of C12 (1100 variable) problem instances. Ideally we wish for heuristics that are both sampling at large β and with small errors. Figure 9 shows the minimum MSE estimator in comparison to the maximum likelihood estimator. These two estimates indicate different operational temperature ranges. The fact that the maximum likelihood estimator is significantly larger is to be expected in annealers. Late in the anneal samples sink towards their respective local minima, decreasing the mean energy significantly (see Figure 1). The mean energy is strongly dependent on the local relaxation, and hence the local inverse temperature, which as we saw in the previous section is large. In the case of MSE, the correlation error is for most models not strongly affected by this local decrease in energy. It is more sensitive to the distribution between modes, which is set only by the temperatures characterizing ergodicity breaking.
By annealing with modified β T or r, improvements are made for sampling intermediate or small β. By contrast, it is relatively hard to sample effectively from large values of β by modifying these parameters; additional time resources are required to make an impact. For this reason, we may argue that, generally, the larger the inverse temperature estimate, the more useful the annealer will be for hard sampling applications. However, it is important to note that an estimate for β, independent of the objective measure, may be risky or misleading. In Figure 9(right), the DW2X system at full scale indicates a lower minimum MSE β than the DW2X system operating at half scale. However, we can see that at this larger β value the half scale system is still more effective as a heuristic.

Effectiveness of post-processing
In Sections 4.5 and 4.6 we demonstrated how adjusting anneal duration, or the terminal temperature, can allow better objective outcomes. In this section we consider briefly the effect of post-processing by one sweep of blocked Gibbs as discussed in Section 3.2. This post-processing changes dramatically the local distribution of samples, hence the MLPL estimate. However, the KL-divergence and mean square error, and the temperatures minimizing these objectives, are affected by a combination of the local and global distribution and so are modified in a non-trivial way by post-processing. Post-processing always strongly modifies local distribution properties, but only in the easy to sample regime (at small β) does it significantly impacts global distribution problems. Figure 10 shows that MSE on correlations are, as expected, improved by post-processing. The improvements are dramatic in the regime β ≈ 0, impressive over intermediate values of β, but almost negligible for larger β. After post-processing, the MSE curve has two minima. There is no guarantee there should be two local minima working with arbitrary distributions. The first local minimum (β = 0) is evidence for the power of post-processing. Since one sweep of blocked Gibbs samples effectively at β ≈ 0, independent of the initial condition P A , β = 0 will be a global minimum for any heuristic distribution. The second local minimum appears due to the closeness (at the macroscopic level) of the annealing distribution to some particular low-temperature Boltzmann distribution, a sweet spot of operation that may be of practical interest.
Post-processing reduces the error everywhere, but more so at smaller β where the time-scales for macroscopic redistribution are shorter. For this reason we expect the minimizing value of β to move leftward with post-processing. If the post-processing allows global redistribution of samples we may anticipate the disappearance of the local maximum separating the "easy for post-processing regime" from the "good for this annealer" regime; at which point a best operational regime for the annealer is less clear. However, we can assume that powerful post-processing of this kind is too expensive in the types of multi-modal problems where annealers are useful. Figure 11 shows the statistics for the local minimum mean square error estimator, and its relation to the local maximum likelihood estimator, to be compared against Figure 9 that has no post-processing. The local minimizer is the right most local minimum of the post-processed curve (see Figure 10), indicating the good operating regime. A global minimum is always at β = 0, but this is not of interest as it reflect the post-processor and not the heuristic.
In Figure 11(left) we see that, relative to the distribution without post-processing, there is a slight shift leftwards in all distributions, and significant shift downwards that appears approximately proportional to the MSE without post-processing. The effect of local relaxation is to give the impression of better estimation at low temperature. Here the effect is partially lifted to reveal that the macroscopic distribution may be characterized by slightly smaller β.
In Figure 11(right) we see how the minimum MSE estimate decreases in all annealers, but by a less significant amount than the downward trend in the maximum likelihood estimator. The two remain strongly correlated under post-processing. It is natural to expect that the local relaxation, which shifts samples towards their local minimum, may have a bigger impact on KL-divergence than MSE, because it is easy to raise the energy by post-processing which impacts maximum likelihood in a systematic manner, but it is difficult to redistribute samples macroscopically, which may be required to alter minimum MSE. The particular effects demonstrated on the exemplar instances of Figure 11 are reflected at the distribution level in Figure 12.
The estimation of KL-divergence is described in supplementary materials, and allows us to measure KL-divergence post-processed RAN1 and AC3 problems, with reasonable precision up to C4 scale. In Figure 13 we demonstrate results for an exemplar on a C4 graph. The pattern observed is qualitatively similar to Figure 10 in that we see a global minimum at β = 0 reflecting the effectiveness of the postprocessing, and a local minimum at intermediate β that reflects a promising region for application of the annealer as a heuristic. As discussed in the supplementary materials, bias can be a problem with our KL-divergence estimator. For this reason we present a C4 (rather than C12) sized problem and only the post-processed (rather than unprocessed and post-processed) estimates. To assess the bias (the sensitivity of the estimate to the finite sample set size) the jack knife bias-corrected estimator is also shown [45]. At C4 scale the bias does not significantly obscure the phenomena, particularly for the more powerful annealers (the STA with 4000 sweeps, and the DW2X).

Discussion
In this paper we have considered several temperature estimators applied in the context of a physical quantum annealer set up for optimization (DW2X), and a comparable simply parameterized simulated thermal annealer (STA). We have demonstrated how different objective measures of closeness to the Boltzmann distribution respond differently to local and global distribution features. An important phenomenon we observe is that in annealed distributions we have a range of temperature estimates according to the method employed. We have shown that estimators indicating larger temperature are those responsive to macroscopic (global) features of the distribution. From an estimator perspective we have global warming: the more effective the estimator is in capturing global distribution features the higher the temperature that is typically indicated. Ergodicity breaking qualitatively explains the origin of this phenomenon in annealers, both the DW2X and STA. We have provided some practical guidelines in Section 1.1.
Local distributional features, which are well characterized by a temperature, are easy to estimate by self-consistent methods. We showed in the main text standard methods for estimate temperature from samples drawn from a single distribution. Self-consistent approaches are efficient and approximate the target distribution from the same samples by which the heuristic is evaluated. We presented a simple form for this, but the principle generalizes. However, the main problem with such methods is that they may indicate a good fit on the basis of incomplete (or biased) information about the target distribution. If a heuristic sample set fails to see a representative set of modes in the distribution, then the evaluation will inevitably skewed by the missing information.
The local approximation method is able to capture an important difference between quantum and simulated annealers related to the difference between dynamics of the DW2X and STA. This is realized in the non-linear dependence of the local temperature estimate to variation of the DW2X rescaling parameter. Quantum simulations on single qubits provide a qualitatively accurate explanation for this phenomenon, and are described in supplementary material.
Describing the global distribution in terms of temperature(s) is more tricky; we proposed KLdivergence and MSE as measures of deviation from the Boltzmann distribution, and based on these objectives developed estimators for the best temperature. Each of these objectives is affected in slightly different ways by deviations locally and globally from the Boltzmann distribution. The maximum likelihood estimator is more strongly affected by the local distribution than the minimum MSE estimator, and indicates a larger estimate of inverse temperature. To remove the local distribution effects we have proposed to take the initial distribution and apply local post-processing in order to isolate the macroscopic distribution effects that are truely a limitation on practical performance. Applying some degree of post-processing may also be valuable in practice, in particular for the DW2X since the post-processing is strictly classical and complementary to the quantum dynamics.
We emphasize that because efficient post-processing allows significant manipulation of the local temperature, we consider this temperature not particularly important in practical applications. If we post-process, the post-processing temperature itself will be synonymous with the local temperature; the local temperature need not be measured.
Important ideas incidental to the main thread are discussed in supplementary materials. These include: a description of single qubit experiments that explain the non-linearity of MLPL estimates in the DW2X; consideration of the effect of embedding on the distribution of samples; the development of an effective estimator for the Kullback-Leibler divergence; a consideration of spin-reversal transformations to mitigate sampling error in the DW2X; and an experiment to test how the choice of kernel (7) affects the locally self-consistent temperature estimates.
At various points in this paper we have included results both for the DW2X and STA. We have motivated a default parameterization for each algorithm in Section 4.4, and it should be clear that these choices are not aimed at, or appropriate for, a competitive comparison. It would be a complicated task to make a fair comparison, and it would also detract from the main theme of this paper since it would distort or disguise the phenomena we wish to highlight. We chose a default DW2X parameterization suitable for optimization, and two STA parameterizations that allow for qualitative comparison. It has been shown in experiments that both annealers can be improved with simple parameter modifications. It is interesting that, despite the fact that the DW2X annealer has been designed and tuned for optimization, it produces good statistics at intermediate temperature ranges, and that the STA with long anneal time shows a qualitatively similar behavior.
When considering the value of annealers in inference problems, it is also important not to forget a variety of other powerful inference methods that may achieve a similar objective. In particular, simple variations on the STA such as annealed (or population) importance sampling methods and other multi-canonical MCMC methods can often be tailored to the graphical structure of the problem under investigation [46,47,7,48,37,49].
From the perspective of both errors on correlations and KL-divergence, the balance of evidence certainly indicates that there is potentially a sweet spot for application of either the DW2X or STA to sampling. This sweet spot may be problem type dependent, but can be tuned to a degree, by modification of the annealing parameters, and more importantly, by post-processing. However, evaluation of this sweet spot is difficult to do self-consistently, and someone interested in applications may have to undertake hard work to discover (and have confidence in) annealer performance. Having available curves such as those in Section 4.5, probably for some weakly post-processed distribution, would allow parameters of the annealer to be set optimally. It may seem computationally intensive (defeating the value of the heuristic) to evaluate the macroscopic distribution before using an annealer, but it is reasonable to assume that for some classes of problems at large scale, the local and global temperature properties will be common across the class. In other time-dependent applications of annealers the statistics of the distributions being learned change slowly, so that only periodic evaluations of the temperatures may be required.
An important potential application of quantum annealers is in machine learning [13,12], where other heuristic samplers (not annealers) are prevalent. A common heuristic used in machine learning is called contrastive divergence (CD) [50,42]. B β is approximated in CD by taking the ground truths (a set of training examples) and evolving them by an MCMC procedure. This is an example of the post-processing scheme used and recommended in this paper, except that annealed samples are replaced by the ground truths. Like the annealing distribution, the distribution of training examples may be incorrect in both its local and global features, the effect of the MCMC procedure is to tidy up local distribution deviations. After post-processing the distribution is used directly -in effect the post-processing temperature is taken to be correct 4 , without consideration of potential macroscopic deviations. The success of this algorithm in practice indicates that learning procedures may be quite tolerant of macroscopic deviations from the Boltzmann distribution in application provided the local temperature is correct. This would be good news since it may be computationally expensive to quantify macroscopic deviations, but it is easy to measure and manipulate local temperature in annealers.
One feature of D-Wave quantum annealers that might lead us to consider a different approach is the quantum part, as already discussed. The single qubit freeze-out, and dynamical slow-down at larger scales, are described by quantum models. The quantum Boltzmann distribution may be a better fit to sample sets drawn from the DW2X, and perhaps in "post-processing" we should think of the quantum space as the target, rather than the classical one. This is certainly a promising direction for future work.    MLPL becomes equivalent to maximum likelihood in the limit of a single qubit. The single qubit β dependence on the rescaling parameter, for the single qubit Hamiltonian H(x) = xi is shown. We plot the DW2X median value together with 25-75 quantiles as error bars. Simulation of the physical dynamics gives the Redfield curve, and the "freeze-out" curve shows is based on the assumption of a single freeze-out point. The theory thus is in agreement with experiment. The DW2X is not equilibrated at the end of the anneal even for a single qubit model. By contrast, in a system that is locally or globally equilibrated at the end of the anneal a linear dependence would be expected, as is seen for the STA.   Meansquare error 10,000 spinreversals 1,000 spinreversals 100 spinreversals 10 spinreversals 1 spinreversal Figure 14: Spin reverals are a noise mitigation technique, described in supplementary materials. Choosing batches with more spin-reversals aids sampling quality bringing us closer to the paradigm of independent and identically distributed samples, but at the price of additional programming time. At C12 scale, spin reversals have a significant impact on MSE for both AC3 (left) and RAN1 (right). Given M samples, the error achieved by a batch methods (m samples per spin-reversal with M/m spin reversals) is already close to that of the ideal scenario of one sample per spin reversal when m = 10. The signal is noisy when few spin-reveral transformations are used, and is only statistically significant after averaging over many distributions (100 in this figure). We establish the mean estimate, and its standard error by bootstrapping of a sample set of 10000 spin-reverals sampling each time 1000 samples.