Physics-Inspired Optimization for Quadratic Unconstrained Problems Using a Digital Annealer

The Fujitsu Digital Annealer (DA) is designed to solve fully connected quadratic unconstrained binary optimization (QUBO) problems. It is implemented on application-specific CMOS hardware and currently solves problems of up to 1024 variables. The DA's algorithm is currently based on simulated annealing; however, it differs from it in its utilization of an efficient parallel-trial scheme and a dynamic escape mechanism. In addition, the DA exploits the massive parallelization that custom application-specific CMOS hardware allows. We compare the performance of the DA to simulated annealing and parallel tempering with isoenergetic cluster moves on two-dimensional and fully connected spin-glass problems with bimodal and Gaussian couplings. These represent the respective limits of sparse versus dense problems, as well as high-degeneracy versus low-degeneracy problems. Our results show that the DA currently exhibits a time-to-solution speedup of roughly two orders of magnitude for fully connected spin-glass problems with bimodal or Gaussian couplings, over the single-core implementations of simulated annealing and parallel tempering Monte Carlo used in this study. The DA does not appear to exhibit a speedup for sparse two-dimensional spin-glass problems, which we explain on theoretical grounds. We also benchmarked an early implementation of the Parallel Tempering DA. Our results suggest an improved scaling over the other algorithms for fully connected problems of average difficulty with bimodal disorder. The next generation of the DA is expected to be able to solve fully connected problems up to 8192 variables in size. This would enable the study of fundamental physics problems and industrial applications that were previously inaccessible using standard computing hardware or special-purpose quantum annealing machines.


I. INTRODUCTION
Discrete optimization problems have ubiquitous applications in various fields and, in particular, many NP-hard combinatorial optimization problems can be mapped to a quadratic Ising model [1] or, equivalently, to a quadratic unconstrained binary optimization (QUBO) problem. Such problems arise naturally in many fields of research, including finance [2], chemistry [3,4], biology [5,6], logistics and scheduling [7,8], and machine learning [9][10][11][12]. For this reason, there is much interest in solving these problems efficiently, both in academia and in industry.
The impending end of Moore's law [13] signals that relying on traditional silicon-based computer devices is not expected to sustain the current computational performance growth rate. In light of this, interest in novel computational technologies has been steadily increasing. The introduction of a special-purpose quantum annealer by D-Wave Systems Inc. [14] was an effort in this direction, aimed at revolutionizing how computationally intensive discrete optimization problems are solved using quantum fluctuations.
Despite continued efforts to search for a scaling advantage of quantum annealers over algorithms on conventional off-the-shelf CMOS hardware, there is as yet no consensus. Efforts to benchmark quantum annealers against classical counterparts such as simulated annealing (SA) [15] have abounded [14,[16][17][18][19][20][21][22][23][24][25][26][27][28][29][30]. Although for some classes of synthetic problems a large speedup was initially found, those problems were subsequently shown to have a trivial logical structure, such that they can be solved more efficiently by more-powerful classical algorithms [31]. To the best of our knowledge, the only known case of speedup is a constant speedup for a class of synthetic problems [29] and, so far, there is no evidence of speedup for an industrial application. The hope is that future improvements to the quantum annealer and, in particular, to its currently sparse connectivity and low precision due to analog noise, will demonstrate the power of quantum effects in solving optimization problems [32,33]. With the same goal in mind, researchers have been inspired to push the envelope for such problems on novel hardware, such as the coherent Ising machine [32], as well as on graphics processing units (GPU) [27,28] and application-specific CMOS hardware [34,35]. Similarly, efforts to emulate quantum effects in classical algorithms-often referred to as quantum-or physics-inspired methods-run on off-the-shelf CMOS hardware have resulted in sizable advances in the optimization domain (see, e.g., Ref. [30] for an overview of different algorithms).
Fujitsu Laboratories has recently developed applicationspecific CMOS hardware designed to solve fully connected QUBO problems (i.e., on complete graphs), known as the Digital Annealer (DA) [34,35]. The DA hardware is currently able to treat Ising-type optimization problems of a size up to 1024 variables, with 26 and 16 bits of (fixed) precision for the biases and variable couplers, respectively. The DA's algorithm, which we refer to as "the DA", is based on simulated annealing, but differs in several ways (see Sec. II), as well as in its ability to take advantage of the massive parallelization possible when using a custom, application-specific CMOS hardware. Previous efforts of running simulated annealing in parallel include executing different iterations in parallel on an AP1000 massively parallel distributed-memory multiprocessor [36,37]. In addition to the DA, a version of the Digital Annealer, which we refer to as "the PTDA", and which uses parallel tempering Monte Carlo [38][39][40][41][42] for the algorithmic engine is now available. In particular, it has been shown that physics-inspired optimization techniques such as simulated annealing and parallel tempering Monte Carlo typically outperform specialized quantum hardware [30] such as the D-Wave devices.
Much of the benchmarking effort has centred around spin glasses, a type of constraint satisfaction problem, in part due to their being the simplest of the hard Boolean optimization problems. Furthermore, application-based benchmarks from, for example, industry tend to be structured and, therefore, systematic benchmarking is difficult. As such, spin glasses have been used extensively to benchmark algorithms on offthe-shelf CPUs [43,44], novel computing technologies such as quantum annealers [18,20,45,46], and coherent Ising machines [32]. In this paper, we benchmark the DA and the PTDA on spin-glass problems, comparing them to simulated annealing [47] and parallel tempering Monte Carlo with isoenergetic cluster moves [48,49] (a variant of Houdayer cluster updates [50] within the context of optimization and not the thermal simulation of spin-glass systems), both state-ofthe-art, physics-inspired optimization algorithms. For other alternative classical optimization techniques used in the literature to solve QUBO problems, the interested reader is referred to [22,30,51].
The paper is organized as follows. Section II describes the algorithms we have benchmarked. In Sec. III we probe the advantage of parallel-trial over single-trial Monte Carlo moves and in Sec. IV we discuss the methodology we have used for measuring time to solution. In Sec. V we introduce the problems benchmarked. The experimental results are presented and discussed in Sec. VI. Finally, our conclusions are presented in Sec. VII. The parameters used for our benchmarking are given in A.

II. ALGORITHMS
In this paper, we compare several Monte Carlo (MC) algorithms and their use for solving optimization problems.

A. Simulated Annealing
Simulated annealing (SA) [15] is a generic algorithm with a wide application scope. The SA algorithm starts from a random initial state at a high temperature. Monte Carlo updates at decreasing temperatures are then performed. Note that the temperatures used follow a predefined schedule. When the simulation stops, one expects to find a lowtemperature state, ideally the global optimum (see Algorithm 1 for details). The high-temperature limit promotes diversification, whereas the low-temperature limit promotes intensification. To increase the probability of finding the optimum, this process is repeated multiple times (referred to as "runs"), returning the best state found. The computational complexity of each Monte Carlo sweep in SA is O(N 2 ) for fully connected problems with N variables, because each sweep includes N update proposals, and each accepted move requires updating N effective fields, at a constant computational cost.

B. The Digital Annealer's Algorithm
The DA's algorithmic engine [34,35] is based on SA, but differs from it in three main ways (see Algorithm 2). First, it starts all runs from the same arbitrary state, instead of starting each run from a random state. This results in a small speedup due to its avoiding the calculation of the initial N effective fields and the initial energy for each run. Second, it uses a parallel-trial scheme in which each Monte Carlo step considers a flip of all variables (separately), in parallel. If at least one flip is accepted, one of the accepted flips is chosen uniformly at random and it is applied. Recall that in SA, each Monte Carlo step considers a flip of a single variable only (i.e., single trial). The advantage of the parallel-trial scheme is that it can boost the acceptance probability, because the likelihood of accepting a flip out of N flips is typically much higher than the likelihood of flipping a particular variable (see Sec. III). Parallel rejection algorithms on GPU [52,53] are examples of similar efforts in the literature to address the low acceptance probability problem in Monte Carlo methods. Finally, the DA employs an escape mechanism called a dynamic offset, such that if no flip was accepted, the subsequent acceptance probabilities are artificially increased by subtracting a positive value from the difference in energy associated with a proposed move. This can help the algorithm to surmount short, narrow barriers. Furthermore, the application-specific CMOS hardware allows for massive parallelization that can be exploited for solving optimization problems faster. For example, in the DA, evaluating a flip of all variables is performed in parallel, and when a flip is accepted and applied, the effective fields of all neighbours are updated in parallel. Note that this step requires a constant time, regardless of the number of neighbours, due to the parallelization on the hardware, whereas the computational time of the same step in SA increases linearly in the number of neighbours.
In order to understand the logic behind the DA, it is helpful to understand several architectural considerations that are specific to the DA hardware. In the DA, each Monte Carlo step takes the same amount of time, regardless of whether a flip was accepted (and therefore applied) or not. In contrast, in a CPU implementation of SA, accepted moves are typically much more computationally costly than rejected moves, that is, O(N ) vs. O(1) , due to the need to update N effective fields versus none if the flip is rejected. As a result, in the DA, the potential boost in acceptance probabilities (from using the parallel-trial scheme) is highly desirable. In addition, in the DA, the computational complexity of updating the effective fields is constant regardless of the connectivity of the graph. Comparing this with SA, the computational complexity of updating the effective fields is O(N ) for fully connected graphs, but it is O(d) for fixed-degree graphs (in which each node has d neighbours). Therefore, running SA on a sparse graph is typically faster than on a dense graph, but the time is the same for the DA. For this reason, it is expected that the speedup of the DA over SA be, in general, higher for dense graphs than for sparse ones.
Finally, it is worth mentioning that the pseudorandom number generator algorithm implemented in the hardware is similar to a twisted generalized feedback shift register algorithm and gives a sufficiently long period of 2 19937 [54].

C. Parallel Tempering with Isoenergetic Cluster Moves
In parallel tempering (PT) [38][39][40][41][42]55] (also known as replica-exchange Monte Carlo), multiple replicas of the system are simulated at different temperatures, with periodic exchanges based on a Metropolis criterion between neighbouring temperatures. Each replica, therefore, performs a random walk in temperature space, allowing it to overcome energy barriers by temporarily moving to a higher temperature. The higher-temperature replicas are typically at a high enough temperature that they inject new random states into the system, essentially re-seeding the algorithm continuously, obviating (at least partially) the need to perform multiple runs. PT has been used effectively in multiple research fields [41], and often performs better than SA, due to the increased mixing.
The addition of isoenergetic cluster moves (ICM) [48,50] to PT, which flip multiple variables at a time, can allow for better exploration of the phase space, but only if the variable clusters do not span the whole system [48,49]. ICM is a generalization of Houdayer's cluster algorithm, which was tailored for two-dimensional spin-glass problems [50]. To perform ICM, two copies (or more) of the system are simulated at the same temperature. The states of those two replicas are then compared, to find a cluster of variables (i.e., a connected component) that are opposite. In the case of QUBO problems, opposite variables are defined as having a product of zero. Finally, the move is applied by swapping the states of the opposite variables in the two replicas. The total energy of the two replicas is unchanged by this move, such that it is rejection free. The combination of PT and ICM, PT+ICM (also known as borealis; see Algorithm 3 [49]), has been shown to be highly effective for low-dimensionality spin-glass-like problems [20,30,56], but it does not provide a benefit for problems defined on fully connected graphs. This can be understood by noting that when the clusters span the system, ICM essentially results in swapping the states completely. for each replica, for each variable do 4: propose a flip 5: if accepted, update the state and effective fields 6: end for 7: for each pair of sequential replicas do 8: propose a replica exchange 9: if accepted, swap the temperatures between the replicas 10: end for 11: perform ICM update, swapping the states of a cluster of variables that have opposite states in the two replicas; update the states and the effective fields for both replicas 12: end for D. The Parallel Tempering Digital Annealer's Algorithm Because the DA's algorithm is based on SA, and given the often superior results that PT gives over SA (see, e.g., Ref. [30]), Fujitsu Laboratories has also developed a Paral-lel Tempering Digital Annealer (PTDA). We had access to an early implementation of a PTDA. In the PTDA, the sweeps in each replica are performed as in the DA, including the parallel-trial scheme, parallel updates, and using the dynamic offset mechanism, but the PT moves are performed on a CPU. The temperatures are set automatically based on an adaptive scheme by Hukushima et al. [57]. In this scheme, the high and low temperatures are fixed, and intermediate temperatures are adjusted with the objective of achieving an equal replicaexchange probability for all adjacent temperatures. Having equal replica-exchange acceptance probabilities is a common target, although other schemes exist [42].
The next generation of the Digital Annealer is expected to be able to simulate problems on complete graphs up to 8192 variables in size, to have faster annealing times, and to perform the replica-exchange moves on the hardware, rather than on a CPU. This is significant, because when performing a computation in parallel, if a portion of the work is performed sequentially, it introduces a bottleneck that eventually dominates the overall run time (as the number of parallel threads is increased). Amdahl's Law [58] quantifies this by stating that if the sequential part is a fraction α of the total work, the speedup is limited to 1/α asymptotically.

III. PARALLEL-TRIAL VERSUS SINGLE-TRIAL MONTE CARLO
To illustrate the advantage of parallel-trial Monte Carlo updates as implemented in the DA over single-trial Monte Carlo updates, let us calculate their respective acceptance probabilities. The acceptance probability for a particular Monte Carlo move is given by the Metropolis criterion A(∆E i , T ) ≡ e −∆Ei/T , where ∆E i denotes the difference in energy associated with flipping variable i, and T is the temperature. The single-trial acceptance probability is then given by where N is the number of variables. In contrast, the paralleltrial acceptance probability is given by the complement probability of not accepting a move, At low temperatures, we expect the acceptance probability to reach zero, in general. In the limit A → 0, a first-order approximation of the parallel-trial acceptance probability gives This indicates that in the best case, there is a speedup by a factor of N at low temperatures. In contrast, at a high enough temperature, all moves are accepted, hence A → 1. In this limit, it is clear that both the single-trial and parallel-trial acceptance probabilities reach 1, so parallel-trial Monte Carlo does not have an advantage over single-trial Monte Carlo. To quantify the difference between parallel-trial and singletrial Monte Carlo, we perform a Monte Carlo simulation at constant temperature for a sufficiently large number of sweeps to reach thermalization. Once the system has thermalized, we measure the single-trial and parallel-trial acceptance probabilities at every move. This has been repeated for a number of sweeps, and for multiple temperatures and multiple problems.
The results of such an experiment are presented in Fig. 1, for problems of size N = 64 of the four problem classes described in detail in Sec. V. The problem classes include two-dimensional (2D) and fully connected [Sherrington-Kirkpatrick (SK)] spin-glass problems with bimodal and Gaussian disorder. The results for all the problem classes except for the 2D-bimodal class follow the expected pattern of the acceptance probabilities reaching zero at low temperatures and one at high temperatures. In the 2D-bimodal case, there is a huge ground-state degeneracy, such that even at the ground state there are single variables for which a flip does not result in a change in energy. This results in a positive single-trial acceptance probability even at very low temperatures. For the same reason, the parallel-trial probability reaches one even for very low temperatures.
To quantify the acceptance probability advantage of parallel-trial over single-trial updates, it is instructive to study the parallel-trial acceptance probability divided by the singletrial acceptance probability, as presented in Fig. 2. For all problem classes except 2D-bimodal, the advantage at low temperatures is indeed a factor of N , as suggested by dividing Eq. (3) by Eq. (1). As explained above, in the 2D-bimodal case the single-trial acceptance probability is nonnegligible at low temperatures, leading to a reduced advantage. It is noteworthy that the advantage of the parallel-trial scheme is maximal at low temperatures, where the thermalization time is longer. As such, the parallel-trial scheme provides an acceptance probability boost where it is most needed.

IV. SCALING ANALYSIS
The primary objective of benchmarking is to quantify how the computational effort in solving problems scales as the size of the problem (e.g., the number of variables) increases. The algorithms we consider here are all stochastic, and a common approach to measuring the scaling of a probabilistic algorithm is to measure the total time required for the algorithm to find a reference energy (cost) at least once with a probability of 0.99. The reference energy is represented by the optimal energy if available or, otherwise, by the best known energy. We denote this time to solution by "TTS", and explain how it is calculated in the rest of this section.
We consider the successive runs of a probabilistic algorithm as being a sequence of binary experiments that might succeed in returning the reference energy with some probability. Let us formally define X 1 , X 2 , . . . , X r as a sequence of ran- P(X i = 1) = θ denotes the probability of success, that is, of observing the reference energy at the i-th run. Defining as the number of successful observations in r runs, we have That is, Y has a binomial distribution with parameters r and θ. We denote the number of runs required to find the reference energy with a probability of 0.99 as R 99 , which equals r such that P(Y ≥ 1|θ, r) = 0.99. It can be verified that and, consequently, that where τ is the time it takes to run the algorithm once. Because the probability of success θ is unknown, the challenge is in estimating θ.
Instead of using the sample success proportion as a point estimate for θ, we follow the Bayesian inference technique to estimate the distribution of the probability of success for each problem instance [22]. Having distributions of the success probabilities would be helpful in more accurately capturing the variance of different statistics of the TTS. In the Bayesian inference framework, we start with a guess on the distribution of θ, known as a prior, and update it based on the observations from consecutive runs of the algorithm in order to obtain the posterior distribution. Since the consecutive runs have a binomial distribution, the suitable choice of prior is a beta distribution [59] that is the conjugate prior of the binomial distribution. This choice guarantees that the posterior will also have a beta distribution. The beta distribution with parameters α = 0.5 and β = 0.5 (the Jeffreys prior) is chosen as a prior because it is invariant under reparameterization of the space and it learns the most from the data [60]. Updating the Jeffreys prior based on the observations from consecutive runs, the posterior distribution, denoted by π(θ), is π(θ) ∼ Beta(α + y, β + r − y), where r is the total number of runs and y is the number of runs in which the reference energy is found.
To estimate the TTS for the entire population of instances with similar parameters, let us assume that there are I instances with the same number of variables. After finding the posterior distribution π i (θ) for each instance i in set {1, 2, . . . , I}, we use bootstrapping to estimate the distribution of the q-th percentile of the TTS. This procedure is described in Algorithm 4. for each sampled instance j do 5: sample a value, p jb , from its posterior probability π j (θ) 6: calculate end for 8: find the q-th percentile of the set {R 99,jb } and denote it by R 99,qb 9: end for 10: consider the empirical distribution of (τ R 99,q1 , . . . , τ R 99,qB ) as an approximation of the true TTSq distribution The procedure for deriving the TTS is slightly different for the DA and the PTDA. The anneal time of the algorithmic engine of the DA is not a linear function of the number of runs for a given number of sweeps. We therefore directly measure the anneal time for a given number of iterations and a given number of runs where the latter is equal to the R 99 . Each iteration (Monte Carlo step) in the DA represents one potential update and each Monte Carlo sweep corresponds to N iterations.
It is important to note that the correct scaling is only observed if the parameters of the solver are tuned such that the TTS is minimized. Otherwise, a suboptimal scaling might be observed and incorrect conclusions could be made. Recall that the TTS is the product of the R 99 and the time taken per run τ . Let us consider a parameter that affects the computational effort taken, such as the number of sweeps. Increasing the number of sweeps results in the algorithm being more likely to find the reference solution, hence resulting in a lower R 99 . On the other hand, increasing the number of sweeps also results in a longer runtime, increasing τ . For this reason, it is typical to find that the TTS reaches infinity for a very low or very high number of sweeps, and the goal is to experimentally find a number of sweeps at which the TTS is minimized.

V. BENCHMARKING PROBLEMS
A quadratic Ising model can be represented by a Hamiltonian (i.e., cost function) of the form Here, s i ∈ {−1, 1} represent Boolean variables, and the problem is encoded in the biases h i and couplers J ij . The sums are over the vertices V and weighted edges E of a graph G = (V, E). It can be shown that the problem of finding a spin configuration {s i } that minimizes H, in general, is equivalent to the NP-hard weighted max-cut problem [61][62][63][64]. Spin glasses defined on nonplanar graphs fall into the NP-hard complexity class. However, for the special case of planar graphs, exact, polynomial-time methods exist [65]. The algorithmic engine of the Digital Annealer can optimize instances of QUBO problems in which the variables x i take values from {0, 1} instead of {−1, 1}. To solve a quadratic Ising problem described by the Hamiltonian represented in Eq. (8), we can transform it into a QUBO problem by taking s i = 2x i − 1.
In the following, we explain the spin-glass problems used for benchmarking.
2D-bimodal -Two-dimensional spin-glass problems on a torus (periodic boundaries), where couplings are chosen according to a bimodal distribution, that is, they take values from {−1, 1} with equal probability.
2D-Gaussian -Two-dimensional spin-glass problems where couplings are chosen from a Gaussian distribution with a mean of zero and a standard deviation of one, scaled by 10 5 .
SK-bimodal -Spin-glass problems on a complete graph-also known as Sherrington-Kirkpatrick (SK) spin-glass problems [66]-where couplings are chosen according to a bimodal distribution, that is, they take values from {−1, 1} with equal probability.
SK-Gaussian -SK spin-glass problems where couplings are chosen from a Gaussian distribution with a mean of zero and a standard deviation of one, scaled by 10 5 .
In all the problems, the biases are zero. The coefficients of the 2D-Gaussian and SK-Gaussian problems are beyond the precision limit of the current DA. In order to solve these problems using the DA, we have used a simple scheme to first scale the coefficients up to their maximum limit and then round to the nearest integer values. The maximum values for the linear and quadratic coefficients are given by the precision limits of the current DA hardware, that is, 2 25 −1 and 2 15 −1, respectively. The problem instances are not scaled when solving them using SA or PT (PT+ICM).
Our benchmarking experiment has been parametrized by the number of variables. Specifically, we have considered nine different problem sizes in each problem category and generated 100 random instances for each problem size. We have used the instance generator provided by the University of Cologne Spin Glass Server [67] to procure the 2D-bimodal and 2D-Gaussian instances. SK instances with bimodal and Gaussian disorder have been generated as described above. Each problem instance has then been solved by different Monte Carlo algorithms. Optimal solutions to the 2D-bimodal and 2D-Gaussian problems have been obtained by a branchand-cut [68] technique available via the Spin Glass Server [67]. The SK problems are harder than the two-dimensional problems and the server does not find the optimal solution within the 15-minute time limit. For the SK-bimodal and the SK-Gaussian problems with 64 variables, we have used a semidefinite branch-and-bound technique through the Biq Mac Solver [69] and BiqCrunch [70] to find the optimal solutions, respectively. For problems of a size greater than 64, the solution obtained by PT with a large number of sweeps (5·10 5 sweeps) is considered a good upper bound for the optimal solution. We refer to the optimal solution (or its upper bound) as the reference energy.

VI. RESULTS AND DISCUSSION
In this paper, we have used an implementation of the PT (and PT+ICM) algorithm based on the work of Zhu et al. [48,49]. The DA and PTDA algorithms are run on Fujitsu's Digital Annealer hardware. For the SA simulations we have used the highly optimized, open source code by Isakov et al. [47].
The DA solves only a fully connected problem where the coefficients of the absent vertices and edges in the original problem graph are set to zero. In our benchmarking study, we have included both two-dimensional and SK problems to represent the two cases of sparsity and full connectivity, respectively. Furthermore, we have considered both bimodal and Gaussian disorder in order to account for problems with high or low ground-state degeneracy. The bimodal disorder results in an energy landscape that has a large number of free variables with zero effective local fields. As a result, the degeneracy of the ground state increases exponentially in the number of free variables, making it easier for any classical optimization algorithm to reach a ground state. Problem instances that have Gaussian coefficients further challenge the DA, due to its current limitations in terms of precision.
In what follows, we discuss our benchmarking results, comparing the performance of different algorithms using twodimensional and SK spin-glass problems, with bimodal and Gaussian disorder. We further investigate how problem density affects the DA's performance. The parameters of the algorithms used in this benchmarking study are presented in A. Figure 3 illustrates the TTS results of the DA, SA, PT+ICM, and PTDA for 2D spin-glass problems with bimodal and Gaussian disorder. In all TTS plots in this paper, points and error bars represent the mean and the 5th and 95th percentiles of the TTS distribution, respectively. PT+ICM has the lowest TTS for problems with bimodal and Gaussian disorder and has a clear scaling advantage. In problems with bimodally distributed couplings, although SA results in a lower TTS for small-sized problems, the DA and SA demonstrate similar TTSs as the problem size increases. The performance of both SA and the DA decreases when solving harder problem instances with Gaussian disorder, with significantly reduced degeneracy of the ground states. However, in this case, the DA outperforms SA even with its current precision limit.

A. 2D Spin-Glass Problems
The PTDA yields higher TTSs than the DA in both cases of bimodal and Gaussian disorder, likely due to the CPU overhead of performing parallel tempering moves. Considering the number of problems solved to optimality, the PTDA outperforms the DA when solving 2D spin-glass problems with bimodal couplings; however, as shown in Fig. 3, the PTDA solves fewer problems with Gaussian disorder.
In order to estimate the q-th percentile of the TTS distribution, at least q% of problems should be solved to their corresponding reference energies. If there is no point for a given problem size and an algorithm in Fig. 3, it means that enough instances have not been solved to their reference energies and we therefore could not estimate the TTS percentiles. Increasing the number of iterations to 10 7 in the DA and the PTDA, we could not solve more than 80% of the 2D-Gaussian problem instances with a size greater than or equal to 400. Therefore, to gather enough statistics to estimate the 80th percentile of the TTS, we have increased the number of iterations to 10 8 in the DA. However, because of the excessive resources needed, we have not run the PTDA with such a large number of iterations.

2D-bimodal
In Fig. 4a, we observe that for a given problem size and number of sweeps, the DA reaches higher success probabilities than SA. As the problem size increases, the difference between the mean success probability curves of the DA and SA becomes less pronounced. Figure 4b illustrates the success probability correlation of 100 problem instances of size 1024. The DA yields higher success probabilities for 52 problem instances out of 100 instances solved.
For 2D-bimodal problems, the boost in the probability of updating a single variable due to the parallel-trial scheme is not effective enough to decrease the TTS or to result in better scaling (Fig. 3a). Since both the DA and SA update at most one variable at a time, increasing the probability of updating a variable in a problem with bimodal disorder, where there are a large number of free variables, likely results in a new configuration without lowering the energy value (see Sec. III for details).

2D-Gaussian
The performance of the DA and SA significantly degrades when solving the problems with Gaussian disorder, which are harder; however, the DA demonstrates clear superiority over SA. Figure 5 shows the residual energy (E), which is the relative energy difference (in percent) between the lowestenergy solution found and the reference-energy solution, for the largest problem size, which has 1024 variables. We observe that the DA outperforms SA, as it results in a lower residual energy for a given number of Monte Carlo sweeps (S ≤ 10 5 ). Furthermore, Fig. 6 illustrates that the paralleltrial scheme is more effective for this class of problems, which could be due to the decrease in the degeneracy of the ground  states. In Fig. 6a, we observe that the difference between mean success probabilities, for a problem of size 144 with 10 4 Monte Carlo sweeps, is larger compared to the bimodal disorder (Fig. 4a). The success probability correlation of 100 2D-Gaussian problem instances with 400 variables in Fig. 6b further demonstrates that the DA reaches higher success probabilities, which results in a lower TTS (Fig. 3b).

B. SK Spin-Glass Problems
We have solved the SK problem instances with the DA, SA, PT, and the PTDA. As explained in Sec. II, for the fully connected problems, the cluster moves have not been included in PT because the clusters of variables span the entire system. Our initial experiments verified that adding ICM to PT increases the computational cost without demonstrating any scaling benefit for this problem class.
The statistics of the TTS distribution of the DA, the PTDA, As the number of Monte Carlo sweeps approaches infinity, the residual energy of both algorithms will eventually reach zero. We have run SA for up to 10 6 sweeps and the DA for up to 10 8 iterations (10 8 /1024 ∼ 10 5 sweeps). Therefore, the data up to 10 5 sweeps is presented for both algorithms. We expect that by increasing the number of iterations to 10 9 (10 6 sweeps) in the DA, the residual energy would further decrease. and the SA and PT algorithms are shown in Fig. 7a for SKbimodal problem instances. Comparing the DA to SA and PT, we observe that the DA yields a noticeable, consistent speedup of at least two orders of magnitude as we approach the largest problem size. In the fully connected problems, accepting a move and updating the effective local fields in a CPU implementation of a Monte Carlo algorithm is computationally more expensive than for sparse problems. Figure 7a shows that each algorithm has solved at least 80% of the SK instances for all problem sizes. We attribute this to the fact that the reference energy for the complete graph problems is an upper bound on the exact optimal solution. We do not know how tight the upper bound is, but it represents, to the best of our knowledge, the best known solution.
To obtain insights on scaling, for each algorithm, we have fit an exponential function of the form y = 10 α+βN , where y and N are the means of the TTS distribution and the number of variables, respectively. Figure 7b shows the 90% confidence interval of the estimated scaling exponent β for the algorithms based on the statistics of the 50th and the 80th percentiles of the TTS distribution. For the 50th percentile, we observe that the PTDA yields superior scaling over the other three algorithms for the problem class with bimodal disorder. For the 80th percentile, there is not enough evidence to draw a strong conclusion on which algorithm scales better because the 90% confidence intervals of the estimated scaling exponents overlap. However, the PTDA has the lowest point estimate.
For the DA, SA, and PT algorithms, we have searched over a large number of parameter combinations to experimentally determine a good set of parameters (see A) while the parameters of the PTDA have been determined automatically by the hardware. We have further experimentally determined the optimal number of sweeps for all four algorithms. However, we do not rule out the possibility that the scaling of the algorithms might be suboptimal due to a non-optimal tuning of parameters. For example, the scaling of the PTDA might improve after tuning its parameters and PT might exhibit better scaling using a more optimized temperature schedule. Figure 8 illustrates the TTS statistics and the confidence interval of the scaling exponent for SK-Gaussian problem instances solved by the DA, SA, PT, and the PTDA. We observe that the DA continues to exhibit a constant speedup of at least two orders of magnitude over the other algorithms, with no strong scaling advantage, in solving spin-glass problems with Gaussian disorder.

The DA versus SA
The DA results in lower TTSs than SA for both SK-bimodal and SK-Gaussian problem instances. The reasons for this behaviour are twofold. First, the anneal time for the DA is independent of the number of variables and the density of the problem, whereas the computation time of a sweep in SA increases with the problem size and the problem density. Second, as shown in Fig. 9, the parallel-trial scheme significantly improves the success probability in fully connected spin-glass problems of size 1024 with both bimodal and Gaussian disorder. As expected, the boost in the low-degeneracy problem instances (with Gaussian coefficients) is higher.
Although the confidence intervals of the scaling exponents overlap, considering the statistics of the TTS 80 , the DA yields lower point estimates than SA for SK-bimodal and SK-Gaussian problems. In particular, β = 0.0021(7) [0.0019(3)] for the DA with bimodal [Gaussian] disorder, whereas β = 0.0027(6) [0.0028(5)] for SA with bimodal [Gaussian] disorder, thus providing a weak scaling advantage.
Our results on spin-glass problems with Gaussian disorder further indicate that the 16-bit precision of the hardware used in this study is not a limiting factor because the DA outperforms SA on instances of these problems. Since there is a high variance in the couplers of spin-glass instances with Gaussian disorder, we expect that the energy gap between the ground state and the first-excited state is likely greater than The 90% confidence interval of the estimated scaling exponent β based on the mean of the TTS 50 and TTS 80 for the SK-bimodal problem class. The label below each rectangle represents the TTS percentile on which the confidence interval is based. For each algorithm, the five largest problem sizes have been used to estimate the scaling exponent by fitting a linear model to log 10 TTS. The R 2 of the fitted model is greater than or slightly lower than 90% for each algorithm.

C. Spin-Glass Problems with Different Densities
Our results for the two limits of the problem-density spectrum suggest that the DA exhibits similar TTSs to SA on sparse problems, and outperforms SA on fully connected problems by a TTS speedup of approximately two orders of magnitude. To obtain a deeper understanding of the relation between the performance and the density, we have performed an experiment using random problem graphs with nine different densities. For each problem density, 100 problem instances with 1024 variables have been generated based on the Erdős-Rényi model [72], with bimodally distributed coupling coefficients, and zero biases. The parameters of the DA and SA have been experimentally tuned for each of the nine prob- lem densities (see A). Figure 10 shows the statistics of the TTS of the DA and SA for different problem densities. The TTS results for 2Dbimodal (d = 0.4%) and SK-bimodal (d = 100%) for a problem size of 1024, representing the limits of the density spectrum, are also included. The DA has lower TTSs than SA for all problem densities except for the sparsest problem set-2D-bimodal. Not enough 2D-bimodal instances were solved to optimality using the DA and SA in order to estimate the statistics of the TTS 80 distributions.  Figure 11 shows the success probabilities of 100 spin-glass problem instances of size 1024 with different densities solved by the DA and SA. The DA has higher success probabilities than SA by a statistically significant margin for all of the densities except for the sparsest problem set-2D-bimodal. We interpret these results as being due to both the increase in the success probabilities from using a parallel-trial scheme and the constant time required to perform each Monte Carlo step on the DA hardware architecture.

VII. CONCLUSIONS AND OUTLOOK
In this work we have compared the performance of the Digital Annealer (DA) and the Parallel Tempering Digital Annealer (PTDA) to parallel tempering Monte Carlo with and without isoenergetic cluster moves (PT+ICM and PT, respectively) and simulated annealing (SA) using random instances of sparse and fully connected spin-glass problems, with bimodal and Gaussian disorder.
Our results demonstrate that the DA is approximately two orders of magnitude faster than SA and PT in solving dense problems, while it does not exhibit a speedup for sparse problems. In the latter problem class, the addition of cluster updates to the PT algorithm is very effective in traversing the energy barriers, outperforming algorithms that act on a single flip neighbourhood, such as the DA and SA. For dense problems, the efficiency of the cluster moves diminishes such that the DA is faster, due to the parallel-trial scheme combined with the massive parallelization that is possible on application-specific CMOS hardware. Our results further support the position that the DA has an advantage over SA on random spin-glass problems with densities of 10% or higher.
In Sec. III we demonstrate that parallel-trial Monte Carlo can offer a significant boost to the acceptance probabilities over standard updating schemes. Furthermore, we show that this boost vanishes at high temperatures and is diminished for problems with high ground-state degeneracy. Our benchmarking results further support the view that the paralleltrial scheme is more effective in solving problems with low ground-state degeneracy because an accepted move is more likely not only to change the state configuration, but also to lower the energy value.
In the current early implementation of the PTDA, the TTS is higher than it is likely to be in the future, due to the CPU overhead in performing PT moves. However, the PTDA algorithm demonstrates better scaling than the other three algorithms for a fully connected spin-glass problem of average computational difficulty, with bimodal couplings.
In the next generation of the Digital Annealer, the hardware architecture is expected to allow the optimization of problems using up to 8192 fully connected variables. In addition, the annealing time is expected to decrease, and we conjecture that the TTS might decrease accordingly. Finally, we expect the replica-exchange moves in the PTDA to be performed on the hardware, which could improve the performance of the PTDA.
Our results demonstrate that pairing application-specific CMOS hardware with physics-inspired optimization methods results in extremely efficient, special-purpose optimization machines. Because of their fully connected topology and high digital precision, these machines have the potential to outperform current analog quantum optimization machines. Pairing such application-specific CMOS hardware with a fast interconnect could result in large-scale transformative optimization devices. We thus expect future generations of the Digital Annealer to open avenues for the study of fundamental physics problems and industrial applications that were previously inaccessible with conventional CPU hardware.

CONFLICT OF INTEREST STATEMENT
Authors MA, GR, and EV are employed by the company 1QBit. Authors TM and HT are employed by the company Fujitsu Laboratories Ltd. Author HGK was employed by 1QBit when this research was completed and is currently employed by the company Microsoft Research.
The authors declare that there is no further commercial or financial relationship that could be construed as a potential conflict of interest.

AUTHOR CONTRIBUTIONS
Authors MA, GR, and HGK developed the methodology, implemented the code, performed the experiments, analyzed the results, and wrote the manuscript. Author EV partially contributed to implementing the code and conducting the experiments. Authors TM and HT carried out the experiments related to the PTDA algorithm. HGK's research is based upon work supported by the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), via Interagency Umbrella Agreement No. IA1-1198. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the ODNI, IARPA, or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon.

ACKNOWLEDGEMENT
The authors would like to thank Salvatore Mandrà for helpful discussions, Lester Szeto, Brad Woods, Rudi Plesch, Shawn Wowk, and Ian Seale for software development and technical support, Marko Bucyk for editorial help, and Clemens Adolphs for reviewing the manuscript. We thank Zheng Zhu for providing us with his implementation of the PT+ICM algorithm [48,49], and Sergei Isakov for the use of his implementation of the SA algorithm [47]. HGK would like to thank Bastani Sonnati for inspiration.

Appendix A: Simulation Parameters
In this section we present the parameters of the algorithms used to solve two-dimensional and SK spin-glass problems with bimodal and Gaussian disorder, then explain the parameters used to calculate the TTSs of the DA and SA when solving spin-glass problems with different densities.

2D & SK Spin-Glass Problems
For the DA, SA, and PT (PT+ICM) and each problem class, we have performed a grid search in the parameter space to determine the best parameters, using a subset of problem instances. The subset of instances used for parameter tuning include instances of size 576, 784, and 1024. The number of instances solved to optimality, the success probability, and the residual energy have been used to select the best parameter combination for the benchmarking study.
In order to find the optimal TTS for a given problem size, we have varied the number of sweeps (iterations in the DA and the PTDA) and have used the procedure shown in Algorithm 4 to find the empirical distribution of the TTS for each number of sweeps. We have then found the number of sweeps for which the TTS distribution has the lowest mean and report the statistics of that distribution as the optimal TTS results. In the calculation of the TTS, we have excluded the initialization and post-processing times. The time spent on replica-exchange moves, currently performed via CPU, has been considered to be part of the PTDA's execution time. The time that it takes to execute one run of SA and PT (and PT+ICM) has been measured using r4.8xlarge Amazon EC2 instances, which consist of Intel Xeon E5-2686 v4 (Broadwell) processors.
To set the grid for the high and the low temperatures, we have simulated the distribution of the energy differences associated with proposed moves and search in the vicinity of the 5th (80th) to the 10th (85th) percentiles of this distribution to find the best-performing low (high) temperature value. In PT (PT+ICM), because we are able to measure different quantities during the simulation, we have further ensured that the highest temperature has been chosen such that the Monte Carlo acceptance probabilities are between 0.6 and 0.8. The parameter values used in each algorithm are outlined below.

The DA parameters
For all experiments, we have used 100 runs, each starting at a vector of zeros. The temperature schedule is linear in the inverse temperature, and the temperature has been adjusted after every iteration. The DA uses the Metropolis criterion to accept Monte Carlo moves. It is worth mentioning that our early experimentation has suggested better performance for the linear inverse temperature schedule than the exponential temperature schedule.
Our investigation of different parameter combinations has further shown that the performance of the DA on spin-glass problems is indifferent to the dynamic offset mechanism. Therefore, we turn this feature off for our final experimentation. The high (T h ) and the low (T l ) temperatures used for each problem class are shown in Table I. TABLE I: High (T h ) and low (T l ) temperatures used in the DA, SA, and PT runs. Note that the row marked with "PT" also includes the PT+ICM parameters. The low and the high temperatures of SA for 2D-bimodal and SK-bimodal are selected according to Refs. [46,47]. The temperature values used in our simulations are unitless.

SA parameters
Each instance has been solved 100 times using SA and the temperature schedule has been set to be linear in the inverse temperature, which is a typical choice for SA in the literature [46,47]. The high (T h ) and the low (T l ) temperatures used for each problem class are shown in Table I.

PT (PT+ICM) parameters
Although the performance of replica-exchange algorithms is significantly dependent on the choice of temperature schedule, the temperatures at each replica have been set based on the commonly used geometric schedule [42]. After determining the low and the high temperatures, the number of replicas has been chosen such that the replica-exchange acceptance probabilities are above 0.2. The number of replicas has been set to 25, 60, 50, and 60, respectively, for the 2D-bimodal, SK-bimodal, 2D-Gaussian, and SK-Gaussian instances. In contrast to the runs of the DA and SA, the replicas in PT (PT+ICM) are not independent and to calculate the TTS, the whole PT (PT+ICM) algorithm has been repeated 30 times for each instance. This is time consuming, so each run (repeat) has been stopped immediately if the reference solution has been found. The high (T h ) and the low (T l ) temperatures used for each problem class are shown in Table I.

The PTDA parameters
The number of replicas has been set to 40 for both the SK-bimodal and SK-Gaussian problem instances, and the dynamic offset feature has been turned off. The high and the low temperatures and the temperature schedule are set internally by an automatic parameter-tuning strategy (see also Sec. II). As done for PT, since the replicas are dependent, the whole algorithm has been repeated 30 times to have enough observations to calculate the TTS for each instance.

Spin-Glass Problems with Different Densities
A grid-search approach on a subset of instances has been used to tune the parameters of the DA and SA for spin-glass problems, as explained in the previous section, separately for each density. The parameters of the DA and SA are the same as the ones used for 2D and SK spin-glass problems, except for the temperature values that are given below.