Discrete optimization using quantum annealing on sparse Ising models

This paper discusses techniques for solving discrete optimization problems using quantum annealing. Practical issues likely to affect the computation include precision limitations, ﬁnite temperature, bounded energy range, sparse connectivity, and small numbers of qubits. To address these concerns we propose a way of ﬁnding energy representations with large classical gaps between ground and ﬁrst excited states, efﬁcient algorithms for mapping non-compatible Ising models into the hardware, and the use of decomposition methods for problems that are too large to ﬁt in hardware. We validate the approach by describing experiments with D-Wave quantum hardware for low density parity check decoding with up to 1000 variables.


INTRODUCTION
D-Wave Systems manufactures a device [1][2][3][4]  (1) Here we have spin variables s i ∈ {−1, 1} indexed by the vertices V(G) of a graph G fixed by the device with allowed pairwise interactions given by the edges E(G) of this graph, and where the h i and J i,j dimensionless coefficients are real-valued. QA was proposed in Finilla et al. [5] and Kadowaki and Nishimori [6] as a method to optimize discrete energy functions. More recently, similar ideas were generalized to full quantum computation [7,8]. Here we explore practical implementation of QA. The quantum annealing process minimizes the Ising energy by evolving the ground state of an initial Hamiltonian H H H 0 = i ∈ V(G) σ σ σ x i to the ground state of a problem Hamiltonian H H H P = i ∈ V(G) h i σ σ σ z i + (i,j) ∈ E(G) J i,j σ σ σ z i ⊗ σ σ σ z j . The ground state of H H H 0 is a superposition state in which all spin configurations are equally likely, while at the end of the process the spin configurations with smallest energy with respect to H H H P are most likely to be measured. The efficiency of QA is determined in part by the energy gap separating ground and excited states during evolution. However, different representations of the same optimization problem may give different quantum gaps, and it is very difficult to know this gap in advance.
In this report, we focus not on problem representations giving rise to larger quantum gaps, but on representations ameliorating the limitations imposed by current experimental hardware. 1. Limited precision/control error. Physical devices impose limitations on the precision to which the programmable parameters h h h, J J J can be specified. Moreover, since the Ising model is only an approximation to the underlying physics there may be systematic errors causing a discrepancy between programmed h h h, J J J and the effective Ising approximation. To maximize the performance of QA in hardware we seek problem representations that are insensitive to these control errors. 2. Limited energy range/finite temperature. Technological limitations restrict the range of energy scales of h h h, J J J relative to the thermal energy k B T. For problems having few ground states and exponentially many excited states within k B T a limited range makes the optimization challenging. 3. Sparse connectivity. It is difficult in hardware to realize all pairwise couplings as the number of couplings grows as n 2 , for an n-qubit system. Thus, QA hardware offering large numbers of variables is likely to offer only sparse connectivity. Figure 1 shows an example of the connectivity graph G for recent hardware supplied by D-Wave Systems. Optimization problems of practical interest require coupling significantly different from G. 4. Small numbers of qubits. While Ising problems may be difficult to minimize even at scales of several hundred variables, real-world problems are often significantly larger. Moreover, the translation of any optimization problem into sparse Ising form will almost always require additional ancillary variables thereby increasing the size of the problem. The detrimental impact of hardware limitations (1) and (2) may be mitigated by finding problem representations in which there is a large (classical) gap between Ising ground and excited states. Notice that this does not guarantee that quantum gaps get any larger, though it seems unlikely they get worse. To deal with issue (3) we introduce additional variables to mediate arbitrary connectivity. For instance, ferromagnetically coupled spins can act as "wires" transporting long-range interactions; in general, the extra variables can take a more subtle role as ancillary variables. Finally, for issue (4), we demonstrate methods by which large problems may be solved through decomposition into smaller subproblems. In addressing these practical limitations we focus on solving Constraint Satisfaction Problems (CSPs) involving a finite set of binary variables. That is, we have a set of variables s i ∈ {−1, 1}, i = 1 . . . n, and a set of constraints, each corresponding to a non-empty subset F F F j of {−1, 1} n (constituting the configurations allowed by the constraint), and we wish to find s s s ∈ j F F F j . To represent such a problem using an Ising model, we construct an Ising-model penalty function for each constraint F F F j . We write a penalty function as Pen F j (z z z) = Pen F j (s s s, a a a) where s s s are the "decision" variables upon which F F F j depends and a a a are ancillary variables. This function should be represented in the Ising form (1) (with an additional constant offset for convenience) and must satisfy: min a a a Pen F j (s s s, a a a) where g > 0. We call g the gap of the penalty function. While a decision variable may be involved in many different constraints, the sets of ancillary variables for different constraints are disjoint. In general, larger gaps make it more likely that the hardware will find a solution satisfying the constraints and offers protection against noise and precision limitations. We add the penalty functions for each constraint to obtain an Ising model for our system of decision and ancillary variables. For any configuration of the decision variables that satisfies all the constraints, there will be some setting of the ancillary variables that makes all the penalty functions 0. On the other hand, any configuration of decision variables violating some constraint will have energy ≥ g. Thus, a ground state of the system solves the CSP if a solution exists.
We also address Constrained Optimization Problems (COPs), where in addition to the constraints F F F j there is an objective to be minimized over the feasible configurations. This can be accomplished by adding more terms to the Hamiltonian expressing the objective in Ising form. The objective should be scaled so that it does not overcome the penalty functions, causing the appearance of ground states that do not satisfy the constraints. In this case also, a larger gap g allows for better scaling of the objective.
This report shows how CSPs and COPs can be represented to ameliorate the hardware limitations listed above. The techniques we present are applicable to any hardware device offering only sparse pairwise variable interaction. In Section 2.1, we consider how to construct a penalty function for a given constraint with the largest possible gap, subject to bounds on the h's and J's imposed by the hardware. We supply a novel algorithm which exploits the sparsity of the hardware graph G. In Section 2.2, we consider how to fit a collection of penalty functions (and the corresponding graphs) onto a hardware graph of sparse connectivity. Generally, each variable participates in several constraints, necessitating the use of embedded chains of vertices of the hardware graph representing a single variable. We present a new heuristic embedding algorithm which scales well to large problems sizes. In Section 2.3, we consider how to deal with problems that are too large to be embedded in the hardware graph. We split the problem into subproblems small enough to embed, and coordinate solutions of the subproblems. We explore two coordination approaches: Belief Propagation (BP) and Dual Decomposition. Section 3 presents experimental results on using the D-Wave hardware to decode LDPC (low-density parity-check) codes [10,see Ch. 47]. Here, each constraint is a parity check on a small set of bits, and the objective is to minimize the Hamming distance to a received vector subject to satisfying all parity checks. Using the techniques presented in this paper on hardware very similar to Figure 1, we were able to decode problems with up to 1000 variables that were not correctly decoded by the standard BP decoding algorithm.

MAPPING PROBLEMS TO ISING MODELS
This section provides an algorithm to find Ising representations of constraints with large classical gaps. To emphasize the linear dependence of the Ising energy on its parameters we write Equation (1) as 1 Here θ i are the local fields h i , θ i,j are the couplings J i,j , and θ 0 represents a constant energy offset. When z z z = (s s s, a a a), we implicitly identify certain nodes in V(G) as decision variables s s s, and other nodes as ancillary variables a a a. We assume there are n decision variables and n a ancillary variables, and that the assignment of these variables to nodes (qubits) is given. The hardware's lower and upper bounds on the parameters are denoted by θ θ θ and θ θ θ respectively (e.g., for D-Wave hardware h i ∈ [−2, 2] and J i,j ∈ [−1, 1]).
For the purpose of this section we will assume that the number of variables, n + n a , is not too large, say less than 40. We will also require that the subgraph induced by the ancillary variables, G a a a , has low treewidth. This assumption allows us to efficiently enumerate the smallest k energy states of any Ising model in G a a a for, say, k ≤ 10000. The treewidth of the Chimera graph in Figure 1 that consists of N × N tiles of complete bipartite graphs K 4,4 is 4N. In our experiments, we used up to 4 × 4 Chimera tiles, so the tree width of G a a a was at most 16.
We maximize the penalty gap g separating s s s ∈ F F F from s s s ∈ F F F subject to the bounds on θ θ θ. This criterion gives rise to the following constrained optimization problem max g,θ θ θ g (3) subject to θ θ θ, φ φ φ(s s s, a a a) ≥ 0 ∀s s s ∈ F F F, ∀a a a (4) θ θ θ, φ φ φ(s s s, a a a) ≥ g ∀s s s ∈ F F F, ∀a a a ∃ a a a : θ θ θ, φ φ φ(s s s, a a a) = 0 ∀s s s ∈ F F F (6) θ θ θ ≤ θ θ θ ≤ θ θ θ.
Constraint (5) separates all s s s ∈ F F F, while constraints (4) and (6) make sure that the minimum penalty for s s s ∈ F F F is 0 [note that constraints (4)-(6) ensure (2)]. Here, constraint (6), involving the disjunction over a a a, makes this problem difficult since we do not know which of the ancillae settings gives zero energy for a particular feasible s s s, i.e., what is a a a (s s s) at which θ θ θ, φ φ φ s s s, a a a (s s s) = 0? However, we note that solving for θ θ θ given a a a (s s s), for s s s ∈ F F F, is a linear programme, and can be made relatively scalable in spite of the exponential number of constraints (the low treewidth assumption makes cut generation tractable [11]). Using binary variables and linear constraints, it is straightforward to transform this optimization problem into a mixed integer linear programme (MILP). Commercial MILP solvers can typically solve this problem if |V(G)| is no larger than about 10. For larger problems these solvers may be ineffective, partly because the linear programme relaxations are typically weak. We propose a method which scales better than a MILP. To address the disjunctive constraint it is easiest to encode a a a (s s s) by introducing n a |F F F| Boolean-valued (0/1) optimization variables β(s s s, i) for s s s ∈ F F F defined so that 2β(s s s, i) − 1 = a i (s s s). The shorthand β β β(s s s) is used to indicate the vector of length n a whose i-th component is β(s s s, i). Rather than directly maximizing g we have found empirically that it is faster to fix a g value and solve the following feasibility problem, FEAS(θ θ θ, β β β), to identify θ θ θ and β β β ≡ {β β β(s s s) | s s s ∈ F F F}: Separately, we find the largest g for which FEAS(θ θ θ, β β β) is satisfiable. Notice that in this formulation we ground constraint (6) by considering all possible ancilla settings a a a. We solve FEAS(θ θ θ, β β β) with Satisfiability Modulo Theory (SMT) solvers. SMT [12] generalizes Boolean satisfiability by allowing some Boolean variables to represent equality/inequality predicates over additional continuous variables. As mentioned earlier, for any given setting of Boolean variables, finding the continuous variables θ θ θ is a linear programme. This linear programme can be solved using a variant of the simplex method [13] that can efficiently propagate new constraints and infer "nogoods" on the β β β variables. A number of solvers for SMT are available, and experiments here rely on the MathSat solver [14]. A naive representation of FEAS(θ θ θ, β β β) requires 2 n + n a inequality constraints, 2 n a |F F F| implication constraints and |V(G)| + |E(G)| bound constraints. However, we can significantly reduce the number of constraints by exploiting the fact that G a a a has low treewidth. This allows us to solve any Ising model in G a a a using dynamic programming, or more specifically, variable elimination (VE). VE exploits the observation that summation distributes over minimization, i.e., min (a + b, a + c) = a + min (b, c). Thus, to minimize a sum of local interactions we pick an ordering in which to minimize (eliminate) one a i variable at a time, and push minimizations as far to the right in the summation as possible. If we record these minimizations in tables then they can be reused in different contexts to save recomputation. Bucket elimination [15] is a convenient way to structure this memoization, and can be encoded nicely within our constraint model.
More concretely, we solve an Ising model in G a a a by storing partial computations and spin states in tables as follows. Suppose that ancilla variables are eliminated in the order a n a , a n a − 1 , . . . , a 1 . Let V n a be the set of ancilla variables adjacent to a n a in G a a a . Clearly, for each setting of all variables in V n a we can deduce the optimal value of a n a . This information is collected in T n a , a table of size 2 |V na | . This table is assigned to the variable of largest index in V n a (alternatively, one can think that the variables in V n a become induced neighbors of each other). In general, V i consists of the neighbors of variable a i together with the variables on the tables assigned to a i but excluding a i itself (alternatively all its neighbors and induced neighbors). We then create T i , a table of size 2 |V i | , for all possible values of variables in V i . Notice that for each setting of the variables in V i we can calculate the smallest contribution to the objective of variable a i by adding linear and quadratic local terms to values from previously generated tables. This information is stored in T i and assigned to the variable of largest index in V i (alternatively, the variables in V i become induced neighbors of each other). In this way, after all variables have been eliminated we end up with one or more tables with a single entry (i.e., the corresponding V i is empty). The sum over the values stored in these singleton tables is the value of the optimal solution of the Ising model. Notice that the storage necessary for the tables is O( n a i = 1 2 |V i | ). For our work here, the important observation is that one can easily find a variable elimination order for which max i |V i | is the treewidth of G a a a using Gogate and Dechter [16]. In what follows, we assume that ancilla variables are ordered using this ordering.
In our case, the constraints are not as simple as just solving an Ising model in G a a a , since some of the coefficients are themselves variables (entries of θ θ θ). Nevertheless, in this parametric Ising model, each table entry can be replaced by a continuous variable m i (v v v i |s s s) (a message conveying all required information from previously eliminated variables), indexed by the decision variable setting s s s and a setting v v v i of variables in V i . Thus, the constraints on s s s Equations (4)-(6) become The purpose of message m i (v v v i |s s s) is to eliminate the ancilla setting of a i . Since in this case the Ising model is defined by variables θ θ θ, we do not know which setting of a i makes the contribution to the Ising model smallest. However, we can upper bound m i (v v v i |s s s) using the two possible values of a i , imposing two inequalities on m i (v v v i |s s s). This suffices for the case when s s s ∈ F F F because the messages will be themselves a lower bound on the value of the Ising model and we only require this value to be at least g in (7). When s s s ∈ F F F, we must make sure that the message takes the exact minimum value for the parametric Ising model. We impose these constraints by making sure that when a i corresponds to β(s s s, i), we lower bound the message so it takes the correct value.

Implementation concerns
For some problems many of the constraints arising from variable elimination are redundant, and many message variables can be shown to be equal. Eliminating such redundancies can dramatically shrink the size of FEAS(θ θ θ, β β β). Further consolidation can be obtained by exploiting automorphisms of G, and gauge symmetries of the Ising energy θ θ θ, φ φ φ(z z z) . Lastly, we note that the order in which constraints are presented to the SMT solver can significantly impact solving time. We have found that running multiple SMT solver instances each with a random shuffling of constraints often yields at least one solution quickly.
Currently, we are limited to problems defined on graphs G with at most 30-50 nodes, and up to |F F F| = 1000 feasible configurations.
To scale to larger sizes requires heuristics which give suboptimal gaps. One approach to better scaling is through graph embedding.
With embedding a problem is reduced to pairwise interactions without regard for the connectivity this reduction may generate. The connectivity is then mimicked in hardware with strong ferromagnetic interactions that slave qubits together: techniques for doing so are discussed in Section 2.2.

Examples
Consider the 3-bit parity check equation, that is, F F F is the set of four feasible solutions consisting of spin triplets with an even number of positive spins. Realizing this parity constraint as an Ising model requires at least one ancillary bit. The bound constraints are h i ∈ [−2, 2] for each i and J ij ∈ [−1, 1] for each edge (i, j). We assume first that G is the complete graph on 4 nodes, K 4 . Then it is straightforward to verify that (s 1 + s 2 + s 3 − 2a + 1) 2 /4 defines an optimal penalty function with gap 1. The same gap can be achieved if the hardware graph is the complete bipartite graph K 3,3 and we identify two qubits on the right side of the partition with two on the left side using ferromagnetic couplings. The first two decision variables are mapped to these two pairs of coupled qubits, so that, s 1 = a 1 and s 2 = a 2 , while the two remaining qubits are s 3 and an additional ancilla a 3 . In this case, the Ising model is +s 2 ( − a 2 − a 3 ) + s 3 (0.5a 1 + 0.5a 2 − a 3 ).
However, using the MILP or SMT model we can obtain a gap of 2 by placing the decision qubits s 1 , s 2 , s 3 on one side of K 3,3 , and all ancillary qubits a 1 , a 2 , a 3 on the other: −a 1 + a 2 − a 3 + s 1 (a 1 + a 2 + a 3 ) +s 2 ( − a 1 + a 2 + a 3 ) + s 3 (a 1 + a 2 − a 3 ).
As a more complex example consider the constraint

MAPPING ISING MODELS TO HARDWARE
Using the techniques in Section 2.1, we realize the constraints {F F F j | 1 ≤ j ≤ m} of a CSP as penalty Ising models {Pen F j (s s s, a a a) | 1 ≤ j ≤ m}, where each Pen F j is defined on a small subgraph G j of the hardware graph G. By choosing the subgraphs to be disjoint (or possibly intersecting at decision variables), we can solve all constraints simultaneously on the hardware. However, most variables s i will appear in multiple constraints, and we require that all instances of a variable take the same value. To do this, we identify several connected qubits with the same variable, and apply a strong ferromagnetic connection between those qubits during the annealing process, ensuring that they obtain the same spin. A connected set of qubits representing the same variable is known as a chain. Notice that the variables in a chain connecting two variables which must assume the same value are ancillary variables enforcing the equality constraint. Chains are simply penalty functions enforcing equality constraints.
The problem of embedding the Ising model of a CSP into G then consists of two parts: (1) choosing a placement of constraints onto disjoint subgraphs of G, and (2) routing chains to represent variables that appear in multiple constraints. This "place and route" model of embedding has been used with great success and scalability in VLSI physical design [17,18], where circuits with millions of variables may be mapped onto a single chip. Some of the techniques presented here have analogies in the VLSI literature. In this section, we describe techniques for placement and routing, as well as more general methods to map Ising models of arbitrary structure to G.

Placement
We consider graphs G consisting of a repeating pattern of unit cells, and assume that constraints (like parity check on 3 variables) can be represented within a single unit cell. In mapping constraints to unit cells within G, we try to place constraints that share variables close to one another: a good choice of placement will make the routing process more tractable. The techniques presented here can be generalized, but for simplicity we assume that the hardware graph G consists of an N × N grid of K 4,4 unit cells. Approaches to placement include: • Quadratic assignment: Define a flow A j,j between two constraints F F F j and F F F j to be the number of variables they have in common. Define a distance B u,u between two unit cells u and u in G to be the length of the shortest path between them.
Assuming two instances of a variable must be joined together by a path to create a chain, the quadratic assignment problem QAP(A, B) identifies a mapping from constraints to cells that roughly minimizes the sum of the chain lengths. (See [19] for background on QAP). Note that we do not require an optimal solution to the QAP problem in order to find a valid embedding; an approximate solution suffices.

www.frontiersin.org
September 2014 | Volume 2 | Article 56 | 5 used in QAP is one example). Provided that each constraint only shares variables with a small number of other constraints, changes in cost function can be evaluated quickly. • Recursive placement: As the hardware graph gets larger, both simulated annealing and quadratic assignment become computationally prohibitive. The cost can be reduced by recursively splitting the problem into pieces and then combining the solutions. First we partition the constraints into two regions such that the number of variables shared between regions is minimized, Then we partition the unit cells of G in two regions such that the number of edges between them is minimized. We continue to partition until the regions are small enough to be computationally tractable. Techniques for partitioning are discussed in Section 2.3.

Routing
Once constraints, and therefore variables, have been assigned to disjoint subgraphs of G, different instances of a variable must be joined together. The routing problem is formulated as follows: given G and a collection of disjoint terminal sets {T i ⊆ V(G) | i = 1 . . . m} representing the qubits on which variable s i has already been placed, find a collection of disjoint chains {S i } m i = 1 such that S i contains T i . The performance of the D-Wave hardware is dependent on the choice of chains, so we would also like to minimize either the maximum size of any chain or the total number of qubits used. This routing problem differs from traditional VLSI routing only in the choice of objective function and the hardware graph G, so several VLSI methods apply with some modification: • Multicommodity flow: The problem of finding a tree S 1 in G of minimal size containing a given terminal set T 1 is known as the Steiner Tree problem on graphs and is NP-hard. For a given T 1 = {z 1 , z 2 , . . . , z k 1 }, we can model a Steiner Tree for T 1 as a network flow. Root vertex z 1 is a source with k 1 − 1 units of commodity, while each of {z 2 , . . . , z k 1 } is a sink requiring one unit of commodity. All other vertices of G have a net-flow of zero. Then, a Steiner Tree S 1 for T 1 exists if and only if there is a feasible flow in this network. By adding binary variables that indicate whether or not a vertex of G is in S 1 , we can easily formulate this problem as a mixed integer linear programme (MILP). The general case, when m > 1, can be formulated as a MILP using m commodities, one for each terminal set (see [20,Ch. 3.6] or [21] for background on multicommodity flows). We add the constraint that each vertex of G can be in at most one Steiner tree, and require that the flow of commodity i can only be routed through vertices in Steiner tree S i . • Greedy Steiner Trees: We heuristically select a variable order and then greedily choose a Steiner Tree for each s i from the subgraph F of unused qubits in G. Several approximation algorithms for the Steiner tree problem are known, but perhaps the simplest is Kou et al. [22]: define a weighted complete graph K on V(F ), where the weight of an edge z 1 z 2 is the shortest-path distance between z 1 and z 2 in F. Choose a minimum spanning tree in K, and then assign s i to every vertex in F on the paths representing edges in the minimum spanning tree. This creates a chain for s i .
• Rip-up routing: Rip-up routing [18,23,24] is a variation of the greedy Steiner Tree algorithm that temporarily allows chains of variables to overlap. For each variable s i , we maintain a chain S i such that T i ⊆ S i . Initially S i is set to T i , but after the first iteration of the algorithm each S i is connected, and by the end of the algorithm the chains are (hopefully) disjoint. The algorithm iteratively updates each chain S i as follows: 1. For each vertex z in G, define a vertex weight wt(z) = α |{j = i:z ∈ S j }| , for some fixed α > 1. 2. Using a Steiner Tree approximation algorithm, choose an approximately minimal vertex-weighted Steiner Tree S i for the terminal set T i .
The algorithm terminates when all S i are disjoint or no improvements are found. By setting the weight of a vertex proportional to the number of variables it represents, the algorithm encourages chains to form on unused vertices whenever possible.
Multicommodity flow is an exact algorithm and therefore most successful when it is tractable (up to roughly 500 qubits). On the other hand Steiner tree approximation algorithms which are polynomial time can be used at much larger scales, and rip-up routing is usually more effective than greedy routing as it is less likely to get trapped in suboptimal local minima.

Global embedding techniques
For certain optimization problems it may be difficult or suboptimal to map individual constraints F F F j to Ising models with the connectivity structure of G. In these cases, we may instead map each F F F j to an Ising model Pen F j (z z z) of arbitrary structure (possibly using ancillary variables), and then attempt to map Pen(z z z) = j Pen F j (z z z) to G directly. Because of the limited qubit connectivity, we will again require chains to represent variables. The techniques described here attempt to map a problem graph P (in which z i and z j are adjacent if they have a non-zero interaction in Pen(z z z)) to G without assuming P or G have any particular structure. Constructing chains such that if two variables are adjacent in P then there is at least one edge between their chains in G is known as the minor-embedding problem [25]. Minorembedding is NP-hard; the best known algorithm has running time O(2 (2k + 1) log k |V(P)| 2k 2 2|V(P)| 2 |V(P)|), where k is the branchwidth of G [26]. While there are deep theoretical results about minors in the theory developed by Robertson and Seymour [27], none of the known exact algorithms are even remotely practical for the scale of problems we are interested in. Nevertheless, heuristics can be effective provided that P is not too large compared to G. When G is the Chimera graph in Figure 1, one strategy is to treat P as a subgraph of a fixed complete graph. The ideal Chimera graph with 8N 2 qubits was designed to have a minor-embedding of a complete graph on 4N + 1 variables [9,25], and [28] provides algorithms for embedding complete graphs when qubits are missing. The heuristics described below, while slower, can embed much larger problems and also have more flexibility than constraint-based techniques. These heuristics may also be used to improve chain lengths of an embedding, regardless of how that embedding was found.

Shortest-path-based chains
This algorithm uses efficient shortest-path computations to construct a chain for each variable, based on the locations of its neighbors' chains. Chains are temporarily allowed to overlap, and each qubit is assigned a weight based on how many variables are currently assigned to it. Suppose we want to find a chain S 0 for variable s 0 , and s 0 is adjacent to s 1 , . . . , s k which already have chains S 1 , . . . , S k respectively. By computing the weighted shortest path from each S i to every other qubit in G, we can select a "root" qubit z * for S 0 which minimizes the weight of the qubits needed to create a connection between s 0 and each of s 1 , . . . , s k . We then take the union of the shortest paths from z * to each S j as the chain for S 0 . The details of this algorithm can be found in Cai et al. [in preparation].

Simulated annealing
An alternative approach uses simulated annealing to attempt to improve partial embeddings. A partial embedding is an assignment of a chain S i to each variable s i ∈ V(P) such that all S i are disjoint. An edge s i s j of P is unfulfilled if there is no edge in G joining S i and S j . The partial embedding is assigned a score, consisting of the sum of the distances between S i and S j for unfulfilled edges s i s j plus a small positive constant times the sum of the squares of the cardinalities of all chains. We try to minimize the score by considering moves of the following types: 1. Given qubit z that is not in any chain, but is adjacent to a chain S i , adjoin z to S i . 2. Given qubit z in chain S i , remove z from S i (if S i \{z} is still a valid chain) and either leave it unassigned or adjoin it to a neighboring chain S j . 3. Given two qubits in different chains, z 1 ∈ S i and z 2 ∈ S j respectively, if z 1 is adjacent to S j and z 2 is adjacent to S i , and S i \{z 1 } and S j \{z 2 } are still valid chains, switch z 1 from S i to S j and z 2 from S j to S i .
Simulated annealing often produces better results than shortestpath-based chains but takes much longer.

SOLVING LARGER PROBLEMS
The tools in Sections 2.1 and 2.2 allow for mapping of arbitrary CSPs to Ising models with connectivity constraints. However, it may be difficult to fit a given problem in the current D-Wave hardware due to its limited number of qubits. In this section we outline approaches to solving large problems by repeatedly calling QA hardware to optimize smaller subproblems. One way of dealing with a large problem is to decompose it into regions. Smaller regional subproblems are then solved, and each region sends some form of feedback to its neighboring regions, which in turn modifies each regional subproblem. The process is repeated until all regions agree on shared variables or some stopping criteria is triggered.
Each region of a CSP is, essentially, a smaller CSP. When we partition a CSP into regions, each region should be as large as possible subject to being embeddable in the hardware graph. Partitions should be chosen so that regions have as few variables in common as possible to minimize the communication between regions.
To decompose the CSP in this way we may use graph and hypergraph partitioning techniques. One mechanism is the following: define a node for each constraint of the CSP and a weighted edge between two nodes counting the variables their constraints share. Then a min-cut balanced partition of the graph will produce similar regions with a minimal number of pairs of shared variables. The min-cut balanced partitioning problem is NP-hard in general, but the Kernighan-Lin algorithm [29] performs well in practice. It starts with a random partition and iteratively swaps sequences of vertices between regions based on improvements to the edge-weights. The software package Metis [30] has implemented a "multi-level" version of Kernighan-Lin that is scalable to tens of thousands of nodes. Since embedding in the hardware graph puts bounds on the number of variables, for regional decomposition to be effective the problem must exhibit some sparsity.
Decomposing the Ising model for the original CSP into R regions produces similar regional subproblems; a region R will have couplingsĥ (R) andĴ (R) derived from (and often equal to) those of (1), yielding a problem on a graph G (R) = (V (R) , E (R) ) (where n R = |V (R) |). We assume the size and sparsity structure of this problem, defined by G (R) , is compatible with the hardware. Next, we consider two approaches to coordinate the solutions of these regional subproblems produced by the decomposition. In our implementations, we chose the regions so that h (R) i splits h i evenly across regions R containing qubit i, but each edge ij ∈ E(G) belongs to exactly one region so that J (R) i,j = J i,j . During the run of the algorithms, for each region R, only the linear terms were updated.

Belief propagation
Belief propagation (BP) [31] is an algorithm in which messages are passed between regions and variables. Messages represent beliefs about the minimal energy conditional on each possible value of a variable. In min-sum BP, messages from variables (i, j, . . .) to regions (R, S, . . .) are updated using the formula where N(i) is the set of regions containing i. In the reverse direction, messages are updated as where similarly N(R) is the set of variables in R. Specifically, μ i→R (z i ) represents the aggregate of beliefs about z i from all regions excluding R, while μ R→i (z i ) is the minimum energy for R aggregating beliefs about all variables in R excluding i. These messages are iteratively passed between regions and variables until messages converge or another criterion is met when messages do not converge. BP is known to perform well for certain CSPs, and is a standard decoding algorithm for LDPC [32].
Two advantages of using hardware-sized regions, over performing BP in which each region is a single constraint, are (1) we can internalize within regions many of the short cycles of variable interactions, which are known to cause failure of BP, and (2) we need only pass messages for variables that appear in multiple regions, as variables appearing in a single region may be resolved after the BP algorithm terminates. Using min-cut heuristics to construct regions augments these benefits.

Dual decomposition
Dual decomposition (DD) uses Lagrangian relaxation methods, a standard approach for dealing with large scale optimization problems [11,33,34]. The Lagrangian relaxation of an Ising model (1) consists of a concave optimization problem of the form max λ λ λ ∈ L(λ λ λ) where . Valid multipliers λ λ λ must lie within = {λ λ λ | R R = 1 λ λ λ (R) = 0 0 0}. As before the Ising models defined by h h h (R) and J J J (R) have sparsity structure G (R) determined once after the partitioning, and the multipliers λ λ λ do not change the sparsity structure of J J J (R) . Notice that evaluating L(λ λ λ) (and thus computing a supergradient) decomposes into R independent regional subproblems of the form (8). The concave optimization problem that defines the relaxation can be solved using subgradient optimization methods (e.g., [33]).
The Lagrangian relaxation is a lower bound on the optimal value of the original Ising model. In particular, if after solving the Lagrangian function L(λ λ λ) all the solutions z z z (R) agree at the regional boundaries, we have indeed solved the original problem. If the solutions z z z (R) differ at regional boundaries, a simple heuristic to try to derive a solution to the CSP is to use regional majority vote or randomized rounding to determine the values of the spins at the boundaries followed by a straightforward greedy descent procedure.
The Lagrangian relaxation can also be used to compute lower bounds in a branch-and-bound approach.

Large-neighborhood local search
Greedy descent or local search is an iterative optimization method that moves from one spin configuration z z z to another by flipping the spin that reduces the objective value of the Ising model the most. A straightforward variation of local search [35] is largeneighborhood local search (LNLS) (see [36]), in which many spins may be flipped simultaneously. In each iteration of LNLS, we select a subset of the spinsz z z that are allowed to change (the rest are fixed), and we optimize the subproblem overz z z in hardware so as to minimize the overall objective value of the Ising model.
In contrast with the decomposition methods in Sections 2.3.1 and 2.3.2, LNLS does not rely on a single partitioning of the original problem: a new subproblem may be selected at each iteration. However, we must ensure that subproblems are embeddable in hardware. For instance, if we have an embedding of a complete graph K m , any subset of m variables can be optimized while keeping all the other spins fixed. An important consideration is the selection of the variables that are fixed at each iteration. One heuristic is to pick a variable at random and grow a breadth-first search tree to a fixed number of variables.

RESULTS
In this section, we report results of our experiments using D-Wave hardware for solving LDPC problems. In our tests, an instance of LDPC consists of a parity check matrix G G G ∈ {0, 1} r×n , and a bit string (message) y y y ∈ {0, 1} n . A codeword x x x is any bit string whose parity check vector, G G Gx x x, is 0 0 0 modulo 2. The goal is to decode message y y y by finding the codeword x x x whose Hamming distance to y y y is as small as possible. The parity check matrix G G G is sparse. We chose G G G so that n was between 100 and 1000, r ≈ 0.70n, the number of non-zeros on each row was between 3 and 5, and the number of non-zeros in each column was between 2 and 4. The message y y y was chosen by first picking a random codeword x x x, and then flipping pn random bits, where p is the error rate. We tested instances for which p ∈ {8%, 10%, 12%, 14%}.
The standard decoding algorithm for LDPC uses belief propagation as described in Section 2.3.1, with each region defined to be a single parity check. We generated our instances at random, and, to get an indication of the potential usefulness of the hardware, considered only instances that had a unique optimal solution and for which standard belief propagation did not converge within 1000 iterations. Optimal solutions used to baseline hardware performance were obtained by dynamic programming applied to regional subproblems [16]. While dynamic programming is faster than hardware solution at these size subproblems, dynamic programming will not scale to much larger subproblems due to prohibitive memory requirements.
We solved the instances using D-Wave hardware (such as the one in Figure 1, for which h i ∈ [−2, 2] and J i,j ∈ [−1, 1]) and the decomposition strategies of Section 2.3. Each region consisted of up to 20 parity checks from G G G and was mapped to the hardware in the following way. First we added ancillary variables so that each parity check involved exactly 3 variables 2 . Each of these checks was then placed in a cell using the techniques of Section 2.2.1. Inside a cell we used the two embeddings described in the example of Section 2.1.3, one with gap 2 (K 3,3 ) and one with gap 1 (K 4 ). We observed that the problems submitted to the hardware using K 3,3 -based embeddings had a significantly improved success probability as compared to the problems that used K 4 -based embeddings (see Figure 3). This in turn allowed for higher precision for the subproblems FIGURE 3 | Hardware performance aggregated over all regional subproblems.
is the relative error of a solution returned by the hardware compared to the true optimal value of the Ising model. Left: comparing K 3,3 -based embedding with K 4 -based embedding. Right: comparing Belief Propagation and Dual Decomposition.
submitted to the hardware as required by the decomposition algorithms. Thus, in the following experiments we used K 3,3 -based embeddings.
We implemented both region-based Belief Propagation (BP) and Dual Decomposition (DD). We optimized each regional subproblem using up to 10 D-Wave hardware calls, in order to adjust the strength of the ferromagnetic couplings used to identify chains. Each hardware call took 10,000 samples with an annealing time of 20 μs (see [37] for more details on the hardware's operating specifications). In the case of DD, we used a standard subgradient algorithm once, and used a simple randomized rounding procedure and local search to try to produce codewords from the subproblems generated throughout the subgradient optimization.
We generated 18 random instances as described above. BP managed to solve all the instances, while DD solved 14 of 18, encountering difficulties for error rates p = 12% for instances with more than 600 bits. We attribute the higher reliability of BP to the fact that the problems sent to the hardware had lower precision than DD. On the other hand, typically DD required fewer hardware calls to converge. Among all the hardware samples we collected, 95% of them were within 4% of optimum (see Figure 3).

CONCLUSION
We have outlined a general approach for coping with intrinsic issues related to the practical use of quantum annealing. To address these issues we proposed methods for finding Ising problem representations that have a large classical gap between ground states and first excited states, practical methods for embedding Ising models that are not compatible with the hardware graph, and decomposition methods to solve problems that are larger than the hardware. As an application of our techniques, we described how we implemented LDPC decoding problems in D-Wave hardware. Our approach has enabled us to solve LDPC decoding problems of up to 1000 variables. The current hardware implementation of QA tested here is roughly as fast as an efficient implementation of simulated annealing, but these results offer the promise of hybrid quantum/classical algorithms that surpass purely classical solution as QA hardware matures.
As future work, we would like to improve upon the scalability of the current method for constructing penalty functions with large gaps. This would allow larger component subproblems and reduce the need for minor embedding between subproblems. Further, the methods we have described here for finding penalty functions assume an assignment of decision variables to qubits. Different assignment choices lead to different results and different hardware performance. We do not currently have an effective method for this assignment.