Graph Neural Networks for Maximum Constraint Satisfaction

Many combinatorial optimization problems can be phrased in the language of constraint satisfaction problems. We introduce a graph neural network architecture for solving such optimization problems. The architecture is generic; it works for all binary constraint satisfaction problems. Training is unsupervised, and it is sufficient to train on relatively small instances; the resulting networks perform well on much larger instances (at least 10-times larger). We experimentally evaluate our approach for a variety of problems, including Maximum Cut and Maximum Independent Set. Despite being generic, we show that our approach matches or surpasses most greedy and semi-definite programming based algorithms and sometimes even outperforms state-of-the-art heuristics for the specific problems.


INTRODUCTION
Constraint satisfaction is a general framework for casting combinatorial search and optimization problems; many well-known NP-complete problems, for example, k-colorability, Boolean satisfiability and maximum cut can be modeled as constraint satisfaction problems (CSPs).Our focus is on the optimization version of constraint satisfaction, usually referred to as maximum constraint satisfaction (MAX-CSP), where the objective is to satisfy as many constraints of a given instance as possible.There is a long tradition of designing exact and heuristic algorithms for all kinds of CSPs.Our work should be seen in the context of a recently renewed interest in heuristics for NPhard combinatorial problems based on neural networks, mostly GNNs (for example, Khalil et al., variables, we can apply the model to instances of arbitrary size 1 .Our loss function rewards solutions with many satisfied constraints.Thus, our networks learn to satisfy the maximum number of constraints which naturally puts the focus on the optimization version MAX-CSP of the constraint satisfaction problem. This focus on the optimization problem allows us to train unsupervised, which is a major point of distinction between our work and recent neural approaches to Boolean satisfiability (Selsam et al., 2019) and the coloring problem (Lemos et al., 2019).Both approaches require supervised training and output a prediction for satisfiability or coloring number.Furthermore, our approach not only returns a prediction whether the input instance is satisfiable, but it returns an (approximately optimal) variable assignment.The variable assignment is directly produced by a neural network, which distinguishes our end-to-end approach from methods that combine neural networks with conventional heuristics, such as Khalil et al., (2017) and Li et al., (2018).
We experimentally evaluate our approach on the following NP-hard problems: the maximum 2-satisfiability problem (MAX-2-SAT), which asks for an assignment maximizing the number of satisfied clauses for a given Boolean formula in 2-conjunctive normal form; the maximum cut problem (MAX-CUT), which asks for a partition of a graph in two parts such that the number of edges between the parts is maximal (see Figure 1); the 3colorability problem , which asks for a 3-coloring of the vertices of a given graph such that the two endvertices of each edge have distinct colors.We also consider the maximum independent set problem (MAX-IS), which asks for an independent set of maximum cardinality in a given graph.Strictly speaking, MAX-IS is not a maximum constraint satisfaction problem, because its objective is not to maximize the number of satisfied constraints, but to satisfy all constraints while maximizing the number of variables with a certain value.We include this problem to demonstrate that our approach can easily be adapted to such related problems.
Our experiments show that our approach works well for all four problems and matches competitive baselines.Since our approach is generic for all MAX-CSPs, those baselines include other general approaches such as greedy algorithms and semidefinite programming (SDP).The latter is particularly relevant, because it is known (under certain complexity theoretic assumptions) that SDP achieves optimal approximation ratios for all MAX-CSPs (Raghavendra, 2008).For MAX-2-SAT, our approach even manages to surpass a state-of-the-art heuristic.In general, our method is not competitive with the highly specialized state-of-the-art heuristics.However, we demonstrate that our approach clearly improves on the stateof-the-art for neural methods on small and medium-sized binary CSP instances, while still being completely generic.We remark that our approach does not give any guarantees, as opposed to some traditional solvers which guarantee that no better solution exists.
Almost all models are trained on quite small training sets consisting of small random instances.We evaluate those models on unstructured random instances as well as more structured benchmark instances.Instance sizes vary from small instances with 100 variables and 200 constraints to medium sized instances with more than 1,000 variables and over 10,000 constraints.We observe that RUN-CSP is able to generalize well from small instances to instances both smaller and much larger.The largest (benchmark) instance we evaluate on has approximately 120,000 constraints, but that instance required the use of large training graphs.Computations with RUN-CSP are very fast in comparison to many heuristics and profit from modern hardware like GPUs.For medium-sized instances with 10,000 constraints inference takes less than 5 s.

Related Work
Traditional methods for solving CSPs include combinatorial constraint propagation algorithms, logic programming techniques and domain specific approaches, for an overview see Apt (2003), Dechter (2003).Our experimental baselines include a wide range of classical algorithms, mostly designed for specific problems.For MAX-2-SAT, we compare the performance to that of WalkSAT (Selman et al., 1993;Kautz, 2019), which is a popular stochastic local search heuristic for MAX-SAT.Furthermore, we use the state-of-the-art MAX-SAT solver Loandra (Berg et al., 2019), which combines linear search and core-guided algorithms.On the MAX-CUT problem, we compare our method to multiple implementations of a heuristic approach by Goemans and Williamson (1995).This method is based on semi-definite programming (SDP) and is particularly popular since it has a proven approximation ratio of αx0.878.Other MAX-CUT baselines utilize extremal optimization (Boettcher and Percus, 2001) and local search (Benlic and Hao, 2013).For MAX-3-COL, we measure the results against HybridEA (Galinier and Hao, 1999;Lewis et al., 2012;Lewis, 2015), which is an evolutionary algorithm with state-of-the-art performance.Furthermore, a simple greedy coloring heuristic (Brélaz, 1979) is also used as a comparison.ReduMIS is a state-of-the-art MAX-IS solver that combines kernelization techniques and evolutionary algorithms.We use it as a MAX-IS baseline, together with a simple greedy algorithm.
Beyond these traditional approaches there have been several attempts to apply neural networks to NP-hard problems and more specifically CSPs.An early group of papers dates back to the 1980s and uses Hopfield Networks (Hopfield and Tank, 1985) to approximate TSP and other discrete problems using neural networks.Hopfield and Tank use a single-layer neural network with sigmoid activation and apply gradient descent to come up with an approximative solution.The loss function adopts soft assignments and uses the length of the TSP tour and a term penalizing incorrect tours as loss, hence being unsupervised.This approach has been extended to k-colorability (Dahl, 1987;Takefuji and Lee, 1991;Gassen and Carothers, 1993;Harmanani et al., 2010) and other CSPs (Adorf and Johnston, 1990).The loss functions used in some of these approaches are similar to ours.
Newer approaches involve modern machine learning techniques and are usually based on GNNs.NeuroSAT (Selsam et al., 2019), a learned message passing network for predicting satisfiability, reignited the interest in solving NPcomplete problems with neural networks.Prates et al., (2019) use GNNs to learn TSP and trained on instances of the form (G, ℓ ± ε) where ℓ is the length of an optimal tour on G.They achieved good results on graphs with up to 40 nodes.Using the same idea, Lemos et al., (2019) learned to predict k-colorability of graphs scaling to larger graphs and chromatic numbers than seen during training.Yao et al., (2019) evaluated the performance of unsupervised GNNs for the MAX-CUT problem.They adapted a GNN architecture by Chen et al., (2019) to MAX-CUT and trained two versions of their network, one through policy gradient descent and the other via a differentiable relaxation of the loss function which both achieved similar results.Amizadeh et al., (2019) proposed an unsupervised architecture for CIRCUIT-SAT, which predicts satisfying variable assignments for a given formula.Khalil et al., (2017) proposed an approach for combinatorial graph problems that combines reinforcement learning and greedy search.They iteratively construct solutions by greedily adding nodes according to estimated scores.The scores are computed by a neural network, which is trained through Q-Learning.They test their method on the MVC, MAX-CUT, and TSP problems, where they outperform traditional heuristics across several benchmark instances.For the #P-hard weighted model counting problem for DNF formulas, Abboud et al., (2019) applied a GNN-based message passing approach.Finally, Li et al., (2018) use a GNN to guide a tree search for MAX-IS.

CONSTRAINT SATISFACTION PROBLEMS
Formally, a CSP-instance is a triple I (X, D, C), where X is a set of variables, D is a domain, and C is a set of constraints of the form (x 1 , . . ., x ℓ , R) for some R4D ℓ .A constraint language is a finite set Γ of relations over some fixed domain D, and I is a Γ-instance if R ∈ Γ for all constraints (x 1 , . . ., x ℓ , R) ∈ C.An assignment α : X → D satisfies a constraint (x 1 , . . ., x ℓ , R) if (α(x 1 ), . . ., α(x ℓ )) ∈ R, and it satisfies the instance I if it satisfies all constraints in C. CSP(Γ) is the problem of deciding whether a given Γ-instance has a satisfying assignment and finding such an assignment if there is one.MAX−CSP(Γ) is the problem of finding an assignment that satisfies the maximum number of constraints.
For example, an instance of 3-COL has a variable x v for each vertex v of the input graph, domain D {1, 2, 3}, and a constraint (v, w, R 3 ≠ ) for each edge vw of the graph.Here, R 3 ≠ {(1, 2), (2, 1), (1, 3), (3, 1), (2, 3), (3, 2)} is the inequality relation on {1, 2, 3}.Thus 3-COL is CSP({R 3 ≠ }).In this paper, we only consider binary CSPs, that is, CSPs whose constraint language only contains unary and binary relations.From a theoretical perspective, this is no real restriction, because it is well known that every CSP can be transformed into an "equivalent" binary CSP (see Dechter, 2003).Let us review the construction.Suppose we have a constraint language Γ of maximum arity k ≥ 3 over some domain D. We construct a binary constraint language Γ as follows.The domain D of Γ consists of all elements of D as well as all pairs (a, R) where R ∈ Γ and a is a tuple occurring in R. For every R ∈ Γ, we add a unary relation Q R consisting of all pairs (a, R) ∈ D where a ∈ R.Moreover, for 1 ≤ i ≤ k we add a binary "projection" relation P i consisting of all pairs ((a, R), a i ) for R ∈ Γ, say of arity ℓ ≤ k, and a (a 1 , . . ., a ℓ ) ∈ R. Finally, for every instance I (X, D, C) of CSP(Γ) we construct an instance I ( X, D, C) of CSP( Γ), where X consists of all variables in X and a new variable y c for every constraint c (x 1 , . . ., x ℓ , R) ∈ C and C consists of a tuple constraint (y c , Q R ) and projection constraints (y c , x i , P i ) for all 1 ≤ i ≤ ℓ ≤ k.Here, the tuple constraints select for every constraint c (x, R) ∈ C a tuple a ∈ R and the projection constraints ensure a consistent assignment to the original variables X4 X.Then the instances I and I are equivalent in the sense that I is satisfiable if and only if I is and there is a one-toone correspondence between the satisfying assignments.
However, the construction is not approximation preserving.For example, it is not the case that an assignment satisfying 90% of the constraints of I yields an assignment satisfying 90% of the constraints of I.It is possible to fix this by adding weights to the constraints, making it more expensive to violate projection constraints.Moreover, and arguably more importantly in this context, it is not clear how well our method works on CSPs of higher arity when translated to binary CSPs using this construction.We leave a thorough experimental evaluation of CSPs with higher arities for future work.

Architecture
We use a randomized recurrent GNN architecture to evaluate a given problem instance using message passing.For any binary constraint language Γ a RUN-CSP network can be trained to approximate MAX-CSP(Γ).Intuitively, our network can be viewed as a trainable communication protocol through which the variables of a given instance can negotiate a value assignment.With every variable x ∈ X we associate a short-term state s (t)  x ∈ R k and a hidden (long-term) state h (t)  x ∈ R k which change throughout the message passing iterations t ∈ {0, . . ., t max }.The short-term state vector s (0)   x for every variable x is initialized by sampling each value independently from a normal distribution with zero mean and unit variance.All hidden states h (0)   x are initialized as zero vectors.
Every message passing step uses the same weights and thus we are free to choose the number t max ∈ N of iterations for which RUN-CSP runs on a given problem instance.This number may or may not be identical to the number of iterations used for training.The state size k and the number of iterations used for training t tr max and evaluation t ev max are the main hyperparameters of our network.
Variables x and y that co-occur in a constraint c (x, y, R) can exchange messages.Each message depends on the states s (t)  x , s (t) y , the relation R, and the order of x and y in the constraint but not on the internal long-term states h (t)  x , h (t) y .The dependence on R implies that we have independent message generation functions for every relation R in the constraint language Γ.The process of message passing and updating the internal states is repeated t max times.We use linear functions to compute the messages as preliminary experiments showed that more complicated functions did not improve performance while being less stable and less efficient during training.Thus, the messaging function for every relation R is defined by a trainable weight matrix M R ∈ R 2k×2k as S R s (t)  x , s (t) The output of S R consists of two stacked k-dimensional vectors, which represent the messages to x and y, respectively.Note that the generated messages depend on the order of the variables in the constraint.This behavior is desirable for asymmetric relations.For symmetric relations we modify S R to produce messages independently from the order of variables in c.In this case we use a smaller weight matrix M R ∈ R k×2k to generate both messages.Note that the two messages can still be different, but the content of each message depends only on the states of the endpoints.The internal states h x and s x are updated by an LSTM cell based on the mean of the received messages.For a variable x which received the messages m 1 , . . ., m ℓ the new states are thus computed by For every variable x and iteration t ∈ {1, . . ., t max }, the network produces a soft assignment φ (t) (x) from the state s (t) x .In our architecture we use φ (t) (x) softmax Ws (t)   x with W ∈ R d×k trainable and d |D| (domain size of the CSP).In φ, the linear function reduces the dimensionality while the softmax function enforces stochasticity.The soft assignments φ (t) (x) can be interpreted as probabilities of a variable x receiving a certain value v ∈ D. If the domain D contains only two values, we compute a "probability" p (t) (x) σ Ws (t)   x for each node with W ∈ R 1×k .The soft assignment is then given by φ . To obtain a hard variable assignment α (t) : X → D, we assign the value with the highest estimated probability in φ (t) (x) for each variable x ∈ X.From the hard assignments α (1) , . . ., α (t ev max ) , we select the one with the most satisfied constraints as the final prediction of the network.This is not necessarily the last assignment α (t ev max ) .
x , s (t−1) y for x ∈ X do.//combine messages and update r (t)  x : x Algorithm 1: Network Architecture.
Algorithm 1 specifies the architecture in pseudocode.Figure 2 illustrates the message passing graph for a MAX-2-SAT instance and the internal update procedure of RUN-CSP.Note that the network's output depends on the random initialization of the short-term states s (0)  x .Those states are the basis for all messages sent during inference and thus for the solution found by RUN-CSP.By applying the network multiple times to the same input and choosing the best solution, we can therefore boost the performance.
We did evaluate more complex variants of this architecture with multi-layered messaging functions and multiple stacked recurrent cells.No increase in performance was observed with these modifications, while the running time increased.Replacing the LSTM cells with GRU cells slightly decreased the performance.Therefore, we use the simple LSTM-based architecture presented here.

Loss Function
In the following we derive our loss function used for unsupervised training.Let I (X, D, C) be a CSP-instance.Assume without loss of generality that D {1, . . ., d} for a positive integer d.Given I, in every iteration our network will produce a soft variable assignment φ : X → [0, 1] d , where φ(x) is stochastic for every x ∈ X.Instead of choosing the value with the maximum probability in φ(x), we could obtain a hard assignment α : X → D by independently sampling a value for each x ∈ X from the distribution specified by φ(x).In this case, the probability that any given constraint (x, y, R) ∈ C is satisfied by α can be expressed by where A R ∈ {0, 1} d×d is the characteristic matrix of the relation R with (A R ) i,j 15(i, j) ∈ R. We then aim to minimize the combined negative log-likelihood over all constraints: The graph corresponding to the MAX-2-SAT-instance f (¬X 1 ∨x 2 )∧(x 1 ∨x 3 )∧(x 2 ∨x 3 ).The nodes for the variables are shown in green.The functions through which the variables iteratively exchange messages are shown in blue (B) An illustration of the update mechanism of RUN-CSP.The trainable weights of this function are shared across all nodes, which allows RUN-CSP to process instances with arbitrary structure.
We combine the loss function L CSP throughout all iterations with a discount factor λ ∈ [0, 1] to get our training objective: L φ t t ≤ t tr max , I : This loss function allows us to train unsupervised since it does not depend on any ground truth assignments.Furthermore, it avoids reinforcement learning, which is computationally expensive.In general, computing optimal solutions for supervised training can easily turn out to be prohibitive; our approach completely avoids such computations.
We remark that it is also possible to extend the framework to weighted MAX-CSPs where a real weight is associated with each constraint.To achieve this, we can replace the averages in the loss function and message collection steps by weighted averages.Negative constraint weights can be incorporated by swapping the relation with its complement.We demonstrate this in Section 4.2 where we evaluate RUN-CSP on the weighted MAX-CUT problem.

EXPERIMENTS
To validate our method empirically, we performed experiments for MAX-2-SAT, MAX-CUT, 3-COL and MAX-IS.For all experiments, we used internal states of size k 128; state sizes up to k 1024 did not increase performance for the tested instances.We empirically chose to use t tr max 30 iterations during training and, unless stated otherwise, t ev max 100 for evaluation.Especially for larger instances it proved beneficial to use a relatively high t ev max .In contrast, choosing t tr max too large during training (t tr max > 50) resulted in unstable training.During evaluation, we use 64 parallel runs for each instance and use the best result.Further increasing this number mainly increases the runtime but has no real effect on the quality of solutions.We trained most models with 4,000 instances split into in 400 batches.Training is performed for 25 epochs using the Adam optimizer with default parameters and gradient clipping at a norm of 1.0.The decay over time in our loss function was set to λ 0.95.We provide a more detailed overview of our implementation and training configuration in the Supplementary Material.
We ran our experiments on machines with two Intel Xeon 8160 CPUs and one NVIDIA Tesla V100 GPU but got very similar runtime on consumer hardware.Evaluating 64 runs on an instance with 1,000 variables and 1,000 constraints takes about 1.5 s, 10,000 constraints about 5 s, and 20,000 constraints about 8 s.Training a model takes less than 30 min.Thus, the computational cost of RUN-CSP is relatively low.

Maximum 2-Satisfiability
We view MAX-2-SAT as a binary CSP with domain D {0, 1} and a constraint language consisting of three relations R 00 (for clauses with two negated literals), R 01 (one negated literal) and R 11 (no negated literals).For example, R 01 {(0, 0), (0, 1), (1, 1)} is the set of satisfying assignments for a clause (¬x∨y).For training a RUN-CSP model we used 4,000 random 2-CNF formulas with 100 variables each.The number of clauses was sampled uniformly between 100 and 600 for every formula and each clause was generated by sampling two distinct variables and then independently negating the literals with probability 0.5.

Random Instances
For the evaluation of RUN-CSP in MAX-2-SAT we start with random instances and compare it to a number of problemspecific heuristics.All baselines can solve MAX-SAT for arbitrary arities, not only MAX-2-SAT, while RUN-CSP can solve a variety of binary MAX-CSPs.The state-of-the-art MAX-SAT Solver Loandra (Berg et al., 2019) won the unweighted track for incomplete solvers in the Max-SAT Evaluation 2019 (Bacchus et al., 2019).We ran Loandra in its default configuration with a timeout of 20 min on each formula.To put this into context, on the largest evaluation instance used here (9,600 constraints) RUN-CSP takes less than 7 min on a single CPU core and about 5 s using the GPU.WalkSAT (Selman et al., 1993;Kautz, 2019) is a stochastic local search algorithm for approximating MAX-SAT.We allowed WalkSAT to perform 10 million flips on each formula using its "noise" strategy with parameters n 2 and m 2000.Its performance was boosted similarly to RUN-CSP by performing 64 runs and selecting the best result.
For evaluation we generated random formulas with 100, 400, 800, and 1,600 variables.The ratio between clauses and variables was varied in steps of 0.1 from 1 to 6. Figure 3 shows the average percentage of satisfied clauses in the solutions found by each method over 100 formulas for each size and density.The methods yield virtually identical results for formulas with less than 2 clauses per variable.For denser instances, RUN-CSP yields slightly worse results than both baselines when only 100 variables are present.However, RUN-CSP matches the results of Loandra for formulas with 400 variables and outperforms it for instances with 800 and 1,600 variables.The performance of WalkSAT degrades on these formulas and is significantly worse than RUN-CSP.

Benchmark Instances
For more structured formulas, we use MAX-2-SAT benchmark instances from the unweighted track of the MAX-SAT Evaluation 2016 (Argelich, 2016) based on the Ising spin glass problem (De Simone et al., 1995;Heras et al., 2008).We used the same general setup as in the previous experiment but increased the timeout for Loandra to 60 min.In particular we use the same RUN-CSP model trained entirely on random formulas.Table 1 contains the achieved numbers of unsatisfied constraints across the benchmark instances.All methods produced optimal results on the first and the third instance.RUN-CSP slightly deviates from the optimum on the second instance.For the fourth instance RUN-CSP found an optimal solution while both WalkSAT and Loandra did not.On the largest benchmark formula, RUN-CSP again produced the best result.
Thus, RUN-CSP is competitive for random as well as spinglass-based structured MAX-2-SAT instances.Especially on larger instances it also outperforms conventional methods.Furthermore, training on random instances generalized well to the structured spin-glass instances.

Regular Graphs
In this section we evaluate RUN-CSP's performance on this problem.Yao et al., (2019) proposed two unsupervised GNN architectures for MAX-CUT.One was trained through policy gradient descent on a non-differentiable loss function while the other used a differentiable relaxation of this loss.They evaluated their architectures on random regular graphs, where the asymptotic MAX-CUT optimum is known.We use their results as well as their baseline results for Extremal Optimization (EO) (Boettcher and Percus, 2001) and a classical approach based on semi-definite programming (SDP) (Goemans and Williamson, 1995) as baselines for RUN-CSP.To evaluate the sizes of graph cuts, Yao et al., (2019) introduced a relative performance measure called P-value given by P(z) z/n−d/4 d/4 √ where z is the predicted cut size for a d-regular graph with n nodes.Based on results of Dembo et al., (2017), they showed that the expected P-value of d-regular graphs approaches P * ≈ 0.7632 as n → ∞.P-values close to P * indicate a cut where the size is close to the expected optimum and larger values are better.While Yao et al. trained one instance of their GNN for each tested degree, we trained one network model on 4,000 Erdős-Rényi graphs and applied it to all graphs.For training, each graph had a node count of n 100 and a uniformly sampled number of edges m ∼ U (100,2000).Thus, the model was not trained specifically for regular graphs.Table 2 reports the mean P-values across 1,000 random regular graphs with 500 nodes for different degrees.For every method other than RUN-CSP, we provide the values as reported by Yao et al.While RUN-CSP does not match the cut sizes produced by extremal optimization, it clearly outperforms both versions of the GNN as well as the classical SDP-based approach.

Benchmark Instances
We performed additional experiments on standard MAX-CUT benchmark instances.The Gset dataset (Ye, 2003) is a set of 71 weighted and unweighted graphs that are commonly used for testing MAX-CUT algorithms.The dataset contains three different types of random graphs.Those graphs are Erdős-Rényi graphs with uniform edge probability, graphs where the connectivity gradually decays from node 1 to n, and 4-regular toroidal graphs.Here, we use two unweighted graphs for each type from this dataset.We reused the RUN-CSP model from the previous experiment but increased the number of iterations for evaluation to t ev max 500.Our first baseline by Choi and Ye (2000) uses an SDP solver based on dual scaling (DSDP) and a reduction based on the approach of Goemans and Williamson (1995).Our second baseline Breakout Local Search (BLS) is based on the combination of local search and adaptive perturbation (Benlic and Hao, 2013).Its results are among the best known solutions for the Gset dataset.For DSDP and BLS we report the values as provided in the literature.Table 3 reports the achieved cut sizes for RUN-CSP, DSDP, and BLS.On G14 and G15, which are random graphs with decaying node degree, the graph cuts produced by RUN-CSP are similar in size to those reported for  DSDP.For the Erdős-Rényi graphs G22 and G55 RUN-CSP performs better than DSDP but worse than BLS.Lastly, on the toroidal graphs G49 and G50 all three methods achieved the best known cut size.This reaffirms the observation that our architecture works particularly well for regular graphs.
Although RUN-CSP did not outperform the state-of-the-art heuristic in this experiment it performed at least as well as the SDP based approach DSDP.

Weighted Maximum Cut Problem
Additionally, we evaluate RUN-CSP on the weighted MAX-CUT problem, where every edge e ∈ E has an associated weight w e ∈ {1, −1}.The aim is to maximize the objective: where the partition S, T of V defines a cut.We can apply RUN-CSP to this problem by training a model for the constraint language Γ {R , R ≠ } over the domain D {0, 1}.Here, R and R ≠ are the equality and inequality relations, respectively.We model every positive edge as a constraint with R ≠ and every negative edge with R .We trained a RUN-CSP network on 4,000 random Erdős-Rényi graphs with n 100 nodes and m ∼ U(100, 300) edges.The weights w e ∼ {1, −1} were drawn uniformly for each edge.We evaluate this model on 10 benchmark instances obtained from the Optsicom Project2 , namely the 10 smallest graphs of set 2. These instances are based on the lsing spin glass problem and are commonly used to evaluate heuristics empirically.All 10 graphs have n 125 nodes and m 375 edges.Khalil et al., (2017) utilize reinforcement learning to guide greedy search heuristics for combinatorial problems including weighted MAX-CUT.They evaluated their method on the same benchmark instances for weighted MAX-CUT and compared the performance to a classical greedy heuristic (Kleinberg and Tardos, 2006) and an SDP-based method (Goemans and Williamson, 1995).Furthermore, they approximated the optimal values by running CPLEX for 1 h on every instance.We use their reported results and baselines for a comparison with RUN-CSP. Crucially, Khalil et al., (2017) trained their network on random variations of the benchmark instances, while RUN-CSP was trained on purely random data.Table 4 provides the achieved cut sizes.On all but one benchmark instance RUN-CSP yields the largest cuts and on five out of 10 instances it even found the optimal cut value.The classical approaches based on Greedy Search and SDP performed substantially worse than both neural methods.

Coloring
Within coloring we focus on the case of three colors, i.e., we consider CSPs over the domain {1, 2, 3} with the inequality relation R ≠ .In general, RUN-CSP aims to satisfy as many constraints as possible and therefore approximates MAX-3-COL.Instead of evaluating on MAX-3-COL, we evaluate on its practically more relevant decision variant 3-COL which asks whether a given graph is 3-colorable without conflicts.We turn RUN-CSP into a classifier by predicting that a given input graph is 3-colorable if and only if it is able to find a conflict-free vertex coloring.

Hard Instances
We evaluate RUN-CSP on so-called "hard" random instances, similar to those defined by Lemos et al., (2019).These instances are a special subclass of Erdős-Rényi graphs where an additional edge can make the graph no longer 3-colorable.We describe our exact generation procedure in the Supplementary Material.We trained five RUN-CSP models on 4,000 hard 3-colorable instances with 100 nodes each.In Table 5 we present results for RUN-CSP, a greedy heuristic with DSatur strategy (Brélaz, 1979), and the state-of-the-art heuristic HybridEA (Galinier and Hao, 1999;Lewis et al., 2012;Lewis, 2015).HybridEA was allowed to make 500 million constraint checks on each graph.We observe that larger instances are harder for all tested methods and between the three algorithms there is a clear hierarchy.The state-of-the-art heuristic HybridEA clearly performs best and finds solutions even for some of the largest graphs.RUN-CSP finds optimal colorings for a large fraction of graphs with up to 100 nodes and even a few correct colorings for graphs of size 200.The weakest algorithm is DSatur which even fails on most of the small 50 node graphs and gets rapidly worse for larger instances.
Choosing larger or more training graphs for RUN-CSP did not significantly improve its performance on larger hard graphs.We assume that a combination of increasing the state size, complexity of the message generation functions, and number and size of training instances is able to achieve better results, but on the cost of efficiency.
In Table 5 we do not report results for GNN-GCP by Lemos et al., (2019) as the structure of the output is fundamentally different.While the three algorithms in Table 5 output a coloring, GNN-GCP outputs a guess on the chromatic number without providing a proof that this is achievable.We trained instances of GNN-GCP on 32,000 pairs of hard graphs of size 40 to 60 (small) and 50 to 100 (medium).For testing, we restricted the model to only choose between the chromatic numbers 3 and 4, when allowing a wider range of possible values, the accuracy of GNN-GCP drops considerably.The network was able to achieve test accuracies of 75% (respectively 65% when trained and evaluated on medium instances).The model generalizes fairly well, with the small model achieving 64% on the medium test set and the large model achieving 74% on the small test set, almost matching the performance of the network trained on graphs of the respective size.On a set of test instances of hard graphs with 150 nodes, GNN-GCP achieved an accuracy of 52% (54% for the model trained on medium instances).Thus, the model performs significantly worse than RUN-CSP which achieves 81% (GNN-GCP 59%) accuracy on a test set of graphs of size 100, and 68% on graphs of size 150 where GNN-GCP achieves up to 54%.The numbers for RUN-CSP are larger than those reported in Table 5 since in the table only 3-colorable instances were considered.Here, the accuracy is computed over 3-colorable instances as well as their non-3-colorable counter parts.By design, RUN-CSP achieves perfect classification on negative instances.
Overall, we see that despite being designed for maximization tasks, RUN-CSP outperforms greedy heuristics and neural baselines on the decision variant of 3-COL for hard random instances.

Structure Specific Performance
On the example of the coloring problem, we evaluate generalization to other graph classes.We expect a network trained on instances of a particular structure to adapt toward this class and outperform models trained on different graph classes.We briefly evaluate this hypothesis for four different classes of graphs.
Erdős-Rényi Graphs: Graphs are generated by uniformly sampling m distinct edges between n nodes.Geometric Graphs: A graph is generated by first assigning random positions within a 1 × 1 square to n distinct nodes.Then an edge is added for every pair of points with a distance less than r.
Powerlaw-Cluster Graphs: This graph model was introduced by Holme and Kim (2002).Each graph is generated by iteratively adding n nodes and connected to m existing nodes.After each edge is added, a triangle is closed with probability p, i.e., an additional edge is added between the new node and a random neighbor of the other endpoint of the edge.
Regular Graphs: We consider random 5-regular graphs as an example for graphs with a very specific structure.
We trained five RUN-CSP models on 4,000 random instances of each type where each graph had between 50 and 100 nodes.We refer to these groups of models as M ER , M Geo , M Pow and M Reg .Five additional models M Mix were trained on a mixed dataset with 1,000 random instances of each graph class.The exact parameters for generating the graphs can be found in the Supplementary Material.Note that the parameters for each class were purposefully chosen such that most graphs are not 3colorable.This allows us to evaluate the relative performance on the maximization task.Table 6 contains the percentage of unsatisfied constraints over the models on 1,000 fresh graphs of each class.We observe that all models perform well on the class of structures they were trained on and M Reg yields the worst performance on all other classes.Both M Geo and M Pow outperform M ER on Erdős-Rényi graphs while M ER outperforms M Geo on Powerlaw-Cluster and M Pow on geometric graphs.When averaging over all four classes, M Mix produces the best results, despite not achieving the best results for any particular class.Additionally, we observe a very low variance in performance between the different models trained on the same dataset.Only the models trained on relatively narrow graph classes, namely regular graphs and to some extent also Powerlaw-Cluster graphs, exhibit a higher variance.
Overall, this demonstrates that training on locally diverse graphs (e.g., geometric graphs or a mixture of graph classes) leads to good generalization toward other graph classes.While all tested networks achieved competitive results on the structure that they were trained on, they were not always the best for that particular structure.Therefore, our original hypothesis appears to be overly simplistic and restricting the training data to the structure of the evaluation instances is not necessarily optimal.

Independent Set
Finally, we experimented with the maximum independent set problem MAX-IS.The independence condition can be modeled through a constraint language Γ IS with one binary relation R IS {(0, 0), (0, 1), (1, 0)}.Here, assigning the value 1 to a variable is interpreted as including the corresponding node in the independent set.MAX-IS is not simply MAX-CSP(Γ IS ), since the empty set will trivially satisfy all constraints.Instead, MAX-IS is the problem of finding an assignment which satisfies R IS at all edges while maximizing an additional objective function that measures the size of the independent set.To model this in our framework, we extend the loss function to reward assignments with many variables set to 1.For a graph G (V, E) and a soft assignment φ : V → [0, 1], we define Here, L CSP is the standard RUN-CSP loss for Γ IS and κ adjusts the relative importance of L CSP and L size .Intuitively, smaller values for κ decrease the importance of L size which favors larger independent sets.A naive weighted sum of both terms turned out to be unstable during training and yielded poor results, whereas the product in Eq. 6 worked well.For training, L MIS is combined across iterations with a discount factor λ as in the standard RUN-CSP architecture.

Random Instances
We start by evaluating the performance on random graphs.We trained a network on 4,000 random Erdős-Rényi graphs with 100 nodes and m ∼ U(100, 600) edges each and with κ 1.For evaluation we use random graphs with 100, 400, 800 and 1,600 nodes and a varying number of edges.For roughly 6% of all predictions, the predicted set contained induced edges (just a single edge in most cases), meaning the predicted sets where not independent.We corrected these predictions by removing one of the endpoints of each induced edge from the set and only report results after this correction.We compare RUN-CSP against two baselines: ReduMIS, a state-of-the-art MAX-IS solver (Akiba and Iwata, 2016;Lamm et al., 2017) and a greedy heuristic, which we implemented ourselves.The greedy procedure iteratively adds the node with lowest degree to the set and removes the node and its neighbors from the graph until the graph is empty.Figure 4 shows the achieved independent set sizes, each data point is the mean IS size across 100 random graphs.For graphs with 100 nodes, RUN-CSP achieves similar sizes as ReduMIS and clearly outperforms the greedy heuristic.On larger graphs our network produces smaller sets than ReduMIS.However, RUN-CSP's performance remains similar to the greedy baseline and, especially on denser graphs, outperforms it.

Benchmark Instances
For more structured instances, we use a set of benchmark graphs from a collection of hard instances for combinatorial problems (Xu, 2005).The instances are divided into five sets with five graphs each.These graphs were generated through the RB Model (Xu and Li, 2003;Xu et al., 2005), a model for generating hard CSP instances.A graph of the class frbc-k consists of c interconnected k-cliques and the MAX-IS has a forced size of c.The previous model trained on Erdős-Rényi graphs did not perform well on these instances and produced sets with many induced edges.Thus, we trained a new network on 2,000 instances we generated ourselves through the RB model.The exact generation procedure of this dataset is provided in the Supplementary Material.We set κ 0.1 to increase the importance of the independence condition.The predictions of the new model contained no induced edges for all benchmark instances.Table 7 contains the achieved IS sizes.We observe that RUN-CSP yields similar results to the greedy heuristic.While our network does not match the state-ofthe-art heuristic, it beats the greedy approach on large instances with over 100,000 edges.We have presented a universal approach for approximating MAX-CSPs with recurrent neural networks.Its key feature is the ability to train without supervision on any available data.Our experiments on the optimization problems MAX-2-SAT, MAX-CUT, 3-COL and MAX-IS show that RUN-CSP produces high quality approximations for all four problems.Our network can compete with traditional approaches like greedy heuristics or semi-definite programming on random data as well as benchmark instances.For MAX-2-SAT, RUN-CSP was able to outperform a state-of-the-art MAX-SAT Solver.Our approach also achieved better results than neural baselines, where those were available.RUN-CSP networks trained on small random instances generalize well to other instances with larger size and different structure.Our approach is very efficient and inference takes only a few seconds, even for larger instances with over 10,000 constraints.The runtime scales linearly in the number of constraints and our approach can fully utilize modern hardware, like GPUs.Overall, RUN-CSP seems like a promising approach for approximating Max-CSPs with neural networks.The strong results are somewhat surprising, considering that our networks consist of just one LSTM cell and a few linear functions.We believe that our observations point toward a great potential of machine learning in combinatorial optimization.

Future Work
We plan to extend RUN-CSP to CSPs of arbitrary arity and to weighted CSPs.It will be interesting to see, for example, how it performs on 3-SAT and its maximization variant.Another possible future extension could combine RUN-CSP with traditional local search methods, similar to the approach by Li et al., (2018) for MAX-IS.The soft assignments can be used to guide a tree search and the randomness can be exploited to generate a large pool of initial solutions for traditional refinement methods.

FIGURE 3 |
FIGURE 3 | Percentage of satisfied clauses of random 2-CNF formulas for RUN-CSP, Loandra and WalkSAT.Each data point is the average of 100 formulas; the ratio of clauses per variable increases in steps of 0.1.

FIGURE 4 |
FIGURE 4| Independent set sizes on random graphs produced by RUN-CSP, ReduMIS and a greedy heuristic.The sizes are given as the percentage of nodes contained in the independent set.Every data point is the average for 100 graphs; the degree increases in steps of 0.2.

TABLE 1 |
MAX-2-SAT: Number of unsatisfied constraints for MAX-2-SAT benchmark instances derived from the Ising spin glass problem.

TABLE 2 |
MAX-CUT: P-values of graph cuts produced by RUN-CSP, Yao, SDP, and EO for regular graphs with 500 nodes and varying degrees.We report the mean across 1,000 random graphs for each degree.

TABLE 3 |
MAX-CUT: Achieved cut sizes on Gset instances for RUN-CSP, DSDP, and BLS.

TABLE 6 |
MAX-3-COL: Percentages of unsatisfied constraints for each graph class under the different RUN-CSP models.Values are averaged over 1,000 graphs and the standard deviation is computed with respect to the five RUN-CSP models.

TABLE 5 |
3-COL: Percentages of hard 3-colorable instances for which optimal 3colorings were found by RUN-CSP, Greedy, and HybridEA.We evaluate on 1,000 instances for each size.We provide mean and standard deviation across five different RUN-CSP models.

TABLE 7 |
MAX-IS: Achieved IS sizes for the benchmark graphs.We report the mean and std.deviation for the five graphs in each group.Frontiers inArtificial Intelligence | www.frontiersin.orgFebruary 2021 | Volume 3 | Article 580607 10