A machine learning framework for neighbor generation in metaheuristic search

This paper presents a methodology for integrating machine learning techniques into metaheuristics for solving combinatorial optimization problems. Namely, we propose a general machine learning framework for neighbor generation in metaheuristic search. We first define an efficient neighborhood structure constructed by applying a transformation to a selected subset of variables from the current solution. Then, the key of the proposed methodology is to generate promising neighbors by selecting a proper subset of variables that contains a descent of the objective in the solution space. To learn a good variable selection strategy, we formulate the problem as a classification task that exploits structural information from the characteristics of the problem and from high-quality solutions. We validate our methodology on two metaheuristic applications: a Tabu Search scheme for solving a Wireless Network Optimization problem and a Large Neighborhood Search heuristic for solving Mixed-Integer Programs. The experimental results show that our approach is able to achieve a satisfactory trade-off between the exploration of a larger solution space and the exploitation of high-quality solution regions on both applications.


Introduction
Combinatorial Optimization (CO) is an important class of optimization problems in Operations Research (OR) and Computer Science (CS).In general, a CO problem is defined by a set of decision variables, a constrained solution space and an objective function.The goal of CO is to find optimal solutions with respect to the objective in the solution space.
Classical methods for solving CO problems can be roughly divided into three classes: exact methods, heuristics and metaheuristics.Mixed-Integer Programming (MIP) is one of the main paradigms of exact methods for modeling complex CO problems.Over the last decades, there has been increasing interest in improving the ability to solve MIPs effectively.Modern MIP solvers incorporate a variety of complex algorithmic techniques, such as primal heuristics [1], Branch and Bound (B&B) [2], cutting planes [3] and pre-processing, which results in complex and sophisticated software tools.
Exact methods are guaranteed to find optimal solutions as well as a proof of their optimality.On the one hand, due to the NP-hardness nature of many CO problems, solving them to optimality within an exact algorithm is still a very challenging task.On the other hand, in many practical applications, one is more interested in getting a good solution within a reasonable time rather than finding an optimal one.Those practical requirements have motivated the development of specific heuristics and metaheuristics (MHs).Specific heuristics are usually designed for solving a certain type of CO problem, whereas MHs are frameworks for designing heuristics for solving general CO problems, and provide guidelines to integrate basic heuristic concepts with high-level diversification strategies.
It is worth noting that a lot of information can be produced and observed from MH processes, and therefore, a large volume of data can be collected.These data might provide valuable information about the optimization status of the process, the characteristics of the problem, the structures and properties of high-quality solutions in the solution regions being visited.However, such knowledge has not been fully exploited by traditional MH algorithms.
That can be viewed as a disadvantage when compared to application-targeted CO algorithms, in which exploiting the domain knowledge of the problem is generally preferable.Meanwhile, real-world CO problems have a rich structure.With similar instances repeatedly solved in many applications, statistical characteristics and patterns appear.This provides the opportunity for Machine Learning (ML) to extract structural properties of the problem from data and automatically produce learning-based MH strategies.
ML is a subfield of Artificial Intelligence (AI) that involves developing algorithms to learn knowledge from data and make predictions on new problems.In recent years, the application of ML techniques in CO became an emerging research area with quite a number of contributions for different purposes.On the one hand, some research has been devoted to develop heuristics for solving CO problems by ML, i.e., to perform "endto-end learning" to directly generate good solutions for a CO problem.On the other hand, ML has been applied in combination with CO algorithms, where the learning based algorithms have the potential to achieve better performances, either because the current strategy of performing some auxiliary tasks is computationally expensive or because they are poorly understood from the mathematical viewpoint.For a detailed review of "learn to optimize", the interested reader is referred to the survey [4].
Specifically for metaheuristics, ML techniques can be used to infer patterns from data generated from MH processes.Integrating the extracted knowledge into the search strategies can lead MHs to search the solution space more efficiently and significantly improve the current performance.Recently, the application of ML techniques for MHs has attracted increasing research interest and we refer the interested reader to the following surveys [5,6].
In this paper, we focus on the integration of ML techniques into MHs and propose a general learning-based framework for neighbor generation in MHs.We first define an efficient neighborhood structure by applying a transformation to a selected subset of variables from the current solution.Then, the problem is to determine how to select a subset of variables that leads to a promising neighborhood of solutions containing a descent of the objective in the solution space.By conducting a classification task, our method learns good variable selection policies from both structural characteristics of the problem and high-quality solutions.We will demonstrate the effectiveness of our approach on two applications: a Tabu Search scheme for solving a Wireless Network Optimization problem and a Large Neighborhood Search heuristic for solving MIPs.

Background
In this section, we introduce the necessary background and notation.

Combinatorial Optimization
CO problems are a class of optimization problems with a set of decision variables and a defined solution space.Without loss of generality, a CO problem can be formulated into a constrained optimization problem as follows: where the index set V := {1, . . ., n} of decision variables is partitioned into B, G, C, which are the index sets of binary, general integer and continuous variables, respectively.Since many CO problems are NP-hard, determining optimal solutions by exact methods requires in the worst case exponential time and might be intractable, especially for large-size applications.In many practical applications where the CO problems are hard and complex, practitioners are often interested in finding good-quality solutions in an "acceptable" amount of computing time rather than solving the problem to optimality.Therefore, heuristic algorithms are developed to efficiently compute highquality solutions.

Algorithm 1 NS-based metaheuristic
Input: an initial solution x;

Metaheuristics
Metaheuristics are general framework strategies for designing heuristics for solving CO problems, and provide guidelines to integrate basic heuristic schemes such as local search with high-level diversification strategies.A large part of metaheuristics are built on top of a basic neighborhood search (NS) scheme, where NS is a local search procedure that starts from an initial solution x and iteratively search for improving solutions by exploring a series of neighborhoods.
We consider an instance p ∈ P of a CO problem, where P is the set of problem instances.X is the solution space, i.e., the set of feasible solutions.The neighborhood is defined as follows.Let x be a solution of an instance p such that x ∈ X .The neighborhood N (x) of solution x is a subset of solutions defined from x in the solution space X , i.e., N (x) ⊆ X .
In general, the structure of the neighborhoods and how these neighborhoods are explored, are designed according to the characteristics of the problem at hand.A neighborhood structure is typically determined by a transformation operator that applies a move to the current solution in the solution space.
A transformation operator is typically represented by ∆ : X → 2 X , which is a function (or a set of functions) that maps a solution x to a set of solutions.If the operator ∆ is defined with a set M of parameters, the operator can be defined by ∆ : X × M → 2 X .For instance, a neighborhood N (x) → ∆(x, m) can be constructed by applying the transformation operator with m ∈ M to the current solution x.
At each NS iteration, the neighborhood N (x) is generated by a ∆(x) or ∆(x, m) function.A basic template for NS-based metaheuristic is shown in Algorithm 1.

Representation Learning for CO
In ML, there is a vast library of models for representing CO problems depending on the format of input data of the CO task.For example, the model could be a linear function or some non-linear Artificial Neural Network (ANN) with a set of parameters to be optimized.In Deep Learning (DL), there are many types of neural networks and architectures available for modeling CO problems.For instance, a Multilayer Perceptron (MLP) is the simplest architecture of feedforward neural networks and can be used to model problems with fixed-size input data; Graph Neural Networks (GNNs) are developed to process data that are naturally representable with graphs; Recurrent Neural Networks (RNNs) can be applied to process CO problems with sequential data, etc.
In particular, due to the ubiquity of graph data, problems over graphs arise in numerous application domains.Moreover, given the fact that the vast majority of CO problems have a discrete nature, many of them are naturally described in graphs or can be modeled into a graph structure.For instance, a network optimization problem can be naturally modeled by a graph and a generic MIP instance can also be represented into a bipartite graph [7].These graphs have inherent structural commonalities and patterns, and there is a potential to exploit valuable graph features to learn patterns from data.Therefore, graph representation learning with the application of various GNNs [8,9,10,11] has recently emerged as a popular approach for studying CO with a machine learning perspective.Without loss of generality, GNNs learn a graph embedding based on the input features and the graph structure.In a nutshell, higherlevel representations of a node or an edge are obtained by kernel convolutions which leverage its local structure.
In the literature, there has been an effort to learn algorithms for solving specific CO problems and many of them apply GNNs as the representation model.On the one hand, the first attempted paradigm was "end-to-end" learning for generating a heuristic solution by a ML model [12,13,14,15,16,17].However, these methods are typically limited to specific CO problems in which a heuristic solution can be easily constructed, and scaling to large-size instances is an issue.On the other hand, since a wide range of constrained CO problems can be formulated into a MIP model, there has also been increasing interest in learning decision rules to improve MIP algorithms [7,18,19,20,21,22].While it is shown that this direction has a great potential to improve the state-of-the-art of MIP algorithms, convincing generalization performances, and transfer learning across instances have not been fully tackled yet.

Methodology
In this section, we present our framework for learning to generate high-quality solution neighbors for MH search.In Section 3.1 and 3.2, we will first introduce an efficient neighborhood structure by applying a transformation to a selected subset of variables from the current solution.Then, in Section 3.3 the problem becomes to determine how to select a subset of variables that leads to a promising neighborhood, i.e., one containing a descent direction of the objective in the solution space.By conducting a classification task, our method learns promising variable selection policies from the structural characteristics of the problem and high-quality solutions.

Transformation Operator and Neighbor Generation
The definition of the neighborhood plays a critical role in the NS-based metaheuristics.It specifies how the metaheuristic search moves in the solution space.As introduced in Section 2.2, the structure of a neighborhood is typically determined by a transformation operator, since the latter specifies how the current solution is perturbed and transformed to solution neighbors.
For instance, in a classical Traveling Salesman Problem (TSP) where a set of n cities and the distances between each pair of cities are given, the task is to find the shortest tour that visits each city exactly once.Given an arbitrary tour as the initial solution, a simple transformation operator can be defined by firstly removing k edges from the current tour and then adding k other edges to construct a new tour.This is known as the "k-OPT" neighborhood.For a generic CO problem at hand, there are many ways of defining a transformation operator.In general, two main aspects must be taken into consideration: exploitation (or intensification) and exploration (or diversification) of the solution space.
On the one hand, the smaller the perturbation induced by the operator, the closer the constructed neighborhood to the current solution.The metaheuristic search will thoroughly exploit the local solution regions around the current solution.On the other hand, with more perturbation or more randomness induced by the operator, the neighbors can be defined far from the current solution.The heuristic search will have a larger chance to explore more solution regions that have been less visited before.On the extreme case, when an operator completely perturbs the solution or the solution is allowed to be fully changed, the neighborhood could be expanded to the entire solution space, and the complexity of exploring this neighborhood will be very high, as high as solving the original problem.Instead, the heuristic search will become totally random when the operator perturbs a part of the solution randomly.Moreover, the design of the transformation operator also strongly depends on the type of problem and its representation.The effectiveness of an operator might not be the same on different types of problems.
In order to guide the metaheuristic search to promising regions of the solution space, we will present a general ML framework for neighbor generation.More precisely, we will design transformation operators with parameterizations that are able to construct neighborhoods containing high-quality solutions.

Variable Selection for Neighbor Generation
In this section, we will define a neighborhood structure in NS-based metaheuristics.As mentioned before, we aim at generating promising neighbors by efficient transformation operators.
In NS, improving solutions are typically found from neighbors defined by perturbing only a part of the solution.In practice, in order to control the size of the neighborhood, the ratio of perturbation is typically set to a relatively small value.Although this ratio can be increased by diversification strategies when no improving solution is found from the last search, it is still very common that the best solutions found by two consecutive local search iterations share a large part of variable values.Nevertheless, this "partial evolution" of the local optima induces a class of transformation operators for constructing structural neighborhoods, where the transformation is defined on a subset of variables.Definition 1. A"subset" transformation operator is a function ∆ : X × Y → 2 X , where X denotes the solution space, V is the index set of variables, and Y = {0, 1} |V| defines a binary decision space for selecting a subset of V. Consequently, for y ∈ Y, the neighborhood N (x) of a solution x ∈ X is defined as ∆(x, y), and only variables x i with y i = 1 can have their value changed by transforming x into a neighbor in N (x).
When the components of y are all set to 1, all the variables in V will be allowed to change.As discussed before, generating neighbors over the entire variable set V might result in a large local problem.On the other hand, the "partial evolution" of improving solutions in local search also indicates that it is possible to define efficient transformation operators only on a subset of variables.Moreover, common characteristics are often present in good solutions in many applications.For instance, in a classical TSP problem, two cities that are far from each other are typically disconnected in good solutions, whereas cities close to each other have a higher probability of being connected.Hence, ML techniques can be employed to learn structural information from high-quality solutions and select a subset of variables that has a high probability of defining a neighborhood that contains a descent point of the objective in the solution space.

Learning a Variable Selection Policy for Structural Neighbor Generation
Given an instance of a generic CO problem and a solution x ∈ X , we denote its current state as s ∈ S, where S is the state space of the problem and typically consists of a set of selected features including x.In order to define a structural neighborhood using the subset transformation operator (defined by Definition 1), an instantiation of Y is required to select a subset of variables.Hence, a variable selection policy π for selecting y ∈ Y can be defined by

Now, the question is
How to design a variable selection policy by which the subset transformation operator defines neighborhoods containing high-quality solutions?
As mentioned before, the high-quality solutions found during the metaheuristic search often share common characteristics and patterns, and more importantly, many improved solutions in the neighborhood often share a part of the variables with the same values.Hence, we propose a general learning-based framework to extract these characteristics from data, and exploit the learned knowledge to guide the metaheuristic search towards compact and promising solution regions.Specifically, the framework learns a variable selection policy and the high-quality neighborhood structure will then be defined by selecting promising variables, the values of which are allowed to be changed from the current solution.
The pipeline of the framework consists of three components: data generation, machine learning and neighbor generation design.The framework is depicted in Figure 1.In the following, we will instantiate the learning problem for variable selection as a classification task and train a variable selection policy through supervised learning.Within our method, the solution neighbors in metaheuristic search are generated based on our trained model.

Generation of a Training Data Set
We aim to learn a policy π that maps the state s of a problem instance, to a highquality y (referred to as label in the following), the binary classification decisions for selecting the best subset of variables that leads to a successful change of the current solution.For this purpose, one typically generate a training data set by applying an algorithm (referred to as expert) which explores the neighborhood of some states s 1 , . . ., s N , allowing to modify the values of all the variables, and then detects which are the variables whose value modification results in an improvement.Hence a label y i is associated with each s i (i = 1, . . ., N ) and we thus have a training data set

Training the Variable Selection Policy
The policy we want to determine depends on parameters θ ∈ Θ of a model (see below), where Θ is the set of candidate parameters.Each policy is thus denoted π θ and, given a training data set D train = {(s i , y i ))} N i=1 , we aim to determine parameters θ * such that π θ * (s i ) is as close as possible to y i (i = 1, . . ., N ).For this purpose, we solve the following classification problem: where L (π θ (s i ), y i ) denotes the loss function and will be defined according to the CO application (see more details in Sections 4.3.1,4.3.2 and 5.2).In summary, given a state s ∈ S, we use y = π θ * (s) for selecting which variables are allowed to be changed when moving to neighbors.
Modeling Since CO problems generally have a discrete nature and many of them can be represented by a graph structure, we will apply GNN architectures [8,9,10] for parametrizing the variable selection policy.More precisely, if the state of a CO instance is modeled by a graph, GNNs can be implemented as the representation model for π θ .GNNs exhibit some appealing properties for processing data in graph format.First of all, GNNs are size-and-order invariant to input data, i.e., they can process graphs of arbitrary size and topology, and the graph model is invariant to the ordering of the input elements, which brings a critical advantage compared to other neural networks.Moreover, GNNs can exploit the sparsity of the graph by localized propagation of information, making them an ideal class of models for embedding sparse CO problems [7].
The basic architecture of GNNs consists of 3 modules: the input module, the convolution module, and the output module.In the input module, the state s of the problem is fed into the GNN model.The input module embeds the features of s.The convolution module propagates the features of the graph components by graph convolution layers [7,22] and further embeds the features.
In particular, the architecture of the graph convolution layers used in this paper applies the message passing operator, defined as where v (h−1) i ∈ R d denotes the feature vector of node i in layer (h − 1), e j,i ∈ R m denotes the feature vector of edge (j, i) from node j to node i, N (i) denotes the set of nodes adjacent to i in the graph, and f φ denote the embedding functions paramaterized by ψ and φ in layer h and are typically represented by neural networks (e.g., MLP architectures).The output module maps the embedding of the state into a distribution in the binary decision space for each variable and the final output is the class prediction of the variable (typically 1 for being selected into the subset, 0 for not being selected).

Neighbor Generation Design
After training the classification model for variable selection, the next step is to apply the pretrained model for neighbor generation.It is important to note that, the trained classification model itself is probabilistic, and it maps the current solution state of an instance into a probability distribution in the binary decision space for each variable.One still needs to select a strategy to make binary decisions on the variables.
The most straightforward way is to apply a greedy strategy.The decision is made by always picking the class with a higher probability and the resulting strategy is deterministic.Another way is to sample decisions from the distribution, and hence the strategy is probabilistic.In general, the greedy strategy only selects the deterministic subset of variables and exploits the learned knowledge from the training data set.Instead, the probabilistic strategy selects from all the possible neighborhoods with a probability preference defined by the classification model, thus results in a better exploration of the less visited solution regions.There is no guarantee that one strategy is always better than the other.In practice, they can be combined.For each task, one should search for a good trade-off between exploitation of the local solution regions and exploration of the global solution space.
Finally, the template for a NS-based metaheuristic guided by the variable selection policy π is given in Algorithm 2. The neighborhood structure N (x) is defined by the subset transformation operator that selects a subset of variables according to policy π θ * , and modifies some of their values to lead the heuristic search to promising solution regions.

Application 1: Tabu Search in Wireless Network Optimization
In this section, we apply our framework to the first case study, a Wireless Network Optimization (WNO) problem.We will demonstrate the effectiveness of our learningbased framework by generating solution neighbors in a Tabu Search scheme.

The Tactical WNO Problem
When telecommunications are necessary but standard networks are unavailable, as in the case of disaster relief operations, small temporary wireless networks, called tactical Algorithm 2 NS-based metaheuristic guided by the variable selection policy Input: an initial solution x and a variable selection policy π θ * ; x * ← x; repeat y ← π θ * (s) where s is the state associated with x; networks, are set up.These typically connect between 10 and 50 nodes (the key locations that must communicate) in a single network.The design of such networks can be optimized such that the network's weakest link is maximized.This, in turn, guarantees that all nodes can receive important information in a timely manner.
Tactical wireless network design is a complex non-linear combinatorial optimization problem that includes three sub-problems: the design of the topology (P 0 ), the configuration of the network (P 1 ), and the configuration of the antennas (P 2 ).These sub-problems are nested such that P 1 is defined given a topology, and P 2 is defined given a topology and a network configuration, namely max where T is the space of valid topologies and o is the objective function of the problem.The problem has been proposed by an industrial partner and the modeling of the full tactical wireless network design optimization problem can be found in [23].
Given a set of nodes V , the topology t ∈ T can be any undirected tree (V, E) that describes how information travels in the network between every pair of nodes.Each edge in the topology represents a direct connection between two antennas in the network.A network configuration selects a root node (also known as a master hub or gateway) and assigns waveforms and channels/frequencies to the edges.The waveforms are the communication protocols that depend on the local structure of the edges and the channels and frequencies characterize their radio signals.Together with the antenna configurations, they determine which edges interfere with each other.An antenna configuration requires to define an angular alignment with respect to the azimuth as well as a set of activated beams, in the case of multi-beam antennas.Given all these properties, the radio signals can be physically modeled by also taking into account the path losses and fade margins of the terrain between every pair of nodes.The data transmission speed (called direct throughput) TP uv for all the edges [u, v] ∈ E can then be computed and, depending on the traffic scenario X and its distribution of congestion n X uv in the edges, their effective throughputs TP uv /n X uv can also be computed.There are three traffic scenarios: scenario A is a single communication between any two nodes, scenario B is all nodes communicating with the root node, and scenario C is all nodes communicating with each other.For all traffic scenarios, the congestion of any directed edge (u, v) can be computed as a function of the number of descendants desc v defined by the root node selected in the network configuration.
To evaluate a single topology t ∈ T , problem P 1 , which is itself a complex combinatorial optimization problem, must be solved, and solving it requires solving P 2 many times as well.In turn, problem P 2 can be efficiently solved by a simple geometrically-based heuristic.Solving P 1 to optimality every time that a topology needs to be evaluated is too costly, especially in a NS context.It can be approximated in such a way that it can be solved efficiently by exhaustive enumeration.We denote the resulting approximated objective function by f (t).However, this approximation does not necessary provide a feasible network because it does not define a proper frequency assignment.Alternatively, P 1 can also be tackled directly by considering greedy frequency assignments which do provide feasible networks, although this takes considerably more time.We denote this final objective function by f (t).
Given such procedures for solving P 1 and P 2 , P 0 , given by max t∈T f (t), can be solved with a NS-based MH in the space of topologies, where local directions of descent are computed with f (t).

Neighborhoods
The neighborhood structure dictates how the local search can move in the space of solutions.Our neighborhood N is the edge-swap neighborhood.To construct the neighbor topologies, for each tree edge e ∈ E in the current topology, we remove it, which disconnects the topology into two connected components, and we consider every possible way of reconnecting these two connected components with another edge.Hence, ∆(t) is the set of all topologies that can be obtained from t by an edge-swap move.
An example of neighbor topologies is depicted in Figure 2, where the edges in blue were "swapped".

Tabu Search
Tabu Search (TS) is a MH that explores the solution space by temporarily not allowing to move in the direction from which it came.The only case in which this rule does not apply is when moving in such a direction improves the incumbent solution.This is done by using tabu lists that record the opposites of the last few moves, which then become tabu for the next few moves.
For our edge-swap neighborhood, two tabu lists are necessary: a list L drop for forbidding edges to be dropped and a list L add for forbidding edges to be added.Let t ∈ N (t) be a neighbor of the current topology t, obtained by replacing an edge e with an edge e : we consider t as tabu if e ∈ L drop or (non exclusive) e ∈ L add .When a move is actually made from the current topology, the newly dropped edge is added to L add (it cannot be added back for the next few moves) and the newly added edge is added to L drop (it can not be dropped for the next few moves).
The lengths of the tabu lists determine for how many moves the dropped/added edges have tabu status.As observed in [23]

Learning to Generate Edge-Swap Neighbors for TS
As a NS-based MH, TS can be roughly described as a NS scheme plus a "tabu" strategy for preventing cycling by keeping a short-term memory of visited solutions stored in the tabu list.As discussed above, the topology design for the wireless network optimization problem can be solved by applying a topology TS algorithm (Algorithm 3).In this TS scheme, a neighborhood N (t) of the current solution t is constructed by applying an edge-swap move operator.Specifically, the move operator consists of two steps: an edge will be dropped from the current topology (by enforcing the value of the dropped edge variable from 1 to 0) and another edge will be added to complete a new topology (by enforcing the value of the added edge variable from 0 to 1).As a result, the edge-swap neighborhood consists of all the edge-swap moves.
In Algorithm 3, the neighborhood search at each TS iteration explores the entire edge-swap neighborhood by enumerating all possible moves.Although the algorithm is guaranteed to find a locally optimal solution by exploiting the entire edge-swap neighborhood, it is generally slow in terms of computing time since a complex subproblem (P 1 ) has to be solved to evaluate each "edge-swap" move, and another subproblem (P 2 ) has to be solved to evaluate each solution in P 1 .
A potential improvement of the enumeration strategy is to apply a random strategy to sample from droppable edge variables and addable edge variables, thus evaluating only a subset of possible moves.The random strategy generally explores a larger solution space than the enumeration strategy because, if the overall amount of time for the algorithm is fixed, it is able to do more iterations.However, random sampling might not be efficient enough for guiding the search towards high-quality neighborhoods.To achieve a better trade-off between exploitation and exploration, we propose to exploit structural characteristics of improving moves, and learn good variable selection policies for selecting "drop" edge variables and "add" edge variables to generate size-reduced, but high-quality edge-swap neighborhoods.The scheme for variable selection consists of two components: a classifier for selecting the "drop" edge variables and another classifier for selecting the "add" edge variables.The scheme is depicted in Figure 3 and a detailed pseudocode will be given in Algorithm 4.

Learning to Drop Edges
Let S denote the state space of the topologies of a WNO instance, and let Y d = {0, 1} |I| denote the set of possible binary decisions for the droppable edge variables, where I is the index set of droppable edges in the current topology.We aim to learn a "drop-edge" policy that maps each state s ∈ S to a label π d θ * (s) = y d ∈ Y d .The generation of a training dataset is explained in Section 4.4.1.Roughly speaking, we consider a set J of WNO instances, and apply T j iterations of Algorithm 3 to each instance j ∈ J.At each iteration of each instance, we determine the edges that can be dropped and replaced by another edge to produce a better topology.This gives a label for every visited state.Let {s i=1 be the states of the j th instance, and let {y i=1 be the corresponding labels.As explained in Section 3.3.2, the policy π d θ * is obtained by solving the following problem: where L is the loss function.
Feature Design Since the topologes naturally have a graph structure, we represent each state s by a graph.More precisely, given a WNO instance with n nodes, we consider d features for each node and e features for each edge.The matrix of node features is denoted by V ∈ R n×d and the tensor of edge features is denoted by E ∈ R n×n×e .The features used for learning which edges have the best potential to lead to an improvement when dropped are listed in Table 1.
GNN Model Because of the intrinsic graph structure of the topologies, it is natural to apply GNN to model the drop-edge classifier.The chosen GNN architecture has 3 modules: the input module, the convolution module and the output module.The output module embeds the hidden features extracted from the convolution module and maps the embedding of each edge in the current topology into a two-neuron output.
Loss Function From the data generation process, we observed that class distribution is unbalanced and only less than 50% of droppable edges lead to an improving move.Since the objective of the learning is to select as many "improving" edges to be dropped as possible, it is reasonable to put more effort in improving the predictions on the Table 1: Description of the features in the "droppable" graph (N, E).More detailed definitions of the listed features can be found in [23].
minority class.Therefore, we apply a weighted Cross Entropy (WCE) loss to train the model.Specifically, we add a penalty factor λ for the "improving" class and 1 − λ for the "non-improving" class.Formally, the WCE loss is defined as where ŷ denotes the prediction of probability given by the policy model, y denotes a label, and I is the index set of droppable edges in the current topology.The penalty factor λ ∈ [0.5, 1] imposes a larger loss for the "improving" class during training.

Learning to Add Edges
After dropping an edge, the next step is to select an edge to be added from all addable edges.Let S denote the state space of all graphs obtained from a topology by dropping an edge, and let Y a = {0, 1} |I| be the set of possible binary decisions for addable edges, where I is the index set of addable edges.We aim to learn an "add-edge" policy π a θ * that maps each state s ∈ S to a label y a ∈ Y a .The generation of a a training dataset is similar to what was done for dropping edges.Algorithm 3 is applied to a set J of WNO instances for a few iterations in order to determine which edges have the potential to improve a topology once one of their edges has been dropped.More details are given in Section 4.4.1.Let {s (j) i } T j i=1 be the states of the j th instance, and let {y a(j) i } T j i=1 be the corresponding labels.The policy π a θ * is obtained by solving the following problem: .
Feature Design As was done for dropping edges, we model a state s by a graph.Given a WNO instance with n nodes, d features for each node, and e features for each edge, we denote by V ∈ R n×d the matrix of node features and by E ∈ R n×n×e the tensor of edge features.The features used for learning which edges have the best potential to lead to an improvement when added are listed in Table 2.
Tensor Feature Description Table 2: Description of the features in the "addable" graph (N, E).

GNN Model
As for the drop-edge classifier, we also apply GNN to model the addedge classifier for selecting the edge to be added.The GNN architecture is the same as for the drop-edge classifier.

Loss Function
The class distribution for addable edges is even more unbalanced than for the droppable case.Actually, only less than 10% of addable edges lead to an improving move.Therefore, we also applied the WCE loss to train the model with an emphasis on the minority class.

Numerical Experiments
This section contains the experimental results for the WNO application.After presenting the data collection, in Sections 4.4.2 and 4.4.3,we discuss the experimental setting and the evaluation metrics, respectively.Finally, the results are reported and discussed in Section 4.4.4.

Data Collection
Problem Instance Generation For the numerical experiments, the instances were generated in the following way.First, independent coordinates for the explicit nodes v ∈ V are iteratively generated using where U 0 , U 1 , U 2 , U 3 ∼ U(0, 1) are independent and identically distributed (IID).These coordinates are then scaled to match the average distance ratio of 10 km.This coordinate generation is repeated until the minimum distance is above 2 km and the maximum distance is below 150 km.Once this is achieved, for each possible edge [u, v] with u, v ∈ V , a random uniform variable is sampled for the path loss and another is sampled for the fade margin, both according to empirical cumulative distribution functions derived from North-American datasets.
We generated small instances of 10 nodes, as well as larger instances of 30 nodes.For these instances, the TS MH is initialized with a minimum spanning tree with a specific distance based on the path losses and fade margins of the terrain between every pair of nodes [23].
Training Data Generation To collect data for training the drop-edge and addedge classifiers, we call Algorithm 3 to evaluate all possible moves in the edge-swap neighborhood, and compute labels according to the learning task.In particular, given a WNO instance, and an initialization of the topology, we execute Algorithm 3 as follows: for each TS iteration, we first call the TS algorithm to evaluate all possible edge-swap moves.For each droppable edge, we evaluate all the addable edges.If the best resulting edge-swap move within the neighborhood of the droppable edge leads to a better approximated pseudo-objective, then the droppable edge is labeled as "improving"; otherwise, it is labeled as "non-improving".Given the dropped edge, the label of each addable edge is also decided by the quality of the resulting edge-swap move.If the move leads to a better approximated pseudo-objective, then the addable edge within the neigborhood of the corresponding dropped edge is labeled as "improving"; otherwise, it is labeled as "non-improving".

TS Guided by GNNs
The TS algorithm with GNN classifiers (TS-GNN) is obtained by using the learned policies π d θ * and π a θ * to generate edge-swap neighbors.Hence, the pseudocode of TS-GNN is the same as Algorithm 3, except that the neighborhood N (t) of t results from the subset transformation operator that performs edge-swaps according to π d θ * and π a θ * .A detailed pseudocode is given in Algorithm 4.
Algorithm 4 P 0 Topology Tabu Search with GNN classifiers Input: an initial topology t, a variable selection policy π d θ * for dropping edges, and a variable selection policy π a θ * for adding edges; where s is the state associated with t; for every e i in t such that y d i = 1 do set s i equal to the state obtained from s by dropping e i from topology t; y a ← π a θ * (s i ); for each reconnecting edge e j such that y a j = 1 do set t i,j equal to the topology obtained from t by dropping edge e i and adding edge e j ; end Make move t ← t and update tabu lists L drop and L add ; if f (t) > f (t * ) then t * ← t; end until termination condition is reached ; return t *

Experimental Setting
Training We train the drop-edge policy and add-edge policy on the collected training dataset separately.The training dataset is generated from 30 small instances of 10 nodes.The collected data is split into training (70%), validation (10%), and test (20%) sets.
Evaluation We evaluated the compared algorithms listed in Section 4.4.4 on two evaluation sets.The first evaluation set contains 50 small instances of 10 nodes.In addition, to evaluate the generalization performance of our approach on larger instances, we also evaluate the model trained on small instances on a large evaluation set with 5 instances of 30 nodes.
Experimental Environment Our code is written in Python 3.9 and we use Pytorch 1.60 [24], Pytorch Geometric 1.7.0 [25] for implementing and training the GNNs.

Evaluation Metrics
We use the primal integral [26] to measure the performance of the compared algorithms.The primal integral was originally proposed to measure the performance of primal heuristics for solving mixed-integer programs.The metric takes into account both the quality of solutions and the computing time spent to find those solutions during the solving process.To define the primal integral, we first consider a primal gap function p(t) as a function of time, defined as where x(t) is the incumbent solution at time t, and γ(•) ∈ [0, 1] is the scaled primal gap where f (x) denotes the objective value given solution x, xopt is either the optimal solution or the best one known for the instance and xinit is the initial solution.
Let t max > 0 be the time limit for executing the heuristic.The primal integral measure is then defined as

Results
In order to validate our approach, we compared multiple versions of our TS-GNN algorithm with the baselines.Specifically, we compared the following algorithms: • TS with No-Classifier, the baseline topology TS algorithm with an enumeration strategy that evaluates all the possible edge-swap neighbors; • TS with Random-Add-Classifier, the baseline topology TS algorithm with a sampling strategy that randomly selects edges to be added for generating edge-swap neighbors; • TS with Random-Add-Drop-Classifier, the baseline topology TS algorithm with a sampling strategy that randomly selects both the edges to be dropped and the edges to be added for generating edge-swap neighbors; • TS with GNN-Add-Classifier, the TS algorithm plus the GNN classifier for selecting the edges to be added for generating edge-swap neighbors; • TS with GNN-Drop-Classifier, the TS algorithm plus the GNN classifier for selecting the edges to be dropped for generating edge-swap neighbors; • TS with GNN-Add-Drop-Classifier, the TS algorithm plus the GNN classifiers both for selecting the edges to be dropped and for selecting the edges to be added.
We evaluated the performance of all the compared algorithms on both small instances with 10 nodes and larger instances with 30 nodes.For each test set, we computed the average primal integral defined in Section 4.4.3 as well as the average number of iterations.The primal integral is our main performance metric for evaluating metaheuristics and it measures the speed of convergence of the objective over the entire search time (the smaller, the better).In addition, the number of iterations counts the number of neighborhoods explored by the algorithm.With the same running time, it reflects how much exploration is guaranteed by modifications made to the basic TS scheme.Indeed, larger number of iterations for the same amount of time indicates a faster moving between neighborhoods and is generally preferable, although the outcome of the search still depends on the quality of the neighborhoods.The results are shown in Figures 4 and 5 for instances with 10 and 30 nodes, respectively.From these results, we observe that all our ML-based TS algorithms with GNN classifiers outperform the original TS baseline with no classifier.The No-Classifier baseline employs an enumeration strategy that evaluates all the possible moves in the edge-swap neighborhood.Although it is guaranteed to find the best move at each TS iteration, its more extensive exploitation results in the lowest number of iterations over all the compared algorithms.For small instances of 10 nodes, the performance of No-Classifier is still decent, as the solution space of these instances is small for exploration.However, it gives the worst results compared to all other algorithms on larger instances because The algorithms with random classifiers, on the contrary, are more efficient in terms of number of iterations.The Random-Add-Classifier algorithm explores a random subset of addable edges in the edge-swap neighborhood and the Random Add-Drop Classifier generates an even smaller neighborhood by sampling from both droppable and addable edges.On small instances with 10 nodes, Random-Add-Classifier achieves a better performance with a lower primal integral although the Random-Add-Drop-Classifier algorithm has a larger number of iterations.This is because, for small instances, the quality of the solutions in the sampled neighborhoods is more important than the number of iterations.Whereas for larger instances of 30 nodes, Random-Add-Drop-Classifier performs better than Random-Add-Classifier since the sampling efficiency becomes more important for exploring larger edge-swap neighborhoods.
On the one hand, the No-Classifier baseline can fully exploit each edge-swap neighborhood (exploitation) but has a low efficiency in terms of number of iterations.The baselines with random classifiers are more effective in increasing the number of iterations by reducing the size of each neighborhood (exploration), however, the quality of the neighborhood is restricted by its sampling efficiency.On the other hand, our complete ML-based algorithm with GNN-Drop-Add-Classifier achieves a smaller primal integral than all the baseline algorithms with a reasonably large number of iterations.Moreover, as the size of instances increases, the GNN-Drop-Add-Classifier algorithm becomes more competitive and significantly outperforms all the compared algorithms.These results demonstrate that our approach offers a good trade-off in both exploration and exploitation of the solution space.Since the GNN classifiers are only trained with data generated from small instances, the results also show that our method generalizes well on larger instances.

Application 2: Large Neighborhood Search in MIP
In this section, we demonstrate how to apply our methodology to another application: a Large Neighborhood Search (LNS) heuristic for solving MIPs.

Large Neighborhood Search
LNS is a refinement heuristic, i.e., given an initial solution, it is applied to improve the solution by exploring a "large" neighborhood.There are several ways of describing a LNS scheme.We adopt the following simple one based of 3 building blocks: • destroy function d: fixes the values of a subset of variables to the current solution x and "destroy" the rest.The output of this function is a sub-MIP with a neighborhood N (x).The size of the LNS neighborhood, i.e. the number of variables to be destroyed, will be selected as a hyperparameter; • repair function r: rebuilds the destroyed solution (in some cases, the repaired solution can be worse than the destroyed solution), typically by solving a sub-MIP defined by N (x); • accept function a: decides whether the new solution should be accepted or rejected.
Given as an input a feasible solution x, LNS searches for the best feasible solution x in the neighborhood of x (the size of the neighborhood is a parameter) and then updates x ← x.This is repeated until a termination condition is met.The pseudocode of the scheme is given in Algorithm 5.

Learning to Generate Neighbors for LNS
In a LNS heuristic, the solution neighbors are obtained by the destroy function that selects a subset of integer variables to be "destroyed", thus freed for neighbor generation.The remaining integer variables will be fixed to their values in the current solution.The classic LNS algorithm applies a randomized sampling strategy or hand-crafted rules for defining the destroy function (see, e.g., [27]).However, our observation shows that only a small subset of variables have the potential to improve the current solution by changing their values and such a subset is strongly dependent to the structure of the problem.Hence, in order to optimize the performance of LNS, we aim at learning new variable selection strategies to select a subset of variables to be freed.In particular, we investigate the dependencies between the state of the problem, defined by a set of both static and dynamic features collected from the LNS procedure (e.g., context of We now show how to use a GNN model for learning a destroy policy.The LNS neighborhoods will then be defined based on our pretrained GNN classifier.. Let S denote the state space of a MIP instance, and let Y = {0, 1} |I| denote all candidate subsets of variables to be destroyed, where I is the index set of integer variables.We aim to learn a destroy policy π θ * that maps each state s to a label y ∈ Y. To generate a training dataset, we consider a set J of instances, and apply T j iterations of the local branching heuristic [28] to each instance j ∈ J.We thus determine improving subsets of variables which give us labels.More details about this process can be found in Section 5.3.1.Let {s (j) i } T j i=1 be the states of the j th instance, and let {y (j) i } T j i=1 be the corresponding labels.As explained in Section 3.3.2, the policy π θ * can be obtained by solving the following classification problem:

Feature Design
We represent each state s of a MIP by a bipartite graph (V, C, E) [7].More precisely, assume that the considered MIP has n variables, m constraints, d features for the variables and q features for the constraints.The variables of the MIP with their feature matrix V ∈ R n×d are represented on one side of the graph.On the other side are nodes corresponding to the constraints with C ∈ R m×q being their feature matrix.A variable node i and a constraint node j are connected by an edge (i, j) if variable i appears in constraint j of the MIP.Finally, E ∈ R n×m×e denotes the tensor of edge features, with e being the number of features for each edge.The features in the bipartite graph are listed in Table 3.
Tensor Feature Description V sol val Solution value.
C bias Right-hand side value of the constraint.
Table 3: Description of the features in the bipartite graph s = (V, C, E).

GNN Model
Given that the state of a MIP instance can be represented as a bipartite graph, we propose to use GNN to parameterize the model for the destroy policy.Our GNN architecture again consists of 3 modules: the input module, the convolution module, and the output module.For a bipartite graph, a convolution layer is decomposed into two half-layers: one half-layer propagates messages from variable nodes to constraint nodes through edges, and the other one propagates messages from constraint nodes to variable nodes.We refer the reader to [7] for more details.The output module embeds the features extracted from the convolution module for the prediction of each variable, which maps the graph representation embedding of each variable into a two-neuron output.

Loss Function
The class distribution is highly unbalanced.In the training dataset, we observed that only less than 10% variables belong to the "destroy" class, i.e. the variables in this class have the potential to improve the current solution by changing their values.In order to adapt to the imbalanced distribution, we applied the WCE loss and focal loss [29] to train the model.

LNS Guided by GNNs
Our refined LNS heuristic with GNN classifier (LNS-GNN) is obtained by using the destroy policy π θ * in Algorithm 5. Hence, given a solution x and its associated state s, we set y equal to π θ * (s), we then destroy some variables x i with y i = 1 to get N (x), and we finally obtain a repaired neighbor x of x by solving the sub-MIP defined by N (x).

Numerical Experiments
In this section, we present the experimental results for the ML-based LNS.As for the WNO application, we first present the data collection, then, we discuss the experimental setting in Section 5.3.2 and the results in Section 5.3.3.The evaluation metrics remain the same as presented in Section 4.4.3.

Data Collection
MIP Benchmark To train, evaluate and compare algorithms, we have considered 126 MIP instances taken from the MIPLIB [30] dataset.For each instance, an initial feasible solution is required to start the LNS heuristic.We use an intermediate solution found by SCIP [31], typically the best solution obtained by SCIP at the end of the root node computation, i.e., before branching.
Training Data Generation To collect data for the classification task, we use the Local Branching (LB) algorithm of Fischetti and Lodi [28].More precisely, given a MIP instance and an initial incumbent x, we call the LB algorithm to explore as many neighbors as possible, with a time limit of 600 seconds, and by limiting to 25% the number of destroyed variables.The best solution x * found by LB is compared with x and the variables with changed values are labeled as improving variables for x.These improving variables are then distroyed from x, and the partial solution is repaired by calling a MIP solver.The resulting solution becomes the new incumbent to which we repeat the same process.This is done for a set J of instances, as explained in Section 5.2.

Experimental Setting
Training In the classification task for producing a good destroy policy, the GNN model learns from the features of the MIP formulation and its incumbent solution.We have trained our GNN classifier with 29 of the 126 considered MIP instances.
Evaluation The algorithms are compared and evaluated on the remaining 97 binary MIP instances.

Results
We compare our LNS-GNN algorithm with the following algorithms: • LNS-Random, the baseline LNS algorithm with a sampling strategy that randomly selects variables to destroy; • LB, the LB algorithm [28].
For all these algorithms, we have limited to 40 the number of variables that can be destroyed to generate neighbors.
For measuring the heuristic performance of the compared algorithms, we also compute the average primal integral defined in Section 4.4.3.We ran the listed algorithms for 60 seconds on each instance.The results are shown in Figure 6.From the results in Figure 6, we can see that the primal integral of the LB baseline is the largest over the entire solving time, which indicates that exploring all the possible neighbors within the same Hamming distance to the current solution is computationally expensive.Moreover, the LNS-Random baseline performs better than LB baseline by applying a random sampling strategy for selecting the subset of variables to be freed.Our LNS-GNN algorithm presents the best heuristic behavior in terms of the primal integral, showing that the pretrained GNN model is able to produce structural neighborhoods that contain improving solutions.These results demonstrate that our approach achieves a better trade-off between exploitation and exploration of the promising solution regions.Although, potentially, any of the baseline algorithms could be tuned to obtain better results, this is true for LNS-GNN too, and overall we believe that these results show clear promise.

Conclusions and Future Work
In this work, we presented a methodology for integrating machine learning techniques into metaheuristics for solving combinatorial optimization problems.Namely, we proposed a general ML framework for neighbor generation in metaheuristic search.We firstly defined a neighborhood structure constructed by applying a transformation operator to a selected subset of variables from the current solution.Then, the key of the proposed methodology is to learn a variable selection policy by which the subset transformation operator is able to produce neighborhoods containing high-quality solutions.We formulated the variable selection problem as a classification problem that exploits structural information from the characteristics of the problem and high-quality solutions.
We demonstrated our methodology on two applications.The first problem we addressed occurs in the context of Wireless Network Optimization, where a Tabu Search metaheuristic is used for the topology design sub-problem.In a predefined topology neighborhood structure, we trained classification models to select sized-reduced, but high-quality neighborhoods.This allowed the metaheuristic search to execute more iterations within the same amount of time.As each iteration requires to solve a series of complex combinatorial sub-problems, more iterations entail a greater exploration of the solution space.In addition, to demonstrate the broader applicability of our approach, we also applied our framework to the Large Neighborhood Search metaheuristic for solving MIPs.The experimental results of the two applications have shown that our approach is able to learn a satisfactory trade-off between the exploration of a larger solution space and the exploitation of promising solution regions in metaheuristic search.
Although deep neural networks such as GNNs have been wildly applied to represent combinatorial optimization problems, the current GNNs might not be expressive enough to capture all the crucial patterns from data [11].For future research, it would be interesting to develop more expressive ML models that exhibit better transferability and scalability across broader classes of problems.

Figure 1 :
Figure 1: A learning-based framework for neighbor generation.
, reasonable lengths for L drop and L add are ( |V | − 1)/2 and |V | (|V | − 1)/2 , respectively, where x means x rounded to the nearest integer.The pseudocode for the resulting topology Tabu Search metaheuristic is given in Algorithm 3.

V
coordinates X and Y coordinates in km desc v normalized desc v = ((|desc v | − 1)/(|V | − 1)) E pass loss path loss in dB fade margin fade margin in dB waveform binary (point-to-point or point-to-multipoint) throughput by mixed scenarios of A, B, C edge indices edge indices in current topology

V
coordinates X and Y coordinates in km desc v normalized |desc v| = ((|desc v| − 1)/(|V | − 1)) E edge type binary indicator of addable edges pass loss path loss in dB fade margin fade margin in dB edge indices edge indices in current topology

Figure 4 :
Figure 4: Evolution of the average primal integral and the average number of iterations over time on evaluation datasets for the instances of 10 nodes.

Figure 5 :
Figure 5: Evolution of the average primal integral and the average number of iterations over time on evaluation datasets for the instances of 30 nodes.
Algorithm 3 P 0 Topology Tabu Search Input: an initial topology t; t * ← t; t * ← t; L drop ← ∅; L add ← ∅; * ); if f (t ) > f (t * ) then t * ← t ; end Make move t ← t and update tabu lists L drop and L add ; if f (t) > f (t * ) then t * ← t; end until termination condition is reached ; return t *