Automated Discovery of Local Rules for Desired Collective-Level Behavior Through Reinforcement Learning

Complex global behavior patterns can emerge from very simple local interactions between many agents. However, no local interaction rules have been identified that generate some patterns observed in nature, for example the rotating balls, rotating tornadoes and the full-core rotating mills observed in fish collectives. Here we show that locally interacting agents modeled with a minimal cognitive system can produce these collective patterns. We obtained this result by using recent advances in reinforcement learning to systematically solve the inverse modeling problem: given an observed collective behavior, we automatically find a policy generating it. Our agents are modeled as processing the information from neighbor agents to choose actions with a neural network and move in an environment of simulated physics. Even though every agent is equipped with its own neural network, all agents have the same network architecture and parameter values, ensuring in this way that a single policy is responsible for the emergence of a given pattern. We find the final policies by tuning the neural network weights until the produced collective behavior approaches the desired one. By using modular neural networks with modules using a small number of inputs and outputs, we built an interpretable model of collective motion. This enabled us to analyse the policies obtained. We found a similar general structure for the four different collective patterns, not dissimilar to the one we have previously inferred from experimental zebrafish trajectories; but we also found consistent differences between policies generating the different collective pattern, for example repulsion in the vertical direction for the more three-dimensional structures of the sphere and tornado. Our results illustrate how new advances in artificial intelligence, and specifically in reinforcement learning, allow new approaches to analysis and modeling of collective behavior.

If in one of such systems we observe a particular collective configuration, how can we infer the local rules that produced it? Researchers have relied on the heuristic known as the modeling cycle [15,16]. The researcher first proposes a set of candidate local rules based on some knowledge of the sensory and motor capabilities of the agents. The rules are then numerically simulated and the results compared with the desired outcome. This cycle is repeated, subsequently changing the rules until an adequate match between simulated trajectories and the target collective configuration is found.
Studies in collective behavior might benefit from a more systematic method to find local rules based on known global behavior. Previous work has considered several approaches. Several authors have started with simple parametric rules of local interactions and then tuned the parameters of the interaction rules via evolutionary algorithms based on task-specific cost functions [17,18]. When the state space for the agents is small, a more general approach is to consider tabular rules, which specify a motor command for every possible sensory state [19].
These approaches have limitations. Using simple parametric rules based on a few basis functions produces models with limited expressibility. Tabular mapping has limited generalization ability. As an alternative not suffering from these problems, neural networks have been used as the function approximator [20][21][22]. Specifically, neural network based Q-learning has been used to study flocking strategies [23] and optimal group swimming strategies in turbulent plumes [24]. Q-learning can however run into optimization problems when the number of agents is large [25]. Learning is slow if we use Q-functions of collective states (e.g., the location and orientation of all agents) and actions, because the dimensionality scales with the number of agents. When implementing a separate Q-function for the state and action of each agent, the learning problem faced by each agent is no longer stationary because other agents are also learning and changing their policies simultaneously [26]. This violates the assumptions of Q-learning and can lead to oscillations or sub-optimal group level solutions [27].
Despite these difficulties, very recent work using inverse reinforcement learning techniques has been applied to find interaction rules in collectives [28,29]. These approaches approximate the internal reward function each agent is following, and require experimental trajectories for all individuals in the collective. Here, we follow a different approach in which we aim at finding a single policy for all the agents in the collective, and with the only requirement of producing a desired collective configuration.
Our approach includes the following technical ingredients. We encode the local rule as a sensorimotor transformation, mathematically expressed as a parametric policy, which maps the agent's local state into a probability distribution over an agent's actions. As we are looking for a single policy, all agents have the same parametric policy, with the same parameter values, identically updated to maximize a group level objective function (total reward during a simulated episode) representing the desired collective configuration. A configuration of high reward was searched for directly, without calculating a group-level value function and thus circumventing the problem of an exploding action space. For this search, we use a simple algorithm of the class of Evolution Strategies (ES), which are biologically-inspired algorithms for black-box optimization [30,31]. We could have chosen other black-box optimization algorithms instead, such as particle swarm algorithms [32]. However, this ES algorithm has recently been successful when solving Multi-Agent RL problems [33], and when training neural-network policies for hard RL problems [34].
We applied this approach to find local rules for various experimentally observed schooling patterns in fish. Examples include the rotating ball, the rotating tornado [35], the fullcore rotating mill [36], and the hollow-core mill [37]. To our knowledge, with the exception of the hollow-core mill [38,39], these configurations have not yet been successfully modeled using SPP models [11].

METHODS
We placed the problem of obtaining local rules of motion (policies) that generate the desired collective patterns in the reinforcement learning framework [40]. As usual in reinforcement learning, agents learn by maximizing a reward signal obtained from their interaction with the environment in a closed-loop manner, i.e., the learning agents's actions influence its later inputs Figure 1. To describe this interaction, it is necessary to specify a model of the agents and the environment, a reward function, a policy representation and a method to find the gradient of the reward function with respect to the parameters of the policy. Both the environment update and the reward are history-independent, and thus can be described in the framework of multi-agent Markov decision processes [41]. We describe the four components (agent and environment model, reward function, policy parameterization, and learning algorithm) in the following subsections.

A Model of the Agent and the Environment
We model fish as point particles moving trough a viscous threedimensional environment. In this section, we explain how we update the state of each and every fish agent.
Let us define a global reference frame, with Z axis parallel to the vertical, and X and Y in a horizontal plane. Length is expressed in an arbitrary unit, which we call body length (BL) because it corresponds to the body length of agents in the retina experiments that we describe in the Supplementary Text.
In this reference frame, we consider a fish agent moving with a certain velocity. We describe this velocity as three numbers: the speed V, elevation angle θ (i.e., its inclination angle is π 2 − θ ) and azimuth angle φ. In the next time step, we update the X, Y, and Z coordinates of the fish as where δ corresponds to the duration of a time step (see Table 1 and Table S2 for parameter values). The elevation angle, azimuth angle change, and speed change are updated based on three outputs of the policy network, p 1 , FIGURE 1 | Framework to obtain an interaction model producing a desired collective behavior. Upper: The policy network is a neural network that transforms the observation of an agent into its action. Its parameters are the same for all agents and chosen to produce a target collective behavior. For each focal fish (purple), the observation used to generate actions is formed by: the horizontal and vertical components of the focal fish, plus the three components of the relative position and three components of the absolute velocity of each of the neighbors. In the diagram, a single example neighbor fish is shown (green). From the observations of each focal fish, the policy network generates an action for each fish, modifying their heading and speed in the next time step according to the physics imposed for the environment. The new locations, velocities and orientations define a new state. Lower: Repeating the loop described above a number of times, we obtain a trajectory for each fish. We compare the trajectories with the selected desired configuration (rotating sphere, rotating tornado, rotating hollow-core mill and rotating full-core mill) producing a reward signal that it is used to change the parameters of the policy network.  The azimuth, φ, is updated using the first output of the policy, p 1 : with φ max the maximum change in orientation per unit time, and δ is the time step duration.
The elevation angle, θ , is calculated based on the second output of the policy network, p 2 , as where the maximum elevation is θ max Finally, the speed change is the sum of two components: a linear viscous drag component (with parameter α) and an active propulsive thrust determined by the third output of the policy network, p 3 , The parameter V max is the maximum active change of speed of a fish. This equation for the change in velocity captures that deceleration in fish is achieved through the passive action of viscous forces [42]. At the beginning of each simulation, we initialize the positions and velocities of all fish randomly (see Supplementary Text for details). The same state update Equations (1)- (6) are applied identically to all fish while taking into account that each one has a different position, speed, orientation and neighbors.

Reward Function
In our simulations, the final behavior to which the group converges is determined by the reward function. We aim to model four different collective behaviors, all of which have been observed in nature. These behaviors are called the rotating ball [16], the rotating tornado [35], the rotating hollow core mill [11,37], and the rotating full core mill [36].
At each time step, the configuration of agents allows to compute an instantaneous group level reward r(t). The objective of the reinforcement learning algorithm is to find ways to maximize the reward R in the episode, which is the sum of r(t) over the N steps of simulation time. The instantaneous reward r(t) is composed of several additive terms. In this section we will explain the terms used in the instantaneous reward function for the rotating ball, and in Supplementary Text we give mathematical expressions for the terms corresponding to the four collective structures.
The first term is composed of collision avoidance rewards, r c . It provides an additive negative reward for every pair of fish (i, j) based on their mutual distance d i,j . Specifically, for each neighbor we use a step function that is zero if d i,j > D c and −1 otherwise. This term is meant to discourage the fish from moving too close to one another.
The second term is an attraction reward, r a , which is negative and proportional to the sum of the cubed distances of all fish from the center of mass of the group. This attraction reward will motivate the fish to stay as close to the center of mass as possible while avoiding mutual collisions due to the influence of the collision reward. Together with r c , it promotes the emergence of a dense fish ball.
The third term in the instantaneous reward, r r , is added to promote rotation. We calculate for each fish i its instantaneous angular rotation about the center of mass in the X − Y plane, i . The rotation term, r r , is the sum of beta distributions of that angular rotation across all fish.
The fourth and final term, r v , penalizes slow configurations. It is a step function that is 0 if the mean speed is above V min and −1 otherwise. V min is small enough to have a negligible effect in the trained configuration, but large enough to prevent the agents from not moving. As such, this last term encourages the agents to explore the state-action space by preventing them from remaining still.
The reward functions designed to encourage the emergence of a rotating tornado and the rotating mills are described in the Supplementary Text but they in general consist of similar terms.
Unlike previous work in which each agent is trying to maximize an internal reward function [28,29], we defined the reward functions globally. Although each agent is observing and taking actions independently, the collective behavior is achieved by rewarding the coordination of all the agents, and not their individual behaviors.

The Policy Network
We parameterize our policy as a modular neural network with sigmoid activation functions, Figure 2. In our simulations, all fish in the collective are equipped with an individual neural network. Each network receives the state observed by the corresponding agent and outputs an action that will update its own position and velocity.
All the networks have the same weight values, but variability in the individual behaviors is still assured for two reasons. First, we use stochastic policies, which makes sense biologically, because the same animal can react differently to the same stimulus. In addition, a stochastic policy enables a better exploration of the state-action space [43]. Second, different fish will still act following different stochastic distributions if they have different surroundings. In the next section we will describe these networks and the implementation of stochasticity in the policy.

Inputs and Outputs
At each time step, the input to the network is information about the agent surroundings. For each focal fish, at every time step we consider an instantaneous frame of reference centered on the focal fish, with the z axis parallel to the global Z axis, and with the y axis along the projection of the focal velocity in the X − Y plane. For each neighbor, the variables considered are x i , y i , z i (the components of the neighbor i position in the new frame of reference) and v x,i , v y,i , v z,i (the components of the neighbor velocity in the new frame of reference). In addition, we also use v y and v z (the components of the focal fish velocity in the new frame of reference). Please note that the frame is centered in the focal fish, but it does not move nor rotate with it, so all speeds are the same as in the global frame of reference.
The policy network outputs three numbers, p 1 , p 2 , and p 3 (see next section for details), that are then used to update the agent's azimuth, elevation angle and speed, respectively.

Modular Structure
To enable interpretability, we chose a modular structure for the policy neural network. Similar to our previous work [44], we chose a network architecture with two modules, Figure 2. Both modules have the 8 inputs we detailed above and two hidden layers of 100 and 30 neurons.
The first module, the pairwise-interaction module, contains 6 output neurons, Figure 2A. For each neighbor i, they produce The outputs are symmetrized with respect to reflections in the x − y and y − z planes, with the exception of p 1,i (mean azimuth angle change, anti-symmetrized with respect to the x − y plane) and p 2,i (mean elevation change, anti-symmetrized with respect to the y − z plane).
The previous values, {p µ j,i , p σ j,i } j=1,2,3 , encode the mean and the scaled variance of three normal distributions (clipped between 0 and 1) used to sample three variables, For each neighbor, i, we sample values of p 1,i , p 2,i and p 3,i independently from the respective distributions, Figure 2B.  The second module, the aggregation module, has a single output, W ( Figure 2C). It is symmetrized with respect to both the x − y and y − z planes. It is clipped between −15 and 15 and there is an exponential non-linearity after the single-neuron readout signal to make it positive.
The final output combines both modules, Figure 2D. It is calculated by summing the outputs of the pairwise-interaction modules applied to the set of all neighbors, I, using the outputs of the aggregation module as normalized weights, where we combined p 1,i , p 2,i , and p 3,i as components of a vector P i . The final outputs used to update the dynamics of the agent, p 1 , p 2 , p 3 are the components of P.
Everywhere in this paper, the set of neighbors considered, I, consists of all the other fish in the same environment as the focal. Even if this is the case, note that the introduction of the aggregation module acts as a simulated attention that selects which neighbors are more relevant for a given state and policy.

Optimizing the Neural Network Parameters
Following previous work [33,34], we improved the local rule using an "Evolution Strategies" algorithm [30,31]. The text in this section is an explanation of its main elements.
Let us denote by #» ω the neural network weights at every iteration of the algorithm, and by R( #» ω) the reward obtained in that iteration (sum during the episode). A change in parameters that improves the reward can be obtained by following the gradient with small steps (gradient ascent), where λ is the learning rate (see Supplementary Text for the values of hyper-parameters in our simulations). Repeating this gradient ascent on the reward function, we approach the desired collective behavior over time. As we explain below, we perform this gradient while co-varying the parameters of all agents.
In contrast, the naive application of policy gradient would be equivalent to performing the gradient with respect to the parameters of one of the agents, keeping the parameters of others constant. This could produce learning inefficiencies or even failure to find the desired policy.
We estimate the gradient numerically from the rewards of many simulations using policy networks with slightly different parameters. We first sample K vectors #» ǫ i independently from a spherical normal distribution of mean 0 and standard deviation σ , with as many dimensions as parameters in the model. We define # » ǫ i+K = − #» ǫ i , i = 1,...,K. We calculate 2K parameter vectors #» ω i = #» ω + #» ǫ i . We then conduct a single simulation of a fish collective -all agents in the same environment sharing the same values #» ω i -and we record the reward R i . Then, we use 1 to perform gradient ascent [34].
We refer to Figure 3 for four example training runs using the algorithm, and to Figure 4 for the pseudo-code of the algorithm.

Measurements of XY-Plane Interaction
As in previous work [44], we described changes in the azimuth angle using the approximate concepts of attraction, repulsion, and alignment. We define the attraction-repulsion and the alignment score as useful quantifications of these approximate concepts. Please note that these scores are not related to reward. We obtained the attraction-repulsion and alignment scores from a centered and scaled version of p µ 1 : where we chose to only explicitly highlight its dependence with the relative neighbor orientation in the XY plane, φ i . This relative neighbor orientation can be calculated as the difference of the azimuth angle of the neighbor and the azimuth angle of the focal fish. Attraction-repulsion score is defined by averagingp 1,i over all possible relative orientations of the neighbor in the XY plane.
We would say there is attraction (repulsion) when the score is positive (negative). The alignment score is defined as As in [44], we arbitrarily decided that alignment is dominant (and thus that the point is in the alignment area) ifp 1,i changes sign when changing the relative orientation of the neighbor in the XY plane, φ i . Otherwise, it is in an attraction or repulsion area, depending on the sign of the attraction-repulsion score [44]. Under this definition, repulsion (attraction) areas are the set of possible relative positions of the neighbor which would make a focal fish turn away (toward) the neighbor, independently of the neighbor orientation relative to the focal fish

RESULTS
To simulate collective swimming, we equipped all fish with an identical neural network. At each time step, the neural network analyzes the surroundings of each fish and produces an action for that fish, dictating change in its speed and turning, Figure 1. Under such conditions, the neural network encodes a local rule and by varying the weights within the network, we can modify the nature of the local rule and thus the resulting group level dynamics. As in previous work [44], we enabled interpretability by using a neural network built from two modules with a few inputs and few outputs each, Figure 2. A pairwise-interaction module outputs turning and change of speed with information from a single neighbor, i, at a time. It is composed of two parts. The first part outputs in a deterministic way mean values and variances for each of the three parameters encoding turning and change of speed, Figure 2A. The second part consists in sampling each parameter, p 1,i , p 2,i , p 3,i , from a clipped Gaussian distribution with the mean and variance given by the outputs of the first part, Figure 2B.
An aggregation module outputs a single positive number expressing the importance carried by the signal of each neighbor, Figure 2C. The final outputs of the complete modular neural network are obtained by summing the results of the pairwiseinteraction module, weighting them by the normalized outputs of the aggregation module, Figure 2D. The final outputs, p 1 , p 2 and p 3 , determine the motor command. We perform these computations for each agent, and use the outputs to determine the position and speed of each agent in the next time step (Equations 1-3).
We introduced a reward function, measuring how similar are the produced trajectories to the desired group behavior (see section 2 for details). We used one of four different reward functions to encourage the emergence of one of four different collective configurations, all of which have been observed in natural groups of fish. These patterns are the rotating ball, the rotating tornado, the rotating hollow core mill and the rotating full core mill.
We used evolutionary strategies to gradually improve the performance of the neural network at the task of generating the desired collective configurations. The value of the reward function increased gradually during training for all four patterns, Figure 3. After several thousand time steps, the reward plateaued and the group collective motion was visually highly similar to the desired one (see Supplementary Videos 1-4). We tested that the agents learned to generate the desired collective configurations by using independent quantitative quality indexes (Supplementary Text and Figure S1). They show that agents learn first to come together into a compact formation, and then to move in the right way, eventually producing the desired configurations. We also checked that the configurations are formed also when the number of agents is different to the number used in training (see Supplementary Videos 5-8 for twice as many agents as those used in training).

Description of Rotating Ball Policy
Here, we use the low dimensionality of each module in terms of inputs and outputs to describe the policy with meaningful plots. We describe here the policy of the rotating ball ( Figure 5). The equivalent plots for the other three configurations can be found in Figures S2-S4.
The pairwise-interaction module outputs three parameters for each focal fish, all bounded between 0 and 1. The first one, p 1 , determines the change in azimuth, that is, rotations in the XY plane (Figure 5, first column). To further reduce dimensionality in the plots, we simplify the description of the policy in this XY plane by computing attraction-repulsion and orientation scores (see Equations 12, 13, section 2). These scores quantify the approximate concepts of attraction, repulsion and orientation. In Figure 5, we plot the alignment score in the areas where alignment is dominant, and the attraction-repulsion score otherwise.
The attraction areas give the neighbor positions in this XY plane which make a focal fish (located at x i = y i = 0) swim toward the neighbor, independently of the neighbor orientation ( Figure 5, first column, orange). The focal fish turns toward the neighbor if the neighbor is far (> 1.5 BL). Repulsion areas are the relative positions of the neighbor which make a focal fish swim away from the neighbor (Figure 5, first column, purple). If the neighbor is closer than 1 BL, but not immediately in front or at the back of the focal, the focal tends to turn away from the neighbor (purple). In the areas where alignment is dominant (alignment areas), we plotted the alignment score ( Figure 5, first column, gray). If the neighbor is at an intermediate distance, or in front or in the back of the focal fish, the focal fish tends to orient with respect to the neighbor in the XY plane.
The second parameter, p 2 , determines the elevation angle ( Figure 5, second column). p 2 is 0.5 on average, and thus elevation angle is zero on average, when the neighbor is on the same XY plane as the focal (Figure 5, second column, second row). When the neighbor is above, the focal agent tends to choose a negative elevation when the neighbor is close (Figure 5, second column, blue), and a positive elevation if it is far (red). The #» ω 0 ←Xavier initialization for i = 1, ..., n epochs do for j = 1, ..., K random perturbation do sample ǫ j ← σ # » N(0, I) reset environment to random initialization S i,j compute R + j , reward of simulation with parameter #» ω i−1 + ǫ j reset environment to previous initialization S i,j compute R − j , reward of simulation with parameter #» anneal learning rate: λ ← γ λ λ anneal s.d. of perturbations: σ ← γ σ σ end for FIGURE 4 | Evolution strategies algorithm. FIGURE 5 | Policy producing a rotating ball, as a function of neighbor relative location. Each output (three from the pair-interaction, one from the aggregation) is shown in a different column. All columns have three diagrams, with the neighbor 1 BL above (top row), in the same XY plane (middle row) and 1 BL below the focal (bottom row). Speed of both the focal and neighbor has been fixed to the median in each configuration. In addition, in columns 2-4, we plot the average with respect to an uniform distribution over all possible relative neighbor orientations (from −π to π). First column: Interaction in the XY plane (change of azimuth). Instead of plotting p 1 , we explain interaction using the approximate notions of alignment (gray), attraction (orange), and repulsion (purple) areas, as in [44]. Alignment score (gray) measures how much the azimuth changes when changing the neighbor orientation angle, and it is computed only in the orientation areas (see Methods, Section 2.5). Attraction (orange) and repulsion scores (purple) measure how much the azimuth change when averaging across all relative orientation angles, and we plot it only outside orientation areas. Second column: Elevation angle, through the mean value of the p 2 parameter. Blue areas indicate that the focal fish will move downwards (p 2 < 0.5), while red areas indicate that the focal fish will move upwards (p 2 > 0.5). Third column: Active change in speed, through the mean value of the p 3 parameter. Darker areas (large mean p 3 ) indicate increase in speed, and lighter areas indicate passive coast. Fourth column: Output of the aggregation module. Neighbors in the darker areas weight more in the aggregation. opposite happens when the neighbor is below: the focal agent tends to choose a positive elevation when the neighbor is close, and a negative elevation if it is far (Figure 5, second column, third row).
The third parameter, p 3 , determines the active speed change (Figure 5, third column). The active speed change is small, except if the neighbor is in a localized area close and behind the focal.
The aggregation module outputs a single positive output, determining the weight of each neighbor in the final aggregation. In the rotating ball policy, the neighbors that are weighted the most in the aggregation are the ones closer than 1 BL from the focal (Figure 5, third column). Neighbors located in a wide area behind the focal, but not exactly behind it, are assigned a moderate weight.
Note that the aggregation module is not constrained to produce a local spatial integration, since the network has access to every neighboring fish. However, we can observe how an aggregation module like the one shown for the rotating ball ( Figure 5, third column) would preferably select a subset of closest neighbors-i.e., integration is local in space. This is also true for the other configurations: for each desired collective configuration we could always find policies with local integration (e.g., Figures S2-S4).

Description of Policy Differences Between Configurations
In the previous section, we described the policy we found to best generate a rotating ball. The policies we found that generate the other three configurations have many similarities and some consistent differences, Figure 6. In this section, we will highlight these differences.
The policy generating a tornado has an attraction-repulsion pattern somewhere in between the rotating ball and the full core milling (Figure 6, first row). The major feature that distinguishes this policy from the others is a strong repulsion along the z-axis, i.e., the area of the plot where the focal fish changes elevation to move away from the neighbor is larger (Figure 6, second column, second row, blue area). This was expected, as to form the tornado the agents need to spread along the z-axis much more than they do in the other configurations.
The policy generating a full-core mill has an increased repulsion area, particularly in the frontal and frontal-lateral areas (Figure 6, third column, first row, purple areas). The policy generating a hollow-core mill has an increased alignment area, weakening the repulsion area (Figure 6, fourth column, first row, gray area). It also has an increased area with a high change in velocity (Figure 6, fourth column, third row, red area). Both mill policies have an extended aggregation area, especially the hollow core (Figure 6, third and fourth column, fourth row), and lack an area where the focal fish changes elevation to move away from the neighbor (Figure 6, third and fourth column, second row, red area). The almost absence of repulsion in the z-direction makes both mills to be 2D structures, whereas repulsion in the sphere and tornado makes them 3D.
The highlighted differences between policies are robust (see Figure 5 and the Supplementary Material for Figures S2-S4 and Figures S6-S13 to see the robustness of the results in different runs). For instance, we have compared the policies by restricting the fish to move at the median speed is each configuration. The highlighted differences are still valid when the fish move with a common median speed across configurations (Figure S5).

Adding a Retina to the Agents
In the preceding section, the observations made by each agent were simple variables like position or velocities of neighbors. This simplification aided analysis, but animals do not receive external information in this way but by sensory organs.
We checked whether we could achieve the group configuration we have studied when the input to the policy for each agent is the activation of an artificial retina observing the other agents. The retina is modeled using a three-dimensional ray-tracing algorithm: from each agent, several equidistant rays project until they encounter a neighbor, or up to a maximum ray length r. The state, i.e., the input to the policy, is the list of the ray lengths. Information about the relative velocity was also given as input to the policy by repeating this computation with the current position and orientation of the focal, but the previous position of all the other agents. See Supplementary Text for a detailed description.
We approximated the policy using a single fully-connected network. Using the interaction and attention modules described in section 2.3.2 would not have added interpretability in this case, because the number of inputs is too large. By using the same evolutionary strategy, we were able to obtain a decision rule leading to the desired collective movement configurations (Supplementary Videos 9-12).
Although these configurations were qualitatively similar to the ones we obtained with the modular network (Figure S14), the average inter-agent distance was greater. This resulted in less compact configurations (Tables S3, S5). This effect might be the result of the increased complexity and decreased accuracy of the inputs given to the policy by the retina, which causes the agents to avoid each other more than in simulations without retina.

DISCUSSION
We have applied evolutionary strategies (ES) to automatically find local rules able to generate desired group level movement patterns. Namely, we found local rules that generate four complex collective motion patterns commonly observed in nature [35,45]. Three of these four patterns have, to our knowledge, not yet been generated using self-propelled particle models with local rules.
We used neural networks as approximators of the policy, the function mapping the local state to actions. The naive use of a neural network would produce a black-box model, that can be then analyzed with different post-hoc explainability strategies (inversion [46], saliency analysis [47], knowledge distillation [48], etc.; see [49,50] for reviews). Instead, as we did in [44], we designed and trained a model that is inherently transparent (interpretable). This alternative improves our confidence in the model understandability: the model is its own exact explanation [51].  Figure 5, except in the case of the elevation (second column), that is equivalent to the panel in the top row. For each configuration, the speed of the focal and neighbor has been fixed to the configuration median: 0.42, 0.32, 0.75, and 1.54 BL/s, respectively.
We used a modular policy network, composed by two modules. Each module is an artificial neural network with thousands of parameters, and therefore it is a flexible universal function approximator. However, we can still obtain insight, because each module implements a function with low number of inputs and outputs that we can plot [44]. Similar to what we obtained from experimental trajectories of zebrafish [44], we found the XY-interaction to be organized in areas of repulsion, orientation, and attraction, named in order of increasing distance from the focal fish. We were also able to describe differences between the policies generating each of the configurations.
To find the local rules generating the desired configurations, we used a systematic version of the collective behavior modeling cycle [15]. The traditional collective behavior modeling cycle begins with the researcher proposing a candidate rule and tuning it through a simulation based feedback process. Here, we parameterize the local rule as a neural network. Since neural networks are highly expressive function approximators which can capture a very diverse set of possible local rules, our method automates the initial process of finding a set of candidate rules. As in the case of the traditional modeling cycle, our method also relies on a cost function (the reward function) and numerical simulations to measure the quality of a proposed rule. The process of rule adaptation is automated by following a gradient of the reward function with respect to neural network weights. Just like the modeling cycle, our method uses iterations that gradually improve the policy until it converges to a satisfactory solution.
There are theoretical guarantees for convergence in tabular RL, or when linear approximators are used for the value functions [40]. However, these theoretical guarantees do not normally extend to the case where neural networks are used as function approximators [52], nor to multi-agent RL [25]. Here we report an application where the ES algorithm was able to optimize policies parameterized by deep neural networks in multi-agent environments. To our knowledge, there were no theoretical guarantees for convergence for our particular setting.
The method we have proposed could have several other interesting applications. In cases where it is possible to record rich individual level data sets of collective behavior, it can be possible to perform detailed comparisons between the rules discovered by our method and the ones observed in experiments [44,53]. The method could also be applicable to answer more hypothetical questions such as what information must be available in order for a certain collective behavior to emerge. Animals may interact in a variety of ways including visual sensory networks [54], vocal communication [55], chemical communication [56] and environment modification (stigmergy) [57]. Animals also have a variety of cognitive abilities such as memory and varying sensory thresholds. By removing or incorporating such capabilities into the neural networks it is now possible to theoretically study the effects these factors have on collective behavior patterns.
Here we relied on an engineered reward function because the behaviors we were modeling have not yet been recorded in quantitative detail. In cases where trajectory data is available, detailed measures of similarity with observed trajectories can be used as a reward [33,58]. Moreover, we can use adversarial classifiers to automatically learn these measures of similarity [19,59]. Further interesting extensions could include creating diversity within the group by incorporating several different neural networks into the collective and studying the emergence of behavioral specialization and division of labor [60].
The present work may be used as a normative framework when the rewards used represent important biological functions. While prior work using analytic approaches has been successful for simple scenarios [42,61], the present approach can extend them to situations in which no analytic solution can be obtained.

DATA AVAILABILITY STATEMENT
The datasets were generated using the software from https:// gitlab.com/polavieja_lab/rl_collective_behaviour.

AUTHOR CONTRIBUTIONS
AL and GP devised the project. TC, AL, FH, and GP developed and verified analytical methods. TC wrote the software, made computations and plotted results with supervision from AL, FH, and GP. All authors discussed the results and contributed to the writing of the manuscript.