Assessing the role of environmental factors on Baltic cod recruitment, a complex adaptive system emergent property

has led many researchers to advocate ANNs as an attractive, non-linear alternative to traditional statistical methods. The goal of this study is to design a Baltic cod recruitment model (FishANN) that can account for complex ecosystem interactions. To this end, we (1) build a quantitative model representation of the conceptual understanding of the complex ecosystem interactions driving Baltic cod recruitment dynamics, and (2) apply the model to strengthen the current capability to project future changes in Baltic cod recruitment. FishANN is demonstrated to bring multiple stressors together into one model framework and estimate the relative importance of these stressors while interpreting the complex non-linear interactions between them. Additional requirements to further improve the current study in the future are also proposed.


INTRODUCTION
Natural living systems are characterized by a high level of complexity, which results from the diversity of their biological components and from the diversity of possible types of interaction (physical, chemical, trophic, behavioral, cognitive;Planque et al., 2014). Many biological interactions are non-linear and biological systems display a remarkable ability to constantly adapt and reconfigure themselves (Planque et al., 2014). Such systems, including marine ecosystems, are characterized by multiple possible outcomes and by the potential for rapid changes and ecological surprises (Doak et al., 2008;Levin and Lubchenco, 2008;Scheffer, 2009;Glaser et al., 2011;Möllmann et al., 2011). This display of high complexity, non-linearity, and adaptability has identified these systems as Complex Adaptive Systems (CAS, Levin, 2005).
For decades, fish recruitment has been a matter of intensive research (e.g., Ricker, 1954;Cushing, 1971;Rothschild, 2000) and is recognized as a key element in management decisions (Fernandes et al., 2015). Models of stock-recruitment (e.g., Ricker, 1954;Beverton and Holt, 1957;Cushing, 1996) commonly used for recruitment prediction often explain only a minor fraction of the inter-annual variation in recruitment. This lack of predictive strength has been assumed to be explained by the influence of different factors in the environment (O'Brien and Little, 2006;Margonski et al., 2010). Consequently, this research area has evolved from considering simply the spawning stock biomass (SSB), to include also use of environmental factors and relationships that can modulate recruitment (Köster et al., 2005;Schirripa and Colbert, 2006). Presently, stock recruitment models along with other regression methods including environmental variables (Planque and Buffaz, 2008), are commonly used for recruitment prediction (Fernandes et al., 2010). This use of environmental information to improve the ability to predict recruitment, could contribute considerably to fisheries management (De Oliveira et al., 2005;Yuan et al., 2012).
A large number of studies have been undertaken using different techniques, to utilize such environmental information to predict recruitment (e.g., Chen and Ware, 1999;Bailey et al., 2005;Dreyfus-León and Chen, 2007;Dreyfus-León and Schweigert, 2008;MacKenzie et al., 2008;Ruiz et al., 2009). Nevertheless, the recruitment problem remains difficult because the mechanisms behind such complex relationships are often poorly understood; this in turn leading to the failure of some proposed relationships, methods, and performance estimations, when new data become available (Myers et al., 1995). Such failures may be related to new controls, which were not previously considered (Myers et al., 1995;Planque and Buffaz, 2008), or to limitations of the available data (Schirripa and Colbert, 2006).
The Baltic Sea has been studied extensively and is identified as being a region of rich data with extensive time series from environmental and biological monitoring (Eero et al., 2015). It is an ecosystem that "has been and will continue to be impacted by various stressors including climate change, exploitation, and eutrophication" (MacKenzie et al., 2012). During the last three decades, it is hypothesized that the Baltic has undergone a regime shift, altering the functioning and structure of its zooplankton and fish communities. The new stable state has been considered as "cod hostile" (e.g., Möllmann et al., 2008Möllmann et al., , 2009Lindegren et al., 2010) due to reduced spawning success in cod, owing to adverse hydrological conditions as well as increased predation on cod eggs and declining food resources for cod larvae (Casini et al., 2009).
Several changes in the biological parameters of Baltic cod, managements and fisheries have occurred in recent years, all of which could have potentially had a positive influence on the development of the stock . However, there was a fail in the analytical stock assessment in 2014, leaving unclear the present stock status. Changes in ecological and environmental conditions have led to an unusual situation for Baltic cod and it has been suggested that the stock is "in distress" (Eero et al., 2015). Clearly, the recruitment dynamics of Baltic cod are "more complex than previously thought" (Köster et al., 2005), thus, our understanding in the status of cod in the Baltic Sea and efforts in solving stock assessment and recruitment issues is "ongoing and will likely continue in the coming years" (Eero et al., 2015). The current conceptual model of the understanding of Baltic cod recruitment dynamics is given by Köster et al. (2003) and Köster et al. (2005), and references therein. "Hydrographic conditions in the central Baltic were affected by large-scale climatic conditions during the 80s and 90s, resulting in higher than normal temperatures in the intermediate and bottom water and declining salinity and oxygen concentrations in the deep Baltic basins" (Köster et al., 2005). Reproductive success of the top-predator cod declined and "anoxic conditions in deep layers at important spawning sites caused severe egg mortalities" (Köster et al., 2005). The observed recruitment decline of the 1980s was believed to be related to these "climate-induced changes in the physical environment" (Köster et al., 2005). Herring stock sizes and especially sprat, important planktivorous predators and in the case of sprat a key predator on cod eggs, increased substantially. Furthermore, "a temperature-related increase in the sprat stock intensified egg predation" (Köster et al., 2005).
Even though the hydrographic conditions facilitative for survival of early life stages improved during the 90s, cod recruitment success remained far below average. The lack of recruitment recovery in the mid-90s, "despite improved hydrographic conditions for egg development, is related to poor larval survival" (Köster et al., 2005) and cod egg predation (Köster and Möllmann, 2000;Casini et al., 2009). A decline in the abundance of the copepod Pseudocalanus sp. (Möllmann et al., 2008) related to lower salinity and increased predation by clupeid fish, mainly sprat (Casini et al., 2009), caused food limitation for first-feeding cod larvae. Few exceptions in recruitment in the 90s have been suggested to be due to relatively high wind speeds (affecting transport and prey encounter via turbulence), and below average temperatures, favoring oxygen conditions for egg survival.
It has long been recognized that non-linear population processes and environmental forcing can generate dramatic fluctuations in recruitment of marine fish and invertebrate stocks (Glaser et al., 2011;Mäntyniemi et al., 2015). It is difficult to clarify and model the mechanisms controlling recruitment by using conventional methods because fish populations have complex and non-linear response, and reactions to biotic interreactions and environmental changes (Olden and Jackson, 2001). Machine-learning techniques such as Artificial Neural Networks (ANNs) have been proposed as "an appropriate approach with some desirable properties to address such problems" (Dreyfus-León and Chen, 2007;Uusitalo, 2007;Fernandes et al., 2013). ANNs gained momentum in mid-1980s (Rumelhart et al., 1986 and subsequently they have been used in a variety of ecological applications. ANNs mimic biological neural networks to fit and find patterns in complex data (Lawrence, 1993). For a more comprehensive treatment of ANN applications in ecology, we suggest the books by Lek and Guegan (2000), Lek et al. (2005), and Recknagel (2006).
ANNs have been shown to approximate competently many complex functions and have demonstrated advantages in predictive ability over general linear models (Hastie et al., 2001). In many cases they are documented to perform better than multiple regressions when there are non-linear relationships between the variables Brey et al., 1997). Their main advantage compared with multiple regression models is that they do not require an a priori model as they can learn from existing data. Additionally, ANNs can model systems that involve complex non-linear relationships between variables and outcomes multiple dependent variables, can learn from "noisy data" (Kasabov, 1996) and the outputs of the multilayer perceptron (MLP), a type of ANN, can be interpreted as "probabilities" (Watts and Worner, 2008). They can generalize well if over-fitting is avoided, in other words they can fit and classify data they have not been trained on with accuracy. They also do not require the transformation of non-linear variables, unlike classical linearization techniques, which can "improve the results but have often failed to fit the data" (Sun et al., 2009). All these features make the application of ANNs "a powerful approach for exploring complex biological problems such as recruitment forecasting" (Chen and Ware, 1999).
This utility of neural networks for solving non-linear problems has been demonstrated in aquatic studies and has led many researchers to comment on neural networks as an attractive (non-linear) alternative to traditional statistical methods (Olden and Jackson, 2002). A thorough revision of the use of neural networks in fisheries research was performed by Suryanarayana et al. (2008) where applications in forecasting, fisheries management distribution and classification since 1978 where reviewed. In addition, Quetglas et al. (2011) also reviewed the utilities of neural networks in marine ecology and fisheries science during the 1990s and 2000s. Some interesting applications of ANN to management and forecasting of fisheries include the modeling of recruitment, stock biomass, abundance, catch of different fisheries, and distribution (Komatsu et al., 1994;Lek et al., 1996;Mastrorillo et al., 1997;Chen and Ware, 1999;Laë et al., 1999;Chen et al., 2000;Gevrey et al., 2003;Maravelias et al., 2003;Zhou, 2003;Joy and Death, 2004;Chen and Hare, 2006;Glaser et al., 2011;Olden et al., 2011;Russo et al., 2011Russo et al., , 2014. Many published works on ANNs in marine ecology compare this approach with classical multivariate statistical methods, such as multiple linear regression (MLR) or discriminant analysis (DA). Quetglas et al. (2011) state that in all cases, ANNs either outperformed (e.g., Baran et al., 1996;Lek et al., 1996;Mastrorillo et al., 1997;Laë et al., 1999;Ibarra et al., 2003;Wagner et al., 2006) or at least performed as well (e.g., Power et al., 2005;Fang et al., 2009) as classical linear and non-linear modeling methods, such as linear regression and generalized additive models with respect to prediction accuracy.
There are only two publications in literature that have applied ANNs in studies on Baltic cod. Wieland and Jarre-Teichmann (1997) used an artificial neural network model to successfully predict the vertical distribution of Baltic cod eggs. Fuchs (1996) utilized ANN as a tool to investigate how the changed hydrography influences the cod distribution in the Baltic. To our knowledge, ANNs have not been applied to investigate recruitment of Baltic Cod. Thus, the goals of this study were to (1) build a quantitative model representation of the conceptual understanding of the complex ecosystem interactions driving Baltic cod recruitment dynamics, and (2) apply the model to strengthen the current capability to project future changes in Baltic cod recruitment.

Data Sources
The data utilized cover the ICES subdivisions 25, 26, and 28-2 in the Baltic Sea (Figure 1), thus encompassing the three major basins known as nursery grounds of eastern Baltic Cod. Here, we analyze biotic and abiotic information as combined time series from all three subdivisions, assuming little effect of transport between spawning areas (Köster et al., 2005). Cod Reproductive Volume from May to August (RVmay, RVaug) were calculated and applied separately as the water column volume with salinity above 11 and dissolved oxygen above 2 ml/l, i.e., proper conditions ensuring fertilization and egg development of cod (Plikshs, 2014). Datasets where obtained from M. Plikshs (Fishery Resources Research Department, Riga, Latvia) and B.R. Mackenzie (Technical University of Denmark, Charlottenlund, Denmark). The effects of natural and fishing mortality (i.e., survival and harvest) were considered variable over time. Natural and Predation Mortality parameters (NatMort) at Age 2 of Cod, are basin-wide estimates derived from the Stochastic Multispecies Model (SMS), a data-driven model frequently used for generating analytical fish stock assessments and forecasts as basis for ICES scientific advice (Lewy and Vinther, 2004). Fishing mortality (FishMort) and SSB estimates are derived from published ICES stock assessments (ICES, 2010).
As an estimate of recruitment we used the number of cod recruits to the fishery at an age of 2. Earlier estimates of year-class strength are difficult to assess due to cannibalism effects (Köster et al., 2003). As a result, regular single species assessments of cod employ Age-2 as the youngest age group and presentations of stock-recruitment relationships refer in general to age-group 2 (Köster et al., 2003) as the first age fully recruited to and sampled by the fishery. Abundance index of Age-2 Baltic Cod is derived from published ICES stock assessments and was used as an estimate of the number of cod recruits to the fishery (Köster et al., 2003). Additionally, we consider habitat driven variables, proposed by Köster et al. (2005), that potentially explain changes in recruitment regimes, namely: Oxygen-related Egg Survival (OES), Relative egg Predation Pressure (RPP), Egg abundance (EA), and Larval Abundance (LA; Köster et al., 2005). We obtained the published 1976-1999 time series of these variables from Köster et al. (2005). Information on the variables employed in this study including duration of time series and data source are presented in Table 1. Additionally, 2000-2009 data time series were used as independent test sets to assess the predictive accuracy of those models for which all input and target variables were available during this period (also presented in Table 1). Time series plots for the 9 input variables and cod at Age-2 are presented in Figure 2. In order to enable a potential operational 2-year-ahead forecast of eastern Baltic cod recruits we only consider model input variables available 2 years ahead of the estimated recruit abundance. For habitat driven variables and SSB this is representative of their direct influence on the earliest life stages of the cod (eggs and larvae). In case of natural and fishing mortality here we use a sum of the given mortality term on cod age 0, 1, and 2 from 2years back with respect to the given cod Age-2 recruit abundance estimate. While these indices do not match the conceptual direct influence these variables have on cod recruitment to the fishery in a given year, they nevertheless explain a large part of the target variable variability. This approach is deliberately different from a typical analytical stock assessment model procedure, which would replace the in-year natural and fishing mortality estimates with the average from last few years as proxies Frontiers in Marine Science | www.frontiersin.org  for these variables, when running the models in forecast mode.

Modeling Approach
In terms of structure, ANNs are non-linear models that apply weighted links between input neurons (predictors), hidden neurons, and output neurons (results) related to a response. Training any type of supervised ANN consists of using a response training dataset to adjust the connection weights in order to minimize the error between observed and predicted response values. These weights are the parameters of ANNs, initially chosen from random numbers and are refined until convergence or maximum clock time set is over. In feed-forward networks, neurons from one layer are interconnected to all neurons of neighboring layers, but no connections are established within a layer or feedback connection. After being trained to match inputs Frontiers in Marine Science | www.frontiersin.org and response outputs with existing data, an ANN can be used for forecasting with new inputs. We use feed-forward neural networks, with a single hidden layer, trained in supervised mode using the back propagation algorithm (Rumelhart et al., 1986). The modeling process in back-propagation neural networks is more direct, as there is no necessity to specify a mathematical relationship between the input variables and output responses (Goh, 1995). This category of neural networks, referred to as MLPs, is commonly used in ecological studies as MLPs are suggested to be universal approximators of any continuous function (Hornik and White, 1989;Olden et al., 2004). The MLP was first used in ecological studies in fisheries in Komatsu et al. (1994) and Lek et al. (1995), and since then has been successfully applied to several different problems in ecology (Kimes et al., 2000;Quetglas et al., 2011). In our study, calculations were done in the MathWorks Matlab environment using the Neural Networks Toolbox.

Model Training and Testing Procedure
The MLP neural network models we used are trained via Bayesian regularized method (BR) using MATLAB's native "trainbr" function. BR training tends to show better performance when the relationship between variables is non-linear (Sovan et al., 1996;Archontoula et al., 2003). It also gives potentially better generalizations because of its capability to automatically select regularization parameters to optimize the network architecture and limit the likelihood of over-fitting Winkler, 1999, 2000). We chose the estimated mean of squared errors (MSE) as the representative metric of performance of the model. The smaller the MSE, the better the model fit. To further minimize the risk of over-fitting the model, a validation stop criterion was introduced to the BR training function, hence the model training stopped as soon as it failed to minimize its error for n-number consecutive training cycles, here n = 20 (e.g., Sjöberg, 1995;Amari et al., 1997).
To further avoid overtraining and to maximize generalizability (i.e., ability to project correctly in unsupervised mode) three-fold cross-validation was used to assess model prediction accuracy, meaning that 1/3 of all available data points from the 1976-1999 input time series were always excluded from training a given set of models. In the training set the data was further divided randomly for training and validation (70 and 30%, respectively) with a different random split for each model. The training data set was used to iteratively adjust the connection weights until a minimum error was found. The validation data set was used to monitor network error after each training cycle, enabling training to be terminated when the network began to over-train or over-fit the data. We compared the test set performances from the three-fold cross-validation and selected the model with the best of the three test set MSEs.

Model Ensemble
Stochasticity needs to be considered as an important trade-off when using data-driven approaches such as ANNs. In order to account for the variability in model results, dependent on the initial choice of weights and random data splitting during training, we employed an ensemble model approach similar to the suggestions discussed in Zhou (2003), and De Oña and Garrido (2014). For each training data set we trained an Nnumber of replicate models (herein N = 35), with identical architecture (see Section Model Ensemble) but different set of initial weights. The performance of the given ensemble member model (m#; # = 1, 2, . . . , 35) was estimated on the test set through a cross-validation procedure described above. The ensemble model performance was considered equal to the median of all 35 replicate models.
To test the hypothesis that the larger the number of ensemble members, the higher the performance of the ensemble, we performed a sensitivity experiment, the results of which are presented in Figure 3. In this analysis we checked the test set performance of each ensemble and plotted the MSE of the ensemble best model as a function of the number of ensemble members ranging from 1 to 90, again applying three-fold crossvalidation as above. Each ensemble configuration was run with 10 replicates, and only the median of the 10 results was plotted in Figure 3.
We found that there is a marked and consistent increase in performance (decrease in MSE) of the ensemble best as the number of ensemble members increased from 1 to around 10 to 15. However, no consistent and significant gain in performance could be achieved when increasing the number of ensemble members beyond the number 35. Figure 3 also illustrates that this result does not depend on whether we select the best, the mean or the median of the three-folds during cross-validation when used to report the performance of each individual ensemble members. While the absolute value of the ensemble MSE changes, all three lines show little if any gain in ensemble performance beyond 35 ensemble members.

Model Architecture
The general architecture of the neural network model used in this study is the same, i.e., there is one input layer, one output layer, and a single so-called hidden layer with three computing units (neurons). The number of neurons was set to three, optimized based on the results of sensitivity analysis performed according to the following procedure. Using a number of neurons in the range from 1 to 30 we trained 20 nets with 9 input variables for every number of neurons and recorded their average (mean and median) performance (MSE criterion), both on training and test sets (Figure 4). The 20 nets differed in terms of their initial weights and distribution of training and test set on the 1976-1999 input time series. The mean of the 20 replicates revealed no consistent trends but the median results from 20 replicates revealed that while training errors kept decreasing until the number of neurons reaches 7 and then showed little if any change, the median test set error reached a global minimum already at three neurons in the hidden layer. Therefore, we opted for the MLP model configuration with the simplest architecture and highest generalization capability, i.e., with three neurons in the hidden layer.

Input Layer Structure-Feature Selection
For the purpose of designing an optimal model capable of predicting eastern Baltic cod recruits 2-years ahead we constructed a series of MLP neural networks that differ from each other by the number of input variables considered, on a range from one to nine. In order to reduce the number of potential input variable combinations we used a quantitative feature selection method based on the results of variable importance analysis, described in detail below. Additionally, we adopted a qualitative criterion which restricted the number of potential input variables to five, which had available data points from the entire time period under study (RVmay, RVaugust, SSB, NatMort, FishMort). Therefore, feature selection was applied independently on two original models: MLP1-with nine input variables, and MLP2-with five input variables. Based on the results of feature selection analysis, which discarded those variables whose importance was significantly lower than others based on the median variable importance from the MLP ensemble, we then identified and retrained two new models with a reduced number of input variables (MLP3 and MLP4). Performance of reduced models was compared to the original ones with a full feature suite. Finally, we include one more model into the analysis with only a single input variable SSB (MLP5), thus mimicking the simplest possible stock-recruitment relationship model.
Variable importance-based feature selection was performed using two metrics derived from the distribution of connection weights inside a trained model: (i) product-of-standardizedweights (PSW) first defined in Garson (1991) and as applied in Russo et al. (2011), and (ii) product-of-connection-weights (PCW) based on Olden and Jackson (2002) and Olden et al. (2004). Measures of variable importance for both methods were obtained using MATLAB code. The two approaches are briefly described below. For full documentation we refer to the publications mentioned above.
In the PSW method the algorithm partitions hidden-output connection weights into components associated with each input neuron using absolute values of connection weights. It is important to note that Garson's PSW algorithm uses the absolute values of the connection weights, thus the variable contributions cannot reveal anything about the direction of the relationship between the input and output variables. The PCW approach of Olden and Jackson (2002) calculates the product of the raw input-hidden and hidden-output connection weights between each input neuron and output neuron and sums the products across all hidden neurons. According to Olden et al. (2004), the application of Garson's algorithm may often be inadequate because using absolute connection weights does not account for counteracting connection weights in cases where there are opposite directions for incoming and outgoing weights from the hidden layer neurons. Therefore, when the two metrics revealed contrasting variable importance ranks, we base our judgments on the PCW approach. Table 2 lists all the models considered in this study, both before and after the qualitative feature selection.

Feature Selection Analysis
Results of the feature selection process based on input variable importance analysis are presented in Figure 5. Using two different algorithms to estimate the neural network connection weights associated with individual input variables, we arrived at similar conclusions as to which features (input variables) are potentially redundant in the two models which use the maximum number of available input variables for either the 1976-1999 training period (MLP1) or the entire time period under study ). Both the PCW and the PSW methods applied to the MLP1 ensemble showed on average the lowest importance given to: RVmay, RVaug, FishMort, OES, and EA. According to the PSW method neither of these variables passed a 10% relative importance in determining the model output, taken as the mean of all ensemble members. NatMort and RPP were ranked as the most important variables with LA and SSB values also reaching above 10% on average across all ensemble members. NatMort and RPP were also the variables with highest average connection weights according to the PCW method, which additionally revealed the opposing direction of influence of these two variables-observation not possible to make with the PSW approach. Here, the connection strength of SSB and LA only slightly exceed the average weights assigned to connections traced back to the two RV variables, FishMort, OES, and EA. Based on these results, we reduced the feature space of MLP1 to four input variables with highest average connection weights (SSB, NatMort, RPP, and LA) which then determined the structure of the new model MLP3. Figure 5 also shows the results of variable importance analysis on MLP3. We observed that RPP and NatMort retained their highest relative variable importance, with SSB showing a consistently lower importance relative to the other inputs when compared to MLP1. Analysis of connection weights in the MLP2 model ensemble confirmed the high relative importance of NatMort. The second most important variable, according to both the PCW and PSW algorithms, was RVmay. According to the PCW results, FishMort showed no consistent direction of influence on the model output and displayed the weakest average connection with the output. Consequently, we decided to rule out only this variable when reducing the feature space of MLP2 for the purpose of building the MLP4 model. As in the case of MLP3, the reduction of the feature space did not alter the relative weight distribution significantly. NatMort and RVmay were retained as the most influential variables for determining the model outputs.
In Section Model Performance Evaluation, we further analyze the effects of our feature selection analysis on model performance in order to choose the optimal FishANN MLP configuration.

Model Performance Evaluation
In Table 2 we compared the performance of the five MLP models using two statistical metrics: MSE and squared correlation of determination coefficient (r 2 ). Please note that both the MSE and r 2 values reported here represent the ensemble median model performance on those samples from the 1976 to 2001 time-series for which there was a complete suite of input variables available. In terms of MSE the best performance was provided by the MLP4 ensemble, although the improvement over MLP2 was marginal. While the r 2 of MLP4 was in fact 0.01 lower than that of MLP2, we must bear in mind that r 2 also likely increased due to a mere increase in the number of features. To test that, we calculated the adjusted r 2 of MLP4 (0.75) which turned out slightly higher than that of MLP2 (0.74). These results suggest that the variable importance-based feature selection performed on MLP2 led to an improvement in the model performance in form of MLP4.
Reducing the number of features in the MLP1 model resulted in an even more pronounced improvement in the model performance in terms of MSE and r 2 . Nonetheless, the improved model MLP3 was found to have a lower overall performance than MLP2. It is interesting that the most complex model with nine input variables (MLP1) had a lower r 2 than the simplest stockrecruitment model with only SSB in its feature space (MLP5). Reducing the feature space of MLP1 led to MLP3 with an r 2 above that of MLP5, however, with an adjusted r 2 (0.67) still lower than that of MLP5 (0.70).
These observations and conclusions were additionally confirmed by the results displayed in the Taylor diagram (Figure 6), which shows a more comprehensive multi-statistical analysis of the differences in performance between the five models. Considering not only the correlation of determination coefficient r, but also the central root mean square difference (RMSD) and standard deviation patterns relative to the ICES data. While the differences in terms of correlation and RMSD are narrow between all models, and small gains in reducing the feature space from MLP1 to MLP3, and MLP2 to MLP4, there is an increase in the standard deviation exhibited by the reduced models, with MLP3 standard deviation being in fact the closest to the ICES reference value.
In comparison with previous Baltic cod recruitment models in similar studies, Köster et al. (2003)  environmentally sensitive stock-recruitment relationships (Köster et al., 2001) and performed linear regressions of cod recruitment (in numbers) on SSB, using cod Age 2 dependent variables per subdivisions 25, 26, and 28 (R 2 = 0.16, R 2 = 0.23, and R 2 = 0.41, respectively). Röckmann et al. (2007) presented a linear regression of the spawning stock size in time series 1976-1999, with combined model estimates that "explained 72% of the variance of ICES standard stock assessment estimates." In this approach, recruitment refers to 0 group cod (Age 0). Margonski et al. (2010) demonstrated two different models in their model comparisons for the Eastern Baltic Cod stock: a linear and additive model (R 2 = 0.73) and a linear and polynomial model (R 2 = 0.725). We assume that the authors considered Age-2 cod as recruiting to the fishery, although this was not stated in the paper. Time-series comparison of MLP-derived and ICES-based Age-2 numbers estimates displayed in Figure 7 provide also additional information: (a) how well the different ensemble median MLPs perform in the distinct high and low recruitment regimes, and (b) what is the level of uncertainty due to picking any random ensemble member model projection, here defined as the range of variability around the median within each ensemble model.
In Figure 7 we see that when considering the range of ensemble variability as twice the standard variation of ensemble, then the majority of ICES Age-2 numbers data points fall inside the model projections. The narrowest range of ensemble projections was exhibited by models MLP3 and MLP4. While models MLP1 and MLP3 underestimated only a few very high recruitment years in the 1970s and early 1980s, MLP2 and MLP4 also overestimated a few low recruitment events in the last two decades. Nevertheless, all these four models in general perform well in both the high (1970-1980s) and low recruitment regimes (after 1990). This is in marked contrast to MLP5 which though seems to capture the overall level and trends in the low recruitment period, fails to capture both the timing and magnitude of the high recruitment events.
One important difference between the MLP4 and MLP2 ensembles, is for instance the fact that MLP4 shows a very consistent direction of forecast in the last 5 years of the time series, unlike MLP2 whose range of predictions reveal a wide spread with possible both decreasing and increasing trend toward Frontiers in Marine Science | www.frontiersin.org FIGURE 6 | Taylor diagram for multi-statistic model performance evaluation of the 5 MLP models in comparison to the ICES Age-2 numbers reference. RMSD-centered root mean square difference which is a variation of the mean square of errors used to train the MLP models. Correlation of determination here equivalent to the square root of r 2 reported in Table 2. Standard deviation and RMSD units expressed on log scale. the end of the time series, with the half an order of magnitude range of predicted Age-2 numbers.
Considering all the statistical metrics reviewed in Table 2 and Figure 6, the additional information inferred from Figure 7, as well as the fact that preference should always be given to a simpler model, we conclude that the best FishANN configuration is obtained by using the MLP4 ensemble. Furthermore, we propose that the range of MLP4 ensemble projections could be interpreted as a measure of the model uncertainty, possibly replaced with formal confidence intervals for the purpose of making operational model forecasts.
Unfortunately, due to lack of availability of published data on RPP and LA we cannot fully evaluate the predictive potential of MLP3 to the same extent as the currently preferred model MLP4. Considering the marginally lower performance of MLP3 relative to MLP4, MLP3 could still demonstrate better generalizability when presented with more independent input variables.

Variable Importance in Ensemble Modeling
Box plots in Figure 5 provided the summary of the distribution of connection weights and inferred variable importance across all ensemble members from MLP1-4. We have already discussed which of the input variables play the key role in the respective models, when taking the ensemble average into considering. It is however interesting to try and analyze the specific differences between the two individual ensemble members which might represent extreme responses under low and high recruitment regimes.
In Figure 8 we show the time series projections of Age-2 recruits from ICES, and MLP4 ensemble median and two individual algorithms from the ensemble: m23 and m29. We can see that m23 cannot properly detect the very high recruitment levels in the early parts of the time period under study, and at the same time, it grossly overestimates recruitment in the latter part of the time series. In fact, the results of m23 alone would not allow to distinguish a low and high recruitment regime which is a pronounced feature of the ICES estimated time series. The results of m23 are in marked contrast to the m29 ensemble member which can capture the high recruitment signal and at the same time follows the observed low recruitment very well. Perhaps the only advantage of m23 over m29 is the fact that it captures the recruitment in the 1985-1990 transition period much better, namely, it does not simulate a high recruitment year in 1985a feature not seen in the data but present in almost every MLP4 ensemble member algorithm.
In the four bottom panels of Figure 8 we analyze the differences in assigned connection weights between m23 and m29. In the case of m23, both the PCW and PSW approach lead to a similar conclusion that NatMort and RVmay were most influential in the model with only their rank reversed in the two approaches. SSB played a visibly lower role than all other input variables. In the case of m29, there are very strong differences between PCW-and PSW-based interpretations of the variable importance. While both PCW and PSW point at an increased role of SSB in m29 relative to m23, the latter assigns nearly 40% relative importance to SSB making it the key variable in the model. The PCW approach draws a very different picture with RVmay being by far the most dominant input variable, with NatMort also having a stronger connection weight to the output than SSB.
In order to try and resolve which approach draws a more accurate picture of variable importance in MLP4, we actually use the results of MLP5 which we know was solely based on the relationship between SSB and number of recruits. Figure 7 clearly illustrated that SSB-based MLP5 captured neither the timing nor the magnitude of the very high recruitment events from the beginning of the time series under study. It is therefore highly unlikely that assigning a higher relative weight to SSB would make m29 fit the high recruitment regime so much better than m23. On the other hand, giving more weight to SSB also does not explain the large overestimates in the low recruitment period, as MLP5 rather tended to underestimate recruitment after 1990. Thus, we conclude based on these observations that the PCW approach is more adequate to describing the variable importance in ANN models. It therefore also follows that RVmay might be the key variable needed to correctly project high recruitment events 2 years ahead. On the other hand, we also note that m29 projects false recruitment peaks, i.e., in 1985 and in 1996. Figure 2 reveals that there was a large increase in RVmay in 1994, 2 years prior to the project 1996 peak. It thus appears that a high RVmay in that year did not lead to a marked increase in recruitment, as predicted by m29, and that other factors need also be considered. Therefore, while it appears that NatMort and RVmay are key predictors for FishANN in general, the challenge remains to assign such relative weights to SSB and RVaug to further maximize the 2-year ahead predictive capacity of MLP4.
Moreover, we conclude based on these results that it is imperative not to assign a single fixed set of weights to individual processes represented by the selected input variables in Baltic cod recruitment models as their role in explaining recruitment patterns is very likely shifting depending on the environmental and recruitment regime (see also Margonski et al., 2010 for a discussion on that). Due to the adaptive structure of neural networks (i.e., variable activation of its computing units in response to changes in the input space), mimicking a set of weighted regression models in place of a single best fit model, such a variable distribution of weights is accomplished. Nonetheless, as indicated by the substantial differences between individual model ensemble members, distinct patterns of assigned weights may lead to model projections that are both statistically sound (high performance in terms of r 2 and MSE) but which have distinct biases of great consequence in potentially using these tools as support for fish stock management.
One approach would be to use expert judgment or another criterion (e.g., precautionary) to select a single best algorithm. The second approach, preferred by us, is to take an ensemble median/mean result with associated confidence intervals from a number of algorithms of the same structure, both in terms of complexity of the computing units and the number of input variables considered. In this study we also demonstrated that taking the ensemble approach still enables interpretation of variable importance through a sum of both positive and negative non-linear interactions inside an ANN model. We thus presented a pragmatic approach to a very complex problem of predicting fish recruitment under a highly variable environment.
As a potential future development, we hypothesize that applying a weighted majority vote scheme to our ensemble model building in place of the simple median majority vote would further increase the probability of correctly classifying recruitment numbers under unknown sets of inputs variables (Kuncheva, 2004). Implementing such a more complex ensemble scheme is currently hindered by the limited number of sample points in the training dataset, but is likely necessary when dealing with very contrasting environmental and recruitment regimes as in case of the eastern Baltic cod.

Assumptions and Limitations
At this stage, there are some points we need to put into consideration. First, more data may be needed to ensure the actual quality of prediction and training. As stated in Doan and Liong (2004) "the length of the test set has an impact on the prediction capability of the trained model" and our results, given the small amount of data in which the model was trained in, could be open to interpretations. Lek et al. (1996) mention that "repeated resampling of test data sets reduces the potential for sampling error caused by choice of a validation or test data set." We believe we have used means to address to this issue (i.e., k-fold cross validation) but a longer data set could provide a better overview. While sampling error is not common when "extremely large and functionally homogenous data sets are used and it is likely to be a factor when used for actual ecological data sets where data availability is limited" (Kemp et al., 2007). This could be a reason behind the fact that the model underestimates high recruitment. It is also possible that data may contain some errors. FishANN seems to constantly overestimate one specific recruitment year. At this point, it needs to be said that we didn't account deep-water salinity data in our study, as suggested by Heikinheimo (2008). We would have considered including average deep-water salinity as an additional input variable, should its higher explanatory power relative to RV be also relevant in our case. It is noted that Heikinheimo (2008) examined the variability in Baltic cod age-0 recruitment. All in all, deep-water salinity data was not included due to lack of sufficient availability of the data set. The ICES database offers access to relevant salinity data only until 2004. In order to facilitate reproducibility of results we opted to restrict ourselves to those input variables that are readily available, even though this information could add potential improvement of model predictive skills. Third, we assumed for our data that mechanisms across Baltic subdivisions are roughly the same, since the integrated abundance of all spawning areas should be unaffected by transport between spawning areas (Köster et al., 2005). We kept this spatial homogeneity within the Central Baltic, similar to studies by Sparholt (1996) and Jarre-Teichmann et al. (2000). We can't be sure to what extent this decision affects the interpretations of complex non-linear relationships for recruitment.
Fourth, the level of analysis in variable importance presented is restricted to semi-linear relations. There might be some additional non-linear effects between them. The question if the available inputs capture this encrypted variability may be more suitable to investigate with emerging tools for non-linear analysis of the relationship of variables used in neural network models (e.g. Fischer, 2015).
The "black box" term is no longer an appropriate term for ANNs. Recent advances in the field of environmental sciences have provided a "set of techniques to determine the relative importance of each input variable" (Fischer, 2015). These techniques include sensitivity analyses Recknagel et al., 1997), partial derivatives (Dimopoulos et al., 1999;Reyjol et al., 2001;Gevrey et al., 2006), "input variable relevancies and neural interpretation diagrams" (Ozesmi and Ozesmi, 1999), significance randomization tests (Olden, 2000;Olden and Jackson, 2002). All these approaches are "based on the fact that the contribution of each independent variable depends on the magnitude and direction of the connection weights between neurons" (Olden et al., 2008). However, only a handful of methods have been documented for visualizing non-linear effects of input variables in literature (Garson, 1991;Dimopoulos et al., 1995Dimopoulos et al., , 1999Lek et al., 1996;Fischer, 2015), an important issue to consider when dealing with complex ecosystem scenarios. So far, to our knowledge, there are no relevant studies using these techniques for predicting fish recruitment with ANNs.

Recommendations for Future Work
In 2008, Houde (2008) highlighted some critical research needs in understanding causes of recruitment variability: • Implementing/continuing long-term surveys on early-life stages. • Better understanding of larval behavior.
• New and improved coupled biological-physical models: Most coupled biological-physical models have been developed as "explanatory or inferential tools" (Miller, 2007) and relatively few as tools to test hypotheses: the biggest contribution to understanding variability in recruitment ultimately may come from that approach. • Research that goes beyond life stages.
• Understanding effects of climate change.
• To enable not only short-term but also long-term recruitment projections under scenarios of climate change/variability.
For the future, the statistical nature of ANNs could be combined with dynamic IBMs, fuzzy logic (Chen and Ware, 1999;Chen and Hare, 2006) or expert knowledge to address more complex long term recruitment scenario forecasting in fisheries. By combining process knowledge and statistics (e.g., capelin migration model- Huse and Ellingsen, 2008) we could also enable future scenario testing for long term recruitment hypotheses, and provide "continuous learning" operational predictive models, i.e., with evolving connection weights. This approach, together with an articulate methodology to overview ranking of non-linear variable importance techniques, could enable automated process monitoring, biological systems control, and advance decision making to a new level. We believe that this might be an appropriate approach for a CAS and that ANN tools can contribute significantly toward this direction. The case of the Baltic cod stock exemplifies the multitude of effects biotic interactions (clupeid predation on cod eggs, competition between clupeids, and cod larvae for common zooplankton resources) and climate variability may have on a fish stock, and it underscores the importance of knowledge of these processes for understanding the dynamics of fish stocks.

AUTHOR CONTRIBUTIONS
All authors listed, have made substantial, direct and intellectual contribution to the work, and approved it for publication.