^{1}

^{2}

^{2}

^{3}

^{4}

^{5}

^{2}

^{4}

^{4}

^{3}

^{4}

^{5}

This article was submitted to Machine Learning and Artificial Intelligence, a section of the journal Frontiers in Artificial Intelligence

This is an open-access article distributed under the terms of the

We show how complexity theory can be introduced in machine learning to help bring together apparently disparate areas of current research. We show that this model-driven approach may require less training data and can potentially be more generalizable as it shows greater resilience to random attacks. In an algorithmic space the order of its element is given by its algorithmic probability, which arises naturally from computable processes. We investigate the shape of a discrete algorithmic space when performing regression or classification using a loss function parametrized by algorithmic complexity, demonstrating that the property of differentiation is not required to achieve results similar to those obtained using differentiable programming approaches such as deep learning. In doing so we use examples which enable the two approaches to be compared (small, given the computational power required for estimations of algorithmic complexity). We find and report that 1) machine learning can successfully be performed on a non-smooth surface using algorithmic complexity; 2) that solutions can be found using an algorithmic-probability classifier, establishing a bridge between a fundamentally discrete theory of computability and a fundamentally continuous mathematical theory of optimization methods; 3) a formulation of an algorithmically directed search technique in non-smooth manifolds can be defined and conducted; 4) exploitation techniques and numerical methods for algorithmic search to navigate these discrete non-differentiable spaces can be performed; in application of the (a) identification of generative rules from data observations; (b) solutions to image classification problems more resilient against pixel attacks compared to neural networks; (c) identification of equation parameters from a small data-set in the presence of noise in continuous ODE system problem, (d) classification of Boolean NK networks by

Given a labeled data-set, a loss function is a mathematical construct that assigns a numerical value to the discrepancy between a predicted model-based outcome and its real outcome. A cost function aggregates all losses incurred into a single numerical value that, in simple terms, evaluates how close the model is to the real data. The goal of minimizing an appropriately formulated cost function is ubiquitous and central to any machine learning algorithm. The main heuristic behind most training algorithms is that fitting a sufficiently representative training set will result in a model that will capture the structure behind the elements of the target set, where a model is fitted to a set when the absolute minimum of the cost function is reached.

The algorithmic loss function that we introduce is designed to quantify the discrepancy between an inferred program (effectively a computable model of the data) and the data.

Algorithmic complexity (

Given a Turing-complete language ^{1}

The previous definition can be understood as the amount of information needed to define

In (

The algorithmic probability (

Solomonoff (

The Coding theorem (

Interestingly, the use of algorithmic probability and information theory to define AI algorithms has theoretically been proposed before (

The main task of a loss function is to measure the discrepancy between a value predicted by the model and the actual value as specified by the training data set. In most currently used machine learning paradigms this discrepancy is measured in terms of the differences between numerical values, and in case of cross-entropy loss, between predicted probabilities. Algorithmic information theory offers us another option for measuring this discrepancy–in terms of the algorithmic distance or information deficit between the predicted output of the model and the real value, which can be expressed by the following definition:

Definition 1. For computable

There is a strong theoretical argument to justify the Def. 1. Let us recall that given a set

Let us assume a prefix-free universal Turing machine and denote by

Thus, by minimizing the algorithmic loss function over the samples, along with the algorithmic complexity of

An algorithmic cost function must be defined as a function that aggregates the algorithmic loss incurred over a supervised data sample. At this moment, we do not have any reason, theoretical or otherwise, to propose any particular loss aggregation strategy. As we will show in subsequent sections, considerations such as continuity, smoothness and differentiability of the cost function are not applicable to the algorithmic cost function. We conjecture that any aggregation technique that correctly and uniformly weights the loss incurred through all the samples will be equivalent, the only relevant considerations being training efficiency and the statistical properties of the data. However, in order to remain congruent with the most widely used cost functions, we will, for the purpose of illustration, use the sum of the squared algorithmic differences

One of the main fields of application for automated learning is the classification of objects. These classification tasks are often divided into supervised and unsupervised problems. In its most basic form, a supervised classification task can be defined, given a set of objects

Now, it is important to note that it is not constructive to apply the algorithmic loss function (Def. 1) to the abstract representations of classes that are commonly used in machine learning classifiers. For instance, the output of a softmax function is a vector of probabilities that represent how likely it is for an object to belong to each of the classes (

Accordingly, in order to apply the algorithmic loss function to classification tasks, we need to seek a model that outputs an

We can find the needed output in the definition of the classification problem: let us define a class

For instance, if we have a computable classifier

Now, on the other hand, the general definition of the algorithmic loss function (def. 1) states that

What the ^{2}

Definition 2. Given a training set

In short, we assign to each object the closest class according to its algorithmic distance to one of the centroids in the set of objects

Now, in a strong algorithmic sense, we can say that a classifier is optimal if the class assigned to each object fully describes this object minus incidental or incompressible information. In other words, if a classifier is optimal and we know the class of an object, then we know all its characteristics, except those that are unique to the object and not shared by other objects within the class.

Formally, a classifier ^{3}

For example, let

Now, consider the classifier

Let ^{4}

Now, the fact that the degree of sophistication

The next theorem shows that minimizing the stated cost function guarantees that the classifier is optimal in a strong algorithmic sense:

Theorem 3. If a classifier f minimizes the cost function

Proof. Assume that f is not optimal. Then there exist

Now, note that given that

While theoretically sound, the proposed algorithmic loss (Def. 1) and classification (Def. 2) cost functions rely on the uncomputable mathematical object

The Coding Theorem Method (

Definition 4. Lets recall that a relation

The previous Def. is based on the Coding theorem (

In the case where

When

The Block Decomposition Method (BDM (

Definition 5. We define the coarse conditional BDM of X with respect to the tensor

The sub-objects

The motivation behind this definition is to enable us to consider partitions for the tensors

The term

The previous definition featured the adjective coarse because we can define a stronger version of conditional BDM approximating

Definition 6. The strong conditional BDM of X with respect to Y corresponding to the partition strategy

While we assert that the pairing strategy minimizing the given sum will yield the best approximation to K in all cases, prior knowledge of the algorithmic structure of the objects can be used to facilitate the computation by reducing the number of possible pairings to be explored, especially when using the domain specific version of conditional BDM. For instance, if two objects are known to be produced from local dynamics, then restricting the algorithmic comparisons by considering pairs based on their respective position on the tensors will, with high probability, yield the best approximation to their algorithmic distance.

Of the three method introduced in this section, Conditional CTM yields the best approximation to conditional algorithmic complexity and should be used whenever possible. Coarse conditional BDM and strong conditional BDM are designed as a way to extend the applicability of conditional CTM to objects for which a good CTM approximation is unknown. These extensions work by separating the larger object into smaller blocks for which we have an approximation of

It is easy to see that under the same partition strategy, strong conditional BDM will always present a better approximation to

The properties of strong and coarse conditional BDM and their relation with entropy are shown in the appendix (

In the previous sections we proposed algorithmic loss and cost functions (Def. 1) for supervised learning tasks, along with means to compute approximations to these theoretical mathematical objects. Here we ask how to perform model parameter optimization based on such measures. Many of the most widely used optimization techniques rely on the cost function being (sufficiently) differentiable, smooth and convex (

Let us start with a simple bilinear regression problem. Let

According to the Def. 1, the loss function associated with this optimization problem is

On the left we have a visualization of the algorithmic cost function, as approximated by coarse BDM, corresponding to a simple bilinear regression problem. From the plot we can see the complex nature of the optimization problem. On the right we have confirmation of these intuitions in the fact that the best performing optimization algorithm is a random pooling of 5,000 points.

The established link between algorithmic information and algorithmic probability theory (

Algorithmic probability theory establishes and quantifies the fact that the most probable computable program is also the least complex one (

By searching for the solution using the algorithmic order we can meet both requirements in an efficient amount of time. We start with the least complex solution, therefore the most probable one, and then we move toward the most complex candidates, stopping once we find a good enough value for

Definition 7. Let

minCost =

Return

where the halting condition is defined in terms of the number of iterations or a specific value for

The algorithmic cost function is not expected to reach zero. In a perfect fit scenario, the loss of a sample is the relative algorithmic complexity of

By Def. 7, it follows that the parameter space is countable and computable. This is justified, given that any program is bound by the same requirements. For instance, in order to fit the output of the function

Now, consider a fixed model structure

Computing the algorithmic order needed for definition 7 is not a trivial task and requires the use of approximations to

Given the way that algorithmic parameter optimization works, the optimization time, as measured by the number of iterations, will converge faster if the optimal parameters have low algorithmic complexity. Therefore they are more plausible in the algorithmic sense. In other words, if we assume that, for the model we are defining, the parameters have an underlying algorithmic cause, then they will be found faster by algorithmic search, sometimes much faster. How much faster depends on the problem and its algorithmic complexity. In the context of artificial evolution and genetic algorithms, it has been previously shown that, by using an algorithmic probability distribution, the exponential random search can be sped up to quadratic (

Following the example of inferring the function in

The assumption that the optimum parameters have an underlying simplicity bias is strong, but has been investigated (

At the same time, we are aware that, for the example given, mathematical analysis-based optimization techniques have a perfect and efficient solution in terms of the gradient of the MSE cost function. While algorithmic search is faster than random search for a certain class of problems, it may be slower for another large class of problems. However, algorithmic parameter optimization (Def. 7) is a domain and problem-independent general method. While this new field of algorithmic machine learning that we are introducing is at an early stage of development. in the next sections we set forth some further developments that may help boost the performance of our algorithmic search for specific cases, such as greedy search over the subtensors, and there is no reason to believe that more boosting techniques will not be developed and introduced in the future.

Thus far we have provided the mathematical foundation for machine learning based on the power of algorithmic probability at the price of operating on a non-smooth loss surface in the space of algorithmic complexity. While the directed search technique we have formulated succeeds with discrete problems, here we ask whether our tools generalize to problems in a continuous domain. To gain insight, we evaluate whether we can estimate parameters for ordinary differential equations. Parameter identification is well-known to be a challenging problem in general, and in particular for continuous models when the data-set is small and in the presence of noise. Following (

An elementary cellular automaton (ECA) is a discrete and linear binary dynamical system where the state of a node is defined by the states of the node itself and its two adjacent neighbors (

Aside from the Turing-complete rule with number 110, the others were randomly selected among all 256 possible ECA. The training set was composed of 275 black and white images, 25 for each automaton or “class”. An independent validation set of the same size was also generated, along with a test-set with 1,375 evenly distributed samples. An example of the data in these data sets is shown in

Two

First we will illustrate the difficulty of the problem by training neural networks with simple topologies over the data. In total we trained three ^{5}

However, as shown in (

The algorithmic probability model chosen consists of eleven

• First, we start with the eleven

• Then, we perform algorithmic optimization, but only changing the bits contained in the upper left

• After minimizing with respect to only the upper left quadrant, we minimize over the upper right

• We repeat the procedure for the lower left and lower right quadrants.

These four steps are illustrated in

The evolution of the center for the four steps of the greedy algorithmic information optimization method used to train the model in the first experiment. This classifier center corresponds to class 11.

The next problem was to classify black and white images representing the evolution of elementary cellular automata. In this case, we are classifying according to the initialization string that produced the corresponding evolution for a randomly chosen automaton. The classes for the experiment consisted of 10 randomly chosen binary strings, each 12 bits in length. These strings correspond to the binary representation of the following integers:

The training, validation and test sets each consisted of two hundred

We trained and tested a group of neural network topologies on the data in order to establish the difficulty of the classification task. These networks were an (adapted version of) Fernandes’ topology and 4 naive neural networks that consisted of a flattened (fully-connected) layer, followed by 1, 2, and 5 groups of layers, each consisting of a fully connected linear layer with rectilinear activation (ReLU) function followed by a dropout layer, ending with a linear layer and a softmax unit for classification. The adaptation of the Fernandes topology was only for the purpose of changing the kernel of the pooling layer to

The best performing network was the shallower one, which consists of a flattened layer, followed by a fully connected ReLU, a dropout layer, a linear layer with 10 inputs and a sotfmax unit. This neural network achieved an accuracy of 60.1%. At 18.5%, the performance of Fernandes’ topology was very low, being barely above random choice. This last result is to be expected, given that the topology is domain specific, and should not be expected to extend well to different problems, even though at first glance the problem may seem to be related.

The algorithmic probability model

However, for this particular problem, the coarse version of conditional BDM proved inadequate for approximating the universal algorithmic distance

• First, we computed all the outputs of all possible 12-bit binary strings for each of the first 128 ECA for a total of 528,384 pairs of 12 bit binary vectors and

• Then, by considering only the inner 6 bits of the vectors (discarding the 3 bits on the left and the 3 bits on the right) and, similarly, the inner

• where

• If a particular pair

In the end we obtained a database of the algorithmic distance between all

The previous procedure might at first seem to be too computationally costly. However, just as with Turing Machine based CTM (

The trained model

An NK network is a dynamical system that consists of a binary Boolean network where the parameter

Given the extensive computational resources it would require to compute a CTM database, such as the one used in

Specifically, the number of possible Boolean operations of degree

Following this idea we defined a classifier where the model

For classifying according to the Boolean rules assigned to each node, we used 10 randomly generated (ordered) lists of 4 binary Boolean rules. These rules were randomly chosen (with repetitions) from

To classify according to network topology we used 10 randomly generated topologies consisting of 10 binary matrices of size

Kauffman networks are a special case of Boolean NK networks where the number of incoming connections for each node is two, that is,

For this problem we used a different type of algorithmic cluster center. For the Boolean rules classifier, the model

The use of different kinds of models for this task showcases another layer of abstraction that can be used within the wider framework of algorithmic probability classification: context. Rather than using binary tensors, we can use a structured object that has meaning for the underlying problem. Yet, as we will show next, the underlying mechanics will remain virtually unchanged.

Let’s go back to the definition 2, which states that to train both models we have to minimize the cost function

So far we have approximated

Following similar steps to the ones used in

So far we have presented supervised learning techniques that, in a way, diverge. In this section we will introduce one of the ways in which the two paradigms can coexist and complement each other, combining statistical machine learning with an algorithmic-probability approach.

The choice of an appropriate level of model complexity that avoids both under- and over-fitting is a key hyperparameter in machine learning. Indeed, on the one hand, if the model is too complex, it will fit the data used to construct the model very well but generalize poorly to unseen data. On the other hand, if the complexity is too low, the model will not capture all the information in the data. This is often referred to as the bias-variance trade-off, because a complex model will exhibit large variance, while an overly simple one will be strongly biased. Most traditional methods feature this choice in the form of a free hyperparameter via, eg, what is known as regularization.

A family of mathematical techniques or processes that has been developed to control over-fitting of a model goes under the rubric “regularization”, which can be summarized as the introduction of information from the model to the training process in order to prevent over-fitting of the data. A widely used method is the Tikhonov regularization (

The core premise of the previous function is that we are disincentivizing fitting toward certain parameters of the model by assigning them a higher cost in proportion to λ, which is a hyperparameter that is learned empirically from the data. In current machine learning processes, the most commonly used weighting functions are the sum of the

We can employ the basic form of

Definition 8. Let

The justification of the previous definition follows from algorithmic probability and the coding theorem: Assuming an underlying computable structure, the most probable model that fits the data is the least complex one. Given the universality of algorithmic probability, we argue that the stated definition is general enough to improve the plausibility of the model of any machine learning algorithm with an associated cost function. Furthermore, the stated definition is compatible with other regularization schemes.

Just as with the algorithmic loss function (Def. 2), the resulting function is not smooth, and therefore cannot be optimized by means of gradient-based methods. One option for minimizing this class of functions is by means of algorithmic parameter optimization (Def 7). It is important to recall that computing approximations to the algorithmic probability and complexity of objects is a recent development, and we hope to promote the development of more powerful techniques.

Another, perhaps more direct way to introduce algorithmic probability into the current field of machine learning, is the following. Given that in the field of machine learning all model inference methods must be computable, the following inequality holds for any fixed training methodology:

Accordingly, the heuristic for our definition of algorithmic probability weighting is that, to each training sample, we assign an importance factor (weight) according to its algorithmic complexity value, in order to increase or diminish the loss incurred by the sample. Formally:

Definition 9. Let

We define the weighted approximation to the algorithmic complexity regularization of J or algorithmic probability weighting as

We have opted for flexibility regarding the specification of the function

As its names implies, the previous Def. 9 can be considered analogous to sample weighting, which is normally used as a means to confer predominance or diminished importance on certain samples in the data set according to specific statistical criteria, such as survey weights and inverse variance weight (

Now, given that the output of f and its parameters are constant from the point of view of the parameters of the model M, it is easy to see that if the original cost function J is continuous, differentiable, convex and smooth, so is the weighted version

A key step to enabling progress between a fundamentally discrete theory such as computability and algorithmic probability, and a fundamentally continuous theory such as that of differential equations and dynamical systems, is to find ways to combine both worlds. As shown in

A visualization of the algorithmic cost function, as approximated by coarse BDM, corresponding to the parameter approximation of an ordinary differential

The average Euclidean distance between the solution inferred by algorithmic optimization and the hidden parameters of the ODE in

Following optimization, a classification function was defined to assign a new object to the class corresponding to the center

The accuracy of the Tested Classifiers.

Classifier | Test Set | Training Set |
---|---|---|

Naive Networks | ||

1 | 38.88 | 95.63% |

2 | 39.70 | 95.63 |

3 | 40.36 | 100% |

4 | 39.05 | 100% |

Fernandes’ | 98.8 | 99.63% |

Algorithmic Class | 98.1 | 99.27% |

Expected number of vulnerability per sample.

Classifier | Total Vulnerabilities | Per Sample | Percentage of Pixels (%) |
---|---|---|---|

Fernandes’ (DNN) | 190,850 | 138.88 | 13.56 |

Algorithmic Classifier | 15,125 | 11 | 1 |

From the data we can see that the algorithmic classifier outperformed the four

Last but not least, we have tested the robustness of the classifiers by measuring how good they are at resisting one-pixel attacks ((

Algorithmic information theory tells us that algorithmic probability classifier models should have a relatively high degree of resilience in the face of such attacks: if an object belongs to a class according to a classifier it means that it is algorithmically close to a center defining that class. A one-pixel attack constitutes a relatively small information change in an object. Therefore there is a relatively high probability that a one-pixel attack would not alter the information content of an image enough to increase the distance to the center in a significant way. In order to test this hypothesis, we systematically and exhaustively searched for vulnerabilities in the following way: a) One by one, we flipped (from 0 to 1 or vice versa) each of the

From the results we can see that for the DNN, 13.56% of the pixels are vulnerable to one-pixel attacks, and that only 1% of the pixels manifest that vulnerability for the algorithmic classifier. These results confirm our hypothesis that the algorithmic classifier is significantly more robust in the face of small perturbations compared to the deep network classifier designed without a specific purpose in mind. It is important to clarify that we are not stating that it is not possible to increase the robustness of a neural network, but rather pointing out that algorithmic classification has a high degree of robustness

The accuracy obtained using the different classifiers is represented in

In order to test the generalization of the CTM database computed for this experiment, we tested our algorithmic classifying scheme on a different instance of the same basic premise: binary images of size

The results are summarized in

Furthermore, by analyzing the confusion matrix plot (

The confusion matrix for the neural network (left) and the algorithmic information classifier (right) while classifying binary vectors representing the degree of connection (parameter k) of 300 randomly generated Boolean NK networks. From the plots we can see that the algorithmic classifier can predict with relatively high accuracy the elements belonging to class

A second task was to classify a set of binary vectors of size 40 that represent the evolution of an NK network of four nodes (

The task was to determine whenever a random Boolean network belonged to the frozen, critical or chaotic phase by determining when

For the task at hand we trained the following classifiers: a neural network, gradient boosted trees and a convolutional neural network. The first neural network had a naive classifier that consisted of a ReLU layer, followed by a Dropout layer, a linear layer and a final softmax unit for classification. For the convolutional model we used prior knowledge of the problem and used a specialized topology that consisted of 10 convolutional layers with a kernel of size 24, each kernel representing a stage of the evolution, with a ReLU, a pooling layer of kernel size 24, a flattened layer, a fully connected linear layer and a final softmax layer. The tree-based classifier manages an accuracy of 35% on the

For comparison purposes, we trained a neural network and a logistic regression classifier on the data. The neural network consisted of a naive topology consisting of a ReLU layer, followed by a dropout layer, a linear layer and a softmax unit. The results are shown in

From the results obtained we can see that the neural network, with 92.50% accuracy, performed slightly better than the algorithmic classifier (91.35%) on the test set. The logistic regression accuracy is a bit further behind, at 82.35%.

However, the difference in the performance of the topology test set is much greater, with both the logistic regression and the neural network reaching very high error rates. In contrast, our algorithmic classifier reaches an accuracy of 72.4%.

As a first experiment in algorithmic weighing, we designed an experiment using the MNIST dataset of hand written digits (

The binarization was performed by using a simple mask: if the value of a (gray scale) pixel was above 0.5 then the value of the pixel was set to one, using zero in the other case. This transformation did not affect the performance of any of the deep learning models tested, including the LeNet-5 topology (

Next we salted or corrupted 40% of the training samples by randomly shuffling 30% of their pixels. An example of these corrupted samples can be seen in

At left we have an image representing a binarized version of a randomly chosen sample. At right we have the salted version of the same sample, with 30% of its pixels randomly shuffled.

Finally, we trained 10 neural networks with increasing depth, setting aside 20% of the training data as a verification set, thereby obtaining neural networks of increasing depth and, more importantly, variance. The topology of these networks consisted of a flattened layer, followed by an increasing number of fully connected linear layers with rectified linear (ReLU) activation functions, and a final softmax layer for classification. In order to highlight the effect of our regularization proposal, we abstained from using other regularization techniques and large batch sizes. For instance, optimizing using variable learning rates such as RMSProp along with small stochastic batches is an alternative way of steering the samples away from the salted samples.

For purposes of comparison, the neural networks were trained with and without weighting, using the option sample_weight for the train_on_batch on Keras.The training parameters for the networks, which were trained using Keras on Python 3, were the following:

Stochastic gradient descent with batch size of 5,000 samples.

40 epochs, (therefore 80 training stages), with the exception of the last model with 10 ReLU layers, which was trained for 150 training stages.

Categorical crossentropy as loss function.

ADAM optimizer.

The hyperparameters for the algorithmic weighting function used were:

Following the theoretical properties of algorithmic regularization, by introducing algorithmic probability weighting we expected to steer the fitting of the target parameters away from random noise and toward the regularities found in the training set. Furthermore, the convergence toward the minimum of the loss function is expected to be significantly faster, in another instance of algorithmic probability speed-up (

The differences in the accuracy of the models observed through the experiments as a function of variance (number of ReLU layers) are summarized in

The first two (upper) plots show the difference between the mean and maximum accuracy obtained through the training of each of the models. The last two (lower) plots show the evolution of accuracy through training for the data sets. The data sets used are (training, test and validation), with data from the MNIST dataset. The (training and validation) data sets were salted with %40 of the data randomly corrupted while the test set was not. From the first two plots we can see that the accuracy of the models trained with algorithmic sample weights is consistently higher than the models trained without them, and this effect increases with the variance of the models. The drops observed after 4 ReLU layers are because, until depth 10, the number of training rounds was constant, with more training rounds therefore needed to achieve a minimum in the cost function. When directly comparing the training history of the models of depth 6 and 10 we can see that the stated effect is consistent. Furthermore, at 10 rectilinear units, we can see significant overfitting, while for the unweighted model, using the algorithmic weights still leaves room for improvement.

On the left are shown the differences between the slopes of the linear approximation to the evolution of the loss function for the first six weighted and unweighted models. The linear approximation was computed using linear regression over the first 20 training rounds. On the right we have the loss function of the models with 10 ReLU units. From both plots we can see that training toward the minimum of the loss function is consistently faster on the models with the algorithmic complexity sample weights, and that this difference increases with the variance of the model.

The difference in accuracy with respect to the percentage of corrupted pixels and samples in the data set for the weighing function 5 for a neural network of depth 4 (four rectilinear units). A positive value indicates that the network trained on the weighted samples reached greater accuracy. The maximum difference was reached for 70% of the samples with 40% of pixels corrupted. From the plot we can see that the networks trained over the weighed samples steadily gained in accuracy until the maximum point was reached. The values shown are the average differences over five networks trained over the same data.

As the data show (

Here we have presented a mathematical foundation within which to solve supervised machine learning tasks using algorithmic information and algorithmic probability theories in discrete spaces. We think this is the first time that a symbolic inference engine is integrated to more traditional machine learning approaches constituting not only a path toward putting both symbolic computation and statistical machine learning together but allowing a state-to-state and cause-and-effect correspondence between model and data and therefore a powerful interpretable white-box approach to machine learning. This framework is applicable to any supervised learning task, does not require differentiability, and is

We have shown specific examples of its application to different problems. These problems included the estimation of the parameters of an ODE system, the classification of the evolution of elementary cellular automata according to their underlying generative rules; the classification of binary matrices with respect to 10 initial conditions that evolved according to a random elementary cellular automaton; and the classification of the evolution of a Boolean NK network with respect to 10 associated binary rules or ten different network topologies, and the classification of the evolution of a randomly chosen network according to its connectivity (the parameter

While simple, the ODE parameter estimation example illustrates the range of applications even in the context of a simple set of equations where the unknown parameters are those explored above in the context of a neural network (

From the results obtained from the first classification task (6.2), we can conclude that our vanilla algorithmic classification scheme performed significantly better than the non-specialized vanilla neural network tested. For the second task (

For finding the underlying topology and the Boolean functions associated with each node, the naive neural network achieved a performance of 92.50%, compared to 91.35% for our algorithmic classifier. However, when classifying with respect to the topology, our algorithmic classifier showed a significant difference in performance, with over 39.75% greater accuracy. There was also a significant difference in performance on the fourth task, with the algorithmic classifier reaching an accuracy of 70%, compared to the 43% of the best neural network tested.

We also discussed some of the limitations and challenges of our approach, but also how to combine and complement other more traditional statistical approaches in machine learning. Chief among them is the current lack of a comprehensive Turing machine based conditional CTM database required for the strong version of conditional BDM. We expect to address this limitation in the future.

It is important to emphasize that we are not stating that there is no neural network that is able to obtain similar, or even better, results than our algorithms. Neither do we affirm that algorithmic probability classification in its current form is better on any metric than the existing extensive methods developed for deep learning classification. However, we have introduced a completely different view, with a new set of strengths and weaknesses, that with further development could represent a better grounded alternative suited to a subset of tasks beyond statistical classification, where finding generative mechanisms or first principles are the main goals, with all its attendant difficulties and challenges.

Publicly available datasets were analyzed in this study. This data can be found here:

SH-O designed and executed experiments, HZ conceived and designed experiments. HZ, JR, NK, and JT contributed to experiments and supervision. SH-O, HZ, JR, AU, and NK contributed to interpreting results and conceiving experiments. SH-O and HZ wrote the paper. All authors revised the paper.

SH-O was supported by grant SECTEI/137/2019 awarded by Subsecretaría de Ciencia, Tecnología e Innovación de la Ciudad de México (SECTEI).

Authors HZ, SH-O, and JR are employed by the company Oxford Immune Algorithmics Ltd.

The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

The Supplementary Material for this article can be found online at:

All modern computing languages process this property, given that they are stored and interpreted by binary systems.

An unlabeled classificatory algorithmic information schema has been proposed by (

The notion of degree of sophistication for computable objects was defined by Koppel (

A number

We say that a NN topology is naive when its design does not use specific knowledge of the target data.