Algorithmic Probability-Guided Machine Learning on Non-Differentiable Spaces

We show how complexity theory can be introduced in machine learning to help bring together apparently disparate areas of current research. We show that this model-driven approach may require less training data and can potentially be more generalizable as it shows greater resilience to random attacks. In an algorithmic space the order of its element is given by its algorithmic probability, which arises naturally from computable processes. We investigate the shape of a discrete algorithmic space when performing regression or classification using a loss function parametrized by algorithmic complexity, demonstrating that the property of differentiation is not required to achieve results similar to those obtained using differentiable programming approaches such as deep learning. In doing so we use examples which enable the two approaches to be compared (small, given the computational power required for estimations of algorithmic complexity). We find and report that 1) machine learning can successfully be performed on a non-smooth surface using algorithmic complexity; 2) that solutions can be found using an algorithmic-probability classifier, establishing a bridge between a fundamentally discrete theory of computability and a fundamentally continuous mathematical theory of optimization methods; 3) a formulation of an algorithmically directed search technique in non-smooth manifolds can be defined and conducted; 4) exploitation techniques and numerical methods for algorithmic search to navigate these discrete non-differentiable spaces can be performed; in application of the (a) identification of generative rules from data observations; (b) solutions to image classification problems more resilient against pixel attacks compared to neural networks; (c) identification of equation parameters from a small data-set in the presence of noise in continuous ODE system problem, (d) classification of Boolean NK networks by (1) network topology, (2) underlying Boolean function, and (3) number of incoming edges.

Notice that in both equations we have the sum over all the pairs that are in both sets, Adj(X) and Adj(Y ), with the difference being in the terms corresponding to the multiplicity. Now we have to consider two cases. If n x i = n y i we have the equality. Otherwise, in the first equation we have terms of the form log(n x j ) f (n x k , n y k ), which, by Def. of f , is 0; analogously for the second equation. Therefore, we have the equality.

Coarseness and Relationship With Entropy
As mentioned in the previous section, the goal behind the Def. of coarse conditional BDM, BDM (X|Y ), is to measure the amount of information contained in X not present in Y . Ideally, this is measured by the conditional algorithmic information K(X|Y ). The Def. 5 includes the adjective coarse given that, as we will show in this section, its behaviour is closer to Shannon's entropy H than the algorithmic information measure K, relying heavily on the entropy-like behaviour of BDM.
The conditional algorithmic information content function K is an incomputable function. Therefore it represents a theoretical ideal that cannot be reached in practice. By construction, coarse conditional BDM is an approximation to this measure. However it differs in not taking into account two information sources: the information content shared between base blocks and the position of each block.
As an example of the first limitation, consider the string 101010 . . . 10 and its negation 010101 . . . 01. Intuitively, we know that both strings are algorithmically close, but for a partition strategy that divides the string into substrings of size 2 with no overlapping, the Adj sets {({10}, n)} and {({01}, n)} are disjoint. Therefore conditional BDM assigns the maximum BDM value to the shared information content. Within this limitation, we argue that conditional BDM represents a better approximation to K in comparison to entropy, mainly because BDM uses the CTM approximation value for each block, rather than just its distribution, and the information content of its multiplicity, thus representing a more accurate approximation to the overall algorithmic information content of the non-shared base blocks.
The second limitation can become a significant factor when the size of the base blocks is small when compared to that of the objects analysed, given that the positional information can become the dominant factor of the information content within an object. This is an issue shared with entropy ) and conditional entropy (H(X|Y )), respectively, corresponding to 5000 pairs of strings randomly chosen from a distribution where the expected number of 1s is the value shown on the x axis divided between the conditional BDM or conditional entropy of the first element of the pair and an unrelated randomly chosen binary string. All strings are of length 20. The partition strategy used for BDM is that of sets of size 1. From this plot we can see that conditional BDM manages to capture the statistical relationship of finite strings generated from the same distribution.
that conditional BDM inherits from the numerical challenges of CTM in BDM. However, conditional BDM has the added benefit that it is defined for finite tensors generated from different distributions by assuming the so-called universal distribution ([40]) (known to dominate any other approach) as the underlying distribution between the two 'events'.

Empirical Comparison with Entropy
Owing to the origins of the BDM function, the asymptotic relationship between coarse conditional BDM and conditional entropy follows from the relationship between BDM and entropy ([49]). In this section we will focus on empirical evidence for this relationship, along with exploring the impact of the partition strategy for unidimensional objects. Further theoretical properties that establish the well-behavedness of conditional BDM are set forth in the Appendix in Section 10.1. For this numerical experiment we generated a sample of 19,000 random binary strings of length 20 that are pairwise related, coming from one of 19 biased distributions where the expected number of 1s varies from 1 to 19. For each pair we computed the conditional BDM with partitions of size 1 and divided it by the conditional BDM of the first string with respect to a random string coming from an uniform distribution. To both, the divisor and the dividend, we added 1 to avoid divisions by zero. We repeated the experiment for conditional entropy. Both results where normalized by dividing the quotients obtained by the maximum value obtained for each distribution. In the plot 11 we show the average obtained for each biased distribution.
From the plot 11 we can see that as the underlying distribution associated with the strings is increasingly biased, the expected shared information content of two related strings is higher (conditional BDM is lower) when compared to the conditional BDM of two unrelated strings. This behaviour is congruent with what we expect and observe for conditional entropy. That the area under the normalized cube is smaller is expected, given that BDM is a finer-graded information content measure than entropy and is not perfectly symmetric, as BDM and CTM are computational approximations to an uncomputable function and are also inherently more sensitive to the fundamental limits of computable random number generators. biased 3/20 (three 1s expected), biased 1/4 (five 1s expected) and biased 7/20 (seven 1s expected). The x axis indicates the partition size used to compute the respective conditional BDM value, which was normalized by dividing it by the partition size.

The Impact of the Partition Strategy
As shown in previous results ([49]), BDM better approximates the universal measure K(X) as the number of elements resulting from applying the partition strategy {↵ i } to X. However, this is not the case for conditional BDM. Instead BDM (X|Y ) is a good approximation to K(X|Y ) when the Adj(X) and Adj(Y ) share a high number of base tensors, and the probability of this occurring is lower in inverse proportion to the number of elements of the partition. For this reason we must point out that conditional BDM is dependent on the chosen partition strategy {↵ i }.
As a simple example, consider the binary string X = 11110000 and its inverse Y = 00001111. Since we have the CTM approximation for strings of size 8, the best BDM value for each string is found when Adj(X) = {(11110000, 1)} and Adj(Y ) = {(00001111, 1)}. However, given that the elements of the partitions are different, we have it that BDM (11110000|00001111) = BDM (11110000) = 25.1899,, even when intuitively we know that, algorithmic information-wise, they should be very close. However, conditional BDM is able to capture this with partitions of size 1 to 4 with no overlapping, assigning a value of 0 to BDM (X|Y ).
We conjecture that there is no general strategy for finding a best partition strategy. This is an issue shared with conditional block entropy, and just like the original BDM definition. At its worst, conditional BDM will behave like conditional entropy when comparable, while maintaining best cases close to the ideal of conditional algorithmic complexity. Thus the partition strategy can be considered an hyperparameter that can be empirically optimized from the available data.
We performed a numerical experiment to observe this behaviour by generating 2 400 000 random binary strings of size 20 with groups of 600,000 strings belonging to one of four different distributions: uniform (ten 1s expected), biased 3/20 (three 1s expected), biased 1/4 (five 1s expected) and biased 7/20 (seven 1s expected). Then, we formed pairs of strings belonging to the same distribution and computed the conditional BDM using different partition sizes from 1 to 20, for a total of 30,000 pairs per data point, normalizing the result by dividing it by the partition size to avoid this factor being the dominant one. In the plot 12 we show the average obtained for each data point.