Knowledge Transfer Between Artificial Intelligence Systems

We consider the fundamental question: how a legacy “student” Artificial Intelligent (AI) system could learn from a legacy “teacher” AI system or a human expert without re-training and, most importantly, without requiring significant computational resources. Here “learning” is broadly understood as an ability of one system to mimic responses of the other to an incoming stimulation and vice-versa. We call such learning an Artificial Intelligence knowledge transfer. We show that if internal variables of the “student” Artificial Intelligent system have the structure of an n-dimensional topological vector space and n is sufficiently high then, with probability close to one, the required knowledge transfer can be implemented by simple cascades of linear functionals. In particular, for n sufficiently large, with probability close to one, the “student” system can successfully and non-iteratively learn k ≪ n new examples from the “teacher” (or correct the same number of mistakes) at the cost of two additional inner products. The concept is illustrated with an example of knowledge transfer from one pre-trained convolutional neural network to another.


INTRODUCTION
Explosive development of neuroinformatics and Artificial Intelligence (AI) in recent years gives rise to new fundamental scientific and societal challenges. Developing technologies, professions, vocations, and corresponding educational environments for sustained generation of evergrowing number of AI Systems is currently recognized as amongst the most crucial of these (Hall and Pesenti, 2017). Nurturing and growing of relevant human expertise is considered as a way to address the challenge. The next step, however, is to develop technologies whereby one or several AI systems produce a training environment for the other leading to fully automated passage of knowledge and experience between otherwise independent AI agents.
Knowledge transfer between Artificial Intelligent systems has been the subject of extensive discussion in the literature for more than two decades (Gilev et al., 1991;Jacobs et al., 1991;Pratt, 1992;Schultz and Rivest, 2000;Buchtala and Sick, 2007) (see also a comprehensive review Pan and Yang, 2010). Several technical ideas to achieve AI knowledge transfer have been explored to date. Using or salvaging, parts of the "teacher" AI system in the "student" AI followed by re-training of the "student" has been proposed and extensively tested in Yosinski et al. (2014) and Chen et al. (2015). Alternatives to AI salvaging include model compression (Bucila et al., 2006), knowledge distillation (Hinton et al., 2015), and privileged information (Vapnik and Izmailov, 2017). These approaches demonstrated substantial success in improving generalization capabilities of AIs as well as in reducing computational overheads (Iandola et al., 2016), in cases of knowledge transfer from larger AI to the smaller one. Notwithstanding, however, which of the above strategies is followed, their computational implementation, even for the case of transferring or learning just a handful of new examples, often requires either significant resources including access to large training sets and power needed for training, or availability of privileged information that may not necessarily be available to end-users. This contrasts sharply with natural intelligence too as recent empirical evidence reveals that single neurons in human brain are capable of rapid learning of new stimuli (Ison et al., 2015). Thus new frameworks and approaches are needed.
In this contribution we provide new framework for automated, fast, and non-destructive process of knowledge spreading across AI systems of varying architectures. In this framework, knowledge transfer is accomplished by means of Knowledge Transfer Units comprising of mere linear functionals and/or their small cascades. Main mathematical ideas are rooted in measure concentration (Gibbs, 1902;Lévy, 1951;Gromov, 1999Gromov, , 2003Gorban, 2007) and stochastic separation theorems (Gorban andTyukin, 2017, 2018) revealing peculiar properties of random sets in high dimensions. We generalize some of the latter results here and show how these generalizations can be employed to build simple one-shot Knowledge Transfer algorithms between heterogeneous AI systems whose state may be represented by elements of linear vector space of sufficiently high dimension. Once knowledge has been transferred from one AI to another, the approach also allows to "unlearn" new knowledge without the need to store a complete copy of the "student" AI is created prior to learning. We expect that the proposed framework may pave way for fully functional new phenomenon-Nursery of AI systems in which AIs quickly learn from each other whilst keeping their pre-existing skills largely intact.
The paper is organized as follows. In section 2 we introduce a general framework for computationally efficient non-iterative AI Knowledge Transfer and present two algorithms for transferring knowledge between a pair of AI systems in which one operates as a teacher and the other functions as a student. These results are based on Stochastic Separation Theorems (Gorban and Tyukin, 2017) of which the relevant versions are provided here as mathematical background justifying the approach. Section 3 illustrates the approach with examples, and section 4 concludes the paper.

General Setup
Consider two AI systems, a student AI, denoted as AI s , and a teacher AI, denoted as AI t . These legacy AI systems process some input signals, produce internal representations of the input and return some outputs. We further assume that some relevant information about the input, internal signals, and outputs of AI s can be combined into a common object, x, representing, but not necessarily defining, the state of AI s . The objects x are assumed to be elements of R n .
Over a period of activity system AI s generates a set S of objects x. Exact composition of the set S could depend on a task at hand. For example, if AI s is an image classifier, we may be interested only in a particular subset of AI s input-output data related to images of a certain known class. Relevant inputs and outputs of AI s corresponding to objects in S are then evaluated by the teacher, AI t . If AI s outputs differ to that of AI t for the same input then an error is registered in the system. Objects x ∈ S associated with errors are combined into the set Y. The procedure gives rise to two disjoint sets: A diagram schematically representing the process is shown in Figure 1. The knowledge transfer task is to "teach" AI s so that with a) AI s does not make such errors b) existing competencies of AI s on the set of inputs corresponding to internal states x ∈ M are retained, and c) knowledge transfer from AI t to AI s is reversible in the sense that AI s can "unlearn" new knowledge by modifying just a fraction of its parameters, if required.
Before proceeding with a proposed solution to the above AI Knowledge Transfer problem, understanding basic yet fundamental properties of the sets Y and M is needed. These properties are summarized and illustrated with Theorems 1, 2, and 3 below.

Stochastic Separation Theorems for Non-iterative AI Knowledge Transfer
Let the set from the same distribution at random. What is the probability that there is a linear functional separating Y from M, and, most importantly, if there is a computationally simple and efficient way to determine these? Below we provide three k-tuple separation theorems: for an equidistribution in the unit ball B n (1) (Theorems 1 and 2) and for a product probability measure with bounded support (Theorem 3). These two special cases cover or, indeed, approximate a broad range of practically relevant situations including e.g., Gaussian distributions (reduce asymptotically to the equidistribution in B n (1) for n large enough) and data vectors in which each attribute is a numerical and independent random variable. The computational complexity for determining the separating functionals, as specified by the theorems and their proofs, can be Comparison FIGURE 1 | AI Knowledge transfer diagram. AI s produces a set of its state representations, S. The representations are labeled by AI t into the set of correct responses, M, and the set of errors, Y. The student system, AI s , is then augmented by an additional "corrector" eliminating these errors.
remarkably low. If no pre-processing is involved, then deriving the functionals stemming from Theorems 1 and 2 requires merely k = |Y| vector additions and, possibly, an approximate solution of a constrained optimization problem in two dimensions. For large data sets, this is a significant advantage over support vector machines whose worst-case computational complexity is O((k + M) 3 ) (Bordes et al., 2005;Chapelle, 2007). Consider the case when the underlying probability distribution is an equidistribution in the unit ball B n (1), and suppose that M = {x 1 , . . . , x M } and Y = {x M+1 , . . . , x M+k } are i.i.d. samples from this distribution. We are interested in determining the probability P 1 (M, Y) that there exists a linear functional separating M and Y. An estimate of this probability is provided in the following theorem.
If the pair (δ, ε) is a solution of the nonlinear optimization program in (1) then the corresponding separating hyperplane is: The proof of the theorem is provided in the Appendix. Figure 2 shows how estimate (1) of the probability P 1 (M, Y) behaves, as a function of |Y| for fixed M and n. As one can see from this figure, when k exceeds some critical value (k = 9 in this specific case), the lower bound estimate (1) of the probability P 1 (M, Y) drops. This is not surprising since the bound (1) is (A) based on conservative estimates, and (B) these estimates are derived for just one class of separating hyperplanes ℓ 0 (x). Furthermore, no prior pre-processing and/or clustering was assumed for the Y. An alternative estimate that allows us to account for possible clustering in the set Y is presented in Theorem 2.
If the pair (δ, ε) is a solution of the nonlinear optimization program in (3) then the corresponding separating hyperplane is: The proof of the theorem is provided in Appendix.
Examples of estimates (3) for various parameter settings are shown in Figure 3. As one can see, in absence of pair-wise strictly positive correlation assumption, β 2 = 0, the estimate's behavior, as a function of k, is similar to that of (1). However, presence of moderate pair-wise positive correlation results in significant boosts to the values of P 1 .
Remark 1. Estimates (1), (3) for the probability P 1 (M, Y) that follow from Theorems 1, 2 assume that the underlying probability distribution is an equidistribution in B n (1). They can, however, be generalized to equidistributions in ellipsoids and Gaussian distributions (cf. Gorban et al., 2016a,b). Tighter probability bounds could also be derived if the upper-bound estimates of the volumes of the corresponding spherical caps in the proofs of Theorems 1, 2 are replaced with their exact values (see e.g., Li, 2011).
Remark 2. Note that not only Theorems 1, 2 provide estimates from below of the probability that two random i.i.d. drawn samples from B n (1) are linearly separable, but also they explicitly present the separating hyperplanes. The latter hyperplanes are similar to Fisher linear discriminants in that the discriminating direction (normal to the hyperplane) is the difference between the centroids.
Whilst having explicit separation functionals as well as thresholds is an obvious advantage from practical view point, the estimates that are associated with such functionals do not account for more flexible alternatives. In what follows we present a generalization of the above results that accounts for such a possibility as well as extends applicability of the approach to samples from product distributions. The results are provided in Theorem 3.
Theorem 3. Consider the linear space E = span{x j − x M+1 | j = M + 2, . . . , M + k}, let the cardinality |Y| = k of the set Y be smaller than n. Consider the quotient space R n /E. Let Q(x) be a representation of x ∈ R n in R n /E, and let the coordinates of Q(x i ), i = 1, . . . , M + 1 be independent random variables i.i.d. sampled from a product distribution in a unit cube with variances The proof of the theorem is provided in Appendix.
Having introduced Theorems 1-3, we are now ready to formulate our main results-algorithms for non-iterative AI Knowledge Transfer.

Knowledge Transfer Algorithms
Our first algorithm, Algorithm 1, considers cases when Auxiliary Knowledge Transfer Units, i.e. functional additions to existing student AI s , are single linear functionals. The second algorithm, Algorithm 2, extends Auxiliary Knowledge Transfer Units to two-layer cascades of linear functionals.
The algorithms comprise of two general stages, preprocessing stage and knowledge transfer stage. The purpose of the pre-processing stage is to regularize and "sphere" Algorithm 1 Single-functional AI Knowledge Transfer 1. Pre-processing (a) Centering. For the given set S, determine the set average,x(S), and generate sets S c be their corresponding eigenvalues, and h 1 , . . . , h n be the eigenvectors of )} is too large, project S c and Y c onto appropriately chosen set of m < n eigenvectors, h n−m+1 , . . . , h n : where H = h n−m+1 · · · h n is the matrix comprising of m significant principal components of S c . (c) Whitening. For the centered and regularized dataset S r , derive its covariance matrix, Cov(S r ), and generate whitened sets . . Y w,p so that elements of these clusters are, on average, pairwise positively correlated. That is there are β 1 ≥ β 2 > 0 such that: . . , p, construct separating functionals ℓ i : respectively, and c i is chosen as (c) Integration. Integrate Auxiliary Knowledge Units into decision-making pathways of AI s . If, for an x generated by an input to AI s , any of ℓ i (x) ≥ 0 then report x accordingly (swap labels, report as an error etc.) the data. This operation brings the setup close to the one considered in statements of Theorems 1, 2. The knowledge transfer stage constructs Auxiliary Knowledge Transfer Units in a way that is very similar to the argument presented in the proofs of Theorems 1 and 2.
close to identity matrix, and the functionals ℓ i are good approximations of (10). In this setting, one might expect that performance of the knowledge transfer stage would be also closely aligned with the corresponding estimates (1), (3).
Remark 3. Note that the regularization step in the pre-processing stage ensures that the matrix Cov and rearranging the sum below as The latter property, however, is guaranteed by the regularization step in Algorithm 1.

Remark 4. Clustering at
Step 2.a can be achieved by classical kmeans algorithms (Lloyd, 1982) or any other method (see e.g., Duda et al., 2000) that would group elements of Y w into clusters according to spatial proximity.
Remark 5. Auxiliary Knowledge Transfer Units in Step 2.b of Algorithm 1 are derived in accordance with standard Fisher linear discriminant formalism. This, however, need not be the case, and other methods, e.g., support vector machines (Vapnik, 2000), could be employed for this purpose there. It is worth mentioning, however, that support vector machines might be prone to overfitting (Han, 2014) and their training often involves iterative procedures such as sequential quadratic minimization (Platt, 1999).
Furthermore, instead of the sets Y w,i , S w \ Y w,i one could use a somewhat more aggressive division: Y w,i and S w \ Y w , respectively.
Depending on configuration of samples S and Y, Algorithm 1 may occasionally create Knowledge Transfer Units, ℓ i , that are "filtering" errors too aggressively. That is, some x ∈ S w \ Y w may accidentally trigger non-negative response, ℓ i (x) ≥ 0, and as a result of this, their corresponding inputs to A s could be ignored or mishandled. To mitigate this, one can increase the number of clusters and Knowledge Transfer Units, respectively. This will increase the probability of successful separation and hence alleviate the issue. An alternative practical strategy to limit the number of Knowledge Transfer Units, when the system is evolving in time, is to retain only most relevant ones taking into account acceptable rates of performance and size of the "relevant" set S. The link between these is provided in Theorems 1, 2. On the other hand, if increasing the number of Knowledge Transfer Units or dismissing less relevant ones is not desirable for some reason, then two-functional units could be a feasible remedy. Algorithm 2 presents a procedure for such an improved AI Knowledge Transfer. 1: Do as in Step 2.b in Algorithm 1. At the end of this step first-stage functionals ℓ i , i = 1, . . . , p will be derived. 2: For each set Y w,i , i = 1, . . . , p, evaluate the functionals ℓ i for all x ∈ S w \ Y w,i and identify elements x such that ℓ i (x) ≥ 0 and x ∈ S w \ Y w (incorrect error assignment). Let Y e,i be the set containing such elements x. 3: If (there is an i ∈ {1, . . . , p} such that |Y e,i | + |Y w,i | > m) then increment the value of p: p ← p + 1, and return to Step 2.a. 4: If (all sets Y e,i are empty) then proceed to Step 2.c. 5: For each pair of ℓ i and Y w,i ∪ Y e,i with Y e,i not empty, project orthogonally sets Y w,i and Y e,i onto the hyperplane ℓ i (x) = 0 and form the sets L i (Y w,i ) and L i (Y e,i ) : Integrate Auxiliary Knowledge Units into decision-making pathways of AI s . If, for an x generated by an input to AI s , any of the predicates (ℓ i (x) ≥ 0) ∧ (ℓ 2,i (x) ≥ 0) hold true then report x accordingly (swap labels, report as an error etc.).
In what follows we illustrate the approach as well as the application of the proposed Knowledge Transfer algorithms in a relevant problem of a computer vision system design for pedestrian detection in live video streams.

EXAMPLE
Let AI s and AI t be two systems developed, e.g., for the purposes of pedestrian detection in live video streams. Technological progress in embedded systems and availability of platforms such as Nvidia Jetson TX2 made hardware deployment of such AI systems at the edge of computer vision processing pipelines feasible. These platforms, however, lack computational power that would enable to run state-of-the-art large scale object detection solutions like ResNet (He et al., 2016) in real-time. Smaller-size convolutional neural networks such as SqueezeNet (Iandola et al., 2016) could be a way to move forward. Still, however, these latter systems have hundreds of thousands trainable parameters which is typically several orders of magnitude larger than in e.g., Histograms of Oriented Gradients (HOG) based systems (Dalal and Triggs, 2005). Moreover, training these networks requires substantial computational resources and data.
In this section we illustrate application of the proposed AI Knowledge Transfer technology and demonstrate that this technology can be successfully employed to compensate for the lack of power of an edge-based device. In particular, we suggest that the edge-based system is "taught" by the state-of-the-art teacher in a non-iterative and near-real time way. Since our building blocks are linear functionals, such learning will not lead to significant computational overheads. At the same time, as we will show later, the proposed AI Knowledge Transfer will result in a major boost to the system's performance in the conditions of the experiment.

Definition of AI s and AI t and Rationale
In our experiments, the teacher AI, AI t , was modeled by an adaptation of SqueezeNet (Iandola et al., 2016) with circa 725 K trainable parameters. The network was trained on a "teacher" dataset comprised of 554 K non-pedestrian (negatives), and 56 K pedestrian (positives) images. Positives have then been subjected to standard augmentation accounting for various geometric and color perturbations. The network was trained for 100 epochs, which took approximately 16 h on Nvidia Titan Xp to complete. The student AI, AI s , was modeled by a linear classifier with HOG features (Dalal and Triggs, 2005) and 2016 trainable parameters. The values of these parameters were the result of AI s training on a "student" dataset, a sub-sample of the "teacher" dataset comprising of 16 K positives (55 K after augmentation) and 130 K negatives, respectively. The choice of AI s and AI t systems enabled us to emulate interaction between low-power edge-based AIs and their more powerful counterparts that could be deployed on a higher-spec embedded system or, possibly, on a server or in a computational cloud.
We note that for both AI t and AI s the set of negatives is several times larger than the set of positives. This makes the datasets somewhat unbalanced. Unbalanced datasets are not uncommon in object detection tasks. There are several reasons why such unbalanced datasets may emerge in practice. Every candidate for inclusion in the set of positives is typically subjected to thorough human inspection. This makes the process time-consuming and expensive, and as a result imposes limitations on the achievable size of the set. Negatives are generally easier to generate. Note also that the set of negatives in our experiments is essentially the set of all objects that are not pedestrians. This latter set has significantly broader spectrum of variations than the set of positives. Accounting for this larger variability without imposing any further prior assumptions or knowledge could be achieved via larger samples. This was the strategy we have adopted here.
In order to make the experiment more realistic, we assumed that internal states of both systems are inaccessible for direct observation. To generate sets S and Y required in Algorithms 1 and 2 we augmented system AI s with an external generator of HOG features of the same dimension. We assumed, however, that positives and negatives from the "student" dataset are available for the purposes of knowledge transfer. A diagram representing this setup is shown in Figure 4. A candidate image is evaluated by two systems simultaneously as well as by a HOG features generator. The latter generates 2016 dimensional vectors of HOGs and stores these vectors in the set S. If outputs of AI s and FIGURE 4 | Knowledge transfer diagram between a teacher AI and a student AI augmented with HOG-based feature generator.
AI t do not match then the corresponding feature vector is added to the set Y.

Error Types Addressed
In this experiment we consider and address two types of errors: false positives (original Type I errors) and false negatives (original Type II errors). The error types were determined as follows. An error is deemed as false positive (for the original data) if AI s reported presence of a correctly sized full-figure image of pedestrian in a given image patch whereas no such object was there. Similarly, an error is deemed as false negative (for the original data) if a pedestrian was present in the given image patch but AI s did not report it there.
Our main focus was to replicate a deployment scenario in which AI t is capable of evaluating only small image patches at once in a given processing quanta. At the same time AI s is supposed to be able to process whole frame in reasonable time, but its accuracy is lower. This makes assessment of all relevant images by AI t not viable computationally and, in addition, rules out automated detection of Type II errors (original false negatives) when AI s is scanning the image at thousands of positions per frame. On the other hand, the number of positive responses of AI s is limited by several dozens of smaller size patches per frame, which is assumed to be well within the processing capabilities of AI t . In view of these considerations, we therefore focused mainly on errors of Type I (original false positives). Nevertheless, in section 3.4 we discuss possible ways to handle Type II errors in the original system and provide an illustrative example of how this might be done.
It is worthwhile to mention that output labels of the chosen teacher AI, AI t , do not always match ground truth labels. AI t may make an occasional error too, and examples of such errors are provided in Figure 5. Regardless of these, performance of AI t was markedly superior to that of AI s and hence for the sake of testing the concept, these rare occasional errors of AI t have been discarded in the experiments.

Datasets Used in Experiments and Experimental Error Types
To test the approach we used NOTTINGHAM video (Burton, 2016) containing 435 frames of live footage taken with an action camera. The video, as per manual inspection, contains 4,039 full-figure images of pedestrians.
For the purposes of training and testing Knowledge Transfer Units, the video has been passed through AI s , and AI s returned detects of pedestrian shapes. These detects were assessed by AI t and labeled accordingly (see the diagram in Figure 4). At this stage, decision-making threshold in AI s was varying from −0.3 to 2 to capture a reasonably large sample of false positives whose scores are near the decision boundary. This labeled set of feature vectors has been partitioned into two non-overlapping subsets: Set 1 consisting of circa 90% of true positives and 90% of false positives, and Set 2 being its complement. HOG features corresponding to original Type I errors (false positives) in Set 1 as well as all HOG features extracted from 55 K images of positives that have been used to train AI s were combined into the training set. This training set was then used to derive Knowledge Transfer Units for AI s . Sets 1 and 2 constitute different testing sets in our example. The first testing set (Set 1) enables us to assess how the modified AI s copes with removing "seen" original Type I errors (false positives) in presence of "unseen" true positives. The second testing set (Set 2) will be used to assess generalization capabilities of the final system.
We note that labeling of false positives involved outputs of AI t rather than ground truth labels. Visual inspection of AI t labels revealed that they contain few dozens of false positives too. This If what follows and unless stated otherwise we shall refer to these definitions.
Results of the application of Algorithms 1, 2 as well as the analysis of their performance on the testing sets are provided below.

Results
We generated 10 different realizations of Sets 1 and 2. This resulted in 10 different samples of the training and testing sets. The algorithms have been applied to all these different combinations. Single run of the preprocessing step, Step 1, took, on average, 23.5 s to complete on an Apple laptop with 3.5 GHz A7 processor. After the pre-processing step only 164 principal components have been retained. This resulted in significant reduction of dimensionality of the feature vectors. In our experiments pre-processing also included normalization of the whitened vectors so that their L 2 norm was always one. This brings the data onto the unit sphere which is somewhat more aligned with Theorems 1 and 2. Steps 2 in Algorithms 1, 2 took 1 and 24 ms, respectively. This is a major speed-up in comparison to complete re-training of AI s (several minutes) or AI t (hours). Note also that complete re-training does not offer any guarantees that the errors are going to be mitigated either.
Prior to running Steps 2 of the algorithms we checked if the feature vectors corresponding to errors (false positives) in   the training set are correlated. This allows an informed choice of the number of clusters parameter, p, in Algorithms 1 and 2. A 3D color-coded visualization of correlations, i.e., x i , x j , between pre-processed (after Step 1) elements in the set Y is shown in Figure 6, left panel. A histogram of lengths of all vectors in the training set is shown in Figure 6, right panel.
Observe that the lengths concentrate neatly around √ 164, as expected. According to Figure 6, elements of the set Y are mostly uncorrelated. There are, however, few correlated elements which, we expect, will be accounted for in Step 2.a of the algorithms. In absence of noticeably wide-spread correlations we set p = 30 which is equivalent to roughly 25 elements per cluster. Examples of correlations within clusters after Step 2 was complete are shown in Figure 7. Note that the larger is the threshold the higher is the expected correlation between elements, and the higher are the chances that such Knowledge Transfer Unit would operate successfully (see Theorems 1 and 2).
Performance of Algorithms 1, 2 on the Testing sets generated from NOTTINGHAM video is summarized in Figures 8, 9. In these figures we showed behavior of Units. Blue squares correspond to AI s after Algorithm 1, and green triangles illustrate application of Algorithm 2. Note that the maximal number of false positives in Figure 8 does not exceed 400. This is due to that the threshold was now varied in the operationally feasible interval [0, 2] as opposed to [−0.3, 2] used for gather training data. As Figure 9 shows, performance of AI s on Testing set 1 (Set 1) improves drastically after the application of both algorithms. Algorithm 2 outperforms Algorithm 1 by some margin which is most noticeable from the plot in Figure 8D. Near-ideal performance in Precision-Recall space can be explained by that the training set contained full information about false positives (but not true positives). It is important to observe that Algorithm 2 did not flag nearly all "unseen" true positives.
As for results shown in Figure 9, performance patterns of AI s after the application of both algorithms change. Algorithm 1 results in a minor drop in Recall figures, and Algorithm 2 recovers this drop to the baseline performance. Precision improves slightly for both algorithms, and True positives vs. False positives curves for AI s with Knowledge Transfer Units dominate those of plain AI s . This suggests that not only the proposed Knowledge Transfer framework allows to acquire new knowledge as specified by labeled data but also has a capacity to generalize the knowledge further. The degree of such generalization will obviously depend on statistics of the data. Yet, as Figure 9 demonstrates, this is a viable possibility.
Our experiments showed how the approach could be used for filtering Type I errors in the original system. The technology, however, could be used to recover Type II errors too (false negatives in the original system), should the data be available. Several strategies might be evoked to obtain this data. The first approach is to use background subtraction to detect a moving object and pass the object through both AI s and AI t . The second approach is to enable AI s to report detects that are classified as negatives but still are reasonably close to the detection boundary. This is the strategy which we adopted here. We validated HOG features in AI s corresponding to scores in the interval [−0.3, 0] with the teacher AI, AI t . Overall, 717 HOG vectors have been extracted by this method, of which 307 have been labeled by AI t as positives, and 410 were considered as negatives. We took one of the HOG feature vectors labeled as True positive, x v , and constructed (after applying the pre-processing transformation from previous experiments) several separating hyperplanes with the weights given by x v / x v and thresholds c v varying in [0, 1]. Results are summarized in Table 2. As before, we observe strong concentration of measure effect: the Knowledge Transfer Unit shows extreme selectivity for sufficiently large values of c v . In this particular case c v = 0.25 provides maximal gain at the lowest risk of expected error in future [see. e.g., the right-hand side of (1) in Theorem 1 at k = 1, ε = 0.75 for an estimate].

CONCLUSION
In this work we proposed a framework for instantaneous knowledge transfer between AI systems whose internal state used for decision-making can be described by elements of a highdimensional vector space. The framework enables development of non-iterative algorithms for knowledge spreading between legacy AI systems with heterogeneous non-identical architectures and varying computing capabilities. Feasibility of the framework was illustrated with an example of knowledge transfer between two AI systems for automated pedestrian detection in video streams.
In the basis of the proposed knowledge transfer framework are separation theorems (Theorem 1-3) stating peculiar properties of large but finite random samples in high dimension. According to these results, k < n random i.i.d. elements can be separated form M ≫ n randomly selected elements i.i.d. sampled from the same distribution by few linear functionals, with high probability. The theorems are proved for equidistributions in a ball and in a cube. The results can be trivially generalized to equidistributions in ellipsoids and Gaussian distributions. Discussing these in detail here is the beyond the scope and vision of this work. Nevertheless, generalizations to other meaningful distributions, relaxation of the independence requirement, and a broader view on how the proposed technology could be used in multiagent AI systems is presented in technical report .