Statistical Analysis of Multi-Relational Network Recovery

In this paper, we develop asymptotic theories for a class of latent variable models for large-scale multi-relational networks. In particular, we establish consistency results and asymptotic error bounds for the (penalized) maximum likelihood estimators when the size of the network tends to infinity. The basic technique is to develop a non-asymptotic error bound for the maximum likelihood estimators through large deviations analysis of random fields. We also show that these estimators are nearly optimal in terms of minimax risk.


Introduction
A multi-relational network (MRN) describes multiple relations among a set of entities simultaneously. Our work on MRNs is mainly motivated by its applications to knowledge bases that are repositories of information. Examples of knowledge bases include WordNet (Miller, 1995), Unified Medical Language System (McCray, 2003, and Google Knowledge Graph (https://developers.google.com/knowledge-graph). They have been used as the information source in many natural language processing tasks such as word-sense disambiguation and machine translation (Gabrilovich and Markovitch, 2009;Scott and Matwin, 1999;Ferrucci et al., 2010). A knowledge base often includes knowledge on a large number of whether and to what extent the underlying distribution of MRN can be recovered, especially when there are a large number of nodes and relations.
The results in this paper fill in the void by studying the error bounds and asymptotic behaviors of the estimators for M ijk 's for a general class of models. This is a challenging problem due to the following facts. Traditional statistical inference of latent variable models often requires a (proper or improper) prior distribution for θ i . In such settings, one works with the marginalized likelihood with θ i integrated out. For the analysis of MRN, the sample size and the latent dimensions are often so large that the above-mentioned inference approaches are computationally infeasible. For instance, a small-scale MRN could have a sample size as large as a few million, and the dimension of the embeddings is as large as several hundred. Therefore, in practice, the prior distribution is often dropped, and the latent variables θ i 's are considered as additional parameters and estimated via maximizing the likelihood or penalized likelihood functions. The parameter space is thus substantially enlarged due to the addition of θ i 's whose dimension is proportionate to the number of entities. As a result, in the asymptotic analysis, we face a double-asymptotic regime of both the sample size and the parameter dimension.
In this paper, we develop results for the (penalized) maximum likelihood estimator of such models and show that under regularity conditions the estimator is consistent. In particular, we overcome the difficulty induced by the double-asymptotic regime via non-asymptotic bounds for the error probabilities. Then, we show that the distribution of MRN can be consistently estimated in terms of average Kullback-Leibler (KL) divergence even when the latent dimension increases slowly as the sample size tends to infinity. A probability error bound is also provided together with the upper bound for the risk (expected KL divergence).
We further study the lower bound and show the near-optimality of the estimator in terms of minimax risk. Besides the average KL divergence, similar results can be established for other criteria such as link prediction accuracy.
The outline of the remaining sections is as follows. In Section 2, we provide the model speicification and formulate the problem. Our main results are presented in Section 3. Finite sample performance is examined in Section 4 through simulated and real data examples.
Concluding remarks are included in Section 5.
2 Problem setup 2.1 Notation Let | · | be the cardinality of a set and × be the Cartesian product. Set {1, . . . , N } is denoted by [N ]. The sign function sgn(x) is defined to be 1 for x ≥ 0 and 0 otherwise. The logistic function is denoted by σ(x) = e x /(1 + e x ). Let 1 A be the indicator function on event A. We use U [a, b] to denote the uniform distribution on [a, b] and Ber(p) to denote the Bernoulli distribution with probability p. The KL divergence between Ber(p) and Ber(q) is written as We use · to denote the Euclidean norm for vectors and the Frobenius norm for matrices.
For two real positive sequences {a n } and {b n }, we write a n = O(b n ) if lim sup n→∞ a n /b n < ∞. Similarly, we write a n = Ω(b n ) if lim sup n→∞ b n /a n < ∞ and a n = o(b n ) if lim n→∞ a n /b n = 0. We denote a n b n if lim sup n→∞ a n /b n ≤ 1. When {a n } and {b n } are negative sequences, a n b n means lim inf n→∞ a n /b n ≥ 1. In some places, we use b n a n as an interchangeable notation of a n b n . Finally, if lim n→∞ a n /b n = 1, we write a n ∼ b n .

Model
Consider an MRN with N entities and K relations. Given i, j ∈ [N ] and k ∈ [K], the triple λ = (i, j, k) corresponds to the edge from entity i to entity j of relation k. Let denote the set of all edges. We assume in this paper that an edge can be either present or absent in a network and use Y λ ∈ {0, 1} to indicate the presence of edge λ. In some scenarios, the status of an edge may have more than two types. Our analysis can be generalized to accommodate these cases.
We associate each entity i with a vector θ i of dimension d E and each relation k with a vector w k of dimension d R . Let E ⊆ R d E be a compact domain where the embeddings θ 1 , . . . , θ N live. We call E the entity space. Similarly, we define a compact relation space R ⊆ R d R for the relation-specific parameters w 1 , . . . , w K . Let x = (θ 1 , . . . , θ N , w 1 , . . . , w K ) be a vector in the product space Θ = E N × R K . The parameters associated with edge λ = (i, j, k) is then x λ = (θ i , θ j , w k ). We assume that given x, elements in {Y λ | λ ∈ Λ} are independent with each other and that the log odds of Y λ = 1 is Here φ is defined on E 2 × R, and φ (x λ ) is often called the score of edge λ.
We will use Y to represent the N × N × K tensor formed by {Y λ | λ ∈ Λ} and M (x) to represent the corresponding probability tensor where x * stands for the true value of x and Y λ 's are independent. In the above model, the probability of the presence of an edge is entirely determined by the embeddings of the corresponding entities and the relation-specific parameters. This imposes a low-dimensional latent structure on the probability tensor M * = M (x * ). We specify our model using a generic function φ. It includes various existing models as special cases. Below are two examples of φ.
where θ i , θ j , a k ∈ R d , b k ∈ R and w k = (a k , b k ). In the distance model, relation k from node i to node j is more likely to exist if θ i shifted by a k is closer to θ j under the Euclidean norm.
where θ i , θ j , w k ∈ R d and diag(w k ) is a diagonal matrix with w k as the diagonal elements. Model (5) is a special case of the more general model Very often, only a small portion of the network is observed (Min et al., 2013). We assume that each edge in the MRN is observed independently with probability γ and that the observation of an edge is independent of Y . Let S ⊂ Λ be the set of observed edges. Then the elements in S are independent draws from Λ. For convenience, we use n to represent the expected number of observed edges, namely, n = E [|S|] = γ|Λ| = γN 2 K. Our goal is to recover the underlying probability tensor M * based on the observed edges {Y λ | λ ∈ S}.
Remark 1. Ideally, if there exists x * such that Y λ = sgn M λ (x * ) − 1 2 for all λ ∈ Λ, then Y can be recovered with no error under x * . This is, however, a rare case in practice, especially for large-scale MRN. A relaxed assumption is that Y can be recovered with some low dimensional By introducing the noise term, we formulate the deterministic MRN as a random graph. The model described in (2) is an equivalent but simpler form of (6).

Estimation
According to (2), the log-likelihood function of our model is We omit the terms λ∈S log γ + λ / ∈S log (1 − γ) in (7) since γ is not the parameter of interest. To obtain an estimator of M * , we take the following steps.
1. Obtain the maximum likelihood estimator (MLE) of x * , 2. Use the plug-in estimatorM as an estimator of M * .
In (8), the estimatorx is a maximizer over the compact parameter space Θ = E N × R K . The dimension of Θ is which grows linearly in the number of entities N and the number of relations K.

Evaluation criteria
We consider the following criteria to measure the error of the above-mentioned estimator.
They will be used in both the main results and numerical studies.
(a) Average KL divergence of the predictive distribution from the true distribution (b) Mean squared error of the predicted scores (c) Link prediction error Remark 2. The latent attributes of entities and relations are often not identifiable, so the MLEx is not unique. For instance, in (4), the values of φ and M (x) remain the same if we replace θ i and a k respectively by Γθ i + t and Γa k , where t is an arbitrary vector in R d E and Γ is an orthonormal matrix. Therefore, we consider the mean squared error of scores, which are identifiable.

Main Results
We first provide results of the MLE in terms of KL divergence between the estimated and the true model. Specifically, we investigate the tail probability P (L(M , M * ) > t) and the expected loss E[L(M , M * )]. In Section 3.1, we discuss upper bounds for the two quantities.
The lower bounds are provided in Section 3.2. In Section 3.3, we extend the results to penalized maximum likelihood estimators (pMLE) and other loss functions. All proofs are deferred to the Appendix.

Upper bounds
We first present an upper bound for the tail probability P (L(M , M * ) > t) in Lemma 1. The result depends on the tensor size, the number of observed edges, the functional form of φ, and the geometry of parameter space Θ. The lemma explicitly quantifying the impact of these element on the error probability. It is key to the subsequent analyses. Lemma 2 gives a non-asymptotic upper bound for the expected loss (risk). We then establish the consistency ofM and the asymptotic error bounds in Theorem 1.
We will make the following assumptions throughout this section.
where E and R are Euclidean balls of radius U .
Assumption 2. The function φ is Lipschitz continuous under the Euclidean norm, where α is a Lipschitz constant.
Assumption 1 is imposed for technical convenience. The results can be easily extended to general compact parameter spaces. Let C = sup Without loss of generality, we assume that C ≥ 2.
Lemma 1. ConsiderM defined in (9) and the average KL divergence L in (10). Under Assumptions 1 and 2, for every t > 0, β > 0 and 0 < s < nt, where In the proof of Lemma 1, we use Bennett's inequality to develop a uniform bound that does not depend on the true parameters. It is sufficient for the current analysis. If the readers need sharper bounds, they can read through the proof and replace the Bennett's bound by the usual large deviation rate function which provides a sharp exponential bound that depends on the true parameters. We don't pursue this direction in this paper.
Lemma 2 below gives an upper bound of risk E[L(M , M * )], which follows from Lemma 1.
Lemma 2. ConsiderM defined in (9) and loss function L in (10). Let C 1 = 18C, C 2 = 8 √ 3αU and C 3 = 2 max {C 1 , C 2 }. If Assumptions 1 and 2 hold and n m ≥ C 2 + e, then We are interested in the asymptotic behavior of the tail probability in two scenarios: (i) t is a fixed constant and (ii) t decays to zero as the number of entities N tends to infinity. The following theorem gives an asymptotic upper bound for the tail probability and the risk.
Theorem 1. ConsiderM defined in (9) and the loss function L in (10). Let the number of entities N → ∞ and C, K, U, d E , d R , α, and γ be fixed constants. If Assumptions 1 and 2 hold, we have the following asymptotic inequalities.
When t is a fixed constant, When t = 10C m n log n m , Furthermore, The consistency ofM is implied by (16) and the rate of convergence is | log P The rate decreases to Ω(N log N ) for the choice of t producing (17). It is also implied by (17) with high probability. We show in the next section that this upper bound is reasonably sharp.
The condition that K, U, d E , d R , and α are fixed constants can be relaxed. For instance, we can let U , d E , d R , and α go to infinity slowly at the rate O(log N ) and K at the rate O(N ). We can let γ go to zero provided that m n log n m = o(1).

Lower bounds
We show in Theorem 2 that the order of the minimax risk is Ω( m n ), which implies the near optimality ofM in (9) and the upper bound O( m n log n m ) in Theorem 1. To begin with, we introduce the following definition and assumption.

Reguralization
In this section, we extend our asymptotic results in Theorem 1 to regularized estimators. In practice, regularization is often considered to prevent overfitting. We consider a regularization similar to elastic net (Zou and Hastie, 2005) where · 1 stands for L 1 norm and ρ 1 , ρ 2 ≥ 0 are regularization parameters. The pMLE iŝ Note that the MLE in (8) is a special case of the pMLE above with ρ 1 = ρ 2 = 0. Sincex is shrunk towards 0, without loss of generality, we assume that E and R are centered at 0. We generalize Theorem 1 to pMLE in the following theorem.

Other loss functions
We present some results for the mean squared error loss M SE φ defined in (11)  If t is a fixed constant, Furthermore, Corollary 2. Under the setting of Theorem 2 with the loss function replaced by M SE φ , we and Assumption 4. There exists ε > 0 such that M * λ − 1 2 ≥ ε for every λ ∈ Λ.
Corollary 3. Under the setting of Theorem 3 with the loss function replaced by err and Assumption 4 added, we have the following asymptotic results.
If t is a fixed constant, Furthermore,

Sparse representations
We are interested in sparse entity embeddings and relation parameters. Let · 0 be the number of non-zero elements of a vector and τ be a prespecified sparsity level of x (i.e. the proportion of nonzero elements). Let m τ = mτ be the upper bound of non-zero parameters, that is, x * 0 ≤ m τ . Consider the following estimator The estimator defined above maximizes the L 0 -penalized log-likelihood.
Theorem 4. ConsiderM defined in (32) and (9) and the loss function L in (10). Let the number of entities N → ∞ and τ, C, K, U, d E , d R , α be absolute constants. Under Assumptions 1 and 2, the following asymptotic inequalities hold.
If t is a fixed constant, If t = 10C mτ n log n mτ , Furthermore, We omit the results for other loss functions as well as the lower bounds since they can be analogously obtained.

Numerical Examples
In this section, we demonstrate the finite sample performance ofM through simulated and real data examples. Throughout the numerical experiments, AdaGrad algorithm (Duchi et al., 2011) is used to computex in (8) or (23). It is a first-order optimization method that combines stochastic gradient descent (SGD) (Robbins and Monro, 1951) with adaptive step sizes for finding the local optima. Since the objective function in (8) is non-convex, a global maximizer is not guaranteed. Our objective function usually has many global maximizers, but, empirically, we found the algorithm works well on MRN recovery and the recovery performance is insensitive to the choice of the starting point of SGD. Computationally, SGD is also more appealing to handle large-scale MRNs than those more expensive global optimization methods.

Simulated Examples
In the simulated examples, we fix K = 20, d E = 20 and consider various choices of N ranging from 100 to 10,000 to investigate the estimation performance as N grows. The function φ we consider is a combination of the distance model (4) and the bilinear model (5), where θ i , θ j , a k , b k ∈ R d and w k = (a k , b k ). We independently generate the elements of θ * i , a * k , and b * k from normal distributions N (0, 1), N (0, 1), and N (0, 0.25), respectively, where N (µ, σ 2 ) denotes the normal distribution with mean µ and variance σ 2 . To guarantee that the parameters are from a compact set, the normal distributions are truncated to the interval [-20, 20]. Given the latent attributes, each Y ijk is generated from the Bernoulli distribution The goal of our analysis is to recover the unobserved part of the knowledge base. We adopt the ranking procedure, which is commonly used in knowledge graph embedding literature, to evaluate link predictions. Given a valid triple λ = (i, j, k), we rank estimated scores for all the invalid triples inside Λ ·jk = {(i , j, k) | i ∈ [N ]} in descending order and call the rank of φ (x λ ) as the head rank of λ, denoted by H λ . Similarly, we can define the tail rank T λ and the relation rank R λ by ranking φ (x λ ) among the estimated scores of invalid triples in Λ ij· and Λ i·k , respectively. For a set V of valid triples, the prediction performance can be evaluated by rank-based criteria, mean rank (MR), mean reciprocal rank (MRR), and hits at q (Hits@q), which are defined as and The subscripts E and R represent the criteria for predicting entities and relations, respectively. Models with higher MRRs, Hits@q's or lower MRs are more preferable. In addition, MRR is more robust to outliers than MR.
The three models described in (4), (5), and (36) are considered in our data analysis and we refer to them as Model 1, 2 and 3, respectively. For each model, the latent dimension d takes value from {50, 100, 150, 200, 250}. Due to the high dimensionality of the parameter space, L 2 penalized MLE is used to obtain the estimated latent attributesx, with tuning parameters ρ 1 = 0 and ρ 2 chosen from {0, 10 −2 , 10 −3 , 10 −4 , 10 −5 } in (22). Since information criteria based dimension and tuning parameter selection is computationally intensive for dataset of this scale, we set aside 5,000 of the unobserved valid triples as a validation set and select the d and ρ 2 that produce the smallest MRR E on this validation set. The model with the selected d and ρ 2 is then evaluated on the test set consisting of the rest 5,000 unobserved valid triples.
The computed evaluation criteria on the test set are listed in Table 1. The table also includes the selected d and ρ 2 for each of the three score models. Models 2 and 3 generate similar performance. The MRRs for the two models are very close to 1, and the Hits@q's are higher than 90%, suggesting that the two models can identify the valid triples very well.
Although Model 1 is inferior to the other two models in terms of most of the criteria, it outperforms them in MR E . The results imply that Model 2 and Model 3 could perform extremely bad for a few triples.
In addition to Models 1-3, we also display the performance of the Canonical Polyadic (CP) decomposition Hitchcock (1927) and a tensor factorization approach, RESCAL Nickel et al. (2011). Their MRR E and Hits E @10 results on the WordNet dataset are extracted from Trouillon et al. (2016) and Nickel et al. (2016), respectively. Both methods, especially CP, are outperformed by Model 3.

Concluding remarks
In this article, we focused on the recovery of large-scale MRNs with a small portion of observations. We studied a generalized latent space model where entities and relations are associated with latent attribute vectors and conducted statistical analysis on the error of recovery. MLEs and pMLEs over a compact space are considered to estimate the latent attributes and the edge probabilities. We established non-asymptotic upper bounds for estimation error in terms of tail probability and risk, based on which we then studied the asymptotic properties when the size of MRN and latent dimension go to infinity simultaneously.
A matching lower bound up to a log factor is also provided.
We kept φ generic for theoretical development. The choice of φ is usually problem-specific in practice. How to develop a data-driven method for selecting an appropriate φ is an interesting problem to investigate in future works.
Besides the latent space models, sparsity (Tran et al., 2020) or clustering assumptions (Jung et al., 2019) have been used to impose low-dimensional structures in single-relational networks. An MRN can be seen as a combination of several heterogeneous single-relational networks. The distribution of edges may vary dramatically across relations. Therefore, it is challenging to impose appropriate sparsity or cluster structures on MRNs. More empirical and theoretical studies are needed to quantify the impact of heterogeneous relations and to incorporate the information for recovering MRNs.

Appendix
Proof of Lemma 1.
be the log likelihood ratio. Therefore, f is a random field living on Θ. By writing f (x), we omit the second argument. In explicit form, We have E [Z λ ] = −γD (M * λ ||M λ (x)) and |Z λ | ≤ C. It follows that f has properties (i) Based on the definition of Θ t and property (ii), we have From property (iii), we get that According to Lemma 3 in Appendix, when C ≥ 2, the variance of Z λ is bounded by It follows that By Bennett's inequality, where 0 < s < nt and h(u) = 1 + 1 u log (1 + u) − 1 is an increasing function for u > 0. Hence by bounds in (39)(40), Let z = argmax x∈Θt f (x) be the random vector on Θ t where f (x) reaches its maximum. Let N ,E and N ,R be the -covering centers for E and R respectively. Since E and R are balls of radius U , we can find -coverings such that |N ,E | ≤ (1 + 2U/ ) d E and |N ,R | ≤ (1 + 2U/ ) d R .
Proof of Lemma 2. To bound E L(M , M * ) , set s = 1 2 nt and β = 1 + t in (14) to get By Fubini's Theorem, Let C 3 = 2 max [{C 1 , C 2 }] and t 0 = C 3 m n log n m . When t ≥ t 0 and n m ≥ C 2 + e, Thus Hence by (47) and (49), Proof of Theorem 1. When t is a constant, let s be absolute constant and β = m → ∞ in Lemma 1. We analyze the order of three exponential terms on the right side of (14), Hence, both the second and the third term is asymptotically ignorable compared to the first term. It follows that

When t = 2C
h( 1 2 ) m n log n m , let s = m and β be absolute constant. The exponential terms The third term exp {−nβh(β)} is negligible. Therefore, To bound the risk, we use similar approach as proof of Lemma 2. Let s = m, β = 1 + t and It follows that Since h( 1 2 ) ≥ 1 5 , we proof the results.
Lemma 3. ∀x, y ∈ [−C, C], we have Proof. We only need to show the result for x ≥ 0 by symmetry. For any fixed x ∈ [0, C], we have g (x) = g(x) = 0. It remains to show that g (y) y−x > 0 for all y ∈ [−C, C] \ {x}, then g(x) reaches the minimum at x = 0 and g(y) ≥ 0 on [−C, C]. Equivalently, we want to show Note that (σ(y) − σ(x))/(y − x) is the slope of secant line on logistic function and reaches its minimum at y = C. It suffices to show that Let h(x) be left side above. By taking the derivative, we get To prove the lower bound in Theorem 2, we will use Lemma 4 -6. Since Lemma 4 Massart (2007) and Lemma 5 Cover and Thomas (2006) are well established results in literature, we will skip the proofs.
Lemma 4 (Gilbert-Varshamov bound). There exists a subset V of the d-dimensional hypercube {−1, 1} d of size at least exp{d/8} such that the Hamming distance Lemma 5 (Fano's inequality). Let V be a uniform random variable taking values in a finite set V with cardinality |V| ≥ 2. For any Markov chain V → X →V , where I(V ; X) is the mutual information between V and X.
Proof of Theorem 2.
, then x λ ∈ N r (u 0 ) for every λ ∈Λ. Hence according to Assumption 3, We will find x * in the vicinity ofx such that (20) holds. Let According to Gilbert-Varshamov bound in Lemma 4, there exist Likewise, from (62) we can get that with u = (w 1 , . . . , w K ) ∈ V R , v = (w 1 , . . . , w K ) ∈ V R and u = v.
Note that z, x (i) ∈ V, according to Pinsker's inequality and (60), For all x = x with x, x ∈ V and N ≥ 2, Hence when x (i) = z, Let P i denote the probability measure under x (i) . Results above show that Assign a prior on x that is uniform on V and denote by P V the Bayes average probability with respect to the prior. By Fano's inequality in Lemma 5, where I(x; X S ) is the mutual information between x and Y S . It can be bounded by the maximum pairwise KL divergence of Y S under P i and P j as follows, Since σ(·) is logistic function, the derivative σ (x) = σ(x) (1 − σ(x)) < 1. By Assumption 2, φ(·) is Lipschitz continuous with coefficient α , we get that σ(φ(·)) is also Lipschitz continuous σ (φ(u)), by Lemma 6 we get for all i, j ∈ [N ]. Hence, there exists x (i) ∈ V such that Let x * = x (i) , P = P i and It follows from (68) that Proof of Theorem 3. We will show the result by continuing the proof of Lemma 1 and Theorem 1 with some modifications. Let f ρ (x) be the penalized log likelihood ratio, we have According to (43), there exists x among the -covering centers such that where z = argmax x∈Θt f ρ (x). It follow that when |S| ≤ n(1 + β) and f ρ (z) ≥ 0, with = s √ 3αn(1+β) . Hence, we can rewrite (45) as P L(M , M * ) ≥ t ≤ P sup x∈Θt f ρ (x) ≥ 0, |S| ≤ n(1 + β) + P (|S| > n(1 + β)) where s ρ = s + (N + K)s αn(1 + β) 2 3 ρ 1 + 2 √ 3 ρ 2 U + √ 2ρ 1 (N + K)U + ρ 2 (N + K)U 2 .