Simulating Subject Communities in Case Law Citation Networks

We propose and evaluate generative models for case law citation networks that account for legal authority, subject relevance, and time decay. Since Common Law systems rely heavily on citations to precedent, case law citation networks present a special type of citation graph which existing models do not adequately reproduce. We describe a general framework for simulating node and edge generation processes in such networks, including a procedure for simulating case subjects, and experiment with four methods of modelling subject relevance: using subject similarity as linear features, as fitness coefficients, constraining the citable graph by subject, and computing subject-sensitive PageRank scores. Model properties are studied by simulation and compared against existing baselines. Promising approaches are then benchmarked against empirical networks from the United States and Singapore Supreme Courts. Our models better approximate the structural properties of both benchmarks, particularly in terms of subject structure. We show that differences in the approach for modelling subject relevance, as well as for normalizing attachment probabilities, produce significantly different network structures. Overall, using subject similarities as fitness coefficients in a sum-normalized attachment model provides the best approximation to both benchmarks. Our results shed light on the mechanics of legal citations as well as the community structure of case law citation networks. Researchers may use our models to simulate case law networks for other inquiries in legal network science.

The community structure of CLCNs has received significantly less attention. This, however, is a rich area of research that retraces to seminal works in network science [12,13]. Understanding communities broadly as connected subgraphs with denser within-set connectivity than without [14] allows us to automatically uncover network communities by iteratively removing links between otherwise dense subgraphs [13] or stochastically modelling link probabilities [15,16]. A wide range of community detection techniques [17][18][19][20][21] as well as measures for evaluating community quality [22][23][24] have been studied. To our knowledge, two prior works have examined community structures in case law. Bommarito et al. [25] develop a distance measure for citation networks which they exploit to uncover communities in USSC judgments. Mirshahvalad and colleagues [26] use a network of European Court of Justice judgments to empirically benchmark a proposed method for identifying the significance of detected communities through random link perturbation.
Studying community structures in CLCNs can reveal deeper insights for both legal studies and network science. For legal studies, how far communities in CLCNs mirror legal doctrinal areas (e.g., torts and contracts) is telling of judicial (citation) practices. A judge who cites solely on doctrinal considerations should generate likewise doctrinal communities; one who cites for other (legal or political) reasons would transmit noisier signals. Community detection algorithms could also further the task of legal topic classification. Thus far, this has primarily been studied from a text-classification approach [27,28].
For network science, CLCNs present a special case of the citation networks that have been studied extensively by the field. Studies mapping scientific papers as complex networks have demonstrated that they exhibit classic scale-free degree distributions [29] (but cf [30]). This has been attributed to preferential attachment, in that papers which have been cited more will be cited more. Other factors shaping paper citations include age [31] and text similarity [32]. These variables' interacting influences on citation formation yield rich structural dynamics in citation networks. For instance, over time, some papers come to be entrenched as central graph nodes while others fade into obsolescence, showing that age alone does not determine centrality [33]. Numerous generative models, discussed further in Part 2.2, have thus been proposed for citation networks, including for web hyperlinks [33][34][35].
As [1]'s findings suggest, however, the structure of CLCNs may differ from those of these traditional citation networks. In law, judges must consider the authority and relevance of precedent, amongst other things, when citing cases in their judgments. The doctrine of precedent further requires them to prefer certain citations to others. It is thus worth studying how CLCNs relate to traditional citation networks.
To this end, we examine how far generative models proposed for traditional citation networks can successfully replicate CLCNs. After a brief review of existing models (Section 2.2), we propose and evaluate a CLCN-tailored model that attempts to account for the unique mechanics of legal citations. The model is premised on an attachment function that attempts to capture aspects of legal authority, subject relevance, and time decay (Section 2.3). As measures for legal authority and decay are well-established, we focus on how subject relevance may be modelled. We devise a method for simulating node-level subjects and experiment with alternative attachment functions that incorporate these vectors in four different ways: using subject cosine similarity as a standalone linear feature in the attachment model; using the same as fitness coefficients [36]; constraining nodes to citing within subject-conditional "local worlds" [37]; using subjects to generate subject-sensitive PageRank scores [38] (Section 2.3.3). We then study by simulation the topological and community properties of networks produced by these alternative models (Section 3.1) and benchmark promising models (and baselines) against two empirical CLCNs: early decisions of the United States Supreme Court and of the Singapore Court of Appeal (Section 3.2). 1 We find that using subject similarity scores as fitness coefficients within a sum-normalized probability function best approximates these actual networks. However, key differences remain between the simulated and actual networks, suggesting that other factors influencing legal citations are remain unaccounted for. Nonetheless, our work represents a first step towards better capturing and studying the mechanics of case law citation networks.

MATERIALS AND METHODS
Section 2.1 sets the legal theory and context behind case law citation formation. Section 2.2 explores how far these are captured by existing models. Section 2.3 explains our proposed models. Section 2.4 describes the simulation protocol. Section 2.5 explains the graph metrics used to evaluate the simulations. Section 2.6 details the benchmark datasets.

Legal Context
We define a CLCN as a graph G(N, E) where all nodes n ∈ N are legal case judgments and all edges e ∈ E citations between them. 2 Let k, k − and k + respectively denote the degree, in-degree, and out-degree distributions of G. Nodes may have attributes such as the authoring judge, decision date, legal subject, and the text of the judgment. Edges may be weighted (e.g., if a judgment cites another more than once), and have attributes such as whether the citation affirms or overrules the cited case.
Like all citation networks, CLCNs are time-directed and acyclic. 3 CLCNs are unique, however, because the probability that a new node n t cites an earlier n t−1 (denoted P((n t , n t−1 )) and the entire distribution P) is shaped by legal theory and practice.
Posner [39] identifies five overarching reasons for legal paper citations, namely to: 1. Acknowledge priority or influence of prior art 2. Provide bibliographic or substantive information 3. Focus disagreements 4. Appeal to authority 5. Reinforce the prestige of one's own or another's work Reason (4) is particularly pertinent to case citations in Common Law systems characterized by the doctrine of binding precedent. The doctrine, in brief, means propositions of law central to a court's essential holding are taken as binding law for future purposes. Lower courts are bound to follow these holdings. While courts at the same level of hierarchy are not technically bound the same way, great deference is generally accorded to past cases nonetheless.
Recent studies have thus sought to measure legal authority with network centrality measures calibrated for the legal domain [2,3]. Beyond citation counts, a judgment's authority is further shaped by its subject areas and time context [40]. Lawyers do not think of judgments as simply authoritative in the abstract, but within a given doctrinal subject area (i.e., torts or contracts) and at a given time. Precedential value waxes and wanes as a judgment gets entrenched by subsequent citations, ages into obsolescence over time, and as other complementary or substitute judgments emerge [7,39]. Relevance, authority, and age are thus key, interconnected drivers of CLCN link formation [8].

Existing Models
How far are these legal mechanics captured by existing network models? In this section, we review existing citation network generative models and consider how they may be used to simulate CLCNs. Note that this paper is not a comprehensive review and will only highlight illustrative examples.

Degree-based Models
Classic Barabasi-Albert ("BA") [41] preferential attachment sets P BA ki j k j . This model famously recovers scale-free degree distributions observed in empirical networks. However, in this model earlier nodes acquire a significant and permanent advantage over later ones, particularly if the former are cited early on. Thus, a known limitation [42] of using BA model for simulating citation networks is that, since P BA 0 whenever k 0, new nodes (which necessarily have k − 0) are very unlikely to gain citations. Thus, the final graph may be such that most subsequent nodes cite the root node. This, of course, does not occur in empirical CLCNs (see also Section 3.2).
The "copying" model [34] offers a partial workaround. Links are determined by first randomly choosing one node from N as a "prototype", denoted n p . Destinations are then either selected randomly from N by a coin toss with manually-specified probability α or copied from n p otherwise. While the model does not explicitly include k in its attachment process, notice that nodes with zero in-degree may only be cited under the former branch, while nodes with high in-degree are more likely to be cited under the latter branch. The copying model can therefore be broadly understood as a mixture between the Erdos-Renyi and BA models with mixture intensity controlled by α. Setting α 1 recovers Erdos-Renyi completely, though the model with α 0 is not completely equivalent to BA. This allows the copying model to produce scale-free degree distributions while leaving open the possibility for k 0 nodes to be cited. However, these nodes are still less likely (depending on α) to be cited, as they cannot be cited under the copying branch. Moreover, the random process used for selecting prototypes and deciding whether to copy does not accord with legal intuitions. We do not expect new judgments (or papers) to randomly choose older judgments to either cite or copy citations from.
Another alternative proposed by Bommarito et al. ("BEA") [43] is a generalizable attachment function which considers inand out-degree separately. More precisely, where k − i and k + i are node i's in-and out-degree respectively, and {α, β} ∈ R are parameters for tuning their influences on P. Denoting {k − i , k + i } as a single feature vector X i and {α, β} as weight vector B, (1) may be rewritten as Because the softmax has a vector smoothing effect, using it over conventional sum normalization ensures that non-zero citation probabilities are assigned to all nodes, even for nodes where k 0. Seen this way, BEA provides a readily-extensible framework for modelling citation networks. BX is capable of encompassing an arbitrary range of weights and features. 4 The BA and BEA models may be seen as instances of what Pham et al. [44] call the General Temporal ("GT") model. GT generalizes k into an arbitrary function of node degree A(k), known as the "attachment kernel", such that P GT ∝ A(k). The GT framework allows a large class of degree-based attachment models to be specified and estimated by maximum likelihood. For instance [44], simulate networks with A(k) 3((log(max(k, 1))) 2 + 1. GT thus offers an attractive framework for modelling legal authority in CLCN link formation. But, despite its name, GT attachment does not explicitly model node age. Yet, age has been identified as a factor driving citation networks, including CLCNs [8].

Aging Models
More generally, degree-based models generally ignore the welldocumented influence of node age on citation formation [31,42,45]. By contrast, "aging" models [31,45] propose introducing a decay vector w(n, t) such that P ∝ A(k) × w(n, t). Here, w(n, t) can be any standard decay function which takes maps every n to weights bounded by [0, 1] based on their arrival time t n . Decay functions are further monotonically non-increasing with item age a n t − t n , and assign weight 1 to nodes with a n 0 [46]. For instance, a simple sliding window assigns all items younger than a cut-off age to weight 1 and all other items to weight 0. The specific aging model proposed by Wang et al. [31] uses node in-degree and exponential decay such that P aging ∝ k − × exp(−τa). In exponential decay functions, the parameter τ controls decay rate and induces a fixed half-life of ln (2) τ . Thus, the aging model gives younger nodes a higher chance to be cited than older ones with the same in-degree. However, nodes with k − 0 still have zero probability of being cited, regardless of age.
One extension of the aging model which incorporates intuitions from copying models are Singh et al.'s "relay" models [33]. Like copying models, relay models first choose a prototype n p and performs a coin toss to determine the next step. But unlike copying models, prototypes are chosen by BA preferential attachment instead of randomly. The first coin's head probability is given by exp(−τa n ) (that is, exponential decay). On heads, n p itself is cited and the process ends. On tails, a second coin toss with manually-specified head probability θ decides if citations are "relayed" (on heads) or if n p will be cited nonetheless (on tails). In a "relay", a new prototype is selected from within the set of nodes citing n p via a specified distribution D (Singh et al. use either uniform-random or preferential attachment). The process repeats until broken by a coin toss or the maximum specified relay depth is reached (in which case the final prototype is cited).
The exponential decay which parameterizes the first coin toss means aged papers are less likely to be cited themselves than they are to relay citations to younger papers citing them. At the same time, since prototypes are chosen initially by preferential attachment and subsequently re-chosen by D, relay models incorporate aspects of degree-based, scale-free models [33]. show that relay models better fit empirically-observed distributions of paper citation age gaps (i.e. the age difference between citing and cited papers) than the classic aging model. Relay models thus provide a more sophisticated method to account for both degree and age simultaneously. What remains missing, however, is a way to incorporate subject relevance as well. We thus turn to examine "fitness" models.

Fitness Models
Fitness models [47] attempt to account for each node's innate ability to compete for citations. This is generally achieved by introducing a vector of node fitness coefficients η i . For instance, the Bianconi-Barabasi model [36] introduces a vector of uniformrandomly sampled ηs to classic preferential attachment such that P BB η i ki j η j kj . Introducing η weakens the monopoly k holds over citation probabilities in P BA . A fit node has a good chance of being cited even if its degree is low (though not if its degree is zero) [48]. further propose introducing a time-decay vector w, such that the final attachment function becomes P η i A(k i )w j η j A(k j )w . Notice that fitness in this regard may represent any arbitrary attribute other than degree which is believed to influence citation probabilities. For instance [42], use the ratio between (a) the theoretical number of citations a node should receive under BA and (b) the actual number received to measure the "relevance" of a node.
Likewise, if we conceptualize η, A(k), and w as capturing legal relevance, authority, and time effects respectively, this threevariable model appears ideal for modelling legal citations. Here, degree-based centrality scores (an instance of A(k)) have been shown to capture legal authority well (see Section 2.3.1 below). Modelling time effects with w is also relatively standard. The crux, therefore, is devising etas that capture subject relevance. This turns on the distribution it is sampled from. Drawing fitness uniformly from [0, 1], as in [36], yields in expectation an evenly distributed node ranking inconsistent with the intuition that nodes sharing subjects with the citing node should be fitter (that is, more relevant) than others.
One workaround is to calculate eta empirically from the text content of actual papers [35]. use the cosine similarity between (stopped and lemmatized) word frequency distributions of two papers' texts to their content similarity. They then propose a "three-feature model" which places content similarity scores alongside in-degree preferential attachment and power-law time decay as competing node attachment distributions. The model randomly chooses one of the three (with probability α, β, c respectively, α + β + c 1) as the final attachment function. This creates a probabilistic mixture between the content similarity, degree, and aging models, although only one model is ultimate used to generate any given edge.
Using text similarity measures to capture content overlap is intuitively logical and allows us to exploit the growing literature on text embeddings [49] (which find increasing representation in legal studies as well [7,50]). The main drawback is that because text is difficult to simulate, we are limited to simulating edges between actual, existing nodes. To illustrate, for CLCNs, we can compute text similarity between empirically-observed case judgments and simulate citations between them. This would, of course, reveal important insights about CLCNs. However, generating the case judgments themselves would be difficult. We would not be able to deviate from empirically-observed node attributes.
To summarise, existing literature provides a wealth of citation network generation models. Each have their own strengths and weaknesses when theoretically applied to CLCNs. At the same time, we are not aware of any study that attempts to do this. Building on this literature, the next Section proposes a model tailored for CLCNs.

Modelling Case Law Citation Networks
Following [43], we start at time t 0 with G 0 comprising |N 0 | 1 node and |E 0 | 0 edges. For each t till a specified stop T, |N t | ∼ Po(λ) new nodes are added. Each n t cites E nt ∼ Po(μ) prior nodes (re-drawn independently per node). {λ, μ} ∈ R thus control network growth rates. The main innovation of our model lies in the attachment function. In the abstract, we use a probability function P f (Λ, ρ, w) where Λ generalizes A(k) above to encompass any function, including functions not based on k alone, capable of measuring legal authority, ρ measures subject relevance, and w is a time-decay function.
The goal is to calibrate these variables in a legally-contextualized manner. Below we expand on each variable in turn.

Authority
In place of k − and k + above, we use Kleinberg's [51] authority and hub scores. These have been shown to accurately recover legallysignificant cases from CLCNs [2,3,5,40,52]. We denote authority and hub scores as A(k − ) and A(k + ) respectively. While we might intuitively expect in-degree to be more representative of authority than out-degree, legal scholars have found that out-degree-based scores can be a better predictor of future case influence [40]. Cases which discuss and synthesize a large number of authorities tend to represent important disputes into which significant legal and financial resources are poured. For similar reasons, they also tend to become important legal checkpoints themselves. We therefore include both score types in the model. In any event, α or β could be set to zero to remove either score.

Time
Following the aging model, we use a standard exponential decay where w(n, t) exp(−τa n ). This suits the legal context because citations to centuries-old judgments are not uncommon. Thus, discrete decay functions like the sliding window that apply a standard discount to all papers above a certain age would not model this observation well. Given the law's respect for old authority, we assume throughout this paper that τ 0.01, resulting in a precedent half-life of about 70 periods. Of course, future work could explore how τ may be empirically estimated (see [31]) and how it may vary over time, jurisdiction, subject, or even judge.

Subject Relevance
To derive relevance, we first need to simulate subjects for each node. Drawing inspiration from Latent Dirichlet Allocation ("LDA") [53], we assign each node a vector of m subjects ϕ i ∼ Dir(ψ, m). ψ is a m-sized vector that controls subject skew. If we want some subjects to occur more than others, possibly following a power-law, a similarly skewed ψ may be used. As a null model, however, we may set ψ 1 m . One drawback of the Dirichlet is that non-zero probabilities may occur across many subjects. This is inconsistent with how legal cases generally discuss only a few subjects. Thus, we set a minimum cut-off of 0.1 below which subject values are floored to 0. The vector is then normalized to sum back to 1. Should this cutoff result in an entirely zero vector, one randomly-chosen subject is assigned weight 1. Because LDA treats documents as finite mixtures over m latent overlapping 'topics' that are in turn multinomial distributions across words, such a cut-off is intuitively similar to assuming that any subject generating less than 10% of the words in a judgment is not a subject that should be ascribed to it.
These subject vectors are analogous to overlapping community belonging coefficients [24], though it is always possible to partition nodes into discrete subjects by taking, for instance, max(ϕ i )). Here, we interpret ϕ as non-fuzzy subject proportions rather than probabilities. That is, a case with ψ i {0.51, 0.25, 0.24} has subjects 1, 2, and 3, each with probability 1. But it is primarily about subject 1, in that 51% of its content is expected to come from the same.
Given ϕ, subject relevance can be modelled in at least four ways: As Linear Features: First, we can derive subject similarity scores ρ sim n t ,N t−1 g(ϕ n t , ϕ N t−1 ), where g is some vector similarity measure. Many options for g exist, but for now we default to cosine similarity given its established use in document clustering, including for legal documents [54]. The simplest way to model relevance is then to include ρ sim as a standalone linear feature with its own weight γ such that P linear ∝ w × (αA(k − ) + βA(k + ) + cρ sim ).
As Fitness Coefficients: Including ρ sim linearly is attractively simple, but may fail to account for interaction effects between authority, relevance, and time. Thus, a second approach is to model ρ sim as fitness coefficients, so P fitness ∝ ρ sim × w × (αA(k − ) + βA(k + )). This is broadly similar to the model proposed in [48], except that fitness values are computed from simulated subjects. Notice that, unlike with the linear features approach, using subject similarities as coefficients ensures that prior nodes with zero subject overlap with the citing node will be assigned P 0 as well.
As Locality Constraints: Another more direct to enforce this is to limited nodes to citing within subject-conditional "local words" [37]. Within each locality, nodes can be selected by any subject-conditional probability distribution [37].'s original paper used the uniform distribution. To account for legal authority and time effects, we continue to choose nodes using HITS scores and exponential decay. To approximate the idea of nodes being authoritative within subjects, HITS scores are recomputed within the subgraph of nodes sharing at least one subject with the citer. More precisely then, P local ∝ w × (αA local (k − ) + βA local (k + )), with the subscript 'local' denoting that the vector is computed on a subject-local subgraph.
For Subject-Sensitive PageRank: Another way to interact subject overlap with degree-based authority is to use ϕ to compute so-called topic-sensitive PageRank ("TSPR") scores [38] that may be used in place of HITS scores. While conventional PageRank [55] produces one global ranking that disregards node topic, TSPR first calculates m (the total number of topics) different rankings by setting non-uniform personalization vectors for each topic given by where N m is the set of nodes with subject j. Given a query node q with topic weights ϕ q , TSPR then returns TSPR q v ϕ q · PR j , with PR j being personalized PageRank scores for topic j. While TSPR has not been studied in legal networks literature, it offers a promising way to simultaneously account for authority and subject relevance using just one centrality measure. Thus, The subject models above are, of course, based on the literature reviewed in Section 2.2. The best approach for modelling subject Frontiers in Physics | www.frontiersin.org July 2021 | Volume 9 | Article 665563 relevance is not obvious. Neither are the approaches mutually exclusive. For instance, after constraining the citable node set by subject, we may still include ρ as a linear feature while also using TSPR scores to model authority. Other combinations are also theoretically possible. But doing so may lead to contradictions. For example, calculating TSPR within a subject-constrained subgraph will return the simple PageRank score of that subgraph. It may also overplay the importance of subject relevance. For now, we study the properties of the networks produced by each approach independently.
To summarise, the proposed subject models begin with one root node and, at each time step t, adds |N t | ∼ Po(λ) nodes with E n t ∼ Po(μ) edges per node. Attachment probabilities are specified generally by P f (A(k), w(t), ϕ), where A(k) is some degree-based centrality measure including simple in/outdegree, HITS scores, and TSPR, w(t) is an exponential decay function, and ϕ ∼ Dir(ψ, m) a vector of simulated node subjects which may be incorporated into P in four different ways proposed here (though we do not rule out alternatives).

Model Simulations
To study the properties of our proposed models, we simulate 50 iterations of T 500 steps for each subject model. Building on [43], we experimented with softmax as well as sum normalization for each model. For ease of reference, below we refer to the four proposed subject models as Linear, Fitness, Locality, and TSPR respectively. We use brackets to identify the normalization scheme. To illustrate, TSPR (sum) refers to a sum-normalized attachment model based on exponentially-decayed subjectsensitive PageRank scores.
For baseline comparison, we also simulated the BA, BEA, aging, copying, and relay models. We ran BA with degree-based preferential attachment (not in-degree), following the original model. Likewise, BEA was simulated with α and β both equal to 1. The aging baseline follows [31]'s specification, using only indegree and an exponential decay with tau 0.01. A softmaxnormalized alternative was tested as well. The copying model was run with copying probability α 0.5 (not to be confused with the in-degree weight α in our models). Finally, we used preferential attachment relay and set relay depth at 1, τ 0.3, and θ 0.9. This follows optimal parameters found by Singh et al. for approximating the scientific paper networks they studied.
Including 5 baselines, subject models, and alternative normalizations, a total of 16 different models are run for 50 iterations each. 5 To promote comparisons across models, we fix a few key parameters in our simulations. First, |N t | is fixed universally at 1. Thus, exactly one node is added at every step for every simulation. Second, the number of subjects is fixed at 30. Third, within each iteration we first draw all E nt s from Po(5) and all ϕ n t s from Dir 1 30 , 30 and use the same inputs across all models/approaches. This means the out-degree distribution of all models in the same iteration will be similar. Further, because the same subject vectors are used across all parameterizations within the same iteration, only 50 × 501 individual node subject vectors are generated. 6 Fourth, all weights {α, β, c} are set at 1 whenever relevant (though recall that c is only used by the linear feature subject model). Finally, an exponential decay with τ 0.01 is used for all models (except the relay model).
A few implementation details are worth noting. First, because E nt is randomly-drawn, it can exceed the total number of nodes in the existing, citable graph. Further, some attachment models result in zero citation probabilities for certain nodes, further limiting the citable node set. Thus, whenever N t−1 < E n t , we draw only |N t−1 destinations (while still using P). As a result, a node's realized out-degree can be lower than its initially-drawn out-degree. This is more likely to occur in the Locality models since nodes may only cite within subject-local subgraphs. This accounts for minor differences in total edge counts across model simulations. Second, nodes and edges are added in batches after attachment probabilities and edge destinations for every new node at a given t is determined. All computations are based solely on G t−1 , so nodes and edges added at the same t do not influence computations for each other. Third, after P is calculated, destination nodes are selected without replacement, so E n t unique destination nodes are always drawn. This follows prior literature which (implicitly) samples without replacement [43].
Finally, we use NetworkX's [56] Python package to compute HITS scores. Since convergence is not guaranteed, we allow the algorithm to run for a maximum of 300 power iterations, three times the package default. We modify the package slightly to return prevailing scores if convergence is not achieved by then. To facilitate convergence (and save computational resources), we exploit the intuition that HITS scores for step t + 1 should not differ too much from those of step t and provide HITS scores from previous steps as warm starts. Note that this cannot be done for Locality. Because the citable node set in that model varies from node to node, relevant prior HITS scores vary.

Model Properties
An important preliminary question is whether the subject models yield scale-free degree distributions even as they seek to model time and relevance effects. As our simulation protocol fixes outdegree distributions, so comparing out-degree or total degree distribution is less meaningful. Thus, we begin by examining each model's realised in-degree distributions. To compute the average distribution across 50 iterations of the same model, we stack distributions on each other to produce a 50 × 501 matrix of indegree counts. We then take the column-wise mode of this matrix 7 as the average distribution and compute the frequency-rank distribution of the same.
To further examine how the baseline and subject models differ in subject structure, we also derive subject signatures for each network. These are broadly inspired by [33]'s temporal bucket signatures. More precisely, denote the subject edge histogram of a graph G whose nodes fall within m subjects as a size m × m matrix H(i, j) where each entry is the total number of times nodes with subject j have cited nodes with subject i. Because one node may have many subjects, a single edge can add to many entries. 8 Thus, |H| ≥ |E|.
The global-sum-normalized matrix H norm H |H| then yields the unconditional probability distribution for the cited and citing subject pairings of an arbitrary edge. Meanwhile, normalizing H column-wise so that H col H |H(j)| yields in each column the probability of subject i being cited conditional on the citing edge having subject j. In this way, H, H norm and H col offer different insights on the subject signature of a single network. Subject signatures for each model may then be computed by averaging these matrices across model iterations.
Finally, we compute a range of network density and community quality metrics for selected models. These include intra/inter-community edge ratios [57] and link modularity, being [18]'s modularity scores extended to the case of overlapping, directed communities [22]. We compute these metrics against (1) the simulated ϕs themselves (as ground truth subject labels) and (2) communities recovered by k-clique percolation. k-Clique percolation is a useful baseline because it is an established method for overlapping community detection [14]. Briefly, it recovers communities by percolating k-sized cliques to adjacent k-cliques (i.e., those sharing k-1 nodes). Its time-efficiency also means running the algorithm on all simulated networks is more practical than other equally established but less efficient algorithms, such as Girvan-Newman edge-betweenness [13]. We fix k at 3, the smallest meaningful input, to allow more, smaller communities to be returned. Though a directed k-clique algorithm exists [58], we could not find any open-source implementation. As an accessible baseline was desired, we relied on NetworkX's undirected implementation instead. Code for k-clique percolation and all community quality metrics are from CDLIB [59].

Empirical Benchmarks
Studying the structural properties and subject signatures of the simulations identifies certain more promising approaches for modelling CLCNs. As a final step, we benchmark selected approaches against two empirical CLCNs. The first is the internal network of USSC judgments that is well-studied in legal network science. To obtain legal subjects, we join Fowler et al.'s [3] edgelist with metadata from the Spaeth database [60], particularly the "issue areas" identified for each case. The second is an internal network of SGCA judgments that has also been studied in prior work [7]. The dataset covers citations between all reported decisions of the SGCA from 1965 (the year Singapore gained independence) to 2017. Judgments are assigned to subjects using catchwords provided by the Singapore Law Reports, the authoritative reporter of SGCA judgments. Following [28], we map these subject labels to 31 unique subject areas (including the "Others" category). Note that in both datasets subjects are overlapping in that the same case may belong to more than one subject.
These networks represent different CLCN archetypes. Although both the United States and Singapore inherited English law, the legal, social, and time contexts in which each system originated and developed is vastly different. Further, while the USSC primarily (and selectively) reviews cases of federal and constitutional significance, the SGCA, as its name suggests, routinely considers appeals on matters of substantive (including private) law. The datasets are also practically usefully because both provide human-labelled legal subjects. To be sure, this also implies that comparisons between the two networks must be made with caution. On top of their different legal contexts, the legal subjects provided by each database also differ. The Spaeth database uses broad issue areas such as "Civil Rights" and "Due Process". The Singapore dataset uses specific doctrinal areas such as torts and contracts. Below we refrain from drawing comparisons between the two networks except on broad properties such as degree distributions.
As we are primarily interested in network generation, we focus on the first 2001 nodes of the USSC network and the first 1001 of the SGCA network (making allowance for one root node). More USSC nodes were used because the early USSC graph was sparser. The resultant USSC and SGCA benchmark graphs had 777 and 779 edges each. The USSC benchmark spans from the year 1791 (the first node) to 1852 (the 2001th node), while the SGCA benchmark spans from 1970 to 1999. More detailed properties of both graphs are discussed in Section 3.2.
To set up the comparison, we first calibrated the model with the empirical properties of each network. Specifically, USSC simulations used an average edge rate per case of 777 2001 0.388. Subject vectors were drawn across 14 issue areas from Dir(ψ us , 14), with ψ us being the normalized issue frequency distribution of the benchmark graph. Likewise, SGCA simulations were run using edge rates of 779 1001 0.778 with subjects drawn from Dir(ψ sg , 31). As with the initial simulations, we fix |N t | 1 for all t and pre-draw all |E t |s and ϕs per iteration.
All benchmark simulations are run and assessed using the same protocols and metrics described in Section 2.4. The implementation details noted there apply. Additionally, we exploit the subject signatures to compute vector distances between the simulated and empirical graphs. In particular, we take the distance between H norm as an indicator of the aggregate differences between the subject structure of two graphs. We also measure distances between the main diagonals/off-diagonals alone for insight into differences between intra/inter-subject structure. Column-wise distances between H col s can be taken as a measure of per-subject differences in structure.
Amongst the wide range of graph distance measures available (see [61] for a review) we use the L1 distance because of its simple, intuitive interpretation as the sum of absolute differences. 9 More precisely, L1(H 1 , H 2 ) j i |H 1 (i, j) − H 2 (i, j)|. At the same time, 8 To illustrate, if a node with subjects 1 and 2 cites a node with subjects 3 and 4, entries(3,1), (4,1), (3,2) and (4,2) are all incremented. 9 We also experimented with Kullbeck-Leibler divergence and obtained similar results, though that measure often returns infinity.
Frontiers in Physics | www.frontiersin.org July 2021 | Volume 9 | Article 665563 classic measures such as the Hamming and Jaccard distances are meant primarily for binary (adjacency) matrices. Future work could explore more tailored distance measures. In particular, since H and H norm may also be interpreted as adjacency matrices for a weighted and directed meta-graph between the subjects, a distance measure specialized for such graphs (e.g. [62,63]) may be useful. Importantly, note that subject indexes must be aligned before computing most vector distances, including the L1, as entry order affects results. In our context, this is akin to ensuring that subject i of H 1 is comparable to subject i of H 2 . As our simulated subjects are arbitrary, we cannot order them by substantial content: say, to place torts at index 0 and contracts at index 1. Instead, we place the most frequent subject at index 0 and the least frequent at m − 1. Since the most common subjects are often also more likely to be cited (simply as a function of frequency), frequency indexing desirably concentrates probabilities towards the topleft quarter of the matrix, presenting a visually-readable signature (see Section 3.2). When computing signature distances, therefore, we are comparing citation patterns across subjects as ranked by frequency. If the most common subjects in graphs G 1 and G 2 both primarily cite other common subjects, signature distances would be relatively small. But if common subjects in G2 primarily cite its least common subjects, signature distances would be larger.

Model Simulations
As shown in Figure 1, most subject models successfully generate scale-free in-degree distributions similar to baselines. We also observe that Aging (sum) produces an average in-degree distribution with one node monopolizing most edges. Because aging models consider only in-degree, sum-normalization leads exactly to the problem, discussed in Section 2.2, where new nodes are never cited. This is partially addressed in by softmax normalization, but the Aging (softmax) model still manifests a visibly more imbalanced in-degree distribution than others. The same is also true for the BEA model. This is because even though the softmax function generally assigns non-zero probabilities, small-valued input elements are quickly assigned vanishingly small values. Consider for example that softmax(5, 5, 1) 0.495, 0.495, 0.009 at three decimal places. This affects models like BEA and Aging, which use simple degree counts as inputs, more directly because differences in feature values are larger.
The models are further differentiated by subject signatures. Figure 2 shows that the baseline models do not reproduce intrasubject citations. Instead, citation densities are evenly spread across all subjects. As expected, this also applies to the softmax-normalized subject models (except Locality). The softmax function's smoothing effect is particularly significant here because we use featuressuch as authority score and cosine similaritiesthat range within [0, 1]. Much of the potentially differentiating information captured by these variables are expressed in terms of small decimal differences which are easily smoothened away. Therefore, except when used in the context of the Locality model (which would impose strict subject constraints to begin with), softmax normalization appears unsuited for our purposes.
The remaining subject models have largely similar subject signatures. To further distinguish them, we examine specific graph properties presented in Tables 1 and 2. While our subject models are similar to baselines in some respects like connectivity, they have clear structural differences, particularly in terms of subject communities.
General Properties: Clustering coefficients were generally low (save in the BEA and Aging (softmax) models). All models produced giant components encompassing most (around 99.9%) of the graph. However, in-degree distributions in baselines are generally more imbalanced than in the subject models. Gini coefficients for the baselines ranged between 0.52 (relay) to 0.998 (Aging (sum)) whereas those for the subject models ranged between 0.417 (Linear (sum)) to 0.598 (TSPR (sum)).
Subject Structure: Relative to all others, sum-normalized subject models yielded more intra-subject edges, lower expansion and conductance scores for communities defined by Frontiers in Physics | www.frontiersin.org July 2021 | Volume 9 | Article 665563 the gold labels, as well as higher link modularities. These networks therefore exhibit stronger within-subject clustering (an attribute which, to recall, should in theory characterize CLCNs). Conversely, the softmaxed subject models differed only slightly from the baselines. To be sure, stronger conformity with subject labels is not necessarily better, since legal citations are influenced by more than subject relevance alone. Depending on the extent of subject clustering desired, therefore, TSPR approaches may offer a middle-ground. k-Clique Structure: The k-clique metrics paint a less coherent picture. There is no clear correlation between how well the models retrace gold label subjects and the average number, size, and quality of the communities recovered by k-clique percolation. The average number (size) of communities uncovered amongst sum-normalized subject models varies from around 28 to 82 (7)(8)(9)(10)(11)(12)(13)(14)(15)(16)(17)(18) whereas the softmaxed subject models tend to yield around 70 k-clique communities of 5-8 nodes. Insofar as our models approximate the legal citation process, these results suggest that k-clique percolation may be less useful for clustering (and classifying) case law by legal subjects. Communities recovered by the algorithm do not appear to reflect actual legal areas, suggesting that legal subject clustering does not follow the assumptions of k-clique percolation. Nonetheless, the observed k-clique communities may be the result of other clustering mechanics inherent in legal citation networks. To this extent, they offer an independent basis for assessing the structural similarity of different simulated models. For instance, it is clear from Table 2 that the BEA model, which always results in exactly 1 large k-clique community that encapsulates the whole network, is structurally distinct from the rest. The sum-normalized Fitness and TSPR models also stand out from the other sum-normalized subject models as they tend to produce fewer but larger k-clique communities (see Table 1).
In sum, preliminary simulations demonstrate that the specific approach used to incorporate subject relevance induces significantly different network structures. While using subject cosine similarities as fitness values in a sum-normalized model may be theoretically similar to setting subject constraints on citable localities, the resulting networks differ in key aspects such as in-degree distribution and k-clique structure. The normalization function used also makes a key difference. This in turn underscores the importance of carefully selecting the subject model. Here we do not declare any one model as correct or best. To the extent that an ideal model exists, its parameterization likely turns on the specific type of CLCN we are trying to simulate. The legal and institutional context underlying the network would be relevant. Citation practices in one court at a given time may fall closer to the Fitness model, whereas another court may cite more in line with the TSPR model.
That said, it is also clear that certain subject models, especially those using softmax normalization, are unlikely to successfully capture the nuances of legal citations for the reasons given above. For benchmarking purposes, therefore, we focus on approaches that appear more promising, being Linear (sum), Fitness (sum), Locality (softmax), and TSPR (sum). For brevity, we only present results for more promising baselines as well (being BA, Copying, and Relay). 10    Notes: Values represent the mean (standard deviation) of the relevant metric over 50 iterations. Relay was simulated with θ 0.9, τ −0.3, relay depth 1, and preferential relay. Relay is not a sum-normalized model but because the relay mechanism relies on sum-normalized preferential attachment, it is expedient to present its properties here instead of in a separate table.    Notes: Values represent the mean (standard deviation) of the relevant metric over 50 iterations. Copy was simulated with α 0.5. Note that Copy is not a softmax-normalized model but is tabulated here for brevity. 10 Results for the other baselines and varying paramaterizations of Copying and Relay(on file) do not affect our conclusions.

Empirical Benchmarks
Figures 3 and 4 compare in-degree distributions produced by the calibrated baseline and subject models against actual distributions from the USSC and SGCA benchmarks respectively. Both benchmark distributions are broadly reproduced by all models (including baselines), though BA and Fitness (sum) produce slightly gentler-sloping distributions in both cases. At the same time, Figures 5 and 6 show that only the subject models successfully reproduce empirical subject signatures. To see this, first notice that since subject distributions for both the USSC and SGCA are imbalanced, citation densities are naturally concentrated within the top left quarter (which indexes the most frequent subjects). Second, observe that main diagonals for both benchmarks are also denser than their off-diagonals, showing that subject overlap materially influences actual legal citations. Although the general top-left concentration is reproduced by all models (recall that simulated subjects are drawn from a Dirichlet parameterized by actual subject frequencies), only subject models exhibit the benchmarks' distinctive signature. These observations apply to both H norm and H col signatures. As expected, H col densities for both the actual and simulated graphs are concentrated upwards across all rows, reflecting how nodes with the most frequent subjects are relatively likely to be cited, regardless of citing subject. This may be explained by subject frequency imbalance. To illustrate, most nodes, by definition, will have subject 0. Assuming node i has subjects {0, 1} and node j has {0, 30}, j may cite i under the subject models due to the shared subject 0. The edge (j, i) would count not only towards H(0, 0) but H(30, 0) and H(30, 1) as well.
Downward density gradients for the benchmarks' signatures are noticeably less smooth than for the simulations. This is clearest for the SGCA benchmark, where (within the top 15 subjects) some less frequent subjects are more likely to be cited than the most frequent. This is suggestive of specific correlations between legal subjects. For example, subject 15 may be legally very relevant to and therefore often cited by subject 5 even though the former rarely occurs. Our models do not currently account for such subject covariance and future work could explore this further. Figure 7 tabulates L1 distances between the benchmark and simulated networks and provides further confirmation that the subject models offer a closer approximation of CLCNs than baselines. Total, diagonal, off-diagonal, and column-wise distances are consistently smaller for the subject models than for the baselines. Note that results are similar if we include other (previously discussed) baselines, including alternative paramaterizations of Copying and Relay, and also if we use Kullbeck-Leibler divergence instead of L1 distance. 11 Surprisingly, none of the subject models are clearly superior to the others. All clock in comparable numbers for every metric.
Therefore, to determine which models produce better empirical fit, we look into detailed network properties for each benchmark as presented in Tables 3 and 4. Results here are consistent with the preliminary simulations. Clustering coefficients and giant component percentages for all models are broadly similar, but in-degree distributions vary across models.
As expected, subject models yield networks with stronger subject communities than all baselines. The subject models fit the SGCA benchmark well, each producing around 800 intrasubject edges compared to the benchmark's 889. They also yield expansion, conductance, and link modularity scores indicating better subject community quality. However, the community quality metrics for the benchmark's network are even better, suggesting that our models can place even more weight on subject overlap.
Fit for the USSC benchmark is less clear. The USSC network has relatively few intra-subject edges. Resultantly, the baselines fall closer to the benchmark on this metric. However, community quality for the subject models are significantly closer to the actual. Re-creating the USSC network may therefore require model paramaterizations which create smaller but even more distinctive communities.
In terms of k-clique structure, the model closest to both benchmarks is Fitness (sum). The model produces on average 13.62 communities of 3.85 nodes (compared to 24 and 4.042) and

DISCUSSION
The mechanics of case law citations involve complex interactions between the legal authority of a case, its relevance to the citer's subjects, as well as the case's age. We may represent the case law citation function in abstract as a probability distribution P f (Λ, ρ, w), where each input variable captures each of these attributes respectively. Determining the exact functional form to be used, however, is difficult. Numerous reasonable approaches can be imagined. For subject relevance alone, we experimented with using subject similarities as linear features, fitness values, to constrain citable horizons by subject, and to use subject-sensitive centrality scores. These all draw on existing literature on citation networks, but other approaches may be possible.
In this light, this paper studied by simulation the expected properties of networks generated by the four approaches above and compared them against established network models. We then compared more promising models against two actual CLCNs   Notes: Except in the "actual" column, values are the mean(standard deviation) of the relevant statistic over 50 simulations per model. For the actual network properties the case issue areas assigned by the Spaeth database are used as subject labels. k-Clique expansion, conductance, and link modularity scores are only computed for and averaged within simulations which return at least 1 k-clique community and should be interpreted in this light. 45, 15, 28, and 13 iterations of the Copy, Linear, Locality, and TSPR models respectively returned 0 k-clique communities. All other models tabulated always returned at least 1. In each row, results numerically closest to the benchmark are bolded.
Frontiers in Physics | www.frontiersin.org July 2021 | Volume 9 | Article 665563 from the USSC and SGCA respectively. Our findings underscore the importance of a legally-informed model of link generation process. The properties of all proposed subject models were substantially different from those of the baseline models, (provided that sum-normalization was used), and closer to empirical benchmarks in terms of both graph properties and subject signature. Amongst the range of approaches studied, we found softmax normalization generally unsuited for the task because it smoothens away distinguishing differences in case attributes.
We also see that model properties vary substantially when alternative means of modelling subject relevance were tested. All subject models performed comparably in terms of subject signature distance from both benchmarks. However, while three of the four subject models studied yielded average in-degree distributions very close to both benchmarks, the Fitness (sum) model yielded a noticeably gentler degree-rank slope. Nonetheless, Fitness (sum) best fits the k-clique structure of both benchmarks.
The emergent picture is that all subject models provide plausible (and superior to baseline) approaches for modelling CLCNs. For our two benchmarks, however, Fitness (sum) slightly edges out the other approaches as the most promising method for modelling CLCNs. To recall, this model takes the cosine similarity between the subject belonging vectors of two cases as a fitness coefficient that modifies the weight computed from each case's authority and age. Notice that this model is similar to the General Temporal model in many respects, and may be seen as an extension of that model tailored to the legal citation context. An important caveat here is that although Fitness (sum) is relatively better when compared to alternative models, it is not in absolute terms a perfect approximation of either benchmark. Key differences remain between the actual and fitness-simulated networks. Further, alternative models may be better suited for CLCNs from other courts.
Our findings hint at possible universality in terms of the way courts think about subject relevance when deciding which cases to cite, though we hasten to add that the two benchmarks ran do not offer sufficient evidence. Universal or otherwise, the alternative models proposed may be useful for examining how courts and possibly individual judges differ when selecting cases to cite. If court/judge A's citation network is better approximated by model X while court/judge B's network is better approximated by model Y. Our models are also helpful for generating better simulated data that may be used to research other questions in legal network science. For instance, researchers could benchmark centrality algorithms against networks simulated with these models to study how far these algorithms recover legally-significant nodes. This, to recall Section 1, is a rich area of legal networks research. Since the data is simulated, it becomes possible to specifically dictate which nodes are legally-significant.

LIMITATIONS AND FUTURE WORK
This paper is the first to study how the unique mechanics of case law citations may be simulated and studied using network models. As a first step, it necessarily leaves a number of important questions unexplored.
First, although we have identified promising approaches for modelling CLCNs, alternative models for legal authority, subject relevance, and time decay remain to be studied. We have focused on comparing different methods for modelling subject relevance because this is the least explored question. Nonetheless, the effect of varying authority and time-decay models are worth studying further. Varying time-decay models in particular may yield insights on how quickly the value of precedent depreciates (a question often raised by legal scholars [8,39]), as well as how much deference different courts accord to antiquated precedent.  Second, future work can explore a larger space of model parameterizations. An immediate extension would be using time-variant network growth rates. That is, both λ and μ may vary across time. Further, the feature weights α, β, and c may be adjusted to generate and also reflect different judicial attitudes towards assessing authority and relevance. It may be possible to learn these weights from empirical data, whether using exponential random graph models or machine learning techniques. This would provide a means of quantitatively measuring which factors most influence legal citation decisions, providing a common metric for comparing how these differ (or remain the same) across judges, time, and space.
Third, this study was limited by data availability. Despite growing literature in case law citations analysis, few publicly available edgelists can be linked to case-level (subject) metadata. For now, we have benchmarked our models against two empirical networks produced by apex Common Law courts. Future work can consider how closely these models approximate the citation mechanics of courts in other jurisdictions, particularly those of Civil Law jurisdictions where the doctrine of precedent theoretically holds less sway.
Fourth, while we have focused primarily on network structure, the microscopic properties of our proposed networks remain largely unexplored. Future work could use node-level metrics such as centrality and accessibility to study what kinds of cases and subjects are most likely to become entrenched in the core of such networks (e.g. [64]). Additionally, the task of replicating and studying CLCNs would also benefit from also exploiting the textual content of case judgments, as was done in [35]. Methods that exploit both network structure and node attributes for community detection (e.g. [65]) could be explored. This would connect our work to the growing literature on legal language processing (e.g. [66,67]).

DATA AVAILABILITY STATEMENT
The original contributions presented in the study are included in the article/Supplementary Material, further inquiries can be directed to the corresponding author.