Does splitting make sentence easier?

In this study, we focus on sentence splitting, a subfield of text simplification, motivated largely by an unproven idea that if you divide a sentence in pieces, it should become easier to understand. Our primary goal in this study is to find out whether this is true. In particular, we ask, does it matter whether we break a sentence into two, three, or more? We report on our findings based on Amazon Mechanical Turk. More specifically, we introduce a Bayesian modeling framework to further investigate to what degree a particular way of splitting the complex sentence affects readability, along with a number of other parameters adopted from diverse perspectives, including clinical linguistics, and cognitive linguistics. The Bayesian modeling experiment provides clear evidence that bisecting the sentence leads to enhanced readability to a degree greater than when we create simplification with more splits.


. Introduction
In text simplification, one question people often fail to ask is whether the technology they are driving truly helps people better understand texts.This curious indifference may reflect the tacit recognition of the partiality of datasets covered by the studies (Xu et al., 2015) or some murkiness that surrounds the goal of text simplification.
As a way to address the situation, we examine the role of simplification in text readability, with a particular focus on sentence splitting.The goal of sentence splitting is to break a sentence into small pieces in a way that they collectively preserve the original meaning.A primary question we ask in this study is, does a splitting of text affect readability?In the face of a large effort spent in the past on sentence splitting, it comes as a surprise that none of the studies put this question directly to people; in most cases, they ended up asking whether generated texts "looked simpler" than the original unmodified versions (Zhang and Lapata, 2017), which is a far cry from directly asking about their readability.We are not even sure whether there was any agreement among people on what constitutes simplification.
Another related question is how many pieces should we break a sentence into?Two, three, or more?In the study, we ask whether there is any difference in readability between two vs. up to five splits.We also report on how good or bad sentence splits are that are generated by a fine-tuned language model, compared with those by humans.
We took care in experiments with humans (later described) to avoid using word simple whose interpretation may vary from person to person.Rather than asking people about the simplicity, we asked people how easy texts were for them to read (details in Section .).
A general strategy we follow in this study is to elicit judgments from people on whether simplification led to a text more readable for them (Section 4.2) and conduct a Bayesian analysis of their responses through multiple methods (logistic regression and decision tree), to identify factors that may have influenced their decisions (Section 4.3).

. Related work
Historically, there have been extensive efforts in ESL (English as a Second Language) to explore the use of simplification as a way to improve reading performance of L2 (second language) students.Crossley et al. (2014) presented an array of evidence showing that simplifying text did lead to an improved text comprehension by L2 learners as measured by reading time and accuracy of their responses to associated questions.They also noticed that simple texts had less lexical diversity, greater word overlap, and greater semantic similarity among sentences than more complicated texts.Crossley et al. (2011) argued for the importance of cohesiveness as a factor to influence the readability.Meanwhile, an elaborative modification of text was found to play a role in enhancing readability, which involves adding information to make the language less ambiguous and rhetorically more explicit.Ross et al. (1991) reported that despite the fact that it made a text longer, the elaborative manipulation of a text produced positive results, with L2 students scoring higher in comprehension questions on modified texts than on the original unmodified versions.
Meanwhile, on another front, Mason and Kendall (1978) conducted experiments with 98 fourth graders and found that segmentation of text enabled poor readers to better respond to comprehension questions, especially when they are dealing with difficult passages, while it had no significant effect on advanced readers, demonstrating that it is low ability readers who benefit the most from the manipulation.Rello et al. (2013) looked at how people with dyslexia respond to a particular reading environment where they had access to simpler lexical alternatives of words they encounter in a text and found that it improved their scores on a comprehension test.
While there have been concerted efforts in the past in the NLP community to develop metrics and corpora purported to serve studies in simplification (Xu et al., 2015;Zhang and Lapata, 2017;Narayan et al., 2017;Botha et al., 2018;Sulem et al., 2018a;Niklaus et al., 2019;Kim et al., 2021), they fell far short of addressing how their study contributes to improving the text comprehensibility.A part of our goal is to break away from a prevailing view that relegates readability to a sideline.
The data for the present study are found at https://github.com/tnomoto/fewer_splits_are_better.
Elsewhere in the NLP, there were people who showed how one might leverage text simplification to improve downstream tasks such as machine translation (Štajner and Popovic, ; Štajner and Popović, ; Sulem et al., ).
. Procedure We perform two rounds of experiments, one focusing on two vs. three sentence long simplifications and the other on two vs. four or more sentence long segmentations.The second study is mostly a repeat of the first, except for tasks we administered to humans.In what follows, we describe the first study.The second study appears in Section 5.

. Study . . Setup
For this part of the study, we look at two vs. three sentence long simplifications, and use two sources, the Split and Rephrase Benchmark (v1.0;SRB, henceforth;Narayan et al., 2017) and WikiSplit (Botha et al., 2018), to create tasks for humans.
SRB consists of complex sentences aligned with a set of multisentence simplifications varying in size from two to four.WikiSplit follows a similar format except that each complex sentence is accompanied only by a two-sentence simplification.We asked Amazon Mechanical Turk workers (Turkers, henceforth) to score simplifications on linguistic qualities and indicate whether they have any preference between two-sentence and three-sentence versions in terms of readability.
We randomly sampled a portion of SRB, creating test data (call it H), which consisted of triplets of the form: S 0 , A 0 , B 0 , . . ., S i , A i , B i , . . ., S m , A m , B m , where S i is a complex sentence, A i WikiSplit was created by drawing on Wikipedia edits via a process that involves tracing a history of edits people made to sentences and identifying those that were split into two in a later edit.Its authors provided no information as to what prompted people to do so.In SRB, a split version was not created by breaking down a complex sentence but by stitching together texts occurring independently in WebNLG (from which SRB is sourced) so that their combined meaning roughly matches that of the complex sentence.
We further rearranged component texts so that the flow of events they depict comes in line with that of the complex sentence.We emphasize that in contrast to Li and Nenkova ( ), this study is not about identifying conditions under which people favor a split sentence.
We used WikiSplit, together with part of SRB, exclusively to fine tune BART to give a single split (bipartite) simplification model and SRB to develop test data to be administered to humans for linguistic assessments.SRB was derived from WebNLG (Gardent et al., ) by making use of RDFs associated with textual snippets to assemble simplifications.

FIGURE
A screen capture of HIT.This is what a Turker would be looking at when taking the test.
is a corresponding two-sentence simplification, and B i is threesentence version.While A alternates between versions created by BART and by human, B deals only with manual simplifications (see Table 1 for a further explanation).
Separately, we extracted from WikiSplit and SRB, another dataset B consisting of complex sentences as a source sentence and two-sentence simplification as a target, i.e., B = { S ′ 0 , A ′ 0 , . . .,

S ′
n , A ′ n }, to use it to fine-tune a language model (BART-large).The fine-tuning was carried out using a code available at GitHub.
A task or a HIT (Human Intelligence Task) asked Turkers to work on a three-part language quiz.The initial problem section introduced a worker to three short texts, corresponding to a triplet S i , A i , B i ; the second section asked about linguistic qualities of A i and B i along three dimensions, meaning, grammar, and fluency; and in the third, we asked workers to solve two comparison questions (CQs): (1) whether A i and B i are more readable than S i , and (2) which of A i and B i is easier to understand.
Figure 1 gives a screen capture of the initial section of the task.Shown Under Source is a complex sentence or S i for some i.Text A and Text B correspond to A i and B i , which appear in a random order.Questions and choices are also displayed randomly.The questions we asked workers are shown in Table 2A.Specifically, we avoided asking them about the simplicity of alternative texts, as has been conducted in previous studies.In total, there were 221 HITs (Table 1), each administered to seven people.All of the participants were self-reported native speakers of English with a degree of college or above.The participation was limited to residents in the US, Canada, UK, Australia, and New Zealand.

. . Preliminary analysis
Table 2 gives a breakdown of responses to comparison questions on two-and three-sentence long texts.A question, labeled ⟪S, BART-A⟫ |q , asks a Turker, which of Source and BART-A he or she finds easier to understand, where BART-A is a BART-generated two-sentence simplification.We had 791 (113×7)  responses, out of which 32% said they preferred Source, 67% liked BART better, and 1% replied they were not sure.Another question, labeled ⟪S, HUM-A⟫ |q , compares Source with HUM-A, a two-sentence long simplification by human.It got 756 responses (108×7).The result is generally parallel to ⟪S, BART-A⟫ |q .The majority of people favored a two-sentence simplification over a complex sentence.The fact that three sentence versions are also favored over complex sentences suggests that breaking up a complex sentence this way works, regardless of how many pieces it is broken into.More people voted for bipartite over tripartite simplifications. Frontiers

Type Example text
Tables 3A, B show average scores on fluency, grammar, and meaning retention of simplifications, comparing BART-A and HUM-B, on one hand, and HUM-A and HUM-B, on the other hand, on a scale of 1 (poor) to 5 (excellent).In either case, we did not see much divergence between A and B in grammar and meaning, but it is in fluency that they diverged the most.A ttest found that the divergence statistically significant.Two-sentence simplifications generally scored higher on fluency (over 4.0) than three-sentence counterparts (below 4.0).
Table 3C gives Pearson correlations of predictors and human responses on readability.We discuss more on this later.
Table 4 gives examples of BART-A and HUM-A/B.A general outline of the rest of the study is as follows.We turn the question of whether splitting enhances readability into a formal hypothesis that could be answered by statistical modeling.Part of that involves translating relevant texts, i.e., HUM (BART)-A and HUM-B, separately into a vector of independent variables or features and setting up a target variable, which we fill in with a worker's response to Q5 (Section 4.1), i.e., "Between A and B, which is easier to understand?"We include among the features, a specific feature we call SPLIT that keeps the count of sentences that make up a text and which takes on true or false, depending on whether it is equal to 2 or more.Our plan is to prove or disprove the hypothesis by looking at how much impact SPLIT has on predicting a response a worker gave for Q5 in AMT Evaluation Form (Table 2A).

. . The Bayesian perspective
We adopt a Bayesian approach to modeling the Turk data from (Section 4.2).The choice reflects our desire to avoid overfitting to the data and express uncertainty about true values of model parameters, as the data we

As Table indicates, BART-A is generally comparable to HUM-A in
the quality of its outputs, suggesting that what it generates is mostly indistinguishable from those by humans.
have do not come in large numbers (Study 1: 1,547, Study 2: 1,106).The decision was mainly motivated by our concern about the limited availability of data we had access to.

. . . Models
To identify potential factors that may have influenced Turkers' decisions, we build two types of a Bayesian model, logistic regression, and decision tree, both based on predictors assembled from the past literature on readability and related fields.

. . . Logistic regression (LogReg)
We consider a regression of the following form.
Ber(λ) is a Bernoulli distribution with a parameter λ. β i represents a coefficient tied to a random variable (predictor) X i , where β 0 is an intercept.We assume that β i , including the intercept, follows a normal distribution with the mean at 0 and the variance at σ i .Y i takes either 1 or 0. Y = 1 if the associated sentence (that predictors represent) is liked (or a preferred choice) and 0 if it is not.
Equally useful in explaining the relationships between potential causes and the outcome are Bayesian tree-based methods (Chipman et al., ; Linero, ; Nuti et al., ), which we do not explore here.The latter could become a viable choice when an extensive non-linearity exists between predictors and the outcome.

. . . Decision tree (GMT)
We work with Greedy Modal Tree (GMT), a recent invention by Nuti et al. (2021), which enables construction of a (binary) decision tree that accommodates the Bayesian uncertainty (Nuti et al., 2021).Given a sequence of data points D = {0, 0.25, 0.5, 0.75, 1.0, 1.25, 1.5} and corresponding outcomes {1,1,1,1,0,0,0}, GMT looks for a mid point between two successive numbers that creates a division in the label set that maximizes the probability of target labels occurring.GMT constructs a decision tree by recursively bifurcating the data space along each dimension (or feature).At each step of the bifurcation, it looks at how much gain it gets in terms of the partition probability, by splitting the space that way, and picks the most probable one among all the possible partitions.More specifically, it carries out the bisection operation to seek a partition Π ⋆ such that: where and w indicates an index of a partition.GMT defines the likelihood function L by way of the Beta function.If we split D into {0, 0.25, 0.5, 0.75} 1 and {1.0, 1.25, 1.5} 2 , the corresponding Ls GMT gives will be B(5, 1) and B(4, 1), respectively.Thus L(D | Π ) ∝ B(5, 1) * B(4, 1).In GMT, the partition prior, p( ), is defined somewhat arbitrarily, as some uniform value determined by how deep the node is, how many features there are, etc.The importance of a feature according to GMT is given as follows: ( M is the total count of nodes in the tree, m is an index referring to a particular node or partitioned data, with Π r,m indicating a bisection under feature r.Equation ( 5) means that the importance of a feature is measured by a combined likelihood of partitions it brings about while constructing the tree.Overall, GMT provides an easy way to incorporate the Bayesian uncertainty into a decision tree without having to deal with costly operations such as MCMC. .

. Predictors
We use predictors shown in Table 5.They come in six categories: synthetic, cohesion, cognitive, classic, perception, and structural.A synthetic feature indicates whether the simplification was created with BART or not, taking true if it is and false otherwise.Those found under cohesion are our adaptions of SYNSTRUT and CRFCWO, which are among the features (McNamara et al., 2014) created to measure cohesion across sentences.SYSTRUCT gauges the uniformity and consistency across sentences by looking at their syntactic similarities or by counting nodes in a common subgraph shared by neighboring sentences.We substituted SYSTRUCT with TREE EDIT DISTANCE (Boghrati et al., 2018), as it allows us to handle multiple subgraphs, in contrast to SYSTRUCT, which only looks for a single common subgraph.CRFCWO gives a normalized count of tokens found in common between two neighboring sentences.We emulate it here with the Szymkiewicz-Simpson coefficient, given as O(X, Y) = |X∩Y| min(|X|,|Y|) .Predictors in the cognitive class are taken from works in clinical and cognitive linguistics (Roark et al., 2007;Boghrati et al., 2018).They reflect various approaches to measuring the cognitive complexity of a sentence.For example, YNGVE scoring defines a cognitive demand of a word as the number of non-terminals to its right in a derivation rule that is yet to be processed.The following are descriptions of features we put to use.

. . . YNGVE
Considering Figure 2A, YNGVE gives every edge in a parse tree a number reflecting its cognitive cost.NP gets "1" because it has a sister node VP to its right.The cognitive cost of a word is defined as the sum of numbers on a path from the root to the word.In Figure 2A, "Vanya" would get 1 + 0 + 0 = 1, whereas "home" 0. Averaging words' costs gives us an Yngve complexity.

. . . FRAZIER
FRAZIER scoring views the syntactic depth of a word (the distance from a leaf to a first ancestor that occurs leftmost in a derivation rule) as a most important factor to determining the sentence complexity.If we run FRAZIER on the sentence in Figure 2A, it will get the score like one shown in Figure 2B."Vanya" gets 1 + 1.5 = 2.5, "walks" 1 and "home" 0 (which has no leftmost ancestor).Roark et al. (2007) reported that both YNGVE and FRAZIER worked well in discriminating subjects with mild memory impairment.

. . . SUBSET and SUBTREE
SUBSET and SUBTREE are both measures based on the idea of Tree Kernel (Collins and Duffy, 2002;Moschitti, 2006;Chen et al., 2022).The former considers how many subgraphs two parses share, while the latter considers how many subtrees.Notably, subtrees are structures that end with terminal nodes.

. . . SPLIT
SPLIT is a structural feature that indicates whether the text consists of exactly two sentences or extends beyond that true if it does and false otherwise.We are interested in whether a specific number of sentences a simplification contains (i.e., 2) is in any way relevant to readability.We expect that how it comes out will have a direct impact on how we think about the best way to split a sentence for enhanced readability.

. . . SAMSA
SAMSA is a recent addition to a battery of simplification metrics that have been put forward in the literature.It looks C(a) = the number of children of a, c (i) a represents the i-th child of a.We let σ > 0.
at how much of a propositional content in the source remains after a sentence is split (Sulem et al., 2018b) (The greater, the better.).

. . . Classic readability features
We also included features that have long been established in the readability literature as standard.They are Dale-Chall Readability, Flesch Reading Ease, and Flesch-Kincaid Grade Level (Kincaid et al., 1975;Flesch, 1979;Chall and Dale, 1995).

. . . Perceptual features
Those found in the perception category are from judgments Turkers made on the quality of simplifications we asked them to evaluate.We did not provide any specific definition or instruction as to what constitutes grammaticality, meaning, and fluency during the task.One could argue that their responses were spontaneous and perceptual.
We standardized all of the features by turning them into zscores, where z = x−x σ . .

. Evaluation (Study ) . . . Setup
We set up the training data in the following way.For each HIT, we translated the associated A-and B-type simplification separately into two data points of the form: {x, Y}, where x is an array of predictor values extracted from a relevant simplification, and Y is an indicator that specifies whether a text that x comes from is a preferred form of simplification.Y can be thought of as a single worker's response to ⟪A, B⟫ |q on a specific HIT assignment.If a worker finds A easier than B , Y for x A (= encodings of A) will be 1 and x B 0; and if the other way around, vice versa.The goal of a model is to predict what Y would be, given predictors.

. . . Logistic regression (LogReg)
We trained the logistic regression (Equation 1) using BAMBI (Capretto et al., 2020), with the burn-in of 50,000 while making draws of 4,000 on four MCMC chains (Hamiltonian).As a way to isolate the effect (or importance) of each predictor, we did two things: one was to look at a posterior distribution of each factor, i.e., a coefficient β tied with a predictor and see how far it is removed from 0; another was to conduct an ablation study where we looked at how the absence of a feature affected the model's performance, which we measured with a metric known as "Watanabe-Akaike Information Criterion" (WAIC) (Watanabe, 2010;Vehtari et al., There is another variant of SAMSA called SAMSA_ABL, which has the term penalizing for the length violation removed.We ignore the metric here as we found it highly correlated with SAMSA (Pearson r > 0.80; p ≪ 0.001) on the datasets we worked with, which renders the attribute rather redundant.https://bambinos.github.io/bambi/main/index.html 2016), a Bayesian incarnation of AIC (Burnham and Anderson, 2003).
In addition to WAIC, we worked with two measures to gauge performance of the models we are building, i.e., root mean square error (RMSE) and accuracy (ACC): RMSE is a measure that tells us the extent to which a predicted value diverges from the ground truth and ACC is how often the model makes a correct binary prediction.ACC is based on the formula: y * = arg max c∈{A,B} p(c|d), where d is a data point and c is a class, with "A" and "B" representing a bipartite and tripartite construction, respectively.Now, Figure 3A shows what posterior distributions of parameters associated with predictors looked like after 4,000 draw iterations with MCMC.None of the chains associated with the parameters exhibited divergence.We achieved R between 1.0 and 1.02, for all β i , a fairly solid stability (Gelman and Rubin, 1992), indicating that all the relevant parameters had successfully converged.
At a first glance, it is a bit challenging what to make of Figure 3A, but a generally accepted rule of thumb is to assume distributions that center around 0 as of less important in terms of explaining observations, than those that appear away from zero.If we go along with the rule, the most likely candidates that affected readability are EASE, SUBSET, FK GRADE, GRAMMAR, MEANING, FLUENCY, SPLIT, and OVERLAP.What remains unclear is, to what degree the predictors affected readability.
One good way to find out this is to perform an ablation study, a method to isolate the effects of an individual factor by examining how seriously its removal from a model degrades its performance.The result of the study is shown in Table 6.Each row represents performance in WAIC of a model with a particular predictor removed.Thus, "TED " in Table 6 represents a model that includes all the predictors in Table 5, except for TED .A row in blue represents a full model which had none of the features disabled.Appearing above the base model means that a removal of a feature had a positive effect, i.e., the feature is redundant.Appearing below means that the removal had a negative effect, indicating that we should not forgo the feature.A feature becomes more relevant as we go down and becomes less relevant as we go up the table.Thus, the most relevant is FLUENCY, followed by MEANING, the least relevant is SUBTREE, followed by DALE and so forth.As shown in Table 6, We found that what predictors we need to keep to explain the readability, they are GRAMMAR, SPLIT, FK GRADE, EASE, MEANING, and FLUENCY (call them "select features").Notably, BART is in the negative realm, meaning that from a perspective of readability, people did not care WAIC is given as follows.R = the ratio of within-and between-chain variances, a standard tool to check for convergence (Lambert, ).The closer the ratio is to the unity, the more likely MCMC chains may have converged.
Frontiers in Artificial Intelligence frontiersin.orgabout whether the simplification was carried out by human or machine.SAMSA was also found in the negative domain, implying that for a perspective of information, a two-sentence splitting carries just as much information as a three way division of a sentence.
To further nail down to what extent they are important, we ran another ablation experiment involving the select features alone.The result is shown in Table 6.At the bottom is FLUENCY, the second to the bottom is SPLIT, followed by MEANING and so forth.As we go up the table, a feature becomes less and less important.The posterior distributions of these features are shown in Figure 3B.Not surprisingly, they are found away from zero, with FLUENCY the furtherest away.The result indicates that contrary to the popular wisdom, classic readability metrics, such as EASE and FK GRADE, are We found that they had 1.0 ≤ R ≤ 1.01, a near-perfect stability.Settings for MCMC, i.e., the number of burn-ins and that of draws, were set to the same as before.
of little use, and they had a large sway on people when they made a decision about readability.

. . . Greedy modal tree (GMT)
The setup follows what has been done with LogReg, working with the same binary class Y = {1, 0}, with the former indicating preference of bisection over trisection and the latter the other around.The testing was conducted using the cross validation method, where we split the data into training and testing blocks in such a way as to keep the same split ratio as we had for LogReg.We postpone the rest of the review until we get to Section 6, where we talk about multi-collinearity.
. Study : going beyond trisection . .Setup In the second part of the study, we looked at whether the observation we made in Study 1 (bi-vs.tri-section) holds for cases which involve four or more divisions.In particular, we asked people to compare a bisected sentence against simplifications more than three sentences long.The test data were constructed out of WebNLG (Gardent et al., 2017), giving us 158 HITs.A total of seven people were assigned to each task.They worked on a question like one shown in Figure 4. Again in Study 1, the task asks a Turker to respond to questions regarding three texts, a source sentence (Source), its two sentence simplification (Text A), and another simplification four or five sentences long (Text B), which appeared in an equal number of times in HITS (79 four sentence long Bs and 79 five sentence long Bs).
The participants are from the same regions as the previous experiment, US, Canada, UK, Australia, and New Zealand, who self-reported to be the native speaker of English with an educational background above high school.

. . Method
We repeated what we have done in the previous study.We applied LogReg and GMT on responses from Amazon Turkers, using the same set of predictors we described in Section 4.4.Hyperparameters were kept unchanged.In Study 1, our goal is to predict which of the two types of simplification, one consisting of two sentences and the other with four or more, humans prefer, given predictors.
We report RMSEs of the models and which of the features they found the most important.

. . Evaluation
Table 7 shows the outcome of the study.An overwhelming majority went for two-sentence simplifications (HUM-A) over versions with more than three sentences.When pitted directly  (Spiegelhalter et al., 2002), a measure to estimate the complexity of the model: the greater, the more complex.d_waic = the distance in WAIC to the top model.se = standard error of WAIC estimates.dse = standard error of differences in WAIC estimates between the top model and each of the rest.↑ means that higher is better.↓ indicates the opposite.The best section gives WAICs for the best features.blue, #D2DDFF.
against four-or five-sentence long simplifications, more than half of the participants preferred shorter bipartite renditions (see the lower section of Table 7).
Table 8 shows the main results.Table 8C shows R = 1.0, indicating a steadfast stability for MCMC (number of draws: 4,000, burn-in: 20,000, number of chains: 4).In contrast to what we found in Study 1, SPLIT (highlighted in green) has fallen into the negative realm (above the baseline), suggesting that it is less relevant to predicting human preferences.Be that as it may, we consider it a spurious effect of SPLIT due to a particular way the model is constructed on two grounds: (1) it runs counter to what we know about SPLIT from Table 8B, that is, it is the most highly correlated with the dependent variable; (2) we have findings from GMT, which indicate a strong association of the feature with the target.We say more on this in the following section.
We also defer a discussion on strengths of predictors and system performance of LogReg and GMT after we usher in the idea of multi-collinearity in Section 6.

. Multi-collinearity
Multi-collinearity occurs when independent variables (predictors) in a regression model are correlated with themselves, making their true effects on a dependent variable amorphous and hard to interpret.Our goal in this section is to investigate whether or how seriously data from Study 1 and 2 are affected by multi-collinearity, and find out, if this is the case, what we can do to alleviate the issue.We introduce the idea of Variation Inflation  The majority went for bipartite versions.HUM-B4: four-sentence long simplification.HUM-B5: five-sentence long simplification.The number indicates the number of votes supporting a particular choice.
Factors (VIFs;Frost, 2019).VIF provides a way to measure to what extent a given predictor can be inferred from the rest of the predictors it accompanies, which together form a pool of independent variables intended to explain the dependent variable in a regression model.VIF is given by: 1 1−R 2 .R 2 is an R-squared value indicating the degree of variance that could be explained using other predictors via a regression.A high value means a high correlation.There is no formally grounded threshold on VIF beyond which we should be concerned.Recommendations in the literature range from 2.5 to 10 (Frost, 2019).For this study, we set a cutoff at 5, dropping predictors with a VIF beyond 5, to the extent that features we value are intact, such as SPLIT, GRAMMAR, and FLUENCY.Table 9 gives VIF values for the predictors in an original pool (Table 9A) and those of what we were left with after throwing away high VIF features (Table 9B).The question is what impact does this de-collinearizing operation has on performance as well as standing of predictors?We find an answer in Table 10.What we have in Tables 10A, B are the results of an ablation analysis we conducted.We trained LogReg on the set of features listed in Table 9B, to the exclusion of a specific feature we are focusing on.Table 10A is for Study 1 and Table 10B for Study 2. We find in either case, SPLIT among the features that belong to the positive realm, meaning that it is of relevance to explaining human responses on readability.Table 10C compares pre-vs.post-de-collinearization results.It looks at whether decollinearizing had any effect on how LogReg and GMT perform in classification, while the results are somewhat mixed for RMSE, both models saw an increase in ACC across the board, confirming that de-collinearization works for GMT.Also of note is a large improvement in WAIC for LogReg (base): WAIC jumped from −1,901 to −949 in Study 1 and from −868 to −735 in Study 2. Furthermore, Table 10A strongly suggests that multi-collinearity is a major cause for the unexpected fall of SPLIT into the negative region in Table 8.
Figures 5A, B look at Study 1.They show a list of predictors ranked by partition probability before and after decollinearization.Partition probabilities are numbers determined by Equation ( 5), which are averaged over 28 cross-validation runs.We emphasize that while we see SPLIT come in third in Figure 5A, there is no practical difference between SPLIT and other closely ranked features such as TED , SAMSA, TNODES, and SUBTREE, whose partition probabilities are 0.062, 0.061, 0.061, and 0.060, respectively, whereas SPLIT got 0.061.In Figure 5B, standings of predictors are more clearly demarcated.We see SPLIT appear in  5C gives a ranking before de-collinearization, and Figure 5D one after.We notice that SPLIT moved up the ladder from 13th, which it was before de-collinearization, to 2nd after de-collinearization.Table 10C shows the models' performance in classification tasks.The number of folds for Study 1 was set to 28 and that for Study 2 was set to 21.This was to keep the size of test data at ∼100.One thing that stands out in the results is that the de-collinearization had a clear effect on ACC, pushing it a few notches up the scale across the board.Its effect on RMSE is somewhat mixed: it works for some setups (GMT/Study 1, LogReg/Study 2), but it does not work for others (GMT/Study 2, LogReg/Study 1), suggesting that we should not equate RMSE with ACC.We see LogReg and GMT generally performing on par, except that GMT is visibly ahead of LogReg in Study 1, with or without de-collinearization.
While the impact of SPLIT on the classification with GMT turned out to be not as clear-cut or as strong as that with LogReg, we argue that its consistent appearance in the higher end of rankings provides reasonable grounds for counting it among the factors that positively influence readability.

. Conclusion
In this study, we asked two questions: does cutting up a sentence help the reader better understand the text?and if so, does it matter how many pieces we break it into?We found that splitting does allow the reader to better interact with the text (Table 2), and moreover, two-sentence simplifications are clearly favored over simplifications consisting of three sentences or more (Tables 2, 6, 7, and Figures 5B, D).As Table 7 has shown, increasing divisions may not result in increased readability (people found sentences with 4 and 5 segments are not better than those with zero splits).
Why breaking a sentence in two makes it a better simplification is something of a mystery.A possible answer may lie in a potential disruption splitting may have caused in a sentence-level discourse structure, whose integrity (Crossley et al., 2011(Crossley et al., , 2014) ) argued, constitutes a critical part of simplification, a topic that we believe is worth a further exploration in the future.Another avenue for the future exploration is uncovering the relationship between the order in which splits are presented and the readability.While it is hard to pin down what it is at the moment, there is a sense that placing splits in a particular order gives a more readable text than placing them in another way.
We leave the study with one caveat.A cohort of people we solicited for the current study is generally well-educated adults who speak English as the first language.Therefore, the results we found in this study may neither necessarily hold for L2-learners, minors, or those who do not have college level education nor do they extend beyond English.

FRAZIER
is more fluent than HUM-B.**p < 0.01.(B) Shows average scores and standard deviations of BART-A and the corresponding HUM-B.BART-A is significantly more fluent than HUM-B.We find in (C), Pearson correlations between Y and predictors.Y is a dependent variable indicating whether the sentence is preferred over an alternative.A feature's ability to distinguish between HUM(BART)-A and HUM-B is thus orthogonal to its relationship with Y (e.g., GRAMMAR, MEANING).
y i |θ)] represents the average likelihood under the posterior distribution of θ, and V[α] represents the sample variance of α, i.e., V[α] = 1 S−1 S 1 (α s − ᾱ), where α s is a sample draw from p(α).A higher WAIC score indicates a better model.n is the number of data points.

FIGURE
FIGURE (A) Posterior distributions of coe cients (β's) in the full model (Study ).The further the distribution moves away from , the more relevant it becomes to predicting the outcome.(B) Posterior distributions of the coe cient parameters in the reduced model (Study ).

FIGURE
FIGUREAn online work screen.

TABLE (
Study ) A break down of H.
B (THREE-SENTENCE SPLIT)− 221 113 of them are of type A (bipartite split) generated by BART-large; 108 are of type A created by humans.There were 221 of type B (tripartite split), all of which were produced by humans.
5) Scale (Sulem et al., 2018b)alness (manually coded).ScaleStructuralSPLITTrue if the text is two sentences long; false if it is longer.CategoricalInformationalSAMSAMeasures how much of the original content is preserved in the target(Sulem et al., 2018b).ScaleFrontiers in Artificial Intelligence frontiersin.orgNomoto ./frai. .

TABLE ( Study
) Comparison of two-vs.four-and five-sentence long simplifications.
We say data are de-collinearized if they are cleared of multi-collinearity inducing predictors.

TABLE VIFs (
variation inflation factors) of the predictors.are a VIF that compares one predictor with what is left of Table5.(B) Gives a set of predictors we are left with after the removal of those that are correlated with the predictor pool.In particular, we removed features correlated with SPLIT so that its VIF stays below 5.
the middle, implying that its contribution to classification is rather limited.Figures5C, Ddeal with Study 2. Figure