Fractality and Variability in Canonical and Non-Canonical English Fiction and in Non-Fictional Texts

This study investigates global properties of three categories of English text: canonical fiction, non-canonical fiction, and non-fictional texts. The central hypothesis of the study is that there are systematic differences with respect to structural design features between canonical and non-canonical fiction, and between fictional and non-fictional texts. To investigate these differences, we compiled a corpus containing texts of the three categories of interest, the Jena Corpus of Expository and Fictional Prose (JEFP Corpus). Two aspects of global structure are investigated, variability and self-similar (fractal) patterns, which reflect long-range correlations along texts. We use four types of basic observations, (i) the frequency of POS-tags per sentence, (ii) sentence length, (iii) lexical diversity, and (iv) the distribution of topic probabilities in segments of texts. These basic observations are grouped into two more general categories, (a) the lower-level properties (i) and (ii), which are observed at the level of the sentence (reflecting linguistic decoding), and (b) the higher-level properties (iii) and (iv), which are observed at the textual level (reflecting comprehension/integration). The observations for each property are transformed into series, which are analyzed in terms of variance and subjected to Multi-Fractal Detrended Fluctuation Analysis (MFDFA), giving rise to three statistics: (i) the degree of fractality (H), (ii) the degree of multifractality (D), i.e., the width of the fractal spectrum, and (iii) the degree of asymmetry (A) of the fractal spectrum. The statistics thus obtained are compared individually across text categories and jointly fed into a classification model (Support Vector Machine). Our results show that there are in fact differences between the three text categories of interest. In general, lower-level text properties are better discriminators than higher-level text properties. Canonical fictional texts differ from non-canonical ones primarily in terms of variability in lower-level text properties. Fractality seems to be a universal feature of text, slightly more pronounced in non-fictional than in fictional texts. On the basis of our results obtained on the basis of corpus data we point out some avenues for future research leading toward a more comprehensive analysis of textual aesthetics, e.g., using experimental methodologies.


BOX COUNTING
There are several methods to measure fractality and the scaling behavior of structures. These methods typically represent measurements at different scales. Fractal analysis techniques have been widely applied to images (Wendt and Abry, 2007;Li et al., 2009;Wendt et al., 2009;Ji et al., 2013), including artworks (Taylor, 2002;Redies et al., 2007;Spehar et al., 2016). They are therefore of special interest for analyzing aesthetic phenomena.
One of the most widely used fractal analysis methods is box counting, which is mathematically straightforward and easy to apply. Given an object S, for a δ > 0 the smallest possible number of subsets with a diameter of at most δ, N δ (S), which covers S, is found. For 1d objects, subsets are rulers and δ is their length. For 2d objects, subsets are boxes and δ is their area, and so forth. The growth ratio of N δ (S), as δ → 0, reflects the degree of fractality of S. If N δ (S) can be approximated by for a constant c, then D B is called the box-counting dimension and shows how complex S is. Mehri and Lashkari (2016) applied this method to seven famous text books and computed their degree of fractality by averaging the fractality degrees of word occurrences. The results revealed that all texts are fractal and their fractal dimensions differed slightly. Fractality patterns of series sometimes do not lend themselves to analysis with a single scaling measure. If different subsets of a series exhibit different types of scaling behavior, the series is multifractal. Chatzigeorgiou et al. (2017) used box counting to find the origin of multifractality in the word-length representation of texts in several Western languages. They showed that the long-range correlations in natural language are related to the clustering feature of long words, i.e. rare and often highly informative content words.

WAVELET-BASED METHODS
Fractal analysis methods based on wavelets are another family of techniques for studying scale-invariant properties of signals (Muzy et al., 1993;Wendt and Abry, 2007;Leonarduzzi et al., 2016). The wavelet transform (WT) is a method to analyze non-stationary signals. The WT of a signal X is defined as (Mallat, 1999): and it describes the content of X around a time parameter t 0 and a scale parameter a. ψ is the analyzing wavelet whose n + 1 first moments are zero, i.e. R t n ψ(t)dt = 0, which makes the WT insensitive to possible polynomial trends of order n in the signal, something which is necessary for multifractal analysis (Muzy et al., 1994;Arneodo et al., 1995). The WT modulus maxima (WTMM) is a well-known method for analyzing multifractality and it is based on the WT coefficients. WTMM is defined by the local maxima L(a) of |T ψ [X](a, t)| according to a given scale a. Then the following partition function is defined: If the signal is monofractal, τ (q) is independent of q. For multifractal signals, the scaling behavior cannot be explained with one value, so, τ (q) changes for different values of q. Based on WT and WTMM, other methods have been extended for discrete and multi-dimensional series (for example, see Wendt and Abry, 2007;Leonarduzzi et al., 2016). Although wavelet-based methods have been applied to a variety of fields, they have been rarely used in text processing. Leonarduzzi et al. (2017) applied the wavelet p-leader method to the sentence-length series of novels that were written either for young people or adults. The authors showed that the latter category is more diverse in terms of its degree of multifractality.

FRACTALITY AND CROSS-CORRELATION ANALYSIS
Fractal analysis can be extended to analyzing more than one series, in order to find relations between fractal behaviors of multiple series. Detrended Cross-Correlation Analysis (DCCA) (Podobnik and Stanley, 2008) and Multi-Fractal Detrended Cross-Correlation Analysis (MFDCCA) (Jiang and Zhou, 2011) are two methods for analyzing correlations between two series. Ghosh et al. (2019) applied MFDCCA, also known as MFDXA, to study correlations between two Tagore's poems, one written in Bengali and one in English. They found a nonlinear correlation between the poems. In a similar study, birdsong and human speech were compared by computing the mutual information decay of signals and it was concluded that the two vocal communication signals have similar dynamics (Sainburg et al., 2019).     Table S3: Accuracy of classification (in %) using the mean value of the text properties for the nonfictional/fictional distinction (Task 1) and the canonical/non-canonical distinction (Task 2). M eans ± SD are listed (N = 10). All values are significantly different (p ≤ 0.05) from random accuracy (50%), except where indicated by a †. High-Level 79.5 ± 2.0 57.2 ± 3.3

SUPPLEMENTARY TABLES AND FIGURES
Low-& High-Level 96.7 ± 1.0 73.5 ± 1.7 Figure S1: Mean R 2 of the linear fits to the fluctuation function of the MFDFA method for different values of q and for different text properties in canonical (a), non-canonical (b) and non-fictional texts (c) in the corpus. The color coding is shown on the right hand side. Figure S2: The percentage of the texts for which the degree of multifractality is significantly larger than the degree of multifractality of their surrogates (p < 0.05). The different text properties are indicated at the bottom. The color of the bars represent the categories of text separately and all together, as shown on top of the figure.