Predicting Variation of Folk Songs: A Corpus Analysis Study on the Memorability of Melodies

We present a hypothesis-driven study on the variation of melody phrases in a collection of Dutch folk songs. We investigate the variation of phrases within the folk songs through a pattern matching method which detects occurrences of these phrases within folk song variants, and ask the question: do the phrases which show less variation have different properties than those which do? We hypothesize that theories on melody recall may predict variation, and as such, investigate phrase length, the position and number of repetitions of a given phrase in the melody in which it occurs, as well as expectancy and motif repetivity. We show that all of these predictors account for the observed variation to a moderate degree, and that, as hypothesized, those phrases vary less which are rather short, contain highly expected melodic material, occur relatively early in the melody, and contain small pitch intervals. A large portion of the variance is left unexplained by the current model, however, which leads us to a discussion of future approaches to study memorability of melodies.


DETAILS ON DETECTING PHRASE OCCURRENCES
We use a combination of three pattern matching methods, which have been shown to agree best with human judgements of phrase occurrences (Janssen et al., Under Revision): city-block distance (Steinbeck, 1982), local alignment (Smith and Waterman, 1981) and structure induction (Meredith, 2006).

Music representations
For city-block distance and local alignment, melodies are represented as pitch sequences. Pitches (the heights of the melody notes in the human hearing range), are represented by integers, derived from their MIDI note numbers. The notes in the pitch sequences were weighted by their duration, i.e., a given pitch is repeated depending on the length of the notes. We represent a crotchet or quarter note by 16 pitch values, a quaver or eighth note by 8 pitch values, and so on. Note onsets of small duration units, especially triplets, may fall between these sampling points, which shifts their onset slightly in the representation. Structure induction uses (onset, pitch) pairs to represent notes in the melodies.
In order to deal with transposition differences in folk songs, Van Kranenburg et al. (2013) transpose melodies to the same key using pitch histogram intersection. We take a similar approach. For each melody, a pitch histogram is computed with MIDI note numbers as bins, with the count of each note number weighted by its total duration in a melody. The pitch histogram intersection of two histograms h s and h t , with shift σ is defined as where k denotes the index of the bin, and r the total number of bins. We define a non-existing bin to have value zero. For each tune family, we randomly pick one melody and for each other melody in the tune family we compute the σ that yields a maximum value for the histogram intersection, and transpose that melody by σ semitones. This process results in pitch-adjusted sequences.
To deal with different notations for the durations of notes, we perform a similar correction for the durations of notes. Analogous to Equation S1, we define a duration histogram intersection of two duration histograms h t and h s , of which the σ minimizing DHI will be chosen as the designated shift.
This σ is then used to calculate the multiplier of the onsets of melody t with relation to melody s, before transforming the pitch and duration values of melody t into a duration weighted pitch sequence: M ult(t, s) = 2 σ (S3)

City-block distance
For city-block distance, the query sequence q, with pitch values q i , is compared with every sequence p of the same length, with pitch values p i , from the melody being searched for matches. If many pitch values are identical, city-block distance is small.
From each melody, we choose the pitch sequences p which have the lowest city-block distance to the query sequence, and determine their position in the melody.

Local alignment
To compute the optimal local alignment, a matrix A is recursively filled according to equation S5. The matrix is initialized as A(i, 0) = 0, i ∈ {0, . . . , n}, and A(0, j) = 0, j ∈ {0, . . . , m}. W insertion and W deletion define the weights for inserting an element from melody s into segment q, and for deleting an element from segment q, respectively. subs(q i , s j ) is the substitution function, which gives a weight depending on the similarity of the notes q i and s j .
We apply local alignment to pitch adjusted sequences. In this representation, local alignment is not affected by transposition differences, and it should be robust with respect to time dilation. For the insertion and deletion weights, we use W insertion = W deletion = −0.5, and we define the substitution score as We normalize the maximal alignment score by the number of notes n in the query segment to receive the similarity of the found match with the query segment. The position of the pitch sequence associated with the maximal alignment score is determined through backtracing.

Structure induction
Structure induction measures the difference between melodic segments through so-called translation vectors. The translation vector T between points in two melodic segments can be seen as the difference between the points q i and s j in onset, pitch space.
The maximally translatable pattern (MTP) of a translation vector T for two melodies q and s is then defined as the set of melody points q i which can be transformed to melody points s j with the translation vector T.
We use the pattern matching method SIAM, defining the similarity of two melodies as the largest set match achievable through translation with any vector, normalized by the length n of the query melody: The maximally translatable patterns leading to highest similarity are selected as matches, and their positions are determined through checking the onsets of the first and last note of the MTPs.

Combination of the measures
The similarity thresholds which result in the best agreement between human annotations of phrase occurrences and algorithmically determined matches were found through optimization on the training corpus. For a given query phrase, all similarity measures were used to determine whether or not a match was found in a given melody. For city-block distance, matches with dist(q, p) ≤ 0.9792, for local alignment, matches with sim(q, s) ≥ 0.5508, and for structure induction, matches with sim(q, s) ≥ 0.5833 are retained. Only if at least two of the three similarity measures retained matches, the melody in question was considered to contain an occurrence.

Pitch reversal
Pitch-reversal is the linear combination of two other principles, registral direction and registral return. The principle of registral direction states that after large implicative intervals, a change of direction is more expected than a continuation of the direction. The tritone, or six semitones, is not defined in this principle.

Frontiers
The other component of pitch-reversal, registral return, states that if the realized interval has a different direction than the implicative interval, the size of the intervals is expected to be similar, i.e. they should not differ by more than two semitones. If the implicative interval describes a tone repetition, or if the difference between two consecutive pitch intervals of opposite direction is too large, pitch-reversal is zero, otherwise it has the value of 1.5.
PitchRev(s j ) = PitchRev dir (s j ) + PitchRev ret (s j ) (S13) Figure S1, drawn after a figure by Schellenberg, shows a schematic overview for the different values pitch reversal can take under different conditions.

Motif repetivity
In Figure S2 we show the second, also fourth, and sixth phrase of the example melody. FANTASTIC (Müllensiefen, 2009) represents relationship between adjacent notes as follows: pitches can either stay the same (s1), move up or down by a diatonic pitch interval (e.g., u4, d2). In this representation, it does not matter whether, e.g, a step down (d2) contains one or two semitones. Durations either stay equal (e), get quicker (q) or longer (l).
The sixth phrase of the example melody (the second phrase shown in the figure) consists of seven M-Types, of which only one (u4l) appears twice. This leads to the entropy of two-note M-Types: H(2) = 2/7 log 2 2/7 + 5 · (1/7 log 2 1/7) log 2 1/7 = −0.52 − 5 · 0.40 −2.81 = 0.89 (S16) For the longer M-Types, there are no repetitions, hence the entropy is maximal at H(3, 4, 5, 6) = 1.0. This leads to a motif repetivity of M R = −(0.89 + 4 · 1.0 / 5) = −0.98.  Figure S1. A visualization of the pitch reversal principle as defined by Schellenberg (1997), drawn after his figure. The vertical axis represents the size of the implicative interval, from 0 to 11, and the horizontal axis the size of the realized interval, from 0 to 12, which can have either the same direction (right side of the panel) or a different direction (left side of the panel). u3e u3e d2l d3q u4l d4e u4e u3e u4l d2q Figure S2. The second, also fourth, and sixth phrase of the example melody with symbols representing the pitch and duration relationships between adjacent notes. Notes can either stay at the same pitch (s1), or move up or down in a diatonic pitch interval (e.g., u4, d2). Durations can either be equal (e), quicker (q), or longer (l).