<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article xml:lang="EN" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" article-type="brief-report" dtd-version="2.3">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Psychol.</journal-id>
<journal-title>Frontiers in Psychology</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Psychol.</abbrev-journal-title>
<issn pub-type="epub">1664-1078</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.3389/fpsyg.2022.831684</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Psychology</subject>
<subj-group>
<subject>Brief Research Report</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>Vocabulary Demands of Informal Spoken English Revisited: What Does It Take to Understand Movies, TV Programs, and Soap Operas?</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" corresp="yes">
<name>
<surname>Ha</surname>
<given-names>Hung Tan</given-names>
</name>
<xref rid="c001" ref-type="corresp"><sup>&#x002A;</sup></xref>
<uri xlink:href="https://loop.frontiersin.org/people/1529496/overview"/>
</contrib>
</contrib-group>
<aff><institution>School of Foreign Languages, University of Economics Ho Chi Minh City (UEH)</institution>, <addr-line>Ho Chi Minh City</addr-line>, <country>Vietnam</country></aff>
<author-notes>
<fn id="fn0001" fn-type="edited-by"><p>Edited by: Mila Vulchanova, Norwegian University of Science and Technology, Norway</p></fn>
<fn id="fn0002" fn-type="edited-by"><p>Reviewed by: Rining Wei, Xi'an Jiaotong-Liverpool University, China; Joanna Kolak, University of Warsaw, Poland</p></fn>
<corresp id="c001">&#x002A;Correspondence: Hung Tan Ha, <email>hatanhung1991@gmail.com</email>, <ext-link ext-link-type="uri" xlink:href="https://orcid.org/0000-0002-5901-7718">orcid.org/0000-0002-5901-7718</ext-link>
</corresp>
<fn id="fn0003" fn-type="other">
<p>This article was submitted to Language Sciences, a section of the journal Frontiers in Psychology</p>
</fn>
</author-notes>
<pub-date pub-type="epub">
<day>21</day>
<month>02</month>
<year>2022</year>
</pub-date>
<pub-date pub-type="collection">
<year>2022</year>
</pub-date>
<volume>13</volume>
<elocation-id>831684</elocation-id>
<history>
<date date-type="received">
<day>08</day>
<month>12</month>
<year>2021</year>
</date>
<date date-type="accepted">
<day>02</day>
<month>02</month>
<year>2022</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#x00A9; 2022 Ha.</copyright-statement>
<copyright-year>2022</copyright-year>
<copyright-holder>Ha</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/">
<p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</p>
</license>
</permissions>
<abstract>
<p>The article presents a methodological update on the lexical profile of informal spoken English with the emphasis on movies, television programs, and soap operas. The study analyzed Mark Davies&#x2019;s mega-corpora with data containing approximately 625 million words and employed Paul Nation&#x2019;s comprehensive and up-to-date British National Corpus/Corpus of Contemporary American English (BNC/COCA) wordlists. Data from the analyses showed that viewers would need a vocabulary knowledge at 3,000 and 5,000 words frequency levels to understand 95 and 98% of the words in scripted dialogs, respectively. Soap operas were found to be less lexically demanding compared to TV programs and movies. Findings are expected to fill in the methodological gaps between vocabulary assessment and vocabulary profiling research.</p>
</abstract>
<kwd-group>
<kwd>lexical coverage</kwd>
<kwd>BNC</kwd>
<kwd>COCA</kwd>
<kwd>TV programs</kwd>
<kwd>movies</kwd>
<kwd>soap operas</kwd>
</kwd-group>
<counts>
<fig-count count="0"/>
<table-count count="4"/>
<equation-count count="0"/>
<ref-count count="45"/>
<page-count count="7"/>
<word-count count="5249"/>
</counts>
</article-meta>
</front>
<body>
<sec id="sec1" sec-type="intro">
<title>Introduction</title>
<p>Vocabulary is the most fundamental aspect in language, and the importance of vocabulary has been constantly repeated (<xref ref-type="bibr" rid="ref27">Nation, 2013</xref>; <xref ref-type="bibr" rid="ref38">Webb, 2020</xref>). Together with the development of wordlists, such as the British National Corpus (BNC) lists (<xref ref-type="bibr" rid="ref26">Nation, 2006</xref>) and the British National Corpus/Corpus of Contemporary American English (BNC/COCA) lists (<xref ref-type="bibr" rid="ref28">Nation, 2017</xref>), researchers in the field of vocabulary studies have been continuously giving us interesting perspectives regarding our vocabulary knowledge as well as the lexical resource we would need to comprehend different text genres (<xref ref-type="bibr" rid="ref32">Nurmukhamedov and Webb, 2019</xref>).</p>
<p>While vocabulary assessment and vocabulary profiling could be said to be the two fields of vocabulary studies that receive the most attention, it does seem to me that research in vocabulary testing are moving way so faster that findings in vocabulary profiling are almost left behind, causing certain gap in research methodology. For example, while vocabulary tests have long employed <xref ref-type="bibr" rid="ref28">Nation&#x2019;s (2017)</xref> up-to-date BNC/COCA wordlist as the source of test items (<xref ref-type="bibr" rid="ref23">McLean and Kramer, 2015</xref>, <xref ref-type="bibr" rid="ref24">2016</xref>; <xref ref-type="bibr" rid="ref25">McLean et al., 2015</xref>; <xref ref-type="bibr" rid="ref44">Webb et al., 2017</xref>), many recent studies on lexical coverage of texts were still stick to the <xref ref-type="bibr" rid="ref26">Nation&#x2019;s (2006)</xref> BNC wordlist (<xref ref-type="bibr" rid="ref1">Al-Surmi, 2014</xref>; <xref ref-type="bibr" rid="ref5">Dang and Webb, 2014</xref>; <xref ref-type="bibr" rid="ref40">Webb and Paribakht, 2015</xref>; <xref ref-type="bibr" rid="ref30">Nurmukhamedov, 2017</xref>; <xref ref-type="bibr" rid="ref36">Tegge, 2017</xref>). This led to the situation where researchers who utilized these modern vocabulary tests for their studies could not reliably relate their results to the existing findings in the field. Attempts have been made to fill in the methodological gaps (<xref ref-type="bibr" rid="ref17">Hsu, 2018</xref>; <xref ref-type="bibr" rid="ref43">Yang and Coxhead, 2020</xref>; <xref ref-type="bibr" rid="ref31">Nurmukhamedov and Sharakhimov, 2021</xref>), however, they are few, and certain areas of the field, including scripted and unscripted spoken discourses, remained uncovered. As a result, research on the relationship between phonological vocabulary knowledge and listening comprehension (<xref ref-type="bibr" rid="ref3">Cheng and Matthews, 2018</xref>; <xref ref-type="bibr" rid="ref21">Lange and Matthews, 2020</xref>; <xref ref-type="bibr" rid="ref14">Ha, 2021b</xref>) still had to rely on the findings of <xref ref-type="bibr" rid="ref41">Webb and Rodgers (2009a</xref>,<xref ref-type="bibr" rid="ref42">b)</xref>, which are more than 10&#x2009;years old and ripe for being updated. In response to the dire need for methodologically updated findings, the present study was conducted to revisit the vocabulary demands of informal spoken English.</p>
</sec>
<sec id="sec2">
<title>Literature Review</title>
<sec id="sec3">
<title>Receptive Vocabulary Knowledge and Listening Comprehension</title>
<p>For decades, vocabulary linguists have documented a strong link between receptive vocabulary knowledge and listening comprehension (<xref ref-type="bibr" rid="ref37">van Zeeland and Schmitt, 2013</xref>; <xref ref-type="bibr" rid="ref3">Cheng and Matthews, 2018</xref>; <xref ref-type="bibr" rid="ref21">Lange and Matthews, 2020</xref>; <xref ref-type="bibr" rid="ref14">Ha, 2021b</xref>). One of the most interesting findings was the concept of <italic>lexical demand</italic> and <italic>lexical coverage</italic>. In short, lexical demand refers to the proportion of words in a text a learner need to know to adequately comprehend it. It has been generally agreed that the minimum threshold for acceptable comprehension is 95% coverage and the coverage for optimal text comprehension would be 98% (<xref ref-type="bibr" rid="ref35">Schmitt et al., 2011</xref>; <xref ref-type="bibr" rid="ref37">van Zeeland and Schmitt, 2013</xref>). As <xref ref-type="bibr" rid="ref18">Hu and Nation (2000)</xref> explained, when learners knew 95% of the running word in a text, they would encounter an unfamiliar token in every 20 words, and that ratio would be reduced to 1/50 if they were to be familiar with 98% of the tokens, a huge gap for a 3% difference.</p>
</sec>
<sec id="sec4">
<title>Lexical Profile of Spoken English</title>
<p>The best thing about lexical coverage studies is that they are often based on word frequency lists, which would show teachers and learners the fastest route to achieve their teaching or learning targets. These wordlists use &#x201C;word family&#x201D; as word counting unit. A word family generally refers to a headword and all of its inflectional and derivational forms through a level 6 affix criteria (also known as WF6; <xref ref-type="bibr" rid="ref45">Bauer and Nation, 1993</xref>; <xref ref-type="bibr" rid="ref29">Nation, 2020</xref>). For example, the WF6 for <italic>add</italic> in <xref ref-type="bibr" rid="ref28">Nation&#x2019;s (2017)</xref> BNC/COCA lists includes <italic>added, adding, addition, additional, additionality, additionally, additions, additive, additives, adds.</italic> The WF6 have been the foundation for several aspects of vocabulary studies (<xref ref-type="bibr" rid="ref32">Nurmukhamedov and Webb, 2019</xref>).</p>
<p>Past studies based on <xref ref-type="bibr" rid="ref26">Nation (2006)</xref> British National Corpus (BNC) have formed comprehensive guidelines for English teachers and learners on what and how much to learn. For example, for informal, spoken English, learners would need to know the around 2,000&#x2013;3,000 word families to achieve 95% coverage and 5,000&#x2013;7,000 word families for 98% coverage (<xref ref-type="bibr" rid="ref26">Nation, 2006</xref>; <xref ref-type="bibr" rid="ref41">Webb and Rodgers, 2009a</xref>,<xref ref-type="bibr" rid="ref42">b</xref>; <xref ref-type="bibr" rid="ref1">Al-Surmi, 2014</xref>; <xref ref-type="bibr" rid="ref36">Tegge, 2017</xref>). Academic spoken English, the type of English we would encounter in TED talks, academic seminars, and university lectures, was a little bit more lexically demanding, generally requiring a knowledge of 4,000 and 8,000 most frequent word families in the BNC word list for 95 and 98% coverage, respectively, (<xref ref-type="bibr" rid="ref4">Coxhead and Walls, 2012</xref>; <xref ref-type="bibr" rid="ref5">Dang and Webb, 2014</xref>; <xref ref-type="bibr" rid="ref30">Nurmukhamedov, 2017</xref>).</p>
</sec>
<sec id="sec5">
<title>Research Gap and the Present Study</title>
<p>Improvement demands changes. As the English we use keeps developing every day, it is not surprising to say that the guidelines for vocabulary teaching and learning that has been built on <xref ref-type="bibr" rid="ref26">Nation&#x2019;s (2006)</xref> BNC lists would soon become obsolete, and therefore, require revisiting (<xref ref-type="bibr" rid="ref34">Schmitt et al., 2017</xref>). In an influential paper, <xref ref-type="bibr" rid="ref34">Schmitt et al. (2017)</xref> suggested two directions that future studies should take to replicate past lexical profile research. The first one is to increase the sample size. This statement was made based on the fact that these past studies employed relatively small corpora and their findings &#x201C;now need to be checked with larger, more comprehensive corpora&#x201D; (<xref ref-type="bibr" rid="ref34">Schmitt et al., 2017</xref>, p. 217). The second suggestion <xref ref-type="bibr" rid="ref34">Schmitt et al. (2017)</xref> put forward is the improvement in research methodology. Despite being extremely helpful and informative, <xref ref-type="bibr" rid="ref26">Nation&#x2019;s (2006)</xref> BNC word list is now 15&#x2009;years old and contains primarily British English which should be &#x201C;due for updating and revision&#x201D; (<xref ref-type="bibr" rid="ref34">Schmitt et al., 2017</xref>, p. 218). In an attempt to create a better version of the BNC, Paul Nation introduced the British National Corpus/Corpus of Contemporary American English (BNC/COCA) wordlist in 2012, which were later updated in 2017. The BNC/COCA is a highly regarded wordlist by researchers (<xref ref-type="bibr" rid="ref6">Dang and Webb, 2016</xref>; <xref ref-type="bibr" rid="ref7">Dang et al., 2020</xref>). As <xref ref-type="bibr" rid="ref34">Schmitt et al. (2017)</xref> pressed, &#x201C;Assuming the new combined BNC-COCA lists are a better indication of word frequency, then everything that has been done using the original BNC-based lists is ripe for replication using these new lists&#x201D; (<xref ref-type="bibr" rid="ref34">Schmitt et al., 2017</xref>, p. 218).</p>
<p>The present study was conducted in response to <xref ref-type="bibr" rid="ref34">Schmitt et al. (2017)</xref> call and aimed at revisiting the vocabulary demands of informal spoken English. People often believe that the investigation of informal spoken English should involve real-life, conversations (<xref ref-type="bibr" rid="ref22">Love et al., 2017</xref>). However, lexical research demonstrated that the examination on scripted English would yield similar results and be of as much help (<xref ref-type="bibr" rid="ref41">Webb and Rodgers, 2009a</xref>,<xref ref-type="bibr" rid="ref42">b</xref>; <xref ref-type="bibr" rid="ref1">Al-Surmi, 2014</xref>; <xref ref-type="bibr" rid="ref11">Davies, 2021</xref>). To date, four studies have been conducted to investigate the lexical profile of informal spoken English through soap operas (<xref ref-type="bibr" rid="ref1">Al-Surmi, 2014</xref>), podcasts (<xref ref-type="bibr" rid="ref31">Nurmukhamedov and Sharakhimov, 2021</xref>), TV programs (<xref ref-type="bibr" rid="ref41">Webb and Rodgers, 2009a</xref>), and movies (<xref ref-type="bibr" rid="ref42">Webb and Rodgers, 2009b</xref>). <xref ref-type="bibr" rid="ref36">Tegge (2017)</xref> is not counted because song lyrics do not always reflect real-life conversations. <xref ref-type="bibr" rid="ref31">Nurmukhamedov and Sharakhimov (2021)</xref> is very updated with their research methodology and a replication study would be unnecessary. As a result, only three of them need to be revisited in <xref ref-type="bibr" rid="ref34">Schmitt et al. (2017)</xref> terms. <xref rid="tab1" ref-type="table">Table 1</xref> shows information regarding the sample size, the wordlist that was used as research methodology as well as the key findings of these past studies.</p>
<table-wrap position="float" id="tab1">
<label>Table 1</label>
<caption><p>A summary of past studies on the vocabulary demands of soap operas, TV programs, and movies.</p></caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left" valign="middle">Corpus</th>
<th align="left" valign="middle">Word list used</th>
<th align="center" valign="middle">Number of episodes</th>
<th align="center" valign="middle">Number of words</th>
<th align="center" valign="middle">Findings</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left" valign="top">Soap operas<break/>
<xref ref-type="bibr" rid="ref1">Al-surmi (2014)</xref></td>
<td align="left" valign="top">BNC<break/>(<xref ref-type="bibr" rid="ref26">Nation, 2006</xref>)</td>
<td align="center" valign="top">254</td>
<td align="center" valign="top">1,290,000</td>
<td align="center" valign="top">2000 WFs&#x2014;95%<break/>5,000 WFs&#x2014;98%</td>
</tr>
<tr>
<td align="left" valign="top">TV programs<break/><xref ref-type="bibr" rid="ref41">Webb and Rodgers (2009a)</xref></td>
<td align="left" valign="top">BNC<break/>(<xref ref-type="bibr" rid="ref26">Nation, 2006</xref>)</td>
<td align="center" valign="top">88</td>
<td align="center" valign="top">264,384</td>
<td align="center" valign="top">3,000 WFs&#x2014;95%<break/>7,000 WFs&#x2014;98%</td>
</tr>
<tr>
<td align="left" valign="top">Movies <xref ref-type="bibr" rid="ref42">Webb and Rodgers (2009b)</xref></td>
<td align="left" valign="top">BNC<break/>(<xref ref-type="bibr" rid="ref26">Nation, 2006</xref>)</td>
<td align="center" valign="top">318</td>
<td align="center" valign="top">2,841,887</td>
<td align="center" valign="top">3,000 WFs&#x2014;95%<break/>6,000 WFs&#x2014;98%</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>WF, Word Family.</p>
</table-wrap-foot>
</table-wrap>
<p>By employing the most comprehensive and updated wordlist, the BNC/COCA, as well as the largest corpus of scripted English available (more details in the methodology section), the present study seeks to answer only one research question:</p>
<p><italic>Would the previous findings concerning the lexical demands of TV, Movies, and Soap Opera change as larger corpora and the BNC/COCA wordlist were applied?</italic></p>
</sec>
</sec>
<sec id="sec6" sec-type="methods">
<title>Methodology</title>
<sec id="sec7">
<title>Data Collection</title>
<p>The data analyzed in the present study were the TV (<xref ref-type="bibr" rid="ref9">Davies, 2019a</xref>), Movies (<xref ref-type="bibr" rid="ref10">Davies, 2019b</xref>), and SOAP (<xref ref-type="bibr" rid="ref8">Davies, 2012</xref>) corpora, which were officially purchased and used under an academic license provided by Mark Davies.<xref rid="fn0004" ref-type="fn"><sup>1</sup></xref> The TV, Movies, and SOAP corpora together could be said to be the largest available corpus of informal spoken English with data containing approximately 625 million tokens in total (<xref ref-type="bibr" rid="ref11">Davies, 2021</xref>). Information regarding the three corpora is presented in <xref rid="tab2" ref-type="table">Table 2</xref>.</p>
<table-wrap position="float" id="tab2">
<label>Table 2</label>
<caption><p>General information about the corpora (<xref ref-type="bibr" rid="ref8">Davies, 2012</xref>, <xref ref-type="bibr" rid="ref9">2019a</xref>,<xref ref-type="bibr" rid="ref10">b</xref>).</p></caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left" valign="middle">Corpus</th>
<th align="center" valign="middle">Period</th>
<th align="center" valign="middle">Number of episodes/scripts</th>
<th align="center" valign="middle">Number of words</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left" valign="top">Soap operas</td>
<td align="center" valign="top">2001&#x2013;2012</td>
<td align="center" valign="top">22,000</td>
<td align="center" valign="top">100,783,900</td>
</tr>
<tr>
<td align="left" valign="top">TV programs</td>
<td align="center" valign="top">1950&#x2013;2018</td>
<td align="center" valign="top">75,000</td>
<td align="center" valign="top">326,201,276</td>
</tr>
<tr>
<td align="left" valign="top">Movies</td>
<td align="center" valign="top">1930&#x2013;2018</td>
<td align="center" valign="top">25,000</td>
<td align="center" valign="top">199,479,302</td>
</tr>
<tr>
<td align="left" valign="top">Total</td>
<td/>
<td align="center" valign="top">122,000</td>
<td align="center" valign="top">626,464,478</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec id="sec8">
<title>Data Analysis</title>
<p>Preliminary analysis showed three major issues that had to be dealt with before the purchased corpora could be ready for any further data analysis. The first and most important thing was context-defining words. Davies&#x2019;s spoken corpora are normally flooded with words that &#x201C;represent the tone of style of speech&#x201D; (<xref ref-type="bibr" rid="ref11">Davies, 2021</xref>, p. 16) or gives additional information on the context, all of which are surrounded by parentheses, for example, (<italic>Enginechuggingnoisily</italic>), (<italic>doorknocking</italic>), and (<italic>treecracking</italic>), (<italic>gunfire</italic>). These words are &#x201C;non-speech&#x201D; words, (<xref ref-type="bibr" rid="ref11">Davies, 2021</xref>, p. 16), and therefore, should not be included in the analysis. Another problem that needed attention was hyphenated words (<italic>second-hand, sky-high</italic>&#x2026;) as lexical profiling software could not read them. The third issue involves words that accidentally stick together (<italic>whatcouldpossiblybegoing, whatchancehasawomangot</italic>&#x2026;) and other typos errors, which were falsely classified by lexical profiling software as &#x201C;Not in the lists.&#x201D;</p>
<p>The parentheses that surrounded context-defining words were replaced with &#x201C;&#x003C;&#x201D; and &#x201C;&#x003E;,&#x201D; so that Range could identify and automatically exclude these words through the &#x201C;ignore &#x2018;&#x003C; &#x003E;&#x2019;&#x201D; function. These words accounted for 592,690, 2,577,943, and 1,689,465 tokens in the SOAP, TV, and MOVIES corpora, respectively. Hyphens in hyphenated words were then replaced by space so that the component words could be classified according to their frequency level. Finally, words that were classified as &#x201C;Not in the lists&#x201D; due to typos were then changed and returned to their frequency levels. These modifications were made using the mass search and replace function of Notepad++ (hotkeys: Ctrl + Shift + F).</p>
<p>The corpora&#x2019;s lexical profile was then analyzed by Range (<xref ref-type="bibr" rid="ref16">Heatley et al., 2002</xref>). Range is a computer program that could classify words to their frequency levels in accordance with the word lists we chose to use it with. Range were chosen for data analysis due to the researcher&#x2019;s personal preference and familiarity. In fact, an analysis with AntWordProfiler 1.5.1 (<xref ref-type="bibr" rid="ref2">Anthony, 2021</xref>) yielded near-identical results for the corpora. Therefore, such program choice should not be the cause for concern. Range can automatically identify and read contractions (cannot, do not&#x2026;) and connected speech (wanna, gonna, kinda&#x2026;). For instance, Range counts <italic>cannot</italic> as <italic>can</italic> and <italic>not</italic> and <italic>wanna</italic> as a family member of <italic>want.</italic></p>
<p>The up-to-date, comprehensive, 25-level BNC/COCA wordlist (<xref ref-type="bibr" rid="ref28">Nation, 2017</xref>) were used together with Range for the analysis. The BNC/COCA wordlist contains twenty-five 1,000-word levels which reflects current British and American English. The BNC/COCA lists are accompanied by four supplementary lists of proper nouns (<italic>Abraham, Portuguese, Waterloo&#x2026;</italic>), marginal words (<italic>hm, yee, phew&#x2026;</italic>), transparent compounds (<italic>racecar, railway, sailboat&#x2026;</italic>), and acronyms (<italic>PHD, NATO, MPHI&#x2026;</italic>; <xref ref-type="bibr" rid="ref29">Nation, 2020</xref>).</p>
</sec>
</sec>
<sec id="sec9" sec-type="results">
<title>Results</title>
<p><xref rid="tab3" ref-type="table">Table 3</xref> presents the number of words and their proportion at each frequency level in the BNC/COCA wordlist for the SOAP, TV, and Movies corpora. Proper nouns, marginal words, transparent compounds, and acronyms were treated as separate word levels.</p>
<table-wrap position="float" id="tab3">
<label>Table 3</label>
<caption><p>The number of tokens at each word level.</p></caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left" valign="top" rowspan="2">Word list</th>
<th align="center" valign="top" colspan="2">Soap operas</th>
<th align="center" valign="top" colspan="2">TV programs</th>
<th align="center" valign="top" colspan="2">Movies</th>
</tr>
<tr>
<th align="char" valign="top" char=".">Token</th>
<th align="char" valign="top" char=".">Percentage</th>
<th align="char" valign="top" char=".">Token</th>
<th align="char" valign="top" char=".">Percentage</th>
<th align="char" valign="top" char=".">Token</th>
<th align="char" valign="top" char=".">Percentage</th>
</tr>
</thead>
<tbody>
<tr>
<td align="char" valign="top" char=".">1,000</td>
<td align="char" valign="bottom" char=".">88,429,576</td>
<td align="char" valign="bottom" char=".">83.211</td>
<td align="char" valign="bottom" char=".">274,421,665</td>
<td align="char" valign="bottom" char=".">84.737</td>
<td align="char" valign="bottom" char=".">167,842,359</td>
<td align="char" valign="bottom" char=".">85.465</td>
</tr>
<tr>
<td align="char" valign="top" char=".">2,000</td>
<td align="char" valign="bottom" char=".">2,802,518</td>
<td align="char" valign="bottom" char=".">2.637</td>
<td align="char" valign="bottom" char=".">13,285,921</td>
<td align="char" valign="bottom" char=".">4.102</td>
<td align="char" valign="bottom" char=".">7,442,263</td>
<td align="char" valign="bottom" char=".">3.790</td>
</tr>
<tr>
<td align="char" valign="top" char=".">3,000</td>
<td align="char" valign="bottom" char=".">904,801</td>
<td align="char" valign="bottom" char=".">0.851</td>
<td align="char" valign="bottom" char=".">5,264,468</td>
<td align="char" valign="bottom" char=".">1.626</td>
<td align="char" valign="bottom" char=".">2,707,299</td>
<td align="char" valign="bottom" char=".">1.379</td>
</tr>
<tr>
<td align="char" valign="top" char=".">4,000</td>
<td align="char" valign="bottom" char=".">661,411</td>
<td align="char" valign="bottom" char=".">0.622</td>
<td align="char" valign="bottom" char=".">3,287,219</td>
<td align="char" valign="bottom" char=".">1.015</td>
<td align="char" valign="bottom" char=".">1,905,709</td>
<td align="char" valign="bottom" char=".">0.970</td>
</tr>
<tr>
<td align="char" valign="top" char=".">5,000</td>
<td align="char" valign="bottom" char=".">373,582</td>
<td align="char" valign="bottom" char=".">0.352</td>
<td align="char" valign="bottom" char=".">2,262,514</td>
<td align="char" valign="bottom" char=".">0.699</td>
<td align="char" valign="bottom" char=".">1,303,088</td>
<td align="char" valign="bottom" char=".">0.664</td>
</tr>
<tr>
<td align="char" valign="top" char=".">6,000</td>
<td align="char" valign="bottom" char=".">406,273</td>
<td align="char" valign="bottom" char=".">0.382</td>
<td align="char" valign="bottom" char=".">1,438,116</td>
<td align="char" valign="bottom" char=".">0.444</td>
<td align="char" valign="bottom" char=".">804,752</td>
<td align="char" valign="bottom" char=".">0.410</td>
</tr>
<tr>
<td align="char" valign="top" char=".">7,000</td>
<td align="char" valign="bottom" char=".">148,023</td>
<td align="char" valign="bottom" char=".">0.139</td>
<td align="char" valign="bottom" char=".">894,267</td>
<td align="char" valign="bottom" char=".">0.276</td>
<td align="char" valign="bottom" char=".">513,066</td>
<td align="char" valign="bottom" char=".">0.261</td>
</tr>
<tr>
<td align="char" valign="top" char=".">8,000</td>
<td align="char" valign="bottom" char=".">325,532</td>
<td align="char" valign="bottom" char=".">0.306</td>
<td align="char" valign="bottom" char=".">790,405</td>
<td align="char" valign="bottom" char=".">0.244</td>
<td align="char" valign="bottom" char=".">480,565</td>
<td align="char" valign="bottom" char=".">0.245</td>
</tr>
<tr>
<td align="char" valign="top" char=".">9,000</td>
<td align="char" valign="bottom" char=".">133,064</td>
<td align="char" valign="bottom" char=".">0.125</td>
<td align="char" valign="bottom" char=".">630,785</td>
<td align="char" valign="bottom" char=".">0.195</td>
<td align="char" valign="bottom" char=".">357,809</td>
<td align="char" valign="bottom" char=".">0.182</td>
</tr>
<tr>
<td align="char" valign="top" char=".">10,000</td>
<td align="char" valign="bottom" char=".">101,528</td>
<td align="char" valign="bottom" char=".">0.096</td>
<td align="char" valign="bottom" char=".">405,553</td>
<td align="char" valign="bottom" char=".">0.125</td>
<td align="char" valign="bottom" char=".">212,453</td>
<td align="char" valign="bottom" char=".">0.108</td>
</tr>
<tr>
<td align="char" valign="top" char=".">11,000</td>
<td align="char" valign="bottom" char=".">121,723</td>
<td align="char" valign="bottom" char=".">0.115</td>
<td align="char" valign="bottom" char=".">371,479</td>
<td align="char" valign="bottom" char=".">0.115</td>
<td align="char" valign="bottom" char=".">200,019</td>
<td align="char" valign="bottom" char=".">0.102</td>
</tr>
<tr>
<td align="char" valign="top" char=".">12,000</td>
<td align="char" valign="bottom" char=".">39,175</td>
<td align="char" valign="bottom" char=".">0.037</td>
<td align="char" valign="bottom" char=".">296,720</td>
<td align="char" valign="bottom" char=".">0.092</td>
<td align="char" valign="bottom" char=".">162,351</td>
<td align="char" valign="bottom" char=".">0.083</td>
</tr>
<tr>
<td align="char" valign="top" char=".">13,000</td>
<td align="char" valign="bottom" char=".">31,546</td>
<td align="char" valign="bottom" char=".">0.030</td>
<td align="char" valign="bottom" char=".">214,223</td>
<td align="char" valign="bottom" char=".">0.066</td>
<td align="char" valign="bottom" char=".">138,193</td>
<td align="char" valign="bottom" char=".">0.070</td>
</tr>
<tr>
<td align="char" valign="top" char=".">14,000</td>
<td align="char" valign="bottom" char=".">16,172</td>
<td align="char" valign="bottom" char=".">0.015</td>
<td align="char" valign="bottom" char=".">158,327</td>
<td align="char" valign="bottom" char=".">0.049</td>
<td align="char" valign="bottom" char=".">87,108</td>
<td align="char" valign="bottom" char=".">0.044</td>
</tr>
<tr>
<td align="char" valign="top" char=".">15,000</td>
<td align="char" valign="bottom" char=".">13,456</td>
<td align="char" valign="bottom" char=".">0.013</td>
<td align="char" valign="bottom" char=".">125,456</td>
<td align="char" valign="bottom" char=".">0.039</td>
<td align="char" valign="bottom" char=".">72,640</td>
<td align="char" valign="bottom" char=".">0.037</td>
</tr>
<tr>
<td align="char" valign="top" char=".">16,000</td>
<td align="char" valign="bottom" char=".">10,716</td>
<td align="char" valign="bottom" char=".">0.010</td>
<td align="char" valign="bottom" char=".">94,951</td>
<td align="char" valign="bottom" char=".">0.029</td>
<td align="char" valign="bottom" char=".">51,432</td>
<td align="char" valign="bottom" char=".">0.026</td>
</tr>
<tr>
<td align="char" valign="top" char=".">17,000</td>
<td align="char" valign="bottom" char=".">14,145</td>
<td align="char" valign="bottom" char=".">0.013</td>
<td align="char" valign="bottom" char=".">86,102</td>
<td align="char" valign="bottom" char=".">0.027</td>
<td align="char" valign="bottom" char=".">49,654</td>
<td align="char" valign="bottom" char=".">0.025</td>
</tr>
<tr>
<td align="char" valign="top" char=".">18,000</td>
<td align="char" valign="bottom" char=".">5,749</td>
<td align="char" valign="bottom" char=".">0.005</td>
<td align="char" valign="bottom" char=".">59,239</td>
<td align="char" valign="bottom" char=".">0.018</td>
<td align="char" valign="bottom" char=".">33,576</td>
<td align="char" valign="bottom" char=".">0.017</td>
</tr>
<tr>
<td align="char" valign="top" char=".">19,000</td>
<td align="char" valign="bottom" char=".">7,517</td>
<td align="char" valign="bottom" char=".">0.007</td>
<td align="char" valign="bottom" char=".">55,526</td>
<td align="char" valign="bottom" char=".">0.017</td>
<td align="char" valign="bottom" char=".">32,049</td>
<td align="char" valign="bottom" char=".">0.016</td>
</tr>
<tr>
<td align="char" valign="top" char=".">20,000</td>
<td align="char" valign="bottom" char=".">22,218</td>
<td align="char" valign="bottom" char=".">0.021</td>
<td align="char" valign="bottom" char=".">47,365</td>
<td align="char" valign="bottom" char=".">0.015</td>
<td align="char" valign="bottom" char=".">25,496</td>
<td align="char" valign="bottom" char=".">0.013</td>
</tr>
<tr>
<td align="char" valign="top" char=".">21,000</td>
<td align="char" valign="bottom" char=".">11,693</td>
<td align="char" valign="bottom" char=".">0.011</td>
<td align="char" valign="bottom" char=".">32,524</td>
<td align="char" valign="bottom" char=".">0.010</td>
<td align="char" valign="bottom" char=".">18,716</td>
<td align="char" valign="bottom" char=".">0.010</td>
</tr>
<tr>
<td align="char" valign="top" char=".">22,000</td>
<td align="char" valign="bottom" char=".">1,752</td>
<td align="char" valign="bottom" char=".">0.002</td>
<td align="char" valign="bottom" char=".">26,938</td>
<td align="char" valign="bottom" char=".">0.008</td>
<td align="char" valign="bottom" char=".">16,209</td>
<td align="char" valign="bottom" char=".">0.008</td>
</tr>
<tr>
<td align="char" valign="top" char=".">23,000</td>
<td align="char" valign="bottom" char=".">3,494</td>
<td align="char" valign="bottom" char=".">0.003</td>
<td align="char" valign="bottom" char=".">24,791</td>
<td align="char" valign="bottom" char=".">0.008</td>
<td align="char" valign="bottom" char=".">15,384</td>
<td align="char" valign="bottom" char=".">0.008</td>
</tr>
<tr>
<td align="char" valign="top" char=".">24,000</td>
<td align="char" valign="bottom" char=".">6,980</td>
<td align="char" valign="bottom" char=".">0.007</td>
<td align="char" valign="bottom" char=".">14,304</td>
<td align="char" valign="bottom" char=".">0.004</td>
<td align="char" valign="bottom" char=".">10,604</td>
<td align="char" valign="bottom" char=".">0.005</td>
</tr>
<tr>
<td align="char" valign="top" char=".">25,000</td>
<td align="char" valign="bottom" char=".">1,900</td>
<td align="char" valign="bottom" char=".">0.002</td>
<td align="char" valign="bottom" char=".">14,351</td>
<td align="char" valign="bottom" char=".">0.004</td>
<td align="char" valign="bottom" char=".">9,029</td>
<td align="char" valign="bottom" char=".">0.005</td>
</tr>
<tr>
<td align="char" valign="top" char=".">Proper nouns</td>
<td align="char" valign="bottom" char=".">7,880,203</td>
<td align="char" valign="bottom" char=".">7.415</td>
<td align="char" valign="bottom" char=".">8,284,432</td>
<td align="char" valign="bottom" char=".">2.558</td>
<td align="char" valign="bottom" char=".">4,868,519</td>
<td align="char" valign="bottom" char=".">2.479</td>
</tr>
<tr>
<td align="char" valign="top" char=".">Marginal words</td>
<td align="char" valign="bottom" char=".">2,882,002</td>
<td align="char" valign="bottom" char=".">2.712</td>
<td align="char" valign="bottom" char=".">7,942,157</td>
<td align="char" valign="bottom" char=".">2.452</td>
<td align="char" valign="bottom" char=".">4,998,170</td>
<td align="char" valign="bottom" char=".">2.545</td>
</tr>
<tr>
<td align="char" valign="top" char=".">Transparent compounds</td>
<td align="char" valign="bottom" char=".">208,202</td>
<td align="char" valign="bottom" char=".">0.196</td>
<td align="char" valign="bottom" char=".">1,027,350</td>
<td align="char" valign="bottom" char=".">0.317</td>
<td align="char" valign="bottom" char=".">578,711</td>
<td align="char" valign="bottom" char=".">0.295</td>
</tr>
<tr>
<td align="char" valign="top" char=".">Acronyms</td>
<td align="char" valign="bottom" char=".">597,920</td>
<td align="char" valign="bottom" char=".">0.563</td>
<td align="char" valign="bottom" char=".">1,849,428</td>
<td align="char" valign="bottom" char=".">0.571</td>
<td align="char" valign="bottom" char=".">1,181,463</td>
<td align="char" valign="bottom" char=".">0.602</td>
</tr>
<tr>
<td align="char" valign="top" char=".">Not in the lists</td>
<td align="char" valign="top" char=".">108,330</td>
<td align="char" valign="top" char=".">0.102</td>
<td align="char" valign="top" char=".">445,612</td>
<td align="char" valign="top" char=".">0.138</td>
<td align="char" valign="top" char=".">269,403</td>
<td align="char" valign="top" char=".">0.137</td>
</tr>
<tr>
<td align="char" valign="top" char=".">Total</td>
<td align="char" valign="top" char=".">106,271,199</td>
<td align="char" valign="top" char=".">100</td>
<td align="char" valign="top" char=".">323,852,190</td>
<td align="char" valign="top" char=".">100</td>
<td align="char" valign="top" char=".">196,388,091</td>
<td align="char" valign="top" char=".">100</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>It is observable that around 85% of the three corpora were made up of the most frequent word families in the BNC/COCA and that nearly 90% of the corpora&#x2019;s tokens were covered by the first two most frequent word families. This may be because the first two 1,000-word levels in the BNC/COCA lists primarily contain words taken from spoken corpora (<xref ref-type="bibr" rid="ref29">Nation, 2020</xref>). The proportion of tokens showed a gradual decrease as the word frequency went down, and after the 5,000 level, the figures dropped below 1% for all the corpora, signaling the importance of high-frequency words.</p>
<p>Another detail that deserves attention is the proportion of proper nouns in the three corpora, which were relatively considerable, especially for the SOAP corpus. Proper nouns normally do not cause significant difficulties for reading comprehension since they can be easily recognized with the first-letter capitalization. However, concerns have been raised on the effect of proper nouns on listening comprehension (<xref ref-type="bibr" rid="ref20">Kobeleva, 2012</xref>; <xref ref-type="bibr" rid="ref19">Klassen, 2021</xref>). In general, auxiliary words including proper nouns (PN), marginal words (MW), transparent compounds (TC), and acronyms accounted for approximately 11% for the SOAP corpus and nearly 6% for the TV and Movies corpora.</p>
<p>The cumulative coverage at each word frequency level is illustrated in <xref rid="tab4" ref-type="table">Table 4</xref>. At this stage, two assumptions were put forward, the first one supposes that learners did not know and could not recognize PNs, MWs, TCs, and acronyms, and the second one assumes that learners knew or could easily recognize these words.</p>
<table-wrap position="float" id="tab4">
<label>Table 4</label>
<caption><p>Cumulative coverage with and without proper nouns, marginal words, transparent compounds, and acronyms.</p></caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left" valign="middle" rowspan="2">Word List</th>
<th align="center" valign="middle" colspan="2">Soap operas</th>
<th align="center" valign="middle" colspan="2">TV programs</th>
<th align="center" valign="middle" colspan="2">Movies</th>
</tr>
<tr>
<th align="char" valign="middle" char=".">Without</th>
<th align="char" valign="middle" char=".">With</th>
<th align="char" valign="middle" char=".">Without</th>
<th align="char" valign="middle" char=".">With</th>
<th align="char" valign="middle" char=".">Without</th>
<th align="char" valign="middle" char=".">With</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left" valign="top">1,000</td>
<td align="char" valign="bottom" char=".">83.211</td>
<td align="char" valign="bottom" char=".">94.097</td>
<td align="char" valign="bottom" char=".">84.737</td>
<td align="char" valign="bottom" char=".">90.635</td>
<td align="char" valign="bottom" char=".">85.465</td>
<td align="char" valign="bottom" char=".">91.385</td>
</tr>
<tr>
<td align="left" valign="top">2,000</td>
<td align="char" valign="bottom" char=".">85.848</td>
<td align="char" valign="bottom" char=".">96.734</td>
<td align="char" valign="bottom" char=".">88.839</td>
<td align="char" valign="bottom" char=".">94.738</td>
<td align="char" valign="bottom" char=".">89.254</td>
<td align="char" valign="bottom" char=".">95.175</td>
</tr>
<tr>
<td align="left" valign="top">3,000</td>
<td align="char" valign="bottom" char=".">86.700</td>
<td align="char" valign="bottom" char=".">97.585</td>
<td align="char" valign="bottom" char=".">90.465</td>
<td align="char" valign="bottom" char=".">96.364</td>
<td align="char" valign="bottom" char=".">90.633</td>
<td align="char" valign="bottom" char=".">96.553</td>
</tr>
<tr>
<td align="left" valign="top">4,000</td>
<td align="char" valign="bottom" char=".">87.322</td>
<td align="char" valign="bottom" char=".">98.208</td>
<td align="char" valign="bottom" char=".">91.480</td>
<td align="char" valign="bottom" char=".">97.379</td>
<td align="char" valign="bottom" char=".">91.603</td>
<td align="char" valign="bottom" char=".">97.523</td>
</tr>
<tr>
<td align="left" valign="top">5,000</td>
<td align="char" valign="bottom" char=".">87.674</td>
<td align="char" valign="bottom" char=".">98.559</td>
<td align="char" valign="bottom" char=".">92.178</td>
<td align="char" valign="bottom" char=".">98.077</td>
<td align="char" valign="bottom" char=".">92.267</td>
<td align="char" valign="bottom" char=".">98.187</td>
</tr>
<tr>
<td align="left" valign="top">6,000</td>
<td align="char" valign="bottom" char=".">88.056</td>
<td align="char" valign="bottom" char=".">98.942</td>
<td align="char" valign="bottom" char=".">92.622</td>
<td align="char" valign="bottom" char=".">98.521</td>
<td align="char" valign="bottom" char=".">92.676</td>
<td align="char" valign="bottom" char=".">98.597</td>
</tr>
<tr>
<td align="left" valign="top">7,000</td>
<td align="char" valign="bottom" char=".">88.195</td>
<td align="char" valign="bottom" char=".">99.081</td>
<td align="char" valign="bottom" char=".">92.899</td>
<td align="char" valign="bottom" char=".">98.797</td>
<td align="char" valign="bottom" char=".">92.938</td>
<td align="char" valign="bottom" char=".">98.858</td>
</tr>
<tr>
<td align="left" valign="top">8,000</td>
<td align="char" valign="bottom" char=".">88.502</td>
<td align="char" valign="bottom" char=".">99.387</td>
<td align="char" valign="bottom" char=".">93.143</td>
<td align="char" valign="bottom" char=".">99.041</td>
<td align="char" valign="bottom" char=".">93.182</td>
<td align="char" valign="bottom" char=".">99.103</td>
</tr>
<tr>
<td align="left" valign="top">9,000</td>
<td align="char" valign="bottom" char=".">88.627</td>
<td align="char" valign="bottom" char=".">99.512</td>
<td align="char" valign="bottom" char=".">93.337</td>
<td align="char" valign="bottom" char=".">99.236</td>
<td align="char" valign="bottom" char=".">93.365</td>
<td align="char" valign="bottom" char=".">99.285</td>
</tr>
<tr>
<td align="left" valign="top">10,000</td>
<td align="char" valign="bottom" char=".">88.722</td>
<td align="char" valign="bottom" char=".">99.608</td>
<td align="char" valign="bottom" char=".">93.463</td>
<td align="char" valign="bottom" char=".">99.361</td>
<td align="char" valign="bottom" char=".">93.473</td>
<td align="char" valign="bottom" char=".">99.393</td>
</tr>
<tr>
<td align="left" valign="top">11,000</td>
<td align="char" valign="bottom" char=".">88.837</td>
<td align="char" valign="bottom" char=".">99.723</td>
<td align="char" valign="bottom" char=".">93.577</td>
<td align="char" valign="bottom" char=".">99.476</td>
<td align="char" valign="bottom" char=".">93.575</td>
<td align="char" valign="bottom" char=".">99.495</td>
</tr>
<tr>
<td align="left" valign="top">12,000</td>
<td align="char" valign="bottom" char=".">88.874</td>
<td align="char" valign="bottom" char=".">99.759</td>
<td align="char" valign="bottom" char=".">93.669</td>
<td align="char" valign="bottom" char=".">99.568</td>
<td align="char" valign="bottom" char=".">93.657</td>
<td align="char" valign="bottom" char=".">99.578</td>
</tr>
<tr>
<td align="left" valign="top">13,000</td>
<td align="char" valign="bottom" char=".">88.903</td>
<td align="char" valign="bottom" char=".">99.789</td>
<td align="char" valign="bottom" char=".">93.735</td>
<td align="char" valign="bottom" char=".">99.634</td>
<td align="char" valign="bottom" char=".">93.728</td>
<td align="char" valign="bottom" char=".">99.648</td>
</tr>
<tr>
<td align="left" valign="top">14,000</td>
<td align="char" valign="bottom" char=".">88.919</td>
<td align="char" valign="bottom" char=".">99.804</td>
<td align="char" valign="bottom" char=".">93.784</td>
<td align="char" valign="bottom" char=".">99.683</td>
<td align="char" valign="bottom" char=".">93.772</td>
<td align="char" valign="bottom" char=".">99.692</td>
</tr>
<tr>
<td align="left" valign="top">15,000</td>
<td align="char" valign="bottom" char=".">88.931</td>
<td align="char" valign="bottom" char=".">99.817</td>
<td align="char" valign="bottom" char=".">93.823</td>
<td align="char" valign="bottom" char=".">99.722</td>
<td align="char" valign="bottom" char=".">93.809</td>
<td align="char" valign="bottom" char=".">99.729</td>
</tr>
<tr>
<td align="left" valign="top">16,000</td>
<td align="char" valign="bottom" char=".">88.941</td>
<td align="char" valign="bottom" char=".">99.827</td>
<td align="char" valign="bottom" char=".">93.852</td>
<td align="char" valign="bottom" char=".">99.751</td>
<td align="char" valign="bottom" char=".">93.835</td>
<td align="char" valign="bottom" char=".">99.756</td>
</tr>
<tr>
<td align="left" valign="top">17,000</td>
<td align="char" valign="bottom" char=".">88.955</td>
<td align="char" valign="bottom" char=".">99.840</td>
<td align="char" valign="bottom" char=".">93.879</td>
<td align="char" valign="bottom" char=".">99.777</td>
<td align="char" valign="bottom" char=".">93.860</td>
<td align="char" valign="bottom" char=".">99.781</td>
</tr>
<tr>
<td align="left" valign="top">18,000</td>
<td align="char" valign="bottom" char=".">88.960</td>
<td align="char" valign="bottom" char=".">99.846</td>
<td align="char" valign="bottom" char=".">93.897</td>
<td align="char" valign="bottom" char=".">99.796</td>
<td align="char" valign="bottom" char=".">93.878</td>
<td align="char" valign="bottom" char=".">99.798</td>
</tr>
<tr>
<td align="left" valign="top">19,000</td>
<td align="char" valign="bottom" char=".">88.967</td>
<td align="char" valign="bottom" char=".">99.853</td>
<td align="char" valign="bottom" char=".">93.914</td>
<td align="char" valign="bottom" char=".">99.813</td>
<td align="char" valign="bottom" char=".">93.894</td>
<td align="char" valign="bottom" char=".">99.814</td>
</tr>
<tr>
<td align="left" valign="top">20,000</td>
<td align="char" valign="bottom" char=".">88.988</td>
<td align="char" valign="bottom" char=".">99.874</td>
<td align="char" valign="bottom" char=".">93.929</td>
<td align="char" valign="bottom" char=".">99.828</td>
<td align="char" valign="bottom" char=".">93.907</td>
<td align="char" valign="bottom" char=".">99.827</td>
</tr>
<tr>
<td align="left" valign="top">21,000</td>
<td align="char" valign="bottom" char=".">88.999</td>
<td align="char" valign="bottom" char=".">99.885</td>
<td align="char" valign="bottom" char=".">93.939</td>
<td align="char" valign="bottom" char=".">99.838</td>
<td align="char" valign="bottom" char=".">93.916</td>
<td align="char" valign="bottom" char=".">99.837</td>
</tr>
<tr>
<td align="left" valign="top">22,000</td>
<td align="char" valign="bottom" char=".">89.001</td>
<td align="char" valign="bottom" char=".">99.886</td>
<td align="char" valign="bottom" char=".">93.947</td>
<td align="char" valign="bottom" char=".">99.846</td>
<td align="char" valign="bottom" char=".">93.925</td>
<td align="char" valign="bottom" char=".">99.845</td>
</tr>
<tr>
<td align="left" valign="top">23,000</td>
<td align="char" valign="bottom" char=".">89.004</td>
<td align="char" valign="bottom" char=".">99.890</td>
<td align="char" valign="bottom" char=".">93.955</td>
<td align="char" valign="bottom" char=".">99.854</td>
<td align="char" valign="bottom" char=".">93.932</td>
<td align="char" valign="bottom" char=".">99.853</td>
</tr>
<tr>
<td align="left" valign="top">24,000</td>
<td align="char" valign="bottom" char=".">89.011</td>
<td align="char" valign="bottom" char=".">99.896</td>
<td align="char" valign="bottom" char=".">93.959</td>
<td align="char" valign="bottom" char=".">99.858</td>
<td align="char" valign="bottom" char=".">93.938</td>
<td align="char" valign="bottom" char=".">99.858</td>
</tr>
<tr>
<td align="left" valign="top">25,000</td>
<td align="char" valign="bottom" char=".">89.012</td>
<td align="char" valign="bottom" char=".">99.898</td>
<td align="char" valign="bottom" char=".">93.964</td>
<td align="char" valign="bottom" char=".">99.862</td>
<td align="char" valign="bottom" char=".">93.942</td>
<td align="char" valign="bottom" char=".">99.863</td>
</tr>
<tr>
<td align="left" valign="top">Total</td>
<td align="char" valign="top" char="." colspan="2">106,271,199</td>
<td align="char" valign="top" char="." colspan="2">323,852,190</td>
<td align="char" valign="top" char="." colspan="2">196,388,091</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>It is obvious that without the knowledge of PN, MW, TC, and acronyms, it is impossible for viewers to achieve the minimum coverage threshold for comprehension, which is indeed worrying. However, when PN, MW, TC, and acronyms were assumed to be known, then we can once again see the optimistic scenario depicted in <xref ref-type="bibr" rid="ref41">Webb and Rodgers (2009a</xref>,<xref ref-type="bibr" rid="ref42">b)</xref>, <xref ref-type="bibr" rid="ref36">Tegge (2017)</xref>, and <xref ref-type="bibr" rid="ref31">Nurmukhamedov and Sharakhimov (2021)</xref>. Generally, it only took 2,000&#x2013;3,000 most frequent word families in the <xref ref-type="bibr" rid="ref28">Nation&#x2019;s (2017)</xref> BNC/COCA lists to cover 95% of the tokens in the corpora, and the vocabulary knowledge of the most frequent 4,000&#x2013;5,000 word families was all required to reach the optimal threshold for comprehension.</p>
<p>Certain differences can be observed between the corpora. To be more specific, soap operas demanded the least lexical knowledge for 95% (2,000 WFs) and 98% (4,000 WFs) coverage compared to the other two. TV programs and movies share similar lexical demands when it comes to the 98% threshold. However, data from the analysis showed that the Movies corpus managed to reach the 95% coverage at the 2,000 level, while the TV programs would require a word knowledge at 3,000 level for 95% coverage. Still, it worth noting that the actual difference was really thin, approximately 0.2%, and had the tendency to become smaller as it moved down the word levels. Therefore, it is safe to state that movies and TV programs shared relatively similar vocabulary demands.</p>
</sec>
<sec id="sec10" sec-type="discussions">
<title>Discussion</title>
<p>The paper revisited research on the vocabulary profiles of soap operas (<xref ref-type="bibr" rid="ref1">Al-Surmi, 2014</xref>), TV programs, and movies (<xref ref-type="bibr" rid="ref41">Webb and Rodgers, 2009a</xref>,<xref ref-type="bibr" rid="ref42">b</xref>) to see if changes in sample size and research methodology would result in changes in findings. Data from the analyses showed that if the BNC/COCA wordlist were to be used as an indicator of word frequency, then the lexical demands of the researched text genres would be generally reduced. To be more specific, if learners were to base their vocabulary learning on the BNC/COCA lists, they would only need to learn 4,000 (instead of 5,000) and 5,000 (instead of 6,000 and 7,000) word families to understand 98% of the words in soap operas and movies and TV programs, respectively. It should be noted that the 1,000&#x2013;2,000 word families difference could be translated into 2&#x2013;4&#x2009;years of English learning or even more (<xref ref-type="bibr" rid="ref39">Webb and Chang, 2012</xref>; <xref ref-type="bibr" rid="ref33">Ozturk, 2016</xref>).</p>
<p>However, the so-called &#x201C;reduced lexical demands&#x201D; are, in actual practice, the additional effect of the four supportive lists (PN, MW, TC, and Acronym). If we were to compare the cumulative coverage at the 3,000 and 5,000 levels, we could easily find similar figures between BNC/COCA-based and BNC-based studies. And if we were to go even further and add the proportion of the four supportive lists (PN, MW, TC, and Acronym) of this study to the cumulative coverage figures at any word level in <xref ref-type="bibr" rid="ref41">Webb and Rodgers (2009a</xref>,<xref ref-type="bibr" rid="ref42">b)</xref> and <xref ref-type="bibr" rid="ref1">Al-Surmi (2014)</xref>, then the same reduced lexical coverage (or even better) could be recorded. Still, this does not mean that we could simply use the BNC lists together with the four additional lists in the BNC/COCA lists. <xref ref-type="bibr" rid="ref29">Nation (2020)</xref> introduced the BNC/COCA wordlist with clear rationales which have been proven by other researchers (<xref ref-type="bibr" rid="ref6">Dang and Webb, 2016</xref>; <xref ref-type="bibr" rid="ref7">Dang et al., 2020</xref>), which means that these lists were designed to work together and scholars are not encouraged to apply questionable practices to avoid unnecessary problems.</p>
<p>The shift in lexical profiling studies from the BNC (<xref ref-type="bibr" rid="ref26">Nation, 2006</xref>) to the BNC/COCA (<xref ref-type="bibr" rid="ref28">Nation, 2017</xref>) could be said to be inevitable as it harmonizes different aspects of vocabulary research. Since most vocabulary tests now utilize the BNC/COCA as the source for test items (<xref ref-type="bibr" rid="ref23">McLean and Kramer, 2015</xref>; <xref ref-type="bibr" rid="ref25">McLean et al., 2015</xref>; <xref ref-type="bibr" rid="ref44">Webb et al., 2017</xref>; <xref ref-type="bibr" rid="ref13">Ha, 2021a</xref>), it would be somehow methodologically inconsistent to relate students&#x2019; results on vocabulary test that is based on the BNC/COCA wordlist to findings of lexical studies that employed the BNC lists. Together with the development of phonological vocabulary tests (<xref ref-type="bibr" rid="ref25">McLean et al., 2015</xref>; <xref ref-type="bibr" rid="ref13">Ha, 2021a</xref>), teachers and vocabulary linguists can now make a reliable connection between the aural vocabulary knowledge of their students and what they can possibly understand in the real world.</p>
<p>The study&#x2019;s findings were also in line with <xref ref-type="bibr" rid="ref31">Nurmukhamedov and Sharakhimov (2021)</xref> which employed <xref ref-type="bibr" rid="ref28">Nation&#x2019;s (2017)</xref> BNC/COCA lists to investigate the lexical profile of English podcasts. <xref ref-type="bibr" rid="ref31">Nurmukhamedov and Sharakhimov (2021)</xref> found that a vocabulary knowledge at the 3,000 and 5,000 levels would cover 96.75 and 98.26% of the words in English podcasts, correspondingly. These results generally suggested that learning the 5,000 most frequent word families in the BNC/COCA wordlist, an attainable learning goal for English learners in most contexts, could help learners achieve unsupported listening comprehension for informal spoken English. This claim is supported by <xref ref-type="bibr" rid="ref37">van Zeeland and Schmitt&#x2019;s (2013)</xref> study which proved that knowing 98% of the running words in a listening text would result in very high degree of listening comprehension.</p>
<p>The study recorded considerable proportions of PN, MW, TC, and acronyms, which aligned well with <xref ref-type="bibr" rid="ref31">Nurmukhamedov and Sharakhimov (2021)</xref>. Vocabulary linguists tend to make the assumption that these words could be easily and recognized by learners (<xref ref-type="bibr" rid="ref26">Nation, 2006</xref>; <xref ref-type="bibr" rid="ref41">Webb and Rodgers, 2009a</xref>,<xref ref-type="bibr" rid="ref42">b</xref>; <xref ref-type="bibr" rid="ref31">Nurmukhamedov and Sharakhimov, 2021</xref>). Although these assumptions could be acceptable for reading, concerns have been raised about whether it is appropriate to assume the same thing to listening (<xref ref-type="bibr" rid="ref20">Kobeleva, 2012</xref>; <xref ref-type="bibr" rid="ref19">Klassen, 2021</xref>). It is true that without the support of orthographic form, proper nouns or even acronyms are difficult to distinguish, and for listening-only formats like podcasts, such concerns are in evidence. However, television programs, movies, and soap operas, which are strongly supported by visualization, are different from podcasts. These non-verbal clues, such as facial expression, body gestures, and lips movements, are considered be of significant support to the processing of aural input (<xref ref-type="bibr" rid="ref15">Harris, 2003</xref>). Such visual clues may help viewers recognize and understand PN, MW, TC, and acronyms.</p>
</sec>
<sec id="sec11" sec-type="conclusions">
<title>Conclusion</title>
<p>This research&#x2019;s findings offer updates on the lexical profiles of informal spoken English by employing up-to-date research methodology and large sample size. In general, it is evident that the BNC/COCA wordlist (<xref ref-type="bibr" rid="ref28">Nation, 2017</xref>) would give English learners and teachers a shorter route to their intended goals. This BNC/COCA-based update also connected research findings on vocabulary profiling and vocabulary assessment, which have been in conflict for several years due to incompatible methodologies.</p>
<p>Despite being informative, this brief research report bears certain limitations. First, the paper revisited the findings of several studies at once, which would give readers a broad overview of the new findings and how these findings related to each other. Therefore, it was not possible for the researcher to go deeper and explore the variation in lexical coverage among texts. Future research should take a deeper look into each corpus in isolation and examine the variation in lexical demands of each text genre. Secondly, although the study showed the number of word families learners would need to achieve 95 and 98% coverage in informal spoken English. It cannot guarantee that learners would successfully comprehend a text should these requirements be satisfied. <xref ref-type="bibr" rid="ref12">Graham (2006)</xref> showed that people may understand every single word in a text and still fail to get the general meaning, which could be due to other factors in the learners&#x2019; language proficiency and metacognitive awareness. As a result, researchers are encouraged to investigate these issues, which could possibly be done by replicating <xref ref-type="bibr" rid="ref37">van Zeeland and Schmitt&#x2019;s (2013)</xref> research methodology.</p>
</sec>
<sec id="sec12" sec-type="data-availability">
<title>Data Availability Statement</title>
<p>The data analyzed in this study is subject to the following licenses/restrictions: The corpora that support the findings of this study are available from Mark Davies. Restrictions apply to the availability of these corpora, which were used under academic license for this study. Data are available from <ext-link xlink:href="https://www.english-corpora.org/" ext-link-type="uri">https://www.english-corpora.org/</ext-link> with the permission of Mark Davies. Requests to access these datasets should be directed to <email>mark.davies@corpusdata.org</email>.</p>
</sec>
<sec id="sec13">
<title>Author Contributions</title>
<p>The author confirms sole responsibility for the following: study conception and design, data collection, analysis and interpretation of results, and manuscript preparation.</p>
</sec>
<sec id="conf1" sec-type="COI-statement">
<title>Conflict of Interest</title>
<p>The author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
</sec>
<sec id="sec15" sec-type="disclaimer">
<title>Publisher&#x2019;s Note</title>
<p>All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.</p>
</sec>
</body>
<back>
<ref-list>
<title>References</title>
<ref id="ref1">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Al-Surmi</surname> <given-names>M.</given-names></name></person-group> (<year>2014</year>). &#x201C;<article-title>TV shows, word coverage, and incidental vocabulary learning</article-title>,&#x201D; in <source>Teaching and Learning English in the Arabic-Speaking World</source>. eds. <person-group person-group-type="editor"><name><surname>Bailey</surname> <given-names>K.</given-names></name> <name><surname>Damerow</surname> <given-names>R.</given-names></name></person-group> (<publisher-loc>London</publisher-loc>: <publisher-name>Routledge</publisher-name>), <fpage>132</fpage>&#x2013;<lpage>147</lpage>.</citation></ref>
<ref id="ref2"><citation citation-type="book"><person-group person-group-type="author"><name><surname>Anthony</surname> <given-names>L.</given-names></name></person-group> (<year>2021</year>). <source>Ant Word Profiler (Version 1.5.1)</source>. <publisher-loc>Tokyo, Japan</publisher-loc>: <publisher-name>Waseda University</publisher-name>.</citation></ref>
<ref id="ref45"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Bauer</surname> <given-names>L.</given-names></name> <name><surname>Nation</surname> <given-names>I. S. P.</given-names></name></person-group> (<year>1993</year>). <article-title>Word families</article-title>. <source>Int. J. Lexicogr.</source> <volume>6</volume>, <fpage>253</fpage>&#x2013;<lpage>279</lpage>. doi: <pub-id pub-id-type="doi">10.1093/ijl/6.4.253</pub-id></citation></ref>
<ref id="ref3"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Cheng</surname> <given-names>J.</given-names></name> <name><surname>Matthews</surname> <given-names>J.</given-names></name></person-group> (<year>2018</year>). <article-title>The relationship between three measures of L2 vocabulary knowledge and L2 listening and reading</article-title>. <source>Lang. Test.</source> <volume>35</volume>, <fpage>3</fpage>&#x2013;<lpage>25</lpage>. doi: <pub-id pub-id-type="doi">10.1177/0265532216676851</pub-id></citation></ref>
<ref id="ref4"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Coxhead</surname> <given-names>A.</given-names></name> <name><surname>Walls</surname> <given-names>R.</given-names></name></person-group> (<year>2012</year>). <article-title>TED talks, vocabulary, and listening for EAP</article-title>. <source>TESOL ANZ J.</source> <volume>20</volume>, <fpage>55</fpage>&#x2013;<lpage>65</lpage>.</citation></ref>
<ref id="ref5"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Dang</surname> <given-names>T.</given-names></name> <name><surname>Webb</surname> <given-names>S.</given-names></name></person-group> (<year>2014</year>). <article-title>The lexical profile of academic spoken English</article-title>. <source>Engl. Specif. Purp.</source> <volume>33</volume>, <fpage>66</fpage>&#x2013;<lpage>76</lpage>. doi: <pub-id pub-id-type="doi">10.1016/j.esp.2013.08.001</pub-id></citation></ref>
<ref id="ref6"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Dang</surname> <given-names>T. N. Y.</given-names></name> <name><surname>Webb</surname> <given-names>S.</given-names></name></person-group> (<year>2016</year>). <article-title>Evaluating lists of high-frequency words</article-title>. <source>ITL Int. J. App. Ling.</source> <volume>167</volume>, <fpage>132</fpage>&#x2013;<lpage>158</lpage>. doi: <pub-id pub-id-type="doi">10.1075/itl.167.2.02dan</pub-id></citation></ref>
<ref id="ref7"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Dang</surname> <given-names>T. N. Y.</given-names></name> <name><surname>Webb</surname> <given-names>S.</given-names></name> <name><surname>Coxhead</surname> <given-names>A.</given-names></name></person-group> (<year>2020</year>). <article-title>Evaluating lists of high-frequency words: teachers&#x2019; and learners&#x2019; perspectives</article-title>. <source>Lang. Teach. Res.</source> <fpage>136216882091118</fpage>. doi: <pub-id pub-id-type="doi">10.1177/1362168820911189</pub-id></citation></ref>
<ref id="ref8"><citation citation-type="other"><person-group person-group-type="author"><name><surname>Davies</surname> <given-names>M.</given-names></name></person-group> (<year>2012</year>). Corpus of American soap operas. Available at: <ext-link xlink:href="https://www.english-corpora.org/soap/" ext-link-type="uri">https://www.english-corpora.org/soap/</ext-link> (Accessed January 26, 2022).</citation></ref>
<ref id="ref9"><citation citation-type="other"><person-group person-group-type="author"><name><surname>Davies</surname> <given-names>M.</given-names></name></person-group> (<year>2019a</year>). The movie corpus. Available at: <ext-link xlink:href="https://www.english-corpora.org/movies/" ext-link-type="uri">https://www.english-corpora.org/movies/</ext-link> (Accessed January 26, 2022).</citation></ref>
<ref id="ref10"><citation citation-type="other"><person-group person-group-type="author"><name><surname>Davies</surname> <given-names>M.</given-names></name></person-group> (<year>2019b</year>). The TV corpus. Available at: <ext-link xlink:href="https://www.english-corpora.org/tv/" ext-link-type="uri">https://www.english-corpora.org/tv/</ext-link> (Accessed January 26, 2022).</citation></ref>
<ref id="ref11"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Davies</surname> <given-names>M.</given-names></name></person-group> (<year>2021</year>). <article-title>The TV and movies corpora: design, construction, and use</article-title>. <source>Int. J. Corpus Ling.</source> <volume>26</volume>, <fpage>10</fpage>&#x2013;<lpage>37</lpage>. doi: <pub-id pub-id-type="doi">10.1075/ijcl.00035.dav</pub-id></citation></ref>
<ref id="ref12"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Graham</surname> <given-names>S.</given-names></name></person-group> (<year>2006</year>). <article-title>Listening comprehension: The learners&#x2019; perspective</article-title>. <source>System</source> <volume>34</volume>, <fpage>165</fpage>&#x2013;<lpage>182</lpage>. doi: <pub-id pub-id-type="doi">10.1016/j.system.2005.11.001</pub-id></citation></ref>
<ref id="ref13"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ha</surname> <given-names>T. H.</given-names></name></person-group> (<year>2021a</year>). <article-title>A Rasch-based validation of the Vietnamese version of the listening vocabulary levels test. Language testing</article-title>. <source>Asia</source> <volume>11</volume>, <fpage>1</fpage>&#x2013;<lpage>19</lpage>. doi: <pub-id pub-id-type="doi">10.1186/s40468-021-00132-7</pub-id></citation></ref>
<ref id="ref14"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ha</surname> <given-names>T. H.</given-names></name></person-group> (<year>2021b</year>). <article-title>Exploring the relationships between various dimensions of receptive vocabulary knowledge and L2 listening and reading comprehension</article-title>. <source>Lang. Test.</source> <volume>11</volume>, <fpage>1</fpage>&#x2013;<lpage>20</lpage>. doi: <pub-id pub-id-type="doi">10.1186/s40468-021-00131-8</pub-id></citation></ref>
<ref id="ref15"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Harris</surname> <given-names>T.</given-names></name></person-group> (<year>2003</year>). <article-title>Listening with your eyes: the importance of speech-related gestures in the language classroom</article-title>. <source>Foreign Lang. Ann.</source> <volume>36</volume>, <fpage>180</fpage>&#x2013;<lpage>187</lpage>. doi: <pub-id pub-id-type="doi">10.1111/j.1944-9720.2003.tb01468.x</pub-id></citation></ref>
<ref id="ref16"><citation citation-type="other"><person-group person-group-type="author"><name><surname>Heatley</surname> <given-names>A.</given-names></name> <name> <surname>Nation</surname> <given-names>I. S. P.</given-names></name> <name> <surname>Coxhead</surname> <given-names>A.</given-names></name></person-group> (<year>2002</year>). Range: A program for the analysis of vocabulary in texts. Available at: <ext-link xlink:href="http://www.victoria.ac.nz/lals/about/staff/paul-nation" ext-link-type="uri">http://www.victoria.ac.nz/lals/about/staff/paul-nation</ext-link> (Accessed January 26, 2022).</citation></ref>
<ref id="ref17"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Hsu</surname> <given-names>W.</given-names></name></person-group> (<year>2018</year>). <article-title>The most frequent BNC/COCA mid- and low-frequency word families in English-medium traditional Chinese medicine (TCM) textbooks</article-title>. <source>Engl. Specif. Purp.</source> <volume>51</volume>, <fpage>98</fpage>&#x2013;<lpage>110</lpage>. doi: <pub-id pub-id-type="doi">10.1016/j.esp.2018.04.001</pub-id></citation></ref>
<ref id="ref18"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Hu</surname> <given-names>M.</given-names></name> <name><surname>Nation</surname> <given-names>I. S. P.</given-names></name></person-group> (<year>2000</year>). <article-title>Unknown vocabulary density and reading comprehension</article-title>. <source>Read. For. Lang.</source> <volume>13</volume>, <fpage>403</fpage>&#x2013;<lpage>430</lpage>.</citation></ref>
<ref id="ref19"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Klassen</surname> <given-names>K.</given-names></name></person-group> (<year>2021</year>). <article-title>Proper name theory and implications for second language reading</article-title>. <source>Lang. Teach.</source> <fpage>1</fpage>&#x2013;<lpage>7</lpage>. doi: <pub-id pub-id-type="doi">10.1017/S026144482100015X</pub-id></citation></ref>
<ref id="ref20"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kobeleva</surname> <given-names>P.</given-names></name></person-group> (<year>2012</year>). <article-title>Second language listening and unfamiliar proper names: comprehension barrier?</article-title> <source>RELC J.</source> <volume>43</volume>, <fpage>83</fpage>&#x2013;<lpage>98</lpage>. doi: <pub-id pub-id-type="doi">10.1177/0033688212440637</pub-id></citation></ref>
<ref id="ref21"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Lange</surname> <given-names>K.</given-names></name> <name><surname>Matthews</surname> <given-names>J.</given-names></name></person-group> (<year>2020</year>). <article-title>Exploring the relationships between L2 vocabulary knowledge, lexical segmentation, and L2 listening comprehension</article-title>. <source>Stu. Sec. Lang. Learn. Teach.</source> <volume>10</volume>, <fpage>723</fpage>&#x2013;<lpage>749</lpage>. doi: <pub-id pub-id-type="doi">10.14746/ssllt.2020.10.4.4</pub-id></citation></ref>
<ref id="ref22"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Love</surname> <given-names>R.</given-names></name> <name><surname>Dembry</surname> <given-names>C.</given-names></name> <name><surname>Hardie</surname> <given-names>A.</given-names></name> <name><surname>Brezina</surname> <given-names>V.</given-names></name> <name><surname>McEnery</surname> <given-names>T.</given-names></name></person-group> (<year>2017</year>). <article-title>The spoken BNC2014 designing and building a spoken corpus of everyday conversations</article-title>. <source>Int. J. Corpus Ling.</source> <volume>22</volume>, <fpage>319</fpage>&#x2013;<lpage>344</lpage>. doi: <pub-id pub-id-type="doi">10.1075/ijcl.22.3.02lov</pub-id></citation></ref>
<ref id="ref23"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>McLean</surname> <given-names>S.</given-names></name> <name><surname>Kramer</surname> <given-names>B.</given-names></name></person-group> (<year>2015</year>). <article-title>The creation of a new vocabulary levels test</article-title>. <source>Shiken</source> <volume>19</volume>, <fpage>1</fpage>&#x2013;<lpage>11</lpage>.</citation></ref>
<ref id="ref24"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>McLean</surname> <given-names>S.</given-names></name> <name><surname>Kramer</surname> <given-names>B.</given-names></name></person-group> (<year>2016</year>). <article-title>The development of a Japanese bilingual version of the new vocabulary levels test</article-title>. <source>VERB</source> <volume>5</volume>, <fpage>2</fpage>&#x2013;<lpage>5</lpage>.</citation></ref>
<ref id="ref25"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>McLean</surname> <given-names>S.</given-names></name> <name><surname>Kramer</surname> <given-names>B.</given-names></name> <name><surname>Beglar</surname> <given-names>D.</given-names></name></person-group> (<year>2015</year>). <article-title>The creation and validation of a listening vocabulary levels test</article-title>. <source>Lang. Teach. Res.</source> <volume>19</volume>, <fpage>741</fpage>&#x2013;<lpage>760</lpage>. doi: <pub-id pub-id-type="doi">10.1177/1362168814567889</pub-id></citation></ref>
<ref id="ref26"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Nation</surname> <given-names>I. S. P.</given-names></name></person-group> (<year>2006</year>). <article-title>How large a vocabulary is needed to reading and listening?</article-title> <source>Can. Mod. Lang. Rev.</source> <volume>63</volume>, <fpage>59</fpage>&#x2013;<lpage>82</lpage>. doi: <pub-id pub-id-type="doi">10.3138/cmlr.63.1.59</pub-id></citation></ref>
<ref id="ref27"><citation citation-type="book"><person-group person-group-type="author"><name><surname>Nation</surname> <given-names>I. S. P.</given-names></name></person-group> (<year>2013</year>). <source>Learning Vocabulary in another Language.</source> <edition>2nd</edition> <italic>Edn</italic>. <publisher-loc>Cambridge</publisher-loc>: <publisher-name>Cambridge University Press</publisher-name>.</citation></ref>
<ref id="ref28"><citation citation-type="other"><person-group person-group-type="author"><name><surname>Nation</surname> <given-names>I. S. P.</given-names></name></person-group> (<year>2017</year>). The BNC/COCA level 6 words family lists (version 1.0.0) [data file]. Available at: <ext-link xlink:href="http://www.victoria.ac.nz/lals/staff/paul-nation.aspx" ext-link-type="uri">http://www.victoria.ac.nz/lals/staff/paul-nation.aspx</ext-link> (Accessed January 26, 2022).</citation></ref>
<ref id="ref29"><citation citation-type="other"><person-group person-group-type="author"><name><surname>Nation</surname> <given-names>I. S. P.</given-names></name></person-group> (<year>2020</year>). About the BNC/COCA headword lists. Available at: <ext-link xlink:href="https://www.wgtn.ac.nz/lals/resources/paul-nations-resources/vocabulary-lists" ext-link-type="uri">https://www.wgtn.ac.nz/lals/resources/paul-nations-resources/vocabulary-lists</ext-link> (Accessed January 26, 2022).</citation></ref>
<ref id="ref30"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Nurmukhamedov</surname> <given-names>U.</given-names></name></person-group> (<year>2017</year>). <article-title>Lexical coverage of TED talks: implications for vocabulary instruction</article-title>. <source>TESOL J.</source> <volume>8</volume>, <fpage>768</fpage>&#x2013;<lpage>790</lpage>. doi: <pub-id pub-id-type="doi">10.1002/tesj.323</pub-id></citation></ref>
<ref id="ref31"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Nurmukhamedov</surname> <given-names>U.</given-names></name> <name><surname>Sharakhimov</surname> <given-names>S.</given-names></name></person-group> (<year>2021</year>). <article-title>Corpus-based vocabulary analysis of English podcasts</article-title>. <source>RELC J.</source> <volume>1</volume>:<fpage>0033688220979315</fpage>. doi: <pub-id pub-id-type="doi">10.1177/0033688220979315</pub-id></citation></ref>
<ref id="ref32"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Nurmukhamedov</surname> <given-names>U.</given-names></name> <name><surname>Webb</surname> <given-names>S.</given-names></name></person-group> (<year>2019</year>). <article-title>Lexical coverage and profiling</article-title>. <source>Lang. Teach.</source> <volume>52</volume>, <fpage>188</fpage>&#x2013;<lpage>200</lpage>. doi: <pub-id pub-id-type="doi">10.1017/S0261444819000028</pub-id></citation></ref>
<ref id="ref33"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ozturk</surname> <given-names>M.</given-names></name></person-group> (<year>2016</year>). <article-title>Second language vocabulary growth at advanced level</article-title>. <source>Lang. Learn. J.</source> <volume>44</volume>, <fpage>6</fpage>&#x2013;<lpage>16</lpage>. doi: <pub-id pub-id-type="doi">10.1080/09571736.2012.708054</pub-id></citation></ref>
<ref id="ref34"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Schmitt</surname> <given-names>N.</given-names></name> <name><surname>Cobb</surname> <given-names>T.</given-names></name> <name><surname>Horst</surname> <given-names>M.</given-names></name> <name><surname>Schmitt</surname> <given-names>D.</given-names></name></person-group> (<year>2017</year>). <article-title>How much vocabulary is needed to use English? Replication of van Zeeland and Schmitt (2012), Nation (2006) and Cobb (2007)</article-title>. <source>Lang. Teach.</source> <volume>50</volume>, <fpage>212</fpage>&#x2013;<lpage>226</lpage>. doi: <pub-id pub-id-type="doi">10.1017/S0261444815000075</pub-id></citation></ref>
<ref id="ref35"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Schmitt</surname> <given-names>N.</given-names></name> <name><surname>Jiang</surname> <given-names>X.</given-names></name> <name><surname>Grabe</surname> <given-names>W.</given-names></name></person-group> (<year>2011</year>). <article-title>The percentage of words known in a text and reading comprehension</article-title>. <source>Mod. Lang. J.</source> <volume>95</volume>, <fpage>26</fpage>&#x2013;<lpage>43</lpage>. doi: <pub-id pub-id-type="doi">10.1111/j.1540-4781.2011.01146.x</pub-id></citation></ref>
<ref id="ref36"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Tegge</surname> <given-names>F.</given-names></name></person-group> (<year>2017</year>). <article-title>The lexical coverage of popular songs in English language teaching</article-title>. <source>System</source> <volume>67</volume>, <fpage>87</fpage>&#x2013;<lpage>98</lpage>. doi: <pub-id pub-id-type="doi">10.1016/j.system.2017.04.016</pub-id></citation></ref>
<ref id="ref37"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>van Zeeland</surname> <given-names>H.</given-names></name> <name><surname>Schmitt</surname> <given-names>N.</given-names></name></person-group> (<year>2013</year>). <article-title>Lexical coverage in L1 and L2 listening comprehension: The same or different from reading comprehension?</article-title> <source>Appl. Linguis.</source> <volume>34</volume>, <fpage>457</fpage>&#x2013;<lpage>479</lpage>. doi: <pub-id pub-id-type="doi">10.1093/applin/ams074</pub-id></citation></ref>
<ref id="ref38"><citation citation-type="book"><person-group person-group-type="author"><name><surname>Webb</surname> <given-names>S.</given-names></name></person-group> (<year>2020</year>). <source>The Routledge Handbook of Vocabulary Studies</source>. <publisher-loc>London</publisher-loc>: <publisher-name>Routledge</publisher-name>.</citation></ref>
<ref id="ref39"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Webb</surname> <given-names>S. A.</given-names></name> <name><surname>Chang</surname> <given-names>A. C.-S.</given-names></name></person-group> (<year>2012</year>). <article-title>Second language vocabulary growth</article-title>. <source>RELC J.</source> <volume>43</volume>, <fpage>113</fpage>&#x2013;<lpage>126</lpage>. doi: <pub-id pub-id-type="doi">10.1177/0033688212439367</pub-id></citation></ref>
<ref id="ref40"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Webb</surname> <given-names>S.</given-names></name> <name><surname>Paribakht</surname> <given-names>T. S.</given-names></name></person-group> (<year>2015</year>). <article-title>What is the relationship between the lexical profile of test items and performance on a standardized English proficiency test?</article-title> <source>Engl. Specif. Purp.</source> <volume>38</volume>, <fpage>34</fpage>&#x2013;<lpage>43</lpage>. doi: <pub-id pub-id-type="doi">10.1016/j.esp.2014.11.001</pub-id></citation></ref>
<ref id="ref41"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Webb</surname> <given-names>S.</given-names></name> <name><surname>Rodgers</surname> <given-names>M. P. H.</given-names></name></person-group> (<year>2009a</year>). <article-title>Vocabulary demands of television programs</article-title>. <source>Lang. Learn.</source> <volume>59</volume>, <fpage>335</fpage>&#x2013;<lpage>366</lpage>. doi: <pub-id pub-id-type="doi">10.1111/j.1467-9922.2009.00509.x</pub-id></citation></ref>
<ref id="ref42"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Webb</surname> <given-names>S.</given-names></name> <name><surname>Rodgers</surname> <given-names>M. P. H.</given-names></name></person-group> (<year>2009b</year>). <article-title>The lexical coverage of movies</article-title>. <source>Appl. Linguis.</source> <volume>30</volume>, <fpage>407</fpage>&#x2013;<lpage>427</lpage>. doi: <pub-id pub-id-type="doi">10.1093/applin/amp010</pub-id></citation></ref>
<ref id="ref44"><citation citation-type="book"><person-group person-group-type="author"><name><surname>Webb</surname> <given-names>S.</given-names></name> <name><surname>Sasao</surname> <given-names>Y.</given-names></name> <name><surname>Balance</surname> <given-names>O.</given-names></name></person-group> (<year>2017</year>). <article-title>The updated vocabulary levels test</article-title>. <source>ITL Int. J. Appl. Linguist.</source> <volume>168</volume>, <fpage>33</fpage>&#x2013;<lpage>69</lpage>. doi: <pub-id pub-id-type="doi">10.1075/itl.168.1.02web</pub-id></citation></ref>
<ref id="ref43"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Yang</surname> <given-names>L.</given-names></name> <name><surname>Coxhead</surname> <given-names>A.</given-names></name></person-group> (<year>2020</year>). <article-title>A corpus-based study of vocabulary in the new concept English textbook series</article-title>. <source>RELC J.</source> <fpage>003368822096416</fpage>. doi: <pub-id pub-id-type="doi">10.1177/0033688220964162</pub-id></citation></ref></ref-list>
<glossary>
<def-list>
<title>Abbreviations</title>
<def-item><term>BNC</term><def><p>British National Corpus</p></def></def-item>
<def-item><term>COCA</term><def><p>Corpus of Contemporary American English</p></def></def-item>
<def-item><term>ESL</term><def><p>English as a second language</p></def></def-item>
<def-item><term>MW</term><def><p>Marginal words</p></def></def-item>
<def-item><term>NOW</term><def><p>News on the web</p></def></def-item>
<def-item><term>PN</term><def><p>Proper nouns</p></def></def-item>
<def-item><term>TC</term><def><p>Transparent compounds</p></def></def-item>
</def-list>
</glossary>
<fn-group>
<fn id="fn0004"><p><sup>1</sup><ext-link xlink:href="https://www.english-corpora.org/" ext-link-type="uri">https://www.english-corpora.org/</ext-link></p></fn>
</fn-group>
</back>
</article>