DATA REPORT article
Front. Artif. Intell.
Sec. Natural Language Processing
Volume 8 - 2025 | doi: 10.3389/frai.2025.1557137
This article is part of the Research TopicSemantics and Natural Language Processing in AgricultureView all 4 articles
A lexicon obtained and validated by a data-driven approach for organic residues valorization in emerging and developing countries
Provisionally accepted- French Agricultural Research Centre for International Development (CIRAD), Montpellier, France
Select one of your emails
You have multiple emails registered with Frontiers:
Notify me on publication
Please enter your email address:
If you already have an account, please login
You don't have a Frontiers account ? You can register here
Open dump remains the main management process of organic residue in middle-and low-income countries [1]. Indeed, according to this study, municipal solid waste is composed of 44% organic fraction. However, waste recycling or valorization is about 7, 4.7, and 21% in Sub-Saharan Africa, Caribbean/Latin America, and South Asia respectively. It is thus interesting to determine organic residue valorization status in those regions. Answer to that question could be prospected through textual analysis. The method herein represents the first step to that end. Indeed, when missing, text mining could be used to extract thematic lexicon from a bibliographic corpus to drive a state-of-art in the valorization of organic residues in agriculture in developing countries. In this work, text mining and Natural Language Processing (NLP) methods enable to generate a specialized lexicon on this specific area. The definition of relevance of terms is challenging and discussed in this data paper. Actually, terminology extraction methods are generally based on benchmarks (i.e. gold-standard) or terms manually validated [2] but an experimental protocol that takes into account different kinds of relevance to consolidate the process is understudied. This needs to integrate expertise knowledge, agreement of experts regarding definitions and evaluation associated with, and the task to do. This paper highlights how this construction is conducted by considering different point-of-view of relevance in a multidisciplinary context. It is important to notice that this kind of lexicon specifically focused on organic residues valorization does not exist in agriculture semantic resources like AgroPortal which include more than 200 ontologies/thesaurii/lexicons [3].The present work consisted of using text mining approach to construct thematical lexicon from a corpus related to valorization of organic waste in developing countries. Method used to collect data was detailed first, followed by a section about the technic adopted to select, annotate, and validate the lexicon. Finally, future perspective work was explained in a concluding section. This exploratory methodology could be used to guide a more in-depth and oriented text analysis of scientific publications (i.e. scientometric analysis). Moreover, this methodology can be reused and/or adapted in other domain depending on purpose. In our ongoing work, we use this lexicon to conduct a semantic analysis of scientific publications dealing with organic residues valorization in emerging and developing countries.Proposed Method to Collect the DataSeveral online databases were consulted in 2021, to extract articles relating to biotransformation and valorization in agriculture of organic residues in emerging and developing countries (WoS, Ovid, Scopus, Google scholar, HAL, Cairn.info, AGRIS, and Agritropfoot_0 ) published until 2021. Terms used for bibliographic search in all databases through specific queries are detailed in the Appendix section of this paper.The equation used in the Web of Science collection was thereafter adapted for the other databases specificities. Advanced search was not available for most of the free online database, a global thematical search was then adopted (Appendix 1). The search gave 24 186 references on which a selective sorting was conducted to avoid duplicates and to select references in English only. A total of 7 692 references were used to generate the dataset available in the excel file (Initial_Corpus_References.xlsx) available on depository [4]. The corpus of the dataset combines articles, reports, book sections, and student thesis with bibliographic references (authors, year of publication, title, doi, and url).BioTex [5] was used to perform an Automatic Term Extraction (ATE) on the corpus. The terms extracted (e.g. rumen, humic acid, nutrient recovery, …) give a semantic point of view of the theme of the text. This tool was developed for Biomedical term extraction [6] and was adapted to extract terms associated with food security [7]. First, BioTex performed a linguistic screening through syntactic patterns (noun-noun, adjective-noun, …). In order to rank terms extracted on the "titles" corpus, the F-TF-IDF-C score integrated to BioTex was applied. This measure combines (i) C-value (4) to favor multi-word terms extracted, and (ii) TF-IDF (Term Frequency-Inverse Document Frequency) to highlight discriminative terms [6].Text mining was thereafter performed on titles of the corpus using the BioTex tools [5] and the result can be found in the associated excel file (Extracted_Terms.xlsx) on depository (1). The first column contains the 19 580 terms obtained from the extraction. The second column ("term") presents the terms constituted of words or compound nouns (e.g. mulch, effluents, soil amendments, bagasse cocomposting). The rank, in the last column, is obtained by maximizing a discriminative score associated with terms (i.e. F-TF-IDF-C).Five specialist raters conducted a first annotation on 200 sampled candidate terms among the 19 580 to exclude irrelevant terms to the topic of interest following the guideline file (Annotation_guidelines.pdf) available on the depository [4]. Specialists were researchers in "Recyclage et risque" unit of Cirad in France, working on recycling organic residue in agriculture and associated risks. The group was specialized in biochemistry, agronomy, microbiology, ecologist, soil science, and environmental assessment using both monitoring and modelling approaches. Each rater was asked to categorize each candidate term belonging to i) organic residues (OWT) and/or ii) biotransformation process (TM) and/or iii) valorization in agriculture (AV) or iv) none of them (None)following the first annotation guide. Definition of each category is described in the annotation guidelines. Table 1 shows example of the first step of annotation conducted by specialist.The Fleiss Kappa [8] which measures agreement between several raters equals to 0.52 for this first annotation corresponding to a bad agreement between the 5 raters. The 4 categories chosen to annotate the candidate terms appeared to be too restrictive. Terms indirectly associated to one or more of the 4 categories have been excluded by several raters.In a second annotation guideline, the manual labelling process focuses on the overall degree of pertinence related to the topic of valorization of organic residues. In this context, candidate term was annotated according 3 classes: (i) very pertinent when it was directly connected to one or more category(ries) (i.e. OWT+, TM+, AV+), (ii) pertinent when it was indirectly connected to one or more category(ries) (i.e. OWT, TM, AV), and (iii) irrelevant (i.e. None).A second annotation on the same 200 sampled terms was conducted. All results of the two series of annotation can be viewed with the file "Raters_Annotation_Results.xlsx" in our dataset [4]. Fleiss Kappa was calculated for 3 and 5 raters. It revealed a decreasing trend of the value (0.84 to 0.60) with increasing number of raters. Closer comparison highlighted more terms indirectly related to one or more category(ies) selected by 3 raters with high value of Kappa. In order to include as many terms indirectly related to the subject as possible, it was decided to apply the logic of these 3 annotators to pursue the categorization of the remaining terms.In Table 2, the results are evaluated in terms of precision (percentage of pertinent terms) obtained over the top k extracted terms (P@k). The results confirm that the ranking function of BioTex is adapted by highlighting relevant terms at the top of the list. For instance, precision value with k=100 and k=200 is high (more than 80%) but recall will be low because a lot of relevant terms are not proposed. Actually, a precise recall value is difficult to calculate because we do not have gold standard.The above detailed dataset can be found in the CIRAD Dataverse repository [4].One of the 5 specialists then pursued the annotation, with a degree of relevance, on the remaining extracted terms. It was decided to continue the categorization with the degree of pertinence and to apply the logic of the 3 annotators with the high kappa value explained above. It took about one-week work for the rater to conduct the categorization. The same five raters were then asked to verify and finalize the terms selection related to the biotransformation and valorization in agriculture of organic residues in low-income countries. All verified relevant terms were combined in the last file on the depository (Pertinent_Terms.xlsx), containing terms which can be indirectly (first sheet) or directly (second to fourth sheet) related to the topic.From the 19 580 initial candidate terms, about 75% were not associated to the topic of interest (Table 2). Irrelevant terms included words which are not related to organic residues nor biotransformation nor valorization in agriculture, such as: absence, certification, design, effect, fecundity, fitness, gray, immune response, integration analysis, low cost, marker genes, …. Among the 25% relevant terms, 2 079 were closely associated with the organic residues valorization in emerging and developing countries such as sludge, sewage, livestock, manure, slurry, anaerobic digestion, composting, vermicomposting. Several terms can be found in the glossary of terms related to livestock and manure management [9] and figure among terms with high pertinence in this dataset. Moreover, some of relevant terms are cited in literatures as the biotransformation (e.g.: anaerobic digestion, composting, bioethanol, biohydrogen) and valorization in agriculture (e.g.: biofertilization, organic fertilizers, amendments) of organic residues (e.g. rice straw, sugarcane bagasse, animal manure) [10], [11].The produced lexicon is currently used in a semantic-driven analysis of our corpus based on the CorTexT software [12]. In the context of this multidisciplinary work on-going, we obtain deeper knowledge regarding bioconversion and valorization in agriculture of organic residues in low-income countries as highlighted in Figure 1-A and B.The text-mining tool used in this work is based on statistical criteria that highlight discriminative terms. This method identifies significant terms that are present in the texts. As future work, the proposed framework could be extended by extracting variation of terms [13] that enables to recognize rare and/or unsystematic terms but also synonyms. Moreover, embedding approaches [14], language models [15], generative methods based on LLM (Large Language Models) techniques [16] could be applied to recognize new terms. Language model techniques are based on generic models like BERT -Bidirectional Encoder Representations from Transformers [15] or specific ones like AgriBERT -Knowledge-Infused Agricultural Language Models for Matching Food and Nutrition [17] dedicated to the agriculture domain. These models can be fine-tuned for specific tasks like terminology extraction.There can be used to improve terminology extraction. Note that the use of language models could be relevant for specific NLP tasks and domains like the agriculture area [18]. As future work, we plan to compare the applied methods described in this paper with other approaches based on language models but also Large Language Models (LLM) for terminology extraction [19]. LLM could also be used to expand our initial lexicon. This enables to extract variations of exiting terms and synonyms but also new terms. In the context of our work, the objective is to conduct a semantic analysis of terms present in the corpus, so the use of words or phrases in our lexicon but not used in our dataset (i.e. corpus) is not really useful. Web of science Core collection query: WOS, FSTA and Biosis TS = ("sewage sludge" OR "crop residue*" OR "agricultural waste" OR "industrial waste" OR "food waste" OR "household waste" OR "organic waste" OR "urban waste" OR "co-product*" OR "byproduct*" OR "biomass" OR "organic waste product*" OR mulch OR digestate* OR compost*) AND TS = (decomposition OR fermentation OR anaerobic OR aerobic OR methanisation OR composting OR vermicomposting OR fertilization OR bokashi OR biodegradation OR mineralization OR recycling OR "agricultural valuation" OR biotransformation OR mulching) AND TS = (africa OR "acp countries" OR "central america" OR "south america" OR "latin america" OR "south east asia" OR "south asia" OR afghanistan OR angola OR albania OR argentina OR armenia OR antigua OR azerbaijan OR burundi OR benin OR "burkina faso" OR bangladesh OR bosnia OR belarus OR belize OR bolivia OR brazil OR bhutan OR botswana OR "central african republic" OR china OR "ivory coast" OR cameroon OR congo OR colombia OR comoros OR "cape verde" OR "costa rica" OR cuba OR djibouti OR dominica OR "dominican republic" OR algeria OR ecuador OR egypt OR eritrea OR ethiopia OR fiji OR micronesia OR gabon OR georgia OR ghana OR guinea OR gambia OR grenada OR guatemala OR guyana OR honduras OR haiti OR indonesia OR india OR iran OR iraq OR jamaica OR jordan OR kazakhstan OR kenya OR kyrgyzstan OR cambodia OR kiribati OR "lao people's democratic republic" OR lebanon OR liberia OR libya OR "saint lucia" OR "sri lanka" OR lesotho OR morocco OR moldova OR madagascar OR maldives OR mexico OR "marshall islands" OR "north macedonia" OR mali OR myanmar OR montenegro OR mongolia OR mozambique OR mauritania OR montserrat OR mauritius OR malawi OR malaysia OR namibia OR niger OR nigeria OR nicaragua OR niue OR nepal OR nauru OR pakistan OR panama OR peru OR philippines OR palau OR "papua new guinea" OR " Democratic People's Republic of Korea" OR "north korea" OR paraguay OR "palestinian territory" OR rwanda OR sudan OR senegal OR "saint helena, ascension and tristan da cunha" OR "solomon islands" OR "sierra leone" OR "el Salvador" OR somalia OR serbia OR "south sudan" OR "sao tome and principe" OR suriname OR eswatini OR "syrian arab republic" OR chad OR togo OR thailand OR tajikistan OR tokelau OR turkmenistan OR "timor-leste" OR tonga OR tunisia OR turkey OR tuvalu OR tanzania OR uganda OR ukraine OR uzbekistan OR "saint vincent and the grenadines" OR venezuela OR "vietnam" OR vanuatu OR "wallis and futuna" OR samoa OR yemen OR "south africa" OR zambia OR zimbabwe). Due to a very high number obtained with the equivalent of WoS query, the following query was used for scopus with subject area=environmental sciences or agricultural. Then, only articles, reviews and conference paper were selected.TITLE-ABS-KEY ( "sewage sludge" OR "crop residue*" OR "agricultural waste" OR "industrial waste" OR "food waste" OR "household waste" OR "organic waste" OR "urban waste" OR "coproduct*" OR "by-product*" OR "biomass" OR "organic waste product*" OR mulch OR digestate* OR compost* ) AND TITLE-ABS-KEY ( decomposition OR fermentation OR anaerobic OR aerobic OR methanisation OR composting OR vermicomposting OR fertilisation OR bokashi OR biodegradation OR mineralisation OR recycling OR "agricultural valuation" OR biotransformation OR mulching ) AND ( LIMIT-TO ( SUBJAREA , "ENVI" ) OR LIMIT-TO ( SUBJAREA , "AGRI" ) ) AND ( LIMIT-TO ( DOCTYPE , "ar" ) OR LIMIT-TO ( DOCTYPE , "re" ) OR LIMIT-TO ( DOCTYPE , "cp" ) )Google scholar, HAL, Cairn.info, AGRIS:Advanced research was not available on free databases; the research was thus conducted with a general query on the topic which was:-In French : « biotransformation et valorisation en agriculture dans les contextes des pays du Sud » -In English « biotransformation et valorization in agriculture in low-income countries» in English. The query was then tested by adding Africa, Latin America, then South-East Asia ».
Keywords: text mining, Organic waste, Biological transformation, Agriculture, Valorization
Received: 09 Jan 2025; Accepted: 01 Aug 2025.
Copyright: © 2025 Rakotomalala and Roche. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
* Correspondence:
Christiane Rakotomalala, French Agricultural Research Centre for International Development (CIRAD), Montpellier, France
Mathieu Roche, French Agricultural Research Centre for International Development (CIRAD), Montpellier, France
Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.