FoodRepo: An Open Food Repository of Barcoded Food Products

In the past decade, digital technologies have started to profoundly influence healthcare systems. Digital self-tracking has facilitated more precise epidemiological studies, and in the field of nutritional epidemiology, mobile apps have the potential to alleviate a significant part of the journaling burden by, for example, allowing users to record their food intake via a simple scan of packaged products barcodes. Such studies thus rely on databases of commercialized products, their barcodes, ingredients, and nutritional values, which are not yet openly available with sufficient geographical and product coverage. In this paper, we present FoodRepo (https://www.foodrepo.org), an open food repository of barcoded food items, whose database is programmatically accessible through an application programming interface (API). Furthermore, an open source license gives the appropriate rights to anyone to share and reuse FoodRepo data, including for commercial purposes. With currently more than 21,000 items available on the Swiss market, our database represents a solid starting point for large-scale studies in the field of digital nutrition, with the aim to lead to a better understanding of the intricate connections between diets and health in general, and metabolic disorders in particular.


INTRODUCTION
Metabolic disorders, such as diabetes or obesity, have become a major public health concern, with increasingly large parts of the global population affected (1,2). Nutritional epidemiologists hope to better understand the underlying causes, the potential treatments and prevention strategies by analyzing population and individual patterns through studies that generally rely on surveying dietary habits. Traditional food-intake survey methods are based on questionnaires filled by participants at a given frequency. The frequency of diet records is an important factor contributing to the accuracy of the study (3). Multiple-day diet records might provide good accuracy when not based on memory, but require strong motivation and time commitment by the participants. Approaches like multiple/single 24-h recalls-involving a specialized interviewer performing surveys in person or on the phone with the participants-require less engagement, but pose issues with missing data as they rely on short-term memory. Finally, so-called Food Frequency Questionnaires, where participants are asked to indicate the frequency of intake of certain foods over long periods of time (typically 1 year), demand minimal participants' commitment, therefore allowing for large cohort studies on long-term dietary habits. However, the likelihood of missing or incorrect data increases as they count on participants' long-term memory. Overall, self-reported dietary data present biases which limit their applications, especially when they heavily rely on participants' memory (4). Such limitations, which should be properly addressed in further epidemiological studies, may be overcome with more advanced recording methodologies such as dietary biomarkers and digital technologies (5).
Recent technological advances, and in particular the emergence and almost complete market penetration of smartphones, have offered interesting surveying alternatives. In particular, mobile phones have been successfully deployed in several food-related studies (6), for example using food photography (7)(8)(9)(10)(11)(12). Other research has also explored the possibility of recording dietary habits by asking participants to scan the barcodes of their consumed food (13,14). Although further investigations are required to assess self-reporting biases, these advances in nutritional research have triggered the release of mobile apps oriented mainly toward diabetes and weight-loss selfmanagement (15)(16)(17)(18)(19), showing the willingness and interest of users to monitor their food intake if it provides potential health benefits.
The further expansion of self-monitoring for research and medical purposes relies on comprehensive and continuously updated food databases. A few databases of barcoded products already exist, for example Open Food Facts (20) or the USDA Food Composition Databases (21). While they each have their strength, not all of them are openly accessible or, and they often have a limited product coverage, and are often not regularly updated. For Switzerland, we did not find any database whose product coverage was sufficiently high, where the data was completely open, and easily accessible through an Application Programming Interface (API). The last point was particularly important to us, as APIs are necessary for third parties to dynamically use the data in their products and services. Our approach was therefore to build an openly accessible database of barcoded food products with sufficiently high coverage, accessible through a stable API. Rather than focusing on a wide geographic range, we focused on a small country (Switzerland) in order to obtain the necessary coverage. The focus on the Swiss market further benefits from the need to support multiple languages from the beginning, thus making the system readily expandable to other countries, which we are now planning to do.
Here, we present this system, which we call FoodRepo (https:// www.foodrepo.org), an openly accessible database of barcoded food products, and we describe the data-acquisition framework, its quality control and maintenance. Here, the word repository is meant to be understood as a data repository, where the community can deposit an increasing number of datapoints on food products. The growing community around FoodRepo and the validation of new products make our database robust, scalable and self-sustainable in the long run. Currently, the FoodRepo database mostly holds products sold in Switzerland, from the main grocery stores in the country. Its international expansion is under development.
Any item in the database is accessible through the FoodRepo website (for an example of products contained in the FoodRepo database, please see Figure 1A) or via our API, described in section Usage Notes. The CC-BY-4 license under which our database is released will allow its exploitation by different type of users, from academic researchers to commercial partners. For instance, a Swiss consumers association is using FoodRepo data in their NutriScan mobile app (22) to make the food package information more accessible, and to provide their users with an overall nutritional score.
Beyond this specific example, the FoodRepo database opens the way for promising research opportunities in the field of digital epidemiology and personalized nutrition. Notably, we foresee that, through dietary live-tracking, this database can support studies which combine other recent technological developments and new findings in our understanding of the human metabolism. For example, phone-connected devices for continuous monitoring of blood glucose levels have recently been made available to diabetic patients (23,24), as well as numerous direct-to-consumer devices to estimate glucose levels have appeared on the market. A plethora of other wireless sensors are now also available to record various physiological parameters such as heart rate or blood pressure, marking a new era of "high-throughput human phenotyping" (25). Studies that would simultaneously track participants' parameters, food intake, glycemic response and physical activity might provide detailed insights on the variability of individual metabolic responses. Interestingly, one of the factors which has recently been found to account for a large part of this variability is microbiota (26)(27)(28)(29)(30). Large-scale testing of these hypotheses through self-tracking could contribute to the assessment of the complex metabolic response of the human body to different energy sources. This requires detailed records of food intake that includes nutritional information as well as eating times (31) and food portion sizes (32)(33)(34), all challenges that FoodRepo may help to overcome.
However, we highlight an important limitation of all food databases. Generally, the curators of such repositories cannot ensure the validity of the data reported by the producers on the nutrition facts labels. It is indeed well known in the literature that there might be large discrepancies between the reported nutrients and the actual food content, due to different factors, such as food pre-processing or the different industry standards (35)(36)(37)(38)(39)(40). Therefore, all studies using databases such as the one presented here would do well to assess the validity of such data and ideally quantify the reporting errors, especially when using the reported data on nutritional values.
Analyses of the database evolution will give interesting indication on the dietary trends and on the overall modification of the nutritive quality of packaged food. Although the database itself does not inform on the buying frequency, the continuous introduction of specific products in the market and thus in the database can potentially indicate how retailers react to customer demands and changing dietary habits.

METHODS
The database building and maintenance process relies on the following steps: (i) collection of product pictures from local retailers, (ii) data extraction from the pictures, (iii) validation of the extracted data, and (iv) permanent storage in the database (Figure 2). For the initial build of the database, we designed a specific pipeline (bootstrap workflow, Figure 2A, which allowed us to validate the first 20,000 food products in a few months. Given the dynamic nature of our data and the cost of the bootstrap workflow, we designed a second pipeline (currently under development) which relies on the growing FoodRepo community. This workflow (communitybased, Figure 2B) allows us to keep up with the new and seasonal products introduced to the market by the retail shops, as well as to ensure the scalability and self-sustainability of FoodRepo in the long run.
The bootstrap workflow (Figure 2A) consists of 3 main steps. The first step entailed a massive manual data collection from three large groceries stores in Switzerland upon approval from the shops (specifically Migros, Coop, and Lidl). We hired students to take pictures of all barcoded food items in retail shops located in the Lausanne area. To facilitate the data collection, we specifically designed a simple phone app with which students could scan the products' barcode and take pictures of the front and back of the package, the product's name, ingredients list, and nutrition facts. These pictures were then automatically uploaded to the database. At the end of this step, students had collected on average 4.4 pictures per item. The second step focused on the extraction of information contained in the pictures. Due to the presence of multi-language ingredients and the often wrinkled surfaces of item packaging, Optical Character Recognition (OCR) systems could not achieve a reliable accuracy. We therefore opted for a crowd-sourced solution and in particular we decided to recruit workers on Amazon Mechanical Turk (41) (AMT). AMT is a platform connecting requesters to workers, the latter being financially compensated to achieve tasks requiring human intelligence (HITs-Human Intelligence Tasks). Here, we designed a graphical user interface (GUI) allowing workers to transcribe the text they could read from product pictures. Specifically, the GUI presented text boxes where AMT workers FIGURE 1 | (A) Screenshot from the webpage of a product on the FoodRepo website. (B) Schematic representation of the pipeline behind our API. When a user or an application (left column) sends a call to the API, the request is handled by the server that hosts the API (middle column). This sends then a query to the server which hosts the FoodRepo database (right column), where the query is handled by the Elastic Search engine. The data is returned to the API server which performs final formatting before giving it back to the user or the application. (C) Distribution of API response times, color-coded according to different sections of the back-end pipeline, as shown in (B). In green (main plot and inset) the response-times of the Elastic Search server to the application server; in blue the full time needed for a user to have the data after a call to our API.
provided the product name, nutritional values (in a table format) and ingredients, in every language present on the label (German and/or French for almost all items; Italian and/or English in addition for some products). Three different HITs were set up: one for nutrients, one for product name and one for ingredients. For the last two, we set up qualification rounds for AMT workers as their transcription involved some language skills. AMT workers could choose to either enter from scratch the information they saw on the pictures, or to approve/modify the suggestions given by an OCR (42) system. At the end of the second step, all annotated products were uploaded into the database, flagged as ready for validation. The third step was thus dedicated to data validation, which was based on extensive manual checking by the FoodRepo team, and was additionally informed by manual reports from visitors to the FoodRepo website and with error-detection analyses of nutritional values. Such online reports are encouraged by the presence of a "report an issue" button on each product web-page, which prompts a visitor to file an issue when spotting a potential error. Details about the error-detection analyses are given in the Technical Validation section. Before the final validation of the data, the FoodRepo team as well as students manually checked all products thoroughly.
The community-based workflow ( Figure 2B) is similar to the bootstrap workflow, but instead of counting on AMT workers, it relies on the growing FoodRepo community. As new products become available in retail shops, FoodRepo users can submit them by uploading the corresponding package pictures, using the FoodRepo smartphone app. Currently, the information extraction is still performed by the FoodRepo team, but additional features are being implemented in the app, which will allow users to directly type the product details contained on the package. Before user-provided information is permanently stored in the FoodRepo database, consistent entries will need to be submitted by at least three different FoodRepo users. If such consensus will not be reached after seven independent submissions (i.e., there are still less than three consistent entries), the item will be manually analyzed by the FoodRepo team for definitive validation and inclusion into the database. This procedure will ensure minimal intervention from our team, while still guaranteeing the reliability of the data. The FoodRepo team is currently fostering the development of an active community through which the continuity of FoodRepo is assured, and which will likely accelerate the birth of independent exploitations of the database, from both public and private partners. Pictures Url to the front picture of the sample product: e.g., https://goo.gl/PyjjNa

DATA RECORDS
While here we only provide the link to the front image of the product, an API call would provide the links to all pictures available for the requested products. A complete description of the fields provided by the API is available in the API documentation, on the project's GitHub repository.
the database (see Table 1). The programmatic access to the database is allowed by an API, described in the section Usage Notes.

TECHNICAL VALIDATION
As described in the Methods section, during the bootstrap stage ( Figure 2A) the final validation was performed manually by the FoodRepo team, while in the community workflow (Figure 2B), the accuracy of the data is ensured by the consensus test (the FoodRepo team intervenes only if fewer than three matches are achieved after the uploads of the same product by seven different users). We highlight here that FoodRepo strictly reflects the information printed on products packages, even when suspicious values are present on the labels. All validation processes have thus been set-up to detect transcription errors. Within this rationale, computational analyses were implemented for the detection of outliers, in particular regarding the nutritional values. These tests reflect basic constraints, such as the mass upper-limit: where p, f , c are respectively the product's protein, fat and carbohydrates concentrations expressed in grams per 100 g of product. From Equation (1), one can also derive other linear inequalities for a single nutrient or couples of nutrients, namely p + f ≤ 100, p + c ≤ 100, and c + f ≤ 100. These simple tests allowed us to detect transcription errors in earlier versions of the database, as illustrated by the outliers in Figure 3A which shows the distribution of products in the fat-carbohydrates space with the joint mass boundary. Similarly, other typos could be spotted by checking that the concentration of a subclass of nutrient is smaller than the one of the parent-class. This is the case for instance of sugars vs. carbohydrates, or saturated-fat vs. fat, shown in Figure 3B.
Another simple relation that helps check products' nutrition facts can be derived from the standard approximation of energy density based on nutrients composition (45): where the product's energy content E is expressed in kCal/100 g. Combining expressions 1 and 2 provides upper and lower boundaries for the energy content (for example Figure 3C). In this case however, not all dots that fall outside the boundaries were due to typos in transcription. Indeed, the approximation in Equation (2) does not take into account the different contribution to energy of complex carbohydrates such as polyols, which account for less than 4 kCal/g. This is why products such as candies and chewing gums would fall below the energy boundaries.

USAGE NOTES
In order to facilitate the access to the database, we built an openly accessible API. Any terminal user, including third party apps or services, can send API requests to retrieve specific data. The API pipeline is illustrated in Figure 1B. User's requests are handled on an application server, where an Elastic Search (ES) application handles the queries on another cloud computing service, based in Ireland. The ES response is then returned to the user after JSON formatting and compression (on demand). We checked that handling the request between the two servers does not critically compromise the total user-response time. We run series of single-page API calls, every 6 h, over a week, in order to measure the full response-time and the application server response-time. We observed that the latter was consistently fast across all experiments (in the range of 20-50 ms) and that the bottleneck was rather the transmission between the terminal user and the application server (the average full response time was about 250 ms-see Figure 1C). For a quick introduction to the API endpoints, users are welcome to try them out on the API Playground page (46). Furthermore, on the project's GitHub repository, one can also find usage cases (47) in Python, Ruby, Curl and JavaScript, as well as examples of complex queries which include fuzzy searches (48). When fetching a large amount of data, we suggest using the option of compressed data 1 and the possibility to include/exclude specific fields of each product [see for details the API documentation (46)]. In this way, one could reduce the response payload size by up to a factor of 10.
We remind readers that all contents (other than computer software) made available by FoodRepo on its websites, apps or services are licensed under the Creative Commons Attribution 4.0 International License. We however would like to highlight the fact that product images may contain copyrighted data such as brand logos.

ACKNOWLEDGMENTS
We are grateful to Migros, Coop, and Lidl for access to their retail shops.

NOMENCLATURE
• API: Application Programming Interface-an set of tools and methods that allow to types of software to communicate. The FoodRepo API allows other applications to get and use the data. • CC-BY-4: Creative-Commons public license, with the "Attribution" term. It implies that anyone is free to share and transform the content of FoodRepo, even for commercial purposes, with the obligation to properly give credit to FoodRepo, and to display any modification without claiming direct endorsement from FoodRepo. For a detailed description, see the license text at https://creativecommons.org/licenses/by/4.0/ • OCR: Optical Character Recognition-tools that allow for automatic conversion of text contained in images to machinereadable formats.