TECHNOLOGY AND CODE article
BioMaster: An Integrated Database and Analytic Platform to Provide Comprehensive Information About BioBrick Parts
- 1School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, China
- 2Centre for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
Synthetic biology seeks to create new biological parts, devices, and systems, and to reconfigure existing natural biological systems for custom-designed purposes. The standardized BioBrick parts are the foundation of synthetic biology. The incomplete and flawed metadata of BioBrick parts, however, are a major obstacle for designing genetic circuit easily, quickly, and accurately. Here, a database termed BioMaster http://www.biomaster-uestc.cn was developed to extensively complement information about BioBrick parts, which includes 47,934 items of BioBrick parts from the international Genetically Engineered Machine (iGEM) Registry with more comprehensive information integrated from 10 databases, providing corresponding information about functions, activities, interactions, and related literature. Moreover, BioMaster is also a user-friendly platform for retrieval and analyses of relevant information on BioBrick parts.
As a branch of the emerging biological sciences in the 21st century, synthetic biology is dedicated to design and construct novel biological parts, devices, and systems for specific purposes, and to make the process of engineering biology easier (Heinemann and Panke, 2006; Bartley et al., 2017). One of the important directions for synthetic biology is to build standard systems using standard devices which are comprised of standard parts. Therefore, to standardize biological parts and devices, Tom Knight at MIT introduced the concept of BioBrick standard in 2003 (Smolke, 2009).
BioBrick parts are biological parts that conform to a restriction-enzyme-based assembly standard, containing defined prefix and suffix sequences which can be recognized by specific restriction endonucleases (Canton et al., 2008; Shetty et al., 2008). Different assembly methods are constantly being reported, such as 3A assembly (Shetty et al., 2008), Gibson assembly (Gibson et al., 2009), Golden Gate assembly (Engler et al., 2008), MIDAS (van Dolleweerd et al., 2018), and CasHRA (Zhou et al., 2016). The standardization of BioBrick parts makes the compatibility and assembly of different biological parts more efficient, and the development of tools for biological computer-aided design (CAD) possible. Since the first successfully synthetic toggle switch and repressilator (Gardner et al., 2000), several related CAD tools, aiming at making the design easier and high throughput, have been developed, such as GenoCAD (Cai et al., 2010), TinkerCell (Chandran et al., 2009), Gene Designer (Villalobos et al., 2006), and Cello (Nielsen et al., 2016; Chen et al., 2020). GenoCAD is a computer tool that helps users assemble genetic circuits from a rich library of parts, compliantly with any of six implemented BioBrick standards. TinkerCell is a visual modeling tool that supports a hierarchy of biological parts. Gene Designer is a software for fast and easy design of synthetic DNA segments and uses advanced algorithms for codon optimization. And Cello provides a design environment for genetic circuit design automation. Thus, engineering biology, both manually and with CAD platforms, requires large amounts of BioBrick data of high quality (Nora et al., 2019). However, these powerful design processes are not supported by adequate information of standardized biological parts (Decoene et al., 2018).
So far, several data registries and repositories have been established to collect biological parts. For example, the Virtual Parts Repository consists of about 3000 biological parts (Cooling et al., 2010; Misirli et al., 2014), PAMDB contains 118 circuits and 165 parts (Huynh and Tagkopoulos, 2016), JBEI-ICE includes about 2302 items (Ham et al., 2012), SEVA-DB has a number of 185 vectors (Martinez-Garcia et al., 2015), while the international Genetically Engineered Machine (iGEM) Registry of standard biological parts (Smolke, 2009; Vilanova and Porcar, 2014) is, currently, the largest BioBrick database, containing over 40,000 items, cataloged as promoters, ribosomal binding sites (RBS), coding sequences, terminators, and so on. However, the iGEM Registry lacks joint analysis and retrieval with other databases, which are numerous and contain various biochemical, genetic, and molecular biological information.
Therefore, in this study, to complement and improve the metadata of iGEM BioBrick parts, we integrated 11 traditional biological databases (Supplementary Table S1), including iGEM Registry, UniProt (Bateman et al., 2017), QuickGO (Binns et al., 2009), KEGG (Kanehisa et al., 2017), BioGRID (Oughtred et al., 2019), BRENDA (Jeske et al., 2019), ExplorEnz (McDonald et al., 2009), STRING (Szklarczyk et al., 2017), PubMed (Sayers et al., 2020), EPD (Perier et al., 2000), and PromEC (Hershberg et al., 2001). These databases supplement the information of functions, sites, interactions, feature keys, and references. The integrated database, termed BioMaster, also provides a more user-friendly platform for searching, browsing, and analyzing BioBrick information.
Materials and Methods
In this study, the UniProt database was taken as a hub for connecting the iGEM Registry with other databases. We first downloaded the entire BioBrick database from the iGEM Registry1. The database contains the parts table, the sequence, and features of BioBrick parts (O’Donovan et al., 2002). Then BLAST was used to find matched sequence in the UniProt database. The other nine databases were then integrated through UniProt unique identifier. Finally, we developed a user-friendly database, termed BioMaster, for easy searching, browsing, and analyzing all these BioBrick information (Figure 1).
Data Sorting of the Sequence Alignment Results
There was no information associated with other databases in the iGEM Registry. To get the corresponding information, UniProt was used as a hub for connecting the iGEM Registry with other nine databases. The connection between the iGEM Registry and UniProt was established by sequence alignment on featured (functional) sequences in iGEM Registry or full BioBrick sequences in case of no available feature information. The sequence alignment was performed by BLAST (Altschul et al., 1997). Each alignment result of BLAST includes three main parameters: Score, expect value (E-value), and Identity. The Score is the sum of similarity of each pair between two sequence, so it is related to the length of the sequence. The E-value is the expected number of high score pairs and decreases exponentially as the score increases. The Identity is the proportion of which two sequences have the same residues at the same position after alignment. The results of BLAST are always displayed with a descending Score. In most cases, however, the order is different if a different standard is used for sorting. Therefore, to comprehensively consider all three parameters, we used their weighted average (p), calculated as the following:
where w1, w2, and w3 are their weights, respectively. As the Identity is ranged from 0 to 1, the Score and E-value were adjusted to this range. The Score was divided by the maximum in the substitution matrix (Smax) (Altschul, 1991) and the length of the query sequence. After taking the negative logarithm of 10 of E-value, the value falls into a range of 1–200. To adjust it to 0–1, we calculated the logarithm of 6 again and divide it by 3 as shown in the above equation.
To determine the value of w1, w2, and w3, we tested from 0.1 to 0.8 with an interval of 0.1, respectively. About 200 items of BLAST results, manually selected, formed a training set, to obtain optimal combination. We found that the screening accuracy was highest when w1 = 0.1, w2 = 0.5, and w3 = 0.4.
To validate the data sorting model, we labeled about 8500 BLAST results, which were classified by our model and support vector machine (SVM). The threefold validation accuracy of SVM is around 0.93, lower than the accuracy of our model (about 0.98). The area under ROC curve (AUC) value of our model is 0.99, which is also better than the AUC value of SVM (0.94) (Supplementary Figure S1A). Meanwhile, the average of Score, E-value, and Identity of the positive items also demonstrates that the performance of our model is better than SVM (Supplementary Figure S1B).
Results and Web Interface
To complete the information of iGEM BioBrick parts, we developed a database termed BioMaster, which provides a user-friendly query interface and a set of tools for a comprehensive overview of BioBrick parts.
The Amount of Data
BioMaster has all the data from the iGEM Registry. In addition, it also integrates corresponding information from 10 databases (UniProt, STRING, BioGRID, KEGG, QuickGO, PubMed, BRENDA, ExplorEnz, EPD, and PromEC) to complement the iGEM BioBrick datasheet. The iGEM Registry contains 47,934 entries of BioBrick information, among which 44,565 entries have feature information. Through sequence alignment with UniProt database and data screening, we found corresponding 3747 items in UniProt. Via their UniProt IDs, information from other nine databases was integrated: 100,000 items of interaction information from STRING and BioGRID, 94,260 pieces of annotations from QuickGO, 6203 pieces of gene information from KEGG, 2694 pieces of homology information from KO, information and annotations of 1336 enzymes from KEGG ENZEMY, BRENDA, and ExplorEnz, some promoters information from EPD and PromEC, and more than 1 million associated references from PubMed. According to the iGEM Registry, BioBrick parts are divided into 25 categories (Supplementary Table S2), of which the eight main categories are promoters, RBS, coding sequences, translational units, terminators, DNA, plasmids, and composite parts.
BioMaster supports a total of four search methods: ID search, Keyword search, Team Wiki search, and sequence alignment by BLAST (Figure 2B). The ID search automatically recognizes IGEM_ID, EPD_ID, UniProt_ID, and Gene_Name. Keywords search returns parts in the iGEM Registry related to the keywords. Team Wiki can search for the team names and keywords related to the iGEM teams. Searching for an amino acid sequence or nucleotide sequence is also practicable via BLAST against BioMaster database.
Figure 2. Snapshots of BioMaster. (A) The data in iGEM Registry, taking the BioBrick BBa_K1745002 as an example. (B) The searching page of BioMaster. (C) Search result page of BBa_K1745002. (D,E) Information of BBa_K1745002 in BioMaster database. For detailed contents, please check on the website.
The search results include a total of no more than 1000 related items sorted into different groups (Figure 2C), such as terminators, DNA, RNA, and reporters. The detailed information can be accessed by selecting any item of results. It shows the information obtained from the iGEM Registry (Figure 2D) as well as other databases related to the part in detail (Figure 2E). In order to display the interaction between different parts or parts and proteins clearly, cytoscape.js was implemented to show interaction diagrams. The information of the corresponding BioBrick parts will be shown by clicking on the nodes and links. In addition, synthetic biology open language (SBOL) is used to visualize the feature information and display it as a graph. The search results can be freely downloaded and is open for user’s comments. Detailed usage tutorials are available on our website in the form of videos8.
Taking the part BBa_K1745002 for example, Figure 2A shows all the information for this part in the iGEM Registry, from which we found that this part is an enzyme, aralkylamine N-acetyltransferase (AANAT), but there is almost no information other than the sequence and features. In our database, BBa_K1745002 was matched to the UniProt_ID Q16613 correctly (Figure 2E), which provides the information about its function, catalytic activity, structure, related reference, and so on. Via the UniProt_ID, massive information, such as related pathways, optimal experimental conditions, KM values, IC50 values, inhibitors, and interaction networks, was integrated from KEGG, Brenda, String, and BioGrid. Therefore, this part is better characterized in our database.
Tools for Analyses and Visualization of BioBrick Parts
Several tools for analyses and visualization of BioBrick parts were integrated in BioMaster. We integrated a JAVA-based visual aiding design software, SBOL designer, into BioMaster to provide visualization and designing of biological components (Figure 3A) (Roehner et al., 2016; Madsen et al., 2019). On the other hand, concerning the limit amount of BioBrick entries, BioMaster provides two prediction tools to conduct bioinformatic exploration on unknown sequences: the promoter prediction (Umarov and Solovyev, 2017) and the enzyme commission (EC) number prediction (Dalkiran et al., 2018). The promoter prediction implemented uses convolutional neural network to recognize both prokaryotic and eukaryotic promoters (Figure 3B). The EC prediction was integrated into BioMaster to predict the enzymatic function for uncharacterized protein sequence. An EC number is given as the prediction result, and more information can be obtained in BRENDA via the EC number (Figure 3C).
Figure 3. Tools for BioBrick analyses and visualization. (A) A SBOL-based visualization tool for genetic circuit design. (B) Promoter prediction. (C) EC number prediction.
The design of genetic circuit requires reliable, comprehensive, and complete information of BioBrick parts. To complete the BioBrick metadata of the iGEM Registry, we developed a database called BioMaster. Various information from 10 popular databases was integrated to the iGEM parts with diverse and detailed information. Meanwhile, BioMaster provides convenient retrieval, user-friendly interface, and potential predictions. In the future, we will integrate more data in breadth and depth from other synthetic biological databases and publications to connect the massive amounts of data pieces (Nora et al., 2019), and further improve the analysis tools on BioMaster. It is hoped that BioMaster could provide a platform to connect and standardize scattered biological part data, and to make designing of new biological systems easier both automatically and manually.
Data Availability Statement
Publicly available datasets were analyzed in this study. These data can be found here: http://www.biomaster-uestc.cn.
HY, JS, and CD collected the data, analyzed the data, developed the web, and wrote the manuscript. JH and F-BG designed the database and wrote the manuscript. BW designed the database, analyzed the data, and wrote the manuscript. All authors contributed to the article and approved the submitted version.
This work was supported by the National Natural Science Foundation of China (Nos. 31971176 and 31800616) and the Fundamental Research Funds for the Central Universities (No. A03018023601045).
Conflict of Interest
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Thanks to the contribution of iGEM team members of UESTC-software-2018 and UESTC-software-2019.
The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fmicb.2021.593979/full#supplementary-material
- ^ http://parts.igem.org/Registry_API
- ^ http://www.biomaster-uestc.cn
- ^ https://laravel.com/
- ^ https://v2.bootcss.com/
- ^ https://cn.vuejs.org/
- ^ https://hub.docker.com/u/igem2019uestcsoftware
- ^ https://www.elastic.co/cn/
- ^ http://www.biomaster-uestc.cn/public/index.php/main/tutorial
Altschul, S. F., Madden, T. L., Schaffer, A. A., Zhang, J. H., Zhang, Z., Miller, W., et al. (1997). Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389–3402. doi: 10.1093/nar/25.17.3389
Binns, D., Dimmer, E., Huntley, R., Barrell, D., O’Donovan, C., and Apweiler, R. (2009). QuickGO: a web-based tool for gene ontology searching. Bioinformatics 25, 3045–3046. doi: 10.1093/bioinformatics/btp536
Cooling, M. T., Rouilly, V., Misirli, G., Lawson, J., Yu, T., Hallinan, J., et al. (2010). Standard virtual biological parts: a repository of modular modeling components for synthetic biology. Bioinformatics 26, 925–931. doi: 10.1093/bioinformatics/btq063
Dalkiran, A., Rifaioglu, A. S., Martin, M. J., Cetin-Atalay, R., Atalay, V., and Dogan, T. (2018). ECPred: a tool for the prediction of the enzymatic functions of protein sequences based on the EC nomenclature. BMC Bioinform. 19:334. doi: 10.1186/s12859-018-2368-y
Decoene, T., De Paepe, B., Maertens, J., Coussement, P., Peters, G., De Maeseneire, S. L., et al. (2018). Standardization in synthetic biology: an engineering discipline coming of age. Crit. Rev. Biotechnol. 38, 647–656. doi: 10.1080/07388551.2017.1380600
Gibson, D. G., Young, L., Chuang, R.-Y., Venter, J. C., Hutchison, C. A. III, and Smith, H. O. (2009). Enzymatic assembly of DNA molecules up to several hundred kilobases. Nat. Methods 6, 343–345. doi: 10.1038/nmeth.1318
Ham, T. S., Dmytriv, Z., Plahar, H., Chen, J., Hillson, N. J., and Keasling, J. D. (2012). Design, implementation and practice of JBEI-ICE: an open source biological part registry platform and tools. Nucleic Acids Res. 40:e141. doi: 10.1093/nar/gks531
Hershberg, R., Bejerano, G., Santos-Zavaleta, A., and Margalit, H. (2001). PromEC: an updated database of Escherichia coli mRNA promoters with experimentally identified transcriptional start sites. Nucleic Acids Res. 29, 277–277. doi: 10.1093/nar/29.1.277
Martinez-Garcia, E., Aparicio, T., Goni-Moreno, A., Fraile, S., and de Lorenzo, V. (2015). SEVA 2.0: an update of the standard european vector Architecture for de-/re-construction of bacterial functionalities. Nucleic Acids Res. 43, D1183–D1189.
Nora, L. C., Westmann, C. A., Martins-Santana, L., Alves, L. D. F., Oliveira Monteiro, L. M., Guazzaroni, M.-E., et al. (2019). The art of vector engineering: towards the construction of next-generation genetic tools. Microb. Biotechnol. 12, 125–147. doi: 10.1111/1751-7915.13318
O’Donovan, C., Martin, M. J., Gattiker, A., Gasteiger, E., Bairoch, A., and Apweiler, R. (2002). High-quality protein knowledge resource: SWISS-PROT and TrEMBL. Brief. Bioinform. 3, 275–284. doi: 10.1093/bib/3.3.275
Roehner, N., Beal, J., Clancy, K., Bartley, B., Misirli, G., Grunberg, R., et al. (2016). Sharing structure and function in biological design with SBOL 2.0. ACS Synth. Biol. 5, 498–506. doi: 10.1021/acssynbio.5b00215
Sayers, E. W., Beck, J., Brister, J. R., Bolton, E. E., Canese, K., Comeau, D. C., et al. (2020). Database resources of the national center for biotechnology information. Nucleic Acids Res. 48, D9–D16.
Szklarczyk, D., Morris, J. H., Cook, H., Kuhn, M., Wyder, S., Simonovic, M., et al. (2017). The STRING database in 2017: quality-controlled protein-protein association networks, made broadly accessible. Nucleic Acids Res. 45, D362–D368.
Umarov, R. K., and Solovyev, V. V. (2017). Recognition of prokaryotic and eukaryotic promoters using convolutional deep learning neural networks. PLoS One 12:e0171410. doi: 10.1371/journal.pone.0171410
van Dolleweerd, C. J., Kessans, S. A., Van de Bittner, K. C., Bustamante, L. Y., Bundela, R., Scott, B., et al. (2018). MIDAS: a modular DNA assembly system for synthetic biology. ACS Synth. Biol. 7, 1018–1029. doi: 10.1021/acssynbio.7b00363
Villalobos, A., Ness, J. E., Gustafsson, C., Minshull, J., and Govindarajan, S. (2006). Gene designer: a synthetic biology tool for constructing artificial DNA segments. BMC Bioinform. 7:285. doi: 10.1186/1471-2105-7-285
Keywords: synthetic biology, genetic circuit design, database integration, BioBrick parts, iGEM Registry
Citation: Wang B, Yang H, Sun J, Dou C, Huang J and Guo F-B (2021) BioMaster: An Integrated Database and Analytic Platform to Provide Comprehensive Information About BioBrick Parts. Front. Microbiol. 12:593979. doi: 10.3389/fmicb.2021.593979
Received: 31 August 2020; Accepted: 04 January 2021;
Published: 21 January 2021.
Edited by:Trevor Carlos Charles, University of Waterloo, Canada
Reviewed by:Yi Xiong, Shanghai Jiao Tong University, China
Rahmi Lale, Norwegian University of Science and Technology, Norway
Copyright © 2021 Wang, Yang, Sun, Dou, Huang and Guo. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
*Correspondence: Beibei Wang, firstname.lastname@example.org
†These authors have contributed equally to this work