Baseline Evaluation of Bioinformatics Capacity in Tanzania

genome wide association studies internal transcribed spacer (ITS) data analysis, variant calling, genome annotation, RNASeq, proteomics and other tasks as specied.


Abstract
Background Even though the genomics technologies have grown to a large extent, Sub Saharan Africa countries have not entirely reaped the bene ts due to the lack of enough capacity to use these technologies. The lack of documentation on existing bioinformatics capacity in these countries hinders the guidance on leveraging the resources and in the identi cation of areas for improvement. The main objective of this study was to map out the interest and capacity for conducting bioinformatics and related research in Tanzania. Our ndings identify critical areas for skills and infrastructure development for bioinformatics research. The study is a cross-sectional, explorative, descriptive study, among Tanzanian researchers in public and private academic and research institutions

Results
Out of 84 respondents, 50 (59.5%) were males. More than half of these 44 (52.4%) were in the age between 26-32 years. The majority 41 (48.8%) were master's degree holders with at least one publication related to bioinformatics. Eighty (95.2%) were willing to join the bioinformatics network and initiative in Tanzania. The major challenge faced by 22 (26.2%) of respondents was the lack of training and skills.
The most used resources for bioinformatics analyses were the BLAST, PubMed and GenBank. Most performed analyses include sequence alignment and phylogenetic, which was reported by 57 (67.9%) and 42 (50%) of respondents, respectively. The most frequently used statistical software packages were SPSS and R. A quarter of the respondents were conversant with computer programming.

Conclusion
Early career and young scientists were the largest group of responders engaged in bioinformatics research and activities across surveyed institutions in Tanzania. The use of bioinformatics tools for analysis is still low, including basic analysis tools such as BLAST, GenBank, sequence alignment software, Swiss-prot and TrEMBL. There is also poor access to resources and tools for bioinformatics analyses. As a way to address the skills and resources gaps, we recommend various modes of training and capacity building of relevant bioinformatics skills and provision of infrastructure so as to improve bioinformatics capacity in Tanzania.

Background
Sub Saharan Africa (SSA) countries face di culties in the access to quality health despite the signi cant disease burden in these countries [1]. The hurdle extends in leveraging the use of new technologies such as genomics and bioinformatics to resolve some signi cant issues such as food insecurities, poverty and diseases [2]. Even though the cost of using bioinformatics technologies for research has dropped, the lack of a strong human capacity with the expertise to use these technologies to effectively analyze and interpret results is still a limiting factor for many African countries, [3,4] Tanzania included.
Recently the eld of genomics has become instrumental in medical research and provision of healthcare.
diagnosis, understanding prevention as well as treatment of several disease conditions [5][6][7]. To achieve this, the ability to generate data and perform bioinformatics analysis has become critical for biomedical scientists [8], particularly in the face of continued fall of the cost of data generation and analysis using the trending technologies. However, the lack of capacity in SSA countries to run such analysis may hinder effective bene ts from the application of genomics in medicine and research [5].
Several initiatives have been established to address the gap. One of the initiatives is the Human Heredity and Health in Africa (H3Africa) Pan Africa Bioinformatics Network (H3ABioNet) [3]. This African initiative was established to facilitate the development of bioinformatics capacity in the continent and hence health genomics research [8,9]. H3Africa has been successful at mobilizing resources as well as developing researchers' networks and capacity for various research resources, including biobanks, developed researchers networks and capacity for analysis of genomics data through H3ABioNet [3].
Through this network, both human and infrastructure bioinformatics needs have been addressed through training, setting up of standardized bioinformatics analysis work ows, access to expertise in various domains and data harmonization have been put in place in Africa [3]. There is a need to document existing human capacity for conducting bioinformatics related research. This will enable effective leveraging of existing resources and strategizing for further building of sustainable expertise in the country. To the best of our knowledge, there is no documentation on existing bioinformatics capacity in the country.
To address this challenge, this study aimed at evaluating the existing human expertise and capacity to use bioinformatics tools for research in public and private institutions. On the one hand, the documentation is hoped to guide leveraging of present resources and in the identi cation of areas for improvement. On the other hand, it will also support the efforts of the H3Africa and H3ABioNet projects to build bioinformatics capacity in Africa. The study ndings may help to make recommendations for improvement in bioinformatics training and research in Tanzania, a model that can be emulated in other SSA countries.
The survey began with an introduction to the bioinformatics research, explanation of objectives of the study and the information expected from the participant. Participants were assured of anonymity and privacy of their responses and use of collected data in an aggregated format. Information that was captured includes respondents' demographics such as employment institution, age group, gender, level of seniority, area of research. Other questions related to the years of work experience, a number of publications in the area of bioinformatics and the highest level of education attained. The sections that followed investigated access to and knowledge about infrastructure and software tools for bioinformatics analysis. Questions about access to computing facilities and computer operating systems regularly used were also asked.
We evaluated the skill levels on the selected Microsoft O ce tools and selected statistical packages as well as the frequency of use of some basic bioinformatics resources such as PubMed, Swiss-prot & We also asked questions intended to investigate the knowledge and type of computer programming languages used by the participants as well as computer database management systems preferred. We sought out to identify challenges that respondents face in bioinformatics research. The broader problems were re-categorized into electric power and internet, mentorship and research network, computer infrastructure, training skills.
Finally, we interrogated the participants' willingness to join the bioinformatics network and initiative in Tanzania under the TGN.
The survey responses were exported from REDCap into a comma-separated le for analysis. Analysis of the results was conducted using R [13] software integrated into R Studio version 1.2.5033. Descriptive statistics, including frequency tables, pie and bar plots, were used to summarize the responses.

Results
Demographic characteristics of respondents A total of 90 respondents from academic and non-academic institutions participated in the survey. The results were exported from REDCap to Microsoft Excel. Six respondents were removed because they acknowledged at the beginning of the survey that they don't know anything about bioinformatics. The majority of respondents (Table 1) Table 1). The highest education level attained, by most respondents were master's degree holders 41(45.8%) followed by bachelor degree holders 25(29.8%) ( Table 1).
The number of publications related to bioinformatics by the respondents was mostly in the range of 1-4, as reported by 24 (28.57) of the respondents (Table 1). Altogether, only 28 (33.3%) of the surveyed respondents have at least one publication about bioinformatics. In comparison, 56(67.7%) did not have any publications in bioinformatics.
The area of research or practice for the majority of the respondents was molecular biology 18(21.4%) followed by 15 (17.9%) from the eld of medicine ( Table 1).

Access to infrastructures for bioinformatics analysis
Eighty-one 96.4%) of the respondents used their personal computers (laptops) for bioinformatics work. A small percentage (less than 10%) indicated to have access to institutional servers abroad or computer cloud (Table 1). Fifty-seven (67.9%) of these respondents run their computers on Windows 8. Only twelve (14.3%) of these respondents have the Linux operating system on their computer systems (Table 1).
Knowledge and use of computer programming language and database management systems Only a quarter of respondents reported to use computer programming language and 15 (17.9%) use a database management system. The most used programming language is Python by 8(9.5%) of the respondents. The widely used databases management systems were Microsoft Access and MariaDB/MySQL, which was used by 14(16.7%) and 6(7.1%) of the respondents, respectively (Table 1).  Majority of respondents were from research institutions 50(59.5%) ( Table 2). The respondents were from a total of 33 institutions (Table 3).

Bioinformatics tools and resources usage
Of the surveyed bioinformatics tools and resources, the one that was seldom used were QIAGEN CLC Main Workbench where 57(67.9%) of the respondents reported that they never used the program. This was followed by the DNA Data Bank of Japan (DDBJ) where 52(61,9%) never used the resource. The most used resources were the BLAST, PubMed and GenBank ( Figure 2).

Statistical software package use
Least use of statistical software packages was reported by 78(92.9%) in WinBUGS followed by 73(86.9%) in MedCalc (Figure 4). The frequently used software packages used were SPSS and R, where respondents report expert, high and intermediate skills in these statistical software packages (Figure 4).

Discussion
To the best of our knowledge, this is the rst study that assesses the level of bioinformatics capacity in Tanzania. We found out that the majority of the respondents were males (59.6%), had a master's degree (48.31%) and were in the age group 26 -32 years (52.38%). The mean work experience of the respondents in years was 6.2, indicating a young group of scientists. The highest education level for most respondents (45.8%) was a master's degree, followed by a bachelor's degree. When asked to rate their seniority at the scale of 0-100, the respondents rated themselves with mean seniority of 39.1, further indicating the perception of the junior ship in the area of bioinformatics practice. Only 21.4% were PhD holders; this is a pool of scientists that can mentor the early-career counterparts. Interestingly, most of the respondents' current area of specialization was mostly molecular biology (21.4%) while only a few (8.3%) related their complete research interest in genomics and bioinformatics, suggesting that the molecular biology scientists are diversifying their career into bioinformatics.
This survey pointed out that the infrastructure and the human capacity for conducting bioinformatics related research in Tanzania is underdeveloped. Precisely, 96.4% of the respondents perform bioinformatics analysis using personal computers/laptops, with only about 10% having access to advanced infrastructures such as high-performance computers, cloud computing and institutional servers. This severely limits the capacity to conduct bioinformatics related research as it usually involves massive datasets and requires reliable high computing capacity that cannot be afforded by personal computers alone [14]. More than 67% of the respondents use Windows operating system (OS), which does not support many genomics and bioinformatics analysis platforms, contrary to only about 14.3% who use the Linux OS that supports a broad range bioinformatics analysis tools.
For most of the respondents, usage of standard bioinformatics analysis tools was also low; therefore, it comes as no surprise that 66.7% of respondents had no publication related to bioinformatics at all. These ndings align with the review by Lyantagaye (2013), who noted that the level of bioinformatics research in Tanzania was still at its infancy, with a lack of investment and underdeveloped infrastructure. The review noted the presence of one modern laboratory at SUA, capable of generating molecular biology and genomics data, and the STM-1 SEACOM undersea bre-optic cable that was expected to increase the internet speed and bandwidth [2]. The situation is not unique to Tanzania alone. Karikari (2015) noted a low level of bioinformatics capacity in terms of personnel and infrastructure in Ghana, with frequent electrical power failures, unreliable internet connections, and lack of high-speed computing power being some of the signi cant infrastructural challenges [15]. In Africa, three countries are responsible for a large fraction of the bioinformatics output from the continent; South Africa, Kenya, and Nigeria. The existence of H3ABioNet has, to a large extent, tried to reduce this disparity by empowering other countries in Africa to participate and contribute to bioinformatics [8,16].
Bioinformatics consists of multidisciplinary elds, including mathematics, computer science, statistics and others. Statistics and programming being one of the disciplines that play signi cant roles in building reproducible methods for biological discovery and validation, especially for complex, high-dimensional data as encountered in genomics. Therefore assessing the knowledge and level of usage of statistics and programming among the respondents was essential. We found that only a quarter of respondents reported using computer programming language and 17.9% use a database management system. The most used programming language is Python by 8 (9.5%) of the respondents and the databases management systems most used were Microsoft Access and MySQL. Both Python and MySQL nd wide application in bioinformatics applications [17]. However, there are a large proportion of respondents without skills in hardcore programming. Short training may help to improve the skills of these researchers. It was also evident that the knowledge and usage of different statistical packages are mostly based on IBM's SPSS package. On the one hand, many respondents are using R statistical packages. On the other hand, packages like WinBUGS and SAS are rarely used by bioinformatics researchers in Tanzania.
Our There are many bioinformatics tools and resources that respondents said they could access, with PubMed, which they use to retrieve scienti c literature, being most popular. The other frequently used resource is GenBank as well as some sequence alignment tools, showing good progress as users can access relevant and essential resources. The use of commercial products such as CLC Workbench (a QIAGEN platform for DNA, RNA and protein sequence data analysis, was limited, probably due to shortage of funding. More than half of the respondents reported one or more problems that they face in relation to bioinformatics practice in Tanzania. Majority of the respondents (26.2%) reported a lack of training and skills as a signi cant problem. Only a few respondents (2.4%) reported inadequate electrical power supply and lack of internet access as challenges. The reduced cost of the internet connectivity and improvement of bandwidth has helped other Africa nations improve their bioinformatics infrastructure and capacity [18]. Tanzania has equally bene ted from the bandwidth improvement, and this may be the reason that few respondents cited internet connectivity as a challenge. Capacity building through training and infrastructural support for bioinformatics research remain to be the major challenges, as noted in other African countries [4,7,15,18].
Regarding most commonly performed analyses, sequence alignment and phylogenetics were used by 67.9% and 50% of the respondents, respectively. Other methods of analysis, such as GWAS were less commonly used. Both most and least frequent applications may require training modules in long or short term training.
In our study, most of the respondents 40 (47.6%) reported learning bioinformatics at bachelor's degree level, followed by 27 (32.1%) who learned at the masters' training and only 18 (21.4%) during PhD training. Conferences and workshops also serve as essential sources of bioinformatics skills for some respondents (28.6%), while a small percentage (15.5%) used online resources to learn bioinformatics skills. These later may have bene tted from the opportunity provided by the H3ABioNet [19] in addition to other training opportunities such as those used in other countries [20][21][22].
It is possible that most of the surveyed Tanzanian bioinformatics researchers were either trained abroad or learned bioinformatics through postgraduate research projects. Today, no full bioinformatics or computational biology degree program exists in the country. There are bioinformatics courses that are part of undergraduate and postgraduate degree programs at the University of Dar es Salaam (UDSM) and  [18,24]. In the early days of bioinformatics, the discipline was not embedded as part of undergraduate curricula in South Africa. To address the gap, students registered for postgraduate degrees in bioinformatics in South African Universities had to start with short formal bioinformatics training before embarking in their studies. Later, the National Bioinformatics Network (NBN) developed joint courses compulsory for NBN funded students that introduced them to a range of bioinformatics topics, programming and other technical skills [18]. In India, similar initiatives were undertaken by the Biotechnology Information System (BTIS) under the Department of Biotechnology (DBT), Government of India [22].
Equally in Tanzania, there is also a need to develop relevant skills through extending undergraduate bioinformatics courses to other universities that offer biomedical, life and computer science courses in the country. Students will be exposed to the eld early on and potentially incite their interest. It will also prepare them with basic knowledge and skills for postgraduate research and education specializing in bioinformatics education [25]. Besides, we advocate for the establishment of short programs for professionals who may be constrained on time to do a full-edged degree. This can go hand in hand with existing programs and infrastructure but also in collaboration with other organizations in Tanzania, Africa and worldwide. EANBiT, for example, offers a residential training course on bioinformatics for East African students and early career researchers (http://eanbit.icipe.org/content/2018-trainees). Other successful training models were in Sudan [26].
In the era of digital technologies, bioinformatics capacity in Tanzania could greatly bene t from online learning and has to be prioritized. It is less costly, often self-paced and accessible to many people at the same time. Online learning may be more suitable for professionals who cannot spend time in physical classes. Although a multitude of online learning platforms for bioinformatics exist, relevant organizations and institutions have a critical role in developing appropriate curriculum and mobilize resources to facilitate the learning process and ensure that online learning is effective. The duration of vast online courses and resources and providing guidelines to learners is also essential.  [27,28]. Before becoming fully capacity in bioinformatics, Tanzania needs to work closely with existing bioinformatics networks to build its capacity through training. The H3ABioNet help desk can help African countries to quickly grasp the assistance needed to get going with bioinformatics tasks [29].
The Government has a pivotal role to play by supporting basic infrastructure for education and training as well as for research and application. The Government also plays a crucial role in promoting human capacity building in bioinformatics and computational biology by ensuring that graduates are recognized by the government scheme and get job opportunities. The collaborative approach will help to guarantee the sustainability of the initiatives, training, and infrastructure and research activities. Tanzania can emulate examples from other countries where government funding has facilitated the growth of bioinformatics [18,27,28]. In South Africa, the leader of bioinformatics in Africa, the very early phases of bioinformatics at the South African National Bioinformatics Institute (SANBI) on the University of the Western Cape (UWC) campus was co-funded by the Government through the. South Africa's National Research Foundation (NRF) [18]. Tanzania and other African countries need to emulate the funding models of SANBI to improve bioinformatics skills and research in their institution.
The respondents agreed to participate in the bioinformatics network and genomics initiative in Tanzania. The bioinformatics community needs to work with the Government to support a national forum that brings together bioinformaticians and genomics practitioners to discuss issues of common interest. Such a forum can already build on the existing platforms such as the TGN and the TSHG to facilitate joint meetings and promote bioinformatics agenda. Similar National platforms have shown to help to build the capacity in bioinformatics in South Africa, India and Australia [18,24,27].

Conclusion
In this study, we found out that the majority of the respondents engaging in bioinformatics research in Tanzania were at the early stages of their career. Although there is a high level of interest in the eld of bioinformatics in Tanzania, a low level of skilled human resource and lack of infrastructure pertinent to research in the eld was limited. The use of bioinformatics tools for data analysis is still at a low level, even for basic analysis tools such as BLAST, GenBank, sequence alignment software, Swiss-prot and TrEMBL. This may be contributed by the fact that most of the respondents also lacked access to basic tools and resources for bioinformatics research.
Investment in human capacity building through both undergraduate and postgraduate training, as well as encouraging and promoting digital learning may help to improve the situation. Provision of infrastructure, mentorship and networking is needed to improve bioinformatics capacity in Tanzania. We recommend building strong collaborations among institutions in Tanzania to promote the effective utilization of shared resources and expertise. Moreover, regional and global network partners and stakeholders may be crucial in the development of infrastructure and research activities as well as ensuring sustainability. Support from the Government by setting the groundwork and funding basic teaching and research infrastructure is also essential to the growth and success of the eld. The launch of a community of practice such as the TSHG of the TGN may be useful in continuing the Pan-African efforts to promote the use of bioinformatics for the betterment of the humankind. Availability of supporting data The data of this study are available from the corresponding author on reasonable request.