Raw fastq data for hotspot regions of cancer-related 50 genes using fresh frozen breast carcinoma tissues obtained from IMERI-FMUI biobank collections

Master’s Programme in Biomedical Sciences, Faculty of Medicine, Universitas Indonesia, Jakarta, Indonesia, Department of Anatomical Pathology, Faculty of Medicine, Universitas Indonesia, Jakarta, Indonesia, Department of Medical Chemistry, Faculty of Medicine, Universitas Indonesia, Jakarta, Indonesia, Bioinformatics Core Facilities-IMERI, Faculty of Medicine, Universitas Indonesia, Jakarta, Indonesia, Department of Biology, Institut Teknologi Sumatera, Lampung, Indonesia, Surgical Oncology Division, Department of Surgery, Faculty of Medicine, Universitas Indonesia, Jakarta, Indonesia


Introduction
The creation of biobank has steadily developed additional research platforms and created opportunities over the years to learn more about how living systems function in both acute and chronic physiological and pathological circumstances. It involves a process of gathering, preserving, distributing, and utilization of biological samples for prospective research studies. The majority of hospitals and biomedical research facilities in many countries, in Indonesia as well, participate in this activity, which is essential to the development of a successful, effective, and cutting-edge research system. Potential in the new biology era could create fascinating possibilities for comprehending the physiological and pathological mechanisms behind human health by unravelling the more intricate processes (Caenazzo and Tozz, 2020).
Early population-based biobanks generally focused on finding genetic variations linked to disease without taking into account how the information may be relayed back to participants for their own health management. The genetic basis of illness susceptibility varies between ethnicities since many diseasecausing variants are uncommon and population-specific, which has contributed to the global creation of biobanks (Wei et al., 2021). One of the disorders linked to an accumulation of somatic mutations, structural variants, epigenetic variables, and changes in copy number is cancer, which frequently arises from a genetic background where hereditary cancer is more prevalent. The application of genomic sequencing in clinical setting have been made possible by advancements in sequencing technology and the creation of computational tools, supporting the therapeutic relevance of genomics to cancer treatment (Rossing et al., 2020). We report raw fastq data for hotspot regions of cancer-related 50 genes using fresh frozen breast carcinoma tissues retrieved from IMERI-FMUI Biobank collection. The data gathered from this study will help understand how breast cancer develops and forecast appropriate treatments based on somatic gene alterations that are associated with it.

Sample collection and DNA purification
Sixteen freshly frozen breast cancer samples were collected from the IMER-FMUI Biobank, of which the majority were invasive carcinoma (Supplementary Table S1). Purified DNA was extracted from the tissues using the QIAamp DNA Mini Kit ® components in accordance with the manufacturer's instructions (Qiagen Sciences). The DNA input for library preparation is 10 ng. The Nanodrop Thermoscientific 2000 instrument (ThermoFisher) was used to assess DNA purity at a 260/ The second amplification steps were conducted to ensure sufficient quantity for sequencing on Illumina systems. This step used 7 cycles of 98°C for 15 s, 64°C for 1 min and then hold at 10°C for up to 24 h at thermal cycler. The second cleanup was performed twice to remove high molecular-weight DNA and primer excess by using 25 μl and 60 μl of Agencourt AMPure XP beads (Beckman Coulter ™ , United States), respectively. The libraries were diluted to the final loading concentration at 7-9 pM and sequenced using the Illumina MiSeq platform.

Descriptive analysis
Paired-end libraries (2 × 150 bp) in fastq format were generated by the sequencing operation. Under the BioProject accession number PRJNA820526, the data sequences were submitted to the SRA. FastQC software was used to evaluate the quality of each sample's paired-end raw readings (Andrews, 2010), and q30 Python programs were used to determine the total number of raw bases and the percentage of Q30 (Chen, 2016). Mosdepth software was used to calculate amplicon mean
Frontiers in Genetics frontiersin.org coverage depth, Coverage Uniformity, and on target rate (Pedersen and Quinlan, 2018). Illumina sequencing raw read data is stored as a text file in the FASTQ format. Each sequencing read is stored in the FASTQ format on four lines of text, which provide the following information for each nucleotide: 1) identifiers, 2) nucleotide sequences, 3) "+" symbols, and 4) base quality. The first identification line includes useful data, such as the machine name, run ID, lane ID, and flow cell ID, that can be utilized to identify batch effects. The total number of reads sequenced, the GC content, and the overall base quality score are the most frequent metrics to be examined at the raw data level and are all frequently calculated by typical raw data QC programs (Sheng et al., 2017). In Table 1, we provide the descriptive details of the raw data. Table 1 indicates that all the samples had a Q30 score above 90%, with the sequence quality presented from "per base sequence quality" generated by FastQC software in Figure 1. For example, for the "2016/083/mammae" sample, it showed that all the bases in the reads had a Q score above 32 (p error less than 0.00063), indicating high quality data produced by the illumina instrument.

Data availability statement
The datasets presented in this study can be found in online repositories. The names of the repository/repositories and accession number(s) can be found below: https://www.ncbi. nlm.nih.gov/, PRJNA820526.

Ethics statement
The studies involving human participants were reviewed and approved by the Faculty of Medicine Universitas Indonesia Ethical Committee (approval number: 867/UN2.F1/ETIK/ PPM.00.02/2020). The patients/participants provided their written informed consent to participate in this study.