Whole Genome Sequencing of Date Palm ( Phoenix dactylifera L.) Cultivars Using NGS.

.


INTRODUCTION
Date palm (Phoenix dactylifera L.) is one of the most greatest economically important plants which possess a lot of useful products (Ahmed et al., 1995). It plays an important role in the desert ecological system. It acts as a shelter for ground crops, and also, plays a great role in land reclamation and desertification control (Gotch et al., 2006).Date palm (2n = 36) is dioecious, from the family Arecaceae (Palmae) monocotyledonous, perennial fruit tree (Barrow, 1998). It is widely cultivated in the arid regions of the Arabian Peninsula, the Middle East and North Africa (Chao and Krueger, 2007).
In Egypt, date palm is considered as very important crop. There is a great national interest in increasing date palm production area to cover the local market and for exportation. In 2014, Egypt has reached a maximum production according to the last FAO statistics (Bekheet and El-Sharabasy, 2015). Although there are about 52 identified cultivars, there is only 20 date palm cultivars are commercially distributed in the Egyptian market. However, the precise number of Egyptian date palm cultivars is argued. This may be due to the difference in identification criteria and the dependence on local experience (Rizk et al., 2004;Bekheet and El-Sharabasy, 2015). In 2008Sharma et al. found that DNA sequencing is a clear-cut approach for identifying genomic variations. Now, there are improvement techniques that led to the enhancement of molecular marker systems which contributed to increasing the sensitivity and resolution of these techniques in genetic diversity studies and cultivar identification purposes (Agarwal et al., 2008). Earlier studies usually paid more attention only to markers in crucial genomic regions to follow important traits. However, with the advent of NGS, markers are used to review as many loci as possible throughout the entire genome and with nucleotide-level precision (Varshney et al., 2014). The development of DNA markers using nextgeneration sequencing (NGS) technology became the most effective way in improving date palm valuable resources (Cullis, 2011;Faqir et al., 2017). In addition, a major step in understanding the genetic basis of an organism is the complete sequences studies (Sterky and Lundeberg, 2000). Finally, we found that, NGS application is very important in many fields related to agriculture in order to find targets for genetic manipulation, evolution studies, exploring the bionetworks, and understanding the fundamental principles of functional genomics (Ashraf et al., 2022) Only, in the last decade deeply studies of the genome sequence of date palm in a much more comprehensive view. Different research teams tried to discover date palm genome sequencing. In 2009, Qatar,s team from Weil Cornell Medical College sequenced the entire date palm genome using an illumina sequencer. They announced the first draft genome map in 2009 for a Khalas variety female date palm. They estimated genome size of 658Mb, assembled 58% of the genome (382Mb) and predicted 25,059 gene models. They sequenced eight other cultivars, including females of the Medjool and Deglet Noor varieties and their backcrossed males. Finally, they postulated a sex-linked region of the date palm genome (Al-Dous et al., 2011).
King Abdulaziz City for Science and Technology (KACST) in collaboration with the Beijing Institute of Genomics, Chinese Academy of Science (BIG/CAS) estabilished a date palm genome project by using both Roche 454 and ABI SOLiD. They presented a more complete genome assembly for the cultivar Khalas and compared it with three other cultivars (Agwa, Sukry and Fahal) to determine sequence variations among them. They estimated a genome size of 605.4 Mb, covered approximately 90% of the genome (671Mb) and predicted 41,660 gene models. Also, they applied transcriptomic studies on genes related to fruit development and abiotic resistance. (Zhang et al., 2011;Al-Mssallem et al., 2013). In A whole genome re-sequencing study took place at the Center for Genomics and Systems Biology, New York University, Abu Dhabi using an illumina sequencer in 2015. They wanted to assume the origin of date palms and determine the genomic diversity in 62 different date palm cultivars collected from various areas,Finally, they discovered genes controlling traits of interest. In this study, four Egyptian cultivars were presented Zaidi, Samany, Zagloul and Hayani (Hazzouri et al., 2015).
Since the diversity within a species can't be studied by the sequencing of a single reference genome, re-sequencing of different cultivars plays a key role in any study (Bolger et al., 2014).
In the present study, whole genome sequencing for date palm cultivars was carried out using NGS. A comparative genomic investigation for Egyptian date palm genome variations among four female cultivars which include, Amhat, Sewi, Zagloul and Hayani and develop novel DNA markers (SNPs/indels) for date palms using next-generation sequencing.

MATERIALS AND METHODS Plant Material:
Four female date palm (Phoenix dactylifera L.) cultivars were shown in Table  (1).

DNA Extraction and Quantification:
Healthy green young leaves were chosen from mature palm trees, thoroughly rinsed with tap water, washed with 70% ethyl alcohol and dried with clean tissues. Then the leaves were cut into small pieces to be preserved at -80 O C or directly applied to the DNA isolation protocol. Genomic DNA was extracted for the four cultivars using DNeasy Plant Mini Kit (Qiagen Inc., cat. no. 69104) and quantified using a Qubit 2.0 fluorometer (supplied by Invitrogen).
The whole genome sequencing workflow includes 4 steps: (1) Fragment library preparation. Throughout library preparation, forward and reverse adaptors are attached to the ends of sheared DNA fragments. The four-date palm cultivars were multiplexed as each one is characterized by a specific barcode. (2) Templated bead preparation (Emulsion, amplification and enrichment). In the SOLiD system, the DNA library to be sequenced needs amplification on P1 beads during emulsion PCR (ePCR), then enrichment of the templated beads takes place. (3) 5500xl SOLiD sequencing. The beads are then deposited and attached to a sequencing flow chip. The flow chip was mounted on the 5500xl genetic analyzer. The clonally amplified DNA fragments linked to beads were sequenced parallelly using sequential ligation of dye-labeled, two-base encoded, oligonucleotide probes. The four libraries were sequenced utilizing the FWD1 sequencing chemistry. (4) Data analysis Bioinformatics tools were used to analyze the raw data of the four date palm cultivars from SOLiD 5500xl to 1) to calculate the genome coverage. 2) align the sequenced genome to a reference one. 3) compare the variations found among cultivars through indels and SNPs polymorphism.
After run completion, the system exported the data in the eXtensible SeQuence (XSQ) file format. There is one XSQ file for each lane, this file was then converted to a color space fasta (csfasta) and a quality file that contains the Phred quality scores (.qual) using NGS plumbing tool. The theoretical redundancy of coverage (c) was calculated for each of the four cultivars separately as LN/G, where L is the read length, N is the number of reads and G is the haploid genome length (Lander and Waterman, 1988). The reads passed the quality control of SOLiD were aligned by Bfast tool (Version bfast-0.7.0a) using the reference genome of cv. Khalas (GCF_000413155.1_DPV01) with a total length of 556.481 Mb and 80,317 scaffolds. The reference genome was converted to BRG format by the fasta2brg command then an index of the reference genome was created using the index of Bfast. The match command of the Bfast was used for seeding alignment locations, local align was used for local alignments of the reads and finally post process command was used for filtering the alignments.The Samtools was used to convert the same files from the Bfast to bam files using the view command. The bam files were merged using the Samtools merge command. The bam file was sorted by the sort command and an index of the aligned reads was created by the index command. Finally, an index of the reference genome was created by faidx command. The picardtools were used for marking and removing duplicate reads. The insertion/deletion (indel) and SNP calling were conducted by SNVer-0.4.1 using the SNVerIndividual command which creates files with vcf format for both indels and SNPs. SNPs and indels were visualized using Integrative Genomics Viewer (IGV) program. Dendrograms were drawn using the SNP variations using the program MegaX (Kumar et al., 2018).

RESULTS AND DISCUSSION Whole Genome Sequencing and Sequence-
Based Markers: 1. Library Preparation Results: 1.a Electrophoresis of the Sheared DNA Library.
Sheared genomic DNA was run on a 2% agarose gel to ensure the success of the shearing process. Figure (1) shows that the four date palm cultivars were successfully sheared as the majority of the sheared DNA produced lies between sizes of 100-250 bp.

b Quantization of the Size-Selected DNA:
After the size selection of the sheared, end-polished DNA fragments by Agencourt AMPure XP reagent beads, the quantity of the size selected DNA yield was measured by the Qubit 2.0 fluorometer using the Qubit dsDNA HS (High Sensitivity) assay kit ( Table 2). The average yield of size-selected DNA is about 12% of the input quantity among the four libraries.

1.c Quantization of the Ligated DNA:
The size-selected DNA was subjected to dA tailing, adaptor ligation and purification steps, then the quantity of the ligated DNA was measured by the Qubit 2.0 fluorometer using the Qubit® dsDNA HS (High Sensitivity) assay kit ( Table 2). The quantity of the ligated DNA after purification for the four libraries was good enough to proceed into further steps.

1.d Quantization of the Amplified Libraries:
Using Qubit After libraries PCR amplification and purification, the quantity of the amplified DNA was measured by the Qubit 2.0 fluorometer using the Qubit dsDNA HS (High Sensitivity) assay kit (Table 3).

e Checking the Size Distribution of The Libraries:
Using gel electrophoresis, the amplified libraries were run on 2% agarose gel to ensure size distribution. Figure (2) shows the 4 fragment libraries after amplification with an average size of 250 bp.
The aamplified DNA libraries concentration of date palm cultivars according to quantitative real-time PCR readings were shown in Table 5.

Using QIAxcel Advanced:
The four amplified libraries were run on the multi-capillary QIAxcel Advanced machine to determine size distribution and concentration. Figure

Templated Beads Quantitation Using Nanodrop:
Three readings for optical densities were recorded at A600 nm for the P2 enriched beads sample using the nanodrop spectrophotometer. Then the concentration of the beads was calculated according to a previously drawn standard curve. A total of 4.5 million beads were obtained after the enrichment process representing 16% of the input beads (Table 7). After considering dilutions, it was found that the sample has 11.5 million beads in each µL. This was deposited on 6 lanes of the 5500 SOLiD flow chip. Thus, 26 µL of the enriched beads needed to be deposited in each lane.

Monitoring the Run:
The sequencing run was monitored to assess the quantity and quality of beads deposited in each lane. The run was monitored for each separate lane in realtime. Run data is shown for one lane for demonstration. Primarily, the 5500xl genetic analyzer records a focal map for the beads on the flow chip, this is done only at the start of the run to establish the position of each covalently attached bead (X, Y).

Bead Assessment:
After the focal map and before the first sequencing ligation, the analyzer generates a bead assessment report for each lane. Figure (5) shows the bead assessment results for one lane as an example, in which P2 records 142.25 million beads. P2 value gives an indication of the total number of enriched beads deposited in a lane. An optimum P2 value spans around 177M beads (every lane has 708 imaged panels and the bead density target is 250,000 beads/panel). One panel is one image area in a lane. N2S (noise to signal) number gives an indication of the sample's clonality which is recorded here as 17.8%. A low number means that the beads are likely monoclonal (having a single template on each bead), and a higher number reflects the beads' polyclonality (beads with multiple templates, so multiple colors). An optimum N2S value shouldn't exceed 15%. The example results also showed 53% as the percentage of beads on the axis. The percent on the axis shows the frequency with which enriched templated beads lie within 10 degrees of each color channel. Spectrally purer beads (closer to the axis) indicate more monoclonality. A higher number of beads on the axis is better. The Median P2/panel value indicates how many beads on average per panel were deposited and detected when the probe annealing to the P2 tag was interrogated. Results showed 201K in this example lane. 250 K beads/panel is the optimum bead density target. The color balance or satay plot (shown in figure 5-b) is an indicator of the spectral purity and signal intensity of the beads.
Each quadrant represents a nucleotide and the color noise associated with it. The figure shows dense colors on axes referring to high spectral purity.
A satay report is shown for each of the five sequencing primers used during the 15 ligation cycles. Figure (6) demonstrates the difference in bead distribution around axes in the satay plots among the 5 sequencing primers in the first and last ligation cycles in one lane as an example. Figure (7) shows the fraction of Good/Best beads with an average of 50%. Best beads are those beads present on the axis in the satay plot (monoclonal ones), while good beads are beads that lie within 10 degrees of each color channel. This value decreases every ligation cycle because a number of beads are blocked every cycle to avoid dephasing.

Bioinformatic Results: 4.1 Genome Coverage Analysis:
The average sequencing coverage slightly varied across the four date palm cultivars (Table 8). The redundancy or depth of coverage is the average number of times that a base in the reference is covered by randomly distributed raw sequencing reads across a genome (Sims et al., 2014). 'Zagloul' had the highest depth at 25x and 'Hyani' recorded the lowest (18x) while 'Amhat' and 'Sewi' reported 22.6x and 21x respectively. Higher

Single Nucleotide Polymorphisms and Insertion/Deletion Mutations:
Some genomic variations from the genome sequence were documented to understand the genetic diversity within the date palms. SNPs and indels were obtained for each cultivar compared to the genome of the Saudi cultivar 'Khalas' as a reference. The Integrative Genomics Viewer (IGV) program was used to illustrate the position of SNPs and indels in the sequencing reads relative to the reference genome. Full data analysis obtained with the aligned sequence results for the four date palm cultivars will be uploaded to the NCBI database. Figure  (   Parts of the SNP and indel output data sheets for cultivar 'Hayani' as an example are shown in Tables (9 and 10) for SNPs, insertions and deletions respectively as the whole data were represented in over 1000 pages. Thus, various numbers for SNPs and indels were estimated from raw data (Table 11). However, variations at a specific level of coverage are only accepted. An SNP was called when it has at least 20x and not more than 70x depth of coverage and indel depth ranged from 10x to 70x. Unusually high coverage sequences (produced from the incorrect mapping of reads from duplicated regions) and low coverage (due to lowquality sequences) are avoided in calling SNPs and indels.
Finally, after filtration 'Zagloul' recorded the highest number of SNPs (1,101,303), insertions (63,648) and deletions (75,709) while 'Amhat' had the lowest SNP (10,427), insertion (7257) and deletion (8782) values among the four cultivars (Table 11 and figure 10). The produced SNPs and indels could be used in genotyping assays to clearly identify date palm cultivars. In this respect, 1,748,109 SNPs were discovered in the Saudi cultivar 'Khalas' indicating a heterozygosity rate of 0.46% (Al-Dous et al., 2011). A number of 7,176,238 SNPs across 62 date palm cultivars were reported by Hazzouri et al. (2015), this elevated value may be because of the increased numbers of tested cultivars or higher coverage estimates.   By examining indel loci, it was found that the numbers of indels greatly varied across four cultivars as well as in relation to their length (Tables 11,12 and Fig. 10). The more the length of the indel is, the fewer indels are called among the four cultivars. Insertions started from a single base to a fragment of 46 bases in length, while deletions ranged from one to 23 bases. These variations may have great effects on each cultivar's identity. This is because insertions and deletions are directly related to significant genomic variations that may reach 10-20% among plant cultivars (Britten et al., 2003;Morgante et al., 2005;Ding et al., 2007).

Comparative Genomic Analyses Using Single Nucleotide Polymorphisms:
Common SNPs among the four date palm cultivars were collected in a comparative way to show genomic variation at a single nucleotide precision (Table13). A total of 2848 SNP loci were examined among the four cultivars, 2823 of which were monomorphic (calling for the same base) while only 25 loci were polymorphic representing a low percentage of polymorphism (0.88%) (Table 14 and fig11). Polymorphic SNPs across the four cultivars are listed in table (15) with their specific scaffold ID and position. These novel DNA markers (SNPs) allow much more precise genotyping for cultivars. By inspecting the 2848 common SNP loci, it was found that transition and transversion SNPs are represented in the four date palm genomes in different quantities, however, the transition/transversion ratio of the SNPs didn't much vary between cultivars recording an average of 1.65 (Table 16). Base substitution mutation is the base of single nucleotide polymorphism which involves either a transition (purine/purine or pyrimidine/pyrimidine) or transversion (purine against pyrimidine or vice versa) exchange. However, transitions are more frequent in nature than transversions (Weising et al., 2005) as reported in this study (Table 16).

Genetic Relationships and Cluster Analysis Among the Four Date Palm Cultivars Using Different Marker Types:
SNPs gained great interest as they are the smallest element of genetic variation and the origin of most genetic variation among individuals (Johnson et al., 2001). The dendrogram constructed for the four Egyptian date palm cultivars from SNP data in comparison with the Saudi cultivar Khalas (reference genome) emphasized the generally narrow genetic background among the Egyptian cultivars (Fig. 12) as concluded earlier from PCR-based and morphological markers (Aly et al.,2019); while showing divergence between the Egyptian date palms and the Saudi cultivar. Two dendrograms were constructed for the four date palm cultivars from PCR-based markers (SCoT and SSR) collectively ( Fig. 13) (Aly et al.,2019) and SNP data (Fig. 14). Results clearly showed the same pattern of relations in both trees, although, PCR-based markers recorded higher PIC values (Table 17). This reflects the power of SNP markers in revealing genomic variation precisely in fewer DNA loci. Similarly, a study aimed to detect the genetic diversity of wild almonds using traditional and new-generation markers, reported higher PIC values for SSR markers over SNPs (Sorkheh et al., 2017).   In this study, the whole genome sequence has increased our knowledge about the palm genome and opened new opportunities to further expand the identification of genetic variation. The advancement of new sequencing technologies had made it possible to highlight SNPs/indel that could differentially distinguish genotypes with distinct traits. However, many other biological questions can be answered through polymorphisms that are assayed in a subset of genomic regions. A massive flow of data from the entire genome sequence of the date palm is to be expected and available, which will help in understanding the genetic bases of this crop and its interaction with the environment, as well as for identification of a molecular marker linked to sex and insights into improving yield, quality and disease resistance. The entire genome will yield the discovery of functions for various genes and establish strategies to manipulate them in order to improve cultivars of interest and create new ones. Therefore, a lot of information still needs to be collected and the genome of the date palm must be studied in more detail.