Sugarcane de novo genome assembly


Sugarcane is one of the most important crop in the world not only for food but also biofuel. World population is growing and will have 50% more by 2050. Adverse weather conditions and speculation in agricultural markets combined with another 2.5 billion people cause more demand. Increasing population also bring about more demand in energy. Global energy needs will double as will carbon dioxide emission. We need low-carbon energy solution. Sugarcane ethanol is a clean, renewable fuel that produces on average 90 percent less carbon dioxide emission than oil and can be an important tool in the ght against climate change.

Sugarcane genome is one of the hardest de novo genome assembly problems. It has very complicated genome structure because of its complicated inbreeding history. A century ago, breeders wanted to develop sweeter and stronger line, so they crossed S. spontaneum which contributes to robustness and S.officinarum which contributes to sweetness and crossed F1 once again back to S. ocinarum to fortify sweetness. As a result, current sugarcane cultiva becomes highly polyploidy and aneuploidy genome, heterozygous with large scale of recombination. The haploid genome size is 1 Gbp, we expect 812 copies per chromosome, totaling size of 10 Gbp in 100-130 chromosomes.

Sugarcane genome introduces many novel algorithmic challenges to computational biology;

(1) Polyploidy/aneuploidy inference : how many copies are there in each chromosome? 80% of sugarcane genome is supposed to be inherited from S. ocinarum and 10% is from S. spontaneum.

(2) Large scale of recombination : 10% of sugarcane genome seems to be mosaic and unknown where they are.

(3) Heterozygosity : The most heterozygous region has 1 in 20 variations, which means we have to consider 10% of variations in overlap computation. This is way over the common setting of assembly programs and can lead false positive linking.

(4) Repeats : Polyploidy boosts repeats and aneuploidy will cause irregularity

The resurgence of reference quality genome


Since the rst DNA-genome, Phage -X174, was sequenced by Fred Sanger in 1977, Sanger sequencing had dominated the market approximately 25-30 years with BAC-by-BAC sequencing until Next-Gen sequencing took over the place. Since Sanger sequencing provided quite long reads (500-1000 bp), it resulted contig sizes in megabases and lead genome sequencing projects to very high quality reference genomes for human, mouse, y, rice, Arabidopsis and so on. Nevertheless it was very costly so only a few very important model species were selected for de novo sequencing.

Next-Gen sequencing supplanted Sanger sequencing with dropped cost and high-throughput. Since it is feasible to massively sequence a genome with deep coverage, literally a lot of species have been sequenced and even individuals or each cell types had been sequenced. Population genomics, comparative genomics started. Contigs, however, left exon-size. Genome finishing was abandoned. Many genome projects were ended up with draft quality genome. Quite portion of a genome are disregarded, so are regulatory elements, genes and syntheny blocks.

Now new biotechnology era begins with long read sequencing technology. Single moleculo read sequencing from PacBio (15Kbp), Moleculo long read sequencing(5Kbp), Oxford Nanopore (5-10Kbp) and 10x Genomics (50Kbp) delivere much longer reads than Next-Gen sequencing. Even longer spanning technology such as Bionano using optical mapping (100-150Kbp) and HiC/cHiCago protocal (25-100Kbp) are developed and in use.  Also related algorithms such as MHAP and LACHESIS etc. are developed.