Sugarcane de novo genome assembly

Sugarcane is one of the most important crop in the world not only for food but also biofuel. World population is growing and will have 50% more by 2050. Adverse weather conditions and speculation in agricultural markets combined with another 2.5 billion people cause more demand. Increasing population also bring about more demand in energy. Global energy needs will double as will carbon dioxide emission. We need low-carbon energy solution. Sugarcane ethanol is a clean, renewable fuel that produces on average 90 percent less carbon dioxide emission than oil and can be an important tool in the ght against climate change.

Sugarcane genome is one of the hardest de novo genome assembly problems. It has very complicated genome structure because of its complicated inbreeding history. A century ago, breeders wanted to develop sweeter and stronger line, so they crossed S. spontaneum which contributes to robustness and S.officinarum which contributes to sweetness and crossed F1 once again back to S. ocinarum to fortify sweetness. As a result, current sugarcane cultiva becomes highly polyploidy and aneuploidy genome, heterozygous with large scale of recombination. The haploid genome size is 1 Gbp, we expect 812 copies per chromosome, totaling size of 10 Gbp in 100-130 chromosomes.

Sugarcane genome introduces many novel algorithmic challenges to computational biology;

(1) Polyploidy/aneuploidy inference : how many copies are there in each chromosome? 80% of sugarcane genome is supposed to be inherited from S. ocinarum and 10% is from S. spontaneum.

(2) Large scale of recombination : 10% of sugarcane genome seems to be mosaic and unknown where they are.

(3) Heterozygosity : The most heterozygous region has 1 in 20 variations, which means we have to consider 10% of variations in overlap computation. This is way over the common setting of assembly programs and can lead false positive linking.

(4) Repeats : Polyploidy boosts repeats and aneuploidy will cause irregularity