Breaking through the unknowns of the human reference genome

Breaking through the unknowns of the human reference genome

The release of drafts of the human genome in 2001 was a landmark achievement1,2. Scientists could, for the first time, study long stretches of each human chromosome, base by base. As such, researchers could begin to understand how individual genes were ordered, and how the surrounding non-protein-coding DNA was structured and organized. Despite this amazing progress, the draft genomes were still incomplete, with more than 150 million bases missing3. Technological advances in the intervening years have allowed researchers to add to the draft, with the complete sequencing of a chromosome finally being achieved4,5 in 2020. As a result, new and uncharacterized parts of the human genome are beginning to surface, ushering in another exciting period of biological discovery.

What exactly was included in the draft genomes? The original draft contained many previously unexplored intergenic regions. It also encompassed the vast majority of genes. The International Human Genome Sequencing Consortium1 initially estimated that the genome contained 30,000–40,000 protein-coding genes, although the publication of an updated genome6 in 2004, along with improved gene-prediction approaches7, led the figure to be revised to about 20,000. The 2004 genome gave a high-resolution map of 2.85 billion nucleotides from euchromatin — the more loosely packaged regions of DNA, which are enriched in genes and make up roughly 92% of the human genome.

The reference genome launched the scientific community into an era of genome exploration, shifting the focus from single genes to more-complete, genome-wide studies. However, gaps remained on each of the 23 pairs of human chromosomes, estimated to contain more than 150 megabases of unknown sequence3 (Fig. 1). The largest gaps were at locations enriched with highly repetitive DNA or sequences for which there are many near-identical copies. These sections were originally difficult to clone, sequence and correctly assemble. As a result, the human genome project intentionally under-represented these repetitive sequences. Although researchers had a very basic idea of the nature of sequences in these regions, the regions’ high-resolution genomic organization remained elusive.

Figure 1

Figure 1 | Filling in the missing sequence in the human genome.a, The 2001 draft human genome1,2 covered most of the gene-rich DNA, which is loosely packaged in the nucleus. But many gaps remained in tightly packaged regions rich in repetitive DNA sequences, which are often untranscribed (the overall extent of the gaps is exaggerated here, for ease of interpretation). b, Thanks to advances in sequencing and bioinformatics, researchers can now study all of these missing sequences. These include the telomere and subtelomere regions that cap chromosomes; centromere structures that are essential for cell division; and particularly short and highly repetitive chromosome arms known as acrocentric arms. Regions in which DNA is duplicated, either in one location or in a segmented way, can also now be analysed.

Early attempts to close the gaps used long sequence reads to span the repetitive sequences — but such reads were initially highly error-prone. In the 2010s, new opportunities arose, thanks to advances in the ability to read longer stretches of sequence (outlined in refs. 8 and 9, for instance), along with the development of scalable bioinformatic tools. Sequence reads of tens to hundreds of kilobases allowed the study of the genomic organization of many moderately sized gaps. This provided insights into some subtelomeric regions9 — repeat-rich DNA adjacent to the telomere structures that cap the ends of chromosomes. It also enabled the study of the first centromeric satellite array10, in which short sequences are repeated in tandem for about 300 kilobases. A subset of segmental duplications (stretches of sequence that share 90–100% of their bases and are found in multiple locations) was also resolved, many containing genes previously missing from the reference genome9,11. However, many of the largest, multi-megabase-sized repeat-rich regions remained intractable.

Over the past few years, the combination of both ultra-long reads9 and highly accurate long-read data12 has proved a game-changer for resolving these regions13,14, revealing, for the first time, extremely long tracts of tandem repeats and regions enriched in segmental duplications. By breaking down these technological barriers, scientists are now discovering extensive repeat-rich regions that can span millions of bases, and make up the entire short arms of chromosomes.

Researchers do not yet fully understand why parts of the human genome are organized in this way. But gaining such an understanding will undoubtedly be valuable, because these repeat-rich sequences are often positioned at sites that are crucial for life. For example, long tracts of ribosomal DNA (rDNA) repeats encode RNA components of the cell’s protein-synthesizing machinery and have an important role in nuclear organization15. And the repetitive DNA of structures called centromeres is essential for proper chromosome segregation during cell division16.

These large swathes of repetitive DNA come with different sets of rules, in terms of their genomic organization and evolution. They are also subject to different epigenetic regulation (molecular modifications to DNA and associated proteins that do not alter the underlying DNA sequence), which leads repetitive DNA to differ from euchromatin in terms of its organization, replication timing and transcriptional activity1719. Many genome-wide tools and data sets cannot yet fully capture all this information from extremely repetitive DNA regions, and so scientists do not yet have a complete picture of what transcription factors bind to them, how these regions are spatially organized in the nucleus, or how regulation of these parts of our genome changes during development and in disease states. Now, much like the initial release of the genome decades ago, researchers are faced with a new, unexplored functional landscape in the human genome. Access to this information will drive technology and innovation to be inclusive of these repeat regions, once again broadening our understanding of genome biology.

In the past year, scientists have used extremely long and highly accurate sequence reads to reconstruct entire human chromosomes from telomere to telomere4,5. Last year also saw the release of a near-complete human reference genome from an effectively ‘haploid’ human cell line, with only five remaining gaps that mark the sites of rDNA arrays ( In this line, cells have two identical pairs of chromosomes, simplifying the challenge of repeat assembly compared with typical human cells (which are diploid, with different chromosomes inherited from the mother and father). These maps together offer the first high-resolution glimpse of centromeric regions, segmental duplications, subtelomeric repeats and each of the five acrocentric chromosomes, which have very short arms made up almost entirely of highly repetitive DNA at one end.

It is tempting to think scientists are finally approaching the finish line. However, a single genome assembly, even if complete with near-perfect sequence accuracy, is an insufficient reference from which to study the sequence variation that exists across the human population. Existing maps that chart the diversity across the euchromatic parts of the genome must be extended to fully capture repetitive regions, where copy number and repeat organization vary between individuals. Doing so will require the development of strategies for routine production and analysis of complete human diploid genomes. The aspirational goal of reaching a more-complete and comprehensive reference of humanity will undoubtedly improve our understanding of genome structure and its role in human disease, and align with the promise and legacy of the Human Genome Project.

Source link