Assembly of a pan-genome from deep sequencing of 910 humans of African descent


Rachel M. Sherman, Johns Hopkins Medicine
Juliet Forman, Johns Hopkins Medicine
Valentin Antonescu, Johns Hopkins Medicine
Daniela Puiu, Johns Hopkins Medicine
Michelle Daya, University of Colorado Anschutz Medical Campus
Nicholas Rafaels, University of Colorado Anschutz Medical Campus
Meher Preethi Boorgula, University of Colorado Anschutz Medical Campus
Sameer Chavan, University of Colorado Anschutz Medical Campus
Candelaria Vergara, Johns Hopkins University
Victor E. Ortega, Wake Forest School of Medicine
Albert M. Levin, Henry Ford Health System
Celeste Eng, University of California, San Francisco
Maria Yazdanbakhsh, Leiden University Medical Center - LUMC
James G. Wilson, University of Mississippi Medical Center
Javier Marrugo, Universidad de Cartagena
Leslie A. Lange, University of Colorado Anschutz Medical Campus
L. Keoki Williams, Henry Ford Health System
Harold Watson, The University of the West Indies
Lorraine B. Ware, Vanderbilt University
Christopher O. Olopade, The University of Chicago
Olufunmilayo Olopade, The University of Chicago
Ricardo R. Oliveira, Fundacao Oswaldo Cruz
Carole Ober, The University of Chicago
Dan L. Nicolae, The University of Chicago
Deborah A. Meyers, University of Arizona College of Medicine – Tucson
Alvaro Mayorga, Centro de Neumologia y Alergias
Jennifer Knight-Madden, Caribbean Institute for Health Research
Tina Hartert, Vanderbilt University
Nadia N. Hansel, Johns Hopkins University
Marilyn G. Foreman, Morehouse School of Medicine
Jean G. Ford, Albert Einstein Healthcare Network
Mezbah U. Faruque, Howard University College of Medicine
Georgia M. Dunston, Howard University College of Medicine
Luis Caraballo, Universidad de Cartagena

Document Type

Letter to the Editor

Publication Date



We used a deeply sequenced dataset of 910 individuals, all of African descent, to construct a set of DNA sequences that is present in these individuals but missing from the reference human genome. We aligned 1.19 trillion reads from the 910 individuals to the reference genome (GRCh38), collected all reads that failed to align, and assembled these reads into contiguous sequences (contigs). We then compared all contigs to one another to identify a set of unique sequences representing regions of the African pan-genome missing from the reference genome. Our analysis revealed 296,485,284 bp in 125,715 distinct contigs present in the populations of African descent, demonstrating that the African pan-genome contains ~10% more DNA than the current human reference genome. Although the functional significance of nearly all of this sequence is unknown, 387 of the novel contigs fall within 315 distinct protein-coding genes, and the rest appear to be intergenic.

This document is currently not available here.