Abstract: The reference assembly for the domestic horse, EquCab2, published in 2009, was built using approximately 30 million Sanger reads from a Thoroughbred mare named Twilight. Contiguity in the assembly was facilitated using nearly 315 thousand BAC end sequences from Twilight's half brother Bravo. Since then, it has served as the foundation for many genome-wide analyses that include not only the modern horse, but ancient horses and other equid species as well. As data mapped to this reference has accumulated, consistent variation between mapped datasets and the reference, in terms of regions with no read coverage, single nucleotide variants, and small insertions/deletions have become apparent. In many cases, it is not clear whether these differences are the result of true sequence variation between the research subjects' and Twilight's genome or due to errors in the reference. EquCab2 is regarded as "The Twilight Assembly." The objective of this study was to identify inconsistencies between the EquCab2 assembly and the source Twilight Sanger data used to build it. To that end, the original Sanger and BAC end reads have been mapped back to this equine reference and assessed with the addition of approximately 40X coverage of new Illumina Paired-End sequence data. The resulting mapped datasets identify those regions with low Sanger read coverage, as well as variation in genomic content that is not consistent with either the original Twilight Sanger data or the new genomic sequence data generated from Twilight on the Illumina platform. As the haploid EquCab2 reference assembly was created using Sanger reads derived largely from a single individual, the vast majority of variation detected in a mapped dataset comprised of those same Sanger reads should be heterozygous. In contrast, homozygous variations would represent either errors in the reference or contributions from Bravo's BAC end sequences. Our analysis identifies 720,843 homozygous discrepancies between new, high throughput genomic sequence data generated for Twilight and the EquCab2 reference assembly. Most of these represent errors in the assembly, while approximately 10,000 are demonstrated to be contributions from another horse. Other results are presented that include the binary alignment map file of the mapped Sanger reads, a list of variants identified as discrepancies between the source data and resulting reference, and a BED annotation file that lists the regions of the genome whose consensus was likely derived from low coverage alignments.
The Equine Research Bank provides access to a large database of publicly available scientific literature. Inclusion in the Research Bank does not imply endorsement of study methods or findings by Mad Barn.
This research summary has been generated with artificial intelligence and may contain errors and omissions. Refer to the original study to confirm details provided. Submit correction.
This research article focuses on detecting and assessing inconsistencies in the EquCab2 assembly, the key reference genome for the domestic horse. The intention is to understand whether observed differences between mapped datasets and the reference are due to genuine sequence variances or errors in the reference.
Objectives of the Study
The main goal of this research was to identify differences between the EquCab2 assembly, which is the reference genome for the domestic horse, and the Twilight Sanger data that contributed to its construction. The study specifically sought to evaluate whether discrepancies were due to real sequence variations or were artefacts.
Another objective was to identify those areas that had low Sanger read coverage and to evaluate genomic content variations inconsistent with either the original Twilight Sanger data or the new genomic sequence data.
Lastly, it aimed to identify heterozygous and homozygous variations within the dataset and determine if they were discrepancies or contributions from Bravo’s BAC end sequences.
Approach and Methodology
The researchers re-mapped the original Sanger and BAC end reads back to the equine reference, now including about 40 times coverage of new Illumina Paired-End sequence data.
Through this process, they identified genomic regions with low Sanger read coverage and inconsistencies in the genomic content as compared to the original Twilight Sanger data or the new sequence data.
Key Findings
The study identified 720,843 homozygous discrepancies between the new high-throughput genomic sequence data generated for Twilight and the EquCab2 reference assembly.
Most of these differentiations are portrayed to represent errors in the assembly, while approximately 10,000 are shown to be contributions from another horse.
The study also provided a binary alignment map file of the mapped Sanger reads, a list of variants identified as discrepancies between the source data and the resulting reference, and a BED annotation file that lists the regions of the genome whose consensus was likely derived from low coverage alignments.
Implications of the Research
This study can provide valuable insights into the reference genome of the domestic horse and can clarify the possible reasons for the observed discrepancies between the mapped datasets and the reference.
These findings can guide future genetic studies on horses and potentially other equine species. Understanding the errors and discrepancies in the reference genome will allow for more confident use and interpretation of these genetic tools. This could open up new possibilities for studying the genetic structure and diversity of horses.
Cite This Article
APA
Rebolledo-Mendez J, Hestand MS, Coleman SJ, Zeng Z, Orlando L, MacLeod JN, Kalbfleisch T.
(2015).
Comparison of the Equine Reference Sequence with Its Sanger Source Data and New Illumina Reads.
PLoS One, 10(6), e0126852.
https://doi.org/10.1371/journal.pone.0126852
Department of Biochemistry and Molecular Biology, School of Medicine, University of Louisville, Louisville, Kentucky, United States of America.
Hestand, Matthew S
Maxwell H. Gluck Equine Research Center, Department of Veterinary Science, University of Kentucky, Lexington, Kentucky, United States of America.
Coleman, Stephen J
Maxwell H. Gluck Equine Research Center, Department of Veterinary Science, University of Kentucky, Lexington, Kentucky, United States of America.
Zeng, Zheng
Department of Computer Science, University of Kentucky, Lexington, Kentucky, United States of America.
Orlando, Ludovic
Centre for GeoGenetics, Natural History Museum of Denmark, University of Copenhagen, Copenhagen, Denmark.
MacLeod, James N
Maxwell H. Gluck Equine Research Center, Department of Veterinary Science, University of Kentucky, Lexington, Kentucky, United States of America.
Kalbfleisch, Ted
Department of Biochemistry and Molecular Biology, School of Medicine, University of Louisville, Louisville, Kentucky, United States of America; Intrepid Bioinformatics, Louisville, Kentucky, United States of America.
MeSH Terms
Animals
Genome
High-Throughput Nucleotide Sequencing
Horses / genetics
Sequence Analysis, DNA
Grant Funding
P20 GM103436 / NIGMS NIH HHS
5P20GM103436-13 / NIGMS NIH HHS
Conflict of Interest Statement
Ted Kalbfleisch is the CEO of Intrepid Bioinformatics. There are no patents, products in development or marketed products to declare. This does not alter the authors' adherence to all the PLOS ONE policies on sharing data and materials.
References
This article includes 13 references
Wade CM, Giulotto E, Sigurdsson S, Zoli M, Gnerre S, Imsland F, Lear TL, Adelson DL, Bailey E, Bellone RR, Blöcker H, Distl O, Edgar RC, Garber M, Leeb T, Mauceli E, MacLeod JN, Penedo MC, Raison JM, Sharpe T, Vogel J, Andersson L, Antczak DF, Biagi T, Binns MM, Chowdhary BP, Coleman SJ, Della Valle G, Fryc S, Guérin G, Hasegawa T, Hill EW, Jurka J, Kiialainen A, Lindgren G, Liu J, Magnani E, Mickelson JR, Murray J, Nergadze SG, Onofrio R, Pedroni S, Piras MF, Raudsepp T, Rocchi M, Røed KH, Ryder OA, Searle S, Skow L, Swinburne JE, Syvänen AC, Tozaki T, Valberg SJ, Vaudin M, White JR, Zody MC, Lander ES, Lindblad-Toh K. Genome sequence, comparative analysis, and population genetics of the domestic horse.. Science 2009 Nov 6;326(5954):865-7.
Orlando L, Ginolhac A, Zhang G, Froese D, Albrechtsen A, Stiller M, Schubert M, Cappellini E, Petersen B, Moltke I, Johnson PL, Fumagalli M, Vilstrup JT, Raghavan M, Korneliussen T, Malaspinas AS, Vogt J, Szklarczyk D, Kelstrup CD, Vinther J, Dolocan A, Stenderup J, Velazquez AM, Cahill J, Rasmussen M, Wang X, Min J, Zazula GD, Seguin-Orlando A, Mortensen C, Magnussen K, Thompson JF, Weinstock J, Gregersen K, Røed KH, Eisenmann V, Rubin CJ, Miller DC, Antczak DF, Bertelsen MF, Brunak S, Al-Rasheid KA, Ryder O, Andersson L, Mundy J, Krogh A, Gilbert MT, Kjær K, Sicheritz-Ponten T, Jensen LJ, Olsen JV, Hofreiter M, Nielsen R, Shapiro B, Wang J, Willerslev E. Recalibrating Equus evolution using the genome sequence of an early Middle Pleistocene horse.. Nature 2013 Jul 4;499(7456):74-8.
McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K, Altshuler D, Gabriel S, Daly M, DePristo MA. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data.. Genome Res 2010 Sep;20(9):1297-303.
Hyman RW, Jiang H, Fukushima M, Davis RW. A direct comparison of the KB™ Basecaller and phred for identifying the bases from DNA sequencing using chain termination chemistry.. BMC Res Notes 2010 Oct 8;3:257.
Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R. The Sequence Alignment/Map format and SAMtools.. Bioinformatics 2009 Aug 15;25(16):2078-9.