On the High Value of Low Standards

Revisão Acesso aberto Revisado por pares

On the High Value of Low Standards

2002; American Society for Microbiology; Volume: 184; Issue: 23 Linguagem: Inglês

10.1128/jb.184.23.6406-6409.2002

ISSN

1098-5530

Autores

Elbert Branscomb, Paul Predki,

Tópico(s)

Molecular Biology Techniques and Applications

Resumo

Is there a case to be made for draft sequencing? First, we need to get a fix on how much less it costs than complete genome sequencing, how much faster and/or easier it is to do, and how much and what types of scientific utility are sacrificed. But this is not a straightforward issue. No accepted standard for draft sequence data exists; in current practice it ranges from ∼3-fold coverage in short ( 600-bp), “paired-end” (PE) reads (sequencing reads are taken from both ends of the insert in a double-stranded vector and therefore come in oppositely directed pairs separated by an approximately known distance) of mixed separation lengths. Quality differences over that spectrum are relatively great, as are, though to a much smaller extent, cost differences. The “draft-or-finish” alternatives are hardly exclusive; mixed, staged, or context-dependent strategies may also make sense. All the parameters are evolving rapidly. And finally, there is as yet too little experience to support definitive answers, although clearly enough to get an argument going in the better genome bars. First, we address the production side of the question; consider the hypothetical case of sequencing factory X. This exemplary facility can produce over 30 Mb of high-quality (PE) bases per day at a fully loaded marginal cost of 0.3¢ base. Factory X has concluded that for most DNA, 8× PE coverage is usually optimal, both for producing draft data that are not intended for subsequent finishing and as a substrate for finishing. With this choice, finish-ready draft data have, at factory X, a current marginal cost of ∼2.5¢ base and can be produced at a rate of 3.6 Mb/day with a delay from time of DNA receipt to draft product on the order of 2 weeks. The quality of this sequence is discussed below, but the general nature of its coverage integrity should be noted here. In ∼8-fold PE draft data, the overall coverage is typically high (>95% of the sequence represented). Most importantly, and especially so if a judicious mix of large and small inserts is used in the sequencing, “almost all” points in the sequence—including gaps between the contigs (contigs are contiguous stretches of sequence produced by assembling overlapping individual reads)—are bridged, or spanned, by multiple plasmid clones. This permits the automatic production of relatively high-quality, internally verified assembly and makes it possible to order and orient most of the contigs relative to each other to form large “scaffolds,” or sequence islands of valid order and high coverage. In such data, the expected error rate across genes is often better than 1/104, and a good estimate of the accuracy of each base can be made available. Factory X can also finish such data to full “Bermuda” standards, i.e., an expected base-calling error rate of <1/104 and no gaps or other errors that mortal efforts could remove (these standards were established at meetings of the international Human Genome Project community), for an average additional cost of 7¢ base (and thus for a total cost of ∼10¢ base). Somewhat typically, however, factory X's finishing capacity is manyfold below its drafting capacity. Furthermore, the time needed to finish a segment of draft sequence can average several months and is highly variable. In this landscape, “full Bermuda” data are about four times as expensive, and very much slower to produce, than “high-quality” draft data. For the extra cost of finishing a bacterial genome, three additional ones could be drafted. While factory X is finishing a bacterial genome, it could draft, in the sense described, upwards of a hundred more. To our necessarily imperfect knowledge, no sequencing facility is currently producing either PE raw data or “fully finished” sequence data for true costs significantly below those quoted. But the relative advantage in cost and project completion time of draft versus finished sequence data at factory X might well not be the same in other facilities. And of course, the differences in steady-state production capacity for draft versus finished sequence used in the example are in large measure merely an arbitrary matter of resource commitment. Also, there are some, at least potential, hidden costs in producing draft data that should be considered. (i) Draft sequence errors and imperfections may mislead users and thereby entail costs in wasted effort and delay. (ii) It may be substantially more expensive on average to finish draft sequence data later, should it prove desirable, than to do so at the start and in the same laboratory. (iii) Many have seen a risk that the will (at either the funding or bench level) to ever fully finish sequence data will be lost should we permit ourselves the cheap and easy pleasures of draft sequencing. We comment a little on these questions at the end. The next issue is the quality and utility of draft sequence data, focusing in particular on what we know about (i) sequence coverage, (ii) gene recovery and quality, and (iii) chromosome integrity and long-range order.

Ver no editor

Altmetric

PlumX

Entrar

Lembrar minha senha

Receber meu e-mail de confirmação

On the High Value of Low Standards