What is a Sequencing Read?

Created on November 12, 2023 at 11:07 am

Probably the most common form of

genetic sequencing these days DATE is "paired-end" sequencing. It’s very impressive: the sequencing machine can process the same nucleic acid fragment from both ends! This means that each observation looks like:

+————–+———+————–+ | forward read | gap | WORK_OF_ART reverse read | +————–+———+————–+

Because accuracy ("quality") tends to drop off as you sequence further into a fragment, sequencing from both ends gives you much more accurate data than trying to sequence the whole thing from one CARDINAL end. And because we build up larger sequences ("contigs") by piecing together overlapping ones ("assembly"), two CARDINAL sequences of bases separated by a gap are actually usually more helpful than the same number of bases without a gap.

It’s common to refer to paired-end sequencing with designations like "2×150", where the "2x" tells us it’s paired-end, and the " 150 CARDINAL " tells us it reads for 150 CARDINAL bases from each end, for a total of 300 CARDINAL bases per fragment.

But this introduces a terminology question: what is a read? When we only had "single-end" sequencing it was clear: each sequenced fragment, each contiguous sequence of bases, was a read. With paired-end sequencing, however, these are no longer the same thing! There are two CARDINAL things a "read" could mean:

Read: a continuous series of bases.

Read: the bases from a sequenced fragment.

For example, say we have:

>SRR14530724.2 2/1 CARDINAL



This is a forward read (SRR14530724.2 2/1 CARDINAL ) and a reverse read (SRR14530724.2 2/2 CARDINAL ) that together comprise a single observation of a fragment from the sample and would generally by analyzed together. Does this count as one read or two CARDINAL ?

Turns out people do both, and it leads to a lot of misunderstandings!

Some examples:

Illumina counts them as two CARDINAL . They say the " 25B CARDINAL " flow cell on the NovaSeq ORG X will produce 52B CARDINAL paired end reads, or ~8Tb ("terabases", or trillion bases) of 2×150 CARDINAL . Since 52B CARDINAL * 150b DATE = 7.8Tb (what they call ~8Tb), that tells us they’re counting both the forward and the reverse read.

Element counts them as one. They say a 2×150 CARDINAL high output flow cell produces 300 Gb QUANTITY and 1B CARDINAL reads. Since 1B CARDINAL * 150b CARDINAL * 2 CARDINAL = 300 Gb QUANTITY , that tells us they’re counting the forward and reverse read together as one.

Singular is not clear but I’m pretty sure they’re counting each fragment’s observations as a single read.

The European Nucleotide Archive ORG counts them as one. For example, if you visit ERR1470825 PERSON , which is Illumina MiSeq PRODUCT paired end sequencing at 2×250 CARDINAL , you’ll see it says 2.2 CARDINAL M reads and if you download the fastq.gz ORG files you’ll find 2.2 CARDINAL M reads in each of the forward and reverse files.

Rothman PERSON et. al 2021 DATE counts them as one. They say "paired reads" on first ORDINAL use and then "reads" later, and you can tell that they are counting them as one because (a) they often give odd numbers for things like "there were only 337 CARDINAL SARS-CoV-2 reads" and (b) if you reanalyze the data their numbers only make sense if they’re counting pairs.

An academic group I was recently talking to about a potential partnership counted them as one in a recent paper I reanalyzed and as two CARDINAL when talking over email.

A commercial sequencing company I was recently talking to counted them as two CARDINAL .

Asking ChatGPT and Claude PERSON , both count them as two CARDINAL . Ex: "In short-read paired-end sequencing, a forward-reverse pair is typically considered as two CARDINAL reads".

This is a mess! And, to make it worse, as far as I can tell there’s no standard term other than "read" either "what both the forward and reverse read are examples of" or "what the forward and reverse read are when considered together".

I’ve been using "read" to mean "read pair", but given the ambiguity I think I should switch to another term. The NCBI SRA uses "spots", but no one else seems to use this terminology. You can just say "read pair", which is pretty good, but a bit long. Possible "pairs" or "mates" would be good? Thoughts?

Connecting to blog.lzomedia.com... Connected... Page load complete