What is a Sequencing Read?

By admin
Probably the most common form of

genetic sequencing

these days
DATE

is "paired-end" sequencing. It’s very impressive: the sequencing machine can process the same nucleic acid fragment from both ends! This means that each observation looks like:

+————–+———+————–+ | forward read |

gap |
WORK_OF_ART

reverse read | +————–+———+————–+

Because accuracy ("quality") tends to drop off as you sequence further into a fragment, sequencing from both ends gives you much more accurate data than trying to sequence the whole thing from

one
CARDINAL

end. And because we build up larger sequences ("contigs") by piecing together overlapping ones ("assembly"),

two
CARDINAL

sequences of bases separated by a gap are actually usually more helpful than the same number of bases without a gap.

It’s common to refer to paired-end sequencing with designations like "2×150", where the "2x" tells us it’s paired-end, and the "

150
CARDINAL

" tells us it reads for

150
CARDINAL

bases from each end, for a total of

300
CARDINAL

bases per fragment.

But this introduces a terminology question: what is a read? When we only had "single-end" sequencing it was clear: each sequenced fragment, each contiguous sequence of bases, was a read. With paired-end sequencing, however, these are no longer the same thing! There are

two
CARDINAL

things a "read" could mean:

Read: a continuous series of bases.

Read: the bases from a sequenced fragment.

For example, say we have:

>SRR14530724.2

2/1
CARDINAL


CATTTTCGACGGCGTCGATGTACAAAGGTTATACCATAGTAAGTCCGAAGC
ORG


TACAGGCTTATGACACCGCAGAGTCAATGTATTCCGGTGACAATGTACTGA
ORG

TGTACAGTGGGACTGACACTGTCTCTTATACACATCTCCGAGCCCACGA >SRR14530724.2

2/2
CARDINAL

TGTCAGTCCCACTGTACATCAGTACATACACACCGGAATACATTGACTCTG CGGTGTCATAAGCCTGTAGCTTCGGACTTACTATGGTATAACCTTTGTACA TCGACGCCGTCGAAAATGCTGTCTCTTATACACATCTGACGCTGCCGAC

This is a forward read (SRR14530724.2

2/1
CARDINAL

) and a reverse read (SRR14530724.2

2/2
CARDINAL

) that together comprise a single observation of a fragment from the sample and would generally by analyzed together. Does this count as one read or

two
CARDINAL

?

Turns out people do both, and it leads to a lot of misunderstandings!

Some examples:

Illumina counts them as

two
CARDINAL

. They say the "

25B
CARDINAL

" flow cell on the

NovaSeq
ORG

X will produce

52B
CARDINAL

paired end reads, or ~8Tb ("terabases", or trillion bases) of

2×150
CARDINAL

. Since

52B
CARDINAL

*

150b
DATE

= 7.8Tb (what they call ~8Tb), that tells us they’re counting both the forward and the reverse read.

Element counts them as one. They say a

2×150
CARDINAL

high output flow cell produces

300 Gb
QUANTITY

and

1B
CARDINAL

reads. Since

1B
CARDINAL

*

150b
CARDINAL

*

2
CARDINAL

=

300 Gb
QUANTITY

, that tells us they’re counting the forward and reverse read together as one.

Singular is not clear but I’m pretty sure they’re counting each fragment’s observations as a single read.


The European Nucleotide Archive
ORG

counts them as one. For example, if you visit

ERR1470825
PERSON

, which is

Illumina MiSeq
PRODUCT

paired end sequencing at

2×250
CARDINAL

, you’ll see it says

2.2
CARDINAL

M reads and if you download the

fastq.gz
ORG

files you’ll find

2.2
CARDINAL

M reads in each of the forward and reverse files.


Rothman
PERSON

et. al

2021
DATE

counts them as one. They say "paired reads" on

first
ORDINAL

use and then "reads" later, and you can tell that they are counting them as one because (a) they often give odd numbers for things like "there were

only 337
CARDINAL

SARS-CoV-2 reads" and (b) if you reanalyze the data their numbers only make sense if they’re counting pairs.

An academic group I was recently talking to about a potential partnership counted them as one in a recent paper I reanalyzed and as

two
CARDINAL

when talking over email.

A commercial sequencing company I was recently talking to counted them as

two
CARDINAL

.

Asking ChatGPT and

Claude
PERSON

, both count them as

two
CARDINAL

. Ex: "In short-read paired-end sequencing, a forward-reverse pair is typically considered as

two
CARDINAL

reads".

This is a mess! And, to make it worse, as far as I can tell there’s no standard term other than "read" either "what both the forward and reverse read are examples of" or "what the forward and reverse read are when considered together".

I’ve been using "read" to mean "read pair", but given the ambiguity I think I should switch to another term. The NCBI SRA uses "spots", but no one else seems to use this terminology. You can just say "read pair", which is pretty good, but a bit long. Possible "pairs" or "mates" would be good? Thoughts?