What Short-Read Sequencing is Costing Your Breeding Program, and What the Data Shows

May 5

Based on: Lee K, Korani W, Bentz PC, Pokhrel S, Ozias-Akins P, Harkess A, Vaughn J, Clevenger J. Long-Read Low-Pass Sequencing for High-Resolution Trait Mapping. bioRxiv preprint. January 2025. doi: 10.1101/2025.01.09.632261. This manuscript has not been peer-reviewed.

Global food production needs to increase 60% by 2050. Climate change has already reduced global crop yields by 21% since the 1960s. Plant breeders are under pressure to move faster, with less margin for error, and with fewer resources.

Genomic selection was supposed to help. And it has. Marker-assisted breeding, GWAS, and QTL-seq have accelerated cultivar development across dozens of crops. But most of these approaches still depend on short-read sequencing. And short-read sequencing has a problem that most breeding programs have learned to work around rather than solve.

It is missing a significant portion of your genome.

Not because of budget constraints. Not because of user error. Because of physics. Short reads are approximately 150 base pairs long. Many regions of complex plant genomes, especially polyploids, require longer reads to resolve reliably. When short reads cannot align with confidence, they get filtered. When they get filtered, the variants in those regions disappear from your analysis.

This post presents what the data shows when you replace short-read low-pass sequencing with long-read low-pass sequencing at the same scale and cost range. The findings come from a study currently available as a bioRxiv preprint, using 127 samples from an active peanut breeding program.

Why Short Reads Lose Coverage Before You Even See the Results

The alignment filter is where short-read data quietly degrades.

When a 150-base-pair read maps to multiple locations in a reference genome, the aligner flags it as a probable misalignment and removes it. That filtering is necessary. Ambiguous alignments produce false variant calls. The problem is that in complex, repetitive, or polyploid genomes, a very large share of short reads cannot map uniquely. They are not low-quality reads. They are reads that are too short to be placed with confidence.

In the Lee et al. 2025 study, researchers compared alignment confidence between long-read and short-read data at matched depth across 127 samples. Short reads retained an average of 54.9% of reads after misalignment filtering. Long reads retained 92.9%. [JSC]

That difference is not a rounding error. It means that before any downstream analysis begins, short-read data is already working with roughly half of what was sequenced. The other half was removed at the filter.

The Coverage Numbers That Result

Depth and coverage measure different things. Depth is how many times a position was sequenced. Coverage is what percentage of the genome was sequenced at all. A sample can have adequate depth and still leave large portions of the genome untouched.

The Lee et al. study designed the comparison to isolate this distinction. Researchers sequenced 127 individuals from an 18-parent MAGIC peanut population using both long-read low-pass and short-read low-pass approaches. Average sequencing depth was statistically equivalent across both methods. Long reads averaged 1.63x depth. Short reads averaged 1.68x. No significant difference between the two (p = 0.8237). [JSC]

The coverage results were not equivalent.

After filtering for probable misalignments:

Long-read low-pass sequencing covered 55% of the genome [JSC]
Short-read low-pass sequencing covered 17% of the genome [JSC]

In gene space specifically:

Long reads covered 57.9% of gene space [JSC]
Short reads covered 11% of gene space [JSC]

At the same sequencing depth, long-read low-pass sequencing produced more than three times the usable genome coverage. In gene space — the portion of the genome directly relevant to trait selection — long reads delivered more than five times the coverage.

This is not a marginal improvement. It is a structural difference in how much of your genome you are actually analyzing.

Peanut is a highly complex allotetraploid. But the misalignment problem that drives this gap is not unique to peanut. Any species with significant repetitive regions, duplicated subgenomes, or high structural variation will produce similar alignment filtering dynamics with short reads. Wheat. Potato. Sugarcane. Salmon. The specific numbers will vary by species. The direction will not.

What Disappears When Coverage Falls Short: Structural Variants

Single nucleotide polymorphisms receive most of the attention in genomic selection. But structural variants — insertions, deletions, inversions, translocations, and duplications — are increasingly recognized as key drivers of agronomic traits.

Research has linked structural variants to disease resistance phenotypes in melon, lettuce, peach, and rapeseed. They influence yield, stress tolerance, quality characteristics, and adaptation. For many traits of breeding interest, structural variants are the causative mechanism.

Short reads at 150 base pairs cannot span most structural variants. The read simply is not long enough to cross the boundary of an insertion or deletion and map on both sides. As a result, short-read data systematically undercounts structural variants, not because the variants are rare, but because the technology cannot see them.

In the Lee et al. study, both data types were aligned to a 16-genome pangraph built from the parental lines of the breeding population. The pangraph comparison produced:

18.7x more SNPs with long-read low-pass than short-read [JSC]
45.9x more indels (2 to 1,000 bp) [JSC]
51.6x more large structural variants (greater than 1,000 bp) [JSC]

These are not incremental gains. The variant landscape visible to long-read sequencing and the variant landscape visible to short-read sequencing are substantively different.

The study includes a direct demonstration of what this means in practice. A 19-kilobase insertion conferring resistance to tomato spotted wilt virus (TSWV) in peanut was detectable with a single long read spanning the full region. Short reads mapped to the edges of the insertion. They did not capture the resistance variant. In a selection decision based only on short-read data, that disease resistance locus would be missed.

Does More Variant Data Actually Improve Selection Accuracy?

A larger variant set is only useful if it produces better decisions. The Lee et al. study tested this directly.

Researchers calculated similarity scores for known QTL regions, comparing breeding population lines against donor parent lines known to carry disease resistance loci. They examined three loci: late leaf spot (LLS), a second late leaf spot locus (LLS2), and tomato spotted wilt virus (TSWV) resistance.

Across all three loci, long-read low-pass sequencing produced significantly higher locus similarity scores than short-read sequencing, regardless of depth. [JSC]

The mechanism is straightforward. Short reads had higher rates of missing data in those regions. The variant density was lower. The similarity calculation could not accurately reflect the relationship between breeding lines and donor parents because the data was not there. Long reads, with higher alignment confidence and broader coverage of complex regions, produced selection scores that more accurately identified lines carrying the QTL of interest.

More variants did not produce more noise. They produced more accurate selection.

For a breeder choosing which lines to advance, that distinction is the difference between keeping a resistant line and discarding it.

The Cost Argument: Why "Cheaper" Sequencing May Not Be

Long-read sequencing has historically carried a cost premium. That reputation is accurate when comparing high-coverage whole-genome sequencing approaches. It does not hold at low-pass scale across large populations.

The Lee et al. study included a cost per insight analysis using 2024 kit and consumable pricing. Researchers assigned weighted scores to different variant types based on biological impact: SNPs scored 1 point, short indels scored 2 points, and large structural variants scored 3 points. They then calculated the cost required to generate one unit of genomic insight for both sequencing approaches.

Across all 127 samples, long-read low-pass sequencing delivered 8.53 times more cost efficiency per unit of genomic insight than short-read low-pass sequencing. [JSC — note: this calculation compares long-read low-pass vs. short-read low-pass specifically, using Clevenger Lab costs in Huntsville, Alabama in 2024]

The logic behind the number is direct. Short-read data filters out roughly half its reads at alignment. The coverage that survives cannot detect structural variants. A large share of the biology driving the traits you are selecting for is invisible to the method. Paying less per sample for data that cannot see half the relevant variation is not cost-efficient. It is cost-apparent.

Long-read low-pass sequencing as a short-read alternative is not just technically superior for complex genomes. At population scale, when cost is measured by what the data actually delivers rather than what it costs to generate, it is more economical.

What We Measured and How

Study: Lee K, Korani W, Bentz PC, Pokhrel S, Ozias-Akins P, Harkess A, Vaughn J, Clevenger J. Long-Read Low-Pass Sequencing for High-Resolution Trait Mapping. bioRxiv preprint. January 2025. doi: 10.1101/2025.01.09.632261. Not yet peer-reviewed.

Population: 127 individuals from an 18-parent, 16-way MAGIC peanut (Arachis hypogaea) breeding population. The population segregates for six known QTL including disease resistance loci for late leaf spot and tomato spotted wilt virus.

Sequencing platform: PacBio HiFi on the Revio sequencer. Libraries prepared using PacBio HiFi SMRTbell Prep Kit 96. Samples pooled in groups of 10-13 per run. Sequencing performed at SeqCenter, Pittsburgh, PA.

Data quality: Average yield of 4.25 GB per sample. Average of 98.2% of base pairs above Phred score Q20. Mean read length averaged 7,927 bp across samples. [JSC]

Depth achieved: Average 1.63x for long reads, 1.68x for short reads. Difference not statistically significant (p = 0.8237, paired t-test). [JSC]

Reference structures: TifRunner assembly version 2 (TRv2) as single linear reference. A 16-genome pangraph constructed from 18 parental genomes for pangenome alignment analysis.

Analysis pipelines: Khufu and KhufuPAN (proprietary, HudsonAlpha Institute for Biotechnology) for SNP calling and pangraph alignment. pbsv version 2.9.0 for structural variant detection. snpEff version 5.2c for variant impact assessment. Minimap2 and samtools for coverage calculations.

Short-read comparison: Illumina NovaSeq 6000. Libraries prepared with Twist 96-Plex Library Prep Kit. Each sample sequenced twice to approximate long-read depth. Long-read and short-read depths not normalized to minimize randomization bias.

What was excluded: 9 of 136 samples removed due to low sequencing yield prior to analysis.

Cost methodology: Weighted variant scoring used for cost per insight calculation. SNPs = 1 point, indels (2-1,000 bp) = 2 points, large structural variants (greater than 1,000 bp) = 3 points. Cost per sample calculated from 2024 kit, consumable, and sequencing service pricing for the Clevenger Lab in Huntsville, Alabama. Full cost details available in study supplemental file.

What This Means for Your Breeding Program

The Lee et al. study used peanut, a challenging polyploid crop. The scale of the advantage it demonstrates in coverage, variant detection, and selection accuracy applies to any species where short-read misalignment is a constraint. That includes most complex plant genomes, polyploid species in particular, as well as animal species with large or highly repetitive genomes.

Long-read low-pass sequencing does not have to replace your entire short-read workflow at once. The data supports several targeted integration points:

Late-generation selection. As population sizes narrow, selection accuracy becomes critical. Short-read data routinely misses complex QTL loci due to coverage gaps in structural variant regions. Replacing short-read sequencing with long-read low-pass at late-generation testing stages captures SVs and improves selection accuracy at the stage where errors are most costly.

Founder line characterization. Running long-read low-pass on founder lines before initial crossing captures the full structural variant landscape of your germplasm. This creates a high-resolution reference for downstream selection across an entire breeding cycle without the expense of genome assembly.

SNP panel development. Panels built from long-read data capture variants that short-read panels miss entirely. A single long-read low-pass run of key founder lines can anchor a higher-resolution marker set for the full duration of a breeding program.

Population-scale genomic selection. For programs already running short-read low-pass across large populations, transitioning to long-read low-pass at comparable depth delivers more usable coverage, more variants, and more accurate selection scores per dollar spent on sequencing.

Long-read low-pass sequencing is also being explored beyond plant breeding. In animal breeding programs, conservation genomics, and human population research, the same core advantage applies. Short reads cannot reliably resolve complex genomic regions. Long reads, at low-pass depth, can.

Frequently Asked Questions

What is long-read low-pass sequencing? Long-read low-pass sequencing is a method that uses PacBio HiFi technology to sequence many samples at low coverage depth, typically 1-3x, pooling multiple samples per sequencing run to reduce cost per sample. It preserves the variant detection advantages of long reads, including structural variant calling and complex genome resolution, while making population-scale genotyping economically feasible.

How does long-read low-pass sequencing compare to short-read sequencing for plant breeding? At matched sequencing depth, long-read low-pass sequencing consistently outperforms short-read approaches in genome coverage, gene space coverage, structural variant detection, and selection accuracy for complex QTL regions. In the Lee et al. 2025 study using 127 peanut samples, long-read low-pass achieved 55% filtered genome coverage vs. 17% for short reads at equivalent depth, and 18.7x more SNPs on a pangraph reference. [JSC]

Can long-read low-pass sequencing detect structural variants? Yes. Structural variant detection is one of the primary technical advantages of long-read sequencing over short-read approaches. Short reads at approximately 150 base pairs cannot span most structural variants. In the Lee et al. 2025 study, long-read low-pass sequencing identified 51.6x more large structural variants than short-read low-pass at equivalent sequencing depth, on a pangenome graph reference. [JSC]

Is long-read low-pass sequencing cost-effective for large breeding populations? At the 127-sample scale tested in the Lee et al. 2025 study, long-read low-pass sequencing delivered 8.53x more genomic insight per dollar than short-read low-pass sequencing, based on 2024 kit and sequencing service costs. The higher absolute sequencing cost per sample is offset by significantly more usable data, including structural variants that short reads cannot detect at all. [JSC]

What level of genome complexity benefits most from long-read low-pass sequencing? The advantage is most pronounced in polyploid and repetitive genomes where short-read misalignment rates are high. The Lee et al. study validated the approach in tetraploid peanut, one of the most complex crop genomes. Any species with significant structural variation, duplicated subgenomes, or high repeat content will see meaningful differences in coverage and variant detection between long-read and short-read low-pass approaches.

What species has long-read low-pass sequencing been validated in for breeding applications? The foundational large-scale methodology paper (Lee et al. 2025) validated long-read low-pass sequencing in tetraploid peanut (Arachis hypogaea) across a 127-sample MAGIC breeding population. The method is applicable to any species where PacBio HiFi sequencing is feasible. Earlier small-scale studies have explored similar approaches in Brassica napus using Oxford Nanopore sequencing. There are also preliminary results that the method works well in blueberry, cannabis, and finger millet.

How does long-read low-pass sequencing affect bioinformatics workflows? Analysis pipelines differ from standard short-read workflows. The Lee et al. study used Khufu and KhufuPAN (HudsonAlpha) for variant calling and pangraph alignment, alongside open-source tools including pbsv for structural variant detection. Programs transitioning from short-read pipelines will need to adapt existing workflows or partner with a lab experienced in long-read analysis.

Conclusion

The problem with short-read sequencing in complex genome work is not cost. It is what disappears before your analysis begins.

When half your reads are discarded at the alignment filter because they are too short to map uniquely, the coverage that follows is built on a fraction of your data. The variants in repetitive regions, polyploid subgenomes, and structural variant loci do not appear in your output. They are not filtered because they are biologically irrelevant. They are filtered because the reads cannot see them.

Long-read low-pass sequencing removes that constraint. At matched sequencing depth, it produces more genome coverage, more gene space coverage, more structural variants, and more accurate selection scores for complex QTL. At population scale, it delivers more genomic insight per dollar.

The Lee et al. 2025 study is the first large-scale head-to-head validation of this approach using PacBio HiFi in an active breeding population. The results demonstrate that long-read low-pass sequencing is ready for routine integration into plant breeding programs today, as a direct short-read alternative for programs where coverage and structural variant detection matter.

Citation: Lee K, Korani W, Bentz PC, Pokhrel S, Ozias-Akins P, Harkess A, Vaughn J, Clevenger J. Long-Read Low-Pass Sequencing for High-Resolution Trait Mapping. bioRxiv preprint. January 2025. doi: 10.1101/2025.01.09.632261. This manuscript has not been peer-reviewed.

Running GWAS, QTL-seq, or genomic selection on a complex genome? Talk to our science team about whether long-read low-pass sequencing is right for your population.

Long-Read SequencingShort-Read SequencingPlant BreedingPacbioHiFi SequencingStructural VariantsGenomic Selection

Veil Genomics