For localized information and support, would you like to switch to your country-specific website for {0}?
- Events
- Sequencing by expansion duplex (SBX-D) data analysis webinar
Key takeaways
Understand the fundamentals of analyzing SBX-D data, including visualization with the Integrative Genomics Viewer (IGV)
Explore the model behind the SBX-optimized open-source (XOOS) small-variant caller and its performance
Walkthrough Google Genomics’ DeepVariant small-variant caller for SBX-D and its performance
Sequencing by expansion duplex data analysis
Join our first ever in-depth exploration of SBX duplex (SBX-D) data and whole-genome germline small variant calling.
In this exclusive webinar, experts will walk you through the underlying algorithms used to generate SBX-D consensus reads. You’ll also gain detailed intel into the resulting read characteristics and the calculation of key metrics, such as accuracy and coverage.
Expand your knowledge–and soon, your science–just in time for the release of a publicly available whole genome sequencing (WGS) genome in a bottle (GIAB) SBX-D data set.
A 56 minute in-depth exploration of sequencing by expansion duplex data and whole-genome germline small variant calling to support the release of a publicly available datasets. Watch now.
Chapter segments
- (1:32) - Intro to SBX-D Webinar and Analysis Strategy
- (5:12) - SBX-D Read Characteristics and Metrics
- (16:31) - SBX-D workflow for germline applications
- (26:00) - XOOS germline small variant caller
- (40:33) - DeepVariant for SBX
- (56:59) - Q & A
*This transcript was generated using an AI-based transcription tool and may contain errors. It reflects the spoken content of a live webinar and has been lightly edited for clarity.
Good afternoon, everyone, and welcome to this GenomeWeb webinar. I'm Ben Butkus, Editorial Director for GenomeWeb, and I'll be your moderator today. The title of today's webinar is "Germline Small Variant Calling Workflow for SBX Duplex Data," and is sponsored by Roche.
Our speakers today will be Jaedon Scott, Senior International Product Manager for Roche Diagnostics Solutions; John Mannion, Head of Computational Science, Molecular Lab Systems at Roche Diagnostics Solutions; Chen Zhao, VP of Computational Biology, Molecular Lab Applications for Roche Diagnostics Solutions; Mahdi Golkaram, Director of Bioinformatics, SBX Algorithm Development at Roche Diagnostics Solutions; and Andrew Carroll, Product Lead for Deep Variant at Google Genomics.
Attendees may type in a question at any time during the webinar. You can do this through the Q&A panel, which appears on the right side of the webinar presentation. If you look to the bottom tray of your window, there are a series of widgets to enhance your webinar experience.
With that, I'll turn it over to Jaedon Scott of Roche. Thanks for the introduction, Ben, and hello, everyone, and welcome again to our Germline Small Variant Calling Workflow for SBXD Data webinar.
We're so glad you could join us for this key SBX milestone, where we'll take a closer look at SBXD reads in the context of whole genome germline small variant calling to support the much-anticipated and highly requested release of SBX data.
As a quick reminder, we have multiple SBX resources available online, all of which can be found on our Sequencing by Expansion technology web page at the QR code you see on your screen, including our introduction to SBX webinar featuring Head of Chemistry and Inventor of SBX, Mark Kokoris, and John Mannion, Head of Computational Sciences.
We also have the Sequencing by Expansion seminal preprint available on the BioArchive server and presentations from this year's AGBT and ESHG conferences. If you haven't had a chance yet to check these out, we highly recommend it as a great place to learn more about the significant progress of SBX in just a matter of months.
We've got a packed agenda today, all centered around what you need to know about SBXD data before you dive in. We'll start off with an overview of our data analysis strategy and highlight the resources that will be made available after the webinar. We'll transition from there to SBXD read characteristics and metrics to set the stage for an overview of processing SBXD data for germline applications.
And then we'll move into a deep dive into the SBX-optimized open-source small variant caller and look ahead at what's next for the Zeus analysis tools. From there, Andrew Carroll will provide an in-depth walkthrough of the Deep Variant team's experience in working with SBXD data, optimizing Deep Variant for SBX, and the performance achieved to date. And then we'll close out today's presentation with a brief Q&A session.
Now, before we dive into the technical side today, I want to take a moment to highlight our analysis strategy along with the resources we'll be making available. Our SBX data analysis strategy is a two-pronged approach characterized by integrated and standalone analysis.
Our aim for integrated analysis, led by John Mannion and his team, is to accelerate relatively commoditized, compute-intensive analysis steps to ensure the inherent value of SBX data generation rates and re-generation modality is maximized in a seamless yet flexible way.
For standalone analysis, our internal teams, led by Chen and Mahdi, are developing a suite of SBX-optimized open-source analysis tools, or Zeus analysis tools for short. Zeus analysis tools will offer a comprehensive suite of free, open-source, efficient, and accurate analysis tools that will serve as a foundation for the SBX analysis ecosystem to be built upon.
We'd also like to note that the Zeus analysis tools, including the small variant caller that we'll be making available in support of this data release, are primarily written in C++, and they'll be made available under an Apache 2 license, providing transparency, accessibility, and adaptability.
In previous venues, we've illustrated our analysis approach through the workflow you see here, where we've locally accelerated multiple steps in the workflow and executed them in an online fashion, wherein full-length reads are generated and analyzed as sequencing is in progress.
Within the context of SBXD, we've demonstrated base calling, DMUCS, consensus, and mapping and alignment via the STADI streaming functionality, allowing aligned reads to be generated in parallel with sequencing.
You'll also note that there are multiple data off-ramps within the workflow to provide maximum flexibility in meeting a diverse set of sequencing workflow and application development needs. And specifically, within the context of SBXD, this includes the three data paths shown here, consisting of generating raw reads, consensus reads, or mapped and aligned reads on instrument.
And each of these data outputs can be ingested by our suite of Zeus analysis tools where applicable.
In support of the data release, we'll make multiple SBXD analysis resources available, including the Zeus small variant caller, the supporting Zeus documentation, a whole genome SBXD genome and a model data set, including BAMs and VCFs, the Deep Variant model trained for SBXD, and a white paper covering an in-depth review of germline small variant calling for SBXD with Deep Variant. And with that, I'll hand it over to John to provide more background on the SBXD read characteristics and metrics. Thank you, Jaedon. As outlined, the focus today is on SBXD data and germline small variant calling.
As a brief reminder, just to put this in context, the SBX technology currently does support both duplex and simplex library structures and workflows. You can see that indicated on this slide, which Mark Kokoris covered in more detail in the intro webinar to SBX earlier this year.
The ultra-high throughput and longer read lengths of SBX simplex can be useful in a number of application areas where roughly Q20 accuracy is acceptable. To name a few that were previously mentioned, for probe counting applications, over 14 billion reads in an hour have been demonstrated.
And for applications which could benefit from longer reads, we've looked in particular at counts of reads above a certain length and observed, for example, greater than 200 million reads of 1 kB or longer in a one-hour technology demonstration run. Duplex, on the other hand, is targeted for applications demanding higher accuracies.
The duplex approach provides reads with insert lengths in the 200 to 300 base pair range and still high throughput, greater than 15 billion high-accuracy reads in a four-hour run or over 5 billion reads in an hour of sequencing. Today, we explore these duplex read characteristics with a particular focus on whole genome small variant calling.
Quick refresh on the workflow. You start with your sample, whether it's tissue, blood, cell-free DNA, run it through your library prep process. That includes ligation of a Y adapter and a hairpin adapter.
A linear amplification of this library is done here as opposed to PCR, as we found this helps our accuracy, particularly in the homopolymers and indel accuracy, and our coverage uniformity as well.
After that, amplified library is loaded onto the synthesis instrument shown in the lower right, where XP synthesis occurs, and this is followed by sequencing itself on the sequencing instrument.
SBX FAST, our amplification-free workflow, is very similar to the standard duplex approach with just a few modifications to optimize for speed. There's no linear amplification step shown after adapter ligation here, which does mean sample input amounts will be a little higher than for SBXD.
With the FAST protocol, we've so far seen comparable results to SBXD from a small variant calling perspective. Andrew will highlight some of these later today. And the latest time demonstrations reported at ESHG in May include a solo sample run, which went from the start of library prep to VCF in under five hours.
It is worth noting that between standard duplex and FAST workflows, both are easily accommodated by the same flexible technology and set of instruments. No special setup needed for FAST.
As we move here toward diving into the data, let's begin by taking a quick look at the expander sequencing process itself and the generation of high accuracy intramolecular consensus reads.
During a run, which may be anywhere from minutes to hours, expander molecules are drawn continuously into pores on the 8 million sensor array. Each captured molecule processes through the pore in a rapid but highly controlled manner.
Here you see a depiction of a single pore in a bilayer and an expander with its tag codes and translocation control elements undergoing a measurement process. As raw data streams off the sensor, bases are called instantly. And for production workflow, high accuracy consensus reads are formed in real time as well.
The process starts with raw reads. The cartoon diagram of one is in the upper right. Hairpin SID and Y adapters are segmented and classified, and these stretches of sequence are labeled in the image. The sequence segments in blue represent two passes on the insert, one for each of the parent plus and minus strands.
Next, these two passes are aligned to one another using a reference-free pairwise alignment. And finally, consensus calls are made. Not all raw reads make it through a full second pass on the insert and to the end adapter. And these result in what we call partial duplex reads.
You can see those in the diagrams furthest to the right. Let's take a closer look now at two classes of reads, both full and partial duplex, as well as the base classes within each. In the upper left, you see a full-length read in which the entire genomic insert sequence is covered with duplex bases.
Given our raw read error rates, the vast majority of duplex base positions are concordant, meaning the base call from the plus and minus strands match. For these positions, we call a base and mark it as high quality. A small percentage of duplex positions will be discordant.
For these positions, bases are also called, though they are marked with our lowest categorical quality score. The partial duplex reads on the right also contain a stretch of duplex sequence for which concordant and discordant positions are treated in the same way.
However, this class of reads additionally contains simplex stretches of base calls for segments of the insert that were covered by the first pass but not by the second. And for those positions, we do once again call a base, but give them a medium quality Q score.
How do we use each of these categories of base calls when providing high-level summary metrics for a particular run? For accuracy, we focus on the concordant duplex bases, the Q39 bases that have been referred to elsewhere.
To be consistent, we also anchor throughput metrics such as coverage, base throughput, or sample throughput around concordant duplex bases.
As Chen, Mahdi, and Andrew will show, these other base categories certainly can have value in downstream variant calling steps, but simplex and discordant bases would not be included in our own estimates of the number of 30x genomes that could be run per hour, for example.
Finally, the insert length metric is exactly as it sounds and includes all duplex and simplex bases in the consensus read. We believe this to be the most relevant length metric for SBXD, given that one of the main benefits of insert length is in correctly mapping the high accuracy bases to difficult-to-map regions of the genome.
In the lower left of this screen is an image constructed from 100 randomly selected duplex reads from an SBXD run. Each row represents one read, where again, the three adapter types can be seen in different colors on the left, middle, and the right sides, and passes on the genomic insert sequence are in the two shades of blue.
The full and partial duplex reads are segregated and also sorted by raw read length within each of those groups. You can quickly get a sense for the raw read length distribution by looking at that image. You also see there is a distribution of insert lengths, which reflects that of the fragment lengths that were input into the library prep process.
The plot on the upper right shows the insert length distribution for all reads, both full and partial duplex from a recent one-hour run. This has a mean well above 200 base pairs and also a long tail out to the right. Any difference between insert lengths between the full and partial duplex reads? Yes.
In fact, we do see differences between those populations. If you break those down, as is done in the lower right, you'll see that while the full duplex distribution is still above a 200 base pair mean insert length, the partial duplex insert lengths are even longer. We believe this is driven mostly by selection bias.
For longer inserts, a raw read is simply less likely to cover the insert and reach the end of the Y adapter on the second pass. Now look at accuracy.
As has been shown before, empirical concordant duplex base accuracy is in the high Q30s, with a breakdown between insertions, deletions, and substitutions shown in the lower left plot of this screen. The simplex base accuracy is typically in the low Q20s, as it was here.
The simplex error profile is shown in the lower right to a degree mirroring the relative breakdown of errors observed for concordant duplex bases, but obviously at higher absolute rates.
As you may be considering downloading and inspecting the SBX data, let's now take a look at how SBXD reads and the different category bases appear in standard file formats such as FASTQ and BAM. Here, a hypothetical read appears as it would in a FASTQ.
Information is arranged as you would expect it, with the header present on the first line containing a read name, as well as a read string itself and a quality score vector on line four. You'll notice characters corresponding to specific thread values in that Q score vector.
We look first at the concordant duplex bases, which are colored here in orange in the read, as they are in the corresponding categorical discretized quality values represented with an ASCII character that corresponds to a thread of Q39. We're often asked about the distribution of quality values within the highest category bin.
Internal quality predictor models do provide us with a view of these distributions, indicating that greater than 85% of concordant duplex bases are predicted to be above Q30.
Bases in the simplex tail of this hypothetical read are now highlighted in orange, as are the quality values which were chosen to represent them at Q22. Finally, several discordant bases can be seen in this read, and these are marked as a Q5 for our lowest quality bin.
To maximize information propagated downstream and available for use in secondary colors, for discordant bases, deltas are stored in what we refer to as a lossless encoding. You can see that present here in the header, and it's structured in such a way that it will populate the YC tag in a BAM.
The information in this encoding, which is further described on the right, allows a user or downstream variant caller to leverage evidence originally present in both strands at discordant positions and weigh that evidence appropriately. Should you, the caller, be designed to do so.
In this particular YC tag, you can see information on whether a simplex base existed on the other end. Here we see 57 bases of simplex, as well as the three discordant bases in the read, with the codebook shown there in the lower right-hand corner.
Though we're not showing a BAM here explicitly in the interest of time, this information will also be present in BAMs in the YC tag. With today's data share, we do hope potential users will be able to explore that data for themselves.
And to that end, for those interested in doing so, BAMs from an example one-hour genome in a bottle simplex run will be accessible as part of our data share event, along with VCFs generated by the variant colors, which will be discussed shortly.
From that run, you can expect to find BAM file sizes a little over 50 gigabytes for each of the seven multiplex samples, and each sample containing approximately 800 million SBXD reads. This is consistent with throughput numbers provided earlier in the year of the 15 billion reads in four hours or over 5 billion in an hour.
And with those data generation rates, it is worth reiterating what Jaedon mentioned. In the product scenario, integrated analysis will include each of those steps you see here in dark blue.
As a latest example of that work under development, for the SBX FAST solo runs reported this May, sorted BAMs were available for downstream secondary processing approximately one minute after completion of the sequencing waveform itself.
Before moving on to the next section on open-source tools and variant calling, we would like to briefly and once again acknowledge our collaborators, including NVIDIA, particularly on the parabricks team, for their effort in optimizing select parabricks software components and libraries. For the remainder of today, we'll focus on the open-source secondary analysis tools.
To that end, it is my pleasure to hand it off to Chen Zhao, who will introduce the Zeus tools. Thank you, John, for the nice introduction on SBX reads.
As Jaedon alluded to earlier, when it comes to analysis tools, we're taking the open-source strategy to empower the community and to lower the barriers for adoption. Here, we illustrate two example analysis workflows, one on WGS germline and one on WGS somatic tumor normal.
On the germline side, today we'll be deep diving into the deep variant and Zeus small variant caller. But on the Roche team, we're also actively working on developing colors for repeat expansion, copy number detection, ASCII detection, and others.
On the somatic side, our team is taking a very similar approach by leveraging MuTAG2 and downstream machine learning models for SMV and induct calling. In addition, our team will also be releasing a little specific copy number caller, ASCII caller, and biomarkers such as TMB and MSI in the future.
To further understand why we're developing SBX-specific analysis tools, first, SBX generates an unprecedented amount of data, more than 15 billion duplex reads in just a few hours. Many existing tools do not scale to this data volume.
Second, SBX produces single-end variable links read. The data set shown in this webinar ranges from about 100 base pairs to all the way above 500. Therefore, assumptions on paired-end or fixed read links do not apply here.
Third, SBX has a relative uniform error rate across the entire read, and the reads are the mixture of full links and partial duplex reads. Therefore, it's important to take all these factors into consideration when building downstream variant colors.
Not surprisingly, SBX platform will provide industry standard file format, namely FASTQ.gz, BAMOCRAM, and VCF. At the end of the webinar, we'll provide a QR code on how to access SBX BAMs and Zeus small variant calling VCFs.
It is a common question to ask what off-the-shelf tools can be applied to SBX. Broadly speaking, they fall into three categories. In the first category, they are directly compatible with SBX with zero or very little modification.
For example, DNA aligner BWMM, DRAF, and RNA aligner MINMAP2 all work directly on SBX. In the second category, we have tools that need optimization. The GADK suite, Haplotype Caller, and MuTAG2 do work on SBX. However, specific parameter settings are needed.
Mahdi will cover more in his section. Same with deep variant, and Andrew will cover in his section. In the last category, there are examples that should not be used for SBX. PCAR marked duplicate is popular, but when applying to single-end reads, it will inflate duplication rate.
SAM2SPELLUP does not handle insertion-based quality correctly and will produce elevated insertion errors. While SBX data are fully compatible with BWMM, we have achieved some of our best results using the pan-genome read aligner DRAF from the VG2 kit.
The pan-genome uses genomic variation from the population to augment the linear reference genome, creating a graph in which different haplotype sequences can be spelled out by taking different paths through the graph.
The variation information is useful during read mapping because it can provide additional clues for the correct placement of reads containing variation, including some reads that might otherwise go unmapped due to the high divergence from the reference.
Having said that, for empirical base accuracy calculation, we still recommend mapping against a linear genome using BWA, as DRAF currently does not left-align inducts, which artificially inflates error rate calculation against the linear genome.
We have also optimized our pan-genome alignment pipeline to improve variant calling accuracy, including, first, incorporating recent advances in personalizing the pan-genome to each sample. Second, using pan-genome information to aid in duplex consensus-based calling.
Finally, optimizing the human pan-genome as a mapping target by correcting errors and augmenting it with missing sequences.
The goal of personalizing the pan-genome is to filter population-level pan-genome enriched for variation that is present in the sample's own genome, which is the most helpful variation for read mapping. We use methods from the VG2 kit to personalize the pan-genome using KMERS from the sample's reads before mapping.
This process makes read alignment more accurate and faster, which compensates for the additional upfront cost of personalization. We can also use the pan-genome to inform duplex consensus-based calls.
After alignment and ahead of variant calling, we can check whether either of the discordant duplex-based calls is present in the pan-genome. If so, we allow the pan-genome to cast a tie-breaking vote to resolve the discordancy.
In the case of discordant snips between read one and read two, the default intermolecular consensus always truths the base in R1, as the empirical accuracy for read one is slightly higher.
We can see that in the top example, we observe a TA mismatch, and the T is capped for the consensus base, and it's marked as a low Q. However, in the second example, we still have the same TA mismatch, but in this case, the A base on R2 matches a known haplotype in the pan-genome.
So the A is shown as this consensus base, and it's marked as Q22 to indicate it's a simplex base. In the case of the insertion, it's also possible that this removes the base from the aligned read.
The resulting reads are more likely to be accurate at the base level, which can improve performance of local assembly in downstream variant calling. There are a number of known assembly errors in GRCH38. False duplications are particularly pernicious since they can mimic low mappability.
We have optimized the reference pan-genome for mapping by masking all these duplications. We estimate this eliminates about 7% of the snip error in the CMRG benchmark. In addition, we have found out even in the pan-genome, variation is often hampered by mismapped reads from missing parallelogram sequences.
We have ongoing work to construct decoy sequences for the missing parallelograms. While not used in this webinar, we expect adding decoy sequences in a careful manner will further improve variant calling F1 scores.
To finish on the aligner side, to preserve the lossless encoding YC tag, one would need to use the -c option in the BWA or PB FASTQ.bam. To preserve the YC tag in VGDRAF, we use the common set text option, and for PBDRAF, use a copy common option.
The prune low complexity option on the right-hand side changes the behavior of DRAF when it's projecting the graph alignment onto the linear reference. In low complexity sequences, deep variant performs better if DRAF prioritizes a consistent parsimonious alignment against the linear reference.
Empirically, this gives the deep variant a boost in its performance. Another common step in any germline variant calling workflow is read deduplication. For SBX, duplication can occur during linear amplification or during sequencing.
We identify duplicates by comparing both the start and end position of the mapped reads. Note that the duplicates can be either partial duplex, as shown on the first read, or full duplex read.
To accommodate for sequencing errors, we optionally allow a positional wiggle room up to four base pairs during duplicate marking process. The wiggle room can be empirically derived depending on the sequencing depth and other factors.
Wiggle room should be used with caution in head-depth sequencing data to avoid marking unique molecules as duplicates, a.k.a. collision. In this webinar, we use a wiggle room of zero.
To recap earlier, PCAR marked duplicate is developed for paired-end reads, and when applied to SBX, it will generally inflate the duplication rate and only consider one end of the reads.
The purple read in the above example, for all-sided maximum allowed wiggle room, and should be considered as a separate unique molecule from the blue reads. The Zeus duplicate marking tool will be made available closer to launch.
In addition to duplicate marking capability, we also support positional intermolecular consensus. In the case of high-sensitivity applications, such as whole genome tissue-informed MRD, we believe positional consensus could further boost the final read accuracy and enhance sensitivity.
Finally, I want to emphasize again that for SBXD, coverage reported on concordant duplex bases. In this case, simplex bases do come free in our data. And even with the lower accuracy at Q22, they are still extremely useful for mapping and alignment, as well as boosting the total coverage.
In the webinar data, a 30-duplex genome does have an additional 10x contribution from the simplex bases. This additional coverage can be effectively leveraged in downstream by smartly weighting the simplex evidence in the variant calling step. With that, I will hand over to Mahdi for Zeus small variant calling part.
Allright, thank you, Chen. Hi, everyone. My name is Mahdi Golkram, and I lead the SBX secondary analysis algorithm development team here at Roche. Today, I'm excited to walk you through a demo of SBX germline small variant caller, a free and open-source tool designed to help customers make a smooth transition to SBX.
With a great introduction from Chen and John, we are now ready to dive into some of the more advanced topics in SBX data analysis, starting with a closer look at the Zeus germline small variant caller.
Looking back at the SBX duplex data, as John mentioned earlier, we can break it down into three distinct base categories: duplex concordant bases, simplex bases, and duplex discordant bases. The proportion of each category varies depending on the sample type and the specific chemistry conditions.
But for the representative data set we'll be reviewing today, roughly 75% of bases are duplex concordant. These bases are highly reliable, capable of achieving Q40 accuracy, and are well-suited for both somatic and germline applications. In addition to duplex bases, about 25% of the bases are typically simplex.
While these have a higher error rate, they're still quite usable in many of the germline applications, especially when pair-based accuracy is less critical or when the true variant is expected to appear at a higher alley frequency. Finally, around 1% to 2% of the bases fall into the discordant category.
These are the least accurate, as we are confident that either forward or reverse strand contains an error at that position. That said, when used thoughtfully, discordant bases can contribute value, particularly in boosting coverage and supporting variants in challenging regions such as small polymer and tandem region.
A natural question that comes up is, how do we measure SBX accuracy? Well, there are several ways, of course, we can measure this. We are providing two options. Option one would be the case where deployed genome assembly is not available for that particular sample, which usually is the case for most of the samples we work with, including the results we're going to report today.
So here, the reads can be mapped into human genome reference using a BWMM aligner. Next, we ignore discordant and simplex bases and apply a minimum mapping quality of four to ignore mapping artifacts.
In this case, it's also important to remove known germline variants from the analysis to avoid penalizing duplex concordant accuracy due to the presence of a true variant. That's why we usually use samples with well-curated, high-confidence germline variant calls to achieve accurate accuracy measurements.
However, if the genome assembly for that sample is available, we can map directly to that genome and not worry about the presence of germline variants. An example would be H2002 cell line, where the telomere-to-telemere deployed assembly is available, which can inform us of SBXT2D accuracy as well.
BEST is a widely used tool for measuring sequencing accuracy, but we are also planning to release a new open-source tool as part of the Zeus software stack as we move closer to our launch date. There are a few key details to keep in mind when measuring accuracy in SBX whole genome sequencing of genome-enabled samples.
In this schematic, I am showing duplex concordant bases with blue squares, as highlighted on top left, discordant bases with red squares, and simplex bases with white squares. Likewise, mismatches are shown with the orange letters while matching bases with black letters.
First, as mentioned on the previous slide, we focus strictly on high-confidence regions and avoid regions that contain germline variants. Second, if reads are aligned to the human genome reference rather than a personalized genome assembly, it is essential to account for potential mapping errors.
To reduce this risk, we filter out reads with low mapping quality using a MAPQ threshold greater than four. Third, errors from simplex and discordant bases should be excluded from accuracy calculations.
For example, in this schematic shown here, we evaluate 34 duplex bases and observe two concordant duplex substitutions and one deletion. That gives us an error rate of three out of 34, which is then converted to a FRED scale quality score.
Here you see also simplex and discordant substitutions, but we ignore those when measuring duplex concordant accuracy. Finally, it is especially important to be cautious about homopolymer regions.
Any read that doesn't span the full length of the homopolymer or contains a discordant base within that region is excluded from our accuracy measurements to avoid inflating the error rate. In this example, we are seeing two reads with all bases being concordant and duplex, and they span the homopolymer region.
One has an error, and one has no error. The error in particular here is the deletion error. So if you just add these up, you will see actually accuracy adds up to 0.5. Visualization of NGS data is also a very powerful way to understand both the data and the behavior of a variant caller.
And one of the widely used tools for this purpose would be Integrative Genomics View, or IGV in short, which many of you are already familiar with. SBX data in black format can be visualized in IGV without any special requirement. However, to enhance the interpretation of specific features, we have developed an upgraded version of IGV.
This version allows users to color-code different base categories, making it easier to distinguish between them. For instance, as shown on the left, the default IGV view often displays numerous insertions in SBX data, but many of these are not supported by duplex concordant bases, and this can be misleading.
In contrast, the SBX mode enabled on the right, users can clearly identify features such as simplex tail, as shown here, and distinguish between high-quality variants from low-quality insertions that are only supported by discordant bases.
You can see an example we're highlighted on the left, the purple, a low-quality discordant insertion. After enabling SBX mode, you can see that particular insertion is actually shaded with light gray. This SBX version of IGV is now available and can be accessed on IGV version 2.19.6 and newer.
Alright, now we are finally ready to dive into how to perform germline small variant calling in SBX data. As described earlier in the webinar, the overall workflow for germline variant calling includes intramolecular consensus generation, followed by reference-based alignment, duplicate removal, and finally variant calling.
When it comes to calling variants in SBX data specifically, our initial goal is to evaluate the performance of popular off-the-shelf tools and understand how they handle unique characteristics of SBX. As Chen mentioned earlier, some tools like GATK may require tuning to account for SBX's specific error profile.
To address this, the XOOS germline small variant caller builds on the GATK core engine. This includes key steps such as active vision detection, de novo local assembly, per-read likelihood calculation using a per-hMM method, and genotyping. On top of that, we've developed a machine learning layer that enhances the accuracy.
In this step, we extract a wide range of genomic features both from GATK haplotype caller VCF and the reassembled or the evidence BAM file. These features are then fed into a large GBM model, which performs variant genotyping and filtering out for pseudo-default positives.
One example of how GATK can have an incorrect behavior on SBX data is estimating variant allele frequency. Because SBX includes discordant and simplex bases, inaccurate handling of these can actually lead to inflated variant allele frequency measurements. In this example, we are looking at a heterozygous germline insertion.
The issue arises when discordant insertions are mistakenly counted alongside duplex-supported bases, inflating the apparent BAF and leading to incorrect interpretation. Throughout our benchmarking, we identified a set of modifications to GATK haplotype caller to better suit SBX data.
Our primary goal was to boost sensitivity, allowing the downstream machine learning step to handle any potential false positives. For example, we set a minimum mapping quality of one, which helps capturing more potential variant candidates.
We also set a minimum base quality of six, which effectively excludes discordant bases from the assembly engine while still including both duplex and simplex bases. In addition, we enabled dynamic redisqualifications and adaptive pruning, both of which can contribute to improved accuracy and runtime performance.
Finally, we've introduced a new option to increase the sensitivity of active vision detection. This is done by multiplying the likelihood function by a constant factor, with a value of five being optimal based on our internal evaluations. This modified version of haplotype caller is also available for download in the link below, and we encourage users to try it out.
I've also included a full command line option needed to run GATK haplotype caller in sensitive mode for SBX data. This should save you some trouble of having reference to the help page. Allright, now that we have a set of active regions, a set of candidates, the next step is to extract features that will help us inform our machine learning model.
I won't go over every single feature for the sake of time, but I want to highlight a few key ones that give you a better idea of what's important to distinguish a false positive from a true positive. Some of these values would include MAPQ-weighted allele frequency, the reference, and alt support observed in the BAM and reported by GATK in the VCF.
Same with read depth, we also consider the distance of a variant to the end of the read, as well as the various contextual figures such as haplotype complexity, the number of unique k-mers, 5-mer base around the context, around the variant, and information about short tandem repeats, homopolymers, and local variant density.
These features together could help the model make more accurate calls. As mentioned earlier, both simplex and discordant bases can play a key role in germline variant calling. On the left, you can see an example involving a discordant base in a homopolymere region where the reference allele contains six A's, but the read supports seven A's.
In this case, during the feature extraction, we assign one simplex support to each allele. When combined with other simplex signals from neighboring reads, this allows our machine learning model to evaluate simplex and duplex support for the reference and alternative allele separately, rather than merging them into a single aggregated feature.
The lossless encoding method we previously discussed, and specifically the use of the YC tag, can especially be very helpful here. It allows us to extract both alleles from such reads accurately, preserving the full context needed for downstream analysis.
Let me walk you through a few examples of how we count duplex reference support and duplex alternative allele support in a homopolymere region. In the first example, we have a true ref support. The read fully spans the homopolymere region, matches the reference allele, and all the bases are concordant and duplex.
So this read contributes one to the reference duplex count. The second example, the read does not span the entire homopolymere region. Third example, the read lacks full duplex support across the entire homopolymere region. Next example, there is a discordant base at the end of a homopolymere region.
The following example, there is a discordant base in the beginning of the homopolymere region. And finally, the last example, we see both a discordant base and a mismatch within a homopolymere region. So again, it doesn't count as a valid duplex support. Now let's talk about training our variant caller.
In theory, training a model simply requires a set of high-quality examples that help it learn a difference between a true biological signal and noise. Genome-enabled samples would be ideal for this case, while you can train on all seven of them to avoid overfitting and ensure a clean evaluation.
We leave H2001 out of our training when performing variant calling on H2001. Instead, we train the model on the remaining genome-enabled samples. The same leave-one-out approach can be applied to any of the other samples depending on the evaluation goals.
If the model is intended to be used for a specific sequence in depth, it's important to include training data at that depth. That said, if you're unsure, we recommend downsampling a high coverage library to create datasets at various depths so the model is exposed to a broad range of coverages during training.
The training process itself is straightforward. We run a SBX haplotype caller to generate variant calls, evaluate true positives and false positives based on the known truth set, and then train a large GBM model.
Once the model is trained, we use this to regenotype and filter variants such as those in H2001, and then benchmark against a truth VCF and calculate the FAM score. Beyond genome-enabled, we can also expand training and evaluation on T2T reference if the provided benchmarking data is available.
H2002 is a good example as it has a curated T2T truth set that supports such evaluations. Allright, here are some results. On the left, you can see a final score both for SNMV and indels for a one-hour SBXD run.
For SNMVs, the score ranges from 99.73 to 99.83, and indels range from 99.65 and 99.77, demonstrating strong performance. We also downsampled all the samples to 30x deduplicated duplex depth. And as you can see, our FAM scores remained consistently high across the board.
All of our evaluations shown here are done using RTG software and are based on the complete genome-enabled high-confidence benchmarking version 4.2.1. In conclusion, we demonstrated that Zeus germline variant caller delivers accurate germline variant calling and specific duplex sequencing data.
To further improve performance, we partnered with PowerBreak team at NVIDIA to accelerate GATK. This allows users who need a faster turnaround time to run our variant caller in an accelerated manner, going from BAM to VCF in just 25 minutes.
Beyond accuracy and speed, Zeus software is fully open source and available to our users for code review. The commitment to open source not only provides transparency, but also paves the way for future collaboration and community-driven contributions to the tools and softwares we develop. To wrap up my part, I'm excited to share some good news.
We'll be soon releasing much more than just a germline small variant caller. Stay tuned for more updates for SBX and future Zeus products, which will support a broad range of applications across both germline and somatic workflows.
This includes whole genome to more normal sequencing to identify SNMVs, indels, MNVs, CNVs, and structural variants, targeting different workflows for detection of small variants, CNVs, and structural variants, key biomarkers, RNA sequencing for gene and isoform detection, as well as fusion.
We're also expanding support for additional germline variant types from whole genome and targeted workflows, including CNV, structural variants, VNTR, STR, spacing, specialized variant callers, joint calling, and much more. And with that, it's my pleasure to hand it over to Andrew for an overview of DeepVariant for SBX duplex germline variant calling.
Hi, I'm Andrew Carroll. I'm a product manager in Google Research and a genomics domain specialist. And I'm honored today to tell you about our experiences working with SBX data, training an optimized version of DeepVariant for SBX, some specific insights with respect to how to operate on SBX data, and finally, a set of benchmark comparisons with other sequencing technologies.
So for some background, our team's mission is to produce scientific tools which enable the scientific community to analyze genomic data. We build methods for every step of the process of analyzing data. I'm highlighting here each of the various steps. All of the methods for our team are open-source software.
So you can go to GitHub, you can check out the code, look at case studies for each of them, file issues if you have questions. Everything here is free and open source with no commercial restrictions. So it's really for anyone in the scientific community. And today, I'm going to be talking about DeepVariant, which is a method that identifies genetic variants from the sequencing data.
So DeepVariant is conceptually the combination of this concept that when you process data for visualization, it forces you to understand what are all of the elements that you want to bring together and structure that information in the most meaningful way.
And then using deep learning approaches, which can operate on all of that data to identify genetic variants. So importantly, because the deep learning methods don't have specific enumerated features, they learn directly from the data.
Essentially, they're a way of allowing the data to speak for itself about what the information content is within it. And by applying the same sort of training architecture with DeepVariant to different sorts of input sequencing technology, you can essentially let this data speak through the machine learning processes to understand what information is present.
So DeepVariant has two conceptual steps. The first part of it uses a set of heuristics very similar to what you heard about with GATK, which do things like construct candidate haplotypes, identify candidate variants, and structure the information.
Then it creates a multidimensional pile-up image, which includes the bases, qualities, mapping qualities, strand, a set of other information which a human wouldn't usually look at, but which the machine learning algorithm uses to understand the data.
And finally, it uses a convolutional neural network to determine which of the potential positions are actually just sequencing errors, and then which of them are variants, and to genotype the variants that exist. So a couple of properties about DeepVariant. One is it's very accurate.
It's won multiple best accuracy awards from multiple different sequencing technologies, both in a 2016 FDA competition and in the 2020 FDA competition.
And the second property of it is that it's very well calibrated, meaning its estimates of variant accuracy are very well correlated with the empirical probability that is correct or incorrect. And this is something which separates it from other methods.
And that really goes back to what I mentioned about being able to look from the properties of the data directly to discover insights that maybe a human hasn't figured out theright way to express in a set of heuristics. So DeepVariant is something we've put out multiple versions of.
We do two to three releases a year, expanding it for various applications and various sequencing technologies. We're going to be adding the SBX case studies to the version 1.9 release. And then in upcoming releases, we'll continue to support with a full release in everything that comes.
In terms of how DeepVariant is trained, there's an architecture which basically starts from an existing model. And then you show it examples that are labeled from genome in a bottle, well-characterized samples, to produce a new model. There's a couple of options.
You can train from scratch, which means around five to seven whole genome samples. And you can also do fine-tuning, which is starting from an existing whole genome model and then adding in additional sample to tune it to the new properties. And that takes around one whole genome sample.
In this example, we're using--sorry, in this case, we're using the training-from-scratch approach because we have sufficient data. And also, in all of these training approaches, there's a downsampling automatic regime that's applied to produce examples from all different coverages.
So before I go into the benchmarks, a few observations about SBX data. These aren't relevant if you're operating from the VCF file, from the variants, because once a method has been developed that understands these properties, really, the downstream error profile doesn't look very different with SBX data at all.
So you mostly don't need to think about these components. But if you're developing methods, there's a few things that you'll need to keep in mind. And you may need to tweak your algorithmic approach as a result. So one is that the SBX reads are typically longer than Illumina paired-end reads, and they are variable in length.
But the SBX reads are not paired-end. So for example, in structural variant applications, which consider the relative position of the reads, the gap between them, that signal isn't available. So for example, certain structural variant approaches will need to be different.
In the case of DeepVariant, we don't use one of our channels' insert size for SBX data. So the second component is that SBX reads, when they disagree at a position, they tend to pick the extra insertion base.
And this gives a slight bias towards insertions, which you saw in previous graphs. And this can interfere with the way that you would construct KMER-based operations. So you need to take into account the fact that there's going to be this slight preference towards insertions.
In the case of DeepVariant, we make a very simple tweak to the De Brogne graph process, which is where this caused some minor issues. Basically, we just add an evidence threshold for one or two base pair insertions in order for us to create a candidate haplotype. And the third thing to keep in mind is that the differences in quality are particularly meaningful.
So a Q22 base means it was a simplex base. A Q39 base means it was a duplex base, which is a little bit different from the machine learning estimations of quality that come with some other sequencing methods.
And that means that you want to be intentional around whether you filter at those simplex bases or not, the Q22 point. So for some of the parameters, we shifted them around that Q22 point, depending on whether it was better to include or exclude simplex bases. Okay. So now I'll go through a set of our accuracy evaluations.
Here we have 30x, which is the duplex coverage for samples for both the SBXD and the SBX fast approach. So we're typically seeing indel F1s in the range of 997 to 998, and SNIP F1s in the range of 998 to as much as 999 in some of the samples.
For all of these comparisons, we're doing that same leave-one-out approach. So for the HG001 benchmark, we left out HG001, trained on the other six. And we did the same for each. So the HG002, we leave out HG002, and so forth. One interesting observation we have is that both of these benchmarks are from the same model.
So we found that you could combine the training data for SBXD and SBX fast and train one model that works on both of them.
And this means that although we do see some minor differences in the input data between SBXD and SBX fast, they're not so large that they really require any sort of different algorithmic approach or even really labeling the data differently for the machine learning algorithm.
So it basically just is able to learn from that data the properties of SBX. So in all the benchmarks that we present, there's one SBX model that's just being applied to different data. And when we do our release, we'll release that one SBX model, which will be for people to use.
So one question that often comes up is, what accuracy do we observe for homopolymers? So for SBX, we typically observe that the homopolymer, actually, accuracy is very high. In fact, it is potentially a bit higher than what we see in Illumina data.
And this is interesting because we also observe the base error rate to be a little bit higher in SBX data for homopolymers than Illumina.
This might reflect potentially the homopolymers errors being a little bit more random versus a little bit more systematic in Illumina data, which is one of the ways to understand how the slight increase in the base error rate can result in more accurate variant calls in homopolymers after you've trained a DeepVariant model for it.
Okay. So next, I'm going to compare at the same 30x coverage between SBX data and a set of Illumina NovaSeq data. And this both uses DeepVariant with all of the BAMs mapped in the same way.
So it's really a direct comparison on the variant calling accuracy with the same analytical pipeline. So what we typically see is that the SBXD and Illumina SNIP F1s are really neck and neck to each other. There's almost no difference between them.
There might be a very slight advantage to Illumina in certain samples. Interestingly, where we see a substantial difference is in the indel F1s. So we observe a noticeably higher indel F1 accuracy with the SBX data versus Illumina.
And I have a couple of examples, which I'll show, which I think highlight where that might be coming from, as well as some genome stratifications. So this chart is just an aggregation of those three prior tables, the SBX fast, SBXD, and Illumina all at 30x.
And here you highlight that slight advantage in SNIP accuracy for Illumina and that more noticeable accuracy advantage for indel that we see with SBX data. When you look at the genome regions, for SNIPs, the difference seems to mostly come from the low mappability components.
And this likely reflects the fact that the SBX data, although it's a bit longer, it's single-end data. So the Illumina paired-end is able to map a little bit better into some parts of the genome.
This is something where, including data from, say, more SBXS, where you would get a longer read, but maybe a little bit less accurate, could mitigate or reverse the difference there between the two. But for now, the small difference in the SNIP accuracy seems to be coming from this component of the genome.
For indels, the difference in accuracy is that in the difficult parts of the genome, the more homopolymers-rich parts of the genome, the tandem-repeat-rich parts of the genome, the SBX data seems to be more accurate. Since this is where a lot of the indels are, this small difference is what manifests in the much larger overall performance.
And then you do see the same sort of mappability component, which is present between the SNIPs and the indels. So overall, the SBX data handles the tandem-repeat and homopolymers regions a bit better. And then it's still a little bit harder to map some parts of the genome without the paired-end cycle.
So next, I have a couple of different examples which highlight this difference. Overall, what I'd like to communicate is that it's very difficult to really see any sort of a systematic difference. You have to look at many, many examples. Most of the cases, there's just places where the sampling is a little bit randomly different between the alleles, and so the call is different.
A couple of areas where I did observe a degree of difference. So one is in a set of a particularly poly-A homopolymer, where I observed that the SBX tends to be more accurate. So in this case, it's a two base pair homopolymers deletion.
And even though this is a difficult region, so even though you do see both in Illumina and in SBX, you see a number of base differences. This might reflect what I was saying about the homopolymers signal being more learnable.
SBX is able to make theright call. And I bring this up because I saw multiple such examples in the course of looking at pile-ups. The next example highlights a couple of the really key elements. So in this case, we have SBX data passing through a tandem-repeat, and the same for Illumina.
So what you see is there's a four base pair insertion, which with realignment, gets all moved to the same place. It's actually quite easy to make this call with the SBX. In Illumina, on the lower part, you see the top reads, which extend into the left. They have these lighter colored bases. These are the low-quality bases.
This is because when Illumina reads pass through a tandem-repeat, there is a higher error rate in those reads. They can be clipped. They can then incorporate errors. And this is because the Illumina data has much more positional dependency. Later parts in the read, especially after hard regions, the accuracy drops.
Whereas with SBX data, there's much more consistency about the read accuracy across the length of the read. And then in the lower part of the Illumina reads, you see there's this sort of like gap where the reads start to clip. This could be because the reads don't fully span that tandem-repeat.
It could be because the quality drops, and then the mapper is forced to soft-clip the bases. So this also goes into the read length of the Illumina read. In order to call an insertion, you have to fully span that repeat element with the inserted bases.
And this signal comes much cleaner in this example on the SBX side than it does on the NovaSeq. So this is the example of how the different properties of that SBX data are contributing to the different aspects of the call. But again, to emphasize, these are the examples that I found where they're discordant.
In the vast majority of cases, they agree. And basically, you can use downstream analyses for the same sort of way you would typically. So finally, I have a comparison on the challenging medically relevant parts of the genome.
Here, we observe the same property where the SBX data is more accurate for indel, a bit less accurate for SNIP. This region is fully held out from training. And second, I have an evaluation on the telomere-to-telemer, which is a much broader benchmark from the genome for HG002.
And here, we use chromosome 20 because we have only one sample for that truth. So we use all of that for training apart from chromosome 20. And here, we actually observe both SNIP and indel accuracy being higher.
I think one factor for this may be that the T2T benchmark includes more tandem-repeat regions, which, like I said, have less of that positional dependency for the SBX data. So finally, we're releasing this model. It'll be in the DeepVariant 1.9 release, as well as case studies.
And we will support it in the typical way via GitHub. In addition, there'll be a PareBrix accelerated version. If you are interested in this, which is capable of analyzing samples in just minutes, please reach out to either ourselves or NVIDIA for access to that model. We expect for full supported release to come soon.
Looking forward, we're excited at expanding DeepVariant for SBXS applications to bring in some recent improvements to runtime and excited at the potential to develop somatic models for the data. So it's been my pleasure to be able to present our analysis of SBX data.
I especially want to acknowledge our full team, all of whom were very excited to be able to work with both the great scientists and engineers at Roche, as well as to get this early peek. And everybody worked extremely hard at all of the methods and analyses that you've seen here. So with that, thank you very much. And I will hand it to Jaedon. Thanks, Andrew.
And big thanks again to you and your team for all the work you've done on SBXD thus far. It's been a tremendously collaborative and valuable effort, and we very much look forward to continuing that progress. Before we move on to the Q&A, we want to thank you again for joining us to learn more about SBX data analysis.
And as mentioned, we're making multiple resources available, which can be found on our SBX data analysis web page via the QR code on your screen. We look forward to continuing to bring SBX to life and sharing our progress with you. And to that effect, you can catch us at ASHG for more exciting updates on SBX. And with that, I'll hand it back to Ben for the Q&A.
Thank you to our panelists for that interesting and informative discussion. We'll now begin the Q&A portion of the webinar. One moment, please, while we gather questions.
Our first question.
Since most of the NGS devices produce a fast queue file to analyze, is there any reason to perform the variant calling and alignment yourself? Is the fast queue produced different from a fast queue from, for instance, Illumina or MGI?
Yeah, good question, Ben. I'll take this one. So on one of our first slides, we showed a couple of different data outputs that we could generate on station. And so it's either consensus reads, raw reads, or aligned reads. So the raw reads and consensus reads would be in a fast queue file format.
And the aligned reads would be, of course, BAMs or CRAMs. So we won't be generating VCFs on station. The VCFs are generated by our open-source tools that would be deployable on the customer's compute of choice, either local or cloud. And just to answer the question, we'll follow the standard file formats of fast queues, BAMs, and VCFs.
Thank you. Our next question.
Can you set the mapping quality to zero to allow for variant calling in homologous regions? That's a good question. Yeah, I can take that one. Yeah, this particular variant color we presented today is not really designed to call variants in regions with high homology.
With that said, we are planning on releasing some of the specialized colors that can call variants in particular regions, such as segdues, paralogs, and regions with high homology. So not this particular variant color, but yeah, we plan to release other variant colors that can perform that particular type of variants.
Thank you. One moment, please. Our next question. Is the GBM in the SBX GATK germline calling pipeline just being used to filter out likely false positives from the candidate list? Another great question.
Yeah, so we first choose a set of candidate variants by running the SBX version of sensitive GATK haplotype color. The machine learning step can decide on the genotype, for example, if the variant is a homozygous or heterozygous variant, or it's actually a false positive.
So it's actually genotyping as well as filtering some of the false positives.
Thank you. Our next question.
Can DeepVariant work with consumer GPUs like NVIDIA RTX 5080, 5090 with 16 gigs and 32 gigs of VRAM? Or how about other vendors' TPU, APU, for instance?
So the main parameter is the amount of memory. I believe the answer to that would be yes. I think that's a sufficient amount of memory. Generally, the PareBrix software will accelerate across potentially every make of GPU. And then the question is just, do you have a sufficient amount of RAM?
But for those, the answer should be yes.
Thank you. Our next question.
Do you plan to share data on SBX fast and SBXS as well? Yeah, of course. Sharing data ends up with more requests to share more data. Of course, it's a big milestone to have an initial SBX dataset broadly available.
But we absolutely intend to make additional datasets available to demonstrate the performance of some of the methods that we've shown in previous venues. So yeah, including SBX fast, including SBXS as well, as we move towards commercial launch and, of course, beyond. Thank you. Our next question.
You did mention that you'd be providing BAMs, but not fast queues for the data release. Do you plan to provide the fast queues as well?
I can take that one too. Yeah, go ahead, Mike. Yeah, so I think we provide the BAM files, but we also plan to provide instructions for users to directly convert the BAM file to a fast queue file. It's going to be a lossless conversion. So it should be very easy to actually reconstruct the fast queues without having any issues.
So for now, just the BAM files, but customers can easily convert the fast queues without instructions we're providing.
Thank you. And with that, we'll have to wrap up. Everyone, thank you for the excellent presentation and discussion. That's all the time we have for today. We'd really like to thank our panelists and our sponsor, Roche. If we didn't have time to get to your question, we will try to follow up with our panelists and our sponsor after the webinar.
Thank you for joining us today for this GenomeWeb webinar.
Explore more
The SBX technology and analysis tools are in development and not commercially available. The content of this material reflects current study results and/or design goals.