For localized information and support, would you like to switch to your country-specific website for {0}?
- Insights
- Diagnostics insights
- Roche SBX Webinar: Introduction to sequencing by expansion
Key takeaways
Sequencing by expansion (SBX) uses a novel strategy to convert DNA into Xpandomer molecules, enhancing signal-to-noise ratios, and—combined with high-throughput arrays and real-time processing—this pore-based technology delivers fast, flexible, and accurate single-molecule sequencing at scale
SBX offers two distinct workflows: SBX-D (duplex sequencing), which provides high accuracy for applications like WGS and oncology, and SBX-S (simplex sequencing), which is optimized for ultra-high throughput and is suited for applications like transcriptome analysis and identifying novel isoforms
The SBX-D workflow demonstrates uniform coverage and improved performance compared to existing technologies in traditionally challenging areas, including high-GC regions and homopolymer stretches, confirming its high quality for advanced genomic studies
The Roche SBX webinar reveals new sequencing technology covering duplex and simplex sequencing, rapid whole genome sequencing, and flexible platform performance.
*This transcript was generated using an AI-based transcription tool and may contain errors. It reflects the spoken content of a live webinar and has been lightly edited for clarity.
Brien Mahoney: Welcome and thank you for joining us for this exciting introduction to Roche's novel sequencing by expansion technology. My name is Brien and I'll be your moderator for today. Before we begin, I'd like to tell you how you can pose your questions.
On your screen, you should see a QR code that when scanned will automatically open the online submission form. You can also open a browser and type pollev.com/rochewebcast. All one word, no spaces. You can pose a question at any time during the presentation. I'll be back at the end of the presentation to help get answers.
Now appearing on your screen, you should see our esteemed panel of experts who over the next hour will guide you through this breakthrough approach to sequencing.
With that, it is my pleasure to introduce the global head of Roche Diagnostics Solutions, Palani Kumaresan. Welcome, Palani.
Palani Kumaresan: Thank you, Brien. A very warm welcome from my side as well. Today is indeed a momentous day and many years in the making and we are really happy that you could join us for this webinar.
Now some of you may not be as familiar with Roche. So here I have some highlighted facts and figures about us. As you look at this slide I would like to highlight two important points. One, we have a deep history of commitment to innovation and we deeply care about having an impact in the societies that we live in.
I would like to call out three figures here. We have three Nobel Prize recipients and over 40 Prix Galien innovation awards over the course of the past 50 years. If we look at the WHO list of essential medicines and diagnostic tests, we have over 40 Roche medicines and close to 100 Roche diagnostic tests that are part of this list. And last year alone, we delivered over 30 billion diagnostic tests to our customers around the world.
Now if you zoom into our sequencing portfolio as it stands today, we have our AVENIO assay kits. These are research-use only kits that can run on third-party sequencers. On the front end, we have our AVENIO Edge system which automates sample preparation. We also have our KAPA library prep and KAPA target enrichment reagents that can run on this instrument but can also be used independent of the instrument. And downstream we have navify® mutation profiler which is a tertiary analysis software for generating actionable insights in a research use only setting.
What has been missing from this picture is our sequencing platform. That's why we are here today to introduce the technology behind our future sequencing platform, and I would like to make two important points as we introduce this platform.
In the future we will introduce it as an open standalone sequencing solution that both academic researchers, and translational researchers can use to advance their work. At the same time, we will have it as part of an end-to-end sequencing ecosystem that we can bring into a clinical setting along with our assay kits. So without further ado, let's get into it. And for this, it's my absolute pleasure and privilege to introduce Mark Kokoris, the inventor of sequencing by expansion chemistry.
Now, Mark is one of the most brilliant minds I have come across. In 2007, Mark came up with this very elegant but hard to realize idea to address the fundamental problem of threading a DNA molecule through a nanopore at very high speeds and being able to decipher the four bases with very high accuracy. Over the course of the past 15 plus years, five of which have been with Roche, Mark and the Roche sequencing team have just realized this.
So without further ado, let me hand it over to Mark to talk to you about the chemistry and some of the exciting results.
Mark Kokoris: Thank you so much Palani for the kind words in the introduction. I'd like to welcome everyone joining us today from so many different time zones for what is a very exciting day for Roche Sequencing, and it's truly an honor to present with my colleagues today introducing the technology and as Palani said this has been a many year journey to get to this point—for me it's been 18 years to get to this point—and all that is to say we've got a lot to cover so let's get started.
The SBX technology was designed for flexibility and performance with headroom to efficiently scale into the future so let's dive into that a little bit. If you're developing a new technology whether it was 1999 or 2029 you have to think about accuracy, throughput, read length, cost efficiency. Those are a given for any sequencing technology. You also have to think about making sure that you have the headroom to scale into the future.
These are all important given considerations that we certainly thought about when starting this technology. But one of the other ones that was key for us was flexible operation. We envisioned a technology that can sequence up and down the throughput spectrum with the same instrument, and doing that efficiently. So for example, if you wanted to sequence for four minutes, 40 minutes or four hours, we wanted a sequencer that can do that. And in fact, I'll show data today from 20 minute, 1-hour and 4-hour runs just as an example of that.
Of course, accuracy is critical. I show an example here for our duplex whole genome sequencing for reference sample HG00001 just to get an idea of what our duplex sequencing accuracy is looking like and we'll go into that detail later on. We'll also cover throughput a whole genome sequencing run for seven Genome in a Bottle samples in 1 hour of sequencing where we achieved over 30x coverage in one hour. Just to put a number to it, that was over five billion duplex reads in one hour of sequencing. Read length we'll cover that and make the distinction between duplex and simplex reads in terms of length and accuracy and throughput, just to make sure we cover that topic. And we'll show an exciting sample-to-VCF whole genome sequencing result in less than seven hours that really highlights some of the flexibility we're talking about. And of course the big one, cost efficiency.
We hear you on this. I don't think you can develop a sequencing technology without being very tuned in to the importance of cost efficiency. We consider it chemistry efficiencies, measurement efficiencies as I point out there, the reusable sensor module which is our sequencing chip. All of those things come together - data analysis, algorithm efficiency, and even the quality of the signal that you pull off your sequencer. All very important to efficiency.
So if you want to be able to someday, as I do, aspire to do a sequencing run where you get a trillion reads. You really have to be tuned into that. And in fact, it has to be wired into your DNA sequence or as I'll explain later on, perhaps wired into your Xpandomer sequence. I couldn't help myself. Sorry about that.
So, what does this look like? What are we doing? This is the coming together of Stratos Genomics sequencing by expansion chemistry and the Genia high throughput sensor array technology. Two very powerful technologies that really needed each other to bring out the best in performance. So, on the instrumentation side, we have an instrument that does the SBX chemistry and a separate instrument that does the sequencing. And again, we believe this actually maximizes the flexibility of the technology. Now, into the future if the workflows dictated to pull both of these systems together that would be something that we would certainly consider.
So, a little bit about the agenda for today. I've heard a lot of conversations about the SBX technology and what it is. Some were pretty spot-on and some not so spot-on. So, I think it's important that we cover and outline what the technology is just to make sure everyone's clear on what the technology does. We'll then go into some of our whole genome sequencing performance for our SBX duplex sequencing that we're doing with our early access collaborators at Hartwig Medical Foundation and Broad Institute.
We'll also hear from John Mannion who will come on and talk about some of the duplex data structures, go into detail there, and dive into some of the quality of the data that we're seeing. I'll come back to cover some of the RNA applications again with our early access collaborator at Broad Institute. We’ll finish off with some pipeline open-source tools and Gustav will come on to talk about our road map, commercial road map and then hopefully time for some Q&A.
So SBX technology overview. Our approach to efficiently sequencing DNA was to not sequence DNA. Pretty much that simple. SBX was a process, a biochemical process using modified nucleotides and enzymes to convert DNA into an expanded surrogate molecule. We call that the Xpandomer. Well, the idea being that we would rescale the signal to noise challenges of direct DNA measurement. So, such a simple picture, how hard could it possibly be? So, I reminded myself of this quote many times over the years.
If you want to improve, be content to be thought foolish. So, I don't think we were ever content to be thought foolish, but I'm really glad that we were foolish enough to try to develop this technology because when you get on the other side of it, it really does enable some pretty powerful measurement capabilities and it's very exciting and these are the things that we're going to show you today.
A little bit about our preprints. We're actually making a preprint and it should be available as we speak, but certainly within the next couple days that covers the SBX chemistry all the way through early 2020. We wanted this foundational paper to kind of talk about how we did it and the important elements of the technology. So, check that out. It'll be a nice 60 to 70 page read. So again, tens of thousands of experiments going into this particular preprint. So enjoy that.
So how do we do this? So essentially I alluded to this modified nucleotide. It's essentially an expandable nucleotide triphosphate or XNTP. So our original thinking was to use a tether, attach a tether to a nucleotide at two specific points. And I'm showing there the alpha phosphate and the heterocycle linkage separated by cleavable bonds. So once you incorporated this XNTP using a polymerase, you could selectively cleave and expand the structure and now you have a new reporter code taking the place of the nucleotide.
As we went on with the technology over years and were more directed towards nanopore measurement, we engineered a translocation control element. So as it says this was to allow us to get a clean reproducible synchronized modulation of the Xpandomer as it moved through the nanopore. So this was a key very important innovation. And also over the years we learned that we needed to add enhancer chemistries and engineer those into the structure as well because, not surprisingly, most polymerases don't like XNTPs. So we had to add that feature as well. And just a little nomenclature here, SSRTs are symmetrically synthesized reporter tethers. That's that whole structure there as we call it.
So diving a little bit deeper, how do you do this? So you do it with innovative molecular engineering and developing novel chemistries because there was no roadmap for how to do any of this way back when we started the technology. We literally made and engineered thousands of custom SSRTs to get to the point where we are now, and we did that using novel phosphoramidite building blocks—dozens and dozens of those—on the dNTP-2c side, that nucleotide core structure that we use as one part to build the XNTP, very similar, dozens of synthetic routes that amidate diester structure did not behave like normal nucleotide chemistry and so we had to learn how to do that as well. And I show that coming together in this convergent synthesis at over 50% purified yield to point out that this is an efficient process and there's no room for inefficiency in this type of sequencing technology.
So I really wanted to point out that we're able to make these very complex molecules incredibly efficiently and I always like to show this picture where I've got the green box highlighting a nucleotide to show the chemical structure of a full XNTP just to get an idea of how crazy this idea is, what 50x expansion really looks like.
So now on to the polymerase. So given that structure, you can imagine that you would need to have a polymerase with an open architecture. In other words, that point where I'm showing the arrows there, having somewhere for that very large tether to go. Most polymerases don't have that opening but the specific class of polymerase Dpo4 or the Y family lesion bypass polymerases actually have that capability and so that kind of directed us to go in a lesion bypass direction.
In terms of our engineering approach, we would come up with a specific set of designs we would want to survey more broadly, refine the search space using rational design, high throughput screening, combine interesting regions of the polymerase together, and iterate on that over and over and over again. Thousands and thousands of polymerases to get to the point where I show below where now we don't have a DNA polymerase anymore. We call it an “XP” or “Xpandomer synthase”. We've literally changed it into a different enzyme because it actually doesn't really like regular nucleotides anymore. It's an interesting little side note.
Over 10% of the amino acids mutated. Our raw read simplex accuracy is 99.3%. And again to contrast that with duplex accuracy that's reading both strands and we'll talk about that later. We've turned the polymerase from a distributed polymerase into a processive polymerase. And that's part of how we get read lengths of over a thousand bases in length. And we've also turned it into a strand displacing polymerase, which it normally isn't strand displacing. And that's very important to be able to do duplex sequencing.
A little bit of the tools. We won't cover much here. Just to say we threw everything at this to figure out how to make these new polymerases. Whether it was machine learning, structure prediction, display technologies or even fusions. We pretty much threw everything at it to get the polymerase that works.
Okay, so now you have the XPs, you have the XNTP with these enhancer elements, and what happens if it still doesn't work? Okay, well at least not work well enough. You create a new molecule. And so that's the polymerase enhancing moiety that I've shown there. Think of this as the glue that holds the whole system together. And so I like to show this one. This is an old gel, something that was really that aha when we finally figured out the glue to the whole system. That red box is what you get if you leave any one of those three key elements out of the reaction. And then if you add all of them including the newest one in the series being the polymerase enhancing moiety, you get an idea of how robust the extension is. Again, this gel is from probably 2018. This is an old gel. Now we have gels that show thousandmers looking even better than this just to give an idea but just the perspective of the innovation there. So I showed the synthesis instrument earlier.
It's essentially a fluid handling system that directs reagents into this XP chip as we call it this fixed solid phase synthesis workflow and the reagents that go into that and I'll go through that you know step-wise but it is a pretty intuitive and straightforward process. So we have an extension oligo as we call it attached to the surface. Hybridize the library of interest to the extension oligo. Add your reagents to do this Xpandomer synthesis. Cleave to expand the structure. Wash the chip and then use a photocleavage, an orthogonal cleavage to release the Xpandomer from the support. Pretty straightforward, pretty intuitive process.
Now elaborating a little bit on this translocation control I talked about before. So we added this innovation as we realize it's really important to have that control over the movement of the Xpandomer through the nanopore. And so I show that orange triangle pausing the Xpandomer as it goes through the barrel side first of the nanopore. And again, we run backwards. We get cleaner results that way. And you can get an idea of that signal result that you would see from that pausing uh for that specific period of time. We then do a very quick uh voltage pulse to advance the Xpandomer one reporter code at a time. And you get an idea of what that now signal looks like and then so on and so forth and you get the idea that this is a way of modulating the throughput of the Xpandomer as we sequence.
So pretty important advancement in the technology and I like to say that the box there where we show the translated XP base calls, that was a drawing that Bob and I did in 2012 to show what we were trying to do, what we'd hoped to be able to do with the technology. And this is what we actually did a mere seven years later. So, and this is a result of our single pore approach system with 1 and a half millisecond pulsing. So, really great separation. And you get an idea as you zoom in and look at the homopolymer bases what those look like when you get this deterministic control over the movement of the molecule through the nanopore. And as well the level histograms there we can see the separation of all four of the bases was exactly how we'd hoped to draw this up in terms of solving that signal-to-noise and signal resolution challenge of directly measuring DNA. So still a single molecule measurement but you see the signal resolution we were able to see here.
So now take that one pore, add it to 8 million array. So combining an array that combines microwells, electrodes, detection circuits, and the analog to digital converter version on the left side you see the 8 million XP chip mounted on a PCB and then you can zoom in a little bit at that SEM (scanning electron microscopy) image on the right here that shows us drawing in the bilayer, the pore, and the Xpandomer translocating it. So now you can get the picture of what this can look like as you scale to an 8 million chip.
Okay, a little bit more on the sequencing system instrument—similarly to the synthesizer—fluid control reagent delivery to that sensor module that we show in the middle of the screen, also of course it has all the electronics to move the signal off and the sequencing data off the system. So a little bit more obviously complicated than the synthesizer and a little bit about the sequencing process. This is the simplified view of it but essentially how you initialize a run. You set up by forming the bilayer and inserting the pores which we do for every single run just to be clear on this. Every run we form the bilayers and bring up the pores for every sequencing cycle. We then add the Xpandomer that you would have synthesized, flow that across, do your sequencing again 4 minutes, 40 minutes, 4 hours, clean the system and then you move on to reuse.
So the reuse part of this as I like to say, significant reuse again going to that efficiency discussion that we had before key part of the technology is to be able to reuse the sequencing sensor module there. So again, you can do that over and over again. And a little bit about the active pores on the system. As I said, there's 8 million total sensors. Typically, we'll get 7-7.5 million pores that are sequencing pores that are functioning that can sequence Xpandomer.
And you get an idea over time, you know, 20 minutes, 60 minutes, 4 hours, the pore lifetimes over that time period. And the way I look at that is is all the data we're showing today would have gone through this type of pore lifetime profile as a great opportunity to even improve our throughput even more as we continue to optimize this because it's not a foregone conclusion that pores aren't going to last for four hours or even way beyond that. So I think that's another opportunity for optimization with the technology.
A little bit more about the capture efficiency. So I show the leader structure threading through the nanopore as well as as a concentrator, and I can remember a conversation we had many many years ago with a prospective science adviser who said I don't know if you're ever going to be able to make these molecules but I really don't see you being able to efficiently thread them through a nanopore. I won't say any names. Anyways, so this was something that had been on the top of our mind from day one how do we get these big molecules to thread through a nanopore and I show this result to show how we figured it out with a leader concentrator approach concentrating the Xpandomer really lining it up for measurement.
They also show the trace on the right which has four Xpandomers sequenced in a row and you get an idea, that little spot being open channel, the gap in between the next molecule coming and the next and the next. So this just illustrates how efficient our capture can be. Now this is a great example, four in a row. They don't always look like that but just to illustrate that we're able to really line these things up and sequence very efficiently. Again, efficiency, is the word that always comes to mind with this type of measurement.
So now you go from that one pore to now an 8 million array and we're looking at a single pore on the 8 million array. I point out at the top there, the open channel in the image there as well as the bracket showing the Xpandomer signal traces for the four different bases. So you get an idea of how much Xpandomer is going through in a 1-hour run just to give you an idea of scale. Now how you zoom in a little bit. Look at one full Xpandomer.
This is probably about an 800 MR [million reads] to a 1000 MR. I actually don't know. But just to give you an idea of what that zoom in looks like for that tiny little fraction across all of those reads in one hour and then you zoom in a little bit more and you get an idea of those translated XP base calls, what that looks like. Again, flexibility. We'll reiterate this over and over again. We wanted to have a technology that could support low throughput, very high throughput, everything in between. If you had a single library multiplex deeply, or shallow, you want to do one synthesis or four. We wanted to be able to support that entire scale of sequencing.
Important part of the technology, one of the things that really makes the technology distinct is we have that massive throughput flexibility. Those are really critical elements for the SBX technology.
Now, jumping a little bit in on the SBX duplex approach. Just to outline what it is and it's pretty straightforward. You have a library that you have a hairpin Y adapter structure in contrast to the simplex which is pretty much just a single-stranded structure there. And what this allows you to do is make an Xpandomer where you have both parent strands on the same Xpandomer molecule allowing for intramolecular consensus and thus the ability to get really high accuracy.
Okay. And so the two topics we'll cover today are somatic oncology as well as a rapid whole genome sequencing and the rest we'll cover over time and certainly we'll talk about simplex later on in the presentation, but for the next few slides we'll talk about those applications.
As far as the workflow goes you have your sample whether it's tissue bloods cellfree DNA running through your library process again it's pretty straightforward. You have a Y adapter and a hairpin that you would ligate to create your Y adapter structure. I show next currently we use 20 to 50 nanograms of genomic DNA, unfragmented genomic DNA, as your start point.
I think there's room to whittle that down and use even less and for cell free DNA 2 and a half to 10 nanograms is what we've used up to this point. The one unique thing here is is we actually do a linear amplification of this library, and as opposed to PCR, we found that that actually helps our accuracy quite dramatically and you'll see when we so show our homopolymer and our indel accuracy what that looks like for an amplified library, it is pretty impressive. After that we pretty much just run through our normal Xpandomer synthesis and sequencing process.
So a little bit about the data I alluded to before, this is a 1-hour run. A sevenplex using seven Genome in a Bottle samples and as I mentioned, in this 1-hour run we got 5.3 billion duplex reads and this is the breakdown of the duplex reads per sample you see pretty uniform yield on all of them. The mean insert read length breakdown on that - John will cover that in his slides in a little more detail. Coverage 34-38x coverage for that 1-hour run. And then the F1 scores - what we're seeing with RGATK (Genome Analysis Toolkit) Roche machine learning approach, very respectable scores for both SNV (single-nucleotide variant) and indel. And then also at the same time and a shout out to the Google deep variant team. We sent the data over to them and they were able to run their system on that and get very very similar results and F1 scores in fact a little bit better than ours. So really happy to get that external look at our data and coming back and finding the same results if not a little better than what we had.
As far as percent of genome at greater than 10x coverage. We were 99.7% and as I mentioned earlier Q39 scores. And the next slide just shows you know what happens at a 30x down sampling - very little change in terms of the quality of the results for the F1 scores.
So moving on to Hartwig Medical Foundation. Edwin Cuppen, the scientific director there, is directing that team and I have this picture of Ewart and Joris in front of our systems that were placed in their labs in the fall.
A little bit more about that. We uncrated the instruments on a Monday. By the end of the day, the sequencer was sequencing. The synthesizer had run through its liquid handling tests. By the end of the next day, both systems were qualified and ready to go. So very first time systems came from the Seattle labs, flew over there, installed and were functioning within two days. I think very impressive for the very first time an instrument left the lab and was placed elsewhere. And a great team there at Hartwig’s to work with. So of course there was preparation, but just to get to that so quickly was really quite impressive.
A little bit about Hartwig Medical Foundation and Edwin will cover this when he speaks next week at the following week at AGBT, but their approach is they have a fully automated lab and IT workflow to perform routine whole genome sequencing based cancer diagnostics for hospitals. And Edwin will go into that detail about what they're doing, but you can see why Hartwig was a great fit for Roche and it wasn't a low bar challenge. We went after it with this type of sequencing because we wanted to be challenged, challenge the technology so we can improve.
A little bit about the study we did. So we used a demonstration with clinical research sample pairs and for this we used 30 matched tumor normal samples and actually added in a couple of cell line references from the Colo cell line. We did 100x coverage for tumor, at least targeted 100x for tumor, 40x for the normal sample. And this is where that 15 billion duplex read number came from in 4 hours that we would have seen in the advertisement for this series.
So just to give you an idea of the amount of output we got in the 4-hour run as well as the workflow we used below to generate all of this data. It took us eight sequencing runs to complete the entire series.
A little bit about the data again Edwin will present this data in more detail at Roche's silver AGBT workshop on February 25th at noon. What we're showing here is data that was generated in Hartwig's lab by the Hartwig team and analyzed by Hartwig. So all this is their data just to reinforce that, I may even say it one more time. So what they saw was SMV and indel counts were similar across platforms being SBX and Illumina and most, if not all the differences, were explained by stochastic effects involving real low VAF subclonal variance. And one last thing, the empirical platform SMV error rate for SBX was 1.7 fold lower relative to Illumina.
The thing I like to say about this and and one of the benefits of having been from the technology from the very beginning is I've seen the trajectory and the direction we're going with the technology, the advances in the technology. Very, very pleased with what we're seeing now knowing how our trajectory has been going over the last many, many months and years but seeing what's changed even in the last six months very, very happy to see this result, and again, Edwin will cover this in a lot more detail when he speaks at the workshop, which will be recorded so people will be able to see that.
A little bit about the SBX Fast protocol which is, as we say, an amplification-free workflow so looks very familiar to the SBX approach that I was showing before and I'll highlight that of course if you're going to go amplification free you would expect to put a little more sample in this. Really, this 2 microgram input is really because we're only running as a single plex, only trying to do for the Fast protocol that I'll show later, only doing one sample as as we move on and I show that example.
The rest is pretty straightforward it's amplification-free same protocol same workflow we showed before the idea being though that we would tighten up some of the timings for the library prep and the synthesis step so to get you know an even tighter workflow and we just started working on this just a few months ago in the late fall.
Okay, a little bit about the data looking a little different than what we talked about before because we ran these as trios, right? So in the case of HG002, HG003, and HG004 I have highlighted there, we just ran those for one hour on all three samples and you get an idea of the median coverage that we got there and you know, very similar to the SBX-D, the amplification approach we showed before.
Nearly five billion duplex reads in one hour. So very consistent there. even better F1 scores when you look at them, especially with the indel side of things. So looking fantastic for an amplification free protocol yields very consistent. And diving in a little bit deeper, 30x whole genome for HG002 sample truth set again the F1 scores on the Y-axis getting an idea of the breakdown small variant SMV indels from the previous slide you can see that but then also a little bit on repeat expansion what we're seeing there, structural variance, and of course copy number variance. And as I said before, please look at the data, compare it as you want, these are all just the start point for where we are now we're very excited to to look at these numbers and continue to improve upon them and we already have a really great idea of exactly the things we need to do to keep pushing these numbers better.
But as a start point, I think we're very happy with what we're seeing here. But again, people can look at the data and draw their own conclusions. So the vision for SBX Fast is the idea that we can take a sample through VCF variant calling in about 5.5 hours again with a system that is the same sequencer that we use for any other applications we're talking about. It's not specialized in any way. It's just a different protocol where we run the steps a little bit faster than what we would otherwise do. So expanding that to our early access with Broad Clinical Labs and Sean Hoffher and his team there.
So why Broad Clinical Labs? Well I think we all know they are a leader in the field of genomics. For technology assessment, it makes a lot of sense—clinical translation, genomic disease associations, and building resources. And Sean will cover that when he speaks as well at the AGBT workshop.
So a little bit about this increasing need for scalable rapid clinical whole genome sequencing. The Kingsmore publication really does a great job of outlining that and I encourage you to read it. It's a great read. Then if you look at the right here in terms of some of the conclusions, genetic disorders are a leading contributor to mortality and NICUs. Every hour saved in diagnosis has a significant impact on improving outcomes. Shorter NICU stays opens up access to other critical patients. Shorter stays lead to cost savings for the organization. So a lot of good reasons and, in Kingsmore, this outlines the traditional neonatalology approach highlighted in the green there versus the precision neonatology approach. So great read, I encourage everyone to look at that.
A little bit about the current world record time from Stanford 2022. We tried to pull out of that what we believe is the sample diagnosis time of 7 hours and 18 minutes, as well as the sample to sequence completion 5 hours and 2 minutes. We think we got that right. Certainly tried to be able to do that. 94% read alignment bases greater than Q7 with an average of Q 11.5 just to kind of get a comparison of what they saw with that technology.
So a little bit more about some training runs that we did at Broad, the initial training runs, and again this was in January this is not a long time ago, this just happened very recently, and our approach on the training runs was really just to get people familiar with the technology, we weren't trying to break land speed records on this. We were just trying to get a frame of reference to see if we saw the same results in the Broad lab as we were seeing in our labs as well.
And the answer is yes. We saw very similar throughputs, lengths, F1 scores with our benchmarking samples. And then we took a step further where we ran some Coriell samples with expected variant types just to see do we get the same answers with the runs we did at Roche labs versus the Broad Labs. And the answer was yes. Looks great. So, let's move on and try some Fast sequencing. Try to see what we can do here.
So, we outlined the workflow at the top there. I'll highlight that we did 20 minutes of sequencing to get this result and the current time, those are the exact times as we ran through the protocol, as the Broad team ran through the protocol on February 11th. So, it's you know, two weeks, less than two weeks ago when we actually did this run and the results below, you get the idea, library prep through VCF in 6 hours and 25 minutes, and again we didn't have the sample prep 20 minutes part because we wanted to use HG1 so we can get good quality measures with the data and you can see almost 30x coverage, very respectable F1 scores and again, this is the second or third time we've done this so a lot of optimization there we will pump up the yield, shorten the sequencing time, no doubt, even from 20 minutes and those F1 scores will go up as we tighten up the protocol and do a little bit more optimization with that.
So you know, very exciting, again illustrates the flexibility of the technology to be able to do this run in six hours and 25 minutes and we just started trying to do this. So very excited for Sean to present the data at AGBT. And now it's my honor to hand over to John Mannion who heads our computational science group and a fantastically talented team of people and he's going to dive in a little bit more detail on the duplex read structure and format and a little more on the on the data that we're seeing.
John Mannion: Thank you, Mark. What an honor it is to be here today and thanks to everyone who's you know taken interest in joining us on this webinar today. In this first section on data analysis, we'll focus on two main topics. The first will be SBX-D read characteristics and properties and the second will be a short sort of cursory introduction to our accelerated local data processing strategy. We will use the example WGS run that Mark started with the sevenplex whole genome sequencing run and dig further into that doing a deeper dive on those read characteristics.
To look at those read properties we start with the Xpandomer molecule itself and a cartoon diagram of that Xpandomer molecule is shown in the upper left, for the SBX-D read constructs. And here you can see indications of genomic DNA insert in the blue as well as adapter segments in the other colors.
If you take an Xpandomer molecule, thread it through the pore, and using the sensor module measure it, produce a read, you'll generate reads such as what is shown in the lower left of this slide. And in this read, you can see blue segments as well as adapter segments that are annotated. The Y adapter on the left is the start adapter. The hairpin adapter is shown in the middle. And there's an end Y adapter sequence as well. The genomic sequence data is present in the middle in the blue between those adapter segments.
If we take a stack of those reads randomly sampled and pile them on top of one another, we can generate an image like the one you see here in which hairpin adapters are always present in the middle and the and the various adapters at the ends are present in different colors as well.
You can see the first and second passes on the genomic DNA in different shades of blue. We can take those linear reads and informatically fold them back over on one another, producing what we call our full-length duplex SBX-D read. In this case, we refer to it as full-length since in this read, both a full first and second pass were achieved on the insert genomic DNA segment.
We know this because we were able to reach the end adapter sequence. Not all SBX-D reads necessarily reach that end adapter sequence. There are various processes upstream which could cause that to be the case. These reads are still valuable, however. And a pileup of a random sampling of these is shown just above. Here you can see that the hairpin adapter is always present in the middle. And the full first pass is present in the light blue. But for many of these reads they have a shorter or partial second pass.
These are also useful as we say we can perform the same bioinformatic analysis where we informatically fold these reads back over on themselves and we generate constructs such as what you see on the right. We refer to these partial reads as partial duplex reads. Note from this image on the left, you get a sense for two things regarding read length.
One, the raw read length distribution as well as a sense for the input DNA length distribution that was input into Xpandomer synthesis itself. You can get that sense from the full-length duplex read insert size distribution. Now if we look in more detail at these two categories of reads both the full-length duplex read and partial duplex read, we can look at the different class passes of bases which are generated for the full-length duplex reads on the upper left.
You see of course that the whole genomic segment was covered with duplex. The vast majority of these bases will match one another on the two passes. And for those positions, we call these concordant duplex positions, we call a base and mark that base as high quality. A small percentage of bases given our SBX raw error profile will result in a discordant duplex position or position where the two passes on the read do not match. In this case, we also call a base but mark that as a low quality base. For the partial duplex reads as is seen on the right, there is again a duplex segment. But in addition to that there are simplex bases which are parts of the insert for which we only had a single pass. For these simplex positions bases are called but these are marked as a medium quality of base. All of these bases are useful in downstream variant calling, and we do pass along all this information to the variant caller including the different quality scores.
Note the two cartoon diagrams here are temporary informatic constructs alone and the high accuracy consensus reads that are produced are themselves linear reads with an associated quality vector and these are compatible with encoding in the standard file formats such as FASTQ or BAM.
How do these different base types affect our overall metrics? When it comes to base accuracy, the Q39 number for example that Mark mentioned earlier, for base accuracy calculations, we focus on concordant duplex bases only. These are bases in our highest quality bit for SBX-D. At the same time, in order to be consistent when it comes to coverage metrics, that could be coverage over the genome, base throughput, sample throughput, etc. These are also anchored around concordant duplex bases only. The other bases such as the simplex bases are there. Think of them as coming along for free and they do provide valuable information, but when we talk about ultimately our sample throughput numbers, that will be on coverage of 30x genomes with concordant duplex bases only.
As for the insert length or length metrics, we think the most relevant metrics for SBX-D reads is in fact the insert length which is exactly as it sounds, a representation of the original length of genomic insert. And for this, we count both the duplex and the simplex bases. When it comes to insert length, we believe the most important aspect of that is our ability to map confidently to the genome and that is assisted by both the duplex and simplex bases.
We can now take these read categories and metrics and having defined them, take a little bit of a closer look at that example SBX-D genome in a bottle run which Mark covered above. If you look in the lower left you can see a portion of that table. All seven samples that were multiplexed are shown there and you can see the absolute number of duplex reads generated for each.
If you sum all of these together you see that in the course of the 1 hour run 5.3 billion SBX duplex reads were produced. Of those, 2.6 billion or 49% of those reads in total were full-length duplex and the other 51% were partial duplex reads. In terms of the base mass a total of 1.2 terabases was produced of SBX-D genomic base mass, and of those, 70% or 840 gigabases were of the concordant duplex base category.
With respect to that insert length distribution again the same number that you see in the table is represented here for the HG001 an insert length mean of 235 base pairs was achieved. And if we want to look at the distribution itself it is certainly interesting to see. Yes, that mean of 235 base pairs is achieved for that distribution and some molecules are shorter. However, it does have a long tail on the right side and these longer molecules, there are a good fraction above 300 350 even 400 base pairs in SBX-D insert length. These are highly valuable in mapping to difficult to map locations in the genome.
With respect to accuracy, Mark has already covered that at the high level, we're approximately Q39. We'd like to show how that accuracy breaks down into the different error categories. You can see in the bar chart below the accuracy represented there and the substitutions, insertions, and deletions called out separately. For substitutions, insertions, these are, as you can see, roughly the same and they are both about 5 × 10⁻⁵ in terms of their frequency for the concordant duplex bases. We're often asked if we have the ability to predict base quality with even higher accuracy.
Indeed we do have internal models that have been developed. These are still in test and development but those models so far suggest that of the concordant duplex bases greater than 85% of them are predicted to be over Q30. We'll continue to refine and calibrate these models and it's possible that we would push that number up even further.
One thing that is commonly asked about SBX is since we are not measuring an ensemble of clonally amplified molecules. Is it possible that our accuracy can hold up even as we go to very long read lengths?
Indeed we believe that to be the case based on understanding of the physical processes that lead to different sources of error and we also can see that evidence in the data. In fact we can take this run and take the fact that we have a distribution of insert lengths and look at slices of that distribution and understand from looking at those slices whether accuracy is holding stable.
We've done that for this run. We've taken slices that are exactly one base pair in width. And so if we look at the next slide, we can see populations of reads that are exactly of the lengths indicated on the left. And you see how many absolute reads were generated for HG00001 that match this exact insert length.
We start with 150 base pair inserts on the top and work our way down to 200, 250, and even 350 base inserts on the bottom. You can see that over the course of different insert lengths, accuracy holds up. Well, and again, this is empirical accuracy of concordant duplex bases or our highest quality bin. You can also see as you go from left to right for any one of these plots that accuracy does hold up with position in the read. And so as we go toward the ends of the reads, we do not see any evidence of a significant degradation of accuracy. It appears to be quite stable.
One additional note about this bottom group of reads, the 350 base pair inserts. In order to produce that, significantly longer raw reads were required. So for 350 base pair inserts plus adapter segments the raw reads were required to be on the order of 750 or 800 base pairs. And if you look at that absolute number of reads for that one multiplex sample that is not a small number of reads of that length to generate within a single 1-hour run.
Looking here at the next slide we look at one another very very common question of any sequencer technology which is how does the accuracy hold up for homopolymers? And so there are a number of different ways to view homopolymer accuracy versus length. We've chosen one here that we think can be meaningful for a good number of people. It's kind of easily understandable. And we've looked at the impact that homopolymers have on variant calling.
So here we've taken the variant callers calls that are part of that genome in a bottle high confidence region on which we've assessed those F1 scores and we've looked at all variants that interact with homopolymers of different lengths and these are specifically indel variants interacting with homopolymers from length 2 all the way through length 30 and even beyond length 30. And we've looked at this data through the lens of two different variant callers.
These are the two Mark mentioned previously. The first is that Roche internally developed and optimized variant caller which is based on GATK plus Roche ML. And then the second on the right is a DeepVariant caller. And we can see that for all homopolymer lengths F1 scores are very good.
In particular for all of those homopolymers that are of length 15 or less F1 scores are greater than 99% and we believe this is very good performance for SBX. We're very certainly very proud of this number.
We also would like to acknowledge of course the work of the genomics team at Google. We've really been working with them for a fairly short period of time, but the collaboration has been great and they've put in great energy and we look forward to more collaborative work to come.
Another set of very important metrics for any sequencing technology are coverage uniformity and whether there are any biases as a function of genomic content or specific types of sequences, types of content within a sequence. And so on the upper left we can see that a very common depth histogram as it's called. And this shows the coverage of different positions in the genome.
And so for a 30x subsampled run, which again is our Genome in a Bottle (GIAB) HG001 sample from that demonstration run here, we've subsampled to 30x and then we look at properties of that distribution and those properties are also summarized in the table below. So when we subsample to 30x that is 30x of concordant duplex-based coverage alone that achieves a median of 30 and both a median and mean of roughly 30.
And then if we look at the F90 and F95 scores these we believe are also good numbers, and for those that aren't familiar, the way we're certainly defining them here is that the F90 is the median over the 10th percentile and F95 median over the fifth percentile. We also have plotted as you can see in the upper left, the coverage depth achieved by all bases and that includes the simplex bases as well. And that coverage is a little bit higher than concordant duplex only in a sense again that that comes along for free.
We're going to anchor our sample throughput around the concordant duplex bases but want to provide that information as possible and the numbers associated with that distribution are below.
With respect to biases, one view of that is the view we've chosen to take here in the upper right. and we've taken the genome in a bottle consortium's genomic stratifications and shown relative SBX coverage over those different stratified regions. It looks very good over all of them in our opinion. Certainly we're going to continue to push to increase performance wherever there are even minor gaps.
If you look at this highlighted section here the low GC content, less than 15%. That's obviously under the line of 1.0 with respect to normalized coverage. Of course we'd like to push that up in the future but another set of regions we'd like to draw your attention to are the high GC content range bins and for this is not always the case that we were showing relative coverage values of 1.0 in these bins.
In fact, in the early days of SBX duplex, this was a challenging area for us to cover well and it's really a testament to Mark and the chemistry team that they were able to target driving this up by focusing on some select knobs in the SBX synthesis. and it also is a testament to you know, some of what Mark has often talked about, which is a separation between chemistry and measurement, and they were able to to turn those knobs and improve the high GC content coverage without impacting negatively any other aspects of the system or the measurement process.
As we go on to the next section and again just a short section on local data analysis, I want to just pull up once again the numbers associated with the instruments base call production rates greater than 500 million bases per second for simplex reads.
These are our Q20 accuracy level reads and then also greater than 15 billion reads for four-hour runs for SBX duplex scenario or five greater than five billion reads in a 1-hour run. And obviously teams are extraordinarily proud of this and we think this is a big advantage but also acknowledge that this could present some practical challenges for those who are trying to run the instrument and generate results as fast as possible or run the instrument continuously you know in sort of a very, very high throughput scenario.
So what have we done in terms of our work to accelerate data processing and assist with handling all of this data that's generated at such high rate? We can start just by briefly looking at the example SBX fast demonstration that Mark mentioned toward the end of his first section in which the analysis for that solo genome occurred in less than two hours post run, postsequencing time. And here on this slide we see an indication of the different major algorithmic steps that were executed, and in the manner in which they were executed.
Base calling on the left was done real time as it always is. Raw data of course streams off the sensor and it's called instantly. Also demux intramolecular consensus mapping and alignment and even germline variant calling were all accelerated through various means.
In this first demonstration those accelerated steps were performed in what we call an offline fashion and all that means is we did not start that data processing until the sequencing itself completed. If we think about opportunities for further accelerating this workflow, we of course will be focusing on making each of those algorithmic steps more efficient. But also in our next deployed version of the accelerated pipeline for SBX-D will be even faster.
We are taking some of those steps that were so far run offline and we are moving them into execution in an online fashion. And again that means this will be done locally but concurrently with generation of sequencing data. Part of the key characteristics of SBX that make this possible are twofold. Number one, it does not take multiple hours or maybe a day or even more to generate full intact reads. Rather, complete reads fully intact are generated on the order of seconds.
Second, when it comes to SBX-D, since we are physically linking information from both the parent plus and minus strands, that information is then localized in time. So, as we take our strategy to form higher accuracy consensus reads in which we've localized our uncertainty to specific bases, we can see that we have the opportunity to accelerate and execute in real time demux intramolecular consensus and in near real time mapping and alignment due to the fact that all the information necessary to execute those steps is localized in time and is available for the first sets of base calls as they're coming off of the instrument.
Great work here by our accelerated compute and algorithm teams and in collaboration with our instrument teams to bring this integrated set of capabilities together. Also, we'd like to acknowledge as always our external collaborators. In this case, in particular, the NVIDIA parabrics team.
Roche’s GPU and bioinformatics teams are working hand-in-hand with NVIDIA and have been over multiple years actually to among other things optimize performance of select Parabricks software components and libraries such that they may be leveraged within parts of the Roche developed SBX pipeline. The collaboration has been extremely positive and we certainly do look forward to continued work together.
Finally, as you think about aspects of the technology being presented here today, you can quickly imagine a great number of useful features that will be made possible by such real-time integrated analysis in combination with the SBX chemistry and sensor module.
Overall, our intent, our strategy is for the instruments local compute to be as fast, as flexible, and importantly, as efficient as the SBX technology itself. And with that, we'll end this first section on analysis. And I'll hand it over to Mark who will talk about SBX simplex reads and some of the novel applications that have been pursued in that area so far and what more could come.
Mark Kokoris: Thank you. Okay, so shifting gears a little bit to go towards the direction of simplex as I mentioned before. Little background on this and we'll kind of go quickly through this as I think everyone knows what we're saying with simplex basically just those singlestranded reads.
And as before with the duplex we focus on two particular types of sequencing probe based and then some RNA isoform work both of which we're doing with our Broad collaborators. So a little background foundational slide just to get an idea of the yield we're talking about with simplex reads and we're just doing one hour sequencing runs as a kind of benchmark example.
On the left we're showing 10x flex probes. What does that look like in terms of the sequence output in a 1-hour run? And again, these are reads that we call Cell Ranger valid barcode reads. In other words, they have valid barcodes, confidently mapped probe sets.
Okay, so, it's a subset of what we actually output in one hour. Nearly 14 billion reads in one hour of sequencing for those types of flex reads. Pretty impressive. Just to give you an idea of that end of the spectrum, for simplex reads to the other side.
We've created, again, HG001 templates just so we can get an idea, get our bearings on quality and accuracy, a library of about 800 base in length. And when looking at again in a 1hour run what's the average read length if you just look at full-length reads well we're getting nearly 800 raw read accuracy 99.36 about 1.8 billion reads per hour and about 1.4 terabases per hour.
If you look at all reads greater than 400 for this data set, you're close to 670 bases, average length, same accuracy, 4 billion reads of that length and about 2.2 2.7 terabases in one hour. So for these types of simplex applications, you can get an idea of the output of the technology.
And I really wanted to frame that so people understood those trade-offs between that and duplex sequencing. Quickly on the RNA because we want to leave time for the Q&A. session at the end. Just a little bit about the 10X Flex kit. This being a kind of probe capture workflow, pretty standard in this particular case.
SBX takes off at the pre-amplification step, do a strand enrichment, drop straight into SBX sequencing just like we covered before. Similar for the 10X three prime, five prime gene expression. Only this time we can actually enter the workflow early on, skip out a bunch of steps, fragmentation and repair at. We don't need to do any of that. We just go straight into SBX sequencing, strand enrichment, and again sequencing more along the lines of those longer reads I was showing in the previous slide. So pretty clear.
One of the other things we have to do to feed into the Cell Ranger pipeline here is actually mimic the read one read two formats for an expand read as I show above a typical Xpandomer read. We just break it up into these split read one and read two formats so you can get an idea of how we can feed it into the pipeline as time goes on, pipelines will be more customized to using the type of SBX sequencing reads and so we won't have to do that.
A little bit about our early access Broad Institute collaborator. Another real class team there Aziz Al’Khafaji and his team that we're working with really as with all of our collaborators here are really world class in terms of the quality and the work that they're able to do. So we're really excited to work with them.
The two focus areas for Aziz's team is a 1 million cell multiomic drug response profiling experiment as well as a a single cell fiveprime gene expression experiment and so this kind of outlined here Aziz will go into a lot more detail and speak to this again at the AGBT workshop going into details of the prism setup that he has as well as what we did for the single cell.
So I'm really just going to focus on the output of this and let Aziz really dive into the data. Again a lot of specific details here, but the one thing I loved when we sat down the first time and talked about what we wanted to do. I said, let's challenge the system. Let's go big in terms of the measurement. What's a read-heavy, a measurement-heavy approach? And so, this is what he came up with this large-scale multiomic experiment.
Again, he'll go through the details of the experiment here, but something that you really need to have a significant amount of output if you're going to be able to do these types of experiments. A little bit about the format here. Don't really go into the details too much, but essentially is a whole transcriptome, a protein panel and a prism barcode panel all coming together into multiple pools in a workflow that then feeds into Xpandomer synthesis and sequencing. So what's the output when we do this 30 billion reads in a 4-hour run? I'll say that I think there's a heck of a lot left in the tank to go much bigger than that in terms of the read count number.
This is early data optimization. We can certainly push that number up, but 30 billion reads in that time frame, pretty significant. So, we were excited to see that result and then get the picture of what that looks like with this UMAP result of all those different 480 I think cell lines and what that kind of looks like and essentially this conclusion which is SBX is able to meet the immense sequencing needs for highly multiplex multiomic experiments.
This is the kind of stuff that we really have that workhorse capability to drive with this technology and Aziz and his team he will present this at AGBT in much more detail to talk about why we're excited about this high throughput approach moving on to the 10x single cell longer five prime reads again the workflow able to cut out quite a few of the steps so simplify the workflow any steps you can cut out is is great if you can do that and in this case you actually get longer reads so, it's a double benefit for the technology.
As I always say, one of my goals for the technology is to get as close to the sample as you can, and I very much look forward to being able to do that more and more with the applications that we explore over time with collaborators and as this goes on to market. Just a little picture of what we saw with this particular experiment. This is just one run. Again, I haven't really spent that much time trying to drive yield at length with this approach, but just to give you an idea, in one hour of sequencing, we saw over 200 million reads greater than a thousand.
So again, this is without really optimizing the system much. And again, to put a number of reads greater than 400, two and a half billion reads for this poly disperse single cell RNA library. So I'm really happy with that. Something that we'll continue to optimize and really bring this approach into the kind of workhorse status for the technology for these type of reads that you get more than enough out when you have Q20 plus reads, but then the throughput and the flexibility of the workflow is really what comes into play here. So this I just kind of threw this on as a nice little snapshot of what a 2000m Xpandomer looks like in a nanopore just to give a picture to know that this is something we can do is push these links.
Again, not something we've really focused on, but I've always gotten the question, how far can you go? Well, we're going to find out. We're going to keep pushing links up. Again, there's always trade-offs with length and yield that you have to consider but as you get the idea, this is a very high yielding technology so a lot of possibilities there a little bit about the results the disease saw both the PBMC cell types were clearly resolved on the left and the clear biomarker expression as well pretty much as expected is what he saw for the particular sequence reads and then lastly SBX can resolve novel transcript isoforms and so you show this nice picture where on the top you have this novel isoform that they were able to pick up relative to to the known isoform in the sample set.
So this is just one of examples that came out of the study that we did and again Aziz and his team will cover this in more detail at the conference. Looking forward to hearing from him on that. And essentially now we go back to John to kind of elaborate a little bit more on the simplex read properties and analysis. There you go John.
John: Great. Thank you. And thanks Mark. And we'll do here just a quick section on SBX-D or SBX simplex read, excuse me, read properties, the raw read error profile, and also comment on the optimized open- source tools which we plan as part of our informatics strategy. looking again at this image in the upper left.
Mark showed this just a few minutes ago. This shows how our simplex reads can fit in quite easily to existing workflows for single cell both for the 10x5 prime and for 10x flex construct shown here with SBX. Of course a single long read would span enough length, these could be split into read one and read two which would be the traditional format that this data would be presented to downstream analysis tools.
So we are able in the short term, took sort of an expedient route to do just that. We perform read trimming and splitting. We then split those into read one and read two FASTQs and then we can send those into existing on-market pipelines such as Cell Ranger.
In this particular case it should be noted that neither the reagents themselves nor the algorithms have have been optimized for SBX error profile for the existing on-market products and tools to give a little bit of a of a view as to as to how that raw error profile looks and why there might be optimizations that could still be made in this in this area of applications. Let's take a look at some example DNA runs which we will use to probe the raw error profile and read characteristics.
This is a set of experiments which we've run really just for demonstration purposes, and we've taken the selected input DNA of the lengths that you see on the left the target input DNA lengths were 370, 500, 750, 1,000 base pairs. So, if we do that with the 500 base pair size selected input DNA, we have a raw read length distribution that is shown in this box above. And you can see the vast majority or the majority of these reads have made it to the end of the molecule. Some of them are partial reads for various reasons.
And we do know that we are able to identify those reads which are as we call full-length reads by those which have our known adapter segment at the very end of the read. And so in the readlength histogram shown below, these have now been colored by those which have or do not have an adapter segment found at the end. For the rest of the analysis on these next few slides, we'll just look at the accuracy and error profile characteristics of that full-length distribution just for simplicity. Although when we look at the partial reads, really we see the same results with respect to accuracy and error profile.
And so at a high level the accuracy for each of those sample experiments is shown in the table below. 99.3 was achieved with all of them. and we can go to the next slide to look at this in a little more detail to see how that breaks down.
Those four experiments are broken down into a raw read error profile. in the four tables below on the left and you can see that 370 is up top, 500, 750 and then a thousand as we go down the page. all the error rates are shown there. and as is the breakdown of substitutions, insertions and deletions which are roughly the same order of magnitude. If we ask how - a very similar question to what we did with SBX-D and the dependence of accuracy on insert length and position in the read - we can ask a very similar question for raw read.
And we can see in the upper right the 500 base pair run has very stable accuracy for all positions in the read, and the error profile as well is broken down there and that is fairly stable. And note as with the SBX-D plots on accuracy versus position, we've also trimmed slightly here just to avoid edge artifacts and also trimmed adapters as well.
A preview of our raw read QC score predictors can be seen in the lower right. a fairly good performance. Again, these are still being calibrated internally. There's much more work to do, but absolutely we're able to predict some of the low quality reads and can leverage that to identify high quality reads or those passing our filters.
Going beyond just that 500-base pair sample, we can look at accuracy versus length of insert and position for all of those different size selected input DNA sizes. And so you can see the 370-base pair template shown on the top, very stable Phred score with respect to position in the read for those full-length reads and same with the 500-base pair as we just saw a moment ago as well as the 750-base pair template and 1000-base pair template.
Just to circle back very quickly to this view of the accelerated pipeline, the local data analysis pipeline. It is the intent to our intent to is to accelerate this pipeline particularly to focus on the common algorithmic steps which will facilitate customers ability to work with SBX data in their own secondary pipelines and to do so efficiently on-prem analyzing in real time reducing and compressing data so that it may efficiently be moved downstream to the users cloud. It is not the intent to completely close off users ability to to work with the data. In fact, quite the opposite.
And so what we've demonstrated here with this image is that in addition to providing that accelerated workflow which is shown in the middle, we also want to enable the ability for customers to pull data off of that pipeline at various points, so to be able to directly retrieve BAM or CRAM files, in the case of SBX-D, SBX Duplex read files or for applications where simplex reads are needed to directly access the SBX simplex reads move those up to the customer's own location of analysis and in addition to that we aim to provide a set of open-source analysis tools which can be used as reference pipelines.
Customers could take a look at them or choose to ignore them, or integrate them into their own cloud pipeline. In either case, we aim to do whatever we can to facilitate the customer's adoption of technology and make it as easy as possible to adapt both the assays and the informatics algorithms to the SBX read characteristics. And with that, I'll hand it over to Mark to speak a bit about the future application areas and the headroom of the technology. All right.
Mark Kokoris: Thank you, John. Okay, so home stretch is almost there. A little from Gustav after that and then we get to the Q&A. So, innovative headroom. So, I guess we're kind of ending where we started the presentation here in terms of the SBX technology into the future. It's interesting after 18 years with technology.
I look at it as we're at the beginning of the technology, not at some culmination for the technology, but really at the beginning getting really excited about the possibilities for what we can do as we've learned so much over the years bringing these two technologies together.
So as we think about some of the stuff into the future we'll add in in terms of our flexible operation this run until done idea that if you do a sequencing run you know how much you need you don't need to sequence further again another form of efficiency when we think about accuracy chemistry measurement algo optimizations I mean we understand this technology very very well and there are a lot of buttons to push both respect to the polymerase side, the chemistry side to keep pushing that raw read accuracy up and up and up.
We understand it very very well and we will continue to do that over time to keep pushing accuracy, keep pushing throughput efficiencies. I mentioned faster pulsing before that's a significant button to push as well as increasing the density of the sensor module. Again, more pores than what we even have at this point. Starting to get an idea of the type of possibilities we have into the future. Of course, thinking about all of these in the context of cost efficiency.
So again, I like to think of this as a beginning for the technology. We're just getting started with what's possible with this technology and then looking towards the future application space. I like to say people are going to need to kind of relearn what's possible with this technology given so much throughput and given the quality of our duplex reads and so that's an exciting thing for me to see and for the team to see and all of Roche to see is how many other things we'll be able to do with this kind of technology capability.
So rethink what's possible with the technology across the whole range of applications and even think of things that you otherwise would have never wanted to do that were now possible when you think about the read and the output, the read heavy output that you have with this technology. So very exciting for us to think about these possibilities into the future and really look forward to see us kind of driving this Formula 1 car around the racetrack and seeing what we can do with the technology. It's an exciting time.
Thank everybody for coming today and we'll hand off to Gustav as our life cycle leader. We'll kind of go into a little bit of detail on the commercial side. So Gustav you go.
Gustav Karlberg: Okay. Thank you Mark and thank you to both Mark and John for a great presentation. So much interesting information to share with you today. We're very excited about where we are with this technology, as you can tell.
I get the opportunity and the privilege to talk a little bit about our commercialization strategy and what we're looking to do as we move forward here. Mark talked a lot about the different workflows and show data from some of those that we've generated already. I think the big takeaway here is that this technology has a lot of versatility. It can be used across all the existing workflows that we have in sequencing and also for things that maybe we're still just imagining in a way that also makes it ideal for a lot of these different applications and we as a company, Roche will think about some of these applications that where we would like to develop an end-to-end solution.
We're going to focus on that over time but I do want to talk a little bit about the fact that we want to make this an open available solution for you, the industry and the researchers to do the work that you want to do. We are going to release this as an RUO product. We have a future vision to take this to the clinical setting. SBX will be our backbone for all the technology development that we want to do in sequencing.
But I want to make it very clear that in this first initial releases, we are making this a very open technology for everyone to use. We also heard a lot in the webinar about the extremely high throughput of this technology—the speed that it has. Both of these factors are super important as they enable us to do all these different applications of interest.
But what we're focusing a lot of effort on and what Mark talked about as well is this flexibility that we've built into the system. We are really trying to make sure that we can drive for flexibility, enable you to run sequencing in different ways than what you're currently doing, and have an efficient way to meet your lab operation needs.
Sequencing has been driven by this cost reduction that has happened over time. However, with regards to uptake, the challenge is that a lot of those cost reductions make it accessible for very large studies. What we are focusing our design efforts on is seeing how can we minimize that price penalty between the very large studies and the smaller studies, and make sure that we can make it more accessible for more labs and have a technology that can actually answer some of those questions that all of us have.
Finally, I want to address a little bit about our commercialization and our timelines here. You've heard Mark talk about the early access work that we have done with our collaborators. We are very excited about that. We will continue to work with those collaborators in 2025 here. We also want to expand that program in a selective way with some additional collaborators looking at some of these applications and workflows of interest.
At the same time, we are working hard to get this to a commercialization stage and we are targeting to have that available in the market in 2026. We are extremely excited about where this technology is, what we have been able to do and what we've been able to show you today and we are really gearing ourselves up for something new and really groundbreaking with regards to how sequencing is done.
And before we go on to our Q&A session, I want to hand it back over to Mark to talk a little bit about and giving acknowledgments to all the teams and all the work that has been done here over the last couple of years. So Mark, please,
Mark Kokoris: Thank you Gustav. Yeah, great. So first I want to thank everyone for coming today and I hope you're getting a sense of our enthusiasm and excitement for this technology where we are and where we're going with this technology. This excitement across Roche is pretty significant. I want to thank you all for coming. I look forward to the Q&A session. As far as the Roche team in general, this has been a global collaboration from Seattle to Santa Clara, Penzberg, Pleasanton, Cape Town, and many other places. Hundreds of people across the organization over the years have been involved in this technology, so it is a huge thank you from all of us for all the teams working on this for their work over many many years to bring the technology to this point and then the many years to come, as we really start running out this technology and showing what it can do.
So I'm very excited for that. Some other acknowledgements that I wanted to make here is I wanted to call out an acknowledgement to my counterpart at Genia, Stefan Roever, who unfortunately passed away a few years ago. He wasn't able to see all of this come together for the Genia Technology and what happens when we bring these two technologies together. And so I know Stefan would have been thrilled to see the results that we're showing today. And I did want to honor him with that.
I also want to reach out to my my team, my Seattle team in Stratos Genomics for all the years of support behind the technology in believing this when it seemed like a crazy idea that was never going to get there and yet still maintained that rigor and that persistence and that grit to make this technology come together and believing and innovating after innovation after innovation to do that.
So I want to thank them for that and all the Roche employees again for all of that innovation that we've done and are going to do into the future. And lastly, I want to thank Severin Schwan. I want to thank Thomas Schinecker, and I want to thank Matt Sause and the entire executive team for their support, unwavering support for this technology, over the years.
We really take seriously innovation and innovation that can impact all of science and innovation that can impact patients lives. This is something we take very seriously at Roche and I want to thank them for that vision and that support and commitment over so many years to this great technology and really look forward to seeing what it can do into the future.
Explore more
Explore articles from our community
The AXELIOS 1 sequencing platform and sequencing by expansion (SBX) technology are in development and not commercially available. The content of this material reflects current research study results and/or design goals. The AXELIOS 1 sequencing platform based on SBX technology will be launched for Research Use Only. Not for use in diagnostic procedures.