For localized information and support, would you like to switch to your country-specific website for {0}?
- Insights
- Diagnostics insights
- Overview of sequencing by expansion (SBX) technology
Key takeaways
Sequencing by expansion (SBX) uses a novel strategy to convert DNA information into Xpandomer molecules, helping to overcome signal-to-noise limitations of current nanopore-based sequencing technologies
When combined with high-throughput arrays and real-time processing—this pore-based technology delivers fast, flexible, and accurate single-molecule sequencing at scale
SBX offers two distinct workflows: SBX-D (duplex sequencing), which provides high accuracy for applications like WGS and oncology, and SBX-S (simplex sequencing), which is optimized for ultra-high throughput and is suited for applications like transcriptome analysis and identifying novel isoforms
A new approach to NGS
Roche has responded to the demand for improved performance by developing a new category of next-generation sequencing (NGS) technology, called sequencing by expansion (SBX). This powerful approach to NGS has been designed for flexibility and performance, with headroom to scale into the future.
Fundamentally, SBX technology converts DNA information into a longer, “expanded” molecule, overcoming the spatial challenges of current nanopore technology and enabling higher signal-to-noise for improved accuracy.1 This expanded molecule, or Xpandomer, is then fed through Roche’s proprietary nanopore, driving single-molecule sequencing at incredibly high rates of speed and facilitating rapid access to usable sequencing data.
Demonstrated research methods
The versatility of the AXELIOS 1 platform has been successfully demonstrated across a broad range of research methods. Its dual sequencing modes provide maximum flexibility to meet your specific project requirements. The SBX-simplex workflow delivers the high throughput required for data-intensive applications, while the SBX-duplex workflow leverages intramolecular consensus to eliminate artifacts and maximize accuracy. Explore the resources below and contact a representative to learn more.
Related information
Resources
*This transcript was generated using an AI-based transcription tool and may contain errors. It reflects the spoken content of a live webinar and has been lightly edited for clarity.
[Music: The Black Keys - Beautiful Day]
I could have lasted a bit longer this music. Okay. Johanna, thank you very much for this excellent overview of the technology. It’s now the question how this works out in a customer’s hands. So I’ll try to give the story on the technology from our perspective. At Hartwig Medical Foundation, we are doing whole genome sequencing of tumor samples with the goal to use it for cancer diagnostics in actually routine clinical settings as well as building up a database that generates data so that we can improve care for future patients. So that database is available for researchers. So our key question is—these are my disclaimers—the key question is: can we actually use the Roche AXELIOS system for this same purpose? And from our perspective, there are five criteria, five factors that do play a role in assessing that and comparing technologies. This of course holds true for every technology, also an alternative to what we’re using currently in routine. First, that’s lab logistics complexity. If you do diagnostics, you want to have things as simple as possible. Turnaround time—patients do not want to wait, so it should be done as fast as possible.
Throughput should be scalable, it should be flexible, so that’s an important factor. Well, and of course, without the right accuracy, no technology qualifies, so that’s an important factor. And of course, at the end of the day, we have to be able to pay for the test, so cost plays an important role as well. So let me go into these five different factors and start off with what a procedure looks like in a routine clinical setting. So there is more than just sequencing—so four hours' run times is fine and great—but there is of course more to it, and from a test perspective what we always consider, you start basically with the samples coming in, which in our case is a biopsy and a blood sample, from which we have to isolate the DNA, have to do the QC. So all of that is happening on the first day, and at the end of that day, we basically start the library prep. We do that overnight because we automate everything, that’s basically what you want to do as well in a routine setting. And the next part is the Xpandomer synthesis, pooling again on the next day, and then starting the sequencer, that can be as Johanna just mentioned, you can start multiple runs in one cycle. How does this compare to what we’re currently doing? Well, there are a few steps which are different from our current Illumina-based procedure. The library prep is a bit more—has a few more steps. So you have more cleanups in there. On the other hand, we have automated all of this on a Biomek platform, so that’s okay, we can deal with that. The next day we have as an additional step the Xpandomer generation, takes four hours, 50 minutes hands-on time, so that is not too complex. After that step, we'll do pooling and start a—sorry, the pooling is done before the XP synthesis, was an error in there. We basically prepare the sequencer and start the sequencer. And there it is an interesting part because the sequencer run could or can only last for four hours, and the data is actually already being mapped while the sequencer is running, while the data is produced. So 20 minutes after the run is finished, we have a mapped BAM and CRAM file and we can kick off our bioinformatics. So really we can get—by looking at these procedures, we can shave off at least a day from our current fastest time from sample to reporting, and we can do this in less than three days on this setup in principle. Then let’s look at throughput as well. Basically, if we look at what can be pooled in a single run, it’s 16 genomes. We do tumor-normal whole genome sequencing, so we sequence the tumor three times and the normal one. So we can fit four patients in a single run. Takes four hours, we can start four runs, so we can do 16 patients in one day. Currently, we’re doing—and if you would compare that to an Illumina setup, that is basically—if you compare basically what you can do on the whole machine if you run it full time, so you can basically run this cycle four times a week, you end up with about 256 genomes, 64 patients in our case that you could do in a week. That’s approximately the same that we’re currently achieving on a NovaSeq X Plus setup using 25B flow cells. So the throughput is quite similar of this machine for the application that we’re using currently in routine. This is an example of: does this really work out? We randomly selected 10 patient samples, and although you see variation in primary characteristics like duplication rates and these type of things in the raw data, we do see that you get the same median de-duplicated genome coverage for all of these 10 patients for the tumor and a normal when you put six tumor-normal pairs on an Illumina NovaSeq S4 flow cell, which runs for two days, as what you would get from four tumor-normal pairs on a Roche AXELIOS duplex run. But you can run more in the same time. So all of that works well. So in conclusion, I would say if you look at the first three steps: lab logistics complexity slightly more complex but easily mitigated by automation; turnaround time is faster; throughput is similar, but we have a bit more flexible batch size because we can start with four patients in a batch up to 16 in one day without any consequences of not having runs fully filled up, while for an Illumina setup, we have to fill the whole flow cell to get to a lowest cost price level. Okay, now the real question because accuracy, I think, is key if you go to diagnostics. So the rest of my presentation will focus on those aspects.
So in the past about half year, we did almost 120 patient samples, tumor-normal pairs, that were actually sequenced and analyzed in a routine diagnostic setting for a comprehensive cancer center in the Netherlands. So we have the base data and use that as the reference point for assessing the performance of Roche’s AXELIOS system. We sequence 90X coverage for the tumor, 30X for the blood control, and all steps are performed at Hartwig Medical Foundation. So all the data generation, all the sequencing, all the library prep, everything has been done in-house by Hartwig personnel. So this is really the customer experience that I'm presenting here. We use duplex sequencing for high-accuracy variant calling for this application, and the tumor types, well we sort of randomly selected them. So they’re quite average on what we see—they’re all solid metastatic cancer patients, but this is sort of a random selection of tumor types. So they have a representative way of how it does perform in different types of complexities in the genomic landscape. This is basically the data over time. There are almost 500 genomes in this graph, and I think this is just confirming what Johanna just showed as well. We see very consistent performance over time, over runs, actually always meeting the specs that we need to see on the relevant parameters for primary data generation. So meaning that we get very highly consistent performance and also in an automated library preparation setup and by scaling sequencing. So the next step is a bit more complex because if you start using software, you can see a lot of differences and they can depend on differences and settings that you make in the software. And especially if you use different tools, it’s very hard to get to the same parameters. So this is basically why we further developed our software suite tool. This is called Hartwig Widgets. It’s a set of tools for comprehensive cancer characterization that goes beyond single nucleotide variant calling, structure variant calling, copy numbers etc. It also includes things like HLA typing, neo-epitope predictions, homologous recombination deficiency, tissue of origin prediction etc. So there’s a whole suite of tool that we use in our diagnostic setting and also for building up our database as a rich database for researchers. So we worked in the past year very hard on making this package platform agnostic. And that’s the work done by Thomas, Charles and Peter. I think they did a brilliant job building this first component. It's called Redux. It does de-duplication and type of things, but it also makes the whole pipeline input material and sequencer agnostic. And it does that by re-calibrating with—by doing an empirical re-calibration of the error values that you get off the machines. So we basically look at what is actually the real error rate for SNVs in a certain context, for indels in a certain context etc. And by doing that, not only the base that is called has the same meaning, but also the quality values that go into the pipeline get the same meaning. Because by origin and by how they’re spit out by the sequencer, that is very technology specific. So that's basically the principle behind this tool. And this is basically one example because it not only makes the whole pipeline platform agnostic, but it also allows you to compare the raw quality values of a platform in an empirical setup. This is just for the small nucleotide variants, where we’re looking at the comparison between single nucleotide variation accuracy. And this is for the duplex bases in on Roche versus the highest quality bin in Illumina. So we basically took the highest quality bins for both, but this trick is done basically for every quality bin. There are a few things notable: one is that error rates on every platform differ quite a lot based on the sequence context. So the type of change that you see is quite variable, what the error rates of the different changes are, but also the sequence context is very impactful, the trinucleotide context. So in certain contexts, error rates are much higher than in other contexts.
The left and the right graph are representing the same data. The right one is maybe a bit more easy to understand. That’s basically every change that you can have in the 96 trinucleotide context. And there you basically see that these Phred scores range between 40 and 60 for the Roche duplex technology and for Illumina between 35 and 45. And you see that on average the raw read quality or error rate in the reads is much lower on Roche than on the Illumina platform. Of course, you have to do all of this for all the bins. So for completeness, the numbers are at the bottom; I will not explain that. Does that matter? Partially, yes and no. But we see for regular—for standard somatic variant calling, Q values above 30 are typically good enough. It’s only specific applications where higher quality values start playing a role. So I want to give that disclaimer as well. So we should not exaggerate what we’re seeing here. It’s important to see that the quality rates are very high and good enough—and appear good enough for what we want to do with it. How does it then work out? This is small variant detection and indel detection done in 115 patients. We stripped out the MSI samples because otherwise they would dominate every data point in the first graph. And we see very high somatic variant allele frequency correlation for both SNVs and indels. If you look at the tumor mutational burden or indel burden, we get an almost perfect correlation. And also the amount of private variants is what you expect, and most of the platform-unique variants are stochastic effects because of at the cut-off levels that we’re seeing. So this is the noise that you also see if you sequence a sample twice on the same platform. Same for structure variants, there are different tools in the package that basically deal with this. You get break-ends, deletions, small, big, or biggest, duplications etc. And for all of these categories, we see an almost perfect correlation. And if you add up all the structural variants, we get to the same counts for all the different categories in this case. Then basically other types of things that we’re doing—so it’s important to come to the right conclusions about variants to know the tumor purity and average ploidy. We get—well, there is one example where we came to a different conclusion whether there was a whole genome duplication event. That does not impact any of the downstream work, but meaning this is also fully concordant. Next graph, copy numbers. If you look at chromosome arms, aneuploidies, or high-resolution 100kb bin copy number alterations of every event in all these patients, we again get an almost perfect correlation for both measures in terms of copy number. This is summarizing data for a single patient: point mutations, copy numbers, and structural variants in a Circos plot, and by eye you actually will not be able to spot a difference between the larger, high-resolution picture of a cancer genome between the two technologies. So we can look at derived parameters and measures which are clinically relevant biomarkers: microsatellite instability, tumor mutational burden classification, but also more complex homologous recombination deficiency which underlies a machine learning algorithm that uses input from different characteristics from the tumor. We get to an almost perfect correlation. There is one discrepant homologous recombination deficiency classification that is deficient in Illumina but not in Roche. There are no known mutations in known homologous recombination deficiency genes, so we actually cannot really tell if this is a false positive or a false negative. If you continue this, tissue of origin prediction, it basically uses information from basically everything that we know of the tumor molecularly and using a machine learning algorithm for that as well to come to a prediction. We get 100% concordance. We come to the same conclusion about what tumor type it would be based on the genomic information. HLA typing, also important for certain tumor types, also for predicting which neo-epitopes are presented. Our toolset has a specific tool for this, for which we can do four-digit high-resolution HLA typing. And basically for HLA for class one, which is used mostly, we basically get an almost fully concordant classification. There are three discordants: one we basically are sure that our software made the wrong call on the Illumina data; the other two we cannot conclude on that. So if you then add this up and summarize, we see that the reporting concordance for somatic variants is almost 99%. If you add in all the other parameters etc., which are clinically relevant, we go up to 1,200 events that we checked and we get to 99% concordance. And that from a clinical perspective and validation perspective is extremely high from our experience. So accuracy, I would conclude: this is also clinical accuracy, clinical reported accuracy, it is the same. So then costs, moment you’ve all been waiting for. It was just announced by Mitu we’re going to be talking about costs. So have all your cameras ready, I got the privilege to do this. Nah, not. Sorry about that.
I want to just end by mentioning two things. Based from the data, the clinical data and our experiences, we do see that Roche AXELIOS is suited for doing the same thing, at least according to the first four parameters. So we’ll have to see what the fifth parameter is going to make my decision or my conclusion different. And the final slide is this one: I also want to mention that we’re a non-profit organization that tries to work as open as possible. So all the software that we used for the analysis, none of this is Roche-based but is also completely open-source and available in different formats. But also the data that I presented, everything in this presentation is available. It’s privacy-sensitive data, so it is through an access-controlled mechanism that you can request it, there is the barcode and the link you can find there. So in principle, you should be able to reproduce the result that I presented today as of next month. And with that, I would like to acknowledge the people—the lab work under the leadership of Ewart de Bruijn sitting here in front of me; Joep de Ligt data analysis lead, and of course very importantly, the algorithm development group in Australia led by Peter Priestley. And of course the team at Roche that supported this journey. Thank you very much.
Presentation by Edwin Cuppen (Hartwig Medical Foundation) at AGBT 2026: "Roche-SBX for paired tumor-normal whole genome sequencing."
Presentation by Edwin Cuppen (Hartwig Medical Foundation) at AGBT 2026: "Roche-SBX for paired tumor-normal whole genome sequencing."
*This transcript was generated using an AI-based transcription tool and may contain errors. It reflects the spoken content of a live webinar and has been lightly edited for clarity.
I hope everybody has had a good AGBT thus far, and we really thank you for choosing to spend the next hour with us. I’ll start us off, and then I will be joined with my colleagues, Johanna Whitaker, who leads our sequencing research and development team; Professor Edwin Cuppen from the Hartwig Medical Foundation, an early collaborator on the AXELIOS 1 platform. He will be followed by Gustaf Calberg, who leads our sequencing systems lifecycle, and then last but not least, Mark Kokoris—he’s the head of SBX technology, the founder of the technology, and by no means a stranger to this stage. I also hope that everybody has had a chance to look at the platform in our suite. AXELIOS 1 is a two-system platform, and the separation of chemistry and sequencing measurement has really enabled it to be designed for flexibility, scalability, and efficiency all the way to analysis. The first of these systems is the synthesis instrument that in parallel can process up to four independent library pools. You can also process them together in parallel fashion up to four, or you can do one, two, three, or four depending upon your throughput needs. The reagents and consumables associated with this platform are modular that actually allows batching flexibility based on your specific needs. Optimized protocols will be available at launch to support various methods as well as various sample types, and then we will continue to build from there. The Xpandomer synthesis process happens on a fixed surface of the Xpandomer chip. The input is your multiplex library pool, and the output is an Xpandomer per pool of those libraries that can be then sequenced downstream on the sequencer, and only four hours are required to generate up to four Xpandomers on the platform. The sequencing system can then process these Xpandomer pools in a simultaneous fashion. So each of those Xpandomer tubes that I showed on the previous slide is one run, and that is tunable from a few minutes to up to four hours on the sequencer, and up to four of those Xpandomer tubes can be then set up on the sequencer to run in a simultaneous fashion to allow multiple or high-throughput sequencing if your lab needs that. The analysis module that is integrated with the sequencer allows for flexible and accelerated analysis, and we’ll talk about that as well. When Johanna comes up, she’ll talk a little bit more about the various sequencing run modes in terms of those individual runs as well as queued runs. The sequencing system uses a sensor module with 8 million microwells, and as we have shown previously, a large number of these microwells are available for the Xpandomers to pass through and a readout that results in high throughput. And the structure of Xpandomers include these specific translocation control elements that very precisely control the movement of Xpandomer through the pores. This results in a very clean signal and at a very high rate of more than 500 million bases per second. The combination of this high speed with the clean signal as well as the large number of microwells on the chip allows for both high throughput as well as the fast speed that we have come to establish with the SBX technology. So in summary, each instrument has its consumables and reagents. Again as a reminder, when we talk about the AXELIOS 1 platform, it consists of the chemistry and the measurement module independently. Each of them have their own consumables and reagents for intuitive UI for loading. However, regardless of the throughput or application needs, you will have the same set of consumables. So there are no different kits based on cycles or output that you need—you need essentially the same reagents and consumables. And I know many of you have been waiting for pricing. We will talk about pricing in this presentation. What I need you to remember is when we talk about pricing and we talk about sequencing consumables, we are talking about consumables across chemistry as well as measurement. There are no other fees to be paid; this is an all-inclusive price that we will talk about. And towards the end of the presentation, we will also give you a little bit of a heads-up on the commercial timeline. So stay tuned.
One distinctive feature of SBX technology is the sequencing modes. So we have talked about this before—the duplex mode and the simplex mode. In the duplex mode, the parent strands are physically linked in the library prep process by this hairpin index that you see on the left-hand picture. This allows us to do intramolecular consensus to improve accuracy for applications where you need that high accuracy, for example, whole genome sequencing. But in several other applications, for example, counting applications, you don’t need such high accuracy; what you need is a very high throughput. And that’s where we can use the simplex mode, which is also capable of longer reads up to 1500 base pairs under appropriate sample and library prep conditions. The simplex mode is also very versatile, so you can use it for single-cell, spatial, bulk RNA, as well as target enrichment and proteomics, and we’ll talk a little bit about that today. The input requirements of these workflows are pretty consistent with the NGS methods that you already use in your labs today. As with other methods, the input is typically a function of your sample quality and quantity as well as the needs of your specific application, and it is no different for AXELIOS 1. So what we are showing you here today on the slide is what we have demonstrated so far and kind of represents an optimized workflow. But as AXELIOS 1 gets into the community and the community adopts the platform, we are also looking forward to seeing how you stretch the limits and show what else may be possible with the technology. So as we are talking about library prep, I want to give a little bit of a view into the duplex library prep in which you are adding that hairpin adapter. So the DNA preparation is actually quite similar to what you would expect in other DNA sequencing methods, for example. So you fragment your DNA, A-tailing, and clean up. The differences here is that two-step ligation process where you add the hairpin adapter, which contains your sample index, and then you link the two strands, and then the Y-adapters at the end of those strands, which contain the sequences so that the library can be attached to that XP chip. Based on your input, you would also have an amplification after you have created that molecule. Instead of traditional PCR, we use linear amplification, and the time for linear amplification, very similar to how you determine PCR cycles, will be dependent again upon the input quantity that you started with. So something with a standard SBX duplex that we typically use in our lab, for which Johanna will show you data, we use about 50 nanograms of input and approximately six hours of amplification time. However, based on your input, this amplification time can be adjusted, and you can also do amplification-free workflows if you were starting with a much higher input and go straight into quantification and pooling. So from an AXELIOS perspective, you will be providing library prep to enable both duplex and simplex workflows. Now keep in mind that simplex is a very broad category, and in some cases, the ecosystem vendors will be providing some or all portions of this workflow, for example, for single-cell RNA, and we’ll talk a little bit about that in the context of a collaboration we’re driving with 10X Genomics. But the various components will include the core library prep kit—this is based on our proven Kappa technology—as well as these adapters, the amplification kit, as also a quantification kit that will be compatible with Qubit systems. You have seen the workflow; it’s pretty similar to how other DNA library prep methods work, and so again very similar to our Kappa portfolio, we will be automating the SBX workflow—the library prep workflow—on our walk-away automation platform, AVENIO Edge, as well as third-party ecosystem providers like Beckman, Hamilton, and Tecan. And then Volta Labs is also looking into integrating SBXG on the Callisto platform. Speaking of analysis, the data generation rates we talked about are quite fast; we are doing real-time base calling on the station that is included with your sequencer module. For counting applications, you would then be able to get those FASTQs pretty much as soon as your run is finished and start your analysis.
For SBX duplex methods, you would also perform—you can also perform demux and intramolecular consensus to then drive those consensus reads FASTQs into your standalone analysis. We also will provide an option to perform mapping and alignment on the station. The value of on-station is to get these files pretty much as soon as the run is finished, and as a reminder, the run can be a few minutes to up to four hours. All these files are available in open format, so these can be leveraged for ecosystem analysis tools. But we will be also providing XOOS analysis tools, starting with demux and consensus, but also a variety of variant types. All these tools will be available for the community freely to be used, and many of these will also be open-sourced and hosted at GitHub. Last year in September, we released the first of our datasets on a seven-plex genome in a bottle sample along with the small variant calling with XOOS, and these are all available on our website, so you can download it from there. We also have a collaboration with Google Life Sciences, and there is a deep variant SBX model that is also available along with a white paper to show how the germline small variant calling can be performed using SBX duplex. This was the start. Over the coming months, we will also be releasing additional datasets across different whole genome sequencing and RNA applications with different types of samples as well as application notes, and we'll continue to iterate from there. Using the various methods and the protocols on the synthesis modules as I mentioned, we have been working very closely with the community. There’s a lot of internal work that has gone on to show duplex workflow in the context of different sample types, and Mark presented quite a few of this data last year at various conferences, as well as driving collaborations with our early access partners such as Hartwig Medical Foundation, and Edwin will speak more about his experiences later today, as well as with Broad Institute both on the rapid whole genome sequencing as well as single-cell RNA based on 10X Genomics products and bulk RNA with the Wellcome Sanger Institute. And we do estimate to continue to work with the ecosystem. An example of that is proteomics workflows in AXELIOS 1. This is a workflow where the simplex method really comes into play with the large amount of reads that can be generated in a pretty short time. As proof of principle, we worked—collaborated with three of the vendors in this area: the proximity extension assay with Olink, the proximity network assay with Pixelgen, as well as the NULISA technology from Alamar Biosciences. And what we did here was we took pre-made libraries using the commercial kits from these vendors and we optimized them for the SBX work—for AXELIOS 1 using a prototype workflow. We then took that data, down-sampled to the same number of reads across the existing platform and kind of plotted them, and you can see very high concordance on each of those platforms. Now this is a proof of principle, and what we are really excited to see is how we then continue to develop along these assays as well as participating with other ecosystem vendors. With that, I am going to invite Johanna Whitaker to give us a little bit more information on SBX duplex.
Presentation by Mitu Chaudhary at AGBT 2026: “AXELIOS 1 Overview”
Presentation by Mitu Chaudhary at AGBT 2026: “AXELIOS 1 Overview”
*This transcript was generated using an AI-based transcription tool and may contain errors. It reflects the spoken content of a live webinar and has been lightly edited for clarity.
Thanks for the intro. Me too. Thanks for also walking through my whole CV. That was great. So I'm going to talk to you guys about pretty much adding on to what Mark talked about. We've been focusing a lot at Broad on trying to have not just proof of principle that this technology can work, but how do we actually start thinking about this in practice, especially in the clinical space?
So, Broad Clinical Labs, we're a part of the Broad Institute, also known as the genomics platform within the Broad Institute, really large sequencing center in the U.S. based in Cambridge, Massachusetts. We serve quite a lot of different groups that span across different types of partners. So, we have some groups that are like really trying to push the boundaries of genomics.
So, how do you do, identify new types of events, add new technologies, really like the people who are kind of creating the groundbreaking research. We then also have a lot of groups that are doing like resource building, so a lot of the tools that a lot of people in this room end up using actually come out of the Broad Institute.
And also we're a part of like large consortia, so we generate a lot of data, and we make a lot of that data publicly available. And then the last group are really clinical translators. So this is kind of a newer arm of Broad Institute. And that's where our group primarily focuses. And how do we take this technology and these large data sets and our scale and actually apply them to clinical problems?
So, here's a couple of the different partners that we have in the different groups, and some of these may be familiar to you. Most of them are based in the US, so recognize them that this is an international community, so some of them may not be familiar. So we are a large-scale sequencing center, so this is just looking at our current fleet. So we have a large setup of NovaSeq X+'s that are pretty much running non-stop around the clock.
We have several of the Ultima systems and then 10 PacBio Revios. On the right-hand side, you can see some of our midplex and smaller sequencers as well. So we're pretty much platform agnostic, and we actually really like to have the opportunity to engage with vendors as they're first working on their technology. So we tend to be like a proving ground for a lot of new technologies, which is a really great opportunity.
And that's actually kind of how the conversation with Roche started, I guess about like pretty much a year and a couple of months ago was when the first conversations were taking place. And in those conversations, the Roche team, like Mark kind of went through a lot of the initial data, had some pretty impressive claims on the ability of the sequencer to have high accuracy, high speed, and high scale.
So basically, having the ability to have a lot of runs, smaller runs and smaller batch sizes leads to actually a huge amount of scale that could come out of one sequencer. The Nanopore and Expandimer sequencer has a lot of potential with accuracy, as Mark talked about.
It was great to hear about all this, and so we were kind of thinking about what the applications are for this technology. So what are those different aspects and different things that the sequencer is really good and the SBX chemistry is really good for? What are the best applications for this? And so immediately started thinking about rapid genome in the neonatal intensive care unit.
This is a figure from a publication that came out of Stephen Kingsmore's group. But in general, the ability to have early diagnostic testing by rapid whole genome ends up saving a significant amount of burden on the patient, the healthcare system, and the hospitals.
Around 60% of the patients end up with positive findings. So the NICU, even patients that you don't expect to have monogenic genetic conditions, once you actually start testing you end up seeing that there's a large percentage of these patients that have it, there's not always as we all know, like for monogenic conditions, there's not always therapeutic intervention.
Now there's more and more that are hitting the market but even just having an answer helps a lot with the management and management decisions both for like the neonatal team and for the parents of this child who's severely sick, and so being able to get an answer fast is critical.
And people have been doing this, so rapid NICU genome testing has been around since, I don't know, probably around 2015, 2014. And RADES has really been kind of championing this in the U.S. And then...
And other groups have as well, and now there are commercial offerings for rapid testing. However, that rapid testing, it still takes a while. So rapid could be anywhere from a week. It could be four days. Now I've seen an advertisement for like an ultra-rapid at 72 hours, which is great. Every single day that you don't have a diagnosis or every hour that you don't have a diagnosis in these patients, actually, it's a pretty big burden on the family and the health care team.
So, we got the system in January, and we presented some of this work at AGBT. Initially when we first got it, it was shortly after holiday break. We got set up in our lab really quickly. Within a week, we were generating data, and in some of the initial training runs, we were trying out the workflow.
We weren't really focusing too much on speed, although we wanted to try to keep it as quick as possible. And we were able to get it down to 12 hours and 10 minutes, which is great. That was great. So this is three samples at a time being run.
Here you can see just like looking at the quality scores between HE001 through HE004. Mark presented a lot of this. I'm not going to go into details, but like the overall performance is really good. Mark also showed this in kind of a different way, but just looking at performance, this is focused on SMBs, but the same thing holds true for indels. If you look at the homopolymers and high GC, low GC, it performs really well.
We took some of this same approach, and we started testing first patient cell lines. So we were looking at patients with different conditions, oftentimes that would be diagnosed in the neonatal period. So there are quite a few inborn errors of metabolism that are here.
And we were able in all the cases to detect the causative variant or variants. And so you'd see there's a list of small variant tests that we did, large copy number event cases, and then repeat expansions, also using some of the Roche developed bioinformatics solutions for repeat expansions. We were able to detect all of these causative findings.
We then took that same approach and tested some previously diagnosed patients in our laboratory, in our clinical lab. And this is just like patients that were tested and run by our standard 30X genome. And we were able to find the causative variant in all those cases.
And so all this work, like I said, was presented at AGBT and it was all generated in January. Mark kind of showed a little bit of this. So this was kind of like the starting point of you know, we have a really fast assay, but how fast can it be? And like this was, this was like in February right before the conference, we got it down to six hours and 25 minutes, which was absolutely amazing. This is just one sample.
Okay. So, like a single sample going through the workflow, Mark showed now we could do it two hours less. So four hours and 25 minutes, which is nuts, still one sample, when we think about it,
As I said, our goal is not how do we show the fastest genome. We are interested in showing the fastest genome, so I don't mean to say that this is not of interest.
But when we think about productionization and actually running on a very routine basis, we want to run something that's going to be able to actually handle the needs of the clinic. And so for rapid genome, the most useful approach would be testing a three-sample set because you could have mom, dad, and proband. And that could be really, really informative when you're thinking about variants and inheritance. Is it a de novo variant? Was it inherited from a healthy parent?
All these things are really, really useful when determining the significance of a variant. And so when we started, after the conference, we spent some time, a lot of the work was done in the Seattle group from Roche, but then, taking this technology and actually applying it in our hands, we were able to get three samples, so three whole genomes going from DNA through variants in a really aggressive time.
So, it was like eight hours and two minutes for three genomes to be sequenced, which is great. We even got one to be below the eight-hour time point.
We were ecstatic, fantastic. And then we kind of added some more tweaks in. So this is a constant iterative process. And now we've got it down to seven hours and eight minutes for three samples being sequenced at the same time. So this is two separate runs of trios just to show that consistently we're getting over 30x. Consistently, we're getting our time to be in this really, really amazing time frame to be able to sequence genomes.
We wanted to see, you know, it's great to have a fast genome, but can we actually detect positive variants in patients? So we tested another group of cell lines. Some of these are the same ones, but a lot of them are different.
Also, looking at small variants, a lot of these are inborn errors of metabolism and other conditions that present in the neonatal period. Not all of them are, but most of them are. We looked at large copy number events, and we also looked at repeat expansion.
For all these, we were able to detect the causative variant again. So we're not just getting really fast sequencing and really accurate sequencing, we're actually able to find diagnostic hits in cases. So these are a set of cell lines, and this is just like looking at some of the data.
So, the data looks really clean and really nice. So this is a single-nucleotide variant for Usher syndrome patients. This is looking at cystic fibrosis, looking at a heterozygous call. Looks really nice. This is a single. So this is a base deletion, I guess a three-base deletion of cystic fibrosis. So this is the classic Delta 508 variant, that's like super common in Northern Europe, homozygous single base pair deletion.
So, you'd notice a lot of the propionic acidemia cases, like Mark showed it to. This is great because it's like my kind of pet disease of interest, and so I keep plugging it, and so that's why there's a lot of PCCA and PCCB samples getting tested.
This is just another example of a complex insertion/deletion. So this is in the second subunit of the propionic acidemia, propanol-CoA carboxylase gene, PCCB. In all these cases, the data is really clean, and it looks great. And these are not just like looking for events in IGV, although that's what I'm showing. These are actually calls that are being made in the VCF.
This is looking at a copy number event in the Duchenne muscular dystrophy gene DMD. This is a patient who has Becker muscular dystrophy and we're able to find the causative event here too.
There are a bunch of other examples too, I'm not going to go into many more, just to show we were able to identify repeat expansions, so this is looking at a patient that has Friedreich's ataxia. It's an autosomal recessive repeat expansion condition, which is different than a lot of the other repeat expansion conditions. We were able to see two expansions at different sizes that are in the full expansion range. So really cool to be able to see all these different use cases.
When we're testing patients, we don't really go into the VCF file, nor do we typically go searching around for things in IGV. We oftentimes will use tertiary analysis software. And so great to see that things are detected in the VCF., Great to see that we're able to see the variant, but are we actually able to identify and prioritize variants in cases?
So, looking at a sample, once again, propionic acidemia sample, this is a trio where you have both parents are heterozygous carriers, and then the proband is homozygous for this single base pair deletion. This is using the platform that we use clinically at Broad Clinical Labs, which is Fabric Genomics, Fabric Enterprise System, and we're able to see the causative variant that was prioritized all the way at the top, very high score.
And once again, this is not being used for clinical use, but if it was being used for clinical use, the next step would be basically the creation of the clinical report also able to be done. This was more like proof of principle. Can we take a sample and the VCF and load it into tertiary analysis platform? We did this kind of like under the radar, without even telling Fabric or really anybody else that we were going to do it, and it worked, which is great.
So, it performs very similar to other short-read technologies like Mark showed. Especially when we're looking at the fast, the read lengths are comparable. They're a little bit longer in this duplex format, but overall, it's able to kind of come into different tertiary platforms.
So, obviously, we've done a lot in the last, I don't know, two or three months. And in the last five or six months, we've done quite a lot, but there's still a lot of work to do. And so Mark kind of alluded to this earlier. We have this collaboration agreement that was announced between our group at Broad Clinical Labs and the Roche sequencing team.
How do we take this from concept? How do we continue to optimize the bioinformatics for these complex genes and variant types, which is really important? We really have to streamline this. So it's fast and easy, but it still takes a person to do it right now. And so we'd like to get to automation whenever it's beneficial.
We showed one example of tertiary analysis. But as you can see at this conference, there's a gazillion different tertiary analysis vendors. And it'd be great to kind of see how this performs in lots of different tertiary analysis platforms, as well as giving vendors the opportunity to potentially tune their AI and algorithms so that it could detect the causative variants.
Because like I said, we did this with Fabric without them even being aware that we were doing it. And then validating an end-to-end workflow. And then really the exciting part for me is taking this idea and starting to implement it for actual patient care.
So, pairing up with a partner who's already set up and really interested in this, how do we get the rapid genome to be done in one day and one shift, which is pretty awesome and kind of unheard of at this point? So this worked very little. It was actually done by me.
This is really the team that was responsible for all of the work that you saw here. The picture is the Broad team standing in front of our systems. They don't look as pretty as the ones at the booth. They're still kind of early alpha release, and obviously, this is not just work that we're doing independently.
The Roche team has been hugely helpful, and really, it's a great partnership. It's kind of rare to interact with a group and a large corporation like Roche with a team that operates very much like a startup.
So, I'm pretty sure Mark's whole team is sleeping in the laboratory and not doing anything except for this, but it's great. So, we're constantly pushing the envelope of what's possible. So thank you.
Presentation by Sean Hofherr (Broad Clinical Labs) at ESHG 2025: "Enhancing Rapid WGS for Trios by Roche SBX Technology.”
Presentation by Sean Hofherr (Broad Clinical Labs) at ESHG 2025: "Enhancing Rapid WGS for Trios by Roche SBX Technology.”
*This transcript was generated using an AI-based transcription tool and may contain errors. It reflects the spoken content of a live webinar and has been lightly edited for clarity.
Mitu Chaudhary: My name is Mitu Chaudhary. I will be your moderator, and I will also be introducing our key speakers. I'm a Senior Director, International Business Leader at Roche Sequencing Systems, and my team is responsible for managing our sequencing systems products, which include the AXELIOS platform that we'll be talking about today, as well as our automation platforms.
But essentially very, very quickly, we do have a rich portfolio of sequencing technologies, primarily in the sample prep space, with our KAPA library prep and target enrichment, and also with our Navify Insight portfolio. Then, over a period of time, we have been developing products on other sequencing technologies, really building these tools together into effective solutions. For example, our AVENIO assay kits, like our comprehensive genomic profiling kit.
But going forward, this is where our platform comes in. We have talked about AXELIOS platform and the product that we will be launching next year will be called AXELIOS 1. This will be the name of the platform that we will be launching next year. And what this really indicates is our commitment to continue to drive innovation on SBX technology, it’s already showing a lot of potential. We are actually going to talk a lot more about the new applications that we have been demonstrating on the backbone of SBX, but this is really the first in its class.
I'm also very happy to announce the pricing of the AXELIOS system. Never heard that question from anybody, I know you're not wondering at all! But I thought I'd just tell you anyway. So, in the United States, the platform will be priced at $USD 750,000. We believe that this pricing will enable adoption across a wide variety of labs, for a variety of applications.
The second question nobody ever asks me, what about operational cost? That's where you will still have to wait just a little bit longer. However, what I can say is this; the platform performs sequencing runs in a much shorter time, and so what we are envisioning here is a price per gigabase that is very, very competitive to the high throughput systems, at market standard accuracy.
Now, we did talk a little bit about duplex and simplex, and we'll actually showcase a lot more on this in the upcoming presentation. So, what you will see there, especially with our simplex approaches, there we can enable a wider range of use cases and leveraging that very, very high throughput we actually see that that pricing will enable you to do deeper and broader studies at a scale that has not been possible until now.
With this system in place, AXELIOS 1, and our analytical portfolio with XOOS analysis platform, what we envision is really across each of these verticals in the sequencing workflow, to have these individual tools and technologies that are very, very capable. But then also what you will start to see from us are, at the bottom as I say, Roche assay kits, which is these solutions for future, for example. So in the germline space, as well as in the oncology space with things like MRD, genomic profiling and cancer detection.
With that, it is my pleasure to actually introduce the key speakers for today. I'll first introduce Mark and then Katie Larkin.
Mark has nearly 30 years of experience in biotechnology, co-inventing the SBX technology in 2007. He formed Stratos Genomics that same year with co-founders Allan Stephan and Robert McRuer. First, he served as the company's President and Chief Science Officer and then as a Chief Executive Officer. He led Stratos through its acquisition by Roche in 2020. Before Stratos, Mark founded BioCaptus in 2003, it was a biotechnology consulting firm. He also served as the Director of Technology for QIAGEN Genomics and was a founding member of Rapigene. He holds a BS in Biochemistry from the University of California, Davis and is an inventor on over 20 US patents.
I'll introduce Katie Larkin as well. Katie is the Director of Clinical Product Development and Strategy at the Broad Clinical Labs, where she helps drive product innovation and shapes the strategic direction to advance BCLs mission. She has been at the institute since 2009 with experience spanning lab operations, clinical sequencing, and next-generation sequencing technologies. Mark, over to you.
Mark Kokoris: So, the problem with me with a presentation like this is, where do you stop? Because for everybody who knows me, I kind of want to relentlessly push in every direction all the time. So, for my team in Seattle, I just want to say thank you for putting up with me, I think, getting ready for this presentation.
But it is all that. So many years of pushing on the technology. I just think about seeing the things that we can do, which is very exciting. And so, what we came up with here today is several new applications: talking about the longer RNA sequencing. We're introducing this bio directional longer RNA sequencing project that we're doing with Sanger and the melioidosis sample set. Also extending some work with Aziz al’Khafaji's team at Broad. We're actually going to introduce some multi-omics work here as well, including some methylation data that we generated, and then show that with some tumor normal samples. Introduce target enrichment. I get a lot of questions about that, and so we just kind of you know show a little snapshot of some data there. Then finish off with some vignettes for whole genome sequencing, FFPE, MRD and SBX-Fast.
And then you'll notice the little symbol on there, I don't know if this has gotten out there yet, but we've actually broken the Guinness World Record for the fastest DNA sequencing technique. I'll show the time at the end of the presentation and Katie is going to spend her part talking about why that's important to be able to do. And so again, I just really want to thank my team in Seattle again for all the hard work. I want to thank the CSI team, and the related teams in Roche, for pushing so hard to put all this material together. I think it's quite a bit of stuff to cover here and it's just an amazing team that I have a lot of respect and admiration for. OK. And if it wasn't clear before, yeah, I got the world record, so this is a little bit bigger here!
OK, technology overview. We've had two webinars now, several presentations and that seminal pre-print. If you're chemistry oriented, read the pre-print. It's pretty crazy! I read through it the other day and I was there for every second of the work we did, and I can tell you I'm still scratching my head with some of the stuff. So, it's a nice read to get some background on the technology prior to the Roche acquisition.
So, we talk a lot about flexibility and performance throughout and that's just for everything that we talk about, you're going to want to be thinking about that flexible operation. It's just kind of implied with everything. We'll show examples of the higher accuracy. The high throughput is pretty much there for everything that we're doing. And then today, we're going to show some of that longer read work that we do and of course showing that world record.
So, the foundation of the technology is Stratos Genomics’ SBX chemistry coming together with the genia high throughput array. And, you know, again, you can get a lot of that background information from some of the previous materials.
The AXELIOS system, now we have the pricing, so we know that. What that is, is that synthesis instrument and sequencer, and of course library preparation which Jagdeesh Chandrasekar went through quite a bit of that earlier at the CoLabs. So, we'll cover a little bit of that here. And just a little bit of background for those of you who didn't see this before.
We break up the sequencing library prep work structures in the SBX-D which creates that hairpin Y-adapter structure that allows us to make Xpandomers that we can get intramolecular consensus reads on. So that's where that high accuracy applications, and we're going to talk about most of these today.
But then you've got simplex side. So, for example, target enrichment, some of the short read flex stuff, but then also the longer read stuff, which we're going to spend quite a bit of time on today. So, you can look at structural variance, phasing, isoforms, things like that. Again, all of that being leveraged because of the throughput and the capability of the technology. This kind of shows again another way of looking at our scale from low to high. And the read lengths as well, but then being dissected centrally from single omics, to now some of the multi-omics things that we're talking about here.
OK. And then the wheel here, I think we've done most of these things. There are things that are not on the wheel that should be and will be. And I think we're just going to keep going around the wheel filling things in and showing what's possible with the technology.
So, shifting to early access. So, we did work, started this project. End of August, I believe, we installed the system. I see Mike Quail down there in front of the system looking quite happy. So, we did the install, it worked perfectly, and we decided, OK, well, let's do a data set. So, Emma Davenport had some samples that we could run and were a good fit for SBX-SLR, which is what we're going to cover here with this project. And so, we said, well, let's go for it. Let's see if we can get some data to show at the conference.
So the workflow again, SBX Simplex Longer RNA (SBX-SLR) is pretty straightforward. You can see the poly A portion, so you get the directionality of the structure there. We do a Y adapter ligation, and these are kind of custom Y adapters that bring in the sample ID. And if you were to want UMIs you could bring UMIs for other applications there as well, all compatible with our linear amplification protocol. And so after that you can see the directionally, what the Xpandomer or sequencing reads would look like. And just to put a number, we're looking at 2.5 to four billion reads per hour in that 400-600bp size range. But again, there's a pretty big histogram there. Those are average relinks. So you'll see we get, you know, quite a few because of the throughput. We get a lot of thousand MERS and longer reads out of that.
OK, so melioidosis is a bacterial infection with a mortality rate of 26% where individual response to treatment varies. And the project we're utilizing looking at SBX-SLR to examine differential gene expression, splicing signatures and disease endotypes. So, for the cohort, I think her cohort was about 1000 in size. We took a subset of that, 90 samples broken apart into healthy, survivors, and non survivors for this particular study, as we were short on time, it just wanted to, to give a look at it.
And again, this is the first time we actually tried to run this in the lab at Sanger, and actually anywhere. So, pretty straightforward, intuitive protocol, 200 nanograms bulk RNA, we used an NEBNext kit for the library preparation, brought cDNA into that Y adapter ligation and then went on to the SBX-SLR workflow to generate reads. We over sequenced, of course, on purpose. We wanted to over sequence just to kind of get a sense of what that would look like. So, we got 138 billion reads in 36 hours. I think that's probably on the lower side. I think we got even more room to expand that quite a bit. Average read length around 400. You can get an idea of the histogram right there. Samples were totaling 90 samples, and for each sample we had 1.5 billion reads, or 3.8 billion reads per hour.
So what can you do with that? Let's dive in a little on the biology and Emma will cover this in a lot more detail, but I just want to touch on a few things. You can give a sense from the picture there looking at differentially expressed neutrophil markers for the non survivors, you can start to differentiate, some of the expression there. Just kind of a little snapshot picture. And again, she'll cover a lot of this.
Same thing on genes with differential splicing. Nothing surprising here, you're seeing genes associated with melioidosis, more immune function and regulatory genes. So, really as expected here, including some of the IG, IGH and BCL genes as well. So really as expected.
Similarly, Gene Ontology, the revealed pathways are associated with immune response and pathogen interactions. So, all making sense. Emma will dive into that, not just here. I imagine she's got quite a bit more analysis to do given how many reads that we were able to generate for this project.
And then taking a quick look at BCL6 and some of the isoforms, we had two here in particular, we looked at a 700 and a 3000 base isoform here. And just looking at transcript usage for this particular one, looking at the group of healthy that yield those yellow boxes up there, you're seeing a different transcript usage there. So again, a lot more to come on this, a lot more detail. Just wanted to touch on it say that this is something that we were doing and looking at, and these are the kind of slides for me that get into the detail that I'm really looking for. It’s OK, well, so longer reads enable more and ubiquitous isoformed identification, I think so.
When you do the comparison to the run that was done for the original 1000 sample cohort, the Illumina run, it was using was 37.5 million paired ends per sample. When you compare that to the longer reads, the longer single ended reads at 37.5 million, you can see the difference in the fraction of isoforms detected. When you bring that up to 75 million, which is actually probably a fairer comparison in terms of total read, you see an even higher number. And go to saturation curve looking at 300 million reads, you're getting that 99% so you kind of bottom line it here.
From 138 billion reads that we produce for the study, we could have sequenced the entire 1000 sample cohort in 36 hours at a depth about 125 million per sample. Probably around 98% saturation. First time we ever did it. You know, really happy to see this. And then a lot more to come along these lines. And there's a lot of buttons to push. Everyone's going to want to push that length out a little bit more here. These are blood samples so they're already a little bit on the shorter side, but very typical. So I think there's a lot of room for us to kind of move that length a little bit further, move the throughput, but those are going to be things that we continue to work on.
So, staying in the same space as the SBX-SLR with Aziz Al Kaffaji's lab. We added another project, and Brian Haas will be speaking about this tomorrow. Aziz is at the meeting in Australia, so couldn't be here, but was offered to provide a couple of slides here, kind of give some of his perspective on this.
Alternative splicing is a key regulatory axis of biology that must be resolved by RNA seq assays. And then transcript splicing modulates: protein isoform, translation efficiency, etcetera. And there's just numerous examples of splicing found to be essential in development and drivers of disease so you know, basically driving to that conclusion, splicing variance are deeply under resolved. That's kind of his position on this and why he's really excited. We talk all the time about the kind of things we want to do together in the collaboration.
So, looking at this a little bit more, standard RNA seq is blind to the rich diversity of proteins emanating from the splice forms of these genes, and the question is, can SBX-SLR longer read lengths enable measurement of transcript isoforms at scale? So that's the question. And the study we did for this is using the DepMap cell lines and you get that picture on the right there to kind of show the spread of the different cell lines used here. The overall project again, SBX-SLR measuring differential gene, isoform, and transcript fusion expression across the DepMap cell lines.
Similar to the previous project, around 96 samples, but in this case they had a modified template switch protocol that they developed. These were a bit longer reads than what we did for the previous, and it shows when we look at the yields here. So, again the throughput over 100 billion reads, around 500 in length, and you can see a noticeable shift to the right in the read lengths there. Then the total reads per hour about 2.8 billion, so kind of right in that range of what we'd expect. Again, the first time we ever ran this project.
Just looking a little bit deeper into it, just an example, GAPDH isoforms about 1.3kb, below you have the reference isoforms by the green box there. And looking at the SBX-SLR is able to unambiguously identify that one contiguous read there as an example. But then the shorter reads, would obviously have a very hard time doing that.
Similar to the previous presentation, looking at the unambiguous isoforms identified, looking at both full splice matches and identifiable isoforms, 95% saturation was achieved with about 58 million reads, or 51 million reads for the ISMs.
Again, the main point really here is the workflow flexibility. We talk about it all the time. I think it's a very real thing that I'm looking forward to seeing how people leverage that flexibility, the massive throughput, the ability that we're able to reach into longer read lengths then and as well as occupy the lower shorter read lengths at massive throughput. Flexible throughput I think is really the advantage here. And then one more statement here. With the 2.8 billion reads per hour, the 96 sample study could be run in three hours at a depth of over 100 million reads per sample.
So, we'll see, and I think Brian is obviously going to keep diving into the data and Aziz's team will keep looking at that. We'll keep making adjustments to the chemistry and tune the workflow around. But I think out of the gate, you know, really happy with what we're seeing there.
Now shifting to multi-omics looking at the Simplex Longer DNA (SBX-SLD) side of things, very similar. I mean, I can’t see a whole lot of difference between some of these pictures here. Again, focusing on DNA inputs, 20 to 50 nanograms, I think we can obviously go lower than that, but that's just the range we've tested so far. Same Y adapter ligation compatible with the linear amplification protocol, so very similar to what we showed before. And just one data slide here to show looking at structural variance detection in cancer cell lines.
And again the same hypothesis here that longer reads can identify more supporting evidence for the structural variance. We note here the true positive criteria here for both XOOS, the Roche XOOS suite, as well as DRAGEN that was used for the study and then the read counts for both. So, again, we were actually using less reads than the comparison here in really, in both cases, but then kind of pointing out the one thing that jumps out on the slide would be the insertion impact here. And this is pretty much as expected based on some of the prior papers, as well with longer read comparisons.
So, shifting a little bit back to the RNA casing slide I showed before in terms of the RNA seq, same thing, same workflow. This is now more for benchmarking, same numbers there we showed before, and just kind of getting an idea with the genome in the bottle cell lines, what are we seeing? Are they concordant gene expression? The answer was yes. Transcript expression, the same thing, yes. In comparison, looks very good, identifying some of the more challenging regions to sequence that are with low mapping quality. We're able to see that with the longer SBX reads, we're able to map better there, whereas there very few reads were mapping at MAPQ greater than five, and that showed in the TPM difference on the left here between the two technologies.
OK. And there's Chen down at the bottom here. You can have a poster. I would encourage people to go and walk by and keep Chen company. Ask him some questions about this. He's going to cover several different topics, but again, looking at somatic SV expression and the measurable expression, perceivable DNA variance for the 1395 cell line.
OK. And then the recount again, about 350 million paired end, versus 350 million single end. So again, a very fair comparison here. Then looking at 32 were missed essentially by Illumina versus 12 here. And, of course, I asked Chen last night. I said, OK, well what are we seeing with the 12? Let's fix that, and of course we're going to go right at it and understand that a little bit more. So again, just data that we just generated very recently. Pretty excited about that.
Now shifting to multi-omics, a little bit more on the multi-omic side. Introducing this idea of SBX duplex with methylation, SBX-DM. And for this, our first attempt here, we're using the Watchmaker Genomics TAPS+ Methyl-seq kit to convert 5mC. So just the high-level overview of SBX-D again, using the hairpin Y adapter structure to make the structure that goes into our linear amplification, and then produces the Xpandomer reads as we've shown and talked about many different ways.
OK, but now if you add in this kit, and again, we just got the kit straight up from them, didn't do any modifications to the kit, and apply this to the SBX chemistry and give a little background on the TAPS chemistry, it’s converting five MC or five HMC to T essentially. And one of the key steps is, there’s many, but I think one of the key steps is that reduction step that they were able to optimize. And I can say, I mean, having done really challenging chemistries, they did a great job of pulling this together. 98% direct conversion with low false positive rates. So, I think it's always tricky finding that balance with chemistry, and they did a really good job of that, and shows in the data.
So thinking about the traditional Methyl-seq, the lower complexity, decreasing sequencing complexity, and the challenges that come with that. Versus the TAPS conversion where you're seeing more of a maintaining of the sequencing complexity with only 1-2% change. And then what does that allow us? It allows us to improve the alignment, and deliver epigenomic and genomic variant information simultaneously.
OK. So what does that look like? TAPS conversion protocol kind of show after the library prep and again, this is where we would apply any other method, and we will of course, we'll look at all the other methods and let people decide what they want to use. So in this case, we applied the TAPS chemistry at that point and then through just normal workflow after that.
OK. So SBX-DM as we've said can concurrently detect DNA variants and methylation signal from a single library. OK, so that's, you make the one library, you convert it, and you get all that information in one sequencing run. So after the base calling, we do demux, intramolecular consensus, and then do reference free methylation detection using the power of the duplex read. After the methylation status is recorded, the converted bases are reverted for mapping and alignment and for that you can actually just use any mapping tool because you're bioinformatically reverting those back.
And then you use the Roche XOOS suite for both DNA variant calling and methylation calling status. So really cool, really efficient. And we'll get into a little bit of the benchmarking that we did here.
Multi-Omics - benchmarking SBX-DM with GIAB & Cell lines
OK, so comparing straight up SBX-D versus SBX-DM, looking at the link side, quite comparable. The SBX-DM is a little shorter, but in this particular comparison, but actually had higher coverage. So OK, and that's not like it was so much shorter. That's a big deal there. We see a little difference in F1 scores for both SNV and indel, and this could be both bioinformatically, some tweaks we may want to make, or chemistry, a little bit of both. But we were actually giddy, I think was the word we used in the meeting, seeing how good we did right off the bat with this comparison.
Similar to what we've shown before, about five billion reads in one hour sequencing, giving us way over obviously 30x coverage. And that's using concordant duplex bases only for the coverage, just to be clear on that.
OK, comparison to the AF values between both are you know, on par. Nothing really surprising. This is looking at the two different cell lines shown below there, so really as expected.
Looking at the methylation status, or the level of methylation that we're seeing against several other technologies. Again pretty much in agreement with what we're seeing with, with other technologies. In particular, looking at TruSeq, the histograms look very, very similar there. So that's good.
And then focusing on methylation patterns. So, on the left looking at, you know, the methylation patterns near transcriptional start. We actually used this as an opportunity to test SBX-SLR, to generate, do the gene expression and then generate the high and low categories of expression there. And then took SBX-DM to assess the methylation levels. And then of course applied that to show that the difference that we're seeing between the low and high actually makes sense. And it does.
And similarly, looking at the methylation across different genomics regions pretty much overlapping there. So all good stuff with that.
Now pulling it all together to kind of leverage the power of all three of these approaches here using SBX-SLD (DNA) for the haplotype phasing. Then SBX-DM for co-detection of DNA and methylation variants, and then SBX-SLR (RNA) for gene expression phasing. So what does that look like? And this was our favorite slide. And I can just tell you we all really love this slide. So, focusing on SBX-SLD here. You see the haplotype phasing using the heterozygous SNPs. You can, it's kind of hard for me to see for the picture here, but you can get an idea of the two, the two haplotypes there as they’re hopefully coming through on your screen there.
And then looking at SBX-DM, you see the methylated haplotypes for haplotype one, and then moving on to haplotype two, you can see the unmethylated G to A SNP. You can see the unmethylated haplotype, and you can see the methylated on both haplotypes. So again, you're getting that with the SBX-DM.
And then lastly, SBX-SLR, you're seeing the allele specific expression of the unmethylated copy there that we point out, as well as the Exon-Intron boundary, so really cool. Again, just brand new stuff. Some of this data was just generated within the last few days. So really exciting and we can't wait to carry this on further.
And on point here. So Mahdi will be presenting at AMP. We're going to take this and amp it up a little bit more as we get close to AMP and he'll be presenting on that. I'm actually going to come back to Boston just to see Mahdi present this. So I'm excited for that.
So again, I wanted to get a little bit of a test looking at some FFPE, buffy coat and tumor normal samples, and get a sense of what that looks like. So we just had five samples of breast cancer, bladder, CRC, for all three sample types and we ran them through just to kind of see what we're seeing here. So we did standard SBX-D, looking at tumor informed MRD in the 60 to 90x range. This is something we've done before showing great data on at ESHG. So we were able to pick five out of five subjects, including a very low TMB sample. So that looks great.
Then looking at SBX-DM we were able to see at 30x coverage, and the reason why we did 30x was the yield on this one time we ran SBX-DM, there was a couple that were a little bit low. So we decided to actually down sample both the 30x for the comparison and that's what we did here. And we're able to see four out of five subjects for both, but it wasn't like we saw five out of five in SBX-D. So we saw very similar performance from both at the similar coverage. So we'll work on making sure that we understand if that coverage was just a born drop off or not, but out of the gate, really exciting stuff.
And again the SBX-DM preserves methylation signal as complementary information for MRD detection. And so we wanted to carry that a little bit further looking at the 30x SBX-DM, looking specifically for cancer specific methylation signals. So we started off with the left looking at differentially methylated sites. And to do that we analyzed paired FFPE blood DMS using SBX-DM.
So we got the count number at that, then we intersected that to narrow the population down a bit with the previously identified cancer specific methylation reporters. So it narrowed it down, and then intersected one more time with our cfDNA to get a cancer specific DMS detection in plasma. So again, this was very recent data, but very exciting to see that we could see that, and Mahdi will cover quite a bit more of this at AMP in a few weeks. Some pretty good stuff and again, the complements of the SNP information improve MRD detection, especially in low-TMB is something that we're, you know, really excited to bring both of these things together with the SBX technology.
So, target enrichment. I get a lot of questions about this. I just wanted to cover it here because I think people deserve to understand what we're doing. So this is actually an SBX simplex approach, using a lot of the same Y adapters I've shown before that would then go through a pretty typical Probe Hybridization, PCR and actually we do use PCR pre and post target enrichment for the steps, and then bringing it through Xpandomer sequencing as normal on the sequencer.
But these read lengths are in a range that is just a workhorse range. So, you'll see the number of reads we're able to generate here. But as with most of these target enrichment applications, you're going to have families in clusters that you then use intermolecular consensus to collapse, both with duplex clusters we show on the left, and simplex clusters on the right to get a consensus read for both types. So that's the basic approach and general approach most people use for that.
And then focusing on the right hand there, from two nanograms of cfDNA using the KAPA HyperExome V2 kits, we generated, we would generate in four hours, we'd be able to do about 48 samples at 340x unique coverage.
And the Phred scores on that would be about 45 to 50 for all clusters versus duplex clusters. So quite respectable there. And we're looking at over 184 billion reads in 24 hours. If you did the six runs as I indicate there, or roughly 288 samples in 24 hours at 340x. So, if you were more interested in germline 30x, you'd be almost 3000 you'd be able to do in 24 hours. And the actual read counts is probably a quarter trillion and I had to get that in just so I could say the word trillion! So just I'm just giving that one up.
Which I'm very excited to keep pushing towards that kind of number of throughput. So anyway, that's target enrichment. I'm really happy to get this into some hands with early access, and see where you go on that.
So just a little jaunt to FFPE and Jagdeesh covered this in his as well, but we do FFPE with a little DNA repair step, and we’re finding that that really helps with the quality of our sequencing there. And then run it through the process very similar to how we’ve covered before. In this particular study we did 18 matched tumor normal samples, across a range of qualities, and what we saw, and the experimental details are on the right, but essentially, normalized 100 nanograms for each, across both technologies, only greater than 70x. They’re all pretty well matched in terms of the coverage, so there was no advantage there either direction, so around 70x were shown, and what we see is really good per base accuracy, really good error rate by substitution type, and really good homopolymer accuracy for the blue SBX, so really happy with that.
This is a poster, and Mahdi will go into more details looking at concordance again against Illumina. Basically the take home is, highly concordant. He'll go through this in more detail to go through the examples if you visit him at the poster, he’s quite exceptional. So I would recommend people go and sit and talk with Mahdi as well.
OK, so MRD, I showed a version of this at ESHG. We've just added more samples to the data set here. It's actually a 96 sample set, so expanded a little bit. Maybe a little bit more bottom of the tube samples here. So, a little bit more challenging, but carrying this through an MRD, SBX-D workflow, and again, 96 samples.
What we saw here was we were able to detect 41 out of 47 MRD samples called correctly. I think this is actually quite an impressive outcome here for this result. And I think it'll be a lot more of this coming up in the coming months. And Kendall there on the right, she'll be covering this in her poster, so I invite people to go see Kendall. Kendall's a key member of our biochem team and does a lot of the things that make us able to make these Xpandomers. So, I encourage people to go visit Kendall. She'll do an overview of the technology there as well and answer questions.
OK, the big finish and we'll hand off to Katie soon after this. So, doing SBX-Fast is essentially a PCR amplification, linear amplification-free workflow, as we say here, running through the SBX-D protocol. And we've talked about this at several meetings already, but essentially the idea is, how can you quickly go through and get a single genome or a trio genome? And so just a little bit of a snapshot here. We've done quite a bit of work on different samples over many months and essentially were able to identify a number of different types of variants, in a number of different samples. This is previous data, we added a bunch more. All of them we’re able to identify correctly.
OK. So, the big number, so we now have sequenced a genome from sample, and this is an HG002 sample from DNA sample through to VCF in three hours and 59 minutes. And we've done this many, many times as Katie will show. So really, really excited. And we're not just doing this as a vanity thing. There's a lot of really good reasons to want to be able to do this as Katie will cover, and so really exciting to be able to see this type of result, and the impact that's going to have.
And I think there's a lot of other things, again, focusing on the flexibility of SBX that we're going to be able to do. This is a project that we started talking about last summer, working on in earnest in November, and the teams really came together. It's a fantastic group of people that we work with to make this result, to be able to demonstrate this. And so, I think pretty exciting there. So, three hours and 59 minutes.
This kind of breaks down a little bit of the processing steps there, so you can get an idea of the timing and Jagdeesh will have a slide, a poster that he'll go through and talk about some of the SBX-Fast work there. And as I mentioned before, throughout, we’ve got a nice, a great presentation tomorrow. I encourage everybody to go to look at some of this SBX-SLR work. Both Brian and Emma will be covering, and Yutaka Suzuki will also be covering some of the spatial work. And I think it's going to be a great demonstration of the things you can do with SBX. I encourage people to go to that.
And so, the last thing here. So, I can remember when I got excited by seeing a single X-NTP extend off the end of a primer. And I actually, you know, that was 2014. So, it was seven years just to get to that point. And so, I think the one last thing I’d like to finish on, the message is, essentially this is pretty hard work that we do here. And you know, everybody has their fears, fear of failure, fear of not getting money, not being able to fund your work or run your projects. And I think the thing I've learned over all the years of what we've done for SBX is you've got to be able to turn that into creativity and find the grit to keep going.
Because I can think about so many different reasons why it would have been so much easier just to kind of not try and solve all these problems. But I've had the great fortune to work with wonderful people, and we figured things out, and we've carried on, and we've pushed forward to keep inventing and using that energy to drive things forward. And I think that’s the message is, we've got a lot of challenges in front of us for all of our work that we have to do. And I think we have to persist and keep pushing through and keep innovating in all the work we're doing. And I hope that SBX can help contribute to that and help people with their projects and move sequencing into a new era. And I believe firmly that we owe that to the people whose shoulders that we are standing on now, the brilliant people that we have had the opportunity to learn from, as well as the people who are the next generation, who we are modelling and showing the way for.
So again, I'm really excited about being able to be here, the things we've been able to do and just to be able to talk to the team here. So, with that, I thank you very much.
Presentation by Mark Kokoris at ASHG 2025: "Advances in sequencing by expansion (SBX). Multiomics, methylation mapping, oncology research and building a sustainable framework for ultra-rapid genome sequencing."
Presentation by Mark Kokoris at ASHG 2025: "Advances in sequencing by expansion (SBX). Multiomics, methylation mapping, oncology research and building a sustainable framework for ultra-rapid genome sequencing."
*This transcript was generated using an AI-based transcription tool and may contain errors. It reflects the spoken content of a live webinar and has been lightly edited for clarity.
So. And thank you for having me. I'm totally honored. But I guess the only reason why I'm here is just that there are demonstrating there's this test technology is widespread. Finally to the other side of the planet. So yeah. And so and, okay. So and I'll talk about the application of SBX for this of the analysis. There are some nuances, but the mostly for the Japanese researchers to, you know, make sure to utilize our core facility.
I'm from the core of the University of Tokyo. And these days we are doing some this, spatial analysis, mainly using Xenium as a platform. And, to my understanding that we have experienced the rapid shift from the sequence based spatial analysis to the hybridization based ones. And this is kind of an, a photo album, photo ev album, photo album, you know, things and they, we started Visium and we have spent a quite happy years and two, three years.
And that us as an image is quite powerful to identify the cells their interactions cancer cell immune cell interactions and the data is free from the sequencing information of the patients. So we feel easier to use, for the data sharing. And actually the data, you know, the procedure is quite robust there so that we can start the cancer genome cataloging project.
Maybe it may be that the same thing is starting from a US and UK and many countries, but the way, finally, of analyzing the accounts cancer, you know, at last seeing apart from in collaboration with the Australian and the Thai and the Taiwan and the Korean people. So anyway, that's a good thing. And about the utmost we have learned restarted the learn this the limit of that Xenium analysis hybridization based platform spatial analysis.
It's good to confirm in a sense what's happening in the already known networks or the gene expressions and and that happening in the spatial context. So in other words, we are just doing that collect the, the putting the pieces into the spatial and layer. And so that's, it's now all this easy to, say something new, anything new about this totally unknown things like a novel cancers.
And by drawing your data driven networks and landscape type like things. So in the sequence based has its own advantage because it's sequence based, after all, so non-biased. And, if we can do the old translational sequencing, that should include TCR, BCR sequencing in that, as we heard, the splice variations and allele expression may be captured. At the same time, in addition to human transcription transcripts, we may be able to analyze the bacterial and other pathogen information.
And so it's applicable to other organisms. Where the probe design is not always possible. So but the one of the highest barriers to prohibit this sequence application is that the sequencing platform has not been enough in terms of the sequencing depth and read length. So and the actually I recommend that this is a version of a saturation curve and how many genes are represented in a single cell.
And then they were recommended the use of up, up most at minimum, at least 200 or 300 million base. But obviously we tried to increase this, set the sequence depth up to five 5000, 0.5 billion of 500,000, right? None in the five. Okay. So anyway, so again. Okay. Five fold. And, that medium raise and we found that the sequencing is, coming to the saturated at this sequencing depth.
So that's, ten times more than, 10x Genomics recommend that recommended about, ten x. So and there we tried a firstly, we tried this so we can, you know, make a better use of the sequencing platform starting from the Xenium, actually, because the, simply because the Xenium is my, our own, you know, the main plot from this phase and the commercially available.
Well, you started the sequencing analysis of this HBV induced liver cancer, and I smell it actually. So they we, back so before, you know, going into this, this, slice of this, slice contain 200,000 cells, liver cancer. And instead of using this fragmented library, the final products which we are using are usually using for the special analysis instead of using this, we, try to make use of the first amplicon representing the full length cDNA, even though its, read length is just there, say, the one K or slightly less than one K, because of the limit of the, you know, amplicon amplification, from the standard, Visium protocol.
So and we firstly compare the expression levels are precisely represented in the long library. And by comparing the short read and long read and profiling that's the consistent say was almost perfect. So that's that then. Don't know why. And we also found out that this is the distribution of the human gene human transcript transcripts. And the KB and the PB are here.
This is liver and we also found that in the same library, we could detect. So and, virus transcripts are, you know, they're preferentially found at the zone street. So in the in they're almost a full length. So that's showing that in HBV only 15 predominantly are this region. So that's that's the one that they when we looked at some specific genes or minor express genes in the signal cells.
And we can yeah. It's a kind of unexpected, kind of expected. And the coverage seems to be too shallow. And even now the five, the sequence of the about 500 million cell and they really need it. There's a better sequencing platform and then I was introduced the SBX was launching the new sequencing platform and we send out the libraries.
Firstly, at Visium library, the pro based one and then two, three prime and libraries for a couple libraries to Roche. And the data was returned. And from that we heard that's not so data DNA sequences were included there. And in one number and they said this is everything was done in Seattle. And so each library generated 15 billion reads.
So that's a big 100, even more 100 times higher than, say, the usual Illumina reads expected from a 10x analysis. And we compared the spatial distribution. And this is the Illumina 5 billion rates, which I said are quite big number. But this is the Downsampled 5 billion rays from Roche. The results are all humble. All of us are.
Since we actually left the results in the. This is a real one and five billions and compared a web in comparison where they say 12 billion reads. So the results were the aggregated bulk coverage is still this. The consistency was almost 100%. That's great. So and this is a sequential saturation. And of course the sequencing saturation levels were dependent on the libraries complex all the complexity of the library.
But the finally it reached to a plateau after, say, 6 or 8, billion rates. So the sequencing depth that depth is that's deep, deep depth is needed for the analysis. And at the same time, we are also, checking and leaks. I found that the sequencing fidelity is no worse than Illumina. So 99 8% for Illumina and 99 7% are slightly worse, but they are almost the same level.
So and this outside was outside. Then there's three prime end library the same the results and this is, you know, we also had to do the sequencing for the assay, the adapters. So the sequencing the reads in things was slightly shorter than one K. But anyway, so essentially the same results. And so Darvish and I mean, sequencing our quality and the sequencing read so and we compare the results with Illumina and SBX and the pollen there's the spatial distributions are almost all spatial expression analysis could be done by using SBX as well.
And we should reach the same conclusion with the previous speaker. There's the. So we had to do the first barrier we encountered was the we had to do with the 1 billion base in the sequence. Aware right. So and that's a refund, which was a software errors. And we found that the Minimap2 was the only reasonable choice.
So and the Giraffe was is is more in a quick quicker. But they could not separate internal exons in a, in a precise model. Anyway, we could do the initial mapping for the ten b or even 10 billion risks by putting all our computer resources to this, you know, analysis for a couple of weeks. And this is a results at the HRAS splice variant.
And then just to know that status is a relatively short messenger, all the you know, the 1.2 KB is messenger RNA length and, and there are several splicing variant, as the previous speaker mentioned. And this is the one number is one. And then no the other thing is there. So we could start with the 360,000 res per splicing variants which is a big number.
Right. So and we usually we had to start where the 1000 times less the lower, you know, the smaller number of the rays. So then we may we it's likely that we miss those in minor transcripts anyway. We we could map this, the spatial location of this, each to get them to, to transcript to this spatial layer.
So I'm not sure whether the there's any biological relevance of clinical mouse in this, you know detection of the splicing variant. But anyway we can now analyze these kinds of splicing variants in the in the cancer cell. And the also that we and this is a the all time thing promoter is also this case and this is the say the same gene and we mapped we could detect say and there one almost 1 million reads per this gene.
And in there they are there is another alternative splice variants, which represents a minor, you know, transcripts. If we had to start with the one, free order a smaller number of the, you know, the, the library size. And we may be able to miss, we may, may have did miss this transcom. So the spatial mapping showed up like this.
So another thing is obviously as expected. One is this we were able to the said detect, transcript. And that we were able to detect best price and resource and piece and the for example, we did this genome sequencing, whole genome sequencing using that Visium. And the G was the dominant. It ran we looked at the genomic level mutation and the in some cells, not a not a large number of cells, but the, some cells that minor allele was dominant the transcribed and in a, in a in a certain layer.
And the same is true for the five prime UTR and three prime UTR is in VS. And also there's some deletions. So we we're now in for the there are some limitations. We have to narrow it. We have to focus on I say relatively it's a short messenger RNAs, but the still we were able to detect.
So those kinds of mutations and perhaps the biggest, one of the biggest issues is, is the TCR and the BCR B-cell receptor, immunoglobulin and the TCRs. And the good thing is there's the messenger RNA length is just a 1.1 K or a somewhat that is some somewhere around there. So that this is the TCR. This is a BCR.
So on the in the same library, we you know, the look, the app we, looks for the TCR, BCR sequences, the is and we found that the quite large number of the cells, we could detect the T-cell sequence, exact sequences are from a quite large number of cells. And this is the say the was 12,000 TCR BCR.
And this is the other the cell at the Caso, the two genes. And they're building and you know cancers and these are the TCR, BCL and the cell and the distribution of the t t as are far beta. And they are in a growing HG and IGA and a top that the most frequent, you know, the, clonal types I showed up like.
That's right. So and there are we found there are lots of still cubbies like this, that we, I, you know, examined and found this division AMD in these busy MD. The cell segmentation is still a problematic. I don't like this because they don't make use of this cell surface marker for the precise cell segmentation. That's one thing.
Another thing is when we go T cell, the billions of these tens of billions were is and we, detected some leaky RNA leaks, so-called ambient RNA. So which was not a problem when we you know, scratched in the playground, in the shallow in the field. So that's a problem. But anyway, that's it's really that we could detect, I think, to detect this in a could b cell things and the so the, the cell and TCR BCR is a kind of an on the, yeah.
As usual thing as expected thing. But they when we try to go further into details like the transcription factors in the immune cells, minor cells, and we found that the sequencing complexity itself is not enough for T to do with this, the ultra deep sequencing. And initially it was concerned that say this is the Visium library and this is a 3' Visium library, the sequence based and probe based.
And we compare the complexity and the found. That's the Visium and the three three ply makes this difference ten times more complexity than even, from Visium. And when we think about the complexity of the library and the perhaps the best performing platform procedure is the Chromium class single cell cell dissociation based ones. And the actually, we asked for the additional sequencing to Roche to do the usual conditional, 12 hold on cell Chromium blood cell library.
And we I was obviously almost lost are we were able to detect the TCR BCR things within the library and we our labs are rather interested in the sequence library complexity. And we found this the source complexity was this is the level of the three prime and ten times higher. Right. And the each cell representing say one 1500 or the 2100 genes.
So ten times higher. So and the correlate we started with the 10x Chromium because that's the easiest platform. And the currently we learn that the five temp medium level, medium cell, millions of cells level in a single cell isolation is possible for on the cell. It's the association based like a process. And the has scale bigrams or things.
And, we and this is, from the usual secret single cell sequencing library that they, we can, you know, we can expect. So, the style that's, of southern gene level folder complexity parcel. I mean, so, there are similar plot Tom was still sticking to, I can say, sticking to the cell dissociation. Perhaps the representative one is there, Curio Seeker just by Takara Japanese company.
So by chance and their cell and that we are either. Yeah, yeah. Very well. Request. I want to request that you they're asked for it to do the sequencing for the, so Chromium based once dense and cell dissociation based ones. And the good thing is there's the. So ATAC-seq, special ATAC-seq is enabled for this era at a Curio Seeker the platform cell that we can expect both the RNA sequencing and ATAC-seq at the older, deep sequencing, level.
So I'm sorry for the messy way of the speaking, but the we are so much excited about the launch of the book. They're looking at the initial and or that batch of the data produced by, SBX Roche, and everything is so new and there are lots of things, we have to do with silk.
And they perhaps. And unlike the giant places like the previous speakers. Yeah. Broad order. Sanger. My place is, you know, mid-sized, but can be more flexible. And the more place is open for any kinds of international collaborations. So. And this is my last slides. And the first of all, I would like to thank all the staff from that Roche for their dedicated supports.
It. Yeah I'm not at the position to say thank you. And we I believe I hope at least they are feeling like we are already a team, right. Heading for the same direction and sail and the internet without the you know. Yeah. I really appreciate their support and having a, you know, allow me to better access to the Roche and I think that's it from my talk.
And if any of you are interested in any parts of my presentation, please feel free to contact me at this email address. [email protected]. And so that is my talk. And I have I've stopped my talk here and I'm happy to take any questions. Thank you very much for your kind attention.
Presentation by Yutaka Suzuki (University of Tokyo) at ASHG 2025: "SBX technology for single cell RNA and spatial analysis."
Presentation by Yutaka Suzuki (University of Tokyo) at ASHG 2025: "SBX technology for single cell RNA and spatial analysis."
*This transcript was generated using an AI-based transcription tool and may contain errors. It reflects the spoken content of a live webinar and has been lightly edited for clarity.
All right. Thanks. First, I want to just thank, I think, Roche for giving the opportunity to be one of the first to look at this exciting technology for, transcriptomes. It's it's just incredible. So I've worked in, broad clinical labs, and there we leverage, sequence technologies for, biomedical applications. This includes, cancer research, cancer clinical diagnostics.
And for this, we find that RNA seq is incredibly powerful tool. It gives us a variety of different, genetic and functional readouts and, we decided to explore, sequencing bulk transcriptome data from cancer cell lines as a way of exploring SBX RNA seq. We worked in collaboration with the DepMap group at the Broad Institute.
We selected 96 different cell lines, and we prepped the cDNAs and targeted the instrument for, for sequencing. We we copied the cDNA into the xpandomers and then, sequenced the xpandomers, through the pores, reading out the reporters attached to each of the, xpandomer nucleotides. And, we one of the things I mentioned here is that during the library prep, we had to stop where we had to ligate, SBX specific Y-adapters, to the cDNA prior to doing this, linear amplification step, and then do an xpandomer synthesis in the sequencing.
And the reason why this is important is because we're actually sequencing both strands of the cDNA, and we're sequencing it from the very ends of the molecules. All together. We sequenced 102 billion reads. This is about 1 billion reads per cell line. And it was done in the course of 36 hours. Like this is 2.8 billion reads per hour, which is just pretty amazing.
96% of the reads align to the genome. Quite well. We showed at the top here is the the distribution of alignment lengths across all the 96 cell lines. You can see that they're very consistent. The mean and the median, alignment length was around 500 base pairs. And you can see from this distribution most of the reads that we're getting, we can we consider it like medium, length reads.
There are quite a few reads that were, considered going into like, longer read territory. So more than one KB, 4% of them were more than one KB. Peeking out at around two KB. Now, all these reads for you. So you could use them all for gene expression analysis, but the longer reads are super useful for doing isoform specific expression and also and and annotations and and we'll see how we use those.
One thing to mention here is that the since we sequenced. So we made this decision to sequence initially, from the ends of the molecule. So as you can see down here below, the coverage distribution reflects that. So for the shorter to medium range, transcripts, we're getting really good coverage across the entire molecule. Or for the really longer transcripts, really just going to see sequencing coverage, towards the ends and then sort of bleeding into the, the center.
It is still has some, some impacts, as we'll see. As far as the read accuracy is concerned. These are simplex reads. So the expectation is that they're going to be around Q20 a little bit higher than Q20. Here we're looking at just the alignment differences. What's that sequencing error. Just looking at the alignment differences, which is a combination of sequencing error and the actual true variation that we see, in these cancer cell lines.
And it's consistent with it being read. Q20 is a little bit higher. The first application that we we examined for RNA seq and using SBX data, I was looking at gene expression and looking at isoform, identification. Here's just a quick bird's eye view of a region of the genome, where we have, we have the SBX-SLR reads.
They're aligned to the genome where the reference isoform structure is below and the data are strand specific. Okay. So the reads that are colored pink are on the top strand. And the reads that are color blue are in the bottom strand. Now the data itself isn't being read as a strand specific. We don't have a strand specific library prep but bioinformatic.
We can actually convert it to strand specific, based on looking at the adapter sequences in the reads that we're sequencing. So we notice, cDNA adapter that corresponds to the bottom strand. Then we can we can recognize that would reverse complement and make it, top strand, read. That's like as a search for strand specificity. And that's actually very effective is over 99% effective.
In this work. Now that we have the reads aligned to the genome, we can, assign reads to genes and we get gene expression values. And that's shown here for all the 96 cell lines, we have gene expression values. We clustered the cell lines according to the gene expression correlation. And we see at the very top here, how the cell lines are clustering, mostly according to the different lineages and cancer types.
Which is reassuring. We do have a biological replicates included. So, we have high correlation and gene expression for the biological replicates. And there's another pair of cell lines where one cell line was historically derived from the other. And we find that the expression values there are also a very highly correlated. Now, back to the bird's eye view of the genome.
Looking at the alignments, we can see it there. It looks like there's evidence for introns and exon structures. And when we compare that to the reference annotations, if we zoom in on one of our favorite genes here, which is the housekeeping gene GAPDH, we can see the reads actually do look quite nice. Most of the reads here, shown are full length.
You can match up the reads with the reference isoform structures below, which was about different dozen different isoforms. At this locus, we're looking at almost a million reads that are aligned at this locus. So I'm really only showing you the the top, like 50 or so reads, out of that million. Right. So we're really just looking at the very top of this, coverage landscape.
Now, just for comparison here, where we have, Illumina data from a publicly available data set, and you see the reads are a lot shorter. One of the key challenges here, we're doing isoform specific analysis, is be able to assign unambiguously individual reads to individual isoforms. And that can be really challenging when the reads are really short.
Because you have a lot of read mapping uncertainty there. But in this case, we have plenty of reads that are in this case, full length and beginning unambiguously mapped to, isoform from which it's derived. And it's one example in the, on this one cell line that I'm showing you, there's actually, a handful of different reference isoforms for which we will find a full length reads for, the one at the very top here is the most highly expressed.
It counts for more than 98% of the expression of this GAPDH in the cell line, and we find over 63,000 full length reads. Now for the others, we find we find, fewer reads. The next plus one, has about 3000 full length reads. And then the one at the very bottom here, we actually only find one full length read, and that's out of a total of 742 million reads.
The way I've read the sample. So it's, it's quite rare. To get some more insight into how we can use these reads for resolving isoform structures. We assigned each read a structural category based on comparing it to the reference annotations. And also here, just shows the different categories of SQANTI based categories where the reference isoform on the top, we have the category of full splice match, which means that the read aligns.
It has the full splicing pattern that matches perfectly up with the reference transcripts. We can have partial matches, so not all the supporting patterns are there, but they agree. Still conflicts. And then we have novel categories with novel in catalog and novel not in catalog. Novel in catalog's just using the existing splice sites, but in different combinations to give you novel products and novel not in catalog's providing, new novel splicing.
One of the first things we looked at, oh, I have another category for you is a useful one, isoform identifiable. Okay. So these are basically they're partial reads, right? But they're long enough for long enough to be able to unambiguously assign it to a specific isoform. One of the first things we looked at was, saturation of unambiguous and unambiguously identifiable isoforms.
And this is, a function of sequencing depth. And we can see that around, sorry, 50 million reads. We reach around 95% saturation of identifiable isoforms. But you may have diminishing returns going out to 100 million reads. Now, the saturation is going to be, a result of not just the sequencing depth, but it's also gonna be impacted by the length of the transcript and how highly expressed it is.
So if transcript is on the shorter side and it's highly expressed, you have a better chance of getting a full splice match. But if you're a longer and you're less expressed, you want to get a full splice match, but you might get isoform identifiable reads. Right. So sort of like the next best thing. We looked for novel isoforms and we were actually able to discover, novel isoforms.
We for each of these 96 cell lines, we assembled the transcripts de novo, using a tool that we developed called FLAIR. You know, we compared those isoform structures, to GENCODE in order to get these same kind of classifications. What I'm showing here is a plot where we have the number of, of cell lines that had evidence of that isoform being independently reconstructed.
Altogether, we have way over 100,000 different isoform structures that were able to assemble, both a few of them are considered core isoforms. We have 2700 approximately. There are known isoforms that were found in all 96 of the cell lines. We only find, less than 100 each of these novel types, in all 96 cell lines.
I just want to show you a couple, just give you an examples of the kinds of novel isoforms that we're finding from the SBX data. So this first example is, is for a TMEM14A, and it's one cell line, which is one of my favorite cell lines we find is three different isoforms that are reconstruct at the top.
The red regions are correspond to the coding regions of these structures. And the bottom you can see we've got, plenty of, full length, SBX reads and, the, the known isoforms. We have a core. No, no, no, it's formed on top. Represents, over 90% of the expression for this gene at this locus in the cell line.
We do have a novel, one that we found, that is also a core 96. And, and it's expressed a 5.8%. You know, notice the differences here at the five prime end and the differences that we have, we actually have two different novel isoforms one's core and one's not core. But the different the five prime end and, and these differences actually impact the coding regions.
So we actually have different N-terminal and these protein sequences. Another example is this YRDC gene. It's a very similar story here. This gene is actually on the opposite strand. So your five prime is now going to be over on this side. And your three prime is going to go over on this side. But you see we have with other three isoforms and they're different here in the five prime end.
We have many reads that appear to be full length SBX reads to support them. And, the big difference here is that the the dominant form in this case, actually, it turns out to be one of the novel core isoforms. But the the known core form is only found at around 18% versus the 74%. So those are kind of interesting.
Now, we don't, you know, it's nice to have full length reads. We love our full length reads. But you don't necessarily need to have full length reads in order to do isoform profiling, right? But if we can have if we have isoform identifiable reads, that's, you know, as long enough. And here are some cases where you have really long chains or so these these transcripts are around 5.7 KB.
So this way no chance we're going to find a full length SBX read for this. For this. But we do find isoform identifiable reads, for these isoforms. And I got a couple examples. Both these examples are really relevant to cancer biology. And this one example corresponds to integrin ITGA6. There's two isoforms.
One of those forms is considered an oncogenic isoform promotes tumor growth. And and you can see that it has it has an exon here that is actually skipped. And the other isoform you can make that out in yellow. And when it includes this isoform and basically adding this exon incorporates a stop codon which actually truncates the C-terminus.
And if we if we look at the different cell lines and which isoforms are expressed, we find an interesting pattern of different isoforms having dominant expression, across different cell lines. In this case, most of the cell lines that are expressing this are actually expressing the oncogenic version. Another example has to do with CD44 for which is a transmembrane protein.
And in this case, you have this yellow region. There's a series of cassette exons, that can be differentially spliced and depending upon what, exons are selected, the impact of this, stem region of the extracellular structure of this, of this protein. And now there's three different forms. The top isoform here actually skips over a yellow region, producing this proto form.
And we actually have full length reads for this one. The others we don't we have partial reads. But again, that's good enough to do our expression profiling. We can find, cell lines where, you know, one is really dominant and the other ones are dominant in other cell lines. Just kind of interesting. Another application that we're very interested in is this fusion transcript detection.
And that's really highly relevant to cancer. We know that there's, Yeah, we can infer structural rearrangements based on that. There are fusion transcripts in cancer that are drivers cancer. And, and there's very good examples of, of this. Usually it's chromosome rearrangements that happened in tumors that, that generate these. One of the best examples here is involves a translocation to the bottom arm of chromosome 22 and chromosome nine.
Which generates these chimeric chromosomes puts the BCR-ABL1 fused with the ABL1 gene. And this actually causes 95% of, cancer in chronic of chronic myelogenous leukemia. So this is the hallmark fusion. It's the best well known fusion that we have for cancer. You can detect it through whole genome sequencing, but it's much easier to detect it through transcriptome sequencing.
It's cheaper. And you actually get at the functional products that are, that are being generated. There are a lot of these, fusion transcripts are relevant to cancer of the COSMIC database has over 300 of them. And some of them are hallmarks of disease. Others you find at lower frequencies. In some cases, they're treatable. So if you have a kinase fusion, you can actually treat it with kinase inhibitors.
And, it can be an improve the patient outcome. So it's important to identify these. We developed the tool and published it earlier this year called CTAT-LR-fusion. It uses long reads to identify fusion transcripts. Now, there's a couple different phases. In the first phase. We identify fusion candidates by finding chimeric reads or part of the read aligns, the one gene part of the lines for a different gene to be on a different chromosome.
Once we have those fusion gene candidates, then we have a model to remodel them as fusion contigs, or essentially take the stretches of the DNA sequence corresponding to the genes, make it into a contig, and put genes in the right order, and then we can realign the reads to it in order to get our breakpoint information and expression information.
So we took this. We adapted it to work with the SBX reads didn't take too much work, but this challenge is working with like a billion reads instead of working with like 10 million reads what we're usually working with. And we we screened the cancer cell lines are working with for these COSMIC fusions. We have a list of known COSMIC fusions that were made available by the DepMap group.
They're based on Illumina data. So shown here on the left. These are the ones that we actually were able to find using CTAT-LR-fusion with, the SBX, reads. We found most of them. And, the expression levels compared between SBX and and Illumina for these fusions were, significantly correlated. But there were some key differences here, that we wanted to examine further.
And there were a few that we didn't find. And so it's not a fault of SBX in any way. But because of the our choice to initially sequence from just the ends of the cDNA. So really limited in terms of what kind of coverage we're getting from the transcripts that we're sequencing. Just highlight one of the examples, that we have from from the COSMIC fusion search and these cell lines.
It's the one that I showed earlier as an example. The BCR-ABL1 fusion. We found this in the MEG-01 cell line. And here we have the fusion contig with the BCR gene and the ABL1 gene put together. And here we have the SBX reads and I'm pointing out where the the breakpoints are supported by the SBX reads.
Not only do we find the BCR-ABL1 fusion, we actually find the reciprocal fusion from the reciprocal translocation, where we have the ABL1 gene and the BCR gene. You can see the SBX evidence for that. We've got reads here with a breakpoint here. In the splicing into the BCR gene over here. So actually both when you have both, you can basically wire them on top of each other and figure out where on the genome might have the translocation actually have occurred.
And here we can sort of narrow it down to the regions, between where the splicing breakpoints were. So, in addition to the COSMIC, fusions, there is a few cell lines have been included that have been very well studied for fusions. And, and we actually surveyed, over 20 different tools for Illumina, Illumina RNA seq.
We're finding these fusions, they're validated fusions. And these are the few cell lines that we find them. And, we targeted SBX for that. And here we had the SBX findings versus the 20 different or so, different, other programs. Based on Illumina data. And the good news is that we're finding most of them. There was one that was particularly concerning here, having to do with, this LAMP1-MCF2L, you can see SBX, we actually don't find it, but there's a bunch of cases where we actually did find it with Illumina data, and there's a bunch of different programs that were able to find it.
So and we know it's real, we should be able to find it. And, but when we dug into it a little bit more, we could see that, here's our fusion contig LAMP1--MCF2L. Oh, here's our Illumina data. Here's our Illumina evidence for that fusion. So we can see where the breakpoint is.
And then we look at the SBX data. We can see there's basically there's not much coverage here. And this is because it's a really long gene. And so MCF2L is a long gene. And you can see we have coverage. But the coverage is really restricted to the very ends. So we want to just address this. And all this work was done over the course of like 4 or 5 weeks.
Two weeks ago we said let's try to fragment the the cDNAs and see if we just run the fragments of cDNA through, you know, see if we'll do any better. So it's took this is the original coverage plot I showed you earlier where we had the high bias on the ends, and, we did the fragmentation.
We basically cleaned up a lot of the end bias that we're having. It's not entirely gone at this point. But again, this is one experiment and there's still a lot of optimization to do here. But lo and behold, we actually now find, the fusion evidence with the SBX with this fusion, and we get better coverage, within this region.
So it turned out to be fruitful. Overall, you know, looking back at this, this checks all the boxes. This was a preliminary analysis was a ton more work that we can do. But again, this was done over the course of just like the last few weeks. Does it work? We say yes. We think this is going to be a very, very powerful platform for for future reference work.
There are some key notable advantages, like the, you know, the read length obviously one of them, we get a lot of medium length reads. I call them longish. We get up, the longish reads, we start getting up to the two KB mark. And we can quickly yield these, you know, billions of reads and a very, very short period of time, which is pretty exciting.
We do have a wishlist. There are things that really would benefit future transcriptomics applications with this. One of the obvious ones is, is getting a broader a transcript coverage distribution. Well, what issues now is that when you do this fragmentation, you actually lose that strand specificity that we had earlier, for the internal fragments.
So, but that's this is all stuff is solvable in that, you know, nothing to worry about. There. More reads. We want more reads that are in like the one KB and sort of like the longer read territory pushing the limits there, which also presumably is going to be doable. And, and then improving the base calling accuracy for simplex reads.
Where is this? Because we'd like to have high accuracy reads. But unit the Q20 plus, it wasn't holding us back in any way. So overall, very excited about this. But acknowledge the broad clinical labs, methods development group that I worked in. Headed by Issac Kohane, the DepMap group and the cancer data science group for collaborating with us on the cell lines and, of course, our partners at Roche, for really giving us this early access to the sequence data.
Exciting stuff.
Presentation by Brian Haas (Methods Development Laboratory, Broad Institute) at ASHG 2025: "Characterization of cancer transcriptomes and fusion isoforms using Roche SBX technology."
Presentation by Brian Haas (Methods Development Laboratory, Broad Institute) at ASHG 2025: "Characterization of cancer transcriptomes and fusion isoforms using Roche SBX technology."
*This transcript was generated using an AI-based transcription tool and may contain errors. It reflects the spoken content of a live webinar and has been lightly edited for clarity.
Brien Mahoney: Welcome and thank you for joining us for this exciting introduction to Roche's novel sequencing by expansion technology. My name is Brien and I'll be your moderator for today. Before we begin, I'd like to tell you how you can pose your questions.
On your screen, you should see a QR code that when scanned will automatically open the online submission form. You can also open a browser and type pollev.com/rochewebcast. All one word, no spaces. You can pose a question at any time during the presentation. I'll be back at the end of the presentation to help get answers.
Now appearing on your screen, you should see our esteemed panel of experts who over the next hour will guide you through this breakthrough approach to sequencing.
With that, it is my pleasure to introduce the global head of Roche Diagnostics Solutions, Palani Kumaresan. Welcome, Palani.
Palani Kumaresan: Thank you, Brien. A very warm welcome from my side as well. Today is indeed a momentous day and many years in the making and we are really happy that you could join us for this webinar.
Now some of you may not be as familiar with Roche. So here I have some highlighted facts and figures about us. As you look at this slide I would like to highlight two important points. One, we have a deep history of commitment to innovation and we deeply care about having an impact in the societies that we live in.
I would like to call out three figures here. We have three Nobel Prize recipients and over 40 Prix Galien innovation awards over the course of the past 50 years. If we look at the WHO list of essential medicines and diagnostic tests, we have over 40 Roche medicines and close to 100 Roche diagnostic tests that are part of this list. And last year alone, we delivered over 30 billion diagnostic tests to our customers around the world.
Now if you zoom into our sequencing portfolio as it stands today, we have our AVENIO assay kits. These are research-use only kits that can run on third-party sequencers. On the front end, we have our AVENIO Edge system which automates sample preparation. We also have our KAPA library prep and KAPA target enrichment reagents that can run on this instrument but can also be used independent of the instrument. And downstream we have navify® mutation profiler which is a tertiary analysis software for generating actionable insights in a research use only setting.
What has been missing from this picture is our sequencing platform. That's why we are here today to introduce the technology behind our future sequencing platform, and I would like to make two important points as we introduce this platform.
In the future we will introduce it as an open standalone sequencing solution that both academic researchers, and translational researchers can use to advance their work. At the same time, we will have it as part of an end-to-end sequencing ecosystem that we can bring into a clinical setting along with our assay kits. So without further ado, let's get into it. And for this, it's my absolute pleasure and privilege to introduce Mark Kokoris, the inventor of sequencing by expansion chemistry.
Now, Mark is one of the most brilliant minds I have come across. In 2007, Mark came up with this very elegant but hard to realize idea to address the fundamental problem of threading a DNA molecule through a nanopore at very high speeds and being able to decipher the four bases with very high accuracy. Over the course of the past 15 plus years, five of which have been with Roche, Mark and the Roche sequencing team have just realized this.
So without further ado, let me hand it over to Mark to talk to you about the chemistry and some of the exciting results.
Mark Kokoris: Thank you so much Palani for the kind words in the introduction. I'd like to welcome everyone joining us today from so many different time zones for what is a very exciting day for Roche Sequencing, and it's truly an honor to present with my colleagues today introducing the technology and as Palani said this has been a many year journey to get to this point—for me it's been 18 years to get to this point—and all that is to say we've got a lot to cover so let's get started.
The SBX technology was designed for flexibility and performance with headroom to efficiently scale into the future so let's dive into that a little bit. If you're developing a new technology whether it was 1999 or 2029 you have to think about accuracy, throughput, read length, cost efficiency. Those are a given for any sequencing technology. You also have to think about making sure that you have the headroom to scale into the future.
These are all important given considerations that we certainly thought about when starting this technology. But one of the other ones that was key for us was flexible operation. We envisioned a technology that can sequence up and down the throughput spectrum with the same instrument, and doing that efficiently. So for example, if you wanted to sequence for four minutes, 40 minutes or four hours, we wanted a sequencer that can do that. And in fact, I'll show data today from 20 minute, 1-hour and 4-hour runs just as an example of that.
Of course, accuracy is critical. I show an example here for our duplex whole genome sequencing for reference sample HG00001 just to get an idea of what our duplex sequencing accuracy is looking like and we'll go into that detail later on. We'll also cover throughput a whole genome sequencing run for seven Genome in a Bottle samples in 1 hour of sequencing where we achieved over 30x coverage in one hour. Just to put a number to it, that was over five billion duplex reads in one hour of sequencing. Read length we'll cover that and make the distinction between duplex and simplex reads in terms of length and accuracy and throughput, just to make sure we cover that topic. And we'll show an exciting sample-to-VCF whole genome sequencing result in less than seven hours that really highlights some of the flexibility we're talking about. And of course the big one, cost efficiency.
We hear you on this. I don't think you can develop a sequencing technology without being very tuned in to the importance of cost efficiency. We consider it chemistry efficiencies, measurement efficiencies as I point out there, the reusable sensor module which is our sequencing chip. All of those things come together - data analysis, algorithm efficiency, and even the quality of the signal that you pull off your sequencer. All very important to efficiency.
So if you want to be able to someday, as I do, aspire to do a sequencing run where you get a trillion reads. You really have to be tuned into that. And in fact, it has to be wired into your DNA sequence or as I'll explain later on, perhaps wired into your Xpandomer sequence. I couldn't help myself. Sorry about that.
So, what does this look like? What are we doing? This is the coming together of Stratos Genomics sequencing by expansion chemistry and the Genia high throughput sensor array technology. Two very powerful technologies that really needed each other to bring out the best in performance. So, on the instrumentation side, we have an instrument that does the SBX chemistry and a separate instrument that does the sequencing. And again, we believe this actually maximizes the flexibility of the technology. Now, into the future if the workflows dictated to pull both of these systems together that would be something that we would certainly consider.
So, a little bit about the agenda for today. I've heard a lot of conversations about the SBX technology and what it is. Some were pretty spot-on and some not so spot-on. So, I think it's important that we cover and outline what the technology is just to make sure everyone's clear on what the technology does. We'll then go into some of our whole genome sequencing performance for our SBX duplex sequencing that we're doing with our early access collaborators at Hartwig Medical Foundation and Broad Institute.
We'll also hear from John Mannion who will come on and talk about some of the duplex data structures, go into detail there, and dive into some of the quality of the data that we're seeing. I'll come back to cover some of the RNA applications again with our early access collaborator at Broad Institute. We’ll finish off with some pipeline open-source tools and Gustav will come on to talk about our road map, commercial road map and then hopefully time for some Q&A.
So SBX technology overview. Our approach to efficiently sequencing DNA was to not sequence DNA. Pretty much that simple. SBX was a process, a biochemical process using modified nucleotides and enzymes to convert DNA into an expanded surrogate molecule. We call that the Xpandomer. Well, the idea being that we would rescale the signal to noise challenges of direct DNA measurement. So, such a simple picture, how hard could it possibly be? So, I reminded myself of this quote many times over the years.
If you want to improve, be content to be thought foolish. So, I don't think we were ever content to be thought foolish, but I'm really glad that we were foolish enough to try to develop this technology because when you get on the other side of it, it really does enable some pretty powerful measurement capabilities and it's very exciting and these are the things that we're going to show you today.
A little bit about our preprints. We're actually making a preprint and it should be available as we speak, but certainly within the next couple days that covers the SBX chemistry all the way through early 2020. We wanted this foundational paper to kind of talk about how we did it and the important elements of the technology. So, check that out. It'll be a nice 60 to 70 page read. So again, tens of thousands of experiments going into this particular preprint. So enjoy that.
So how do we do this? So essentially I alluded to this modified nucleotide. It's essentially an expandable nucleotide triphosphate or XNTP. So our original thinking was to use a tether, attach a tether to a nucleotide at two specific points. And I'm showing there the alpha phosphate and the heterocycle linkage separated by cleavable bonds. So once you incorporated this XNTP using a polymerase, you could selectively cleave and expand the structure and now you have a new reporter code taking the place of the nucleotide.
As we went on with the technology over years and were more directed towards nanopore measurement, we engineered a translocation control element. So as it says this was to allow us to get a clean reproducible synchronized modulation of the Xpandomer as it moved through the nanopore. So this was a key very important innovation. And also over the years we learned that we needed to add enhancer chemistries and engineer those into the structure as well because, not surprisingly, most polymerases don't like XNTPs. So we had to add that feature as well. And just a little nomenclature here, SSRTs are symmetrically synthesized reporter tethers. That's that whole structure there as we call it.
So diving a little bit deeper, how do you do this? So you do it with innovative molecular engineering and developing novel chemistries because there was no roadmap for how to do any of this way back when we started the technology. We literally made and engineered thousands of custom SSRTs to get to the point where we are now, and we did that using novel phosphoramidite building blocks—dozens and dozens of those—on the dNTP-2c side, that nucleotide core structure that we use as one part to build the XNTP, very similar, dozens of synthetic routes that amidate diester structure did not behave like normal nucleotide chemistry and so we had to learn how to do that as well. And I show that coming together in this convergent synthesis at over 50% purified yield to point out that this is an efficient process and there's no room for inefficiency in this type of sequencing technology.
So I really wanted to point out that we're able to make these very complex molecules incredibly efficiently and I always like to show this picture where I've got the green box highlighting a nucleotide to show the chemical structure of a full XNTP just to get an idea of how crazy this idea is, what 50x expansion really looks like.
So now on to the polymerase. So given that structure, you can imagine that you would need to have a polymerase with an open architecture. In other words, that point where I'm showing the arrows there, having somewhere for that very large tether to go. Most polymerases don't have that opening but the specific class of polymerase Dpo4 or the Y family lesion bypass polymerases actually have that capability and so that kind of directed us to go in a lesion bypass direction.
In terms of our engineering approach, we would come up with a specific set of designs we would want to survey more broadly, refine the search space using rational design, high throughput screening, combine interesting regions of the polymerase together, and iterate on that over and over and over again. Thousands and thousands of polymerases to get to the point where I show below where now we don't have a DNA polymerase anymore. We call it an “XP” or “Xpandomer synthase”. We've literally changed it into a different enzyme because it actually doesn't really like regular nucleotides anymore. It's an interesting little side note.
Over 10% of the amino acids mutated. Our raw read simplex accuracy is 99.3%. And again to contrast that with duplex accuracy that's reading both strands and we'll talk about that later. We've turned the polymerase from a distributed polymerase into a processive polymerase. And that's part of how we get read lengths of over a thousand bases in length. And we've also turned it into a strand displacing polymerase, which it normally isn't strand displacing. And that's very important to be able to do duplex sequencing.
A little bit of the tools. We won't cover much here. Just to say we threw everything at this to figure out how to make these new polymerases. Whether it was machine learning, structure prediction, display technologies or even fusions. We pretty much threw everything at it to get the polymerase that works.
Okay, so now you have the XPs, you have the XNTP with these enhancer elements, and what happens if it still doesn't work? Okay, well at least not work well enough. You create a new molecule. And so that's the polymerase enhancing moiety that I've shown there. Think of this as the glue that holds the whole system together. And so I like to show this one. This is an old gel, something that was really that aha when we finally figured out the glue to the whole system. That red box is what you get if you leave any one of those three key elements out of the reaction. And then if you add all of them including the newest one in the series being the polymerase enhancing moiety, you get an idea of how robust the extension is. Again, this gel is from probably 2018. This is an old gel. Now we have gels that show thousandmers looking even better than this just to give an idea but just the perspective of the innovation there. So I showed the synthesis instrument earlier.
It's essentially a fluid handling system that directs reagents into this XP chip as we call it this fixed solid phase synthesis workflow and the reagents that go into that and I'll go through that you know step-wise but it is a pretty intuitive and straightforward process. So we have an extension oligo as we call it attached to the surface. Hybridize the library of interest to the extension oligo. Add your reagents to do this Xpandomer synthesis. Cleave to expand the structure. Wash the chip and then use a photocleavage, an orthogonal cleavage to release the Xpandomer from the support. Pretty straightforward, pretty intuitive process.
Now elaborating a little bit on this translocation control I talked about before. So we added this innovation as we realize it's really important to have that control over the movement of the Xpandomer through the nanopore. And so I show that orange triangle pausing the Xpandomer as it goes through the barrel side first of the nanopore. And again, we run backwards. We get cleaner results that way. And you can get an idea of that signal result that you would see from that pausing uh for that specific period of time. We then do a very quick uh voltage pulse to advance the Xpandomer one reporter code at a time. And you get an idea of what that now signal looks like and then so on and so forth and you get the idea that this is a way of modulating the throughput of the Xpandomer as we sequence.
So pretty important advancement in the technology and I like to say that the box there where we show the translated XP base calls, that was a drawing that Bob and I did in 2012 to show what we were trying to do, what we'd hoped to be able to do with the technology. And this is what we actually did a mere seven years later. So, and this is a result of our single pore approach system with 1 and a half millisecond pulsing. So, really great separation. And you get an idea as you zoom in and look at the homopolymer bases what those look like when you get this deterministic control over the movement of the molecule through the nanopore. And as well the level histograms there we can see the separation of all four of the bases was exactly how we'd hoped to draw this up in terms of solving that signal-to-noise and signal resolution challenge of directly measuring DNA. So still a single molecule measurement but you see the signal resolution we were able to see here.
So now take that one pore, add it to 8 million array. So combining an array that combines microwells, electrodes, detection circuits, and the analog to digital converter version on the left side you see the 8 million XP chip mounted on a PCB and then you can zoom in a little bit at that SEM (scanning electron microscopy) image on the right here that shows us drawing in the bilayer, the pore, and the Xpandomer translocating it. So now you can get the picture of what this can look like as you scale to an 8 million chip.
Okay, a little bit more on the sequencing system instrument—similarly to the synthesizer—fluid control reagent delivery to that sensor module that we show in the middle of the screen, also of course it has all the electronics to move the signal off and the sequencing data off the system. So a little bit more obviously complicated than the synthesizer and a little bit about the sequencing process. This is the simplified view of it but essentially how you initialize a run. You set up by forming the bilayer and inserting the pores which we do for every single run just to be clear on this. Every run we form the bilayers and bring up the pores for every sequencing cycle. We then add the Xpandomer that you would have synthesized, flow that across, do your sequencing again 4 minutes, 40 minutes, 4 hours, clean the system and then you move on to reuse.
So the reuse part of this as I like to say, significant reuse again going to that efficiency discussion that we had before key part of the technology is to be able to reuse the sequencing sensor module there. So again, you can do that over and over again. And a little bit about the active pores on the system. As I said, there's 8 million total sensors. Typically, we'll get 7-7.5 million pores that are sequencing pores that are functioning that can sequence Xpandomer.
And you get an idea over time, you know, 20 minutes, 60 minutes, 4 hours, the pore lifetimes over that time period. And the way I look at that is is all the data we're showing today would have gone through this type of pore lifetime profile as a great opportunity to even improve our throughput even more as we continue to optimize this because it's not a foregone conclusion that pores aren't going to last for four hours or even way beyond that. So I think that's another opportunity for optimization with the technology.
A little bit more about the capture efficiency. So I show the leader structure threading through the nanopore as well as as a concentrator, and I can remember a conversation we had many many years ago with a prospective science adviser who said I don't know if you're ever going to be able to make these molecules but I really don't see you being able to efficiently thread them through a nanopore. I won't say any names. Anyways, so this was something that had been on the top of our mind from day one how do we get these big molecules to thread through a nanopore and I show this result to show how we figured it out with a leader concentrator approach concentrating the Xpandomer really lining it up for measurement.
They also show the trace on the right which has four Xpandomers sequenced in a row and you get an idea, that little spot being open channel, the gap in between the next molecule coming and the next and the next. So this just illustrates how efficient our capture can be. Now this is a great example, four in a row. They don't always look like that but just to illustrate that we're able to really line these things up and sequence very efficiently. Again, efficiency, is the word that always comes to mind with this type of measurement.
So now you go from that one pore to now an 8 million array and we're looking at a single pore on the 8 million array. I point out at the top there, the open channel in the image there as well as the bracket showing the Xpandomer signal traces for the four different bases. So you get an idea of how much Xpandomer is going through in a 1-hour run just to give you an idea of scale. Now how you zoom in a little bit. Look at one full Xpandomer.
This is probably about an 800 MR [million reads] to a 1000 MR. I actually don't know. But just to give you an idea of what that zoom in looks like for that tiny little fraction across all of those reads in one hour and then you zoom in a little bit more and you get an idea of those translated XP base calls, what that looks like. Again, flexibility. We'll reiterate this over and over again. We wanted to have a technology that could support low throughput, very high throughput, everything in between. If you had a single library multiplex deeply, or shallow, you want to do one synthesis or four. We wanted to be able to support that entire scale of sequencing.
Important part of the technology, one of the things that really makes the technology distinct is we have that massive throughput flexibility. Those are really critical elements for the SBX technology.
Now, jumping a little bit in on the SBX duplex approach. Just to outline what it is and it's pretty straightforward. You have a library that you have a hairpin Y adapter structure in contrast to the simplex which is pretty much just a single-stranded structure there. And what this allows you to do is make an Xpandomer where you have both parent strands on the same Xpandomer molecule allowing for intramolecular consensus and thus the ability to get really high accuracy.
Okay. And so the two topics we'll cover today are somatic oncology as well as a rapid whole genome sequencing and the rest we'll cover over time and certainly we'll talk about simplex later on in the presentation, but for the next few slides we'll talk about those applications.
As far as the workflow goes you have your sample whether it's tissue bloods cellfree DNA running through your library process again it's pretty straightforward. You have a Y adapter and a hairpin that you would ligate to create your Y adapter structure. I show next currently we use 20 to 50 nanograms of genomic DNA, unfragmented genomic DNA, as your start point.
I think there's room to whittle that down and use even less and for cell free DNA 2 and a half to 10 nanograms is what we've used up to this point. The one unique thing here is is we actually do a linear amplification of this library, and as opposed to PCR, we found that that actually helps our accuracy quite dramatically and you'll see when we so show our homopolymer and our indel accuracy what that looks like for an amplified library, it is pretty impressive. After that we pretty much just run through our normal Xpandomer synthesis and sequencing process.
So a little bit about the data I alluded to before, this is a 1-hour run. A sevenplex using seven Genome in a Bottle samples and as I mentioned, in this 1-hour run we got 5.3 billion duplex reads and this is the breakdown of the duplex reads per sample you see pretty uniform yield on all of them. The mean insert read length breakdown on that - John will cover that in his slides in a little more detail. Coverage 34-38x coverage for that 1-hour run. And then the F1 scores - what we're seeing with RGATK (Genome Analysis Toolkit) Roche machine learning approach, very respectable scores for both SNV (single-nucleotide variant) and indel. And then also at the same time and a shout out to the Google deep variant team. We sent the data over to them and they were able to run their system on that and get very very similar results and F1 scores in fact a little bit better than ours. So really happy to get that external look at our data and coming back and finding the same results if not a little better than what we had.
As far as percent of genome at greater than 10x coverage. We were 99.7% and as I mentioned earlier Q39 scores. And the next slide just shows you know what happens at a 30x down sampling - very little change in terms of the quality of the results for the F1 scores.
So moving on to Hartwig Medical Foundation. Edwin Cuppen, the scientific director there, is directing that team and I have this picture of Ewart and Joris in front of our systems that were placed in their labs in the fall.
A little bit more about that. We uncrated the instruments on a Monday. By the end of the day, the sequencer was sequencing. The synthesizer had run through its liquid handling tests. By the end of the next day, both systems were qualified and ready to go. So very first time systems came from the Seattle labs, flew over there, installed and were functioning within two days. I think very impressive for the very first time an instrument left the lab and was placed elsewhere. And a great team there at Hartwig’s to work with. So of course there was preparation, but just to get to that so quickly was really quite impressive.
A little bit about Hartwig Medical Foundation and Edwin will cover this when he speaks next week at the following week at AGBT, but their approach is they have a fully automated lab and IT workflow to perform routine whole genome sequencing based cancer diagnostics for hospitals. And Edwin will go into that detail about what they're doing, but you can see why Hartwig was a great fit for Roche and it wasn't a low bar challenge. We went after it with this type of sequencing because we wanted to be challenged, challenge the technology so we can improve.
A little bit about the study we did. So we used a demonstration with clinical research sample pairs and for this we used 30 matched tumor normal samples and actually added in a couple of cell line references from the Colo cell line. We did 100x coverage for tumor, at least targeted 100x for tumor, 40x for the normal sample. And this is where that 15 billion duplex read number came from in 4 hours that we would have seen in the advertisement for this series.
So just to give you an idea of the amount of output we got in the 4-hour run as well as the workflow we used below to generate all of this data. It took us eight sequencing runs to complete the entire series.
A little bit about the data again Edwin will present this data in more detail at Roche's silver AGBT workshop on February 25th at noon. What we're showing here is data that was generated in Hartwig's lab by the Hartwig team and analyzed by Hartwig. So all this is their data just to reinforce that, I may even say it one more time. So what they saw was SMV and indel counts were similar across platforms being SBX and Illumina and most, if not all the differences, were explained by stochastic effects involving real low VAF subclonal variance. And one last thing, the empirical platform SMV error rate for SBX was 1.7 fold lower relative to Illumina.
The thing I like to say about this and and one of the benefits of having been from the technology from the very beginning is I've seen the trajectory and the direction we're going with the technology, the advances in the technology. Very, very pleased with what we're seeing now knowing how our trajectory has been going over the last many, many months and years but seeing what's changed even in the last six months very, very happy to see this result, and again, Edwin will cover this in a lot more detail when he speaks at the workshop, which will be recorded so people will be able to see that.
A little bit about the SBX Fast protocol which is, as we say, an amplification-free workflow so looks very familiar to the SBX approach that I was showing before and I'll highlight that of course if you're going to go amplification free you would expect to put a little more sample in this. Really, this 2 microgram input is really because we're only running as a single plex, only trying to do for the Fast protocol that I'll show later, only doing one sample as as we move on and I show that example.
The rest is pretty straightforward it's amplification-free same protocol same workflow we showed before the idea being though that we would tighten up some of the timings for the library prep and the synthesis step so to get you know an even tighter workflow and we just started working on this just a few months ago in the late fall.
Okay, a little bit about the data looking a little different than what we talked about before because we ran these as trios, right? So in the case of HG002, HG003, and HG004 I have highlighted there, we just ran those for one hour on all three samples and you get an idea of the median coverage that we got there and you know, very similar to the SBX-D, the amplification approach we showed before.
Nearly five billion duplex reads in one hour. So very consistent there. even better F1 scores when you look at them, especially with the indel side of things. So looking fantastic for an amplification free protocol yields very consistent. And diving in a little bit deeper, 30x whole genome for HG002 sample truth set again the F1 scores on the Y-axis getting an idea of the breakdown small variant SMV indels from the previous slide you can see that but then also a little bit on repeat expansion what we're seeing there, structural variance, and of course copy number variance. And as I said before, please look at the data, compare it as you want, these are all just the start point for where we are now we're very excited to to look at these numbers and continue to improve upon them and we already have a really great idea of exactly the things we need to do to keep pushing these numbers better.
But as a start point, I think we're very happy with what we're seeing here. But again, people can look at the data and draw their own conclusions. So the vision for SBX Fast is the idea that we can take a sample through VCF variant calling in about 5.5 hours again with a system that is the same sequencer that we use for any other applications we're talking about. It's not specialized in any way. It's just a different protocol where we run the steps a little bit faster than what we would otherwise do. So expanding that to our early access with Broad Clinical Labs and Sean Hoffher and his team there.
So why Broad Clinical Labs? Well I think we all know they are a leader in the field of genomics. For technology assessment, it makes a lot of sense—clinical translation, genomic disease associations, and building resources. And Sean will cover that when he speaks as well at the AGBT workshop.
So a little bit about this increasing need for scalable rapid clinical whole genome sequencing. The Kingsmore publication really does a great job of outlining that and I encourage you to read it. It's a great read. Then if you look at the right here in terms of some of the conclusions, genetic disorders are a leading contributor to mortality and NICUs. Every hour saved in diagnosis has a significant impact on improving outcomes. Shorter NICU stays opens up access to other critical patients. Shorter stays lead to cost savings for the organization. So a lot of good reasons and, in Kingsmore, this outlines the traditional neonatalology approach highlighted in the green there versus the precision neonatology approach. So great read, I encourage everyone to look at that.
A little bit about the current world record time from Stanford 2022. We tried to pull out of that what we believe is the sample diagnosis time of 7 hours and 18 minutes, as well as the sample to sequence completion 5 hours and 2 minutes. We think we got that right. Certainly tried to be able to do that. 94% read alignment bases greater than Q7 with an average of Q 11.5 just to kind of get a comparison of what they saw with that technology.
So a little bit more about some training runs that we did at Broad, the initial training runs, and again this was in January this is not a long time ago, this just happened very recently, and our approach on the training runs was really just to get people familiar with the technology, we weren't trying to break land speed records on this. We were just trying to get a frame of reference to see if we saw the same results in the Broad lab as we were seeing in our labs as well.
And the answer is yes. We saw very similar throughputs, lengths, F1 scores with our benchmarking samples. And then we took a step further where we ran some Coriell samples with expected variant types just to see do we get the same answers with the runs we did at Roche labs versus the Broad Labs. And the answer was yes. Looks great. So, let's move on and try some Fast sequencing. Try to see what we can do here.
So, we outlined the workflow at the top there. I'll highlight that we did 20 minutes of sequencing to get this result and the current time, those are the exact times as we ran through the protocol, as the Broad team ran through the protocol on February 11th. So, it's you know, two weeks, less than two weeks ago when we actually did this run and the results below, you get the idea, library prep through VCF in 6 hours and 25 minutes, and again we didn't have the sample prep 20 minutes part because we wanted to use HG1 so we can get good quality measures with the data and you can see almost 30x coverage, very respectable F1 scores and again, this is the second or third time we've done this so a lot of optimization there we will pump up the yield, shorten the sequencing time, no doubt, even from 20 minutes and those F1 scores will go up as we tighten up the protocol and do a little bit more optimization with that.
So you know, very exciting, again illustrates the flexibility of the technology to be able to do this run in six hours and 25 minutes and we just started trying to do this. So very excited for Sean to present the data at AGBT. And now it's my honor to hand over to John Mannion who heads our computational science group and a fantastically talented team of people and he's going to dive in a little bit more detail on the duplex read structure and format and a little more on the on the data that we're seeing.
John Mannion: Thank you, Mark. What an honor it is to be here today and thanks to everyone who's you know taken interest in joining us on this webinar today. In this first section on data analysis, we'll focus on two main topics. The first will be SBX-D read characteristics and properties and the second will be a short sort of cursory introduction to our accelerated local data processing strategy. We will use the example WGS run that Mark started with the sevenplex whole genome sequencing run and dig further into that doing a deeper dive on those read characteristics.
To look at those read properties we start with the Xpandomer molecule itself and a cartoon diagram of that Xpandomer molecule is shown in the upper left, for the SBX-D read constructs. And here you can see indications of genomic DNA insert in the blue as well as adapter segments in the other colors.
If you take an Xpandomer molecule, thread it through the pore, and using the sensor module measure it, produce a read, you'll generate reads such as what is shown in the lower left of this slide. And in this read, you can see blue segments as well as adapter segments that are annotated. The Y adapter on the left is the start adapter. The hairpin adapter is shown in the middle. And there's an end Y adapter sequence as well. The genomic sequence data is present in the middle in the blue between those adapter segments.
If we take a stack of those reads randomly sampled and pile them on top of one another, we can generate an image like the one you see here in which hairpin adapters are always present in the middle and the and the various adapters at the ends are present in different colors as well.
You can see the first and second passes on the genomic DNA in different shades of blue. We can take those linear reads and informatically fold them back over on one another, producing what we call our full-length duplex SBX-D read. In this case, we refer to it as full-length since in this read, both a full first and second pass were achieved on the insert genomic DNA segment.
We know this because we were able to reach the end adapter sequence. Not all SBX-D reads necessarily reach that end adapter sequence. There are various processes upstream which could cause that to be the case. These reads are still valuable, however. And a pileup of a random sampling of these is shown just above. Here you can see that the hairpin adapter is always present in the middle. And the full first pass is present in the light blue. But for many of these reads they have a shorter or partial second pass.
These are also useful as we say we can perform the same bioinformatic analysis where we informatically fold these reads back over on themselves and we generate constructs such as what you see on the right. We refer to these partial reads as partial duplex reads. Note from this image on the left, you get a sense for two things regarding read length.
One, the raw read length distribution as well as a sense for the input DNA length distribution that was input into Xpandomer synthesis itself. You can get that sense from the full-length duplex read insert size distribution. Now if we look in more detail at these two categories of reads both the full-length duplex read and partial duplex read, we can look at the different class passes of bases which are generated for the full-length duplex reads on the upper left.
You see of course that the whole genomic segment was covered with duplex. The vast majority of these bases will match one another on the two passes. And for those positions, we call these concordant duplex positions, we call a base and mark that base as high quality. A small percentage of bases given our SBX raw error profile will result in a discordant duplex position or position where the two passes on the read do not match. In this case, we also call a base but mark that as a low quality base. For the partial duplex reads as is seen on the right, there is again a duplex segment. But in addition to that there are simplex bases which are parts of the insert for which we only had a single pass. For these simplex positions bases are called but these are marked as a medium quality of base. All of these bases are useful in downstream variant calling, and we do pass along all this information to the variant caller including the different quality scores.
Note the two cartoon diagrams here are temporary informatic constructs alone and the high accuracy consensus reads that are produced are themselves linear reads with an associated quality vector and these are compatible with encoding in the standard file formats such as FASTQ or BAM.
How do these different base types affect our overall metrics? When it comes to base accuracy, the Q39 number for example that Mark mentioned earlier, for base accuracy calculations, we focus on concordant duplex bases only. These are bases in our highest quality bit for SBX-D. At the same time, in order to be consistent when it comes to coverage metrics, that could be coverage over the genome, base throughput, sample throughput, etc. These are also anchored around concordant duplex bases only. The other bases such as the simplex bases are there. Think of them as coming along for free and they do provide valuable information, but when we talk about ultimately our sample throughput numbers, that will be on coverage of 30x genomes with concordant duplex bases only.
As for the insert length or length metrics, we think the most relevant metrics for SBX-D reads is in fact the insert length which is exactly as it sounds, a representation of the original length of genomic insert. And for this, we count both the duplex and the simplex bases. When it comes to insert length, we believe the most important aspect of that is our ability to map confidently to the genome and that is assisted by both the duplex and simplex bases.
We can now take these read categories and metrics and having defined them, take a little bit of a closer look at that example SBX-D genome in a bottle run which Mark covered above. If you look in the lower left you can see a portion of that table. All seven samples that were multiplexed are shown there and you can see the absolute number of duplex reads generated for each.
If you sum all of these together you see that in the course of the 1 hour run 5.3 billion SBX duplex reads were produced. Of those, 2.6 billion or 49% of those reads in total were full-length duplex and the other 51% were partial duplex reads. In terms of the base mass a total of 1.2 terabases was produced of SBX-D genomic base mass, and of those, 70% or 840 gigabases were of the concordant duplex base category.
With respect to that insert length distribution again the same number that you see in the table is represented here for the HG001 an insert length mean of 235 base pairs was achieved. And if we want to look at the distribution itself it is certainly interesting to see. Yes, that mean of 235 base pairs is achieved for that distribution and some molecules are shorter. However, it does have a long tail on the right side and these longer molecules, there are a good fraction above 300 350 even 400 base pairs in SBX-D insert length. These are highly valuable in mapping to difficult to map locations in the genome.
With respect to accuracy, Mark has already covered that at the high level, we're approximately Q39. We'd like to show how that accuracy breaks down into the different error categories. You can see in the bar chart below the accuracy represented there and the substitutions, insertions, and deletions called out separately. For substitutions, insertions, these are, as you can see, roughly the same and they are both about 5 × 10⁻⁵ in terms of their frequency for the concordant duplex bases. We're often asked if we have the ability to predict base quality with even higher accuracy.
Indeed we do have internal models that have been developed. These are still in test and development but those models so far suggest that of the concordant duplex bases greater than 85% of them are predicted to be over Q30. We'll continue to refine and calibrate these models and it's possible that we would push that number up even further.
One thing that is commonly asked about SBX is since we are not measuring an ensemble of clonally amplified molecules. Is it possible that our accuracy can hold up even as we go to very long read lengths?
Indeed we believe that to be the case based on understanding of the physical processes that lead to different sources of error and we also can see that evidence in the data. In fact we can take this run and take the fact that we have a distribution of insert lengths and look at slices of that distribution and understand from looking at those slices whether accuracy is holding stable.
We've done that for this run. We've taken slices that are exactly one base pair in width. And so if we look at the next slide, we can see populations of reads that are exactly of the lengths indicated on the left. And you see how many absolute reads were generated for HG00001 that match this exact insert length.
We start with 150 base pair inserts on the top and work our way down to 200, 250, and even 350 base inserts on the bottom. You can see that over the course of different insert lengths, accuracy holds up. Well, and again, this is empirical accuracy of concordant duplex bases or our highest quality bin. You can also see as you go from left to right for any one of these plots that accuracy does hold up with position in the read. And so as we go toward the ends of the reads, we do not see any evidence of a significant degradation of accuracy. It appears to be quite stable.
One additional note about this bottom group of reads, the 350 base pair inserts. In order to produce that, significantly longer raw reads were required. So for 350 base pair inserts plus adapter segments the raw reads were required to be on the order of 750 or 800 base pairs. And if you look at that absolute number of reads for that one multiplex sample that is not a small number of reads of that length to generate within a single 1-hour run.
Looking here at the next slide we look at one another very very common question of any sequencer technology which is how does the accuracy hold up for homopolymers? And so there are a number of different ways to view homopolymer accuracy versus length. We've chosen one here that we think can be meaningful for a good number of people. It's kind of easily understandable. And we've looked at the impact that homopolymers have on variant calling.
So here we've taken the variant callers calls that are part of that genome in a bottle high confidence region on which we've assessed those F1 scores and we've looked at all variants that interact with homopolymers of different lengths and these are specifically indel variants interacting with homopolymers from length 2 all the way through length 30 and even beyond length 30. And we've looked at this data through the lens of two different variant callers.
These are the two Mark mentioned previously. The first is that Roche internally developed and optimized variant caller which is based on GATK plus Roche ML. And then the second on the right is a DeepVariant caller. And we can see that for all homopolymer lengths F1 scores are very good.
In particular for all of those homopolymers that are of length 15 or less F1 scores are greater than 99% and we believe this is very good performance for SBX. We're very certainly very proud of this number.
We also would like to acknowledge of course the work of the genomics team at Google. We've really been working with them for a fairly short period of time, but the collaboration has been great and they've put in great energy and we look forward to more collaborative work to come.
Another set of very important metrics for any sequencing technology are coverage uniformity and whether there are any biases as a function of genomic content or specific types of sequences, types of content within a sequence. And so on the upper left we can see that a very common depth histogram as it's called. And this shows the coverage of different positions in the genome.
And so for a 30x subsampled run, which again is our Genome in a Bottle (GIAB) HG001 sample from that demonstration run here, we've subsampled to 30x and then we look at properties of that distribution and those properties are also summarized in the table below. So when we subsample to 30x that is 30x of concordant duplex-based coverage alone that achieves a median of 30 and both a median and mean of roughly 30.
And then if we look at the F90 and F95 scores these we believe are also good numbers, and for those that aren't familiar, the way we're certainly defining them here is that the F90 is the median over the 10th percentile and F95 median over the fifth percentile. We also have plotted as you can see in the upper left, the coverage depth achieved by all bases and that includes the simplex bases as well. And that coverage is a little bit higher than concordant duplex only in a sense again that that comes along for free.
We're going to anchor our sample throughput around the concordant duplex bases but want to provide that information as possible and the numbers associated with that distribution are below.
With respect to biases, one view of that is the view we've chosen to take here in the upper right. and we've taken the genome in a bottle consortium's genomic stratifications and shown relative SBX coverage over those different stratified regions. It looks very good over all of them in our opinion. Certainly we're going to continue to push to increase performance wherever there are even minor gaps.
If you look at this highlighted section here the low GC content, less than 15%. That's obviously under the line of 1.0 with respect to normalized coverage. Of course we'd like to push that up in the future but another set of regions we'd like to draw your attention to are the high GC content range bins and for this is not always the case that we were showing relative coverage values of 1.0 in these bins.
In fact, in the early days of SBX duplex, this was a challenging area for us to cover well and it's really a testament to Mark and the chemistry team that they were able to target driving this up by focusing on some select knobs in the SBX synthesis. and it also is a testament to you know, some of what Mark has often talked about, which is a separation between chemistry and measurement, and they were able to to turn those knobs and improve the high GC content coverage without impacting negatively any other aspects of the system or the measurement process.
As we go on to the next section and again just a short section on local data analysis, I want to just pull up once again the numbers associated with the instruments base call production rates greater than 500 million bases per second for simplex reads.
These are our Q20 accuracy level reads and then also greater than 15 billion reads for four-hour runs for SBX duplex scenario or five greater than five billion reads in a 1-hour run. And obviously teams are extraordinarily proud of this and we think this is a big advantage but also acknowledge that this could present some practical challenges for those who are trying to run the instrument and generate results as fast as possible or run the instrument continuously you know in sort of a very, very high throughput scenario.
So what have we done in terms of our work to accelerate data processing and assist with handling all of this data that's generated at such high rate? We can start just by briefly looking at the example SBX fast demonstration that Mark mentioned toward the end of his first section in which the analysis for that solo genome occurred in less than two hours post run, postsequencing time. And here on this slide we see an indication of the different major algorithmic steps that were executed, and in the manner in which they were executed.
Base calling on the left was done real time as it always is. Raw data of course streams off the sensor and it's called instantly. Also demux intramolecular consensus mapping and alignment and even germline variant calling were all accelerated through various means.
In this first demonstration those accelerated steps were performed in what we call an offline fashion and all that means is we did not start that data processing until the sequencing itself completed. If we think about opportunities for further accelerating this workflow, we of course will be focusing on making each of those algorithmic steps more efficient. But also in our next deployed version of the accelerated pipeline for SBX-D will be even faster.
We are taking some of those steps that were so far run offline and we are moving them into execution in an online fashion. And again that means this will be done locally but concurrently with generation of sequencing data. Part of the key characteristics of SBX that make this possible are twofold. Number one, it does not take multiple hours or maybe a day or even more to generate full intact reads. Rather, complete reads fully intact are generated on the order of seconds.
Second, when it comes to SBX-D, since we are physically linking information from both the parent plus and minus strands, that information is then localized in time. So, as we take our strategy to form higher accuracy consensus reads in which we've localized our uncertainty to specific bases, we can see that we have the opportunity to accelerate and execute in real time demux intramolecular consensus and in near real time mapping and alignment due to the fact that all the information necessary to execute those steps is localized in time and is available for the first sets of base calls as they're coming off of the instrument.
Great work here by our accelerated compute and algorithm teams and in collaboration with our instrument teams to bring this integrated set of capabilities together. Also, we'd like to acknowledge as always our external collaborators. In this case, in particular, the NVIDIA parabrics team.
Roche’s GPU and bioinformatics teams are working hand-in-hand with NVIDIA and have been over multiple years actually to among other things optimize performance of select Parabricks software components and libraries such that they may be leveraged within parts of the Roche developed SBX pipeline. The collaboration has been extremely positive and we certainly do look forward to continued work together.
Finally, as you think about aspects of the technology being presented here today, you can quickly imagine a great number of useful features that will be made possible by such real-time integrated analysis in combination with the SBX chemistry and sensor module.
Overall, our intent, our strategy is for the instruments local compute to be as fast, as flexible, and importantly, as efficient as the SBX technology itself. And with that, we'll end this first section on analysis. And I'll hand it over to Mark who will talk about SBX simplex reads and some of the novel applications that have been pursued in that area so far and what more could come.
Mark Kokoris: Thank you. Okay, so shifting gears a little bit to go towards the direction of simplex as I mentioned before. Little background on this and we'll kind of go quickly through this as I think everyone knows what we're saying with simplex basically just those singlestranded reads.
And as before with the duplex we focus on two particular types of sequencing probe based and then some RNA isoform work both of which we're doing with our Broad collaborators. So a little background foundational slide just to get an idea of the yield we're talking about with simplex reads and we're just doing one hour sequencing runs as a kind of benchmark example.
On the left we're showing 10x flex probes. What does that look like in terms of the sequence output in a 1-hour run? And again, these are reads that we call Cell Ranger valid barcode reads. In other words, they have valid barcodes, confidently mapped probe sets.
Okay, so, it's a subset of what we actually output in one hour. Nearly 14 billion reads in one hour of sequencing for those types of flex reads. Pretty impressive. Just to give you an idea of that end of the spectrum, for simplex reads to the other side.
We've created, again, HG001 templates just so we can get an idea, get our bearings on quality and accuracy, a library of about 800 base in length. And when looking at again in a 1hour run what's the average read length if you just look at full-length reads well we're getting nearly 800 raw read accuracy 99.36 about 1.8 billion reads per hour and about 1.4 terabases per hour.
If you look at all reads greater than 400 for this data set, you're close to 670 bases, average length, same accuracy, 4 billion reads of that length and about 2.2 2.7 terabases in one hour. So for these types of simplex applications, you can get an idea of the output of the technology.
And I really wanted to frame that so people understood those trade-offs between that and duplex sequencing. Quickly on the RNA because we want to leave time for the Q&A. session at the end. Just a little bit about the 10X Flex kit. This being a kind of probe capture workflow, pretty standard in this particular case.
SBX takes off at the pre-amplification step, do a strand enrichment, drop straight into SBX sequencing just like we covered before. Similar for the 10X three prime, five prime gene expression. Only this time we can actually enter the workflow early on, skip out a bunch of steps, fragmentation and repair at. We don't need to do any of that. We just go straight into SBX sequencing, strand enrichment, and again sequencing more along the lines of those longer reads I was showing in the previous slide. So pretty clear.
One of the other things we have to do to feed into the Cell Ranger pipeline here is actually mimic the read one read two formats for an expand read as I show above a typical Xpandomer read. We just break it up into these split read one and read two formats so you can get an idea of how we can feed it into the pipeline as time goes on, pipelines will be more customized to using the type of SBX sequencing reads and so we won't have to do that.
A little bit about our early access Broad Institute collaborator. Another real class team there Aziz Al’Khafaji and his team that we're working with really as with all of our collaborators here are really world class in terms of the quality and the work that they're able to do. So we're really excited to work with them.
The two focus areas for Aziz's team is a 1 million cell multiomic drug response profiling experiment as well as a a single cell fiveprime gene expression experiment and so this kind of outlined here Aziz will go into a lot more detail and speak to this again at the AGBT workshop going into details of the prism setup that he has as well as what we did for the single cell.
So I'm really just going to focus on the output of this and let Aziz really dive into the data. Again a lot of specific details here, but the one thing I loved when we sat down the first time and talked about what we wanted to do. I said, let's challenge the system. Let's go big in terms of the measurement. What's a read-heavy, a measurement-heavy approach? And so, this is what he came up with this large-scale multiomic experiment.
Again, he'll go through the details of the experiment here, but something that you really need to have a significant amount of output if you're going to be able to do these types of experiments. A little bit about the format here. Don't really go into the details too much, but essentially is a whole transcriptome, a protein panel and a prism barcode panel all coming together into multiple pools in a workflow that then feeds into Xpandomer synthesis and sequencing. So what's the output when we do this 30 billion reads in a 4-hour run? I'll say that I think there's a heck of a lot left in the tank to go much bigger than that in terms of the read count number.
This is early data optimization. We can certainly push that number up, but 30 billion reads in that time frame, pretty significant. So, we were excited to see that result and then get the picture of what that looks like with this UMAP result of all those different 480 I think cell lines and what that kind of looks like and essentially this conclusion which is SBX is able to meet the immense sequencing needs for highly multiplex multiomic experiments.
This is the kind of stuff that we really have that workhorse capability to drive with this technology and Aziz and his team he will present this at AGBT in much more detail to talk about why we're excited about this high throughput approach moving on to the 10x single cell longer five prime reads again the workflow able to cut out quite a few of the steps so simplify the workflow any steps you can cut out is is great if you can do that and in this case you actually get longer reads so, it's a double benefit for the technology.
As I always say, one of my goals for the technology is to get as close to the sample as you can, and I very much look forward to being able to do that more and more with the applications that we explore over time with collaborators and as this goes on to market. Just a little picture of what we saw with this particular experiment. This is just one run. Again, I haven't really spent that much time trying to drive yield at length with this approach, but just to give you an idea, in one hour of sequencing, we saw over 200 million reads greater than a thousand.
So again, this is without really optimizing the system much. And again, to put a number of reads greater than 400, two and a half billion reads for this poly disperse single cell RNA library. So I'm really happy with that. Something that we'll continue to optimize and really bring this approach into the kind of workhorse status for the technology for these type of reads that you get more than enough out when you have Q20 plus reads, but then the throughput and the flexibility of the workflow is really what comes into play here. So this I just kind of threw this on as a nice little snapshot of what a 2000m Xpandomer looks like in a nanopore just to give a picture to know that this is something we can do is push these links.
Again, not something we've really focused on, but I've always gotten the question, how far can you go? Well, we're going to find out. We're going to keep pushing links up. Again, there's always trade-offs with length and yield that you have to consider but as you get the idea, this is a very high yielding technology so a lot of possibilities there a little bit about the results the disease saw both the PBMC cell types were clearly resolved on the left and the clear biomarker expression as well pretty much as expected is what he saw for the particular sequence reads and then lastly SBX can resolve novel transcript isoforms and so you show this nice picture where on the top you have this novel isoform that they were able to pick up relative to to the known isoform in the sample set.
So this is just one of examples that came out of the study that we did and again Aziz and his team will cover this in more detail at the conference. Looking forward to hearing from him on that. And essentially now we go back to John to kind of elaborate a little bit more on the simplex read properties and analysis. There you go John.
John: Great. Thank you. And thanks Mark. And we'll do here just a quick section on SBX-D or SBX simplex read, excuse me, read properties, the raw read error profile, and also comment on the optimized open- source tools which we plan as part of our informatics strategy. looking again at this image in the upper left.
Mark showed this just a few minutes ago. This shows how our simplex reads can fit in quite easily to existing workflows for single cell both for the 10x5 prime and for 10x flex construct shown here with SBX. Of course a single long read would span enough length, these could be split into read one and read two which would be the traditional format that this data would be presented to downstream analysis tools.
So we are able in the short term, took sort of an expedient route to do just that. We perform read trimming and splitting. We then split those into read one and read two FASTQs and then we can send those into existing on-market pipelines such as Cell Ranger.
In this particular case it should be noted that neither the reagents themselves nor the algorithms have have been optimized for SBX error profile for the existing on-market products and tools to give a little bit of a of a view as to as to how that raw error profile looks and why there might be optimizations that could still be made in this in this area of applications. Let's take a look at some example DNA runs which we will use to probe the raw error profile and read characteristics.
This is a set of experiments which we've run really just for demonstration purposes, and we've taken the selected input DNA of the lengths that you see on the left the target input DNA lengths were 370, 500, 750, 1,000 base pairs. So, if we do that with the 500 base pair size selected input DNA, we have a raw read length distribution that is shown in this box above. And you can see the vast majority or the majority of these reads have made it to the end of the molecule. Some of them are partial reads for various reasons.
And we do know that we are able to identify those reads which are as we call full-length reads by those which have our known adapter segment at the very end of the read. And so in the readlength histogram shown below, these have now been colored by those which have or do not have an adapter segment found at the end. For the rest of the analysis on these next few slides, we'll just look at the accuracy and error profile characteristics of that full-length distribution just for simplicity. Although when we look at the partial reads, really we see the same results with respect to accuracy and error profile.
And so at a high level the accuracy for each of those sample experiments is shown in the table below. 99.3 was achieved with all of them. and we can go to the next slide to look at this in a little more detail to see how that breaks down.
Those four experiments are broken down into a raw read error profile. in the four tables below on the left and you can see that 370 is up top, 500, 750 and then a thousand as we go down the page. all the error rates are shown there. and as is the breakdown of substitutions, insertions and deletions which are roughly the same order of magnitude. If we ask how - a very similar question to what we did with SBX-D and the dependence of accuracy on insert length and position in the read - we can ask a very similar question for raw read.
And we can see in the upper right the 500 base pair run has very stable accuracy for all positions in the read, and the error profile as well is broken down there and that is fairly stable. And note as with the SBX-D plots on accuracy versus position, we've also trimmed slightly here just to avoid edge artifacts and also trimmed adapters as well.
A preview of our raw read QC score predictors can be seen in the lower right. a fairly good performance. Again, these are still being calibrated internally. There's much more work to do, but absolutely we're able to predict some of the low quality reads and can leverage that to identify high quality reads or those passing our filters.
Going beyond just that 500-base pair sample, we can look at accuracy versus length of insert and position for all of those different size selected input DNA sizes. And so you can see the 370-base pair template shown on the top, very stable Phred score with respect to position in the read for those full-length reads and same with the 500-base pair as we just saw a moment ago as well as the 750-base pair template and 1000-base pair template.
Just to circle back very quickly to this view of the accelerated pipeline, the local data analysis pipeline. It is the intent to our intent to is to accelerate this pipeline particularly to focus on the common algorithmic steps which will facilitate customers ability to work with SBX data in their own secondary pipelines and to do so efficiently on-prem analyzing in real time reducing and compressing data so that it may efficiently be moved downstream to the users cloud. It is not the intent to completely close off users ability to to work with the data. In fact, quite the opposite.
And so what we've demonstrated here with this image is that in addition to providing that accelerated workflow which is shown in the middle, we also want to enable the ability for customers to pull data off of that pipeline at various points, so to be able to directly retrieve BAM or CRAM files, in the case of SBX-D, SBX Duplex read files or for applications where simplex reads are needed to directly access the SBX simplex reads move those up to the customer's own location of analysis and in addition to that we aim to provide a set of open-source analysis tools which can be used as reference pipelines.
Customers could take a look at them or choose to ignore them, or integrate them into their own cloud pipeline. In either case, we aim to do whatever we can to facilitate the customer's adoption of technology and make it as easy as possible to adapt both the assays and the informatics algorithms to the SBX read characteristics. And with that, I'll hand it over to Mark to speak a bit about the future application areas and the headroom of the technology. All right.
Mark Kokoris: Thank you, John. Okay, so home stretch is almost there. A little from Gustav after that and then we get to the Q&A. So, innovative headroom. So, I guess we're kind of ending where we started the presentation here in terms of the SBX technology into the future. It's interesting after 18 years with technology.
I look at it as we're at the beginning of the technology, not at some culmination for the technology, but really at the beginning getting really excited about the possibilities for what we can do as we've learned so much over the years bringing these two technologies together.
So as we think about some of the stuff into the future we'll add in in terms of our flexible operation this run until done idea that if you do a sequencing run you know how much you need you don't need to sequence further again another form of efficiency when we think about accuracy chemistry measurement algo optimizations I mean we understand this technology very very well and there are a lot of buttons to push both respect to the polymerase side, the chemistry side to keep pushing that raw read accuracy up and up and up.
We understand it very very well and we will continue to do that over time to keep pushing accuracy, keep pushing throughput efficiencies. I mentioned faster pulsing before that's a significant button to push as well as increasing the density of the sensor module. Again, more pores than what we even have at this point. Starting to get an idea of the type of possibilities we have into the future. Of course, thinking about all of these in the context of cost efficiency.
So again, I like to think of this as a beginning for the technology. We're just getting started with what's possible with this technology and then looking towards the future application space. I like to say people are going to need to kind of relearn what's possible with this technology given so much throughput and given the quality of our duplex reads and so that's an exciting thing for me to see and for the team to see and all of Roche to see is how many other things we'll be able to do with this kind of technology capability.
So rethink what's possible with the technology across the whole range of applications and even think of things that you otherwise would have never wanted to do that were now possible when you think about the read and the output, the read heavy output that you have with this technology. So very exciting for us to think about these possibilities into the future and really look forward to see us kind of driving this Formula 1 car around the racetrack and seeing what we can do with the technology. It's an exciting time.
Thank everybody for coming today and we'll hand off to Gustav as our life cycle leader. We'll kind of go into a little bit of detail on the commercial side. So Gustav you go.
Gustav Karlberg: Okay. Thank you Mark and thank you to both Mark and John for a great presentation. So much interesting information to share with you today. We're very excited about where we are with this technology, as you can tell.
I get the opportunity and the privilege to talk a little bit about our commercialization strategy and what we're looking to do as we move forward here. Mark talked a lot about the different workflows and show data from some of those that we've generated already. I think the big takeaway here is that this technology has a lot of versatility. It can be used across all the existing workflows that we have in sequencing and also for things that maybe we're still just imagining in a way that also makes it ideal for a lot of these different applications and we as a company, Roche will think about some of these applications that where we would like to develop an end-to-end solution.
We're going to focus on that over time but I do want to talk a little bit about the fact that we want to make this an open available solution for you, the industry and the researchers to do the work that you want to do. We are going to release this as an RUO product. We have a future vision to take this to the clinical setting. SBX will be our backbone for all the technology development that we want to do in sequencing.
But I want to make it very clear that in this first initial releases, we are making this a very open technology for everyone to use. We also heard a lot in the webinar about the extremely high throughput of this technology—the speed that it has. Both of these factors are super important as they enable us to do all these different applications of interest.
But what we're focusing a lot of effort on and what Mark talked about as well is this flexibility that we've built into the system. We are really trying to make sure that we can drive for flexibility, enable you to run sequencing in different ways than what you're currently doing, and have an efficient way to meet your lab operation needs.
Sequencing has been driven by this cost reduction that has happened over time. However, with regards to uptake, the challenge is that a lot of those cost reductions make it accessible for very large studies. What we are focusing our design efforts on is seeing how can we minimize that price penalty between the very large studies and the smaller studies, and make sure that we can make it more accessible for more labs and have a technology that can actually answer some of those questions that all of us have.
Finally, I want to address a little bit about our commercialization and our timelines here. You've heard Mark talk about the early access work that we have done with our collaborators. We are very excited about that. We will continue to work with those collaborators in 2025 here. We also want to expand that program in a selective way with some additional collaborators looking at some of these applications and workflows of interest.
At the same time, we are working hard to get this to a commercialization stage and we are targeting to have that available in the market in 2026. We are extremely excited about where this technology is, what we have been able to do and what we've been able to show you today and we are really gearing ourselves up for something new and really groundbreaking with regards to how sequencing is done.
And before we go on to our Q&A session, I want to hand it back over to Mark to talk a little bit about and giving acknowledgments to all the teams and all the work that has been done here over the last couple of years. So Mark, please,
Mark Kokoris: Thank you Gustav. Yeah, great. So first I want to thank everyone for coming today and I hope you're getting a sense of our enthusiasm and excitement for this technology where we are and where we're going with this technology. This excitement across Roche is pretty significant. I want to thank you all for coming. I look forward to the Q&A session. As far as the Roche team in general, this has been a global collaboration from Seattle to Santa Clara, Penzberg, Pleasanton, Cape Town, and many other places. Hundreds of people across the organization over the years have been involved in this technology, so it is a huge thank you from all of us for all the teams working on this for their work over many many years to bring the technology to this point and then the many years to come, as we really start running out this technology and showing what it can do.
So I'm very excited for that. Some other acknowledgements that I wanted to make here is I wanted to call out an acknowledgement to my counterpart at Genia, Stefan Roever, who unfortunately passed away a few years ago. He wasn't able to see all of this come together for the Genia Technology and what happens when we bring these two technologies together. And so I know Stefan would have been thrilled to see the results that we're showing today. And I did want to honor him with that.
I also want to reach out to my my team, my Seattle team in Stratos Genomics for all the years of support behind the technology in believing this when it seemed like a crazy idea that was never going to get there and yet still maintained that rigor and that persistence and that grit to make this technology come together and believing and innovating after innovation after innovation to do that.
So I want to thank them for that and all the Roche employees again for all of that innovation that we've done and are going to do into the future. And lastly, I want to thank Severin Schwan. I want to thank Thomas Schinecker, and I want to thank Matt Sause and the entire executive team for their support, unwavering support for this technology, over the years.
We really take seriously innovation and innovation that can impact all of science and innovation that can impact patients lives. This is something we take very seriously at Roche and I want to thank them for that vision and that support and commitment over so many years to this great technology and really look forward to seeing what it can do into the future.
The Roche SBX webinar reveals new sequencing technology covering duplex and simplex sequencing, rapid whole genome sequencing, and flexible platform performance.
The Roche SBX webinar reveals new sequencing technology covering duplex and simplex sequencing, rapid whole genome sequencing, and flexible platform performance.
*This transcript was generated using an AI-based transcription tool and may contain errors.
Next generation sequencing has revolutionized our understanding of biology. For years, sequencing by synthesis or SBS has been the most widely used technology. However, the invention of sequencing by expansion, or SBX, promises a new approach to shape the future of genomics, but what's the difference? And why does it matter? Let's start by looking at how DNA sequencing works. Sequencing a DNA sample generates its unique sequence of bases. The order of A's, T's, C's, and G's and the faster you can spot the differences that make a sequence unique, when compared to a reference, the sooner you can gain insights. This could mean driving quicker progress in oncology-related research and making faster breakthroughs that shape healthcare for the next generation.
Now, imagine DNA sequencing as a game of spot the difference. Your reference sequence is the original picture, the DNA from our sample is our difference picture. Our goal? To spot the differences between them. When you have both your reference and sample pictures together side by side, comparing them is very straightforward, but the key differences lie in how that sample picture is created, namely, how sequencing reads are generated.
Let's start with the SBS sample picture. SBS sequences the sample in iterative cycles, building the sequence piece by piece. Imagine this like a printer which prints the image line by line. You have to wait until the the image sufficiently appears before you can begin to see the full sample picture. Only then can you begin to spot all of the differences against the reference picture.
SBX works in a dramatically different way. First, samples are converted into molecules called Xpandomers. Then, after loading the Xpandomers onto the sequencer, full length reads begin to be measured within seconds of sequencing the converted sample. SBX achieves this speed by separating synthesis and sequencing steps. Instead of printing out our sample picture line by line, SBX begins to generate the full picture right from the start, with rapidly increasing clarity and that means you can compare against the reference picture with confidence almost immediately.
By looking at SBS and SBX side by side over the same period of time, we can see the differences in action. On the left is our SBS sample picture, we can only see a small portion of the image, we're still waiting for the printer to finish its job. In contrast, on the right, the SBX sample picture is visible in its entirety, you can compare the sample picture to the reference from the beginning, even as the picture gains increasing detail.
But what does this mean in the real world of genomics? With SBX, as the initial view of the full picture is available within within minutes, to even seconds, downstream analysis can begin almost immediately. This is in comparison to SBS, which must wait, until sequencing has sufficiently progressed to begin downstream analysis. This ability to generate data and move from sample to data analysis in a fraction of the time, while delivering high accuracy and flexibility, marks a fundamental shift in the sequencing workflow.
SBX will ultimately accelerate scientific discovery, unlocking possibilities in sequencing, that were previously unattainable. We're advancing towards a time where fast, accurate genomic understanding sets new standards in personalized medicine, health, and our collective well-being. So make space, because the future of genomics is here here and it's faster than ever before.
Watch a high-level comparison between sequencing by expansion (SBX) and sequencing by synthesis (SBS) in simple terms. Using a printer analogy and a "spot the difference" game, this video illustrates how each method generates sequencing reads.
Watch a high-level comparison between sequencing by expansion (SBX) and sequencing by synthesis (SBS) in simple terms. Using a printer analogy and a "spot the difference" game, this video illustrates how each method generates sequencing reads.
*This transcript was generated using an AI-based transcription tool and may contain errors. It reflects the spoken content of a live webinar and has been lightly edited for clarity.
DNA bases are separated by only 3.4 angstroms, the width of a few atoms. DNA sequencing methods cannot easily resolve this dimension. Sequencing by expansion technology, or SBX, overcomes this limitation by creating a surrogate molecule called an Xpandomer. This molecule is more than 50 times longer than its target DNA and encodes the DNA sequence information in large, high signal-to-noise reporters.
The building blocks for the Xpandomer are novel nucleotides called X-NTPs. A modified DNA replication process synthesizes the Xpandomer by sequentially linking the X-NTPs along a target DNA template using a proprietary enzymatic process. This forms a double-stranded helix. Each of the four X-NTP types depicted as A, C, G, and T has a tether that is linked between the base and the alpha phosphate. The tether supports a high signal-to-noise reporter that mirrors the base sequence identity.
Located in the center of each reporter is the translocation control element, or TCE, that modulates Xpandomer movement during sequencing. As the extension proceeds, only an X-NTP that correctly complements the next base on the DNA template will be incorporated.
The sequence information of the DNA is now expressed in the reporter sequence of the X-NTP. This process of linking X-NTPs is continued sequentially and extends along the entire length of the DNA template. After synthesis, a reagent is added, which degrades the DNA template, and the cleavable bond between each of the X-NTP bases is broken, allowing the backbone to expand. This product is called the Xpandomer. The Xpandomer is then threaded electrically through a biological nanopore in a deterministic manner. The translocation control element holds the reporter in the nanopore, altering the electrical resistance of the pore to provide one of four signals that uniquely identify the corresponding base in the original DNA sequence.
After measurement, a short high-voltage pulse is applied, and the Xpandomer is reliably advanced to the next reporter. This process is performed in a highly parallel manner across millions of wells on a single (complementary metal-oxide-semiconductor) CMOS-based sensor module and can provide for remarkable real-time sequencing rates of hundreds of millions of bases per second.
Roche's innovative SBX technology can flexibly scale to satisfy a broad range of applications, setting a new paradigm for next-generation sequencing technology.
SBX transforms tightly packed DNA information into a longer molecule with distinct, high-signal reporters. This allows rapid, highly accurate sequencing that can flexibly scale to support a wide range of applications. Watch the video to learn more!
SBX transforms tightly packed DNA information into a longer molecule with distinct, high-signal reporters. This allows rapid, highly accurate sequencing that can flexibly scale to support a wide range of applications. Watch the video to learn more!
*This transcript was generated using an AI-based transcription tool and may contain errors.
All right, thanks for coming. I'm Alex, director in computation of biology at Foundation Medicine, speaking on behalf of everyone at Foundation who's working on the AXELIOS Platform. The title of my talk is “Whole genome sequencing on AXELIOS 1, detecting low levels of cfDNA with high accuracy.” And I got to be a little bit transparent with you guys, when I was typing this out, my muscle memory, I definitely typed on “Alexios”. All right so, here's my disclosures.
The talk's going to be in four parts. First, just a little brief about installation and onboarding of the prototype. Then, setting up the tumor informed molecular residual disease MRD technical study design, what workflows we're using, samples using, computational analysis methods, then the results, and then finally a little bit more of a deep dive into background error rate estimations. So, first the installation onboarding. We got our machine in August of last year.
We've had it for a few months. We've been working kind of towards this event. We have top men and women. That's Matt and Rashida unboxing it down in the bottom left. And up on the right we've got our synthesis machine and our sequencer.
We got some good hands-on training for everyone who's going to be using the sequencer. And everyone was able to, you know, work on it independently by the end of the week after we got it. All right. So, tumor informed MRD technical study design.
Molecular residual disease or minimal residual disease MRD, generally means more or less the same thing, detection and quantification of cfDNA and plasma at especially low levels. And if you've been at this conference for more than a couple days, you'll know that there's a lot of different ways that you can try and do this sort of assay. Ours, that we're analyzing for this readout is a tissue informed co-analysis of whole genome sequencing data or FFPE and plasma pairs. So basically at a high level what we do is whole genome sequencing on the AXELIOS and we're also going to be referring to a similar workflow in the NovaSeq X . We determine the tumor specific genomic signature, a list of signal nucleide substitutions to track though, in principle we could be using other kinds of mutations.
And then we do hold genome sequencing on matched plasma using, you know, the same two platforms that I was talking about before and we hunt for the variance that we had found in the FFPE. And in principle this could be you know repeated on subsequent plasmas but for this study we're only looking at plasma per FFPE so this, has you know, applications in a research setting for assessing CTD levels in like completed clinical trials. All right, let me just go a little bit deeper into that variant list that I was talking about. So, kind of at a slightly more specific level, we're going to get a kind of a long list of potential variants from the FFPE whole genome sequencing.
And then using the plasma and tissue together, we'll do a classification of the potentially somatic variants to determine if we're, you know, confident that they're somatic or if they could be CHIP, or if they could potentially even be germline. And then at the same time, we're also doing for specificity tracking that signature that we got from the FFPE in a set of unrelated healthy plasmas, just to make sure that, you know, we have good specificity for the assay. All right. So for this application as you can imagine accuracy is king. We weren't really attempting to set any Guinness workbook of world records for speed here.
We used the SBX duplex chemistry for both the FFPE and for the plasma samples. And this is going to be pretty similar to what Justin just described. We used up to 24 samples derived from either plasma or FFPE, and they were manually processed through library prep on 96-well plates after an overnight linear amp. Then we did some cleanup, some quant and pooling into four six-plex pools for synthesis. After four hours of synthesis, the pools are loaded into the sequencer where each pool is run through four hours of nanopore sequencing.
And after each run, there's a like a one-hour sensor module cleaning followed by poor regeneration prior to starting the next run in the run queue and then ban file generation is kind of happening at the same time and we started getting them about a few hours after each run completed. Okay, so as I mentioned a couple slides ago there were two highly similar TI-MRD workflows using AXELIOS and NovaSeq X . Similar efforts were made to make them as similar as we could but there were a few differences, as you can imagine. For example there was a covaris fragmentation for the NovaSeq X workflow versus an enzymatic fragmentation for the AXELIOS workflow.
The adapter ligation is different because we have the two-step adapter ligation on the AXELIOS workflow and there was the linear amplification for the AXELIOS workflow that Justin mentioned versus the exponential PCR for NovaSeq X. And then there was sequencing. The DNA input amounts were the same between workflows though. All right, so a cohort of matched pairs FFPE and plasma was procured. And these represented a pretty broad range of solid tumor types. I believe, if you count out there's nine solid tumor types stages one through four. So we're not intending here to learn anything about any like specific clinical setting but we're just getting a broad range of samples that are going to cover what, you know, might be observed.
The TMB distribution I'm showing it for the AXELIOS data. It was quite broadly ranging from, you know, a median of 3.2 TMB. Some of them were quite low and one of them was over 60. That's a melanoma one. These are all vendor procured pre-treatment samples.
So there's an expectation that they should all have some cfDNA, though it could be quite high or it could be quite low or in between. So for specificity, we got a close near age-matched healthy cohort of 40 samples, 40 plasma samples. For assessing specificity, I'm showing the tumor cohort ages and the healthy cohort ages. And each healthy plasma sample was analyzed using the mutation list from each of the cancer tissue samples for both workflows and I'll show that again in a second. Here's like a highlight picture of the processing pipeline that we were using for the AXELIOS workflow.
This is a mixture of some of the code that, you know, comes with the sequencer that Justin covered in the last presentation and then there's also some, you know, application specific code that we developed at FMI. So at the top we have a consensus reads, demultiplexing, variant calling, QC, CNA calling to make sure that we're getting good tumor purity for those FFPE samples. And then we get a de novo set of substitution variants from the FFPE. And that output of substitution variants is going to be an input for the plasma cfDNA samples on the bottom where we do something similar, but there's an additional intermolecular consensus and realignment that it kind of accounts for, you know, special aspects of how the SBX-D hairpins are sequenced.
Then we take the list of variants from the FFPE and hunt for them in the cfDNA and we do further QC classification counting and making sure that the FFPE and the plasma came from the same person. And finally that at least they're genotype matched. And finally we get tumor fraction and MRD status and what are the track mutation hits that we found. So the processing pipeline for the NovaSeq X workflow was of course quite similar but there were a couple differences mostly to do with, you know, special treatment for handling paired ends in the NovaSeq X reads, as well as there was an additional filter for shorter homopolymers in the AXELIOS workflow.
So those are basically the main differences. All right, now let's get some QC and performance results and the comparison. So there were some sample failures. There were also some samples that procured samples that were not actually genotype matches and were excluded.
So we got down to 46 fully passing across both workflows, FFPE plasma pairs that were considered for comparison and 40 healthy cfDNA samples. We are showing here the duplex concordant coverage, sorry, unique duplex concordant coverage for the AXELIOS workflow and unique coverage for the NovaSeq X workflow. And so the duplex concordant is from that SBX-D hairpin where, you know, if we only got one half of the hairpin that's not being counted in it, only the part that was concordant between the top and the bottom and that's fair, because that's what we're using. And you'll see that there's a slight difference.
There's a broader range in distribution for the AXELIOS workflow and they’re a little bit higher. So one thing that we did was for the FFPE, we chose the sample, sorry, the workflow that had the lower unique coverage on this plot basically and down sampled the other ones so that the FFPE are equal. We didn't do that for the tumor plasma or the healthy plasma because they were a little bit closer though technically the AXELIOS workflow is slightly higher and may favor the AXELIOS workflow a little bit. Let's see then, the healthy, so this is just a visualization of what I've kind of already said but a little bit more. So for this application we just kind of try to keep being a little bit simple about how we're doing the analysis, no machine learning here. Here we're just tracking the mutations in each of the healthy plasma samples from each of the FFPE samples.
So each of those combinatorial pairs, we're quantifying it, correcting for how much error we're expecting in each one based on, you know, the sample quality and we have this little logic here. If the tumor fraction that we estimate based on this analysis, is higher than the 40 healthy samples, then we're counting that as positive. Otherwise, we're counting as MRD negative. So, this is where basically this is kind of like way of saying that we're going to have post hoc 100% specificity but, you know, it doesn't necessarily correspond to real world 100% specificity, of course. But it's just a simple way of doing it.
And both workflows were done this way with, you know, the whole way through. Okay, here's just a little bit more QC metrics for robust QC work, sorry, robust AXELIOS workflow performance, so we have the coverage coefficient of and variation. So, lower means it's smoother, you know, about 0.25 for the plasma on average, or on median for AXELIOS. A little bit higher in the FFPE understandably. The read lengths are, you know, 1.75 approximately for the plasma and a little bit more variation, a tiny bit shorter, for the FFPE also kind of understandably, is probably for the plasma at least due to the biology of the fragmentation. We had relatively low AT dropout for both plasma and FFPE, and extremely low very favorable GC dropout for the plasma, and slightly higher or at least more higher distribution for the FFPE probably contributing to that higher CV and the coverage for the FFPE.
All right, so here's kind of where the results come. So the plotted values, the stars, are the match samples. So the plasma sample that came from the same that matched to the tumor, and so those stars indicate that they're MRD positive. And then the box plots along the bottom for each workflow are kind of the error corrected signal that we're seeing in those matched healthy samples. And they're error corrected.
So we're kind of expecting that the median of those distributions should be right at zero. So that's why, and what we're really seeing here is just the distribution above the median. And that's what those boxes are. So for most of the samples that we happen to procure in this, you know, of mixed cohort, both workflows called MRD positive, see at the left side. On the right side there were a few samples where neither workflow called MRD positive, some of those look like they're a little bit on the edge, some of them not so much on the edge.
These are a little bit biased towards stages one and two, both MRB negative but in the lower middle right, where the tumor fraction is low but not undetectable, there were four samples that were detected by the AXELIOS workflow but were not detected by the NovaSeq X workflow. And there was a sample that was detected by the NovaSeq X workflow but not detected by the AXELIOS workflow. So to summarize a little bit more the fraction of all samples called positive MRD was 83% for AXELIOS workflow and 76% for the NovaSeq X workflow. All right and of course it seems to be a little bit dependent on stage though it's not. There's plenty of exceptions. Going into stage a bit more.
So in stages one and two, or where we saw higher detection rates in the AXELIOS workflow. And then the sample that was detected in the NovaSeq X but wasn't detecting in AXELIOS workflow was in stage four, but you know, that's only out of five. So somewhat expected. So okay, there's kind of three kinds of results. There's the qualitative: Did we detect it or not?
That's what I've been showing you so far. But there's also a quantitative: what it’s our best estimate of what the tumor fraction is between the two workflows? And as you can see, there's a very high correlation between the two workflows. We're getting it especially when the tumor fraction is high, we're getting great correlation. There are a few blue and orange dots getting down towards the origin at lower tumor fractions, there's a little bit less correlation.
And then there's some statistics on the slide about what the lowlevel ones are that I won't read. Okay. So the third aspect of what the results are, do we get the same variance? Like we said we did the whole thing start to finish on each workflow. So we're not tracking you know the Illumina, sorry, the NovaSeq X workflows variance on AXELIOS. We're tracking the AXELIOS workflows variance on AXELIOS right? So are they the same variance?
In the top right I'm showing a vin diagram for a somewhat typical sample that happens to be maybe a little bit higher tumor purity. The tumor purity is probably the factor that's going to drive the, you know, overlap of variance that we find between workflows. This guy has an index of 0.7, meaning that the intersection has 70% of the union, as you can see. And then looking at the actual quantitative VAFs, after we've done some filtering for germline and potential CHIP, we're seeing I would say very high correlation in the VAFs between and within the FFPE tissue for both the AXELIOS workload versus the NovaSeq X workflow.
Then on the left it's a little bit more complicated. We're showing the kind of density plot for each workflow and for each sample. How many of the variants we detected in workflow were in the union, right so it's a little bit like the Jaccard index but asymmetric. And so, on the median was that the AXELIOS workflow detected 87% of the union of variance from both workflows and the median of the NovaSeq X was a little bit lower and 76% but still overall quite high.
This is approximately in line from what you would expect from just stochastic sampling of tumors of various purities and we're not necessarily claiming that there's a difference between what kinds of variants one is able to detect especially for substitution variants in one workflow using one sequencer versus another.
All right, so let's go into the final section like a little bit more of a slightly deeper dive on the background error rates estimations that we made. So, we like the kind of question that we're trying to get here is what's driving the differences in the fraction of samples called MRD positive for ctDNA? As you saw there were 83% and 76%. But you know for many samples the tumor fraction is extremely high and for many samples is very low but in the sweet spot there's a much kind of in the middle. There's some samples that are differentially detected between them. We matched the coverage for the FFPE and the coverage for that plasma was, you know, close.
So what is driving it it's likely to be since the detection for ctDNA it's going to be driven by seeing higher observed track mutation hits in the match sample than the healthy background sample. We got to understand what is the distribution of background hits in healthy samples because that's going to really tell you how low you can go. And again we're not using machine learning here. We're just being a little bit transparent about how we're doing it. So we devised a way of making empirical measurements of the rates of substitutions in cfDNA at non-track mutation rate site. Sorry, at non-track mutation sites. So we're excluding anything that could be classified as a mutation by algorithms. We're not counting germline. We're not counting low base quality. We're not counting, you know, terrible mapping quality.
Everything that could count. We're kind of averaging that out to see what what kinds of rates of background we would see in a sample so that we know that, you know, if we count at the track sites, how much we would expect of those would be coming from potential just background. So these kind of background error rates, I think they're appropriate for modeling noise in a TI-MRD setting because they include all the sources of error but exclude the things that are, you know, going to be excluded by the process anyways, but they're not necessarily going to be concordant with like a perfect estimation of what the sequencers performance is specifically. It's all about the whole workflow including the computers. All right.
So, this is a plot of the variants including the trinucleotide context. So, as you know the base before and after has a large influence in what the error rates are for sequencing across platforms. And so what I'm showing from left to right is the 96 trinucleotides plus the class of what other base the middle base is being mutated to. So as an example, the one at the very far left is a ACA going to AAA. And there's two things that kind of stick out.
One is that there is of course a high dependence on what the kind of mutation is. Is it C to T? Is it T to C? Is it T to A? And the context does seem to matter as well.
This is a representative plasma sample that we're showing, but a lot of them look pretty similar. The other thing that sticks out is that the AXELIOS workflow for almost all of these 96 mutation classes has a lower background error rate than the NovaSeq X workflow. And this sticks out a little bit more if we do a weighted average estimation. So we say, take these 96. Figure out what fraction each of those classes is in the signature that we're tracking for each individual sample. Then go do that.
Use that fraction to weight them to do an average. That's what I'm showing on the left, sorry, on the right where we have the each dot as a sample. And we've done that weighted averaging. And the AXELIOS workflow samples all have a higher, sorry, a lower background error rate than the NovaSeq X workflow samples for those tumors plasma samples. So plasma samples that are matched to a tumor. All right.
So inclusion background error rates are highly dependent on sequence context but are lower for AXELIOS for most substitution types. Okay, so we did a little bit of computational modeling of this just for various reasons just simple Poisson models, using the empirical error rates that we're measuring coverages, mutation types, and what we're interested in looking at is the relationship between the observed mutation hits and the expected distribution of background hits. So this and we're trying to keep this simple but it's going to get a little complicated. So for each of the two workflows, what we're doing is we're looking at the cumulative number of hits we see as we look at increasingly large numbers of variants. So at the far left we're looking at just variant which is the lowest air rate variant and increasing as we go to the right looking at a larger and larger set.
Then how many of that larger and larger set did we detect? So it should go up monotonically and the blue line is the actual observation of how many hits there were in that sample. And then the red line is the expectation of how many hits there would be if there was no, if the consumer fraction were zero. So that's going up monotonically too. And there's also a background 99% threshold.
So the 99th percentile of the distribution of how many we expect to see. So this, and I just want to point out one other thing that might be a little confusing is that for the NovaSeq X workflow, we're doing a little special platform specific modeling of the portion of the paired end reads that overlap. Those are treated separately. And so that's why we're seeing twice as many variants in the x axis of the NovaSeq X workflow plot. Because we're counting it within that overlapping portion and with and outside of that overlapping portion kind of separately.
They're approximately the same number of hits detecting either. The thing that sticks out to me is that the observed hits is just way above the background. It's above the 99% percentile threshold for the AXELIOS workflow, but for the NovaSeq X workflow, it's above the expectation but it's not, you know, necessarily at that 99% threshold. And this then if you look at the models on the left, on the right is just comparing. Basically going back to what the data were on that slide previously.
The AXELIOS workflow, the star is sticking outside of the distribution of real samples and for the NovaSeq X it's at the top too but it doesn't just stick out quite as much. So modeling we also looked at the healthy background so not doing modeling anymore but tracking how many, what's the rate of hits that we're seeing background mutation hits that we're seeing in the AXELIOS workflow and in the NovaSeq X workflow for those, you know, combinatorial pairwise 46 by 40 pairs. And what we observed is that, you know, there's several samples at the far left where there's essentially no detections, but overall for the distributions, there's a lower median for the AXELIOS workflow than for the NovaSeq X workflow was our observation. Okay, so here's the summary.
Foundation Medicine has conducted a ctDNA quantification study, using an early access AXELIOS sequencer using SBX-D library preparation for WGS on sample types: FFPE and cfDNA. The workflows for this assay are convenient and they are up and running in our lab. The throughput and QC were basically on par with our expectations based on what we saw from you know the data that the roach team has put out there and the sequencing accuracy for SNVs at least, single nucleotide variation, exceeded our expectations based on what we saw and we did a comparison with a pretty similar workflow using NovaSeq X on the same matched sample pairs. The results were overall highly concordant but we saw that the AXELIOS workflow showed higher sequencing accuracy and a higher fraction of cancer plasma samples lower called MRD positive.
So that's the summary. And I want to put in some acknowledgements from both the foundation medicine team and the Roche team. On the Foundation Medicine side everybody contributed significantly to this project, but I would like to highlight Akshay Kakumanu, and Matt Walsh and Eileen were leading things and developing some of the great algorithms and putting together some of the stuff that I showed. And on the Roche side, Mark of course is leading the project overall. And, I would say, Mahdi Golkaram gave us a huge amount of help on our algorithms and making sure we're understanding the new data once we got them in our hands. But thank you to everybody. We really appreciate everyone who worked on this!
Presentation by Alex Robertson (Foundation Medicine) at AACR 2026: “Whole Genome Sequencing (WGS) on AXELIOS 1: Detecting Low-Levels of ctDNA with High Accuracy”
Presentation by Alex Robertson (Foundation Medicine) at AACR 2026: “Whole Genome Sequencing (WGS) on AXELIOS 1: Detecting Low-Levels of ctDNA with High Accuracy”
Download our white paper
Germline variant calling using SBX with DeepVariant
Explore how SBX, when coupled with advanced bioinformatics tools like a pangenome aligner and DeepVariant, significantly improves the accuracy of germline variant identification from whole-genome sequencing (WGS) data.
Explore more
Contributors
Roche Diagnostics
Roche Diagnostics is a division of Roche, developing and integrating diagnostic solutions that address today’s healthcare challenges while anticipating tomorrow’s needs. In more than 100 countries, we provide one of the industry’s most comprehensive in vitro diagnostics portfolios spanning molecular diagnostics, clinical chemistry and immunoassays, tissue diagnostics, Point of Care testing, patient self-testing, next-generation sequencing, laboratory automation and IT, as well as digital health and decision-support solutions.
Our articles are authored by Roche Diagnostics subject matter experts, drawing on collective expertise across multiple disciplines to provide reliable insights for healthcare professionals worldwide.
Explore articles from our community
For Research Use Only. Not for use in diagnostic procedures. AXELIOS is a trademark of Roche.
References
Wang Y et al. The evolution of nanopore sequencing. Frontiers in Genetics. 2015;5:449. Available from: doi:10.3389/fgene.2014.00449