Genetic and genomic databases
We have a lot of information at our fingertips, so how do we make sense of it all to transform human health? Conor and Dodi speak to two experts who are making sense of this information overload by creating genetic and genomic databases.
Dr Artem Babaian, a computational biologist and now Assistant Professor leading The Laboratory for RNA-Based Lifeforms at the University of Toronto, explains how he and his team uncovered 100 000 novel viruses in old genetic data that could help us predict future pandemics.
Professor Jinchuan Xing, Associate Professor at Rutgers University in the Department of Genetics doing research on genomic variation, walks us through his study on using genomic data to predict infertility from aneuploid egg production.
Let's dive into the data!
- Edgar, R.C., Taylor, J., Lin, V. et al. Petabase-scale sequence alignment catalyzes viral discovery. Nature 602, 142–147 (2022).
- Sun, S., Miller, M., Wang, Y. et al. Predicting embryonic aneuploidy rate in IVF patients using whole-exome sequencing. Hum Genet 141, 1615–1627 (2022).
DODI: Conor, we live in a super-fast-paced industry. There's so much new information every day.
CONOR: And you're not just talking about the 18 500 unread emails in my inbox, but we're talking about the data that's being generated and the new science. But sometimes there's so much overwhelming information, we just don't know how to sort the golden nuggets from the chaff.
DODI: Exactly. Thank you for saying it nicely and politely, we're going to talk about big data and growing families and how those might be connected. We're going to start with genomic data, how computational biology can help us sift through petabytes of information to find what is useful, building databases, or building families.
CONOR: Oh, that sounds like an odd pairing. But I'm very interested because on one of the genealogy sites, my mother's family is one of the top largest family trees. So, let's find out more about big data and big families because that's what matters on Discovery Matters.
ARTEM : My name is Dr. Artem Babaian. I'm a Banting research fellow at the University of Cambridge, and I am a computational virologist.
DODI: Artem and his colleague Jeff Taylor started uncovering 100 000 novel viruses in old genetic data that could help us predict future pandemics, including nine coronaviruses. I mean, for me, this must be the equivalent of when you remodel a house, and you peel back wallpaper, and there are layers and layers of old wallpaper. Anyway, so we read about this on science.org. The link to their study is in the show notes. But you can listen to how this project came about with us. Now, Conor, I think you're really going to love this.
ARTEM : It was purely a volunteer team when COVID-19 happened. We were just a bunch of scientists; we didn't really know what to do. I was between work. We decided that it was our job as scientists to give back and try to fight this pandemic in any way we can. So, just kind of by luck, I ended up assembling this kind of dream team of bioinformaticians. The best and smartest people that you know, I read their papers and I'd known all about them. I shot them an email, and I'm like, "I'm going to try this ridiculous project. Do you want to help?" Or often I'd be like, "I have a question about your software." Once they got the idea, they were like, "how about instead of just answering a question, how would I come on and I'll help." It was literally like; the big guns came out. Then we went from like a really good idea with decent implementation to just everything becoming so optimized by the people who developed the software, to begin with. In that way, you know, we brought together this team, no one had paid us to do this, we didn't have any grants, we just got a donation for computing from Amazon via UVC. So, we always have this goal that we're creating a public domain resource where no one came into this with an idea of making a bunch of money, or anything like that, or that we're going to commercialize it. We were trying to help like it was a war effort, right? A lot of scientists switch their labs to COVID-19. It was like this is a huge societal problem. It's a war effort to try to fight this thing. This was our part.
CONOR: This is absolutely brilliant. I do love this. It's really poignant because we've seen so many COVID-19 variants come out of one virus.
DODI: And you've had all of them.
CONOR: I've had them all, yeah! The resilience of viruses and humans as well. The scientists at Cytiva have said that there's no better life form than a virus. They're just so good at perpetuating themselves, right?
ARTEM : It's not so much I think that the viruses are smart, we are fighting evolution. And that is going to always be a losing battle. You have to be humble and appreciate that this process of evolution has been going on for billions of years. It's gotten really, really good. Viruses are kind of the most advanced and most rapidly evolving RNA viruses, specifically, kind of organisms, so they are the hardest to stamp out because they're constantly going to be changing. We need to figure out ways where we can kind of adapt to change.
CONOR: So, the search for these variants and novel viruses has got to be a lot of work.
DODI: And that is why Artem used computational biology.
ARTEM : The way to frame this is we did about 2 000 years of computing in around 11 days. It took the world 13 years, it's between 3.9 to 14.9 billion USD worth of sequencing data that was generated over 13 years to cover our entire planet's laboratory sample, environmental samples in the forest, there's even anal swabs of penguins in Antarctica, all that data gets centralized and is freely available for all of us scientists to use and reuse. Then we said, "Okay, we're going to look for coronaviruses, and we're going to analyze everything." We built this big computer, it was able to analyze the data super, super fast. That’s why we could go from 15 000 known viruses in 2020 to one year later, we found over 130 000 new species.
CONOR: That's a lot of viruses. And I guess the issue is really that that constant transformation makes it very hard to predict what the variants are and what the relevant treatments could be.
DODI: Completely. It's similar with antibiotic resistance in bacteria, we have this beautiful technology, and through its misuse and misapplication, all of a sudden, we're on the other side of this battle with evolution again.
ARTEM : This is an arms race that our immune systems have been fighting for millions of years. And our technologies have only been fighting it for like a few decades really.
DODI: You know, despite this revelation of novel viruses, we've probably only sampled about 0.1% of the Earth's virome.
CONOR: Virome - I love that we're not even scratching the surface, so much work to do.
DODI: Yeah, because we're vastly undersampling what the viral diversity is in nature, we're not even close to sampling the earth’s virome. This is a huge problem much bigger than what we are able to do.
ARTEM : The database itself is growing exponentially, which means that every 18 months, the amount of data that's being released and shared in the global community is doubled. So, by 2023, there will be double the amount of data than what was analyzed in 2021.
DODI: In the data he's already analyzed, Artem acknowledges that they've only found familiar stuff, the viruses that look like the ones we already know of. I think this is fascinating. It's when you have a red car, you see so many more red cars.
CONOR: And what we maybe haven't seen are like buses. That kind of worries me a little bit like there's a whole bunch of viral things going on out there that we've not even seen. But then are we assuming that we actually have a good understanding of the number of viruses out there? It can't be true, right?
DODI: I think we're coming around to like, we don't know what we don't know, we know what we know. We don't know what we don't know.
CONOR: So, there are a lot of unknown knowns. It’s like the issue of dark matter. Like there's so much out there. Where's the missing viral load, where is it?
ARTEM : So, you do the math and we're on track to characterize over 100 million RNA viruses by 2030. This is what hyper exponential growth looks like it was 15 000 in 2020, 145 000 in 2021, and then by 2030 I want to hit 100 million. That might make a real dent in our virome. But at the moment, we're both sensitivity-scarce and data-scarce.
CONOR: This is just a vast trove of genomic data that could help us find treatments, predict pandemics, and understand rare diseases, but how are we wading through all of that?
DODI: That is why Artem is effectively beginning to create a search engine for these viruses.
ARTEM : That's what we're also developing, a way of functionalizing the data very quickly and efficiently. So, I have a proof of concept now where you can take an RNA virus, say it's a virus that shows up in the serum of a child in Cambridge hospital, it would take two minutes and cost less than two cents to connect that virus to a virus that was sampled in a camel in Sub Saharan Africa in 2007.
DODI: The idea is not only to create a catalogue of the data but to also do it in such a way that makes it easy to interconnect the information. So, they intend to try to make sense of all that known unknown and unknown known.
CONOR: So, all this metadata that can be pulled out from the samples collected by the global community, they aggregate it and then functionalize it in a way that helps clinicians and researchers, right?
ARTEM : This is not just for humans; this should apply also to our animals. So, keeping our livestock and pets healthy, and keeping our food sources secure. That'll also go into agriculture, and this is going to be a problem that affects many domains of human society. We're trying to create a general tool that can apply to all of these main domains and endangered species. We want to know if there's an endangered species and if it has like a rare infection that's keeping its numbers down it can be diagnosed very quickly and help conserve those species.
DODI: I think this is such an amazing discovery. I think it highlights the potential of computational biology in all of its data driven glory. So, for me, it illustrates that scientific adage that the more you know the more you realize you don't know.
CONOR: So true, Artem and his colleagues seem to have begun an encyclopedia of viruses that is just going to grow, and grow, and grow. It's like wiki virus, right? Think about all the undiscovered data in the virus we're yet to exploit.
DODI: Yeah, it is an exciting thought. For example, understanding genomic data using computational biology has led researchers to begin to understand how to predict the likelihood of infertility in certain women. So, here's that connection between big data and building families. Conor, imagine that you're in hospital…
CONOR: Why have you put me here?
DODI: We've put you here so that we can talk about the notes on the clipboard at the end of your bed.
CONOR: Ah, the ones that say all my vitals, statistics, blood pressure, blood type, and so on.
DODI: What if your notes could also detail your specific genome? And this would help doctors know if you're predisposed to a certain illness or genetic disorder.
CONOR: Oh now, that is an interesting thought. I mean, we've talked about the human genome before on Discovery Matters. The Human Genome Project was an international project to chart the entire genetic makeup of the human being, it was completed in 2003. You can go into the Wellcome Library in London, and you can read it in a large book of letters. So, it's still a really recent innovation in the world of biology.
DODI: Exactly. And it leads on very nicely to the genomic study we're about to dive into. We read about this one in Springer Nature, which you can read via our show notes, but we'll talk about it here with one of the authors who tells us how it all started almost 20 years ago.
JINCHUAN: My name is Jinchuan Xing. I'm associate professor at Rutgers University in the Department of Genetics, my lab does research on genomic variation. So, we focus on the part of the genome that is different among each of us and makes each of us special.
DODI: Professor Xing was so taken by the new area of discovery that he chose to do his PhD on the human genome.
JINCHUAN: At the time, my advisor told me "You have to go watch; this is a big deal since the human genome is being released." Then we heard about the whole genome sequence and that of the 3 billion base pairs we know less than a fraction of a fraction of what they do. For the vast majority of the genome – upwards of 99% - we have no clue what they do. Everything we get to learn there is going to be new. That became a very different method of discovery. Other than generating a hypothesis and doing experiments to test it, it becomes essentially a discovery mission. What can you learn from this vast unknown? I'm still doing it and still fascinated everyday by the discovery aspect of it.
CONOR: So, it's just one of those like huge landmark discoveries in the life sciences like penicillin? The Human Genome Project is one of those moments, right?
DODI: Absolutely. And what's amazing is that the human genome offers something new to look for every day, there are still so many unknowns.
CONOR: What was the focus of Jim's study?
DODI: It focused on infertility from an aneuploid egg production. Aneuploidy is when there is one or more chromosomes missing. The usual number is 23. It presents complications to have either too few or too many chromosomes.
JINCHUAN: So, the overall goal of this study is trying to understand the risk of egg aneuploidy in female IVF patients. So, aneuploidy is the abnormal number of chromosomes in eggs in 'mom'. We are trying to both understand what are the genetic risks of this phenotype, and ultimately, how can we use what we have learned and apply it to clinic to help patients?
DODI: The very first line in Professor Xing's paper cites a startling statistic: 12% of women of reproductive age in the US suffer infertility, that's equivalent to about 20 million individuals.
CONOR: That really brings it home. I mean, it's almost one in 10, right? So, biology is just making an incredibly important contribution here if we can get something right. And that's just the US. So globally, the number is going to be much higher.
JINCHUAN: So, there is a statistic for infertility, but there's also still a statistic for difficulty to get pregnant. There's even a larger proportion of people that is affected by the latter. These are reproductive age patients, or a woman, or a couple who can eventually have healthy kids, but it took them a lot more effort than other people. It may take them over a year, or it might take them to go to the IVF clinic to have a healthy pregnancy and childbirth. And that statistic, if you count all those people together, is really much higher for people of reproductive age couples in the US, it is a really common problem. It is much more common than people typically imagined.
CONOR: So, the better we can understand infertility the better we can help people, and to go through all the information and the human genome to understand and kind of predict the likelihood of aneuploid egg production and miscarriages, that's really ambitious. How did you sort through the genomes?
DODI: And this is the connection to the interview a few minutes ago. So, besides whole exome sequencing, they are using machine learning and manipulating the data.
CONOR:Okay, so let's start with the whole exome sequencing, understanding the mutations in the genome that give way to this risk.
DODI: It's differences in chromosomes, right?
CONOR: Exactly. So, you can't sequence the whole genome with all of its 3 billion base pairs, you've got to find that tiny fraction of the genome that produces this phenotype expression in the individual.
JINCHUAN: There's only a tiny fraction of the genome which is coded for proteins or components to build the body. That is about 1%. Most of the genome is not coding for proteins, and the vast majority of severe mutations are in this coding region. So, whole exome sequencing is a balance between the cost and effect because whole exome sequencing is only sequencing the coding part of the genome. But in the meantime, we will capture mutations in most of the coding regions that are likely to have a functional impact.
DODI: Essentially, this helps find that specific phenotype that causes aneuploidy, which is that difference in chromosomes, and machine learning helps to indicate those that are most likely to have this mutation.
JINCHUAN: But the issue there is, most of the mutations we identified are very rare in the population. So, you cannot use that mutation as a clinical test because, in most people- you will test for the mutation, but it will not be there. It doesn't help you to predict who is at risk. The way we do find those with the mutation is with the use of a machine learning algorithm. The goal is to look at the whole exome with all the mutations in all the coding regions and ask the question ‘If we know a group of patients, and their exome looks this way and has these profile mutations, then compared them to a group of patients that have low risk, can we distinguish the two groups based on all the mutations in the genome rather than a single mutation in the gene?’ That's where machine learning really helps because it's not a yes or no answer. This combination of mutations in different genes across the genome or the exome will predict if a person has a high risk or low risk based on what the algorithm knows from the dataset that we have.
CONOR: So, machine learning is becoming increasingly prominent in studies like this, obviously, for a good reason.
DODI: Definitely. So, what we've got here is just one study, but one of the larger implications could be being able to predict if an IVF treatment - in vitro fertilization - could be successful, or not. The family planning journey could benefit from this kind of big data manipulation and insight.
CONOR: It could and we should look further than this study alone, because there are kind of interesting implications with respect to what kind of things you want to see and what kind of things are best left alone for predicting risk, right? So, the future goal is to use this for the good, especially considering the really emotional topic of fertility.
DODI: That's right.
JINCHUAN: The IVF procedure is very expensive. It has a very strong mental burden on the woman or the couple. For some of these patients that we looked at because the patient has such a high annual risk, they ended up having to go through multiple IVF cycles without generating a viable embryo. That's just putting so much stress and burden on the patient. The ultimate goal is if we can predict that a patient will have a much higher risk than you would predict – based on age and other risk factors – either they will have to try to retrieve more eggs in the procedure to increase the odds of getting a viable one, or maybe the couple should think about an alternative approach such as adoption or getting a donor. If the chances of getting a healthy baby or pregnancy is slim, based on your genome and your risk, it may help just to know if the alternatives work better. I think it will help patients with their mental stress.
CONOR: So, we're working towards real precision medicine.
JINCHUAN: For obstetrician-gynecologists, based on the predicted risk from the mom's genome, they could provide a prognosis before the treatment. For example, if somebody is predicted to have high risk, even at a younger age, they might advise the patient to either have IVF early or be prepared to have multiple cycles of IVF. There are things you can do, depending on the prediction of a success rate for the procedure, which will be really helpful to individual patients if we can achieve that goal.
DODI: Moving towards precision medicine, the next step of the study is to include more diverse genetic knowledge. Human populations are diverse, their genetics are diverse, so the risk factor may differ in different populations. The ideal approach is to have an incredibly diverse cohort in this study, so you don't miss those risk factors in different populations.
JINCHUAN: I think 10 years ago most of the studies were focused on European ancestry, but this is not the entirety of human genetics. The field is increasingly realizing the importance of the diversity and human population for any kind of disease because the diverse ancestry of the human population gave us how we are different, including the disease risk.
CONOR: I can see how this contributes to the future hope of precision medicine because we need to understand the differences across the genome for every individual. As Jin's findings show, ancestry does impact the genome.
JINCHUAN: Precision medicine using genomic sequencing or other information that is unique to you will help you in healthcare. I think one thing that is going to happen pretty soon is that whole genome sequencing will become part of the health record. We only need to sequence the genome once then you get the profile for your entire genome that can go into the electronic health record and the patient will have access to it. So, when someone goes into the clinic, regardless of whether it is because of heart disease or infertility or any kind of disease that have a genetic component, the doctor can look up the genomic profile and see if there's any risk of mutations or genes. I hope what we are doing will be contributing to that, especially to the IVF patients.
DODI: So, in this episode Conor, we've looked way, way back, really dug into those old layers of wallpaper in our house. And we're trying to look ahead too about families of viruses and families in humans.
CONOR: These are just huge topics that we just can't simply do the calculations on in our brains.
DODI: Yeah, this math is not like memorizing the multiplication table, which doesn't help you here.
CONOR: Yeah, I didn't even do that. So, with that computational machine learning the whole enterprise here is making branches of scientific examination possible that would never be possible before. Our executive producer is Andrea Kilin. This podcast is produced with the help of Bethany Grace Armitt-Brewster. Editing, mixing, music, and incredible artificial intelligence by Tom Henley and Banda Produktions. My name is Conor McKechnie, I'm not an artificial intelligence.
DODI: And I am Dodi Axelson as I live and breathe. Make sure you rate us on Spotify or whichever platform you listen to us on. If you are listening on Spotify, by the way, please answer our poll. It's under the episode description and it helps make us better. We'll see you when we come back with another episode of Discovery Matters.
CONOR: Outstanding I just need to brew a coffee now.