This man is trying to build Google Maps for the genome
The Human Genome Project was supposed to unlock all of life’s secrets. Once we had a genetic roadmap, we’d be able to pinpoint why we got ill and figure out how to fix our maladies.
That didn’t pan out. Ten years and more than $4 billion dollars later, we got the equivalent of a medieval hand-drawn map when what we needed was Google Maps.
“Even though we had the text of the genome, people didn’t know how to interpret it, and that’s really puzzled scientists for the last decade,” said Brendan Frey, a computer scientist and medical researcher at the University of Toronto. “They have no idea what it means.”
For the past decade, Frey has been on a quest to build scientists a sort of genetic step-by-step navigation system for the genome, powered by some of the same artificial-intelligence systems that are now being used by big tech companies like Google, Facebook, Microsoft, IBM and Baidu for auto-tagging images, processing language, and showing consumers more relevant online ads.
Today Frey and his team are unveiling a new artificial intelligence system in the top-tier academic journal Science that’s capable of predicting how mutations in the DNA affect something called gene splicing in humans. That’s important because many genetic diseases—including cancers and spinal muscular atrophy, a leading cause of infant mortality—are the result of gene splicing gone wrong.
“It’s a turning point in the field,” said Terry Sejnowski, a computational neurobiologist at the Salk Institute in San Diego and a long-time machine learning researcher. “It’s bringing to bear a completely new set of techniques, and that’s when you really make advances.”
Those leaps could include better personalized medicine. Imagine you have a rare disease doctors suspect might be genetic but that they’ve never seen before. They could sequence your genome, feed the algorithm your data, and, in theory, it would give doctors insights into what’s gone awry with your genes—maybe even how to fix things.
For now, the system can only detect one minor genetic pathway for diseases, but the platform can be generalized to other areas, says Frey, and his team is already working on that.
Splicing: It’s complicated
Frey first turned the machine onto solving the mystery of gene splicing because it underpins many diseases. Splicing is a way for genes to pack a lot of information into a limited amount of space. The same gene can make multiple versions of the same protein depending on how its exons—the parts of a DNA strand that actually code for proteins—are stitched together. These so-called alternatively spliced proteins can have different functions, in the same types of cells, but also in different types of tissues.
Successful splicing hinges on a lot of factors, which means it can be a pretty complicated process to understand—and one that would benefit from machine learning.
That was Frey’s thinking.
He’d take the databases other scientists had compiled over the years and build machine-learning systems to analyze them, applying the concepts of deep learning — a subfield of artificial intelligence inspired by how the brain processes information — to decipher all the genetic data scientists had amassed in the last decade. His program looked at DNA sequences and the proteins coded from those sequences, and set about recognizing when that process went right and when it went wrong, and why.
Frey’s research is part of a movement to take the artificial intelligence that tech giants use to place the creepily accurate ads you see online and apply it to clinical research. They now have at their disposal cheaper and more powerful computers and the mountains of data on which deep-learning systems thrive. (The models, though, are smaller in scale than the ones used at Google or Facebook because there’s just less data available.)
In biomedicine, deep-learning techniques are being put to work for classifying x-rays and predicting whether patients in hospitals might have a high risk of coming down with an infection. But it’s early days. Deep learning’s application in the life sciences, especially genetics, is more embryonic still, even though the potential benefits for medicine are substantial. Frey’s team is one of the only groups out there using deep learning for genetics.
Last year, he published a paper describing a deep-learning algorithm that could predict how genes were spliced together in mice. This was a first step towards applying the types of tools that are making artificial intelligence smarter for vision and speech to genetics. The Science paper is a more advanced version of that work.
To train the system, Frey and his team fed the new algorithm almost 11,000 chunks of DNA along with the proteins they produced, say in the gut or in the brain. After seeing thousands of these examples, the computer could assess to what extent genes — even some that it had never seen before — had been stitched together in weird ways due to mutations. The system was 95 percent accurate, compared to measurements scientists made by hand in the lab for known sequences, according to Frey.
Even more exciting, the program determined that mutations in the part of the genetic code that scientists thought was irrelevant and extraneous — which they’d termed “junk DNA” — can affect splicing. If a mutation throws splicing off, you can end up with a faulty protein. The introns, or not-so-junky-after-all DNA, were actually part of what scientists are calling the genome’s “regulatory code.”
Frey’s work gives geneticists a new way to measure what it’s doing in complicated genetic conditions like autism for which the one-gene theory of disease doesn’t hold up.
For this study, for example, the researchers decided to look at the genomes of five people with autism spectrum disorder (and twelve controls, for comparison). Using the new system, they were able to pick out new genes that might be misregulated at the intron level in the brains of patients with autism. Scientists didn’t know about these previously.
This is why “biologists are really going to pay attention,” says the Salk Institute’s Sejnowski. The work gives biologists concrete proof that the types of machine-learning algorithms Frey’s using can advance science because “they have superhuman power to pick out information from very complex patterns that aren’t obvious to humans.”
Down the line—once these findings are validated in animal and human studies—these insights could help doctors explain to patients how genetic conditions might affect them and their families — even give them potential clues as to possible treatments.
“The way genetics…is used in the clinic to influence patient care is informed primarily by ‘the parts’ list,” said Mike Lin, the director of R&D at genetics-analysis startup DNAnexus, in reference to the genome’s exons. Exons only make up about 1 percent of our genetic code, so that means scientists have traditionally had a really myopic view of how our genes make us who we are. Frey’s work expands our field of view a bit.
“The ability to leverage what we know of the regulatory program is very limited. That’s what’s so exciting and novel,” Lin added. “The method opens a kind of new, and most importantly, useful way of interpreting genetic variation in the regulatory program of the genome.”
A new wave
Frey’s work signals a new wave of computational biology, reminiscent in some ways of the revolution the field of computer vision saw just a couple of years ago thanks to deep learning. That revolution let Facebook quickly tag your friends’ faces in the photos you uploaded; Frey hopes this one will allow for the causes of diseases to be tagged.
Frey has a unique background. In the 90s, he got his start, like so many of today’s leading minds in this space, with Geoff Hinton, the man Google recently hired to supercharge the search giant’s artificial intelligence efforts.
Then, about 12 years ago, he and his then-wife were faced with an awful predicament. Doctors told the couple their unborn baby had an unknown genetic condition. They didn’t know how it might affect their child. “They couldn’t say how bad things might be,” he recalls. “It was completely unknown. It was very stressful.”
That’s when he decided to apply some of the techniques he’d learned with Hinton and others to genetics.
But at the time, the promise of deep learning hadn’t fully come to bear. Computers weren’t fast enough, and scientists didn’t have access to the massive amounts of data these systems need to learn.
That started to change thanks to cloud computing and the data explosion made possible by the web and mobile devices. Around 2012, Hinton and two of his students showed that deep learning algorithms were much better at recognizing images than more traditional machine-learning methods. Around the same time, Google showed computers running deep neural networks could teach themselves to recognize cats.
Suddenly, deep learning morphed into the darling of the machine-learning community.
In the genomics space, things started to look brighter too. In 2008, just a few years before the deep-learning revolution was in full swing, the cost of whole-genome sequencing started to drop off dramatically as faster methods for analyzing DNA became available. In 2003 — the year the Human Genome Project was completed — sequencing a single genome cost roughly $100 million. In 2008, the cost was roughly $1 million, and by 2012, it was hovering around $10,000. Now, it’s somewhere around $1,000.
These new DNA analysis tools also made it possible for scientists to study how things like epigenetics — the way molecules attach themselves to DNA and impact its function — affect cells. Suddenly, these fields were experiencing their own surge in data.
People started experimenting with machine learning to try to tease out patterns. Frey’s work seems to be the next generation of that. His algorithm is doing much better at classifying DNA than previous methods, says Sejnowski.
But unlike in computer vision or speech, where experimental results can go from the lab to the technology in our pockets with relative ease, things in medicine move much slower.
For Frey’s algorithm to change medicine the way he’s hoping it will, scientists will have to put his findings in a wider context. For instance, the current work “doesn’t lay out the whole topology of how genes are connected to the environment,” says Joel Dudley, a geneticist at the Mount Sinai School of Medicine in New York. What’s needed to capitalize on the findings is to integrate them into larger models that incorporate things like biometric data, daily habits, and medications we take. Graphical models — the types Facebook uses to map out your social network — could come in handy.
Some of the data isn’t readily available, but that’s starting to change too. And luckily, computer scientists don’t have to start from scratch. Frey’s new algorithm serves as a “general framework” for assessing all sorts of things, says Dudley. Plus, right now it’s currently freely available online for geneticists to use.
A commercial version might be in the works down the road, as Frey has already been talking to venture capital firms and medical-diagnostic companies interested in partnering.
Agencies that oversee medical diagnostics, like the U.S. Food and Drug Administration, will likely get involved, too.
Once all that’s done, maybe we’ll see a beefed up version of Frey’s algorithm at our doctor’s office. It came too late for Frey’s own family.
“If we had the tools that we’re starting to develop now, it would enable people to make a much more knowledgeable decision. I think they would be able to understand better what’s going on,” Frey said. “Depending on that, people can make different choices.”
Top illustration courtesy of Graham Johnson and Andrew Delong.
Daniela Hernandez is a senior writer at Fusion. She likes science, robots, pugs, and coffee.