AI Revolutionizes Genome Editing: Open Source Gene Editor Produces 5 Times More Proteins than Previously Possible

[Introduction to New Wisdom]Just now, the molecular biology community detonated nuclear bomb-level news: human DNA can be rewritten by AI! Startup company Profluent announced that it has open sourced the world's first AI-designed gene editor, which successfully edited DNA in human cells. This is so science fiction. If you had the chance, would you choose to “transform” your own DNA?

Can AI rewrite the human genome?

Advertisement

Just now, the startup Profluent announced a gene editor completely designed by AI.DNA in human cells has been successfully edited.

In other words, the world's first molecular-level precise gene editor designed from scratch using AI was born.

Just like ChatGPT can generate poetry, Profluent, a new AI system, allows us to edit the microscopic mechanisms of our own DNA to generate blueprints.

The researchers trained LLM on the most extensive dataset of CRISPR-based gene editing systems to date. The proteins produced by these LLMs expanded the diversity of almost all naturally occurring CRISPR-Cas families by 4.8 times!

Advertisement

Furthermore, the gene editor showed comparable or better activity and specificity to SpCas9 (an example gene editor) in human cells, while being more than 400 mutations away.

This also means that we have mastered the code of our own genome. Scientists in the future will fight diseases more accurately and faster than today. Moreover, the company also decided,These DNA molecules will be freely released under the OpenCRISPR protocol.

The physical structure of OpenCRISPR-1, a gene editor created by Profluent's AI technology

▲ The physical structure of OpenCRISPR-1, which is a gene editor created by Profluent’s AI technology

Ali Madani, co-founder of Profluent, said, “Trying to use AI-designed biological systems to edit human DNA is a scientific moonshot.”

“Our success shows that in the future, AI can accurately design a series of customized disease treatment plans.”

Some netizens said, “Is it time to reprogram humans? The advancement of AI-driven CRISPR technology is challenging the boundaries of genetic ethics.”

If you could change your DNA, would you do it?

The genes for anemia and blindness diseases can be modified by ourselves

The startup Profluent describes this technology in detail in this paper just published.

Paper address:https://www.biorxiv.org/content/10.1101/2024.04.22.590591v1.full.pdf

The paper is expected to be presented next month at the annual meeting of the American Society for Gene and Cell Therapy.

The technology is the same approach that drives ChatGPT, which creates new gene editors after analyzing large amounts of biological data, including the microscopic mechanisms scientists already use to edit human DNA.

These gene editors are based on a Nobel Prize-winning method involving a biological mechanism called CRISPR.

After the birth of CRISPR-based technology, it caused a sensation in the industry. It changed the way scientists study disease.

In the past, if we unfortunately got genetic diseases such as sickle cell anemia and blindness, we were often helpless. But now, CRISPR technology allows us to directly modify the genes that cause these diseases!

The CRISPR method uses a mechanism we find in nature: biological material collected from bacteria magically gives these microorganisms the ability to resist bacteria.

James Fraser, professor and chair of the Department of Bioengineering and Therapeutic Sciences at the University of California, San Francisco, said that these biological materials have never existed on earth, and Profluent's AI system learns how to create these brand-new things from nature.

If these technologies continue to develop, the resulting gene editors may be more flexible and powerful than those we humans have honed over billions of years of evolution.

Now,Profluent says it is open source OpenCRISPR-1 editorwhich means that individuals, academic laboratories, and companies can use these technologies for free.

Open source, which is common in the AI ​​industry, can accelerate the creation of new technologies. However, for biolabs and pharmaceutical companies, open source like OpenCRISPR-1 is not common.

Of course, Profluent only open sourced the gene editor generated by its AI technology, but did not open source the AI ​​technology itself.

Time-lapse photography of human cells edited by OpenCRISPR-1

▲ Time-lapse photography of human cells edited by OpenCRISPR-1

Why AI edits proteins is so important

Currently, if the protein engineering community wants to copy functional proteins or use “directed evolution” to iteratively modify them, they usually still need to copy them from nature.

Many proteins of great significance to humans were discovered by accident, such as insulin in dogs, Cas9 in yogurt making, and botulinum toxin, which often causes food poisoning.

The role of large generative protein language models is to capture the basic blueprint that makes natural proteins function. They outline a shortcut that bypasses the stochastic processes of evolution, pushing humans to consciously design proteins for specific purposes.

The Cas9 protein is the core component of the CRISPR-Cas9 gene editing system. It is an RNA-guided nuclease that can search all 3 billion nucleotides in the human genome and cut at a specific site.

The nuclease is complexed with a single guide RNA (sgRNA), which consists of a scaffold that structurally interacts with the protein and a spacer sequence that can be programmed to target any site in the genome.

The tricky part is that most Cas9 proteins are over 1000 amino acids long, and the entire design space contains 20^1000 possible sequences, which is several orders of magnitude higher than the number of atoms in the observable universe!

And, because these proteins must coordinate many interactions in a precise order to achieve precise cleavage, even a single misplaced mutation may completely eliminate a protein's function.

If all possible sequence variations were exhausted through experiments, many scientists would not be able to complete it in several lifetimes. However, AI systems can easily explore the entire search space and discover functional gene editors. And, it only takes a few hours!

The world's first open source gene editor rewrites human DNA

The gene editor OpenCRISPR-1 consists of a Cas9-like protein and guide RNA.

As mentioned before, it is developed entirely from Profluent's AI Big Model.

During the specific implementation process, the researchers mined the 26TB assembled “genome” and “metagenomic” database system and compiled a data set of more than 1 million CRISPR operons (operons).

By training OpenCRISPR, AI learns from large-scale sequence and biological context to generate millions of CRISPR-like proteins that do not exist in nature.

Researchers claim that AI has generated 4.8 times the number of protein clusters found in the “CRISPR-Cas family” found in nature, fully achieving exponential expansion!

Furthermore, the language model also customized single guide RNA sequences for Cas9-like effector proteins.

Compared to the prototype gene editing effector SpCas9, several generated gene editors showed comparable or improved activity and specificity while differing by 400 mutations in sequence.

Finally, the researchers also demonstrated the compatibility of the AI-generated gene editing OpenCRISPR-1 with base editing.

The key results from this study are as follows.

AI generates 4.8 times the “CRISPR-Cas” protein universe

Generating protein language models is usually pre-trained on large datasets of natural protein sequences covering a variety of phylogenies and functions.

These models can generate realistic protein sequences that reflect the distribution and properties of native proteins.

However, for specific applications, such as the generation of novel gene editors, it is necessary to direct the generation process toward specific subsets of protein families of interest.

In this regard, the researchers conducted exhaustive data mining to build the database.

They searched 26.2TB of assembled microbial genomes and metagenomes and found 1,246,163 CRISPR-Cas operons.

The newly created database shows greater diversity when compared to curated databases such as CRISPRCasDB and CasPDB, as well as UniProt, the world's largest protein resource.

By summarizing commonalities, the researchers found a single model for all CRISPR-Cas proteins that generates different sequences across the family.

To generate novel CRISPR-Cas proteins, the authors fine-tuned the ProGen2-based language model on the CRISPR-Cas Atlas, thus balancing the representation of protein families and sequence cluster sizes.

From this model, the researchers generated 4 million sequences. Half of these are generated directly from the model, and the other half are prompted by up to 50 residues from the N or C terminus of the native protein to guide the generation of a specific protein.

To evaluate their novelty and diversity, the authors used MMseqs2 to cluster the generated and native sequences of each family by 70% identity.

It was found that the generated sequences achieved a 4.8-fold diversity expansion compared to the native proteins in the CRISPR-Cas map.

For families with few native proteins, such as Cas13 and Cas12a, the diversity of generated sequences increased by 8.4-fold and 6.2-fold, respectively.

In addition, only minimal context is required, providing 50 residues or less, to generate guide sequences for a specific family that are consistent with the family of interest.

All 1 million Cas9-like proteins are produced

Although many CRISPR-Cas proteins have been used for genome editing, Cas9 remains the most widely used.

To generate new Cas9-like sequences, the researchers sampled the N- or C-terminal 50 residues of Cas9 from the CRISPR-Cas map, which prompted the CRISPR-Cas model.

Here, the author used 238,917 Cas9 sequences in the CRISPR-Cas Atlas to fine-tune another language model.

This model generated viable Cas9-like sequences 2 times faster than the CRISPR-Cas model (54.2%) without requiring any hints.

To explore the potential sequence distribution of type II effectors, the researchers used a Cas9 model to generate 1 million Cas9 proteins.

The generated viable generations (n=542,042) were clustered with native Cas9 at 40% identity and used as input to construct a maximum likelihood phylogenetic tree (Fig. 2a).

Strikingly, the resulting proteins dominated the phylogenetic pattern, accounting for 94.1% of the total phylogenetic diversity.

Compared with the entire CRISPR-Cas map, the diversity increased by 10.3-fold (Fig. 2b).

The new phylogenetic groups are distributed throughout the tree, indicating that the model captures the full diversity of Cas9 and does not overfit any specific lineage.

The generated sequences differed greatly from the CRISPR-Cas map, with an average identity of only 56.8% to any natural sequence (Figure 2c).

Overall, the generated sequences closely matched the length of native proteins in the same protein cluster, with a Pearson correlation of 0.97 (Fig. 2d).

In addition, Figure 2e shows the on- and off-target editing efficiencies of native Cas9, ancestral sequence reconstruction, and 48 generated proteins. Figure 2f shows a comparison of native Cas9, ancestral sequence reconstruction, and generated proteins in terms of targeted editing efficiency and specificity.

Gene editor generated, working in human cells

The researchers then further narrowed their focus to the CRISPR-Cas9 system and trained a protein language model on the 238,917 Cas9 proteins in the CRISPR-Cas map.

Using these models, the researchers generated Cas9-like proteins that interoperate with SpCas9. That is, they bind to the same part of the genome (PAM) and are compatible with the same sgRNA, so they can be used in the same application.

The researchers selected 48 of the generated sequences for rigorous functional characterization in human cells.

The most popular OpenCRISPR-1 has comparable activity to SpCas9 at the target site (the editing rate of OpenCRISPR-1 is 55.7% and the editing rate of SpCas9 is 48.3%), but surprisingly, the editing rate at the off-target site is A reduction of 95% (the editing rate of OpenCRISPR-1 was 0.32% and SpCas9 was 6.1%).

Furthermore, as a very new protein, OpenCRISPR-1 is 403 mutations away from SpCas9 and 182 mutations away from any natural protein in the CRISPR-Cas map.

Multiple generated nucleases (green), including OpenCRISPR-1 (dark green), have comparable or higher on-target activity than SpCas9 (blue) but much lower off-target activity

▲ Multiple generated nucleases (green), including OpenCRISPR-1 (dark green), have comparable or higher on-target activity than SpCas9 (blue), but much lower off-target activity

The researchers also found that, when paired with deaminases, OpenCRISPR-1 and SpCas9 had similar activity and specificity in precisely editing single bases in target genomes.

They were also able to maintain base-editing activity while increasing specificity by using a deaminase generated with another Profluent-trained protein language model.

OpenCRISPR-1 functions very similarly to SpCas9 when base editing is performed using ABE8.20, a highly active engineered deaminase, and the generated deaminases PF-DEAM-1 and PF-DEAM-2

▲ When using ABE8.20, a highly active engineered deaminase, and the generated deaminase PF-DEAM-1 and PF-DEAM-2 for base editing, OpenCRISPR-1 functions very similarly to SpCas9

Finally, to further optimize the activity of the generated nucleases, the researchers also trained a model to generate compatible sgRNAs for any given Cas9-like protein.

Compared with the sgRNA of SpCas9, these generated sgRNAs increased the activity of the nuclease produced by four of the five proteins tested.

For 4 of the 5 generated nucleases tested, using model-generated sgRNA improved editing efficiency

▲ For 4 of the 5 generated nucleases tested, using model-generated sgRNA improved editing efficiency

AI is improving healthcare

Now, there are many projects around the world using AI technology to improve medical care.

For example, scientists at the University of Washington are using the methods behind ChatGPT and Midjourney to create entirely new proteins and are working to accelerate the development of new vaccines and drugs.

Many generative AIs that are popular today are driven by neural networks. By analyzing large amounts of data, neural networks learn certain skills.

For example, Midjourney is based on neural networks and analyzes millions of digital images, as well as the captions that describe each image. In this way, the system learns to recognize the connection between images and text, and can draw pictures such as “Rhino jumping off the Golden Gate Bridge.”

Profluent's technology is also driven by a similar AI model.

The model learns from amino acid and nucleic acid sequences, the compounds that define the microscopic biological mechanisms scientists use to edit genes.

Essentially, it analyzed the behavior of CRISPR gene editors extracted from nature and learned how to generate entirely new gene editors.

Ali Madani, CEO of Profluent, said that these AI models learn from sequences, whether they are characters, words, computer codes, or sequences of amino acids.

Mr. Madani is at Profluent Labs in Berkeley, Calif., and previously worked in the artificial intelligence lab of software giant Salesforce

How far will it go before humans can edit genes?

Profluent has not yet conducted clinical trials of these synthetic gene editors, so it's unclear whether they can match or even exceed the performance of CRISPR.

But their research shows that AI models can produce something capable of editing the human genome.

Still, the results are unlikely to impact health care in the short term.

Fyodor Urnov, a gene-editing pioneer and scientific director of the UC Berkeley Institute for Innovative Genomics, said scientists have no shortage of naturally occurring gene editors to use to fight disease.

The real bottleneck is that the editor will incur extremely high costs due to safety, manufacturing, and regulatory review before it can be used for clinical treatment.

However, the potential of generative AI systems cannot be underestimated as they learn from more and more data.

If Profluent's technology continues to improve, scientists may one day be able to edit genes in a more precise way. By then, we may be in a world where many drugs and treatments can be quickly tailored to the individual. This is something people today dare not think about.

“I dream of a world where we can deliver CRISPR on demand within weeks,” Dr. Urnov said.

Another big question is, is CRIPSR risky?

Scientists have been warning against using CRISPR for human enhancement for a long time! Because this is a relatively new technology, it is likely to have undesirable side effects, such as causing cancer. And some people use it for unethical purposes, such as genetically modified human embryos.

Synthetic gene editors also face this problem. Now, scientists have everything they need to edit embryos.

But Dr. Fraser said that if someone really wanted to do bad things with them, they would only use existing ones, not AI-created editors.

References:

Advertisement