Early diagnosis and treatment of certain cancers can be the difference between life and death, says Matthew Callstrom, professor of radiology and head of the generative AI program at the Mayo Clinic. However, the human genome consists of over 3 billion base pairs—an enormous needle-in-a-haystack problem.
The researchers worked with Evo 2—an open-source “genomic foundation model” trained by the Arc Institute—to predict which DNA mutations cause disease, and to understand which biological features might be responsible. Evo 2 is trained to predict the next “letter” in a DNA sequence—in the same way that large language models (LLMs) such as ChatGPT are trained to predict the next word in a passage of text. For ChatGPT, training on most of the text on the internet teaches it the structure of language and facts about the world. Trained on 128,000 genomes spanning all domains of life—each composed of just four letters (G, T, C, and A), the molecules that make up DNA—Evo 2 learns which genetic sequences are ‘conducive to life,” says Nicholas Wang, one of the paper’s authors.
Read the full article here
