Poster
in
Workshop: AI for Nucleic Acids (AI4NA)
Long-range gene expression prediction with token alignment of large language model
Edouardo Honig · Huixin Zhan · Zijun Frank Zhang · Yingnian Wu
Gene expression is a cellular process that plays a fundamental role in humanphenotypical variations and diseases. Despite advances of deep learning modelsfor gene expression prediction, recent benchmarks have revealed their inability tolearn distal regulatory grammar. We address this challenge by leveraging a frozenpretrained language model to enhance gene expression prediction. Our method,Genetic sequence Token Alignment (GTA), aligns genetic sequence featureswith natural language tokens, using the frozen language model to perform symbolicreasoning. This cross-modal adaptation learns the regulatory grammar and allowsus to further incorporate gene-specific human annotations as prompts, enablingin-context learning that is not possible with existing models. GTA offers improvedpredictive power and better interpretation of long-range interactions through theidentification of the most meaningful sections of the input genetic context.