Poster
in
Workshop: AI for Nucleic Acids (AI4NA)
Character-level Tokenizations as Powerful Inductive Biases for RNA Foundational Models
Adrian Morales-Pastor · Raquel Vázquez-Reza · Miłosz Wieczór · Clàudia Valverde · Manel Gil-Sorribes · Bertran Miquel-Oliver · Alvaro Ciudad Serrano · Alexis Molina
RNA plays a critical role in cellular functions and is increasingly targeted for therapeutics, yet its structural complexity poses challenges for computational modeling. While foundational models have transformed protein representation learning, achieving similar success for RNA remains elusive. We introduce ChaRNABERT, a suite of sample- and parameter-efficient RNA foundational models that leverage a learnable tokenization process to achieve superior performance across established benchmarks. We further validate its capabilities on downstream tasks, including RNA-protein and aptamer-protein interaction prediction. The ChaRNABERT-8M model, along with inference code, will be publicly available for academic research, with additional models provided upon request.