AI Framework APOLLO Brings Structure to Multimodal Single-Cell Analysis
Modern single-cell measurement technologies can now capture multiple layers of cellular information from the same cell, measuring aspects like gene expression, chromatin accessibility, and protein abundance. Each modality offers a different view of cellular state, producing high-dimensional datasets that must be integrated into a shared representation.
Many multimodal machine learning methods compress these inputs into a single latent space. This improves clustering and prediction, but it can make it difficult to tell which data are shared across modalities and which are specific to a particular assay. In a paper published in Nature Computational Science, researchers at MIT, the Broad Institute, and ETH Zurich have introduced a new AI framework to address this problem: APOLLO, short for Autoencoder with a Partially Overlapping Latent space learned through Latent Optimization. Instead of forcing all modalities into a single unified embedding, the framework allocates separate regions in latent space, with one shared across modalities and others reserved for modality-specific information. The idea resembles overlapping sets (or a Venn diagram, as an MIT News report noted). Some aspects of cell state should appear in more than one dataset, while others remain unique to a particular measurement technology.
APOLLO learns three latent spaces to disentangle information captured by each modality using a two-step training procedure (Credit: Paper Authors)
APOLLO encodes that idea directly into the architecture, translating partial overlap into explicit structure in latent space. To implement this design, the system trains one autoencoder per modality but constrains them to share only part of their latent representation. During an initial training phase, the model directly optimizes latent variables alongside modality-specific decoders, learning which dimensions represent shared versus modality-specific information. In a second phase, encoders are trained to map new data into this structured latent space, enabling generalization and cross-modality prediction.
When applied to real datasets, the distinction between shared and modality-specific information becomes clearer. In paired RNA and chromatin accessibility data, the framework automatically distinguished gene activity captured jointly by both assays from signals that appeared in only one. Instead of flattening measurements into a single embedding, the model separated them according to how they relate to the underlying cell state.
The researchers also extended the method to paired RNA-protein datasets and multiplexed imaging experiments. In one case, the model identified which measurement modality captured γH2AX, a protein marker associated with DNA damage in cancer cells. Tracing a disease-relevant signal to a specific assay like this can help researchers decide which measurements are essential and which may be predicted computationally.
(Collagery/Shutterstock)
That same ability to trace signals to specific modalities also enables something more ambitious: predicting measurements that were never taken. APOLLO can infer unmeasured modalities because the shared latent space captures information common to multiple assays. In imaging experiments, for example, APOLLO was able to predict protein localization patterns from chromatin images alone. For large-scale studies, this could reduce experimental burden by predicting certain measurements rather than collecting them directly.
As multimodal assays continue to expand in scope and resolution, the computational challenge is shifting from collecting data to integrating it in an organized way. Researchers must understand how measurements relate, where data comes from, and which assays are truly necessary. By separating shared biological structure from modality-specific information, APOLLO enhances multimodal analysis from simple integration into structured representation learning. In scientific computing workflows, frameworks that explore the internal structure of complex datasets can guide experimental design, reduce redundancy, and make large-scale studies more feasible. As the number of measurable cellular features keeps growing, tools that untangle the data instead of flattening it may become essential infrastructure for AI-driven biology. Read more about APOLLO in the scientific paper.
This article first appeared on HPCwire.
Related

