scValue: value-based subsampling of large-scale single-cell transcriptomic data for machine and deep learning tasks

Li Huang¹, Weikang Gong^1,2, Dongsheng Chen¹

¹State Key Laboratory of Common Mechanism Research for Major Diseases, Suzhou Institute of Systems Medicine, Chinese Academy of Medical Sciences and Peking Union Medical College, 100 Chongwen Road, Suzhou Industrial Park, Suzhou, Jiangsu Province 215123, China.

Briefings in Bioinformatics|June 14, 2025

Related Experiment Videos

View abstract on PubMed

Summary

scValue is a new method that efficiently subsamples large single-cell RNA sequencing (scRNA-seq) data for machine learning tasks. It prioritizes high-value cells, preserving biological signals and outperforming existing methods.

Area of Science:

Computational Biology
Genomics
Machine Learning

Background:

Large single-cell RNA sequencing (scRNA-seq) datasets offer deep biological insights but pose significant computational challenges.
Existing subsampling techniques may compromise performance in downstream machine learning and deep learning (ML/DL) analyses.

Purpose of the Study:

To introduce scValue, a novel cell-ranking approach for efficient and effective subsampling of large scRNA-seq datasets.
To demonstrate scValue's superiority over existing methods in preserving biological signals for ML/DL tasks.

Main Methods:

scValue ranks cells based on 'data value' derived from random forest out-of-bag estimates.
Prioritizes high-value cells and oversamples cell types with greater data value variability.
Benchmarked against existing subsampling methods on cell-type annotation, label transfer, label harmonization, and deconvolution tasks.

Main Results:

scValue consistently outperformed existing subsampling methods in automatic cell-type annotation tasks.
Achieved performance comparable to full-data analysis across various ML/DL models and datasets.
Effectively preserved T-cell annotations, reproduced T-cell subtype relationships, and constructed reliable single-cell references for deconvolution.

Conclusions:

scValue offers a robust and scalable solution for subsampling large scRNA-seq data in ML/DL workflows.
Demonstrates fast execution, balanced cell-type representation, and distributional properties similar to uniform sampling.
Available as an open-source Python package.

Keywords:

cell type analysis data valuation machine and deep learning single-cell transcriptomics subsampling