scanpy.pp.highly_variable_genes#
- scanpy.pp.highly_variable_genes(adata, *, layer=None, n_top_genes=None, min_disp=0.5, max_disp=inf, min_mean=0.0125, max_mean=3, span=0.3, n_bins=20, flavor=None, subset=False, inplace=True, batch_key=None, check_values=True)[source]#
Annotate highly variable genes [Satija et al., 2015, Stuart et al., 2019, Zheng et al., 2017].
Expects logarithmized data, except when
flavor='seurat_v3'/'seurat_v3_paper', in which count data is expected.Depending on
flavor, this reproduces the R-implementations of Seurat [Satija et al., 2015], Cell Ranger [Zheng et al., 2017], and Seurat v3 [Stuart et al., 2019].'seurat_v3'/'seurat_v3_paper'requiresscikit-miscpackage. If you plan to use this flavor, consider installingscanpywith this optional dependency:scanpy[skmisc].For the dispersion-based methods (
flavor='seurat'Satija et al. [2015] andflavor='cell_ranger'Zheng et al. [2017]), the normalized dispersion is obtained by scaling with the mean and standard deviation of the dispersions for genes falling into a given bin for mean expression of genes. This means that for each bin of mean expression, highly variable genes are selected.For
flavor='seurat_v3'/'seurat_v3_paper'[Stuart et al., 2019], a normalized variance for each gene is computed. First, the data are standardized (i.e., z-score normalization per feature) with a regularized standard deviation. Next, the normalized variance is computed as the variance of each gene after the transformation. Genes are ranked by the normalized variance. Only ifbatch_keyis notNone, the two flavors differ: Forflavor='seurat_v3', genes are first sorted by the median (across batches) rank, with ties broken by the number of batches a gene is a HVG. Forflavor='seurat_v3_paper', genes are first sorted by the number of batches a gene is a HVG, with ties broken by the median (across batches) rank.The following may help when comparing to Seurat’s naming: If
batch_key=Noneandflavor='seurat', this mimics Seurat’sFindVariableFeatures(…, method='mean.var.plot'). Ifbatch_key=Noneandflavor='seurat_v3'/flavor='seurat_v3_paper', this mimics Seurat’sFindVariableFeatures(..., method='vst'). Ifbatch_keyis notNoneandflavor='seurat_v3_paper', this mimics Seurat’sSelectIntegrationFeatures.See also
scanpy.experimental.pp._highly_variable_genesfor additional flavors (e.g. Pearson residuals).- Parameters:
- adata
AnnData The annotated data matrix of shape
n_obs×n_vars. Rows correspond to cells and columns to genes.- layer
str|None(default:None) If provided, use
adata.layers[layer]for expression values instead ofadata.X.- n_top_genes
int|None(default:None) Number of highly-variable genes to keep. Mandatory if
flavor='seurat_v3'.- min_mean
float(default:0.0125) If
n_top_genesunequalsNone, this and all other cutoffs for the means and the normalized dispersions are ignored. Ignored ifflavor='seurat_v3'.- max_mean
float(default:3) If
n_top_genesunequalsNone, this and all other cutoffs for the means and the normalized dispersions are ignored. Ignored ifflavor='seurat_v3'.- min_disp
float(default:0.5) If
n_top_genesunequalsNone, this and all other cutoffs for the means and the normalized dispersions are ignored. Ignored ifflavor='seurat_v3'.- max_disp
float(default:inf) If
n_top_genesunequalsNone, this and all other cutoffs for the means and the normalized dispersions are ignored. Ignored ifflavor='seurat_v3'.- span
float(default:0.3) The fraction of the data (cells) used when estimating the variance in the loess model fit if
flavor='seurat_v3'.- n_bins
int(default:20) Number of bins for binning the mean gene expression. Normalization is done with respect to each bin. If just a single gene falls into a bin, the normalized dispersion is artificially set to 1. You’ll be informed about this if you set
settings.verbosity = 4.- flavor
tuple|None(default:None) Choose the flavor for identifying highly variable genes (default depends on
scanpy.settings.presetpropertyhighly_variable_genes). For the dispersion based methods in their default workflows,'seurat'passes the cutoffs whereas'cell_ranger'passesn_top_genes.- subset
bool(default:False) Inplace subset to highly-variable genes if
Trueotherwise merely indicate highly variable genes.- inplace
bool(default:True) Whether to place calculated metrics in
.varor return them.- batch_key
str|None(default:None) If specified, highly-variable genes are selected within each batch separately and merged. This simple process avoids the selection of batch-specific genes and acts as a lightweight batch correction method. For all flavors, except
seurat_v3, genes are first sorted by how many batches they are a HVG. For dispersion-based flavors ties are broken by normalized dispersion. Forflavor = 'seurat_v3_paper', ties are broken by the median (across batches) rank based on within-batch normalized variance.- check_values
bool(default:True) Check if counts in selected layer are integers. A Warning is returned if set to True. Only used if
flavor='seurat_v3'/'seurat_v3_paper'.
- adata
- Return type:
- Returns:
Returns a
pandas.DataFramewith calculated metrics ifinplace=False, else returns anAnnDataobject where it sets the following field:adata.var['highly_variable']pandas.Series(dtypebool)boolean indicator of highly-variable genes
adata.var['means']pandas.Series(dtypefloat)means per gene
adata.var['dispersions']pandas.Series(dtypefloat)For dispersion-based flavors, dispersions per gene
adata.var['dispersions_norm']pandas.Series(dtypefloat)For dispersion-based flavors, normalized dispersions per gene
adata.var['variances']pandas.Series(dtypefloat)For
flavor='seurat_v3'/'seurat_v3_paper', variance per geneadata.var['variances_norm']/'seurat_v3_paper'pandas.Series(dtypefloat)For
flavor='seurat_v3'/'seurat_v3_paper', normalized variance per gene, averaged in the case of multiple batchesadata.var['highly_variable_rank']pandas.Series(dtypefloat)For
flavor='seurat_v3'/'seurat_v3_paper', rank of the gene according to normalized variance, in case of multiple batches description aboveadata.var['highly_variable_nbatches']pandas.Series(dtypeint)If
batch_keyis given, this denotes in how many batches genes are detected as HVGadata.var['highly_variable_intersection']pandas.Series(dtypebool)If
batch_keyis given, this denotes the genes that are highly variable in all batches
Notes
This function replaces
filter_genes_dispersion().