scanpy.pp.highly_variable_genes#
- scanpy.pp.highly_variable_genes(adata, *, layer=None, n_top_genes=None, min_disp=0.5, max_disp=inf, min_mean=0.0125, max_mean=3, span=0.3, n_bins=20, flavor=None, subset=False, inplace=True, batch_key=None, check_values=True)[source]#
- Annotate highly variable genes [Satija et al., 2015, Stuart et al., 2019, Zheng et al., 2017]. - Expects logarithmized data, except when - flavor='seurat_v3'/- 'seurat_v3_paper', in which count data is expected.- Depending on - flavor, this reproduces the R-implementations of Seurat [Satija et al., 2015], Cell Ranger [Zheng et al., 2017], and Seurat v3 [Stuart et al., 2019].- 'seurat_v3'/- 'seurat_v3_paper'requires- scikit-miscpackage. If you plan to use this flavor, consider installing- scanpywith this optional dependency:- scanpy[skmisc].- For the dispersion-based methods ( - flavor='seurat'Satija et al. [2015] and- flavor='cell_ranger'Zheng et al. [2017]), the normalized dispersion is obtained by scaling with the mean and standard deviation of the dispersions for genes falling into a given bin for mean expression of genes. This means that for each bin of mean expression, highly variable genes are selected.- For - flavor='seurat_v3'/- 'seurat_v3_paper'[Stuart et al., 2019], a normalized variance for each gene is computed. First, the data are standardized (i.e., z-score normalization per feature) with a regularized standard deviation. Next, the normalized variance is computed as the variance of each gene after the transformation. Genes are ranked by the normalized variance. Only if- batch_keyis not- None, the two flavors differ: For- flavor='seurat_v3', genes are first sorted by the median (across batches) rank, with ties broken by the number of batches a gene is a HVG. For- flavor='seurat_v3_paper', genes are first sorted by the number of batches a gene is a HVG, with ties broken by the median (across batches) rank.- The following may help when comparing to Seurat’s naming: If - batch_key=Noneand- flavor='seurat', this mimics Seurat’s- FindVariableFeatures(…, method='mean.var.plot'). If- batch_key=Noneand- flavor='seurat_v3'/- flavor='seurat_v3_paper', this mimics Seurat’s- FindVariableFeatures(..., method='vst'). If- batch_keyis not- Noneand- flavor='seurat_v3_paper', this mimics Seurat’s- SelectIntegrationFeatures.- See also - scanpy.experimental.pp._highly_variable_genesfor additional flavors (e.g. Pearson residuals).- Parameters:
- adata AnnData
- The annotated data matrix of shape - n_obs×- n_vars. Rows correspond to cells and columns to genes.
- layer str|None(default:None)
- If provided, use - adata.layers[layer]for expression values instead of- adata.X.
- n_top_genes int|None(default:None)
- Number of highly-variable genes to keep. Mandatory if - flavor='seurat_v3'.
- min_mean float(default:0.0125)
- If - n_top_genesunequals- None, this and all other cutoffs for the means and the normalized dispersions are ignored. Ignored if- flavor='seurat_v3'.
- max_mean float(default:3)
- If - n_top_genesunequals- None, this and all other cutoffs for the means and the normalized dispersions are ignored. Ignored if- flavor='seurat_v3'.
- min_disp float(default:0.5)
- If - n_top_genesunequals- None, this and all other cutoffs for the means and the normalized dispersions are ignored. Ignored if- flavor='seurat_v3'.
- max_disp float(default:inf)
- If - n_top_genesunequals- None, this and all other cutoffs for the means and the normalized dispersions are ignored. Ignored if- flavor='seurat_v3'.
- span float(default:0.3)
- The fraction of the data (cells) used when estimating the variance in the loess model fit if - flavor='seurat_v3'.
- n_bins int(default:20)
- Number of bins for binning the mean gene expression. Normalization is done with respect to each bin. If just a single gene falls into a bin, the normalized dispersion is artificially set to 1. You’ll be informed about this if you set - settings.verbosity = 4.
- flavor tuple|None(default:None)
- Choose the flavor for identifying highly variable genes (default depends on - scanpy.settings.presetproperty- highly_variable_genes). For the dispersion based methods in their default workflows,- 'seurat'passes the cutoffs whereas- 'cell_ranger'passes- n_top_genes.
- subset bool(default:False)
- Inplace subset to highly-variable genes if - Trueotherwise merely indicate highly variable genes.
- inplace bool(default:True)
- Whether to place calculated metrics in - .varor return them.
- batch_key str|None(default:None)
- If specified, highly-variable genes are selected within each batch separately and merged. This simple process avoids the selection of batch-specific genes and acts as a lightweight batch correction method. For all flavors, except - seurat_v3, genes are first sorted by how many batches they are a HVG. For dispersion-based flavors ties are broken by normalized dispersion. For- flavor = 'seurat_v3_paper', ties are broken by the median (across batches) rank based on within-batch normalized variance.
- check_values bool(default:True)
- Check if counts in selected layer are integers. A Warning is returned if set to True. Only used if - flavor='seurat_v3'/- 'seurat_v3_paper'.
 
- adata 
- Return type:
- Returns:
- Returns a - pandas.DataFramewith calculated metrics if- inplace=False, else returns an- AnnDataobject where it sets the following field:- adata.var['highly_variable']- pandas.Series(dtype- bool)
- boolean indicator of highly-variable genes 
- adata.var['means']- pandas.Series(dtype- float)
- means per gene 
- adata.var['dispersions']- pandas.Series(dtype- float)
- For dispersion-based flavors, dispersions per gene 
- adata.var['dispersions_norm']- pandas.Series(dtype- float)
- For dispersion-based flavors, normalized dispersions per gene 
- adata.var['variances']- pandas.Series(dtype- float)
- For - flavor='seurat_v3'/- 'seurat_v3_paper', variance per gene
- adata.var['variances_norm']/- 'seurat_v3_paper'- pandas.Series(dtype- float)
- For - flavor='seurat_v3'/- 'seurat_v3_paper', normalized variance per gene, averaged in the case of multiple batches
- adata.var['highly_variable_rank']- pandas.Series(dtype- float)
- For - flavor='seurat_v3'/- 'seurat_v3_paper', rank of the gene according to normalized variance, in case of multiple batches description above
- adata.var['highly_variable_nbatches']- pandas.Series(dtype- int)
- If - batch_keyis given, this denotes in how many batches genes are detected as HVG
- adata.var['highly_variable_intersection']- pandas.Series(dtype- bool)
- If - batch_keyis given, this denotes the genes that are highly variable in all batches
 
 - Notes - This function replaces - filter_genes_dispersion().