scanpy.tl.marker_gene_overlap#
- scanpy.tl.marker_gene_overlap(adata, reference_markers, *, key='rank_genes_groups', method='overlap_count', normalize=None, top_n_markers=None, adj_pval_threshold=None, key_added='marker_gene_overlap', inplace=False)[source]#
Calculate an overlap score between data-derived marker genes and provided markers.
Marker gene overlap scores can be quoted as overlap counts, overlap coefficients, or jaccard indices. The method returns a pandas dataframe which can be used to annotate clusters based on marker gene overlaps.
This function was written by Malte Luecken.
- Parameters:
- adata
AnnData The annotated data matrix.
- reference_markers
dict[str,set] |dict[str,list] A marker gene dictionary object. Keys should be strings with the cell identity name and values are sets or lists of strings which match format of
adata.var_name.- key
str(default:'rank_genes_groups') The key in
adata.unswhere the rank_genes_groups output is stored. By default this is'rank_genes_groups'.- method
Literal['overlap_count','overlap_coef','jaccard'] (default:'overlap_count') (default:
overlap_count) Method to calculate marker gene overlap.'overlap_count'uses the intersection of the gene set,'overlap_coef'uses the overlap coefficient, and'jaccard'uses the Jaccard index.- normalize
Optional[Literal['reference','data']] (default:None) Normalization option for the marker gene overlap output. This parameter can only be set when
methodis set to'overlap_count'.'reference'normalizes the data by the total number of marker genes given in the reference annotation per group.'data'normalizes the data by the total number of marker genes used for each cluster.- top_n_markers
int|None(default:None) The number of top data-derived marker genes to use. By default the top 100 marker genes are used. If
adj_pval_thresholdis set along withtop_n_markers, thenadj_pval_thresholdis ignored.- adj_pval_threshold
float|None(default:None) A significance threshold on the adjusted p-values to select marker genes. This can only be used when adjusted p-values are calculated by
sc.tl.rank_genes_groups(). Ifadj_pval_thresholdis set along withtop_n_markers, thenadj_pval_thresholdis ignored.- key_added
str(default:'marker_gene_overlap') Name of the
.unsfield that will contain the marker overlap scores.- inplace
bool(default:False) Return a marker gene dataframe or store it inplace in
adata.uns.
- adata
- Returns:
Returns
pandas.DataFrameifinplace=False, else returns anAnnDataobject where it sets the following field:adata.uns[key_added]pandas.DataFrame(dtypefloat)Marker gene overlap scores. Default for
key_addedis'marker_gene_overlap'.
Examples
>>> import scanpy as sc >>> adata = sc.datasets.pbmc68k_reduced() >>> sc.pp.pca(adata, svd_solver="arpack") >>> sc.pp.neighbors(adata) >>> sc.tl.leiden(adata) >>> sc.tl.rank_genes_groups(adata, groupby="leiden") >>> marker_genes = { ... "CD4 T cells": {"IL7R"}, ... "CD14+ Monocytes": {"CD14", "LYZ"}, ... "B cells": {"MS4A1"}, ... "CD8 T cells": {"CD8A"}, ... "NK cells": {"GNLY", "NKG7"}, ... "FCGR3A+ Monocytes": {"FCGR3A", "MS4A7"}, ... "Dendritic Cells": {"FCER1A", "CST3"}, ... "Megakaryocytes": {"PPBP"}, ... } >>> marker_matches = sc.tl.marker_gene_overlap(adata, marker_genes)