scanpy.pp.pca#
- scanpy.pp.pca(data, n_comps=None, *, layer=None, zero_center=True, svd_solver=None, chunked=False, chunk_size=None, random_state=0, return_info=False, mask_var=_empty, use_highly_variable=None, dtype='float32', key_added=None, copy=False)[source]#
Principal component analysis [Pedregosa et al., 2011].
Computes PCA coordinates, loadings and variance decomposition. Uses the following implementations (and defaults for
svd_solver):chunked=False,zero_center=Truesklearn
PCA('arpack')chunked=False,zero_center=Falsesklearn
TruncatedSVD('randomized')dask-ml
TruncatedSVD[2] ('tsqr')chunked=True(zero_centerignored)sklearn
IncrementalPCA('auto')dask-ml
IncrementalPCA[3] ('auto')- Parameters:
- data
AnnData|ndarray|csr_array|csc_array|csr_matrix|csc_matrix The (annotated) data matrix of shape
n_obs×n_vars. Rows correspond to cells and columns to genes.- n_comps
int|None(default:None) Number of principal components to compute. Defaults to 50, or 1 - minimum dimension size of selected representation.
- layer
str|None(default:None) If provided, which element of layers to use for PCA.
- zero_center
bool(default:True) If
True, compute (or approximate) PCA from covariance matrix. IfFalse, performa a truncated SVD instead of PCA.Our default PCA algorithms (see
svd_solver) support implicit zero-centering, and therefore efficiently operating on sparse data.- svd_solver
Union[Literal['auto','full','tsqr','randomized'],Literal['tsqr','randomized'],Literal['auto','full','randomized'],Literal['arpack','covariance_eigh'],Literal['arpack','randomized'],Literal['covariance_eigh'],None] (default:None) SVD solver to use. See table above to see which solver class is used based on
chunkedandzero_center, as well as the default solver for each class whensvd_solver=None.Efficient computation of the principal components of a sparse matrix currently only works with the
'arpack’ or'covariance_eigh’ solver.NoneChoose automatically based on solver class (see table above).
'arpack'ARPACK wrapper in SciPy (
svds()). Not available for dask arrays.'covariance_eigh'Classic eigendecomposition of the covariance matrix, suited for tall-and-skinny matrices. With dask, array must be CSR or dense and chunked as
(N, adata.shape[1]).'randomized'Randomized algorithm from Halko et al. [2009]. For dask arrays, this will use
svd_compressed().'auto'Choose automatically depending on the size of the problem: Will use
'full'for small shapes and'randomized'for large shapes.'tsqr'“tall-and-skinny QR” algorithm from Benson et al. [2013]. Only available for dense dask arrays.
Changed in version 1.9.3: Default value changed from
'arpack'to None.Changed in version 1.4.5: Default value changed from
'auto'to'arpack'.- chunked
bool(default:False) If
True, perform an incremental PCA on segments ofchunk_size. Automatically zero centers and ignores settings ofzero_center,random_seedandsvd_solver. IfFalse, perform a full PCA/truncated SVD (seesvd_solverandzero_center). See table above for which solver class is used.- chunk_size
int|None(default:None) Number of observations to include in each chunk. Required if
chunked=Truewas passed.- random_state
int|RandomState|None(default:0) Change to use different initial states for the optimization.
- return_info
bool(default:False) Only relevant when not passing an
AnnData: see “Returns”.- mask_var
ndarray[tuple[int,...],dtype[bool]] |str|None|Empty(default:_empty) To run only on a certain set of genes given by a boolean array or a string referring to an array in
var. By default, uses.var['highly_variable']if available, else everything.- use_highly_variable
bool|None(default:None) Whether to use highly variable genes only, stored in
.var['highly_variable']. By default uses them if they have been determined beforehand.Deprecated since version 1.10.0: Use
mask_varinstead- layer
Layer of
adatato use as expression values.- dtype
Union[dtype[Any],None,type[Any],_SupportsDType[dtype[Any]],str,tuple[Any,int],tuple[Any,SupportsIndex|Sequence[SupportsIndex]],list[Any],_DTypeDict,tuple[Any,Any]] (default:'float32') Numpy data type string to which to convert the result.
- key_added
str|None(default:None) If not specified, the embedding is stored as
obsm['X_pca'], the loadings asvarm['PCs'], and the the parameters inuns['pca']. If specified, the embedding is stored asobsm[key_added], the loadings asvarm[key_added], and the the parameters inuns[key_added].- copy
bool(default:False) If an
AnnDatais passed, determines whether a copy is returned. Is ignored otherwise.
- data
- Return type:
AnnData|ndarray|csr_array|csc_array|csr_matrix|csc_matrix|None- Returns:
If
datais array-like andreturn_info=Falsewas passed, this function returns the PCA representation ofdataas an array of the same type as the input array.Otherwise, it returns
Noneifcopy=False, else an updatedAnnDataobject. Sets the following fields:.obsm['X_pca' | key_added]csr_matrix|csc_matrix|ndarray(shape(adata.n_obs, n_comps))PCA representation of data.
.varm['PCs' | key_added]ndarray(shape(adata.n_vars, n_comps))The principal components containing the loadings.
.uns['pca' | key_added]['variance_ratio']ndarray(shape(n_comps,))Ratio of explained variance.
.uns['pca' | key_added]['variance']ndarray(shape(n_comps,))Explained variance, equivalent to the eigenvalues of the covariance matrix.