feats package

Submodules

feats.feats_batchcorrection module

feats.feats_batchcorrection.AdjustShiftVariance(X_ref, X, Corr, sigma)
feats.feats_batchcorrection.ComputeCorrectionVectors(V, X, tar_idx, sigma)
feats.feats_batchcorrection.ComputeMNNPairs(X_ref, X, k)
feats.feats_batchcorrection.ComputeSpan(X, idx, svd_dim)
feats.feats_batchcorrection.CorrectBatches(batches, correct_order, k=20, sigma=10, svd_dim=0, adj_var=True)

Implements the Mutual Nearest Neighbour (MNN) approach to correcting batch effects in the integrated and merged datasets.

Parameters
  • batches (SingleCell) – An integrated and merged SingleCell object (dataset) to correct. This dataset should have a ‘batch’ column in the celldata dataframe with batch information.

  • correct_order (list) – A Python list by the batch number or name representing the order in which to correct the batches. The first batch in the list is the reference batch.

  • k (int, optional) – The number of Mutual Nearest Neighbours to compute between the datasets to be corrected. default (20).

  • signa (float, optional) – A parameter controlling the width of the Gaussian kernel smoothing function.

  • svd_dim (int, optional) – The dimensionality of U when computing the SVD of data, where U, S, V^T = svd(X). default (0).

  • adj_var (bool, optional) – Whether or not to adjust the variance of the computed batch vectors. True (default) if adjust the variance, False if not.

Returns

The corrected dataset.

Return type

SingleCell

Raises

ValueError – If the length of ‘correct_order’ is less than 2, and if no batch information is found in the input parameter ‘batches’.

feats.feats_batchcorrection.IntegrateBatches(batches, name_by)

Integrates a list of SingleCell objects storing single-cell data. This function first finds common genes across all the datasets in the list and then merges all the datasets together in one SingleCell object. If no common genes are found between the datasets, then an error is generated with a message.

Parameters
  • batches (list) – A Python list of SingleCell objects (datasets) to integrate.

  • name_by (list) – A list of str representing column names in the SingleCell objects where gene names are stored. The order in this list is same as the order of SingleCell objects in the batches list.

Returns

A merged dataset where the number of rows (d) is equal to the number of common genes in all the datasets and the number columns (n) is the sum of the number of cells in all the datasets.

Return type

SingleCell

Raises

ValueError – If no common genes are found between the SingleCell datasets in the batches list.

feats.feats_batchcorrection.LogspaceAdd(log_array)
feats.feats_batchcorrection.MergeBatches(batches)

Merges a list of SingleCell objects storing single-cell data. This function assumes that the datasets are already integrated and the number and type of genes in all the datasets are the same. If this is not the case, then an error is generated.

Parameters

batches (list) – A Python list of SingleCell objects (datasets) to merge. It is assumed that the SingleCell datasets are already integrated.

Returns

A merged dataset where the number of rows (d) is equal to the number genes the datasets (which is the same for all the datasets) and the number columns (n) is the sum of the number of cells in all the datasets.

Return type

SingleCell

Raises

ValueError – If the number and type of genes are not the same in all the datasets.

feats.feats_batchcorrection.ReducedMean(X, idx)
feats.feats_batchcorrection.RemoveBatchEffects(ref_batch, tar_batch, k, sigma, svd_dim, adj_var)
feats.feats_batchcorrection.SqDistToLine(ref, grad, point)

feats.feats_clustering module

feats.feats_clustering.AnovaHierarchical(sc, k, normalization, q)
feats.feats_clustering.Cluster(sc, k='gap', k_max=10, normalization='mean', q=None)

Clusters gene expression data stored in SingleCell object. Stores the cluster information (labels) in the celldata assay of the SingleCell object. The column name under which it stores the cluster information is ‘FEATS_k_Clusters’, where k is the number of clusters, which the method can estimate or is defined by the user as an int or an int list.

Parameters
  • sc (SingleCell) – The SingleCell dataset containing gene expression data to be clustered.

  • k (int, list or str, optional) – This is an optional input for the number of clusters in the dataset. It is either an int, a Python list of ints or a str. If it is an int, the method will cluster the data into k groups. If it is a Python list, the method will cluster the data into the number of groups specified in the list and store the cluster information for all the values in the list. If it is the string ‘gap’, the method will use gap statistic to estimate the number of clusters and cluster the data into the number of groups estimated.

  • k_max (int, optional) – The upper limit of the number of clusters when using gap statistic to estimate the number of clusters. Ignored if k is not ‘gap’. Default 10.

  • normalization (str, optional) – The normalization method to use when normalizing the gene expression data. By default the normalized data is not saved in the SingleCell object after using the Cluster function. If you want to save the normalized counts in the SingleCell object, you can use the FeatureNormalize() function. Type help(FeatureNormalize) for more information. Options for the normalization are ‘mean’ for z-score normalization (default), ‘l2’ for L2 normalization, ‘cosine’ for Cosine normalization and None for no normalization.

  • q (list, optional) – This parameter is an int list which specifies the number of top features to select. The clustering algorithm selects each q_i top features in this list, perofrms clustering, computes the clustering score,and then determines the optimal number of features which give best clustering. By default q = [1, 2, …, 5% of d], where d is the number of features.

Returns

  • SingleCell – The SingleCell object with the cluster information stored in the celldata assay.

  • int – The number of clusters in the dataset, k. This output is useful when k is not known and estimated by the algorithm.

Raises

ValueError – If k_max is < 2. If k > n and < 2, where n is the number of samples in the dataset. If unknown input types/class is detected.

feats.feats_clustering.ClusteringScore(X, labels)
feats.feats_clustering.SelectTopNScores(clustering_score, n)

feats.feats_detectoutliers module

feats.feats_detectoutliers.DetectOutliers(sc, cluster_label, red_dim=2, outlier_prob_thres=0.0001)

This function implements the outlier detection scheme of FEATS.

Parameters
  • sc (SingleCell) – The SingleCell object which contains the data and metadata of genes and cells

  • cluster_label (str) – The name of the column in celldata assay of sc which stores the cluster labels of the cells

  • red_dim (int, optional) – The reduced dimentionality in which the outliers are computed. Default 2.

  • outlier_prob_thres (float) – The probability threshold for samples to be classified as outliers. Default 10^-4.

Returns

The single cell object containing the outlier analysis information in the celldata assay. It contains the following columns in the celldata assay with the outlier information: ‘FEATS_Outliers’ - A column with the value True if the respective cell is an outlier, False otherwise. ‘FEATS_Msd’ - The computed Mahalanobis squared distance for the respective cells. ‘FEATS_Outlier_Score’ - The outlier score for the respective cells. ‘FEATS_Oos’ - A column with the value True if the respective cell was not used by the Minimum Covariance Determinant (MCD) algorithm in computing the robust mean and covariance matrix.

Return type

SingleCell

feats.feats_filtering module

feats.feats_filtering.CellFilter(sc, min_genes, max_genes, expr_thres=0)

Filters the cells in which the genes are lowly and/or highly expressed. Returns the sliced SingleCell object with the cells selected through the filtering criteria. This functions skips filtering and warns the user if the filtering criteria filters out all the samples/cells.

Parameters
  • sc (SingleCell) – The SingleCell object containing gene expression data.

  • min_genes (int) – The minimum number of genes in the cell in which the genes must be expressed with the expression threshold. The value should be >= 0 and <= d, where d is the number of genes the the dataset.

  • max_genes (int) – The maximum number of genes in the cell in which the gene must be expressed with the expression threshold. The value should be >= 0 and <= d, where d is the number of genes the the dataset.

  • expr_thres (int, optional) – The expression threshold. Default 0. The gene is considered not expressed if the expression count is <= this threshold.

Returns

sc – A sliced SingleCell object containing the filtered cells.

Return type

SingleCell

feats.feats_filtering.GeneFilter(sc, min_cells, max_cells, expr_thres=0)

Filters the lowly and/or highly expressed genes stored in the SingleCell object based on the expression counts. Returns the sliced SingleCell object with the genes selected through the filtering criteria. This functions skips filtering and warns the user if the filtering criteria filters out all the genes.

Parameters
  • sc (SingleCell) – The SingleCell object containing gene expression data.

  • min_cells (int) – The minimum number of cells in which the gene must be expressed with the expression threshold. The value should be >= 0 and <= n, where n is the number of cells the the dataset.

  • max_cells (int) – The maximum number of cells in which the gene must be expressed with the expression threshold. The value should be >= 0 and <= n, where n is the number of cells the the dataset.

  • expr_thres (int, optional) – The expression threshold. Default 0. The gene is considered not expressed if the expression count is <= this threshold.

Returns

sc – A sliced SingleCell object containing the filtered genes.

Return type

SingleCell

feats.feats_filtering.HVGFilter(sc, num_genes, name_by='gene_names')

Filters/Selects the Highly Variable Genes in the SingleCell object by computing the dispersion of each gene. Returns a sliced SingleCell object containing the top Highly Variable Genes. Stores additional information such as dispersion (‘FEATS_dispersion), mean (‘FEATS_mean’) and variance (‘FEATS_variance’) in the genedata assay of SingleCell object.

Parameters
  • sc (SingleCell) – The SingleCell object containing gene expression data.

  • num_genes (int) – The number of Highly Variable Genes to select.

  • name_by (str) – The name of the column in SingleCell object that stores the gene names. This is used to print the top Highly Variable Genes.

Returns

sc – A sliced SingleCell object containing the top Highly Variable Genes.

Return type

SingleCell

feats.feats_filtering.LogFilter(sc)

Computes and stores the log-transformed gene expression data stored in SingleCell object. Here Log2 is performed after adding a pseudocount of 1.

Parameters

sc (SingleCell) – The SingleCell object containing gene expression data.

Returns

sc – The SingleCell object containing the log-transfprmed gene expression data.

Return type

SingleCell

feats.feats_gapstatistics module

feats.feats_gapstatistics.ComputeWk(X, labels, classes)

X - (d x n) data matrix, where n is samples and d is dimentionality lables - n dimentional vector which are class labels

feats.feats_gapstatistics.GapStatistics(sc_obj, k_max, B)

Computes the gap statistic and estimates the number of clusters in the gene expression dataset contained in SingleCell object.

Parameters
  • sc_obj (SingleCell) – The SingleCell object containing gene expression data and the metadata.

  • k_max (int) – The upper limit of the number of clusters.

  • B (int) – The number of reference datasets to generate to compute the gap quantities.

Returns

  • est_clusters (int) – The estimate of the number of clusters in the dataset.

  • Gap_k (list) – The gap statistic quantity for gap. The list contains gap values for each k, where k = 1, 2, …, k_max.

  • s_k (list) – The gap statistic quantity for standard deviation. The list contains the standard deviation for each k, where k = 1, 2, …, k_max.

  • W_k (list) – A gap statistic quantity for each k, where k = 1, 2, …, k_max.

  • w_bar (list) – A gap statistic quantity for each k, where k = 1, 2, …, k_max.

feats.feats_transformations module

feats.feats_transformations.GramMatrix(K)
feats.feats_transformations.PCA(sc, n_comp=1, dist_or_kernel='linear')

Computes and stores the Principal Components of the gene expression data stored in the SingleCell object.

Parameters
  • sc (SingleCell) – The SingleCell object containing gene expression data.

  • n_comp (int, optional) – The number of Principal Components to compute. Default 1.

  • dist_or_kernel (str, optional) – The distance metric or the kernel to compute. If a distance metric is passed, it computes the pairwise distance between the cells and then converts the distance metrics to kernels. If a kernel is passed, it computes the kernel. Valid values are ‘linear’ (default) for linear kernel, ‘spearmans’ for Spearmans distance, ‘euclidean’ for Euclidean distance and ‘pearsons’ for Pearsons distance.

Returns

sc – The SingleCell object containing the dimensionnality reduced gene expression data. The reduced dimensionality is n_comp. The gene names are removed and the features in the reduced space are named ‘PC1’, ‘PC2’ and so on.

Return type

SingleCell

feats.feats_transformations.dist_to_kernel(D)
feats.feats_transformations.euclidean(X)

X is a (n x d) matrix where rows are samples Returns D which is a (n x n) matrix of distances between samples

feats.feats_transformations.linear_kernel(X)

X is a (n x d) matrix where rows are samples Returns K which is a (n x n) kernel matrix

feats.feats_transformations.pearsons(X)

X is a (n x d) matrix where rows are samples Returns D which is a (n x n) matrix of distances between samples

feats.feats_transformations.spearmans(X)

X is a (n x d) matrix where rows are samples Returns D which is a (n x n) matrix of distances between samples

feats.feats_utils module

feats.feats_utils.FeatureNormalize(sc, norm)

Computes and stores the normalized gene expression data stored in SingleCell object.

Parameters
  • sc (SingleCell) – The SingleCell object containing gene expression data.

  • norm (str) – The normalization to perform. Accepted values are ‘l2’, ‘mean’, ‘norm6’ and ‘cosine’.

Returns

sc – The SingleCell object containing the normalized gene expression data.

Return type

SingleCell

Module contents

FEATS

FEATS is a new Python tool for performing the following downstream analysis on single-cell RNA-seq datasets: 1. Clustering 2. Estimating the number of clusters 3. Outlier detection 4. Batch correction and integration of data from multiple experiments

See https://github.com/edwinv87/feats for more information and documentation.

feats.CellFilter(sc, min_genes, max_genes, expr_thres=0)

Filters the cells in which the genes are lowly and/or highly expressed. Returns the sliced SingleCell object with the cells selected through the filtering criteria. This functions skips filtering and warns the user if the filtering criteria filters out all the samples/cells.

Parameters
  • sc (SingleCell) – The SingleCell object containing gene expression data.

  • min_genes (int) – The minimum number of genes in the cell in which the genes must be expressed with the expression threshold. The value should be >= 0 and <= d, where d is the number of genes the the dataset.

  • max_genes (int) – The maximum number of genes in the cell in which the gene must be expressed with the expression threshold. The value should be >= 0 and <= d, where d is the number of genes the the dataset.

  • expr_thres (int, optional) – The expression threshold. Default 0. The gene is considered not expressed if the expression count is <= this threshold.

Returns

sc – A sliced SingleCell object containing the filtered cells.

Return type

SingleCell

feats.Cluster(sc, k='gap', k_max=10, normalization='mean', q=None)

Clusters gene expression data stored in SingleCell object. Stores the cluster information (labels) in the celldata assay of the SingleCell object. The column name under which it stores the cluster information is ‘FEATS_k_Clusters’, where k is the number of clusters, which the method can estimate or is defined by the user as an int or an int list.

Parameters
  • sc (SingleCell) – The SingleCell dataset containing gene expression data to be clustered.

  • k (int, list or str, optional) – This is an optional input for the number of clusters in the dataset. It is either an int, a Python list of ints or a str. If it is an int, the method will cluster the data into k groups. If it is a Python list, the method will cluster the data into the number of groups specified in the list and store the cluster information for all the values in the list. If it is the string ‘gap’, the method will use gap statistic to estimate the number of clusters and cluster the data into the number of groups estimated.

  • k_max (int, optional) – The upper limit of the number of clusters when using gap statistic to estimate the number of clusters. Ignored if k is not ‘gap’. Default 10.

  • normalization (str, optional) – The normalization method to use when normalizing the gene expression data. By default the normalized data is not saved in the SingleCell object after using the Cluster function. If you want to save the normalized counts in the SingleCell object, you can use the FeatureNormalize() function. Type help(FeatureNormalize) for more information. Options for the normalization are ‘mean’ for z-score normalization (default), ‘l2’ for L2 normalization, ‘cosine’ for Cosine normalization and None for no normalization.

  • q (list, optional) – This parameter is an int list which specifies the number of top features to select. The clustering algorithm selects each q_i top features in this list, perofrms clustering, computes the clustering score,and then determines the optimal number of features which give best clustering. By default q = [1, 2, …, 5% of d], where d is the number of features.

Returns

  • SingleCell – The SingleCell object with the cluster information stored in the celldata assay.

  • int – The number of clusters in the dataset, k. This output is useful when k is not known and estimated by the algorithm.

Raises

ValueError – If k_max is < 2. If k > n and < 2, where n is the number of samples in the dataset. If unknown input types/class is detected.

feats.CorrectBatches(batches, correct_order, k=20, sigma=10, svd_dim=0, adj_var=True)

Implements the Mutual Nearest Neighbour (MNN) approach to correcting batch effects in the integrated and merged datasets.

Parameters
  • batches (SingleCell) – An integrated and merged SingleCell object (dataset) to correct. This dataset should have a ‘batch’ column in the celldata dataframe with batch information.

  • correct_order (list) – A Python list by the batch number or name representing the order in which to correct the batches. The first batch in the list is the reference batch.

  • k (int, optional) – The number of Mutual Nearest Neighbours to compute between the datasets to be corrected. default (20).

  • signa (float, optional) – A parameter controlling the width of the Gaussian kernel smoothing function.

  • svd_dim (int, optional) – The dimensionality of U when computing the SVD of data, where U, S, V^T = svd(X). default (0).

  • adj_var (bool, optional) – Whether or not to adjust the variance of the computed batch vectors. True (default) if adjust the variance, False if not.

Returns

The corrected dataset.

Return type

SingleCell

Raises

ValueError – If the length of ‘correct_order’ is less than 2, and if no batch information is found in the input parameter ‘batches’.

feats.DetectOutliers(sc, cluster_label, red_dim=2, outlier_prob_thres=0.0001)

This function implements the outlier detection scheme of FEATS.

Parameters
  • sc (SingleCell) – The SingleCell object which contains the data and metadata of genes and cells

  • cluster_label (str) – The name of the column in celldata assay of sc which stores the cluster labels of the cells

  • red_dim (int, optional) – The reduced dimentionality in which the outliers are computed. Default 2.

  • outlier_prob_thres (float) – The probability threshold for samples to be classified as outliers. Default 10^-4.

Returns

The single cell object containing the outlier analysis information in the celldata assay. It contains the following columns in the celldata assay with the outlier information: ‘FEATS_Outliers’ - A column with the value True if the respective cell is an outlier, False otherwise. ‘FEATS_Msd’ - The computed Mahalanobis squared distance for the respective cells. ‘FEATS_Outlier_Score’ - The outlier score for the respective cells. ‘FEATS_Oos’ - A column with the value True if the respective cell was not used by the Minimum Covariance Determinant (MCD) algorithm in computing the robust mean and covariance matrix.

Return type

SingleCell

feats.FeatureNormalize(sc, norm)

Computes and stores the normalized gene expression data stored in SingleCell object.

Parameters
  • sc (SingleCell) – The SingleCell object containing gene expression data.

  • norm (str) – The normalization to perform. Accepted values are ‘l2’, ‘mean’, ‘norm6’ and ‘cosine’.

Returns

sc – The SingleCell object containing the normalized gene expression data.

Return type

SingleCell

feats.GapStatistics(sc_obj, k_max, B)

Computes the gap statistic and estimates the number of clusters in the gene expression dataset contained in SingleCell object.

Parameters
  • sc_obj (SingleCell) – The SingleCell object containing gene expression data and the metadata.

  • k_max (int) – The upper limit of the number of clusters.

  • B (int) – The number of reference datasets to generate to compute the gap quantities.

Returns

  • est_clusters (int) – The estimate of the number of clusters in the dataset.

  • Gap_k (list) – The gap statistic quantity for gap. The list contains gap values for each k, where k = 1, 2, …, k_max.

  • s_k (list) – The gap statistic quantity for standard deviation. The list contains the standard deviation for each k, where k = 1, 2, …, k_max.

  • W_k (list) – A gap statistic quantity for each k, where k = 1, 2, …, k_max.

  • w_bar (list) – A gap statistic quantity for each k, where k = 1, 2, …, k_max.

feats.GeneFilter(sc, min_cells, max_cells, expr_thres=0)

Filters the lowly and/or highly expressed genes stored in the SingleCell object based on the expression counts. Returns the sliced SingleCell object with the genes selected through the filtering criteria. This functions skips filtering and warns the user if the filtering criteria filters out all the genes.

Parameters
  • sc (SingleCell) – The SingleCell object containing gene expression data.

  • min_cells (int) – The minimum number of cells in which the gene must be expressed with the expression threshold. The value should be >= 0 and <= n, where n is the number of cells the the dataset.

  • max_cells (int) – The maximum number of cells in which the gene must be expressed with the expression threshold. The value should be >= 0 and <= n, where n is the number of cells the the dataset.

  • expr_thres (int, optional) – The expression threshold. Default 0. The gene is considered not expressed if the expression count is <= this threshold.

Returns

sc – A sliced SingleCell object containing the filtered genes.

Return type

SingleCell

feats.HVGFilter(sc, num_genes, name_by='gene_names')

Filters/Selects the Highly Variable Genes in the SingleCell object by computing the dispersion of each gene. Returns a sliced SingleCell object containing the top Highly Variable Genes. Stores additional information such as dispersion (‘FEATS_dispersion), mean (‘FEATS_mean’) and variance (‘FEATS_variance’) in the genedata assay of SingleCell object.

Parameters
  • sc (SingleCell) – The SingleCell object containing gene expression data.

  • num_genes (int) – The number of Highly Variable Genes to select.

  • name_by (str) – The name of the column in SingleCell object that stores the gene names. This is used to print the top Highly Variable Genes.

Returns

sc – A sliced SingleCell object containing the top Highly Variable Genes.

Return type

SingleCell

feats.IntegrateBatches(batches, name_by)

Integrates a list of SingleCell objects storing single-cell data. This function first finds common genes across all the datasets in the list and then merges all the datasets together in one SingleCell object. If no common genes are found between the datasets, then an error is generated with a message.

Parameters
  • batches (list) – A Python list of SingleCell objects (datasets) to integrate.

  • name_by (list) – A list of str representing column names in the SingleCell objects where gene names are stored. The order in this list is same as the order of SingleCell objects in the batches list.

Returns

A merged dataset where the number of rows (d) is equal to the number of common genes in all the datasets and the number columns (n) is the sum of the number of cells in all the datasets.

Return type

SingleCell

Raises

ValueError – If no common genes are found between the SingleCell datasets in the batches list.

feats.LogFilter(sc)

Computes and stores the log-transformed gene expression data stored in SingleCell object. Here Log2 is performed after adding a pseudocount of 1.

Parameters

sc (SingleCell) – The SingleCell object containing gene expression data.

Returns

sc – The SingleCell object containing the log-transfprmed gene expression data.

Return type

SingleCell

feats.MergeBatches(batches)

Merges a list of SingleCell objects storing single-cell data. This function assumes that the datasets are already integrated and the number and type of genes in all the datasets are the same. If this is not the case, then an error is generated.

Parameters

batches (list) – A Python list of SingleCell objects (datasets) to merge. It is assumed that the SingleCell datasets are already integrated.

Returns

A merged dataset where the number of rows (d) is equal to the number genes the datasets (which is the same for all the datasets) and the number columns (n) is the sum of the number of cells in all the datasets.

Return type

SingleCell

Raises

ValueError – If the number and type of genes are not the same in all the datasets.

feats.PCA(sc, n_comp=1, dist_or_kernel='linear')

Computes and stores the Principal Components of the gene expression data stored in the SingleCell object.

Parameters
  • sc (SingleCell) – The SingleCell object containing gene expression data.

  • n_comp (int, optional) – The number of Principal Components to compute. Default 1.

  • dist_or_kernel (str, optional) – The distance metric or the kernel to compute. If a distance metric is passed, it computes the pairwise distance between the cells and then converts the distance metrics to kernels. If a kernel is passed, it computes the kernel. Valid values are ‘linear’ (default) for linear kernel, ‘spearmans’ for Spearmans distance, ‘euclidean’ for Euclidean distance and ‘pearsons’ for Pearsons distance.

Returns

sc – The SingleCell object containing the dimensionnality reduced gene expression data. The reduced dimensionality is n_comp. The gene names are removed and the features in the reduced space are named ‘PC1’, ‘PC2’ and so on.

Return type

SingleCell