Link Search Menu Expand Document

8. metaQC (optional): This module helps correct for inaccuracies in ROI gating and data cutoff selection by performing unsupervised, density-based clustering on equal-sized batches of clean and noisy (previously-redacted) single-cell data using HDBSCAN. Noisy cells falling within predominantly clean clusters are returned to the analysis, while clean cells falling within predominantly noisy clusters are dropped from the analysis. Users are presented with tSNE or UMAP embeddings of cells colored by 1) HDBSCAN cluster, 2) QC status, 3) reclassification status, and 4) sample. Clustering is optimized by testing minimum cluster size (min_cluster_size) values. To aid in identifying stable clustering solutions, a range of min_cluster_size values may be passed and the number of clusters associated with each min_cluster_size is printed to the terminal window. Cells in the HDBSCAN plot may be lassoed by clicking and holding the mouse button. After entering the name of a tissue of interest in the provided text box, selected cells will appear as scatter points in their corresponding image colored by the stage at which they were filtered from the analysis. Using clean and noisy reclassification cutoff selectors, users specify tolerance limits on the proportion of clusters composed of clean and noisy data for cells identified clusters; ambiguous (or unclustered) cells always maintain their original QC status annotations. Clicking the “save” button causes the program to reclassify data according to the current clustering solution and reclassification cutoffs. This module can be bypassed by toggling the metaQC parameter to False (see YAML configurations below). Regardless of the metaQC parameter setting, a pie chart showing the fraction of data redacted by each of the prior QC filters is saved to <output_dir/metaQC/censored_by_stage.pdf>

YAML configurations (config.yml)

ParameterDefaultDescription
metaQCTrue(bool) Whether to perform data reclassification based on unsupervised clustering results of combinations of clean and noisy (previously-redacted) data.
embeddingAlgorithmQC“UMAP”(str) Embedding algorithm used for clustering (options: “TSNE” or “UMAP”).
channelExclusionsClusteringQC[ ](list of strs) Immunomarkers to exclude from clustering.
samplesToRemoveClusteringQC[ ](list of strs) Samples to exclude from clustering.
fracForEmbeddingQC1.0(float) Fraction of cells to be embedded (range: 0.0-1.0). Limits the amount of data passed to downstream modules.
dimensionEmbeddingQC2(int) Dimension of the embedding (fixed to 2 in current version).
topMarkersQC“channels”(str) Normalization axis (“channels” or “clusters”) used to define highest expressed markers per cluster.
metricQC“euclidean”(str) Distance metric for computing embedding. Choose from valid metrics used by scipy.spatial.distance.pdist: “braycurtis”, “canberra”, “chebyshev”, “cityblock”, “correlation”, “cosine”, “dice”, “euclidean”, “hamming”, “jaccard”, “jensenshannon”, “kulsinski”, “mahalanobis”, “matching”, “minkowski”, “rogerstanimoto”, “russellrao”, “seuclidean”, “sokalmichener”, “sokalsneath”, “sqeuclidean”, “yule”.
perplexityQC50.0(float) This is a tSNE-specific configuration (https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.htmlRelated) related to the number of nearest neighbors used in other manifold learning algorithms. Larger datasets usually require larger perplexity. Different values can result in significantly different results.
earlyExaggerationQC12.0(float) This is a tSNE-specific configuration (https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.htmlRelated). For larger values, the space between natural clusters will be larger in the embedded space.
learningRateTSNEQC200.0(float) This is a tSNE-specific configuration (https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.htmlRelated). tSNE learning rate (typically between 10.0 and 1000.0).
randomStateQC5(int) This is a tSNE-specific configuration (https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.htmlRelated). It determines the random number generator for reproducible results across multiple function calls.
nNeighborsQC5(int) This is a UMAP-specific configuration (https://umap-learn.readthedocs.io/en/latest/api.html). It determines the size of local neighborhood (in terms of number of neighboring sample points) used for manifold approximation. Larger values result in more global views of the manifold, while smaller values result in more local data being preserved. In general values should be in the range 2 to 100.
learningRateUMAPQC1.0(float) This is a UMAP-specific configuration (https://umap-learn.readthedocs.io/en/latest/api.html). It Determines the initial learning rate for the embedding optimization.
minDistQC0.1(float) This is a UMAP-specific configuration (https://umap-learn.readthedocs.io/en/latest/api.html). Determines the effective minimum distance between embedded points. Smaller values will result in a more clustered/clumped embedding where nearby points on the manifold are drawn closer together, while larger values will result on a more even dispersal of points. The value should be set relative to the spread value, which determines the scale at which embedded points will be spread out.
repulsionStrengthQC5.0(float) This is a UMAP-specific configuration (https://umap-learn.readthedocs.io/en/latest/api.html). Determines the weighting applied to negative samples in low dimensional embedding optimization. Values higher than one will result in greater weight being given to negative samples.