metaQC

8. metaQC (optional): this module helps control for curation bias by correcting for inaccuracies in ROI gating and data cutoff placement by performing unsupervised clustering on equal sized batches of clean (retained) and noisy (redacted) single-cell data using a combination of UMAP (or t-SNE) and HDBSCAN. Noisy cells clustering with predominantly clean cells are returned to the dataframe, while clean cells clustering with predominantly noisy clusters are dropped from the dataframe. After selecting a Min Cluster Size (MCS) value and clicking the Cluster and Plot button in the Plot Single MCS widget at the top right of the Napari viewer, users are presented with UMAP (or t-SNE) embeddings of cells colored by 1) HDBSCAN cluster, 2) QC status, 3) reclassification status, and 4) sample. Clustering is optimized by testing different MCSvalues: an HDBSCAN parameter that significantly effects the clustering result, see HDBSCAN documentation for details. To assist in the identification of a stable clustering solution, a range of min_cluster_size values may be entered into the Sweep MCS Range widget at the right of the Napari viwer and the number of clusters associated with each min_cluster_size will be printed to the terminal window. Cells in the HDBSCAN plot can be lassoed and visualized in a given sample by pressing and holding the mouse (or track pad) button and drawing around cells of interest. The name of the sample of interest is then entered into the Sample Name field and the View Lassoed Points button is clicked. Selected cells will appear as scatter points in their corresponding image colored by the module used to filter them from the analysis. Using clean and noisy reclassification cutoff selectors, users can specify tolerance limits on the proportion of clusters composed of clean (Reclass Clean) and noisy (Reclass Noisy) data for clustering cells to be reclassified. Unclustered cells (i.e., cells with HDBSCAN cluster label -1) whose original QC status is clean are reclassified as noisy.

Clicking the Save button at the bottom right of the Napari viewer causes the program to reclassify the data according to the current clustering solution and reclassification cutoffs. After the first chunk of clean and noisy data has been reclassified, additional chunks are reclassified using the same UMAP, HDBSCAN, and reclassifiction parameters. To re-define clustering or reclassification cutoffs, remove the metadata associated with the metaQC module from cylinter_report.yml located in the CyLinter output directory specified in cylinter_config.yml and re-run the metaQC module with cylinter --module metaQC cylinter_config.yml. This module can be bypassed by toggling the metaQC parameter to False (see YAML configurations below). Regardless of the metaQC parameter setting, a pie chart showing the fraction of data redacted by each QC data filtration module (selectROIs, intensityFilter, areaFilter, cycleCorrelation, pruneOutliers) is saved to the output subdirectory for the metaQC module (censored_by_stage.pdf

YAML configurations

Parameter	Default	Description
`metaQC`	True	(bool) Whether to perform data reclassification based on unsupervised clustering results of combinations of clean and noisy (previously-redacted) data.
`embeddingAlgorithmQC`	“UMAP”	(str) Embedding algorithm used for clustering (options: “TSNE” or “UMAP”).
`channelExclusionsClusteringQC`	[ ]	(list of strs) Immunomarkers to exclude from clustering.
`samplesToRemoveClusteringQC`	[ ]	(list of strs) Samples to exclude from clustering.
`percentDataPerChunk`	0.2	(float) Fraction of data (range: 0.0-1.0) to undergo embedding and clustering per reclassifaction cycle.
`colormapAnnotationQC`	“Sample”	(str) Metadata annotation to colormap the embedding: `Sample` or `Condition`.
`metricQC`	“euclidean”	(str) Distance metric for computing embedding. Choose from valid metrics used by scipy.spatial.distance.pdist: “braycurtis”, “canberra”, “chebyshev”, “cityblock”, “correlation”, “cosine”, “dice”, “euclidean”, “hamming”, “jaccard”, “jensenshannon”, “kulsinski”, “mahalanobis”, “matching”, “minkowski”, “rogerstanimoto”, “russellrao”, “seuclidean”, “sokalmichener”, “sokalsneath”, “sqeuclidean”, “yule”.
`perplexityQC`	50.0	(float) This is a tSNE-specific configuration (https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.htmlRelated) related to the number of nearest neighbors used in other manifold learning algorithms. Larger datasets usually require larger perplexity. Different values can result in significantly different results.
`earlyExaggerationQC`	12.0	(float) This is a tSNE-specific configuration (https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.htmlRelated). For larger values, the space between natural clusters will be larger in the embedded space.
`learningRateTSNEQC`	200.0	(float) This is a tSNE-specific configuration (https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.htmlRelated). tSNE learning rate (typically between 10.0 and 1000.0).
`randomStateQC`	5	(int) This is a tSNE-specific configuration (https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.htmlRelated). It determines the random number generator for reproducible results across multiple function calls.
`nNeighborsQC`	5	(int) This is a UMAP-specific configuration (https://umap-learn.readthedocs.io/en/latest/api.html). It determines the size of local neighborhood (in terms of number of neighboring sample points) used for manifold approximation. Larger values result in more global views of the manifold, while smaller values result in more local data being preserved. In general values should be in the range 2 to 100.
`learningRateUMAPQC`	1.0	(float) This is a UMAP-specific configuration (https://umap-learn.readthedocs.io/en/latest/api.html). It Determines the initial learning rate for the embedding optimization.
`minDistQC`	0.1	(float) This is a UMAP-specific configuration (https://umap-learn.readthedocs.io/en/latest/api.html). Determines the effective minimum distance between embedded points. Smaller values will result in a more clustered/clumped embedding where nearby points on the manifold are drawn closer together, while larger values will result on a more even dispersal of points. The value should be set relative to the spread value, which determines the scale at which embedded points will be spread out.
`repulsionStrengthQC`	5.0	(float) This is a UMAP-specific configuration (https://umap-learn.readthedocs.io/en/latest/api.html). Determines the weighting applied to negative samples in low dimensional embedding optimization. Values higher than one will result in greater weight being given to negative samples.