12. `clustering`

: this module performs density-based hierarchical clustering with HDBSCAN on UMAP (or t-SNE) embeddings to identify biologically relevant cell states in a dataset. In doing so, users will enter into the `Cluster and Plot`

field at the top right of the Napari window an integer value for `Min Cluster Size`

(MCS): an HDBSCAN parameter that effects the clustering result. The embedding will be shown in the Napari window colored by three different variables for review: **1)** HDBSCAN cluster, **2)** gate-based cell type classification (see gating module for details), and **3)** tissue sample. Clustering cells may be viewed in the context of a given tissue by pressing and holding the mouse (or track pad) button and lassoing data points in HDBSCAN plot, typing the name of a sample of interest in the `Sample Name`

field, and clicking the `View Lassoed Points`

button. Selected cells will then appear as scatter points in their corresponding image. After each MCS entry, a seperate window showing the results of silhouette analysis will also be shown. Cells with positive silhouette coefficients indicate their current cluster assignment is suitable, while those with negative coefficients would be better off in an other cluster indicative of under-clustering. To aid in cluster optimization, a range of MCS values can be entered into the `Sweep MCS Range`

field and the number of clusters associated with each MCS value will be printed to the terminal window without the results being plotting into the Napari window. Clicking the `Save`

button at the bottom right of the Napari viewer causes the program to append the current cluster labels to the single-cell dataframe and proceed to the next module.

### YAML configurations

Parameter | Default | Description |
---|---|---|

`embeddingAlgorithm` | “UMAP” | (str) Embedding algorithm to use for clustering (options: “TSNE” or “UMAP”). |

`channelExclusionsClustering` | [ ] | (list of strs) Immunomarkers to exclude from clustering. |

`samplesToRemoveClustering` | [ ] | (list of strs) Samples to exclude from clustering. |

`normalizeTissueCounts` | True | (bool) Make the number of cells per tissue for clustering more similar through sample-weighted random sampling. |

`fracForEmbedding` | 1.0 | (float) Fraction of cells to be embedded (range: 0.0-1.0). Limits amount of data passed to downstream modules. |

`dimensionEmbedding` | 2 | (int) Dimension of the embedding (options: 2 or 3). |

`colormapAnnotationClustering` | “Sample” | (str) Metadata annotation to colormap the embedding: Sample or Condition. |

`metric` | “euclidean” | (str) Distance metric for computing embedding. Choose from valid metrics used by scipy.spatial.distance.pdist: “braycurtis”, “canberra”, “chebyshev”, “cityblock”, “correlation”, “cosine”, “dice”, “euclidean”, “hamming”, “jaccard”, “jensenshannon”, “kulsinski”, “mahalanobis”, “matching”, “minkowski”, “rogerstanimoto”, “russellrao”, “seuclidean”, “sokalmichener”, “sokalsneath”, “sqeuclidean”, “yule”. |

`perplexity` | 50.0 | (float) This is a tSNE-specific configuration related to the number of nearest neighbors used in other manifold learning algorithms. Larger datasets usually require larger perplexity. Different values can result in significantly different results. |

`earlyExaggeration` | 12.0 | (float) This is a tSNE-specific configuration. For larger values, the space between natural clusters will be larger in the embedded space. |

`learningRateTSNE` | 200.0 | (float) This is a tSNE-specific configuration. tSNE learning rate (typically between 10.0 and 1000.0). |

`randomStateTSNE` | 5 | (int) This is a tSNE-specific configuration. It determines the random number generator for reproducible results across multiple function calls. |

`nNeighbors` | 6 | (int) This is a UMAP-specific configuration. It determines the size of local neighborhood (in terms of number of neighboring sample points) used for manifold approximation. Larger values result in more global views of the manifold, while smaller values result in more local data being preserved. In general values should be in the range 2 to 100. |

`learningRateUMAP` | 1.0 | (float) This is a UMAP-specific configuration. It Determines the initial learning rate for the embedding optimization. |

`minDist` | 0.1 | (float) This is a UMAP-specific configuration. Determines the effective minimum distance between embedded points. Smaller values will result in a more clustered/clumped embedding where nearby points on the manifold are drawn closer together, while larger values will result on a more even dispersal of points. The value should be set relative to the spread value, which determines the scale at which embedded points will be spread out. |

`repulsionStrength` | 5.0 | (float) This is a UMAP-specific configuration. Determines the weighting applied to negative samples in low dimensional embedding optimization. Values higher than one will result in greater weight being given to negative samples. |

`randomStateUMAP` | 5 | (int) This is a UMAP-specific configuration. It determines the random number generator for reproducible results across multiple function calls. |