Clustering

In CodexMAV application user has ability to view clusters of cells. These clusters may be from CodexMAV or from external applications.

Import clustering

CodexMAV allows you to import FCS files that were clustered with an external clustering tool. The expected format is the same as the raw FCS data format (one FCS file per region), but the clustered FCS files should contain an additional column (named ClusterID or similar) which will contain cluster assignment information.

After clicking on 'Import clustering run' item in the 'File' menu you should pick the clustering result folder via File chooser and confirm that selected clustering run relates with opened experiment.

There should be one FCS file for each region within the selected clustering folder. Name of these files should start with region prefix (reg001) and have 'compensated' word. All of these FCS files must contain X, Y, Z columns. For the more info see FCS files requirements.

After a successful clustering run import, the next step described below will become available.

Open clustering

If you want to open some clustering run you should click on the 'Open clustering run' item in the 'File' menu.

Then you should select clustering run and click 'OK' and Clustering panel will be updated with selected clustering run. (see below)

Note: you can select 'No clustering' option and then clustering panel will disappear.

Run X-Shift clustering

X-Shift is a density-based clustering algorithm for multidimensional single-cell data (see this paper for algorithm description and citation)

CodexMAV uses a new implementation of X-Shift, which by default, now uses FLANN_java library for fast approximate nearest neighbor search, which, for compatibility reasons, you can disable in Tools > Preferences > Clustering menu. Also in this tab you can set the sub-sampling limit. If this limit is less than the total number of cells that are selected for clustering, down-sampling-clustering-up-sampling procedure will be performed. This allows to keep the running time of clustering down, but due to the stochastic nature of down-sampling, clustering results may not be the same as without sub-sampling.

If you want to do X-Shift clustering from CodexMAV then you should click on the 'X-Shift clustering' item in the 'Analysis' menu.

Then you will see dialog where you can set clustering parameters.

  1. You can use FLANN algorithm, which uses fast approximate nearest neighbor library to speed up KNN computation (about 2x speed improvement).

  2. You can specify subsampling limit, which determines the maximum number of cells from all regions that will be used for clustering. If the actual number of cells in all regions is higher than the subsampling limit, the clustering dataset will be randomly subsampled to match the limit, and the cells that were excluded from the clustering dataset will be assigned to core-set clusters by means of nearest-neighbor classification.

  3. The k-value controls the KNN density estimate and should be a positive number. Smaller k-values will create more clusters. By default there will be suggested K value⚠️ (it is not recommended to set below 5).

  4. Distance measure may be Euclidean or Angular. Angular distance is less sensitive to expression intensity relative to Euclidean distance. Euclidean distance is sensitive to outliers and expression intensity and is computationally more intensive. To see combinations of markers with relatively less consideration of intensity use Angular distance. To stratify populations by marker intensity, use Euclidean distance. It is particularly important to gate out noise when using Euclidean distance.

  5. You can cluster over all cells or only over cells that belong to selected populations. See Population table for more details.

  6. You can select 'Normalize data' checkbox if you want to pass normalized values to the Clustering algorithm (values from 0 to 1). Angular clustering is normalized by default. Normalization is only needed for Euclidean clustering.

In the next dialog you will be able to pick markers which will be used for clustering. By default, the same set of markers that is selected in the Marker manager is selected for clustering.

After clicking the 'OK' button X-Shift clustering will start and a corresponding activity will be created in the Background activities panel. You can stop the activity at any moment by clicking on the 'stop' button.

After the clustering is done you will see the following message and the Clustering panel will be automatically updated.

Note: if you did not save the analysis state earlier, you will be prompted with a dialog asking to name the new analysis state. Then the state FCS files will be created and information about the clustering will be stored there.

Clustering panel

Clusters

Any cluster in the Clustering frame has certain meta columns: Cluster ID, # of cells and comment. You can add or edit the comments by double-clicking on the corresponding cells.

Heatmap

Every cluster has a marker expression heatmap for each selected marker in Marker manager. All heatmap table cells are colored using a user-defined color scheme (see Preferences). If you hover mouse over the heatmap, you will see a tooltip with min/max, median and standard deviation values for the corresponding cluster / marker combination.

Also each cluster has a cell-frequency-per-region heatmap at the end of the table. These table cells are colored by white-blue scheme where blue means that this cluster has more cells in corresponding region. Hovering over the heatmap will show the exact number of cells in a given cluster/region combination.

Sorting

All marker columns in Clustering table are sorted with the same way as in Marker manager. All clusters are sorted by 'Cluster ID' column by default.

You can sort clusters by any column by clicking on corresponding table column header. If you click again then sorting will be changed from ascending to descending. Also dendrograms are impact on the clusters sorting. (see below)

Dendrograms

Dendrograms reflect the hierarchy of relative similarity of median cluster phenotypes (expression profiles).

Clicking the 'Row DND' button will display the row dendrogram for all clusters. Also first three meta columns will be moved to the end of the table. The row dendrogram will also sort the cluster rows according to their phenotypic similarity.

Clicking on the 'Column DND' button will show the column dendrogram at the bottom of clustering panel. Also marker columns will be sorted according to the marker co-expression pattern across clusters. Region frequency columns will be sorted by region column dendrogram if there is more than one region.

If you click on the 'Region row DND' button then you will see row dendrogram to the right of clustering panel. Also clusters will be sorted according to the region frequency dendrogram (cell type co-occurrence across regions). First three meta columns will be placed at the beginning of the table. If row dendrogram was opened then it will be automatically closed and vice versa.

Note: You can toggle the dendrogram buttons to show/hide corresponding dendrograms. Also, clicking on the marker name in the table header will trigger table sorting by that marker and row dendrograms will be hidden automatically.

'Save dendrogram' button in clustering panel allows you to choose a folder folder to save the heatmap+dendrogram image as a PNG.

Other tools

You can toggle the visibility of the marker cycle information by pressing the 'Cycle' button.

You can select some table rows by mouse and click on the 'plus' button. This will lead to corresponding populations being created in the Population table.

If you want to add all clusters as population, you can click on the 'Add all' button.

You can toggle color scaling. There are two available options:

By dataset:

This option is selected by default and it means that all clusters intensities are normalized using 1% and 99% values of the whole dataset.

For example, the whole dataset contains 100 000 cells, so it means that all 100 000 values will be used for the intensity normalization. In this case, the intensity of a cluster for some marker will be Red if the median value of this cluster higher than the value in 99% of cells.

By clusters:

if we want to compare clusters with each other, then we can select this option. In this case, we will always have the lowest (white/blue) and the highest (red) intensities for each marker.

For example, we have 30 clusters in the cluster table, so it means that 30 median values of these clusters will be used for the intensity normalization. In this case, the intensity of a cluster for some marker will be Red if the median value of this cluster higher than the median value in 99% of clusters.

Also you are able save current heatmap intensities as CSV by clicking on the corresponding button. All intensities will be saved as values from 0 to 1.

Last updated