Chinese Journal of Chemical Physics  2018, Vol. 31 Issue (4): 404-420

#### The article information

Jun-hui Peng, Wei Wang, Ye-qing Yu, Han-lin Gu, Xuhui Huang

Clustering Algorithms to Analyze Molecular Dynamics Simulation Trajectories for Complex Chemical and Biological Systems

Chinese Journal of Chemical Physics, 2018, 31(4): 404-420

http://dx.doi.org/10.1063/1674-0068/31/cjcp1806147

### Article history

Accepted on: July 13, 2018
Clustering Algorithms to Analyze Molecular Dynamics Simulation Trajectories for Complex Chemical and Biological Systems
Jun-hui Penga,b, Wei Wanga,b, Ye-qing Yua,b, Han-lin Guc, Xuhui Huanga,b,d,e
Dated: Received on June 20, 2018; Accepted on July 13, 2018
a. HKUST-Shenzhen Research Institute, Hi-Tech Park, Nanshan, Shenzhen 518057, China;
b. Department of Chemistry, The Hong Kong University of Science and Technology, Kowloon, Hong Kong;
c. Department of Mathematics, The Hong Kong University of Science and Technology, Kowloon, Hong Kong;
d. Center of Systems Biology and Human Health, The Hong Kong University of Science and Technology, Kowloon, Hong Kong;
e. State Key Laboratory of Molecular Neuroscience, The Hong Kong University of Science and Technology, Kowloon, Hong Kong
*Author to whom correspondence should be addressed. Xuhui Huang, E-mail:xuhuihuang@ust.hk
Abstract: Molecular dynamics (MD) simulation has become a powerful tool to investigate the structurefunction relationship of proteins and other biological macromolecules at atomic resolution and biologically relevant timescales. MD simulations often produce massive datasets containing millions of snapshots describing proteins in motion. Therefore, clustering algorithms have been in high demand to be developed and applied to classify these MD snapshots and gain biological insights. There mainly exist two categories of clustering algorithms that aim to group protein conformations into clusters based on the similarity of their shape (geometric clustering) and kinetics (kinetic clustering). In this paper, we review a series of frequently used clustering algorithms applied in MD simulations, including divisive algorithms, agglomerative algorithms (single-linkage, complete-linkage, average-linkage, centroid-linkage and ward-linkage), center-based algorithms (K-Means, K-Medoids, K-Centers, and APM), density-based algorithms (neighbor-based, DBSCAN, density-peaks, and Robust-DB), and spectral-based algorithms (PCCA and PCCA+). In particular, differences between geometric and kinetic clustering metrics will be discussed along with the performances of different clustering algorithms. We note that there does not exist a one-size-fits-all algorithm in the classification of MD datasets. For a specific application, the right choice of clustering algorithm should be based on the purpose of clustering, and the intrinsic properties of the MD conformational ensembles. Therefore, a main focus of our review is to describe the merits and limitations of each clustering algorithm. We expect that this review would be helpful to guide researchers to choose appropriate clustering algorithms for their own MD datasets.
Key words: Molecular dynamics simulation    Clustering algorithms    Markov state models    Protein dynamics
Ⅰ. INTRODUCTION

Since the first protein molecular dynamics (MD) simulation was carried out more than four decades ago [1], MD simulation has been proven to be a powerful approach to complement experimental techniques to provide dynamic information of protein conformations at atomic resolution. In recent years, breakthroughs in structural biology, particularly the advancement in the cryo-EM technique, have led to the skyrocketing high-resolution structures of proteins and other biomolecules [2-4]. These structures offer great opportunities for MD simulations to further reveal the dynamics of numerous biological systems. Meanwhile, with recent advancements in both computer hardware and software [5-9], state-of-the-art MD simulations are able to simulate protein conformational dynamics at microseconds or even longer, which open up opportunities to investigate a wide range of important biological processes such as protein folding, functional conformational changes, protein-ligand recognition, and protein aggregation [10-21].

Modern MD simulations often produce massive datasets, which typically contain hundreds of MD trajectories and millions of conformations. One major challenge facing the MD simulation community is thus how to rapidly and accurately extract useful information from these large datasets to comprehend mechanisms underlying biological processes. One valuable approach to addressing this challenge is the conformational clustering, where geometrically or kinetically similar MD conformations are grouped into the same cluster [22-26]. This clustering procedure can reduce millions of MD conformations into a small number of conformational states (or clusters), which enables us to further investigate various thermodynamic or kinetic properties of the studied macromolecules [27-39].

A wide variety of clustering algorithms originally developed in computer science [40-44] have been successfully applied to MD datasets [23, 36, 45-77]. According to their basic principles, the commonly used clustering algorithms can be broadly divided into two major groups: hierarchical and partitional clustering algorithms [42-44]. In hierarchical clustering algorithms, the arrangement of clusters is hierarchical, and each cluster can have sub-clusters, while in partitional clustering algorithms, MD conformations are simply divided into non-overlapping clusters without assuming that all the conformations and clusters are in a hierarchical structure.

In this paper, we review some popular clustering algorithms applied to MD datasets [22-26, 69, 70, 78, 79], including divisive hierarchical clustering, agglomerative hierarchical clustering (single-linkage, complete-linkage, average-linkage, centroid-linkage and Ward-linkage), center-based partitional clustering (K-Means, K-Medoids, K-Centers, and APM), density-based partitional clustering (neighbor-based, DBSCAN, density-peaks and Robust-DB), and spectral-based partitional clustering (PCCA and PCCA+) algorithms. We note that there is no "one-size-fits-all" clustering algorithm, and each algorithm has its own merits and limitations. Therefore, this review is focused on the physical insights of clustering algorithms, their algorithmic implementations to MD datasets, and together with our own experience of applying them. We hope that this review could help readers to choose the appropriate clustering algorithm for specific MD datasets they are working on. In the remaining part of the paper, we will review a number of widely used algorithms in clustering MD datasets as shown in FIG. 1, and also discuss their performances and comparisons between different clustering algorithms.

 FIG. 1 Summary of common clustering algorithms applied to classify MD conformations of complex chemical and biological molecules
Ⅱ. COORDINATES AND FEATURES OF MOLECULAR DYNAMICS SIMULATION TRAJECTORIES

An MD simulation dataset often contains parallel trajectories representing time-evolved snapshots or conformations of the simulated macromolecular systems. Each snapshot or conformation contains three-dimensional Cartesian coordinates of all the atoms of the simulated systems (e.g). proteins, solvents and ions) (FIG. 2). In recent years, typical MD datasets contain multiple trajectories each ranging from hundreds of nanoseconds to microseconds, and thus these datasets are often consisted of millions of configurations stored in parallel MD trajectories [23, 32, 34]. For example, Da et al. [34] collected 480$\times$100 ns MD simulation trajectories (containing $\sim$4.8$\times$10$^6$ conformations) to investigate the conformational dynamics of RNA Polymerase Ⅱ backtracking. With these massive datasets containing many trajectories, it becomes challenging to elucidate the molecular mechanisms via visual inspection or other simple analyses. Under this situation, clustering algorithms become powerful to analyze MD conformations by grouping them into clusters based on distances that can describe similarities between pairs of conformations. To analyze MD trajectories, one often defines the distance between pairs of MD conformations by geometric criterion such as the root-mean-square deviation (RMSD) of protein atoms (heavy atoms or $C\alpha$ atoms). In addition to RMSD, internal coordinates (e.g. torsional angles, inter-residue distances, etc.) [22, 45], substrate-binding cavity [78], and collective coordinates [80] can also be applied to compare the similarities between different MD conformations of biological macromolecules.

 FIG. 2 An illustrated example of a MD dataset. The top panel shows a diagram of different MD trajectories and the bottom panel shows a snapshot of the simulation box from MD trajectory 1. The simulation box contains solvent (water molecules), ions (Na$^+$ and Cl$^-$) and the macromolecule of interest (RNA Polymerase Ⅱ). (Figure adapted with permission from Ref.[32], copyright 2014 National Academy of Sciences, USA)
Ⅲ. HIERARCHICAL CLUSTERING ALGORITHMS

Hierarchical clustering builds a hierarchy of clusters based on a certain linkage criterion (similarities or dissimilarities between different clusters). The clustering process can be either divisive (top-down) or agglomerative (bottom-up). The agglomerative (bottom-up) approach starts with each MD conformation in its own cluster, and merges two closest clusters iteratively. While the divisive (top-down) approach starts with all the MD conformations in one cluster, and splits the largest cluster recursively. A simplified framework of hierarchical clustering is shown in FIG. 3. In this section, we will introduce the basic principles and discuss the performances of some widely used hierarchical algorithms [22, 23, 68-70, 78].

 FIG. 3 The framework of hierarchical clustering algorithms. In particular, the agglomerative hierarchical clustering algorithms follow the bottom-up design, while the divisive ones follow top-down design. (Figure adapted with permission from Ref.[44], copyright 2017 Elsevier)
A. The divisive hierarchical clustering algorithm

The divisive hierarchical clustering algorithm follows the top-down design. General procedures of the divisive hierarchical clustering algorithm are described as the following: (ⅰ) The divisive algorithm assigns all the MD conformations into one large cluster. (ⅱ) The largest cluster from the previous step is further split into two sub-clusters. This bi-partitioning is often achieved by the K-Means and K-Centers clustering algorithms [42, 43] (see Section Ⅳ for details of these two clustering algorithms). (ⅲ) This division procedure is repeated iteratively until the desired number of clusters is reached or until each conformation forms a unique cluster on its own [40]. Similar to other hierarchical clustering algorithms, one advantage of the divisive algorithm is that no prior information regarding the number of clusters is required as input.

The divisive hierarchical clustering algorithm was first adopted in clustering MD trajectories by Torda et al. [22]. In this work, the authors applied the divisive hierarchical algorithm to a single MD trajectory containing 2, 000 conformations of the 64-residue structured domain of the barley serine protease inhibitor. They found that the divisive hierarchical algorithm performed well in clustering a well-sampled MD simulation trajectory. Shao et al. [23] also applied the divisive hierarchical algorithm to cluster the MD datasets of several different biomolecules. Similar to the findings of Torda et al. [22], Shao et al. [23] discovered that the divisive hierarchical clustering is efficient in processing large MD simulation trajectories. However, they also showed that the divisive hierarchical clustering algorithm is very sensitive to outliers and tends to yield clumsy results near the cluster boundaries [23].

B. Agglomerative hierarchical clustering algorithms

The agglomerative hierarchical clustering algorithms follow the bottom-up fashion. The algorithms start from "one element-one cluster" state and then merge small clusters into larger ones by a pre-defined linkage criteria until certain conditions are satisfied to terminate this process (e.g). the number of clusters reaches a desired value or all the MD conformations are finally grouped into one single cluster). According to the linkage criteria applied, the agglomerative hierarchical clustering algorithms can be further divided into several subtypes: the single-linkage algorithm, the complete-linkage algorithm, the average-linkage algorithm, the centroid-linkage algorithm and the Ward-linkage algorithm. The linkage criterions of the agglomerative hierarchical clustering algorithms in clustering MD trajectories are summarized in Table Ⅰ.

Table Ⅰ Summary of agglomerative hierarchical clustering algorithms

The linkage criterion of the single-linkage algorithm is defined as the minimum distance between two clusters. The distance between two clusters in the single-linkage algorithm is computed as the distance between a single pair of MD conformations (one in each cluster) that are closest to each other. At each clustering step of the single-linkage algorithm, the two clusters that have the minimum distance are merged into one larger cluster. The time and space complexities of the original single-linkage clustering are $O$($n^3$) and $O$($n^2$), respectively. After further optimization through R. Sibson's algorithm [81], the time and space complexities can be reduced to $O$($n^2$) and $O$($n$), respectively. In two previous studies, the single-linkage algorithm was implemented to cluster MD datasets [22, 23]. They both showed that the single-linkage algorithm was very fragile and the performance was relatively poor.

In the complete-linkage algorithm, the linkage criterion is also the minimum distance between two clusters as in the single-linkage algorithm. The difference is that, in the complete-linkage algorithm, the distance between two clusters is defined by the maximum distance between any pair of MD conformations (one in each cluster). The complete-linkage clustering algorithm shares similar time and space complexities as the single-linkage method.

In the work of Shao et al. [23], the authors implemented the complete-linkage algorithm to cluster the MD datasets of several different biomolecules. From the clustering results, they found that unlike the single-linkage algorithm, which tends to generate singletons, the complete-linkage algorithm is more likely to produce meaningful sets of clusters. De Paris et al. [78] implemented the complete-linkage algorithm to classify the MD trajectories of the InhA enzyme for docking purposes using the cavity attributes as the clustering metric. The authors found that in their cases, complete-linkage algorithm was among the best choices.

The average-linkage algorithm and the centroid-linkage algorithm are two highly related clustering algorithms. In the two algorithms, the same linkage criterion is adopted, which is defined as the minimum distance between two clusters as in the above agglomerative hierarchical algorithms. However, the definitions of the distance between two clusters are quite different. In the average-linkage algorithm, the distance between two clusters is defined by the average of all the distances between any pair of MD conformations (one in each cluster). While in the centroid-linkage algorithm, the distance between two clusters is defined as the distance between the geometric means of the clusters. The time and space complexities of the two linkage algorithms are also similar to those of the single-linkage algorithm.

As reported by Shao et al. [23], the two algorithms produced clusters of varying sizes and performed well in clustering MD datasets. In the work by De Paris et al. [78], the authors showed that, the performances of the two algorithms were generally good when using cavity attributes as the clustering metric.

The Ward-linkage or the Ward's method, also known as the Ward's minimum variance method, is developed by Ward et al. [82]. The linkage criterion of the Ward-linkage algorithm is different from other linkage clustering algorithms described above. In the Ward-linkage algorithm, a linkage between two clusters is determined when this pair of clusters leads to the minimum increase in total within-cluster variance after merging. The original Ward-linkage algorithm is time consuming with time complexity of $O$($n^3$). Therefore, different approximations can be introduced when implementing the Ward's algorithm to reduce time and space complexity [83, 84].

The Ward-linkage algorithm was reported to perform well when applied to analyze MD datasets of protein folding and conformational changes [27, 69, 70]. For example, Beauchamp et al. [27] applied the Ward-linkage algorithm to cluster the MD trajectories generated by the folding simulations of 14 different proteins. The produced clusters (called microstates) were used to construct Markov State Models (MSMs) (FIG. 4) to further interpret the protein folding dynamics. The predictions of their MSMs were in good agreement with experimental observations. More recently, Husic et al. [70] applied the Ward-linkage clustering algorithm to construct MSMs to analyze the folding dynamics of three fast folding proteins: the villin, trp-cage, and protein G. The authors showed that fewer clusters were needed to obtain MSMs with maximum cross-validation scores by applying the Ward-linkage algorithm in comparison with other hierarchical clustering algorithms.

 FIG. 4 MSMs of HP35 (a), FiP35 (b), and GTT (c) constructed using Ward-linkage algorithm. Simulated structures were shown in rainbow. The experimental determined structures were also shown in: (a) black and gray, (b) black and gray, and (c) black. (Figure adapted with permission from Ref.[27], copyright 2012 National Academy of Sciences, USA)
Ⅳ. PARTITIONAL CLUSTERING ALGORITHMS

Unlike hierarchical clustering algorithms that group conformations into hierarchical clusters, partitional clustering algorithms simply divide the conformations into different non-overlapping clusters without the presumption that all the conformations or clusters are hierarchical. According to their algorithm design, partitional clustering algorithms can be further divided into different subgroups [42]. Partitional clustering algorithms have been widely applied to analyze the MD simulation datasets [23-26, 57, 59, 60, 64, 67, 71, 78, 85]. In this section, we will focus on two groups of popular partitional clustering algorithms: the center-based clustering algorithms, and the density-based clustering algorithms.

A. Center-based clustering algorithms

In the center-based clustering algorithms, $K$ conformations as cluster centers are usually identified to be subject to the optimization of certain objective functions, and then each of the remaining MD conformations is assigned to its closest cluster center. Center-based clustering algorithms have been widely applied in clustering MD datasets to produce states for the MSM construction [35, 87]. In this subsection, we will describe four popular center-based clustering algorithms including the K-Means algorithm, the K-Medoids algorithm, the K-Centers algorithm, and the APM algorithm.

1. The K-Means algorithm and the K-Medoids algorithm

The K-Means clustering algorithm aims to partition the MD datasets into $K$ non-overlapping clusters, $\left\{ {C_1, C_2, {\rm{ }}...{\rm{ }}, C_k } \right\}$, such that the total within-cluster sum of square of distances (WCSS) of the conformations to their corresponding cluster centers (geometric means) is minimized. Its objective function can be formulated as,

 $\begin{eqnarray} \min \sum\limits_{i = 1}^K {\sum\limits_{x \in C_i } {\left\| {x - \mu {}_i} \right\|^2 } } \end{eqnarray}$ (1)

where $x$ denotes to MD conformations in the $i$th cluster $C_i$, and $\mu_i$ is the cluster center (geometric mean). The minimization problem is essentially NP-hard. To obtain an approximate solution, the Lloyd's algorithm [43, 88, 89] is often adopted. A simplified diagram of the algorithm is illustrated in FIG. 5(a). There are three major steps in the K-Means algorithm, which are "initiation", "iteration" and "termination". In the initiation step, we randomly pick $K$ conformations as the initial cluster centers and assign all the other conformations to their closest centers. In the iteration step, we first re-calculate the geometric means of each cluster; then all conformations are re-assigned based on the updated cluster centers. Finally, the K-Means algorithm will be terminated once the WCSS score reaches convergence. The average runtime complexity of the K-Means algorithm is $O$($Knl$), where $n$ is the number of conformations in the MD dataset, $K$ is the desired number of clusters, and $l$ is the number of iteration steps.

 FIG. 5 Illustration of three popular center-based algorithms: the K-Means algorithm, the K-Centers algorithm, and the APM algorithm. (a) In the K-Means algorithm, the initial centers (circles) are randomly selected, and all the data points are subsequently assigned according to their distance to the cluster centers. Then the centers are updated iteratively as the geometric mean of the cluster members, followed by re-assigning of all the data points to the updated centers, until the final results are converged. (Figure adapted with permission from Ref.[43], copyright 2010 Elsevier). (b) In the K-Centers algorithm, the first center (triangle) is randomly selected from the dataset, and the choice of subsequent centers is always the point furthest away from all previously found centers. (Figure adapted with permission from Ref.[60], copyright 2012 John Wiley and Sons). (c) In the APM algorithm, all the MD conformations are partitioned into two clusters using the K-Centers algorithm at the beginning. Then the cluster that has a small escape probability will be further split. Such splitting procedure is performed iteratively until all the clusters (microstates) satisfy certain kinetic constraints (please refer to the main text for details). (Figure adapted with permission from Ref.[86], copyright 2016 Royal Society of Chemistry)

The K-Medoids algorithm is highly related to the K-Means algorithm. The difference between two algorithms lies in the choice of cluster centers. In particular, the K-Medoids algorithm chooses MD conformations as cluster centers, while the K-Means algorithm defines the cluster centers as the geometric mean of points within each cluster, and thus these cluster centers are not necessarily corresponding to actual MD conformations. Accordingly, the objective function of the K-Medoids algorithm is the total WCSS score of all the MD conformations to their corresponding center conformations and can be expressed as,

 $\begin{eqnarray} \min \sum\limits_{i = 1}^K {\sum\limits_{x \in C_i } {\left\| {x - {\rm{\textbf{C}}}(C_i )} \right\|^2 } } \end{eqnarray}$ (2)

where C($C_i$) is the center conformation of the cluster $C_i$. The clustering procedure of the K-Medoids algorithm is the same as that of the K-Means algorithm. One advantage of the K-Medoids algorithm over the K-Means algorithm is that, in the K-Medoids algorithm, the centers of the clusters can directly be used as representative MD conformations.

There exist several limitations of the above two algorithms. One limitation is that the optimization steps when performing the clustering are often trapped into local minima. To alleviate this issue, we suggest running the optimization procedures multiple times with different initial conditions, and then choose the one with the smallest WCSS score as the clustering result. Other limitations of the K-Medoids and K-Means algorithms include that the number of clusters ($K$) needs to be known a priori, and the underlying distributions of the datasets need to be convex. In the work by Shao et al. [23], the authors implemented the K-Means algorithm to cluster the MD trajectories of several different biomolecular systems including a single stranded 10-mer polyA DNA and a DNA-Drug complex. The authors discovered that the K-Means algorithm in general yields good results, while they also reported several limitations as described above.

2. The K-Centers algorithm

The K-Centers algorithm aims to partition the MD datasets into $K$ non-overlapping clusters such that the maximum radius of the centers is minimized. The objective function can be computed as,

 $\begin{eqnarray} \mathop {\textrm{min}}\limits_i \mathop {\max }\limits_{x \in C_i } d(x, \mu _i ) \end{eqnarray}$ (3)

where $d(x, \mu _i)$ is the distance of a conformation $x$ to its corresponding cluster center $\mu_i$. As in the K-Medoids algorithm, the conformations themselves are chosen as cluster centers. A simplified diagram of the K-Centers algorithm is illustrated in FIG. 5(b). Similar to the K-Means algorithm, the K-Centers algorithm also has three major steps. In the initial step, one conformation is randomly selected as the first cluster center. In the iteration step, the conformation that is the furthest from all the existing cluster centers is chosen as a new cluster center. In the termination step, the algorithm is terminated until $K$ cluster centers are identified and all the conformations are assigned to their closest cluster centers. The K-Centers algorithm is computationally efficient with time complexity of $O$($Kn$), where $n$ is the total number of MD conformations and $K$ is the desired number of clusters. A fast implementation of the algorithm [60] (FIG. 6) to cluster MD datasets can be adopted to further improve the efficiency. However, the K-Centers algorithm also shares similar limitations with the K-Means algorithm. For example, the number of clusters needs to be specified in advance and the results are usually not global optimal.

 FIG. 6 The test systems and speedups of the GPU version of the K-Centers algorithm. MD datasets were generated using four different test systems as shown in (a). The speedups were shown for the GTX580 GPU (b) and the Tesla GPU (c), respectively. (Figure adapted with permission from Ref.[60], copyright 2012 John Wiley and Sons)

Due to its low computational costs, the K-Centers algorithm is widely applied in clustering MD datasets [31, 35, 37, 60, 90]. In the work of Zhao et al. [60], the authors implemented a GPU version of the K-Centers algorithm to cluster 195, 000 conformations of the alanine dipeptide (20 ps simulation), $\sim$15.68 million conformations of the $\beta$-hairpin Tryptophan Zipper 2 (78, 305 ns simulation in total), 1.98 million conformations of the Human Islet Amyloid Polypeptide (396 ns simulation in total), and 0.25 million conformations of the maltose binding protein (1.25 ${\rm{ \mathsf{ μ} }}$s simulation in total), respectively. They showed that the GPU version K-Centers algorithm can have about 30-100 fold speedups than the CPU version algorithm, which makes it very suitable for fast geometric clustering to construct MSMs [35, 60]. However, the authors have also shown that the K-Centers algorithm is very sensitive to outliers and prefers picking outliers as cluster centers over the MD conformations at the high-density regions. To address the outlier problem, the authors recommended several alternative approaches, such as disregarding outliers in the MD datasets [37, 91] or using other clustering methods (e.g. the hybrid K-Centers and K-Medoids algorithm [92] and density-based clustering algorithms [26]). Details of the density-based clustering algorithms will be described in the subsection "Density-based clustering algorithms".

3. The automatic state partitioning for multibody systems (APM) algorithm

The APM algorithm is a hybrid method developed by Sheong et al. [64] that incorporates both kinetic information and geometric proximity between different MD conformations. The algorithm aims to group MD datasets into microstates (clusters) such that regions with fast kinetics contain few microstates while regions with slow kinetics are split into more microstates (FIG. 5(c)). This is achieved by iteratively bi-partitioning the microstates (clusters) that have low escape probabilities. The escape probability of a microstate is related to the transition probability of this state to other microstates. For example, if one microstate can hardly transit to other microstates, this microstate will have extremely low escape probability.

The procedure of the APM algorithm is as the following: at first, a minimum escape probability is specified in advance; then we partition all the MD conformations into two clusters by applying the K-Centers algorithm. In the subsequent steps, we iteratively bipartition the clusters that have lower escape probabilities than the pre-specified value. The splitting step is achieved by choosing the conformation in the to-be-split cluster that is the furthest from all the existing cluster centers. Once the new cluster centers are chosen, all the conformations are re-assigned according to the updated Voronoi diagram. The algorithm is terminated until the escape probabilities of all the microstates are larger than a cut-off value, which is a pre-specified value in the escape probability. The number of clusters is automatically determined in the process.

The APM algorithm has been widely applied to investigate multi-body processes with heterogeneous timescales [86, 93], such as the loading of miRNA to the human Argonaute protein [94, 95] and the self-assembly of the star-like block copolymers nano-particle [96] and the $\beta$-sheet-rich amyloid-$\beta$ dimers [97]. For example, in the work by Jiang et al. [94], the authors applied the APM algorithm to analyze the molecular mechanisms of how human Agonaute protein recognizes and loads miRNA. In this work, trajectories from multiple MD simulations ($\sim$8 ${\rm{ \mathsf{ μ} }}$s in total) starting from docking experiments were clustered into 480 microstates (clusters) by the APM algorithm. The microstates were further lumped into seven macrostates ("long-lived" metastable states) to visualize and interpret the conformational dynamics. Based on these results, the authors proposed a two-step mechanism: miRNA reorganizes the protein through selective binding initially, and then the protein undergoes structural re-arrangements to finally reach a stable bound state.

B. Density-based clustering algorithms

The density-based clustering algorithms are designed to cluster the MD conformations according to their densities. These algorithms are intuitively appealing, since for Boltzmann distributed MD conformations of a macromolecular system, regions with high densities are often the free energy minima. Several different density-based clustering algorithms have been implemented to cluster MD trajectories. Here, we only review some popular algorithms, such as the neighbor-based algorithm, the density based spatial clustering of applications with noise (DBSCAN) algorithm, the density-peaks algorithm, and the Robust-DB algorithm.

1. The neighbor-based algorithm

The first successful density-based clustering algorithm applied to MD datasets was developed by Daura et al. [49], which is known as the neighbor-based algorithm. In this algorithm, a cut-off radius to compute the density was specified in advance. Then the conformation with the highest number of neighbors (highest density) within the pre-specified radius was taken as the first cluster center. Subsequently, the members of the entire cluster obtained in the previous step were removed from the conformational pool and the densities of the remaining conformations were recomputed. The authors applied this algorithm to cluster the MD trajectories of a $\beta$-heptapeptide and analyzed the folding dynamics of the system. They showed that the predominant conformations with the highest densities were similar to the experimental determined structure, with backbone RMSD values less than 0.6 Å.

2. The density based spatial clustering of applications with noise (DBSCAN) algorithm

The DBSCAN algorithm was developed by Ester et al. [98]. In this algorithm, MD conformations in high-density regions are grouped into clusters and outliers in low-density regions are regarded as "noise". The two parameters are required in the DBSCAN algorithm, which are the neighboring distance ($\varepsilon$) and the minimum number of neighboring conformations (minPts). According to the two input parameters, we classify all the conformations as core conformations, border conformations, and outliers. Core conformations have at least minPts neighbors within their $\varepsilon$-neighborhood. Border conformations are in the $\varepsilon$-neighborhood of any core points and have less than minPts neighboring conformations. The remaining conformations are considered to be outliers. Generally, there are two steps in the DBSCAN algorithm. The first step is to scan all the conformations to find all the core conformations, border conformations, and outliers. The second step is to group connected core conformations into a single cluster. A connection between two core conformations is determined if their distance is less than $\varepsilon$ or if they share a common border conformation. The border conformations are then assigned to the clusters that contain their corresponding core conformations. The average runtime complexity of the DBSCAN algorithm is $O$($n$log$n$) based on an implementation of the $R^*$-tree database [99].

The DBSCAN algorithm and its variants are widely used in the analysis of MD datasets [24, 66, 100-102]. For example, Keller et al. [24] implemented a similar algorithm named common-nearest-neighbor (CNN) to cluster the MD trajectories of a short peptide. They discovered that this algorithm was superior in identifying metastable conformational states of the short peptide studied. The authors also compared this algorithm with two other clustering algorithms (the K-Medoids algorithm and the neighbor-based algorithm described above) using five different shaped arbitrary 2D datasets (FIG. 7). The clustering results suggest that the CNN algorithm can generate intuitively good clusters compared to the other two algorithms. In a more recent work by Lemke et al. [66], the DBSCAN algorithm based core-set model is applied to the MD datasets of the terminal capped alanine and the hairpin peptide "RGKITVNGKTYEGR". The authors showed that this approach can generate highly accurate models of the conformational dynamics.

 FIG. 7 Comparison of the performances of the neighbor-based algorithm (column 1), the K-Medoids algorithm (column 2) and the common-nearest-neighbor algorithm (CNN, column 3) with five 2D test datasets. Data points in different clusters are colored differently. In general, the CNN algorithm can generate intuitively good results. (Figure adapted with permission from Ref.[24], copyright 2010 AIP Publishing)
3. The density-peaks algorithm

In the density-peaks clustering algorithm, the cluster centers are identified as local density maxima that are separated by regions with lower local densities [103]. To achieve this, two quantities are computed for each conformation $i$: the local density $\rho_i$ and distance $\delta_i$ from $i$ to the nearest conformation that has a higher local density. The cluster centers can be obtained by a decision graph, where the centers are the conformations with $\rho_i$$\geq$$\rho_{\textrm{cut}}$ and $\delta_i$$\geq$$\delta_{\textrm{cut}}$. Alternatively, the first few conformations that have the largest $\gamma_i$=$\rho_i$$\delta_i values can also serve as cluster centers. Once the centers are identified, each of the remaining conformations is assigned to the cluster that contains its nearest neighboring conformation with a higher density. The runtime complexity of the density-peaks algorithm is O(n^2), where n is the number of all the data points. The density-peaks algorithm and its variations have been widely applied in clustering MD datasets [71, 103]. For example, in the work by Liu et al. [71], the APLoD (adaptive partitioning by local density-peaks) algorithm based on the density-peaks algorithm and the K nearest neighbors (KNN) algorithm [104] is implemented to cluster the MD datasets of different biomolecular systems. The authors demonstrated the APLoD algorithm can handle ultra-large MD datasets more efficiently than the K-Centers algorithm and the K-Medoids algorithm. They also showed that the APLoD performed better than the other two algorithms. 4. The Robust density-based (Robust-DB) algorithm The main idea of the Robust-DB algorithm [67] is that one first makes estimations of the local free energies of MD conformations, and then classifies the conformations into local free energy minima using these estimations. The general procedures of the Robust-DB algorithm can be described as the following: firstly, the principle component analysis (PCA) is performed to reduce the dimensionality of the MD datasets. Then, the local population of each MD conformation is estimated by the number of conformations in the hypersphere of radius R. The local free energy of each conformation is then estimated as,  \begin{eqnarray} F_R (x) = - kT\textrm{ln}\left(\frac{P_R (x)}{P_R^{\max }(x)}\right) \end{eqnarray} (4) where P_R(x) is the local population and P_R^{\textrm{max}}(x) is the highest population. Having estimated a local free energy for each MD conformation, clustering is performed at several free energy cutoff values in a stepwise manner. At each free energy cutoff, density based clustering is applied to generate microstates. At last, the resulting clusters or microstates are used to construct MSMs to identify metastable states. Sittel et al. [67] implemented the Robust-DB algorithm to study the conformational dynamics of three peptide systems including the Ala_7 peptide, the villin headpiece (HP-35), and the bovine pancreatic trypsin inhibitor (BPTI). The authors demonstrated that the algorithm is robust, deterministic and computationally efficient with these well-established model systems. Ⅴ. KINETIC DISTANCES AND KINETIC CLUSTERING ALGORITHMS For biomolecules, conformations in the same free energy minimum should be able to interconvert quickly in kinetics. For the purpose of elucidating kinetics, clustering based on kinetic similarities is thus the most accurate way to analyze the MD simulation datasets. However, MD conformations, which are geometrically or structurally similar, are generally kinetically similar. For example, two conformations with sufficiently small RMSD (e.g. within one angstrom) are often an indicator that they belong to the same kinetically metastable states. Therefore, geometrical clustering algorithms described in the previous section are also applied to identify kinetically metastable states such as in identifying microstates for the MSM construction. However, it has also been reported the geometric criterions are sometimes not sufficient to identify kinetically metastable states [24, 62, 74]. In these cases, one can also cluster MD trajectories directly using the kinetic metrics [24]. The kinetic metrics measure the distances between two conformations according to how fast they interconvert. Clustering results based on kinetic metrics can identify metastable or "long-lived" conformational states in a much more effective way than those using geometric metrics. For example, by applying kinetic metrics, Schwantes et al. [74] improved the quality of MSMs constructed and found new intermediates in NTL9 folding (FIG. 8).  FIG. 8 Non-native intermediate structures identified by kinetic clustering using the K-Centers algorithm and tICA (please refer to the main text for details). (a) Native structure (top left) and non-native \beta-sheet structures (top right and bottom). (b) Native structure (left) and non-native structure with extended \beta-strand (right). (c) Native structure (left) and non-native structure (right) that missing several core interactions (colored stick). (Figure adapted with permission from Ref.[74], copyright 2013 American Chemical Society) One popular approach to computing kinetic distance between any two MD conformations is the time-structure based independent component analysis (tICA). tICA, originally used in signal processing [105, 106], was introduced to analyze molecular dynamics by Schwantes et al. [74], Pérez-Hernández et al. [62], and Naritomi et al. [107] for extracting functionally relevant slow dynamics of macromolecules. In tICA, it is assumed that kinetically similar MD conformations have high autocorrelations in the slowest-relaxing motions. The solution to the tICA problem is to solve the generalized eigenvalue problem,  \begin{eqnarray} C_{ij}^{(\tau )} v = \lambda \Sigma v \end{eqnarray} (5) where C_{ij}^{(\tau)} is the time-lagged autocorrelation matrix and \Sigma is the covariance matrix. The above equation can be further expressed as,  \begin{eqnarray} { E}\left[{X_i (t)X_j (t + \tau )} \right]v{\rm{ = }}\lambda { E}\left[{X_i (t)X_j (t)} \right]v \end{eqnarray} (6) where \tau is the lag time and X_i(t) denotes to an input feature vector extracted from conformation i in the MD dataset. The input feature vector can be distances between different atom pairs. By solving Eq.(6), the dimensionality of the system is also reduced by tICA, and further decomposition can be done in the slowest-relaxing subspace using clustering algorithms such as the K-Centers and K-Means algorithms described in the previous section [35, 71]. More details and applications of tICA are reviewed in other previous studies [108, 109]. Kinetic lumping is commonly applied in the second stage of a two-stage procedure of the MSM construction. In this procedure, we first group MD conformations into hundreds or thousands of non-overlapping microstates based on either geometric or kinetic metrics [35]. However, the large number of microstates may hamper the visualization of the conformational dynamics and the understanding of the underlying mechanism. Therefore, the second stage of the "kinetic lumping" is often applied to group kinetically similar microstates into macrostates. The kinetic similarities (or distances) between two microstates are often described by the transition probability between them after a certain lag time [24, 35, 110]. Commonly applied kinetic lumping (or kinetic clustering) algorithms are spectral-based clustering algorithms such as the Perron-cluster cluster analysis (PCCA) algorithm [77], the Robust Perron cluster analysis (PCCA+) algorithm [54], and the Hierarchical Nyström Extension Graph (HNEG) algorithm [61]. Algorithmically, these spectral-based clustering algorithms make use of the top eigenvectors of an affinity matrix derived from similarities between data points (e.g. the transition probability matrix) [111, 112]. The General workflow of the spectral-based clustering algorithms contains the following steps: (ⅰ) Preprocessing step: prepare for input the affinity matrix A$$ \in$${{\textbf{R}}}^{n \times n}$ according to the similarities among the $n$ data points. (ⅱ) Spectral mapping step: construct a normalized Graph Laplacian matrix from the affinity matrix and compute the corresponding eigenvectors. (ⅲ) Post-processing/grouping step: clustering using cutoffs or other clustering algorithms, such as the K-Means algorithm. In the post-processing/grouping step, different approaches can be adopted [112]. For example, Shi and Malik [113] used the second eigenvector to bi-split the dataset and the successive clustering can be achieved recursively on the obtained sub-graphs [114]. Chan et al. applied K-Way ratio-cut to partition datasets into $K$ clusters [115]. Another approach proposed by Ng et al. [111] performs the K-Means algorithm on top $k$ eigenvectors to classify the data points.

PCCA and PCCA+ are two most popular spectral-based algorithms for the MSM construction. In PCCA, the grouping step is achieved by performing bi-partitioning based on the sign structure of top eigenvectors recursively [77]. Different from PCCA, which performs hard partitioning on all the microstates, PCCA+ performs a fuzzy clustering around the transition region [54]. PCCA+ is considered to be a robust improvement version of PCCA. Other variations based on PCCA and PCCA+ are also implemented to perform kinetic lumping when constructing MSMs [26, 35, 61]. Besides their widely applications in MSMs, the spectral-based clustering algorithms have also been implemented to characterize different conformational species of intrinsic disordered proteins with drastic conformational changes [57, 59, 85].

Ⅵ. COMPARISONS OF THE PERFORMANCES OF DIFFERENT CLUSTERING ALGORITHMS

Each clustering algorithm described in the previous section has its own advantages and limitations. In this section, we discuss the comparisons of performances of different geometric and kinetic clustering algorithms based on previous published work [22-26, 36, 45-76, 78, 79]. At the end of this section, we also propose our own recipe on how to choose clustering algorithms for different types of MD datasets.

A. Comparisons of the performances of clustering algorithms using geometric metrics

Clustering algorithms based on geometric metrics on MD datasets have been compared in several studies. In an early work by Torda et al. [22], the authors compared two hierarchical clustering algorithms in grouping a 1-ns MD trajectory and found that the divisive hierarchical clustering algorithm performs better than the single-linkage clustering algorithm, since the single-linkage algorithm tends to generate singleton clusters and the clustering results are sensitive to linkage cut-off. In another study, Keller et al. [24] compared three algorithms applied to perform geometric clustering to study the conformational dynamics of a short peptide. The authors showed that the CNN algorithm outperformed the K-Means and the K-Medoids algorithm. Recently, Zhao et al. [60] implemented a GPU version of the K-Centers algorithm, which achieves tens of times of speedup than the original CPU version. To investigate the conformational dynamics of multi-body systems such as protein-ligand binding, Sheong et al. [64] demonstrated that their APM algorithm is more suitable compared to other center-based algorithms (FIG. 9). More recently, Liu et al. [71] show that the density-based algorithms are also suitable to investigate the conformational dynamics of multi-body systems.

 FIG. 9 The APM algorithm is well suited to cluster MD datasets of multi-body systems. (a) Artificial 2D potential mimicking a two-body system (protein-ligand binding, for example, the bound state is represented by the central minimum). (b) Pseudo-conformations sampled by the 2D potential. (c) With small number of clusters ($K$=117), the K-Centers algorithm could not identify central minimum. (d) With a very large number of clusters ($K$=15000), the K-Centers algorithm could identify the central minimum. However, the resulting clusters are of uniform sizes even in the unbound state. (e) With small number of clusters ($K$=117), the APM algorithm could identify central minimum and clusters are of various sizes. (f) Final MSMs with three metastable states were generated from (e). (Figure adapted with permission from Ref.[64], copyright 2015 American Chemical Society)

In several previous studies, new metrics have been proposed to quantitatively assess the quality of the clustering results. Shao et al. [23] presented a systematic study by applying two metrics, the Davies-Bouldin index (DBI) [116] and the pseudo F-statistic (pSF) [117], to compare eleven different well-known clustering algorithms, including several hierarchical, and center-based clustering algorithms. According to their definitions, good clustering results should have low DBI and high pSF values. In this work, the authors concluded that, in general, the average-linkage algorithm and the K-Means algorithm yield the best performances. The authors also suggested one can use the divisive hierarchical clustering algorithm and the average-linkage algorithm in cases where the number of clusters is unclear in advance. Similar to the study of Shao et al. [23], Phillips and co-workers [25] presented several analytic and dynamic polymer models with well-characterized dynamics (e.g. meta-stable states, transition states, helical structures and stochastic dynamics) to validate the results and performance of the clustering algorithm used in MD simulations of macromolecules. In this work, spectral clustering algorithm was examined and this algorithm was shown to be promising in clustering MD trajectories. In addition, De Paris et al. [78] introduced a new scoring method, the sum of the quartile differences (SQD), which identifies ensembles of medoids that have similar dispersion to the whole MD trajectory. A smaller SQD value indicates higher similarity. In this work, the authors examined the performances of a few hierarchical clustering algorithms (the average-linkage algorithm, the centroid-linkage algorithm, and the complete-linkage algorithm) and two center-based algorithms (the K-Means algorithm and the K-Medoids algorithm) in clustering the MD trajectories of the InhA enzyme for docking purposes. They showed that hierarchical clustering algorithms, especially the Ward-linkage algorithm, performed better than the two center-based algorithms, the K-Means algorithm and the K-Medoids algorithm.

B. Comparisons of the performances of clustering algorithms using kinetic metrics

Performances of some popular clustering algorithms using kinetic metrics have also been compared in several previous studies [26, 70, 79]. For example, Bowman et al. [26] compared several methods for coarse-graining MSMs using a Bayesian model and three test systems: the alanine dipeptide, the Villin headpiece, and the $\beta$-lactamase. In this work, the authors concluded that the HNEG algorithm [61] outperform methods such as PCCA and PCCA+. The authors also found that the performances of these algorithms can be further improved by disregarding low populated states which might be artifacts or outliers from insufficient sampling. More recently, Husic et al. [70] also compared several hierarchical clustering algorithms and center-based algorithms in constructing MSMs, while tICA is applied to calculate the kinetic distances between MD conformations. In this study, they demonstrated that the Ward-linkage and the K-Means clustering algorithm can generate MSMs with higher quality (i.e). larger cross-validated scores) than a number of other hierarchical clustering algorithms (e.g. the single-linkage, average-linkage, and complete-linkage algorithm) and center-based clustering algorithms (e.g. the K-Means, K-Medoids, and K-Centers algorithm).

C. Our advice on the choice of clustering algorithms

The proper choice of a clustering algorithm for a specific MD simulation dataset should be based on the intrinsic properties of datasets as well as the purpose of the clustering. In this subsection, we will first discuss the features of the MD datasets. Then we will provide our advice on the choice of clustering algorithms to serve different purposes.

In MD simulations, we aim at sampling ensemble of conformations that satisfy the equilibrium population, for example, the Boltzmann distribution. Such equilibrium distribution is directly related to the underlying free energy landscape, where the dominant metastable states (high density regions) are usually separated by high free energy barriers (low density regions). In complex biomolecular systems, there often exists a separation of timescales, where transitions between the conformations within one certain metastable state are fast, while the transitions between the conformations from different metastable states are much slower.

Clustering methods using geometric metrics often partition MD datasets into disjoint clusters, such that the conformations from one cluster are structurally similar and the conformations from different clusters have large structural variations. Thus, the results of such clustering are often applied to choose representative MD conformations to facilitate ensemble molecular docking [118], further compare with experiment [119], or simplify the data visualization. In cases where we need to choose dominant conformations locating in the free energy basins, the K-Means (or K-Medoids) and the Ward-linkage algorithm would be good choices, as these algorithms often tend to produce multiple clusters in high-density regions (or free energy basins) in order to minimize the total within-cluster variance. Alternatively, the density based algorithms may also serve as a good choice for this purpose. These algorithms are designed to identify the high-density regions as "cores". One caveat with density based algorithms is that accurate estimations of densities at high dimensional space is often challenging [120]. If the purpose of the clustering is to investigate the transition states, the K-centers algorithm would be a better choice because the K-centers algorithm minimizes the maximum radius for the clusters, and thus clusters with varying densities are produced including the low-density ones at the transition regions. In addition, it has been demonstrated that the K-centers algorithm tends to equally divide the conformational space. Therefore, populations of clusters produced by the K-centers algorithm can well reflect the underlying free energy landscape (FIG. 10(b)), while the K-Medoids algorithm fails to display such a correlation (FIG. 10(c)).

 FIG. 10 2D PMF of alanine dipeptide obtained, (a) directly from equilibrium MD simulations, (b) from the populations of the clusters obtained by the K-Centers algorithm with $K$=4000, and (c) from the populations of the clusters obtained by the K-Medoids algorithm with $K$=4000. The 2D PMF can be better recovered by K-Centers algorithm from the results. (Figure adapted with permission from Ref.[60], copyright 2012 John Wiley and Sons)

In the MSM construction, Kinetic lumping is an important step, where hundreds or thousands of microstates are grouped into a small number of metastable macrostates. Such kinetic lumping can often facilitate the interpretations of molecular mechanisms for protein conformational changes [32, 34, 35]. Among various kinetic clustering algorithms, PCCA and PCCA+ gain popularities due to their generally good performance and wide implementations [35, 54, 77]. However, PCCA and PCCA+ are often suffered from insufficient sampling, where lowly populated outliers are wrongly identified as metastable states. To alleviate this issue, we recommend the HNEG method [61], which can overcome this limitation as clusters are built up from the well-sampled regions in this method. Moreover, the HNEG method provides a hierarchical extension graph containing information of metastable states at different density levels, which can further help us to comprehend the architecture of the free energy landscapes. However, we also note that the HNEG method needs additional input parameters such as the super-density levels in comparison with PCCA and PCCA+. If one is interested in identifying the transition states, the recent method developed by Martini et al. [121] is a good choice as it can automatically identify transition states as well as transition states using a variationally optimization approach in the framework of MSMs.

Attentions should also be paid to multi-body systems [86], which underline a wide range of important biological processes, such as protein-ligand recognition, protein-protein interactions and self-assembly. These multi-body systems can display drastically different dynamic behaviors from single-body systems such as folding of individual protein and RNA molecule. In particular, for these multi-body systems, there often exists a heterogeneous mixture of timescales [86]. For example, in protein-ligand recognition, the ligand dynamics in the free solution is significantly slower than that in bound with the protein. Therefore, uniform geometric clustering algorithms such as the K-Means algorithm and the K-Centers algorithm are inadequate to describe both the bound and the unbound regions of the ligand dynamics [64]. In such cases, algorithms like APM and APLoD are good choices due to their superior performances and efficiencies on multi-body systems [64, 71, 86].

Ⅶ. CONCLUSION

In this paper, the basic principles of a number of clustering algorithms that have been widely applied to MD datasets are reviewed. The applications, performances and comparisons among these algorithms are also discussed. It should be noted that there are no one-size-fits-all algorithms that would be suitable for any MD datasets. The performances of clustering algorithms are closely related to the purpose of clustering, the metrics used in clustering, and the intrinsic properties of the MD datasets. For example, to cluster and analyze multi-body systems, algorithms such as APM and APLoD would be good choices. We note that the development of clustering algorithms for the analysis of MD simulation datasets is a fast-evolving field, and it is not possible for us to cover all the literatures. In this article, we reviewed a number of widely applied clustering algorithms with a focus on discussion merits and limitations of each method, and comparisons between different methods. We hope this review would be helpful for readers who are interested in applying the clustering methods on their own MD simulation trajectories.

Ⅷ. ACKNOWLEDGMENTS

This work was supported by Shenzhen Science and Technology Innovation Committee (JCYJ20170413173837121), the Hong Kong Research Grant Council (HKUST C6009-15G, 14203915, 16302214, 16304215, 16318816, and AoE/P-705/16), King Abdullah University of Science and Technology (KAUST) Office of Sponsored Research (OSR) (OSR-2016-CRG5-3007), Guangzhou Science Technology and Innovation Commission (201704030116), and Innovation and Technology Commission (ITCPD/17-9 and ITC-CNERC14SC01). X. Huang is the Padma Harilela Associate Professor of Science.

Reference