The article information
 Junhui Peng, Wei Wang, Yeqing Yu, Hanlin Gu, Xuhui Huang
 彭俊辉, 王薇, 虞叶卿, 谷翰林, 黄旭辉
 Clustering Algorithms to Analyze Molecular Dynamics Simulation Trajectories for Complex Chemical and Biological Systems^{†}
 分析复杂化学及生物体系分子动力学模拟轨迹的聚类方法
 Chinese Journal of Chemical Physics, 2018, 31(4): 404420
 化学物理学报, 2018, 31(4): 404420
 http://dx.doi.org/10.1063/16740068/31/cjcp1806147

Article history
 Received on: June 20, 2018
 Accepted on: July 13, 2018
b. Department of Chemistry, The Hong Kong University of Science and Technology, Kowloon, Hong Kong;
c. Department of Mathematics, The Hong Kong University of Science and Technology, Kowloon, Hong Kong;
d. Center of Systems Biology and Human Health, The Hong Kong University of Science and Technology, Kowloon, Hong Kong;
e. State Key Laboratory of Molecular Neuroscience, The Hong Kong University of Science and Technology, Kowloon, Hong Kong
Since the first protein molecular dynamics (MD) simulation was carried out more than four decades ago [1], MD simulation has been proven to be a powerful approach to complement experimental techniques to provide dynamic information of protein conformations at atomic resolution. In recent years, breakthroughs in structural biology, particularly the advancement in the cryoEM technique, have led to the skyrocketing highresolution structures of proteins and other biomolecules [24]. These structures offer great opportunities for MD simulations to further reveal the dynamics of numerous biological systems. Meanwhile, with recent advancements in both computer hardware and software [59], stateoftheart MD simulations are able to simulate protein conformational dynamics at microseconds or even longer, which open up opportunities to investigate a wide range of important biological processes such as protein folding, functional conformational changes, proteinligand recognition, and protein aggregation [1021].
Modern MD simulations often produce massive datasets, which typically contain hundreds of MD trajectories and millions of conformations. One major challenge facing the MD simulation community is thus how to rapidly and accurately extract useful information from these large datasets to comprehend mechanisms underlying biological processes. One valuable approach to addressing this challenge is the conformational clustering, where geometrically or kinetically similar MD conformations are grouped into the same cluster [2226]. This clustering procedure can reduce millions of MD conformations into a small number of conformational states (or clusters), which enables us to further investigate various thermodynamic or kinetic properties of the studied macromolecules [2739].
A wide variety of clustering algorithms originally developed in computer science [4044] have been successfully applied to MD datasets [23, 36, 4577]. According to their basic principles, the commonly used clustering algorithms can be broadly divided into two major groups: hierarchical and partitional clustering algorithms [4244]. In hierarchical clustering algorithms, the arrangement of clusters is hierarchical, and each cluster can have subclusters, while in partitional clustering algorithms, MD conformations are simply divided into nonoverlapping clusters without assuming that all the conformations and clusters are in a hierarchical structure.
In this paper, we review some popular clustering algorithms applied to MD datasets [2226, 69, 70, 78, 79], including divisive hierarchical clustering, agglomerative hierarchical clustering (singlelinkage, completelinkage, averagelinkage, centroidlinkage and Wardlinkage), centerbased partitional clustering (KMeans, KMedoids, KCenters, and APM), densitybased partitional clustering (neighborbased, DBSCAN, densitypeaks and RobustDB), and spectralbased partitional clustering (PCCA and PCCA+) algorithms. We note that there is no "onesizefitsall" clustering algorithm, and each algorithm has its own merits and limitations. Therefore, this review is focused on the physical insights of clustering algorithms, their algorithmic implementations to MD datasets, and together with our own experience of applying them. We hope that this review could help readers to choose the appropriate clustering algorithm for specific MD datasets they are working on. In the remaining part of the paper, we will review a number of widely used algorithms in clustering MD datasets as shown in FIG. 1, and also discuss their performances and comparisons between different clustering algorithms.
Ⅱ. COORDINATES AND FEATURES OF MOLECULAR DYNAMICS SIMULATION TRAJECTORIESAn MD simulation dataset often contains parallel trajectories representing timeevolved snapshots or conformations of the simulated macromolecular systems. Each snapshot or conformation contains threedimensional Cartesian coordinates of all the atoms of the simulated systems (e.g). proteins, solvents and ions) (FIG. 2). In recent years, typical MD datasets contain multiple trajectories each ranging from hundreds of nanoseconds to microseconds, and thus these datasets are often consisted of millions of configurations stored in parallel MD trajectories [23, 32, 34]. For example, Da et al. [34] collected 480
Hierarchical clustering builds a hierarchy of clusters based on a certain linkage criterion (similarities or dissimilarities between different clusters). The clustering process can be either divisive (topdown) or agglomerative (bottomup). The agglomerative (bottomup) approach starts with each MD conformation in its own cluster, and merges two closest clusters iteratively. While the divisive (topdown) approach starts with all the MD conformations in one cluster, and splits the largest cluster recursively. A simplified framework of hierarchical clustering is shown in FIG. 3. In this section, we will introduce the basic principles and discuss the performances of some widely used hierarchical algorithms [22, 23, 6870, 78].
A. The divisive hierarchical clustering algorithmThe divisive hierarchical clustering algorithm follows the topdown design. General procedures of the divisive hierarchical clustering algorithm are described as the following: (ⅰ) The divisive algorithm assigns all the MD conformations into one large cluster. (ⅱ) The largest cluster from the previous step is further split into two subclusters. This bipartitioning is often achieved by the KMeans and KCenters clustering algorithms [42, 43] (see Section Ⅳ for details of these two clustering algorithms). (ⅲ) This division procedure is repeated iteratively until the desired number of clusters is reached or until each conformation forms a unique cluster on its own [40]. Similar to other hierarchical clustering algorithms, one advantage of the divisive algorithm is that no prior information regarding the number of clusters is required as input.
The divisive hierarchical clustering algorithm was first adopted in clustering MD trajectories by Torda et al. [22]. In this work, the authors applied the divisive hierarchical algorithm to a single MD trajectory containing 2, 000 conformations of the 64residue structured domain of the barley serine protease inhibitor. They found that the divisive hierarchical algorithm performed well in clustering a wellsampled MD simulation trajectory. Shao et al. [23] also applied the divisive hierarchical algorithm to cluster the MD datasets of several different biomolecules. Similar to the findings of Torda et al. [22], Shao et al. [23] discovered that the divisive hierarchical clustering is efficient in processing large MD simulation trajectories. However, they also showed that the divisive hierarchical clustering algorithm is very sensitive to outliers and tends to yield clumsy results near the cluster boundaries [23].
B. Agglomerative hierarchical clustering algorithmsThe agglomerative hierarchical clustering algorithms follow the bottomup fashion. The algorithms start from "one elementone cluster" state and then merge small clusters into larger ones by a predefined linkage criteria until certain conditions are satisfied to terminate this process (e.g). the number of clusters reaches a desired value or all the MD conformations are finally grouped into one single cluster). According to the linkage criteria applied, the agglomerative hierarchical clustering algorithms can be further divided into several subtypes: the singlelinkage algorithm, the completelinkage algorithm, the averagelinkage algorithm, the centroidlinkage algorithm and the Wardlinkage algorithm. The linkage criterions of the agglomerative hierarchical clustering algorithms in clustering MD trajectories are summarized in Table Ⅰ.
1. The singlelinkage algorithmThe linkage criterion of the singlelinkage algorithm is defined as the minimum distance between two clusters. The distance between two clusters in the singlelinkage algorithm is computed as the distance between a single pair of MD conformations (one in each cluster) that are closest to each other. At each clustering step of the singlelinkage algorithm, the two clusters that have the minimum distance are merged into one larger cluster. The time and space complexities of the original singlelinkage clustering are
In the completelinkage algorithm, the linkage criterion is also the minimum distance between two clusters as in the singlelinkage algorithm. The difference is that, in the completelinkage algorithm, the distance between two clusters is defined by the maximum distance between any pair of MD conformations (one in each cluster). The completelinkage clustering algorithm shares similar time and space complexities as the singlelinkage method.
In the work of Shao et al. [23], the authors implemented the completelinkage algorithm to cluster the MD datasets of several different biomolecules. From the clustering results, they found that unlike the singlelinkage algorithm, which tends to generate singletons, the completelinkage algorithm is more likely to produce meaningful sets of clusters. De Paris et al. [78] implemented the completelinkage algorithm to classify the MD trajectories of the InhA enzyme for docking purposes using the cavity attributes as the clustering metric. The authors found that in their cases, completelinkage algorithm was among the best choices.
3. The averagelinkage algorithm & the centroidlinkage algorithmThe averagelinkage algorithm and the centroidlinkage algorithm are two highly related clustering algorithms. In the two algorithms, the same linkage criterion is adopted, which is defined as the minimum distance between two clusters as in the above agglomerative hierarchical algorithms. However, the definitions of the distance between two clusters are quite different. In the averagelinkage algorithm, the distance between two clusters is defined by the average of all the distances between any pair of MD conformations (one in each cluster). While in the centroidlinkage algorithm, the distance between two clusters is defined as the distance between the geometric means of the clusters. The time and space complexities of the two linkage algorithms are also similar to those of the singlelinkage algorithm.
As reported by Shao et al. [23], the two algorithms produced clusters of varying sizes and performed well in clustering MD datasets. In the work by De Paris et al. [78], the authors showed that, the performances of the two algorithms were generally good when using cavity attributes as the clustering metric.
4. The Wardlinkage algorithmThe Wardlinkage or the Ward's method, also known as the Ward's minimum variance method, is developed by Ward et al. [82]. The linkage criterion of the Wardlinkage algorithm is different from other linkage clustering algorithms described above. In the Wardlinkage algorithm, a linkage between two clusters is determined when this pair of clusters leads to the minimum increase in total withincluster variance after merging. The original Wardlinkage algorithm is time consuming with time complexity of
The Wardlinkage algorithm was reported to perform well when applied to analyze MD datasets of protein folding and conformational changes [27, 69, 70]. For example, Beauchamp et al. [27] applied the Wardlinkage algorithm to cluster the MD trajectories generated by the folding simulations of 14 different proteins. The produced clusters (called microstates) were used to construct Markov State Models (MSMs) (FIG. 4) to further interpret the protein folding dynamics. The predictions of their MSMs were in good agreement with experimental observations. More recently, Husic et al. [70] applied the Wardlinkage clustering algorithm to construct MSMs to analyze the folding dynamics of three fast folding proteins: the villin, trpcage, and protein G. The authors showed that fewer clusters were needed to obtain MSMs with maximum crossvalidation scores by applying the Wardlinkage algorithm in comparison with other hierarchical clustering algorithms.
Ⅳ. PARTITIONAL CLUSTERING ALGORITHMSUnlike hierarchical clustering algorithms that group conformations into hierarchical clusters, partitional clustering algorithms simply divide the conformations into different nonoverlapping clusters without the presumption that all the conformations or clusters are hierarchical. According to their algorithm design, partitional clustering algorithms can be further divided into different subgroups [42]. Partitional clustering algorithms have been widely applied to analyze the MD simulation datasets [2326, 57, 59, 60, 64, 67, 71, 78, 85]. In this section, we will focus on two groups of popular partitional clustering algorithms: the centerbased clustering algorithms, and the densitybased clustering algorithms.
A. Centerbased clustering algorithmsIn the centerbased clustering algorithms,
The KMeans clustering algorithm aims to partition the MD datasets into
$ \begin{eqnarray} \min \sum\limits_{i = 1}^K {\sum\limits_{x \in C_i } {\left\ {x  \mu {}_i} \right\^2 } } \end{eqnarray} $  (1) 
where
The KMedoids algorithm is highly related to the KMeans algorithm. The difference between two algorithms lies in the choice of cluster centers. In particular, the KMedoids algorithm chooses MD conformations as cluster centers, while the KMeans algorithm defines the cluster centers as the geometric mean of points within each cluster, and thus these cluster centers are not necessarily corresponding to actual MD conformations. Accordingly, the objective function of the KMedoids algorithm is the total WCSS score of all the MD conformations to their corresponding center conformations and can be expressed as,
$ \begin{eqnarray} \min \sum\limits_{i = 1}^K {\sum\limits_{x \in C_i } {\left\ {x  {\rm{\textbf{C}}}(C_i )} \right\^2 } } \end{eqnarray} $  (2) 
where C(
There exist several limitations of the above two algorithms. One limitation is that the optimization steps when performing the clustering are often trapped into local minima. To alleviate this issue, we suggest running the optimization procedures multiple times with different initial conditions, and then choose the one with the smallest WCSS score as the clustering result. Other limitations of the KMedoids and KMeans algorithms include that the number of clusters (
The KCenters algorithm aims to partition the MD datasets into
$ \begin{eqnarray} \mathop {\textrm{min}}\limits_i \mathop {\max }\limits_{x \in C_i } d(x, \mu _i ) \end{eqnarray} $  (3) 
where
Due to its low computational costs, the KCenters algorithm is widely applied in clustering MD datasets [31, 35, 37, 60, 90]. In the work of Zhao et al. [60], the authors implemented a GPU version of the KCenters algorithm to cluster 195, 000 conformations of the alanine dipeptide (20 ps simulation),
The APM algorithm is a hybrid method developed by Sheong et al. [64] that incorporates both kinetic information and geometric proximity between different MD conformations. The algorithm aims to group MD datasets into microstates (clusters) such that regions with fast kinetics contain few microstates while regions with slow kinetics are split into more microstates (FIG. 5(c)). This is achieved by iteratively bipartitioning the microstates (clusters) that have low escape probabilities. The escape probability of a microstate is related to the transition probability of this state to other microstates. For example, if one microstate can hardly transit to other microstates, this microstate will have extremely low escape probability.
The procedure of the APM algorithm is as the following: at first, a minimum escape probability is specified in advance; then we partition all the MD conformations into two clusters by applying the KCenters algorithm. In the subsequent steps, we iteratively bipartition the clusters that have lower escape probabilities than the prespecified value. The splitting step is achieved by choosing the conformation in the tobesplit cluster that is the furthest from all the existing cluster centers. Once the new cluster centers are chosen, all the conformations are reassigned according to the updated Voronoi diagram. The algorithm is terminated until the escape probabilities of all the microstates are larger than a cutoff value, which is a prespecified value in the escape probability. The number of clusters is automatically determined in the process.
The APM algorithm has been widely applied to investigate multibody processes with heterogeneous timescales [86, 93], such as the loading of miRNA to the human Argonaute protein [94, 95] and the selfassembly of the starlike block copolymers nanoparticle [96] and the
The densitybased clustering algorithms are designed to cluster the MD conformations according to their densities. These algorithms are intuitively appealing, since for Boltzmann distributed MD conformations of a macromolecular system, regions with high densities are often the free energy minima. Several different densitybased clustering algorithms have been implemented to cluster MD trajectories. Here, we only review some popular algorithms, such as the neighborbased algorithm, the density based spatial clustering of applications with noise (DBSCAN) algorithm, the densitypeaks algorithm, and the RobustDB algorithm.
1. The neighborbased algorithmThe first successful densitybased clustering algorithm applied to MD datasets was developed by Daura et al. [49], which is known as the neighborbased algorithm. In this algorithm, a cutoff radius to compute the density was specified in advance. Then the conformation with the highest number of neighbors (highest density) within the prespecified radius was taken as the first cluster center. Subsequently, the members of the entire cluster obtained in the previous step were removed from the conformational pool and the densities of the remaining conformations were recomputed. The authors applied this algorithm to cluster the MD trajectories of a
The DBSCAN algorithm was developed by Ester et al. [98]. In this algorithm, MD conformations in highdensity regions are grouped into clusters and outliers in lowdensity regions are regarded as "noise". The two parameters are required in the DBSCAN algorithm, which are the neighboring distance (
The DBSCAN algorithm and its variants are widely used in the analysis of MD datasets [24, 66, 100102]. For example, Keller et al. [24] implemented a similar algorithm named commonnearestneighbor (CNN) to cluster the MD trajectories of a short peptide. They discovered that this algorithm was superior in identifying metastable conformational states of the short peptide studied. The authors also compared this algorithm with two other clustering algorithms (the KMedoids algorithm and the neighborbased algorithm described above) using five different shaped arbitrary 2D datasets (FIG. 7). The clustering results suggest that the CNN algorithm can generate intuitively good clusters compared to the other two algorithms. In a more recent work by Lemke et al. [66], the DBSCAN algorithm based coreset model is applied to the MD datasets of the terminal capped alanine and the hairpin peptide "RGKITVNGKTYEGR". The authors showed that this approach can generate highly accurate models of the conformational dynamics.
3. The densitypeaks algorithmIn the densitypeaks clustering algorithm, the cluster centers are identified as local density maxima that are separated by regions with lower local densities [103]. To achieve this, two quantities are computed for each conformation
The densitypeaks algorithm and its variations have been widely applied in clustering MD datasets [71, 103]. For example, in the work by Liu et al. [71], the APLoD (adaptive partitioning by local densitypeaks) algorithm based on the densitypeaks algorithm and the K nearest neighbors (KNN) algorithm [104] is implemented to cluster the MD datasets of different biomolecular systems. The authors demonstrated the APLoD algorithm can handle ultralarge MD datasets more efficiently than the KCenters algorithm and the KMedoids algorithm. They also showed that the APLoD performed better than the other two algorithms.
4. The Robust densitybased (RobustDB) algorithmThe main idea of the RobustDB algorithm [67] is that one first makes estimations of the local free energies of MD conformations, and then classifies the conformations into local free energy minima using these estimations. The general procedures of the RobustDB algorithm can be described as the following: firstly, the principle component analysis (PCA) is performed to reduce the dimensionality of the MD datasets. Then, the local population of each MD conformation is estimated by the number of conformations in the hypersphere of radius
$ \begin{eqnarray} F_R (x) =  kT\textrm{ln}\left(\frac{P_R (x)}{P_R^{\max }(x)}\right) \end{eqnarray} $  (4) 
where
Sittel et al. [67] implemented the RobustDB algorithm to study the conformational dynamics of three peptide systems including the Ala
For biomolecules, conformations in the same free energy minimum should be able to interconvert quickly in kinetics. For the purpose of elucidating kinetics, clustering based on kinetic similarities is thus the most accurate way to analyze the MD simulation datasets. However, MD conformations, which are geometrically or structurally similar, are generally kinetically similar. For example, two conformations with sufficiently small RMSD (e.g. within one angstrom) are often an indicator that they belong to the same kinetically metastable states. Therefore, geometrical clustering algorithms described in the previous section are also applied to identify kinetically metastable states such as in identifying microstates for the MSM construction. However, it has also been reported the geometric criterions are sometimes not sufficient to identify kinetically metastable states [24, 62, 74]. In these cases, one can also cluster MD trajectories directly using the kinetic metrics [24]. The kinetic metrics measure the distances between two conformations according to how fast they interconvert. Clustering results based on kinetic metrics can identify metastable or "longlived" conformational states in a much more effective way than those using geometric metrics. For example, by applying kinetic metrics, Schwantes et al. [74] improved the quality of MSMs constructed and found new intermediates in NTL9 folding (FIG. 8).
One popular approach to computing kinetic distance between any two MD conformations is the timestructure based independent component analysis (tICA). tICA, originally used in signal processing [105, 106], was introduced to analyze molecular dynamics by Schwantes et al. [74], PérezHernández et al. [62], and Naritomi et al. [107] for extracting functionally relevant slow dynamics of macromolecules. In tICA, it is assumed that kinetically similar MD conformations have high autocorrelations in the slowestrelaxing motions. The solution to the tICA problem is to solve the generalized eigenvalue problem,
$ \begin{eqnarray} C_{ij}^{(\tau )} v = \lambda \Sigma v \end{eqnarray} $  (5) 
where
$ \begin{eqnarray} { E}\left[{X_i (t)X_j (t + \tau )} \right]v{\rm{ = }}\lambda { E}\left[{X_i (t)X_j (t)} \right]v \end{eqnarray} $  (6) 
where
Kinetic lumping is commonly applied in the second stage of a twostage procedure of the MSM construction. In this procedure, we first group MD conformations into hundreds or thousands of nonoverlapping microstates based on either geometric or kinetic metrics [35]. However, the large number of microstates may hamper the visualization of the conformational dynamics and the understanding of the underlying mechanism. Therefore, the second stage of the "kinetic lumping" is often applied to group kinetically similar microstates into macrostates. The kinetic similarities (or distances) between two microstates are often described by the transition probability between them after a certain lag time [24, 35, 110]. Commonly applied kinetic lumping (or kinetic clustering) algorithms are spectralbased clustering algorithms such as the Perroncluster cluster analysis (PCCA) algorithm [77], the Robust Perron cluster analysis (PCCA+) algorithm [54], and the Hierarchical Nyström Extension Graph (HNEG) algorithm [61]. Algorithmically, these spectralbased clustering algorithms make use of the top eigenvectors of an affinity matrix derived from similarities between data points (e.g. the transition probability matrix) [111, 112]. The General workflow of the spectralbased clustering algorithms contains the following steps: (ⅰ) Preprocessing step: prepare for input the affinity matrix
PCCA and PCCA+ are two most popular spectralbased algorithms for the MSM construction. In PCCA, the grouping step is achieved by performing bipartitioning based on the sign structure of top eigenvectors recursively [77]. Different from PCCA, which performs hard partitioning on all the microstates, PCCA+ performs a fuzzy clustering around the transition region [54]. PCCA+ is considered to be a robust improvement version of PCCA. Other variations based on PCCA and PCCA+ are also implemented to perform kinetic lumping when constructing MSMs [26, 35, 61]. Besides their widely applications in MSMs, the spectralbased clustering algorithms have also been implemented to characterize different conformational species of intrinsic disordered proteins with drastic conformational changes [57, 59, 85].
Ⅵ. COMPARISONS OF THE PERFORMANCES OF DIFFERENT CLUSTERING ALGORITHMSEach clustering algorithm described in the previous section has its own advantages and limitations. In this section, we discuss the comparisons of performances of different geometric and kinetic clustering algorithms based on previous published work [2226, 36, 4576, 78, 79]. At the end of this section, we also propose our own recipe on how to choose clustering algorithms for different types of MD datasets.
A. Comparisons of the performances of clustering algorithms using geometric metricsClustering algorithms based on geometric metrics on MD datasets have been compared in several studies. In an early work by Torda et al. [22], the authors compared two hierarchical clustering algorithms in grouping a 1ns MD trajectory and found that the divisive hierarchical clustering algorithm performs better than the singlelinkage clustering algorithm, since the singlelinkage algorithm tends to generate singleton clusters and the clustering results are sensitive to linkage cutoff. In another study, Keller et al. [24] compared three algorithms applied to perform geometric clustering to study the conformational dynamics of a short peptide. The authors showed that the CNN algorithm outperformed the KMeans and the KMedoids algorithm. Recently, Zhao et al. [60] implemented a GPU version of the KCenters algorithm, which achieves tens of times of speedup than the original CPU version. To investigate the conformational dynamics of multibody systems such as proteinligand binding, Sheong et al. [64] demonstrated that their APM algorithm is more suitable compared to other centerbased algorithms (FIG. 9). More recently, Liu et al. [71] show that the densitybased algorithms are also suitable to investigate the conformational dynamics of multibody systems.
In several previous studies, new metrics have been proposed to quantitatively assess the quality of the clustering results. Shao et al. [23] presented a systematic study by applying two metrics, the DaviesBouldin index (DBI) [116] and the pseudo Fstatistic (pSF) [117], to compare eleven different wellknown clustering algorithms, including several hierarchical, and centerbased clustering algorithms. According to their definitions, good clustering results should have low DBI and high pSF values. In this work, the authors concluded that, in general, the averagelinkage algorithm and the KMeans algorithm yield the best performances. The authors also suggested one can use the divisive hierarchical clustering algorithm and the averagelinkage algorithm in cases where the number of clusters is unclear in advance. Similar to the study of Shao et al. [23], Phillips and coworkers [25] presented several analytic and dynamic polymer models with wellcharacterized dynamics (e.g. metastable states, transition states, helical structures and stochastic dynamics) to validate the results and performance of the clustering algorithm used in MD simulations of macromolecules. In this work, spectral clustering algorithm was examined and this algorithm was shown to be promising in clustering MD trajectories. In addition, De Paris et al. [78] introduced a new scoring method, the sum of the quartile differences (SQD), which identifies ensembles of medoids that have similar dispersion to the whole MD trajectory. A smaller SQD value indicates higher similarity. In this work, the authors examined the performances of a few hierarchical clustering algorithms (the averagelinkage algorithm, the centroidlinkage algorithm, and the completelinkage algorithm) and two centerbased algorithms (the KMeans algorithm and the KMedoids algorithm) in clustering the MD trajectories of the InhA enzyme for docking purposes. They showed that hierarchical clustering algorithms, especially the Wardlinkage algorithm, performed better than the two centerbased algorithms, the KMeans algorithm and the KMedoids algorithm.
B. Comparisons of the performances of clustering algorithms using kinetic metricsPerformances of some popular clustering algorithms using kinetic metrics have also been compared in several previous studies [26, 70, 79]. For example, Bowman et al. [26] compared several methods for coarsegraining MSMs using a Bayesian model and three test systems: the alanine dipeptide, the Villin headpiece, and the
The proper choice of a clustering algorithm for a specific MD simulation dataset should be based on the intrinsic properties of datasets as well as the purpose of the clustering. In this subsection, we will first discuss the features of the MD datasets. Then we will provide our advice on the choice of clustering algorithms to serve different purposes.
In MD simulations, we aim at sampling ensemble of conformations that satisfy the equilibrium population, for example, the Boltzmann distribution. Such equilibrium distribution is directly related to the underlying free energy landscape, where the dominant metastable states (high density regions) are usually separated by high free energy barriers (low density regions). In complex biomolecular systems, there often exists a separation of timescales, where transitions between the conformations within one certain metastable state are fast, while the transitions between the conformations from different metastable states are much slower.
Clustering methods using geometric metrics often partition MD datasets into disjoint clusters, such that the conformations from one cluster are structurally similar and the conformations from different clusters have large structural variations. Thus, the results of such clustering are often applied to choose representative MD conformations to facilitate ensemble molecular docking [118], further compare with experiment [119], or simplify the data visualization. In cases where we need to choose dominant conformations locating in the free energy basins, the KMeans (or KMedoids) and the Wardlinkage algorithm would be good choices, as these algorithms often tend to produce multiple clusters in highdensity regions (or free energy basins) in order to minimize the total withincluster variance. Alternatively, the density based algorithms may also serve as a good choice for this purpose. These algorithms are designed to identify the highdensity regions as "cores". One caveat with density based algorithms is that accurate estimations of densities at high dimensional space is often challenging [120]. If the purpose of the clustering is to investigate the transition states, the Kcenters algorithm would be a better choice because the Kcenters algorithm minimizes the maximum radius for the clusters, and thus clusters with varying densities are produced including the lowdensity ones at the transition regions. In addition, it has been demonstrated that the Kcenters algorithm tends to equally divide the conformational space. Therefore, populations of clusters produced by the Kcenters algorithm can well reflect the underlying free energy landscape (FIG. 10(b)), while the KMedoids algorithm fails to display such a correlation (FIG. 10(c)).
In the MSM construction, Kinetic lumping is an important step, where hundreds or thousands of microstates are grouped into a small number of metastable macrostates. Such kinetic lumping can often facilitate the interpretations of molecular mechanisms for protein conformational changes [32, 34, 35]. Among various kinetic clustering algorithms, PCCA and PCCA+ gain popularities due to their generally good performance and wide implementations [35, 54, 77]. However, PCCA and PCCA+ are often suffered from insufficient sampling, where lowly populated outliers are wrongly identified as metastable states. To alleviate this issue, we recommend the HNEG method [61], which can overcome this limitation as clusters are built up from the wellsampled regions in this method. Moreover, the HNEG method provides a hierarchical extension graph containing information of metastable states at different density levels, which can further help us to comprehend the architecture of the free energy landscapes. However, we also note that the HNEG method needs additional input parameters such as the superdensity levels in comparison with PCCA and PCCA+. If one is interested in identifying the transition states, the recent method developed by Martini et al. [121] is a good choice as it can automatically identify transition states as well as transition states using a variationally optimization approach in the framework of MSMs.
Attentions should also be paid to multibody systems [86], which underline a wide range of important biological processes, such as proteinligand recognition, proteinprotein interactions and selfassembly. These multibody systems can display drastically different dynamic behaviors from singlebody systems such as folding of individual protein and RNA molecule. In particular, for these multibody systems, there often exists a heterogeneous mixture of timescales [86]. For example, in proteinligand recognition, the ligand dynamics in the free solution is significantly slower than that in bound with the protein. Therefore, uniform geometric clustering algorithms such as the KMeans algorithm and the KCenters algorithm are inadequate to describe both the bound and the unbound regions of the ligand dynamics [64]. In such cases, algorithms like APM and APLoD are good choices due to their superior performances and efficiencies on multibody systems [64, 71, 86].
Ⅶ. CONCLUSIONIn this paper, the basic principles of a number of clustering algorithms that have been widely applied to MD datasets are reviewed. The applications, performances and comparisons among these algorithms are also discussed. It should be noted that there are no onesizefitsall algorithms that would be suitable for any MD datasets. The performances of clustering algorithms are closely related to the purpose of clustering, the metrics used in clustering, and the intrinsic properties of the MD datasets. For example, to cluster and analyze multibody systems, algorithms such as APM and APLoD would be good choices. We note that the development of clustering algorithms for the analysis of MD simulation datasets is a fastevolving field, and it is not possible for us to cover all the literatures. In this article, we reviewed a number of widely applied clustering algorithms with a focus on discussion merits and limitations of each method, and comparisons between different methods. We hope this review would be helpful for readers who are interested in applying the clustering methods on their own MD simulation trajectories.
Ⅷ. ACKNOWLEDGMENTSThis work was supported by Shenzhen Science and Technology Innovation Committee (JCYJ20170413173837121), the Hong Kong Research Grant Council (HKUST C600915G, 14203915, 16302214, 16304215, 16318816, and AoE/P705/16), King Abdullah University of Science and Technology (KAUST) Office of Sponsored Research (OSR) (OSR2016CRG53007), Guangzhou Science Technology and Innovation Commission (201704030116), and Innovation and Technology Commission (ITCPD/179 and ITCCNERC14SC01). X. Huang is the Padma Harilela Associate Professor of Science.
[1]  J. A. McCammon, B. R. Gelin, and M. Karplus, Nature 267 , 585 (1977). DOI:10.1038/267585a0 
[2]  E. F. Garman, Science 343 , 1102 (2014). DOI:10.1126/science.1247829 
[3]  J. A. Marsh, and S. A. Teichmann, Annu. Rev. Biochem. 84 , 551 (2015). DOI:10.1146/annurevbiochem060614034142 
[4]  P. W. Rose, A. Prlić, A. Altunkaya, C. X. Bi, A. R. Bradley, C. H. Christie, L. Di Costanzo, J. M. Duarte, S. Dutta, Z. K. Feng, R. K. Green, D. S. Goodsell, B. Hudson, T. Kalro, R. Lowe, E. Peisach, C. Randle, A. S. Rose, C. H. Shao, Y. P. Tao, Y. Valasatava, M. Voigt, J. D. Westbrook, J. Woo, H. W. Yang, J. Y. Young, C. Zardecki, H. M. Berman, and S. K. Burley, Nucleic Acids. Res. 45 , D271 (2017). DOI:10.1093/nar/gkw1042 
[5]  S. Pronk, S. Pll, R. Schulz, P. Larsson, P. Bjelkmar, R. Apostolov, M. R. Shirts, J. C. Smith, P. M. Kasson, D. Van Der Spoel, B. Hess, and E. Lindahl, Bioinformatics 29 , 845 (2013). DOI:10.1093/bioinformatics/btt055 
[6]  M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess, and E. Lindahl, SoftwareX 12 19 (2015). 
[7]  R. SalomonFerrer, D. A. Case, and R. C. Walker, WIREs 3 , 198 (2013). 
[8]  J. C. Phillips, R. Braun, W. Wang, J. Gumbart, E. Tajkhorshid, E. Villa, C. Chipot, R. D. Skeel, L. Kal, and K. Schulten, J. Comput. Chem. 26 , 1781 (2005). DOI:10.1002/(ISSN)1096987X 
[9]  D. E. Shaw, M. M. Deneroff, R. O. Dror, J. S. Kuskin, R. H. Larson, J. K. Salmon, C. Young, B. Batson, K. J. Bowers, J. C. Chao, M. P. Eastwood, J. Gagliardo, J. P. Grossman, C. R. Ho, D. J. Ierardi, I. Kolossvry, J. L. Klepeis, T. Layman, C. McLeavey, M. A. Moraes, R. Mueller, E. C. Priest, Y. B. Shan, J. Spengler, M. Theobald, B. Towles, and S. C. Wang, Commun. ACM 51 , 91 (2008). 
[10]  M. Karplus, and J. A. McCammon, Nat. Struct. Biol. 9 , 646 (2002). DOI:10.1038/nsb0902646 
[11]  R. O. Dror, R. M. Dirks, J. P. Grossman, H. F. Xu, and D. E. Shaw, Annu. Rev. Biophys. 41 , 429 (2012). DOI:10.1146/annurevbiophys042910155245 
[12]  J. L. Klepeis, K. LindorffLarsen, R. O. Dror, and D. E. Shaw, Curr. Opin. Struct. Biol. 19 , 120 (2009). DOI:10.1016/j.sbi.2009.03.004 
[13]  F. R. Salsbury Jr., Curr. Opin. Struct. Biol. 10 , 738 (2010). 
[14]  J. D. Durrant, and J. A. McCammon, BMC Biol. 9 , 71 (2011). DOI:10.1186/17417007971 
[15]  X. W. Liu, D. F. Shi, S. Y. Zhou, H. L. Liu, H. X. Liu, and X. J. Yao, Exp. Opin. Drug Discovery 13 , 23 (2018). DOI:10.1080/17460441.2018.1403419 
[16]  J. R. Perilla, B. C. Goh, C. K. Cassidy, B. Liu, R. C. Bernardi, T. Rudack, H. Yu, Z. Wu, and K. Schulten, Curr. Opin. Struct. Biol. 31 , 64 (2015). DOI:10.1016/j.sbi.2015.03.007 
[17]  M. C. Childers, and V. Daggett, Mol. Syst. Des. Eng. 2 , 9 (2017). DOI:10.1039/C6ME00083E 
[18]  A. Chevalier, D. A. Silva, G. J. Rocklin, D. R. Hicks, R. Vergara, P. Murapa, S. M. Bernard, L. Zhang, K. H. Lam, G. R. Yao, C. D. Bahl, S. I. Miyashita, I. Goreshnik, J. T. Fuller, M. T. Koday, C. M. Jenkins, T. Colvin, L. Carter, A. Bohn, C. M. Bryan, D. A. FernndezVelasco, L. Stewart, M. Dong, X. H. Huang, R. S. Jin, I. A. Wilson, D. H. Fuller, and D. Baker, Nature 550 , 74 (2017). 
[19]  A. Hospital, J. R. Goñi, M. Orozco, and J. L. Gelpí, Adv. Appl. Bioinform. Chem. 8 , 37 (2015). 
[20]  D. E. Shaw, J. P. Grossman, J. A. Bank, B. Batson, J. A. Butts, J. C. Chao, M. M. Deneroff, R. O. Dror, A. Even, C. H. Fenton, A. Forte, J. Gagliardo, G. Gill, B. Greskamp, C. R. Ho, D. J. Ierardi, L. Iserovich, J. S. Kuskin, R. H. Larson, T. Layman, L. S. Lee, A. K. Lerer, C. Li, D. Killebrew, K. M. Mackenzie, S. Y. H. Mok, M. A. Moraes, R. Mueller, L. J. Nociolo, J. L. Peticolas, T. Quan, D. Ramot, J. K. Salmon, D. P. Scarpazza, U. Ben Schafer, N. Siddique, C. W. Snyder, J. Spengler, P. T. P. Tang, M. Theobald, H. Toma, B. Towles, B. Vitale, S. C. Wang, and C. Young, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, New Orleans, LA, USA: IEEE, 41(2014). 
[21]  S. Doerr, M. J. Harvey, F. Noé, and G. De Fabritiis, J. Chem. Theory Comput. 12 , 1845 (2016). DOI:10.1021/acs.jctc.6b00049 
[22]  A. E. Torda, and W. F. Van Gunsteren, J. Comput. Chem. 15 , 1331 (1994). DOI:10.1002/(ISSN)1096987X 
[23]  J. Y. Shao, S. W. Tanner, N. Thompson, and T. E. Cheatham, J. Chem. Theory Comput. 3 , 2312 (2007). DOI:10.1021/ct700119m 
[24]  B. Keller, X. Daura, and W. F. Van Gunsteren, J. Chem. Phys. 132 , 074110 (2010). DOI:10.1063/1.3301140 
[25]  J. L. Phillips, M. E. Colvin, and S. Newsam, BMC Bioinformatics 12 , 445 (2011). DOI:10.1186/1471210512445 
[26]  G. R. Bowman, L. M. Meng, and X. H. Huang, J. Chem. Phys. 139 , 121905 (2013). DOI:10.1063/1.4812768 
[27]  K. A. Beauchamp, R. McGibbon, Y. S. Lin, and V. S. Pande, Proc. Natl. Acad. Sci. USA 109 , 17807 (2012). DOI:10.1073/pnas.1201810109 
[28]  G. Jayachandran, V. Vishal, and V. S. Pande, J. Chem. Phys. 124 , 164902 (2006). DOI:10.1063/1.2186317 
[29]  V. S. Pande, K. Beauchamp, and G. R. Bowman, Methods 52 , 99 (2010). DOI:10.1016/j.ymeth.2010.06.002 
[30]  J. D. Chodera, and F. Noé, Curr. Opin. Struct. Biol. 25 , 135 (2014). DOI:10.1016/j.sbi.2014.04.002 
[31]  L. T. Da, F. K. Sheong, D. A. Silva, and X. H. Huang, Protein Conformational Dynamics, Han K. L., Zhang X., and M. J. Yang Eds., Cham:Springer, 805 , 29 (2014). 
[32]  D. A. Silva, D. R. Weiss, F. Pardo Avila, L. T. Da, M. Levitt, D. Wang, and X. H. Huang, Proc. Natl. Acad. Sci. USA 111 , 7665 (2014). DOI:10.1073/pnas.1315751111 
[33]  S. Gu, D. A. Silva, L. M. Meng, A. Yue, and X. H. Huang, PLoS Comput. Biol. 10 , e1003767 (2014). DOI:10.1371/journal.pcbi.1003767 
[34]  L. T. Da, F. PardoAvila, L. Xu, D. A. Silva, L. Zhang, X. Gao, D. Wang, and X. H. Huang, Nat. Commun. 7 , 11244 (2016). DOI:10.1038/ncomms11244 
[35]  W. Wang, S. Q. Cao, L. Z. Zhu, and X. H. Huang, WIREs 8 , e1343 (2018). 
[36]  J. D. Chodera, N. Singhal, V. S. Pande, K. A. Dill, and W. C. Swope, J. Chem. Phys. 126 , 155101 (2007). DOI:10.1063/1.2714538 
[37]  G. R. Bowman, K. A. Beauchamp, G. Boxer, and V. S. Pande, J. Chem. Phys. 131 , 124101 (2009). DOI:10.1063/1.3216567 
[38]  F. Noé, and S. Fischer, Curr. Opin. Struct. Biol. 18 , 154 (2008). DOI:10.1016/j.sbi.2008.01.008 
[39]  D. Shukla, C. X. Hernndez, J. K. Weber, and V. S. Pande, Acc. Chem. Res. 48 , 414 (2015). DOI:10.1021/ar5002999 
[40]  E. Godehardt, C.J. E. Ter Braak, M. Roux, R. K. Blashfield, P. Rousseau, P. G. Bryant, and R. J. Hathaway, J. Classif. 8 , 269 (1991). DOI:10.1007/BF02616243 
[41]  P. Arabie, L. J. Hubert, and G. De Soete, Clustering and Classification, Singapore: World Scientific, (1996). 
[42]  A. K. Jain, M. N. Murty, and P. J. Flynn, ACM Comput. Surv. 31 , 264 (1999). DOI:10.1145/331499.331504 
[43]  A. K. Jain, Patt. Recognit. Lett. 31 , 651 (2010). DOI:10.1016/j.patrec.2009.09.011 
[44]  A. Saxena, M. Prasad, A. Gupta, N. Bharill, O. P. Patel, A. Tiwari, M. J. Er, W. Ding, and C. T. Lin, Neurocomputing 267 , 664 (2017). DOI:10.1016/j.neucom.2017.06.053 
[45]  M. E. Karpen, D. J. Tobias, and C. L. Brooks Ⅲ, Biochemistry 32 , 412 (1993). DOI:10.1021/bi00053a005 
[46]  A. G. Michel, and C. Jeandenans, Comput. Chem. 17 , 49 (1993). DOI:10.1016/00978485(93)80028C 
[47]  P. S. Shenkin, and D. Q. McDonald, J. Comput. Chem. 15 , 899 (1994). DOI:10.1002/(ISSN)1096987X 
[48]  J. M. Troyer, and F. E. Cohen, Proteins 23 , 97 (1995). DOI:10.1002/(ISSN)10970134 
[49]  X. Daura, W. F. Van Gunsteren, and A. E. Mark, Proteins 34 , 269 (1999). DOI:10.1002/(ISSN)10970134 
[50]  J. GabarroArpa, and R. Revilla, Comput. Chem. 24 , 693 (2000). 
[51]  M. T. Hyvönen, Y. Hiltunen, W. ElDeredy, T. Ojala, J. Vaara, P. T. Kovanen, and M. AlaKorpela, J. Am. Chem. Soc. 123 , 810 (2001). DOI:10.1021/ja0025853 
[52]  C. Best, and H. C. Hege, Comput. Sci. Eng. 4 , 68 (2002). DOI:10.1109/5992.998642 
[53]  M. Feher, and J. M. Schmidt, J. Chem. Inf. Comput. Sci. 43 , 810 (2003). DOI:10.1021/ci0200671 
[54]  P. Deuflhard, and M. Weber, Linear Algebra Appl. 398 , 161 (2005). DOI:10.1016/j.laa.2004.10.026 
[55]  Y. Li, J. Chem. Inf. Model. 46 , 1742 (2006). DOI:10.1021/ci050463u 
[56]  F. Noé, I. Horenko, C. Schtte, and J. C. Smith, J. Chem. Phys. 126 , 155102 (2007). DOI:10.1063/1.2714539 
[57]  J. L. Phillips, M. E. Colvin, E. Y. Lau, and S. Newsam, Proceedings of 2008 IEEE International Conference on Bioinformatics and Biomedicine Workshops, Philadelphia, PA, USA: IEEE, 17(2008). 
[58]  D. Fraccalvieri, A. Pandini, F. Stella, and L. Bonati, BMC Bioinformatics 12 , 158 (2011). DOI:10.1186/1471210512158 
[59]  F. Haack, K. Fackeldey, S. Röblitz, O. Scharkoi, M. Weber, and B. Schmidt, J. Chem. Phys. 139 , 194110 (2013). DOI:10.1063/1.4830409 
[60]  Y. T. Zhao, F. K. Sheong, J. Sun, P. Sander, and X. H. Huang, J. Comput. Chem. 34 , 95 (2013). DOI:10.1002/jcc.23110 
[61]  Y. Yao, R. Z. Cui, G. R. Bowman, D. A. Silva, J. Sun, and X. H. Huang, J. Chem. Phys. 138 , 174106 (2013). DOI:10.1063/1.4802007 
[62]  G. PérezHernández, F. Paul, T. Giorgino, G. De Fabritiis, and F Noé, J. Chem. Phys. 139 , 015102 (2013). DOI:10.1063/1.4811489 
[63]  G. Bouvier, N. Desdouits, M. Ferber, A. Blondel, and M. Nilges, Bioinformatics 31 , 1490 (2015). DOI:10.1093/bioinformatics/btu849 
[64]  F. K. Sheong, D. A. Silva, L. M Meng, Y. T. Zhao, and X. H. Huang, J. Chem. Theory Comput. 11 , 17 (2015). DOI:10.1021/ct5007168 
[65]  T. M. Abramyan, J. A. Snyder, A. A. Thyparambil, S. J. Stuart, and R. A. Latour, J. Comput. Chem. 37 , 1973 (2016). DOI:10.1002/jcc.24416 
[66]  O. Lemke, and B. G. Keller, J. Chem. Phys. 145 , 164104 (2016). DOI:10.1063/1.4965440 
[67]  F. Sittel, and G. Stock, J. Chem. Theory Comput. 12 , 2426 (2016). DOI:10.1021/acs.jctc.5b01233 
[68]  H. V. Dang, B. Schmidt, A. Hildebrandt, T. T. Tran, and A. K. Hildebrandt, Int. J. High Perform. Comput. Appl. 30 , 200 (2016). DOI:10.1177/1094342015597988 
[69]  V. C. De Souza, L. Goliatt, and P. V. Z. C. Goliatt, Proceedings of 2017 IEEE Latin American Conference on Computational Intelligence, Arequipa, Peru: IEEE, (2017). 
[70]  B. E. Husic, and V. S. Pande, J. Chem. Theory Comput. 13 , 963 (2017). DOI:10.1021/acs.jctc.6b01238 
[71]  S. Liu, L. Z. Zhu, F. K. Sheong, W. Wang, and X. H. Huang, J. Comput. Chem. 38 , 152 (2017). DOI:10.1002/jcc.24664 
[72]  D. Shortle, K. T. Simons, and D. Baker, Proc. Natl. Acad. Sci. USA 95 , 11158 (1998). DOI:10.1073/pnas.95.19.11158 
[73]  D. Chema, and A. Goldblum, J. Chem. Inf. Comput. Sci. 43 , 208 (2003). DOI:10.1021/ci0255735 
[74]  C. R. Schwantes, and V. S. Pande, J. Chem. Theory Comput. 9 , 2000 (2013). DOI:10.1021/ct300878a 
[75]  S. T. Xu, S. X. Zou, and L. C. Wang, J. Comput. Biol. 22 , 436 (2015). DOI:10.1089/cmb.2014.0162 
[76]  G. R. Bowman, J. Chem. Phys. 137 , 134111 (2012). DOI:10.1063/1.4755751 
[77]  P. Deuflhard, W. Huisinga, A. Fischer, and C. Schtte, Linear Algebra Appl. 315 , 39 (2000). DOI:10.1016/S00243795(00)000951 
[78]  R. De Paris, C. V. Quevedo, D. D. A. Ruiz, and O. N. De Souza, PLoS One 10 , e0133172 (2015). DOI:10.1371/journal.pone.0133172 
[79]  Y. Li, and Z. G. Dong, J. Chem. Inf. Model. 56 , 1205 (2016). DOI:10.1021/acs.jcim.6b00181 
[80]  A. Wolf, and K. N. Kirschner, J. Mol. Model. 19 , 539 (2013). DOI:10.1007/s0089401215634 
[81]  R. Sibson, Comput. J. 16 , 30 (1973). DOI:10.1093/comjnl/16.1.30 
[82]  J. H. Ward Jr., J. Am. Stat. Assoc. 58 , 236 (1963). DOI:10.1080/01621459.1963.10500845 
[83]  F. Murtagh, and P. Legendre, J. Classif. 31 , 274 (2014). DOI:10.1007/s003570149161z 
[84]  D. Müllner, Modern Hierarchical, Agglomerative Clustering algorithms, arXiv preprint arXiv: 1109. 2378, (2011). 
[85]  N. G. Sgourakis, M. MercedSerrano, C. Boutsidis, P. Drineas, Z. M. Du, C. Y. Wang, and A. E. Garcia, J. Mol. Biol. 405 , 570 (2011). DOI:10.1016/j.jmb.2010.10.015 
[86]  L. Z. Zhu, F. K. Sheong, X. Z. Zeng, and X. H. Huang, Phys. Chem. Chem. Phys. 18 , 30228 (2016). DOI:10.1039/C6CP02545E 
[87]  G. R. Bowman, X. H. Huang, and V. S. Pande, Methods 49 , 197 (2009). DOI:10.1016/j.ymeth.2009.04.013 
[88]  S. P. Lloyd, Bell Syst. Tech. J. 36 , 517 (1957). DOI:10.1002/bltj.1957.36.issue2 
[89]  S. P. Lloyd, IEEE Trans. Inf. Theory 28 , 129 (1982). DOI:10.1109/TIT.1982.1056489 
[90]  L. Zhang, F. PardoAvila, I. C. Unarta, P. P. H. Cheung, G. Wang, D. Wang, and X. H. Huang, Acc. Chem. Res. 49 , 687 (2016). DOI:10.1021/acs.accounts.5b00536 
[91]  F. No, C. Schtte, E. VandenEijnden, L. Reich, and T. R. Weikl, Proc. Natl. Acad. Sci. USA 106 , 19011 (2009). DOI:10.1073/pnas.0905466106 
[92]  K. A. Beauchamp, G. R. Bowman, T. J. Lane, L. Maibaum, I. S. Haque, and V. S. Pande, J. Chem. Theory Comput. 7 , 3412 (2011). DOI:10.1021/ct200463m 
[93]  L. M. Meng, F. K. Sheong, X. Z. Zeng, L. Z. Zhu, and X. H. Huang, J. Chem. Phys. 147 , 044112 (2017). DOI:10.1063/1.4995558 
[94]  H. L. Jiang, F. K. Sheong, L. Z. Zhu, X. Gao, J. Bernauer, and X. H. Huang, PLoS Comput. Biol. 11 , e1004404 (2015). DOI:10.1371/journal.pcbi.1004404 
[95]  H. Jiang, L. Z. Zhu, A. Héliou, X. Gao, J. Bernauer, and X. H. Huang, Drug Target miRNA: Methods and Protocols, M. F. Schmidt Ed., New York, NY: Humana Press, 251(2017). 
[96]  X. Z. Zeng, B. Li, Q. Qiao, L. Z. Zhu, Z. Y. Lu, and X. H. Huang, Phys. Chem. Chem. Phys. 18 , 23494 (2016). DOI:10.1039/C6CP01808D 
[97]  Y. Cao, X. H. Jiang, and W. Han, J. Chem. Theory Comput. 13 , 5731 (2017). DOI:10.1021/acs.jctc.7b00803 
[98]  M. Ester, H. P. Kriegel, J. Sander, and X. W. Xu, Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, Portland, OR (1996). 
[99]  N. Beckmann, H. P. Kriegel, R. Schneider, and B. Seeger, Proceedings of the 1990 Acm Sigmod International Conference on Management Data, Atlantic City, New Jersey, USA:ACM, 19 , 322 (1990). 
[100]  K. Wang, J. D. Chodera, Y. Z. Yang, and M. R. Shirts, J. Comput. Aided Mol. Des. 27 , 989 (2013). DOI:10.1007/s1082201396898 
[101]  P. Sfriso, M. DuranFrigola, R. Mosca, A. Emperador, P. Aloy, and M. Orozco, Structure 24 , 116 (2016). DOI:10.1016/j.str.2015.10.025 
[102]  R. GalindoMurillo, and T. E. Cheatham, Chemmedchem 9 , 1252 (2014). DOI:10.1002/cmdc.201402020 
[103]  A. Rodriguez, and A. Laio, Science 344 , 1492 (2014). DOI:10.1126/science.1242072 
[104]  E. V. Ruiz, Patt. Recognit. Lett. 4 , 145 (1986). DOI:10.1016/01678655(86)900139 
[105]  L. Molgedey, and H. G. Schuster, Phys. Rev. Lett. 72 , 3634 (1994). DOI:10.1103/PhysRevLett.72.3634 
[106]  T. Blaschke, P. Berkes, and L. Wiskott, Neural Comput. 18 , 2495 (2006). DOI:10.1162/neco.2006.18.10.2495 
[107]  Y. Naritomi, and S. Fuchigami, J. Chem. Phys. 134 , 065101 (2011). DOI:10.1063/1.3554380 
[108]  M. A. Rohrdanz, W. W. Zheng, and C. Clementi, Annu. Rev. Phys. Chem. 64 , 295 (2013). DOI:10.1146/annurevphyschem040412110006 
[109]  F. Noé, and C. Clementi, Curr. Opin. Struct. Biol. 43 , 141 (2017). DOI:10.1016/j.sbi.2017.02.006 
[110]  F. Noé, and C. Clementi, J. Chem. Theory Comput. 11 , 5002 (2015). DOI:10.1021/acs.jctc.5b00553 
[111]  A. Y. Ng, M. I. Jordan, and Y. Weiss, Proceedings of the 14th International Conference on Neural Information Processing Systems: Natural and Synthetic, Vancouver, British Columbia, Canada: MIT Press, 849(2001). 
[112]  D. Verma and M. Meilǎ, Ph. D Dissertion, Washington: University of Washington, (2003). 
[113]  J. B. Shi, and J. Malik, IEEE Trans. Pattern Anal. Mach. Intell. 22 , 888 (2000). DOI:10.1109/34.868688 
[114]  M. Filippone, F. Camastra, F. Masulli, and S. Rovetta, Pattern Recognit. 41 , 176 (2008). DOI:10.1016/j.patcog.2007.05.018 
[115]  P. K. Chan, M. Schlag, and J. Y. Zien, Proceedings of the 1993 Symposium on Research on Integrated Systems, Seattle, Washington, USA: MIT Press, 123(1993). 
[116]  D. L. Davies, and D. W. Bouldin, IEEE Trans. Pattern Anal. Mach. Intell. 1 , 224 (1979). 
[117]  T. Caliński, and J. Harabasz, Commun. Stat. 3 , 1 (1974). 
[118]  R. E. Amaro, J. Baudry, J. Chodera, Ö. Demir, J. A. McCammon, Y. L. Miao, J. C. Smith, Biophys. J. 114, 2271(2018). 
[119]  A. B. Ward, A. Sali, and I. A. Wilson, Science 339 , 913 (2013). DOI:10.1126/science.1228565 
[120]  X. H. Huang, Y. A. Yao, G. R. Bowman, J. Sun, L. J. Guibas, G. Carlsson, and V. S. Pande, Pac. Symp. Biocomput. 2010 , 228 (2010). 
[121]  L. Martini, A. Kells, R. Covino, G. Hummer, N. V. Buchete, and E. Rosta, Phys. Rev. X 7 , 031060 (2017). 
b. 香港科技大学化学系, 九龙;
c. 香港科技大学数学系, 九龙;
d. 香港科技大学系统生物与人类健康中心, 九龙;
e. 香港科技大学分子神经科学国家重点实验室, 九龙