分析复杂化学及生物体系分子动力学模拟轨迹的聚类方法

彭俊辉; 王薇; 虞叶卿; 谷翰林; 黄旭辉

doi:10.1063/1674-0068/31/cjcp1806147

分析复杂化学及生物体系分子动力学模拟轨迹的聚类方法

Clustering Algorithms to Analyze Molecular Dynamics Simulation Trajectories for Complex Chemical and Biological Systems^†

摘要

摘要: 分子动力学（MD）模拟可以很好地用于揭示蛋白质等生物大分子体系在原子尺度的结构及功能的关系.分子动力学模拟通常产生海量的描述分子在模拟中运动的数据，包含很多模拟轨迹以及随时间演化的各个原子的坐标和速度等.为了从这些海量数据中获得体系的分子机制，需要发展并利用聚类算法来将这些海量数据进行归类.聚类算法通常将具有某些相似度的构象聚成一类，这些相似度可以分为两类，几何相似度以及动力学相似度.对应地，用于分析分子动力学模拟的聚类算法通常可以分为两大类：几何聚类及动力学聚类.本文列举了一系列常用的用于分子动力学模拟的聚类算法包括分裂算法，凝聚算法（单连锁，完全连锁，平均连锁，质心连锁以及Ward连锁），中心算法（K-Means，K-Medoids，K-Centers及APM），密度算法（邻居算法，DBSCAN，密度-峰及Robust-DB算法），谱算法（PCCA，PCCA+）等.本文讨论了几何分类和动力学分类的不同点以及不同算法的性能.另外注意到并不存在某一个适用于所有MD数据的聚类算法.对于某个特定体系，选择一个合适的聚类算法取决于聚类的目的，MD构象系综的内在性质等.因此，本文的一个要点也在于介绍每个聚类算法的优缺点.期望通过本文，能够指导读者在MD模拟中选择一个合适的聚类算法.

Abstract: Molecular dynamics (MD) simulation has become a powerful tool to investigate the structurefunction relationship of proteins and other biological macromolecules at atomic resolution and biologically relevant timescales. MD simulations often produce massive datasets containing millions of snapshots describing proteins in motion. Therefore, clustering algorithms have been in high demand to be developed and applied to classify these MD snapshots and gain biological insights. There mainly exist two categories of clustering algorithms that aim to group protein conformations into clusters based on the similarity of their shape (geometric clustering) and kinetics (kinetic clustering). In this paper, we review a series of frequently used clustering algorithms applied in MD simulations, including divisive algorithms, agglomerative algorithms (single-linkage, complete-linkage, average-linkage, centroid-linkage and ward-linkage), center-based algorithms (K-Means, K-Medoids, K-Centers, and APM), density-based algorithms (neighbor-based, DBSCAN, density-peaks, and Robust-DB), and spectral-based algorithms (PCCA and PCCA+). In particular, differences between geometric and kinetic clustering metrics will be discussed along with the performances of different clustering algorithms. We note that there does not exist a one-size-fits-all algorithm in the classification of MD datasets. For a specific application, the right choice of clustering algorithm should be based on the purpose of clustering, and the intrinsic properties of the MD conformational ensembles. Therefore, a main focus of our review is to describe the merits and limitations of each clustering algorithm. We expect that this review would be helpful to guide researchers to choose appropriate clustering algorithms for their own MD datasets.

HTML全文

参考文献(121)

施引文献

资源附件(0)

分析复杂化学及生物体系分子动力学模拟轨迹的聚类方法

Clustering Algorithms to Analyze Molecular Dynamics Simulation Trajectories for Complex Chemical and Biological Systems†

Clustering Algorithms to Analyze Molecular Dynamics Simulation Trajectories for Complex Chemical and Biological Systems^†