• 中文核心期刊要目总览
  • 中国科技核心期刊
  • 中国科学引文数据库(CSCD)
  • 中国科技论文与引文数据库(CSTPCD)
  • 中国学术期刊文摘数据库(CSAD)
  • 中国学术期刊(网络版)(CNKI)
  • 中文科技期刊数据库
  • 万方数据知识服务平台
  • 中国超星期刊域出版平台
  • 国家科技学术期刊开放平台
  • 荷兰文摘与引文数据库(SCOPUS)
  • 日本科学技术振兴机构数据库(JST)

分析复杂化学及生物体系分子动力学模拟轨迹的聚类方法

Clustering Algorithms to Analyze Molecular Dynamics Simulation Trajectories for Complex Chemical and Biological Systems

  • 摘要: 分子动力学(MD)模拟可以很好地用于揭示蛋白质等生物大分子体系在原子尺度的结构及功能的关系.分子动力学模拟通常产生海量的描述分子在模拟中运动的数据,包含很多模拟轨迹以及随时间演化的各个原子的坐标和速度等.为了从这些海量数据中获得体系的分子机制,需要发展并利用聚类算法来将这些海量数据进行归类.聚类算法通常将具有某些相似度的构象聚成一类,这些相似度可以分为两类,几何相似度以及动力学相似度.对应地,用于分析分子动力学模拟的聚类算法通常可以分为两大类:几何聚类及动力学聚类.本文列举了一系列常用的用于分子动力学模拟的聚类算法包括分裂算法,凝聚算法(单连锁,完全连锁,平均连锁,质心连锁以及Ward连锁),中心算法(K-Means,K-Medoids,K-Centers及APM),密度算法(邻居算法,DBSCAN,密度-峰及Robust-DB算法),谱算法(PCCA,PCCA+)等.本文讨论了几何分类和动力学分类的不同点以及不同算法的性能.另外注意到并不存在某一个适用于所有MD数据的聚类算法.对于某个特定体系,选择一个合适的聚类算法取决于聚类的目的,MD构象系综的内在性质等.因此,本文的一个要点也在于介绍每个聚类算法的优缺点.期望通过本文,能够指导读者在MD模拟中选择一个合适的聚类算法.

     

    Abstract: Molecular dynamics (MD) simulation has become a powerful tool to investigate the structurefunction relationship of proteins and other biological macromolecules at atomic resolution and biologically relevant timescales. MD simulations often produce massive datasets containing millions of snapshots describing proteins in motion. Therefore, clustering algorithms have been in high demand to be developed and applied to classify these MD snapshots and gain biological insights. There mainly exist two categories of clustering algorithms that aim to group protein conformations into clusters based on the similarity of their shape (geometric clustering) and kinetics (kinetic clustering). In this paper, we review a series of frequently used clustering algorithms applied in MD simulations, including divisive algorithms, agglomerative algorithms (single-linkage, complete-linkage, average-linkage, centroid-linkage and ward-linkage), center-based algorithms (K-Means, K-Medoids, K-Centers, and APM), density-based algorithms (neighbor-based, DBSCAN, density-peaks, and Robust-DB), and spectral-based algorithms (PCCA and PCCA+). In particular, differences between geometric and kinetic clustering metrics will be discussed along with the performances of different clustering algorithms. We note that there does not exist a one-size-fits-all algorithm in the classification of MD datasets. For a specific application, the right choice of clustering algorithm should be based on the purpose of clustering, and the intrinsic properties of the MD conformational ensembles. Therefore, a main focus of our review is to describe the merits and limitations of each clustering algorithm. We expect that this review would be helpful to guide researchers to choose appropriate clustering algorithms for their own MD datasets.

     

/

返回文章
返回