The article information
 Yaolong Zhang, Xueyao Zhou, Bin Jiang
 张耀龙, 周雪瑶, 蒋彬
 Accelerating the Construction of Neural Network Potential Energy Surfaces: A Fast Hybrid Training Algorithm^{†}
 加速神经网络势能面的构建:一种杂化的训练算法
 Chinese Journal of Chemical Physics, 2017, 30(6): 727734
 化学物理学报, 2017, 30(6): 727734
 http://dx.doi.org/10.1063/16740068/30/cjcp1711212

Article history
 Received on: November 14, 2017
 Accepted on: December 12, 2017
As a natural consequence of the BornOppenheimer approximation, the adiabatic potential energy surfaces (PES) represents a function of nuclear coordinates and governs the nuclear motion and associated molecular spectroscopy and reaction dynamics [1]. With advances in electronic structure theory, it now becomes straightforward to compute electronic energies via density function theory or even higher level molecular wavefunction theory spanning a wide range of molecular configurations using sophisticated quantum chemistry packages [2, 3]. The effort of PES construction therefore focuses on finding a mathematical representation to the large data set of ab initio points with a multidimensional analytical function [4]. This offers enormous opportunity to apply machine learning techniques [5], which are designed for extracting patterns from numerous data, to bridge the relationship between the coordinates and potential energies. Indeed, machine learning based approaches have been used in many fields in physics and chemistry [69], and their applications in developing highdimensional PESs have recently gained rapid growth and been continuously active [10, 11].
Among various machine learning methods, neural networks (NNs) are one of the most popular tools because of its simplicity and efficiency. Since the first application in 1995 [12], the NN based potentials have achieved great successes in the last decade, representing such like gas phase reactions [1318], moleculesurface interactions [1928], as well as materials in condensed phase [2932]. A variety of studies have been concentrated on addressing specific issues related to chemical systems, e.g. symmetry adaption [3335], data sampling [36, 37], as well as developing more efficient NN structures for high dimensional problems [31, 3840]. In any cases, NNs are essentially a set of chained nonlinear analytical functions termed as "neurons" with a number of parameters termed as "weights" and "biases", offering high flexibility to approximate an arbitrary function [41]. These parameters were optimized to reproduce as accurately as possible the ab initio energies. This optimization process is termed as "training" or "learning" in the language of artificial intelligence, which is essential for realizing a fast and accurate fit, especially with a large number of data as in the cases of highdimensional PESs. As a result, it is understandable that the standard backpropagation (BP) algorithm [42], which converges slowly and easily gets stuck in a local minimum, has seldom been used in training the NN potentials. Alternatively, other improved gradient descent methods, e.g., resilient backpropagation (Rprop) [43] and conjugate gradient (CG) [44], and second order learning algorithms, e.g., Levenberg Marquardt (LM) [44] and the global extended Kalman filter (GEKF) [45] are more useful. While a comprehensive comparison among these methods is rarely available, LM algorithm has been found to work much better than gradient descent based methods [44] and recognized as one of the most efficient and accurate training algorithms for small and medium sized patterns. However, LM algorithm becomes less efficient when training hundreds of thousand data, which is yet necessary, e.g. for chemical systems with many degrees of freedom (DOFs) [20, 25, 28].
Recently, Huang and coworkers proposed a new learning scheme called extreme learning machine (ELM) [46], which is designed for single hidden layer feedforward neural networks (SLFNs). With initially randomized weights, in the ELM framework, training the SLFNs becomes simply equivalent to solve a generalized linear least square problem, thus is extremely fast and applicable to big data. However, ELM does not yield small enough fitting errors and requires many more neurons, as it produces the weights all once and does not further optimize the weights in the input layer [47]. Chen and Jin later suggested combining the LM and ELM method for training the SLFNs, i.e. using ELM and LM to obtain the weights linking input and hidden layers and those linking hidden and output layers, respectively [48]. It was found that this hybrid algorithm can be more efficient than LM alone for model systems in regression and classification using SLFNs with only dozens of weights. It is yet not clear that how it performs in training a large size of data and multilayer NNs with thousands of parameters. In the present work, we generalize this hybrid algorithm to the multilayer feedforward neural networks (MFNNs) and carefully check its performance in fitting multidimensional PESs with up to two hundred thousands of ab initio data having several thousands of weights and biases to be determined. Our results demonstrate that this novel algorithm can be very useful to accelerate the training process in developing NN based PESs.
Ⅱ. THEORY A. Permutation invariant polynomial neural networksAs depicted in FIG. 1, a typical feedforward NN relates a set of input parameters to a scalar output via one or more hidden layers of interconnecting neurons. The interconnecting neurons are expressed as nonlinear activation functions which provide the flexibility of NNs to approximate any function. Supposing the NN has
$ y_i^{k+1}=f^{k+1}\left[\sum\limits^{N_k}_{j=1}\left(w_{i, j}^{k+1}\times y_j^k\right)+b_i^{k+1}\right] $  (1) 
which
For our purpose here, the input layer often consists of molecular coordinates while the output is the potential energy. For a specific chemical system of interest, one has to incorporate the intrinsic permutation symmetry (i.e. the equivalence of these geometries with identical atoms exchanged) in the NN potentials, which is essential to accurately describe nuclear dynamics. To this end, we recently developed a simple and rigorous permutation invariant polynomial neural network (PIPNN) approach [4, 34, 49, 50], in which the input layer is made of the polynomials of functions of internuclear distances that satisfy the permutation symmetry [51]. Since our focus here is on the training algorithm which is universal and independent of the input vector, details of the construction of PIPs will not be given here and can be found in Ref.[4].
B. LevenbergMarquardt algorithmLevenbergMarquardt algorithm is a powerful approach for solving general nonlinear least squares problems [52, 53]. It balances the second order GaussNewton and the simple steepest descent via a variable factor, thus taking advantages of both methods, i.e. stable convergence of the former and high efficiency of the latter.
The goal of training NNs is to minimize the performance function, i.e. the root mean square error (RMSE) of the target potential energy and the NN output,
$ \mathbf{D}(\mathbf{x})=\sum\limits_{i=1}^{N_{\rm{t}}}(e_i)^2 =\sum\limits_{i=1}^{N_{\rm{t}}}(E_it_i)^2 $  (2) 
where
$ {\boldsymbol \Delta \bf{x}}=\left[\boldsymbol\nabla^2\bf{D}(\bf{x})\right]^{1}\boldsymbol\nabla \bf{D}(\bf{x}) $  (3) 
where
$ l=\sum\limits_{i=1}^{m+1}(N_{i1}+1)\times N_i $  (4) 
where
$ \boldsymbol\nabla\bf{D}(\bf{x})=\bf{J}^\mathrm{T}(\bf{x})\bf{e}(\bf{x}) $  (5) 
$ \boldsymbol\nabla^2\bf{D}(\bf{x})=\bf{J}^\mathrm{T}(\bf{x})\bf{J}(\bf{x})+\bf{S}(\bf{x}) $  (6) 
with
$ \begin{align} {\bf{J}}\left( {\bf{x}} \right) &=& \left[{\begin{array}{*{20}{c}} {\displaystyle\frac{{\partial {e_1}}}{{\partial {x_1}}}}&{\displaystyle\frac{{\partial {e_1}}}{{\partial {x_2}}}}& \cdots &{\displaystyle\frac{{\partial {e_1}}}{{\partial {x_l}}}}\\ {\displaystyle\frac{{\partial {e_2}}}{{\partial {x_1}}}}&{\displaystyle\frac{{\partial {e_2}}}{{\partial {x_2}}}}& \cdots &{\displaystyle\frac{{\partial {e_2}}}{{\partial {x_l}}}}\\ \vdots&\vdots&\ddots&\vdots \\ {\displaystyle\frac{{\partial {e_{{N_{\rm{t}}}}}}}{{\partial {x_1}}}}&{\displaystyle\frac{{\partial {e_{{N_{\rm{t}}}}}}}{{\partial {x_2}}}}& \cdots &{\displaystyle\frac{{\partial {e_{{N_{\rm{t}}}}}}}{{\partial {x_l}}}} \end{array}} \right] \end{align} $  (7) 
$ {\rm{S}}\left( {\bf{x}} \right) = \sum\limits_{i = 1}^{{N_{\rm{t}}}} {{e_i}({\bf{x}})} {\nabla ^2}{e_i}({\bf{x}}) $  (8) 
While GaussNewton algorithm further assumes
$ \Delta {\bf{x}} = {\left[{{{\bf{J}}^{\rm{T}}}\left( {\bf{x}} \right){\bf{J}}\left( {\bf{x}} \right) + \mu {\bf{I}}} \right]^{  1}}{{\bf{J}}^{\rm{T}}}\left( {\bf{x}} \right){\bf{e}}\left( {\bf{x}} \right) $  (9) 
where
ELM is an emerging learning algorithm for the generalized single hidden layer feedforward neural networks [46]. Unlike the gradient based learning algorithms, parameters in NNs are not tuned iteratively but determined by solving a generalized linear equation. Supposing that the output layer activated function is a linear one, the SFLNs can be mathematically rewritten as linear equations, i.e.,
$ \sum\limits_{i = 1}^{{N_1}} {{\beta _i}f} \left( {\sum\limits_{j = 1}^{{N_0}} {w_{i,j}^1 \times y_{n,j}^0 + {b_i}} } \right) = {o_n},\;\;\;\;n = 1,2, \ldots $  (10) 
The task of minimizing the deviation between the target and NN output, i.e.,
$ {\bf{H}}\boldsymbol\beta = {\bf{T}} $  (11) 
where
$ \begin{align} {\bf{H}} &=& \left[{\begin{array}{*{20}{c}} {y_{1, 1}^1}&{y_{1, 2}^1}& \ldots &{y_{1, {N_1}}^1}\\ {y_{2, 1}^1}&{y_{2, 2}^1}& \ldots &{\;y_{2, {N_1}}^1}\\ \vdots&\vdots&\ddots&\vdots \\ {y_{N_{\rm{t}}, 1}^1}&{y_{N_{\rm{t}}, 2}^1}& \ldots &{y_{N{\rm{t}}, {N_1}}^1} \end{array}} \right] \end{align} $  (12) 
$ \begin{align} \boldsymbol\beta &=& \left[{\begin{array}{*{20}{c}} {w_{1, 1}^2}\\ {w_{1, 2}^2}\\ \vdots \\ {w_{1, {N_1}}^2} \end{array}} \right], \quad \bf{T}= \left[{\begin{array}{*{20}{c}} {{t_1}}\\ {{t_2}}\\ \vdots \\ {{t_{N_{\rm{t}}}}} \end{array}} \right] \end{align} $  (13) 
A leastsquares solution
$ \boldsymbol\beta ^ + = \bf{H}^ + \bf{T} $  (14) 
where
Given the advantages and disadvantages of both algorithms, Chen and Jin suggested combining the LM and ELM method for training the SLFNs, which worked well for model systems with just a few dozens of parameters. The idea is to use ELM to train the NNs forward yielding parameters between the hidden and output layers and LM backward to finely tune parameters between the input and hidden layers [48]. Although ELM is designed for a SLFNs, it is in principle applicable to MFNNs provided that one just needs to adjust the parameters connecting the last hidden and output layers. To this end, our generalized ELMLM algorithm is schemed in FIG. 2. Starting with all randomly assigned weights and biases, those of which linking the last hidden and output layers are immediately obtained by ELM. With these parameters fixed, LM optimization is then done to update weights and biases between the input and the last hidden layers only. One goes back to the next cycle of ELM and LM iteratively until the RMSE reaches a certain criterion. This gives us the universal ELMLM hybrid algorithm.
Ⅲ. RESULTS AND DISCUSSIONTo compare the efficiency of ELMLM with LM algorithm, we select to fit PIPNN PESs using both algorithms for two representative systems, i.e. H+H
For the H+H
As shown in the Table Ⅰ, both ELMLM and LM algorithms produce a similar RMSE with
Let us now discuss the efficiency of the two algorithms. It is found that the ELMLM algorithm is much more efficient than LM, as listed in Table Ⅰ, where the former only takes less than 1/3 training time compared to the later to reach a similar level of accuracy. Here the training time is given in hours computed in parallel by OpenMP with two Intel Xeon E52680 v3 processors including 24 cores. Interestingly, FIG. 6(a) displays the reduction of overall RMSE as a function of the iterative step (epoch in NN's language), suggesting a twofold effect for this increase of efficiency. In the beginning, ELMLM gives a rise to a much lower initial RMSE since the first ELM step performs like a more advanced initialization for the weights between the second hidden layer and output. On the other hand, the ELMLM terminates much earlier than LM, as it quickly reaches the local minimum in the error space and triggers early stopping. Note that FIG. 6(a) is plotted in the logarithm scale and the decrease of both the initial fitting error and the number of training steps induced by ELM is remarkable.
B. CHRecent advances in electronic structure calculations enable one to compute up to hundreds of thousands single energy points in the PES construction. For example, a chemically accurate PES in describing the methane dissociative chemisorption on the frozen Ni(111) has been developed in our group by PIPNN fit with nearly two hundred thousand points [55]. These points were collected initially from ab initio molecular dynamics (AIMD) trajectories, followed by continuously adding more data points by running quasiclassical trajectory (QCT) calculations with the iteratively improved PES. A geometric criterion based on the Euclidean distance between two points and energy criterion based on the energy difference of three fits were employed to make sure the additional points are not close to any existing datum and not collected in the regions that are already well described [4].
In the present work, these data were used to test the performance of ELMLM and LM algorithms for CH
We next discuss the acceleration of the ELMLM algorithm. For this complex system, the training process is very timeconsuming and the pure LM algorithm requires about half a month to finish the 50 fits. It is found in Table Ⅰ that ELMLM is
To summarize, we report in this work a novel hybrid algorithm for training MLNNs aiming to fit a highdimensional PES with numerous data. This socalled ELMLM algorithm combines the ELM algorithm linearly determining the weights linking the last hidden layer and output, and the LM algorithm refining the weights elsewhere in the MLNNs. By taking advantages of both, ELMLM is able to train MLNNs for complicated data with both high efficiency and accuracy. Our tests on the H+H
This work was supported by the National Key R & D Program of China (No.2017YFA0303500), National Natural Science Foundation of China (No.91645202, No.21722306, and No.21573203), Fundamental Research Funds for the Central Universities of China (No.WK2060190082 and No.WK2340000078) Calculations have been done at the Supercomputing Center of USTC. Bin Jiang thanks Dr. Jun Chen at USTC for providing us the source code for LevenbergMarquart algorithm and helpful discussions.
[1]  J. N. Murrell, S. Carter, S. C. Farantos, P. Huxley, and A. J. C. Varandas, Molecular Potential Energy Functions. Chichester: Wiley (1984). 
[2]  P. J. K. H. J. Werner, G. Knizia, F. R. Manby, M. Schütz, P. Celani, W. Györffy, D. Kats, T. Korona, R. Lindh, A. Mitrushenkov, G. Rauhut, K. R. Shamasundar, T. B. Adler, R. D. Amos, A. Bernhardsson, A. Berning, D. L. Cooper, M. J. O. Deegan, A. J. Dobbyn, F. Eckert, E. Goll, C. Hampel, A. Hesselmann, G. Hetzer, T. Hrenar, G. Jansen, C. Köppl, Y. Liu, A. W. Lloyd, R. A. Mata, A. J. May, S. J. McNicholas, W. Meyer, M. E. Mura, A. Nicklaß, D. P. O'Neill, P. Palmieri, D. Peng, K. Pflüger, R. Pitzer, M. Reiher, T. Shiozaki, H. Stoll, A. J. Stone, R. Tarroni, T. Thorsteinsson, and M. Wang, see http://www.molpro.net., (2015). 
[3]  G. Kresse, and J. Furthmuller, Comp. Mater. Sci. 6 , 15 (1996). DOI:10.1016/09270256(96)000080 
[4]  B. Jiang, J. Li, and H. Guo, Int. Rev. Phys. Chem. 35 , 479 (2016). DOI:10.1080/0144235X.2016.1200347 
[5]  M. I. Jordan, and T. M. Mitchell, Science 349 , 255 (2015). DOI:10.1126/science.aaa8415 
[6]  J. Behler, and M. Parrinello, Phys. Rev. Lett. 98 , 146401 (2007). DOI:10.1103/PhysRevLett.98.146401 
[7]  A. P. Bartók, M. C. Payne, R. Kondor, and G. Csányi, Phys. Rev. Lett. 104 , 136403 (2010). DOI:10.1103/PhysRevLett.104.136403 
[8]  J. C. Snyder, M. Rupp, K. Hansen, K. R. Müller, and K. Burke, Phys. Rev. Lett. 108 , 253002 (2012). DOI:10.1103/PhysRevLett.108.253002 
[9]  J. Cui, and R. V. Krems, Phys. Rev. Lett. 115 , 073202 (2015). DOI:10.1103/PhysRevLett.115.073202 
[10]  A. P. Bart, and G. Csányi, Int. J. Quant. Chem. 115 , 1051 (2015). DOI:10.1002/qua.24927 
[11]  J. Behler, J. Chem. Phys. 145 , 170901 (2016). DOI:10.1063/1.4966192 
[12]  T. B. Blank, S. D. Brown, A. W. Calhoun, and D. J. Doren, J. Chem. Phys. 103 , 4129 (1995). DOI:10.1063/1.469597 
[13]  J. Chen, X. Xu, and D. H. Zhang, J. Chem. Phys. 138 , 154301 (2013). DOI:10.1063/1.4801658 
[14]  J. Chen, X. Xu, X. Xu, and D. H. Zhang, J. Chem. Phys. 138 , 221104 (2013). DOI:10.1063/1.4811109 
[15]  J. Li, and H. Guo, Phys. Chem. Chem. Phys. 16 , 6753 (2014). DOI:10.1039/C4CP00241E 
[16]  J. Li, and H. Guo, J. Chem. Phys. 143 , 221103 (2015). DOI:10.1063/1.4937570 
[17]  D. D. Lu, and J. Li, J. Chem. Phys. 145 , 014303 (2016). DOI:10.1063/1.4954765 
[18]  J. Li, R. Dawes, and H. Guo, Phys. Chem. Chem. Phys. 18 , 29825 (2016). DOI:10.1039/C6CP06232F 
[19]  B. Jiang, and H. Guo, Phys. Chem. Chem. Phys. 16 , 24704 (2014). DOI:10.1039/C4CP03761H 
[20]  B. Kolb, X. Luo, X. Zhou, B. Jiang, and H. Guo, J. Phys. Chem. Lett. 8 , 666 (2017). DOI:10.1021/acs.jpclett.6b02994 
[21]  B. Jiang, and H. Guo, J. Chem. Phys. 144 , 091101 (2016). DOI:10.1063/1.4943002 
[22]  B. Jiang, and H. Guo, Phys. Rev. Lett. 114 , 166101 (2015). DOI:10.1103/PhysRevLett.114.166101 
[23]  T. H. Liu, B. N. Fu, and D. H. Zhang, Sci. China:Chem. 57 , 147 (2014). DOI:10.1007/s1142601350057 
[24]  T. Liu, Z. Zhang, B. Fu, X. Yang, and D. H. Zhang, Chem. Sci. 7 , 1840 (2016). DOI:10.1039/C5SC03689E 
[25]  X. J. Shen, J. Chen, Z. J. Zhang, K. J. Shao, and D. H. Zhang, J. Chem. Phys. 143 , 144701 (2015). DOI:10.1063/1.4932226 
[26]  X. Shen, Z. Zhang, and D. H. Zhang, Phys. Chem. Chem. Phys. 17 , 25499 (2015). DOI:10.1039/C5CP04229A 
[27]  S. Lorenz, A. Groß, and M. Scheffler, Chem. Phys. Lett. 395 , 210 (2004). DOI:10.1016/j.cplett.2004.07.076 
[28]  K. Shakouri, J. Behler, J. Meyer, and G. J. Kroes, J. Phys. Chem. Lett. 8 , 2131 (2017). DOI:10.1021/acs.jpclett.7b00784 
[29]  S. Kondati Natarajan, T. Morawietz, and J. Behler, Phys. Chem. Chem. Phys. 17 , 8356 (2015). DOI:10.1039/C4CP04751F 
[30]  T. Morawietz, V. Sharma, and J. Behler, J. Chem. Phys. 136 , 064103 (2012). DOI:10.1063/1.3682557 
[31]  J. Behler, and M. Parrinello, Phys. Rev. Lett. 98 , 146401 (2007). DOI:10.1103/PhysRevLett.98.146401 
[32]  S. K. Natarajan, and J. Behler, Phys. Chem. Chem. Phys. 18 , 28704 (2016). DOI:10.1039/C6CP05711J 
[33]  J. Behler, S. Lorenz, and K. Reuter, J. Chem. Phys. 127 , 014705 (2007). DOI:10.1063/1.2746232 
[34]  B. Jiang, and H. Guo, J. Chem. Phys. 139 , 054112 (2013). DOI:10.1063/1.4817187 
[35]  K. Shao, J. Chen, Z. Zhao, and D. H. Zhang, J. Chem. Phys. 145 , 071101 (2016). DOI:10.1063/1.4961454 
[36]  L. M. Raff, M. Malshe, M. Hagan, D. I. Doughan, M. G. Rockley, and R. Komanduri, J. Chem. Phys. 122 , 084104 (2005). DOI:10.1063/1.1850458 
[37]  S. Manzhos, X. G. Wang, R. Dawes, and T. Carrington, J. Phys. Chem. A 110 , 5295 (2006). DOI:10.1021/jp055253z 
[38]  J. Behler, J. Chem. Phys. 134 , 074106 (2011). DOI:10.1063/1.3553717 
[39]  S. Manzhos, R. Dawes, and T. Carrington, Int. J. Quant. Chem. 115 , 1012 (2015). DOI:10.1002/qua.24795 
[40]  S. Manzhos, T. Carrington Jr., J. Chem. Phys. 125 , 194105 (2006). 
[41]  L. M. Raff, R. Komanduri, M. Hagan, and S. T. S. Bukkapatnam, Neural Networks in Chemical Reaction Dynamics. Oxford: Oxford University Press (2012). 
[42]  D. E. Rumelhart, G. E. Hinton, and R. J. Williams, Nature 323 , 533 (1986). DOI:10.1038/323533a0 
[43]  C. Igel and M. Hsken, in Proceedings of the Second International ICSC Symposium on Neural Computation, NC'2000, Berlin, Germany: ICSC Academic Press, 115(2000). 
[44]  M. T. Hagan, and M. B. Menhaj, IEEE Trans. Neural Networks 5 , 989 (1994). DOI:10.1109/72.329697 
[45]  T. B. Blank, and S. D. Brown, J. Chemom. 8 , 391 (1994). DOI:10.1002/(ISSN)1099128X 
[46]  G. B. Huang, Q. Y. Zhu, and C. K. Siew, presented at the 2004 IEEE International Joint Conference on Neural Networks (IEEE Cat. No. 04CH37541), (2004). 
[47]  J. Tang, C. Deng, and G. B. Huang, IEEE Trans. Neural Netw. Learn. Syst. 27 , 809 (2016). DOI:10.1109/TNNLS.2015.2424995 
[48]  H. Chen and F. Jin, Advances in Neural NetworksISNN 2006, J. Wang, Z. Yi, J. M. Zurada, B. L. Lu and H. Yin, Eds, Berlin, Heidelberg: Springer 509(2006). 
[49]  B. Jiang, and H. Guo, J. Chem. Phys. 141 , 034109 (2014). DOI:10.1063/1.4887363 
[50]  J. Li, B. Jiang, and H. Guo, J. Chem. Phys. 139 , 204103 (2013). DOI:10.1063/1.4832697 
[51]  Z. Xie, and J. M. Bowman, J. Chem. Theo. Comp. 6 , 26 (2010). DOI:10.1021/ct9004917 
[52]  K. Levenberg, Q. J. Math. 2 , 164 (1944). 
[53]  D. W. Marquardt, J. Soc. Ind. Appl. Math. 11 , 431 (1963). DOI:10.1137/0111030 
[54]  A. I. Boothroyd, W. J. Keogh, P. G. Martin, and M. R. Peterson, J. Chem. Phys. 104 , 7139 (1996). DOI:10.1063/1.471430 
[55]  X. Zhou, F. Nattino, Y. Zhang, J. Chen, G. J. Kroes, H. Guo, and B. Jiang, Phys. Chem. Chem. Phys. 19 , 30540 (2017). DOI:10.1039/C7CP05993K 
[56]  D. Nguyen and B. Widrow, 1990 IJCNN International Joint Conference on Neural Networks, 21(1990). 
[57]  X. Hu, W. L. Hase, and T. Pirraglia, J. Comp. Chem. 12 , 1014 (1991). DOI:10.1002/(ISSN)1096987X 