Chinese Journal of Chemical Physics  2017, Vol. 30 Issue (6): 727-734

The article information

Yao-long Zhang, Xue-yao Zhou, Bin Jiang

Accelerating the Construction of Neural Network Potential Energy Surfaces: A Fast Hybrid Training Algorithm

Chinese Journal of Chemical Physics, 2017, 30(6): 727-734

http://dx.doi.org/10.1063/1674-0068/30/cjcp1711212

Article history

Received on: November 14, 2017
Accepted on: December 12, 2017
Accelerating the Construction of Neural Network Potential Energy Surfaces: A Fast Hybrid Training Algorithm
Yao-long Zhang, Xue-yao Zhou, Bin Jiang
Dated: Received on November 14, 2017; Accepted on December 12, 2017
Department of Chemical Physics, University of Science and Technology of China, Hefei 230026, China
*Author to whom correspondence should be addressed. Bin Jiang, E-mail:bjiangch@ustc.edu.cn
Abstract: Machine learning approaches have been promising in constructing high-dimensional potential energy surfaces (PESs) for molecules and materials. Neural networks (NNs) are one of the most popular such tools because of its simplicity and efficiency. The training algorithm for NNs becomes essential to achieve a fast and accurate fit with numerous data. The Levenberg-Marquardt (LM) algorithm has been recognized as one of the fastest and robust algorithms to train medium sized NNs and widely applied in recent NN based high quality PESs. However, when the number of ab initio data becomes large, the efficiency of LM is limited, making the training time consuming. Extreme learning machine (ELM) is a recently proposed algorithm which determines the weights and biases of a single hidden layer NN by a linear solution and is thus extremely fast. It, however, does not produce sufficiently small fitting error because of its random nature. Taking advantages of both algorithms, we report a generalized hybrid algorithm in training multilayer NNs. Tests on H+H2 and CH4+Ni(111) systems demonstrate the much higher efficiency of this hybrid algorithm (ELM-LM) over the original LM. We expect that ELM-LM will find its widespread applications in building up high-dimensional NN based PESs.
Key words: Potential energy surface    Reaction dynamics    Neural networks
Ⅰ. INTRODUCTION

As a natural consequence of the Born-Oppenheimer approximation, the adiabatic potential energy surfaces (PES) represents a function of nuclear coordinates and governs the nuclear motion and associated molecular spectroscopy and reaction dynamics [1]. With advances in electronic structure theory, it now becomes straightforward to compute electronic energies via density function theory or even higher level molecular wavefunction theory spanning a wide range of molecular configurations using sophisticated quantum chemistry packages [2, 3]. The effort of PES construction therefore focuses on finding a mathematical representation to the large data set of ab initio points with a multidimensional analytical function [4]. This offers enormous opportunity to apply machine learning techniques [5], which are designed for extracting patterns from numerous data, to bridge the relationship between the coordinates and potential energies. Indeed, machine learning based approaches have been used in many fields in physics and chemistry [6-9], and their applications in developing high-dimensional PESs have recently gained rapid growth and been continuously active [10, 11].

Among various machine learning methods, neural networks (NNs) are one of the most popular tools because of its simplicity and efficiency. Since the first application in 1995 [12], the NN based potentials have achieved great successes in the last decade, representing such like gas phase reactions [13-18], molecule-surface interactions [19-28], as well as materials in condensed phase [29-32]. A variety of studies have been concentrated on addressing specific issues related to chemical systems, e.g. symmetry adaption [33-35], data sampling [36, 37], as well as developing more efficient NN structures for high dimensional problems [31, 38-40]. In any cases, NNs are essentially a set of chained nonlinear analytical functions termed as "neurons" with a number of parameters termed as "weights" and "biases", offering high flexibility to approximate an arbitrary function [41]. These parameters were optimized to reproduce as accurately as possible the ab initio energies. This optimization process is termed as "training" or "learning" in the language of artificial intelligence, which is essential for realizing a fast and accurate fit, especially with a large number of data as in the cases of high-dimensional PESs. As a result, it is understandable that the standard backpropagation (BP) algorithm [42], which converges slowly and easily gets stuck in a local minimum, has seldom been used in training the NN potentials. Alternatively, other improved gradient descent methods, e.g., resilient backpropagation (Rprop) [43] and conjugate gradient (CG) [44], and second order learning algorithms, e.g., Levenberg Marquardt (LM) [44] and the global extended Kalman filter (GEKF) [45] are more useful. While a comprehensive comparison among these methods is rarely available, LM algorithm has been found to work much better than gradient descent based methods [44] and recognized as one of the most efficient and accurate training algorithms for small and medium sized patterns. However, LM algorithm becomes less efficient when training hundreds of thousand data, which is yet necessary, e.g. for chemical systems with many degrees of freedom (DOFs) [20, 25, 28].

Recently, Huang and coworkers proposed a new learning scheme called extreme learning machine (ELM) [46], which is designed for single hidden layer feedforward neural networks (SLFNs). With initially randomized weights, in the ELM framework, training the SLFNs becomes simply equivalent to solve a generalized linear least square problem, thus is extremely fast and applicable to big data. However, ELM does not yield small enough fitting errors and requires many more neurons, as it produces the weights all once and does not further optimize the weights in the input layer [47]. Chen and Jin later suggested combining the LM and ELM method for training the SLFNs, i.e. using ELM and LM to obtain the weights linking input and hidden layers and those linking hidden and output layers, respectively [48]. It was found that this hybrid algorithm can be more efficient than LM alone for model systems in regression and classification using SLFNs with only dozens of weights. It is yet not clear that how it performs in training a large size of data and multilayer NNs with thousands of parameters. In the present work, we generalize this hybrid algorithm to the multilayer feedforward neural networks (MFNNs) and carefully check its performance in fitting multidimensional PESs with up to two hundred thousands of ab initio data having several thousands of weights and biases to be determined. Our results demonstrate that this novel algorithm can be very useful to accelerate the training process in developing NN based PESs.

Ⅱ. THEORY A. Permutation invariant polynomial neural networks

As depicted in FIG. 1, a typical feed-forward NN relates a set of input parameters to a scalar output via one or more hidden layers of interconnecting neurons. The interconnecting neurons are expressed as nonlinear activation functions which provide the flexibility of NNs to approximate any function. Supposing the NN has $N_k$ neurons at the $k$th layer, the $i$th neuron at the ($k$+1) layer is

 FIG. 1 The schematic structure of the multilayer neural networks
 $y_i^{k+1}=f^{k+1}\left[\sum\limits^{N_k}_{j=1}\left(w_{i, j}^{k+1}\times y_j^k\right)+b_i^{k+1}\right]$ (1)

which $w_{i, j}^{k+1}$ is the weight connecting the $i$th neuron in ($k$+1)th layer and $j$ neuron in $k$ layer, and $b_i^{k+1}$ is the biases of $i$th neuron in ($k$+1)th layer. $f^{k+1}(x)$ is the activation function of the ($k$+1)th layer.

For our purpose here, the input layer often consists of molecular coordinates while the output is the potential energy. For a specific chemical system of interest, one has to incorporate the intrinsic permutation symmetry (i.e. the equivalence of these geometries with identical atoms exchanged) in the NN potentials, which is essential to accurately describe nuclear dynamics. To this end, we recently developed a simple and rigorous permutation invariant polynomial neural network (PIP-NN) approach [4, 34, 49, 50], in which the input layer is made of the polynomials of functions of internuclear distances that satisfy the permutation symmetry [51]. Since our focus here is on the training algorithm which is universal and independent of the input vector, details of the construction of PIPs will not be given here and can be found in Ref.[4].

B. Levenberg-Marquardt algorithm

Levenberg-Marquardt algorithm is a powerful approach for solving general non-linear least squares problems [52, 53]. It balances the second order Gauss-Newton and the simple steepest descent via a variable factor, thus taking advantages of both methods, i.e. stable convergence of the former and high efficiency of the latter.

The goal of training NNs is to minimize the performance function, i.e. the root mean square error (RMSE) of the target potential energy and the NN output,

 $\mathbf{D}(\mathbf{x})=\sum\limits_{i=1}^{N_{\rm{t}}}(e_i)^2 =\sum\limits_{i=1}^{N_{\rm{t}}}(E_i-t_i)^2$ (2)

where $t_i$ and $E_i$ are the target and output for the $i$th data and $N_{\rm{t}}$ is the total number of data points. For this nonlinear least square problem, the Newton's method to minimize D(x) with respect to the parameter vector x can be expressed as [44],

 ${\boldsymbol \Delta \bf{x}}=-\left[\boldsymbol\nabla^2\bf{D}(\bf{x})\right]^{-1}\boldsymbol\nabla \bf{D}(\bf{x})$ (3)

where $\bf{x}$ is a collection of all weights and biases which is updated after each iteration by $\bf{x}$=$\bf{x}$+$\boldsymbol\Delta \bf{x}$. The dimension of $\bf{x}$ can be calculated by

 $l=\sum\limits_{i=1}^{m+1}(N_{i-1}+1)\times N_i$ (4)

where $N_i$ is the number of neuron in the $i$th hidden layer ($i$=0 corresponds to the input layer) and $m$ is the number of hidden layers. The Hessian matrix $\boldsymbol\nabla^2\bf{D}(\bf{x})$ and the gradient $\boldsymbol\nabla\bf{D}(\bf{x})$ can be written in the form,

 $\boldsymbol\nabla\bf{D}(\bf{x})=\bf{J}^\mathrm{T}(\bf{x})\bf{e}(\bf{x})$ (5)
 $\boldsymbol\nabla^2\bf{D}(\bf{x})=\bf{J}^\mathrm{T}(\bf{x})\bf{J}(\bf{x})+\bf{S}(\bf{x})$ (6)

with $\bf{J}(\bf{x})$ being the Jacobian matrix

 \begin{align} {\bf{J}}\left( {\bf{x}} \right) &=& \left[{\begin{array}{*{20}{c}} {\displaystyle\frac{{\partial {e_1}}}{{\partial {x_1}}}}&{\displaystyle\frac{{\partial {e_1}}}{{\partial {x_2}}}}& \cdots &{\displaystyle\frac{{\partial {e_1}}}{{\partial {x_l}}}}\\ {\displaystyle\frac{{\partial {e_2}}}{{\partial {x_1}}}}&{\displaystyle\frac{{\partial {e_2}}}{{\partial {x_2}}}}& \cdots &{\displaystyle\frac{{\partial {e_2}}}{{\partial {x_l}}}}\\ \vdots&\vdots&\ddots&\vdots \\ {\displaystyle\frac{{\partial {e_{{N_{\rm{t}}}}}}}{{\partial {x_1}}}}&{\displaystyle\frac{{\partial {e_{{N_{\rm{t}}}}}}}{{\partial {x_2}}}}& \cdots &{\displaystyle\frac{{\partial {e_{{N_{\rm{t}}}}}}}{{\partial {x_l}}}} \end{array}} \right] \end{align} (7)
 ${\rm{S}}\left( {\bf{x}} \right) = \sum\limits_{i = 1}^{{N_{\rm{t}}}} {{e_i}({\bf{x}})} {\nabla ^2}{e_i}({\bf{x}})$ (8)

While Gauss-Newton algorithm further assumes $\bf{S}(\bf{x})$ to be zero, in the LM algorithm, $\bf{S}(\bf{x})$ is assigned to $\mu\bf{I}$, where $\bf{I}$ is a unit matrix. The updating step length of parameter $\bf{x}$ become [44],

 $\Delta {\bf{x}} = -{\left[{{{\bf{J}}^{\rm{T}}}\left( {\bf{x}} \right){\bf{J}}\left( {\bf{x}} \right) + \mu {\bf{I}}} \right]^{ - 1}}{{\bf{J}}^{\rm{T}}}\left( {\bf{x}} \right){\bf{e}}\left( {\bf{x}} \right)$ (9)

where $\mu$ is a dynamically changing factor that controls the training procedure. Whenever the last step leads to an increase $\bf{D}(\bf{x})$, $\mu$ is multiplied by the factor $\alpha$. Otherwise $\mu$ is divided by the $\alpha$, where $\alpha$ is often set to 10. It is immediately seen that LM algorithm becomes the steepest descent algorithm when $\mu$ is large, while reduces to the Gauss-Newton algorithm when $\mu$ approaches zero.

C. Extreme learning machine

ELM is an emerging learning algorithm for the generalized single hidden layer feed-forward neural networks [46]. Unlike the gradient based learning algorithms, parameters in NNs are not tuned iteratively but determined by solving a generalized linear equation. Supposing that the output layer activated function is a linear one, the SFLNs can be mathematically rewritten as linear equations, i.e.,

 $\sum\limits_{i = 1}^{{N_1}} {{\beta _i}f} \left( {\sum\limits_{j = 1}^{{N_0}} {w_{i,j}^1 \times y_{n,j}^0 + {b_i}} } \right) = {o_n},\;\;\;\;n = 1,2, \ldots$ (10)

The task of minimizing the deviation between the target and NN output, i.e., $\min$$\bigg(\displaystyle\sum\limits_{n=1}^{N_{\rm{t}}}|o_n-t_n |\bigg), is then converted to a linear least square problem supposing the parameters connecting the input and hidden layers are already known. In other words, one needs to solve the generalized linear equations like this,  {\bf{H}}\boldsymbol\beta = {\bf{T}} (11) where \bf{H} is called the hidden layer output matrix (N_{\rm{t}}$$\times$$N_1$) whose $i$th column is the $i$th hidden neuron's output vector, i.e.,

 \begin{align} {\bf{H}} &=& \left[{\begin{array}{*{20}{c}} {y_{1, 1}^1}&{y_{1, 2}^1}& \ldots &{y_{1, {N_1}}^1}\\ {y_{2, 1}^1}&{y_{2, 2}^1}& \ldots &{\;y_{2, {N_1}}^1}\\ \vdots&\vdots&\ddots&\vdots \\ {y_{N_{\rm{t}}, 1}^1}&{y_{N_{\rm{t}}, 2}^1}& \ldots &{y_{N{\rm{t}}, {N_1}}^1} \end{array}} \right] \end{align} (12)
 \begin{align} \boldsymbol\beta &=& \left[{\begin{array}{*{20}{c}} {w_{1, 1}^2}\\ {w_{1, 2}^2}\\ \vdots \\ {w_{1, {N_1}}^2} \end{array}} \right], \quad \bf{T}= \left[{\begin{array}{*{20}{c}} {{t_1}}\\ {{t_2}}\\ \vdots \\ {{t_{N_{\rm{t}}}}} \end{array}} \right] \end{align} (13)

A least-squares solution $\boldsymbol\beta^+$ can be found by,

 $\boldsymbol\beta ^ + = \bf{H}^ + \bf{T}$ (14)

where $\bf{H}^+$ is the Moore-Penrose generalized inverse of $\bf{H}$. Huang et al. have proved in theory that the SLFNs with randomly assigned hidden neurons with output weights determined in this linear way maintains the universal function approximation capability provided there are sufficiently more neurons [46]. It is particularly useful in some model problems in classification and regression with extremely fast learning rate. However, it may be far from optimal due to its random nature especially when the target function is complicated.

D. Hybrid algorithm

Given the advantages and disadvantages of both algorithms, Chen and Jin suggested combining the LM and ELM method for training the SLFNs, which worked well for model systems with just a few dozens of parameters. The idea is to use ELM to train the NNs forward yielding parameters between the hidden and output layers and LM backward to finely tune parameters between the input and hidden layers [48]. Although ELM is designed for a SLFNs, it is in principle applicable to MFNNs provided that one just needs to adjust the parameters connecting the last hidden and output layers. To this end, our generalized ELM-LM algorithm is schemed in FIG. 2. Starting with all randomly assigned weights and biases, those of which linking the last hidden and output layers are immediately obtained by ELM. With these parameters fixed, LM optimization is then done to update weights and biases between the input and the last hidden layers only. One goes back to the next cycle of ELM and LM iteratively until the RMSE reaches a certain criterion. This gives us the universal ELM-LM hybrid algorithm.

 FIG. 2 The protocol of the ELM-LM hybrid algorithm
Ⅲ. RESULTS AND DISCUSSION

To compare the efficiency of ELM-LM with LM algorithm, we select to fit PIP-NN PESs using both algorithms for two representative systems, i.e. H+H$_2$ reaction in gas phase and the CH$_4$ dissociative chemisorption on a rigid Ni(111) surface. The former is the simplest chemical reaction containing three DOFs only, while the later is one of the most extensively studied gas-surface reactions in the past two decades including fifteen DOFs. These two systems thus correspond to the trivial and nontrivial cases for our test, respectively. Since the purpose here is to validate our new training algorithm, no additional ab initio calculations were performed. Instead, ab initio based Boothroyd-Keogh-Martin-Peterson (BKMP2) [54] PES was used to generate data for H+H$_2$ reaction and density functional theory based data computed in our recent work were used for CH$_4$+Ni(111) system [55]. In the NN training, Nguyen-Widrow initialization scheme [56] was used for both two algorithms and the initial value of factor $\mu$ was kept at 10$^{-2}$. The commonly used two-hidden layer NNs were for both systems, although our algorithm is not limited to the number of hidden layers. The hidden layer activated function is hyperbolic tangent, i.e. $f(x)$=$\tanh(x)$, while the output is linked to the last hidden layer with a linear function. The data were divided randomly into a training (90%) set and a validating (10%) set, and only data in the training set were used to update the weights and biases, while the validating data serve as a signature to impose early stopping when the RMSE of the validating set keeps increasing in six iterations. Otherwise, the training is terminated when the desired RMSE is achieved, e.g. 1 meV. Fifty fits with different random initializations were done for each system and the total time was used for comparing the efficiency. But only the best PESs with the smallest overall RMSEs (for both training and validating sets) were taken in the following numerical analysis. This is consistent with the general procedure in NN fits because of its nonlinear nature.

A. H+H$_2$ reactive system

For the H+H$_2$ reaction, following our previous work [34], the reactant channel were sampled in a cubic box defined by $R_{\rm{H1H2}}$ in [0.9, 3.0] bohr, $R_{\rm{H2H3}}$ in [4.5, 10.0] bohr, and $\theta_{\rm{H1H2H3}}$ in [0$^\circ$, 180$^\circ$]. These points also cover the product channel by symmetry. The points in the interaction region were sampled within the box defined by $R_{\rm{H1H2}}$ in [0.9, 4.0] bohr, $R_{\rm{H2H3}}$ in [0.9, 4.0] bohr, and $\theta_{\rm{H1H2H3}}$ in [0$^\circ$, 180$^\circ$]. In order to avoid unphysical behaviors in regions that are not dynamically relevant, a few hundred points with high energy spanning a large configuration space were added into the dataset. Approximately 1300 points were finally collected for the H+H$_2$. For this simple system, three PIPs sufficiently preserve the symmetry, and two hidden layers with 20 neurons in each, yielding the 3-20-20-1 MFNNs architecture.

As shown in the Table Ⅰ, both ELM-LM and LM algorithms produce a similar RMSE with $\sim$2 meV. The energy error distributions, as shown in FIG. 3(a), are also similarly narrow with a maximum error of $\sim$30 meV. In addition, FIG. 4 shows two-dimensional (2D) contour plots of the collinear H+H$_2$ exchange reaction with the energy differences between the fitted PES and the original BKMP2 projected on top of them. Again, these two plots for ELM-LM and LM PESs are quite similar, and the former shows a slightly more uniform error distribution in the entire configuration space. In order to further validate the NN based PESs, quantum dynamics $J$=0 total reaction probabilities of H+H$_2$($v$=0, $j$=0)$\rightarrow$H+H$_2$ calculated by a time-dependent wave packet method are compared in FIG. 5(a). The computational details have been given elsewhere [34]. The reaction probabilities on the PESs fitted by the two algorithms are almost indistinguishable from that on the BKMP2 PES as a function of collision energy up to 2.0 eV. It is clear that both algorithms are able to provide a faithful representation of the original data in this simple reactive system.

Table Ⅰ The RMSEs (in meV) of the best NN fits using ELM-LM and LM for H+H2 and CH4+Ni(111) systems, respectively, along with the overall computational time for fifty fits
 FIG. 3 Distributions of the fitting error for (a) H+H$_2$ and (b) CH$_4$+Ni(111) systems as a function of potential energy
 FIG. 4 2D potential energy contour plots (black curves) obtained from the best (a) LM and (b) ELM-LM fits for H+H$_2$ reaction, as a function of two H$-$H bonds with enclosed angle fixed at 180$^\circ$. The colored contours correspond to the fitting errors
 FIG. 5 (a) $J$=0 total reaction probabilities obtained from the LM, ELM-LM and original BKMP2 PESs for the H+H$_2$ ($v$=0, $j$=0). (b) Dissociation probabilities of the ground state CH$_4$ on the rigid Ni(111) computed on the LM and ELM-LM

Let us now discuss the efficiency of the two algorithms. It is found that the ELM-LM algorithm is much more efficient than LM, as listed in Table Ⅰ, where the former only takes less than 1/3 training time compared to the later to reach a similar level of accuracy. Here the training time is given in hours computed in parallel by OpenMP with two Intel Xeon E5-2680 v3 processors including 24 cores. Interestingly, FIG. 6(a) displays the reduction of overall RMSE as a function of the iterative step (epoch in NN's language), suggesting a twofold effect for this increase of efficiency. In the beginning, ELM-LM gives a rise to a much lower initial RMSE since the first ELM step performs like a more advanced initialization for the weights between the second hidden layer and output. On the other hand, the ELM-LM terminates much earlier than LM, as it quickly reaches the local minimum in the error space and triggers early stopping. Note that FIG. 6(a) is plotted in the logarithm scale and the decrease of both the initial fitting error and the number of training steps induced by ELM is remarkable.

 FIG. 6 The evaluations of the RMSE as a function of epoch for the LM and ELM-LM algorithms on (a) H+H$_2$ reaction and (b) CH$_4$+Ni(111) system in logarithm scale
B. CH$_4$ dissociative chemisorption on Ni(111)

Recent advances in electronic structure calculations enable one to compute up to hundreds of thousands single energy points in the PES construction. For example, a chemically accurate PES in describing the methane dissociative chemisorption on the frozen Ni(111) has been developed in our group by PIP-NN fit with nearly two hundred thousand points [55]. These points were collected initially from ab initio molecular dynamics (AIMD) trajectories, followed by continuously adding more data points by running quasi-classical trajectory (QCT) calculations with the iteratively improved PES. A geometric criterion based on the Euclidean distance between two points and energy criterion based on the energy difference of three fits were employed to make sure the additional points are not close to any existing datum and not collected in the regions that are already well described [4].

In the present work, these data were used to test the performance of ELM-LM and LM algorithms for CH$_4$+Ni(111) system as a complex example. A number of 323 PIPs was adopted to guarantee the system dependent surface periodicity and the permutation symmetry in the molecule [2], and the two hidden layer NNs can be denoted as 323-12-40-1. As listed in Table Ⅰ, overall RMSEs of the NN PESs for this complicated system are quite similar using both algorithms, but generally larger than those in H+H$_2$ reaction. On one hand, this is expected for a much more complicated system, on the other hand, density functional theory calculations for molecule-metal interaction are more difficultly converged. Both algorithms yield similar and uniform fitting error distributions as well, as shown in FIG. 3(b). The fitting quality is further compared in FIG. 7, where the 2D contour plots as a function of the height of the CH$_4$ center of mass ($Z_{\rm{com}}$) and the distance of dissociative C$-$H bond ($r_{\rm{CH}}$) at bridge site. The two PESs match with each other very well. We finally compare the dissociation probability of CH$_4$ in its ground state on Ni(111) computed by QCT with the heavily modified VENUS program [57] on ELM-LM and LM PESs, and the results are presented in FIG. 5(b). It is encouraging that the dissociation probabilities on these two PESs agree extremely well with each other in a wide range of translational energy, suggesting both ELM-LM and LM algorithms can fit the large number of data equivalently well.

 FIG. 7 Comparison of 2D contour plots on the LM (a) and ELM-LM (b) PESs describing CH$_4$ dissociation on Ni(111) over the bridge site, as a function of the height of the CH$_4$ center of mass ($Z_{\rm{com}}$) and the distance of dissociative C$-$H bond ($r_{\rm{CH}}$)

We next discuss the acceleration of the ELM-LM algorithm. For this complex system, the training process is very time-consuming and the pure LM algorithm requires about half a month to finish the 50 fits. It is found in Table Ⅰ that ELM-LM is $\sim$9.5 times faster than LM reducing the time cost to less than two days. Again, it is seen from FIG. 6(b) that the ELM-LM algorithm starts with a much more decent initial guess with the assistance of ELM, followed by a similar decreasing rate of the RMSE as the pure LM. More importantly, the ELM-LM algorithm iterates with much fewer steps (506), roughly one ninth of 4829 steps in LM. This leads to the incredible efficiency of ELM-LM hybridization given the negligible time cost of ELM itself compared to LM. An improved accelerating rate from the simple H+H$_2$ to more complex CH$_4$+Ni(111) system, indicate that ELM-LM is capable of handling the difficulty of fitting a very large number of data.

Ⅳ. CONCLUDING REMARKS

To summarize, we report in this work a novel hybrid algorithm for training MLNNs aiming to fit a high-dimensional PES with numerous data. This so-called ELM-LM algorithm combines the ELM algorithm linearly determining the weights linking the last hidden layer and output, and the LM algorithm refining the weights elsewhere in the MLNNs. By taking advantages of both, ELM-LM is able to train MLNNs for complicated data with both high efficiency and accuracy. Our tests on the H+H$_2$ and CH$_4$+Ni(111) systems demonstrate that ELM-LM can be nearly ten times faster than LM but at the same level of accuracy. The remarkable increase of efficiency is due largely to the better initialization by the first ELM step and faster approaching a local minimum in the error space. The computational cost in each epoch remains unchanged compared to the original LM, since ELM takes a negligible amount of time. We anticipate that its efficiency can be further boosted when using GPU to speed up at present the most costly step, i.e. the inverse of Jacobian matrix. It is expected that this hybrid method will be helpful in the construction of more complicated high-dimensional PESs.

Ⅴ. ACKNOWLEDGEMENTS

This work was supported by the National Key R & D Program of China (No.2017YFA0303500), National Natural Science Foundation of China (No.91645202, No.21722306, and No.21573203), Fundamental Research Funds for the Central Universities of China (No.WK2060190082 and No.WK2340000078) Calculations have been done at the Supercomputing Center of USTC. Bin Jiang thanks Dr. Jun Chen at USTC for providing us the source code for Levenberg-Marquart algorithm and helpful discussions.

Reference
 [1] J. N. Murrell, S. Carter, S. C. Farantos, P. Huxley, and A. J. C. Varandas, Molecular Potential Energy Functions. Chichester: Wiley (1984). [2] P. J. K. H. J. Werner, G. Knizia, F. R. Manby, M. Schütz, P. Celani, W. Györffy, D. Kats, T. Korona, R. Lindh, A. Mitrushenkov, G. Rauhut, K. R. Shamasundar, T. B. Adler, R. D. Amos, A. Bernhardsson, A. Berning, D. L. Cooper, M. J. O. Deegan, A. J. Dobbyn, F. Eckert, E. Goll, C. Hampel, A. Hesselmann, G. Hetzer, T. Hrenar, G. Jansen, C. Köppl, Y. Liu, A. W. Lloyd, R. A. Mata, A. J. May, S. J. McNicholas, W. Meyer, M. E. Mura, A. Nicklaß, D. P. O'Neill, P. Palmieri, D. Peng, K. Pflüger, R. Pitzer, M. Reiher, T. Shiozaki, H. Stoll, A. J. Stone, R. Tarroni, T. Thorsteinsson, and M. Wang, see http://www.molpro.net., (2015). [3] G. Kresse, and J. Furthmuller, Comp. Mater. Sci. 6 , 15 (1996). DOI:10.1016/0927-0256(96)00008-0 [4] B. Jiang, J. Li, and H. Guo, Int. Rev. Phys. Chem. 35 , 479 (2016). DOI:10.1080/0144235X.2016.1200347 [5] M. I. Jordan, and T. M. Mitchell, Science 349 , 255 (2015). DOI:10.1126/science.aaa8415 [6] J. Behler, and M. Parrinello, Phys. Rev. Lett. 98 , 146401 (2007). DOI:10.1103/PhysRevLett.98.146401 [7] A. P. Bartók, M. C. Payne, R. Kondor, and G. Csányi, Phys. Rev. Lett. 104 , 136403 (2010). DOI:10.1103/PhysRevLett.104.136403 [8] J. C. Snyder, M. Rupp, K. Hansen, K. R. Müller, and K. Burke, Phys. Rev. Lett. 108 , 253002 (2012). DOI:10.1103/PhysRevLett.108.253002 [9] J. Cui, and R. V. Krems, Phys. Rev. Lett. 115 , 073202 (2015). DOI:10.1103/PhysRevLett.115.073202 [10] A. P. Bart, and G. Csányi, Int. J. Quant. Chem. 115 , 1051 (2015). DOI:10.1002/qua.24927 [11] J. Behler, J. Chem. Phys. 145 , 170901 (2016). DOI:10.1063/1.4966192 [12] T. B. Blank, S. D. Brown, A. W. Calhoun, and D. J. Doren, J. Chem. Phys. 103 , 4129 (1995). DOI:10.1063/1.469597 [13] J. Chen, X. Xu, and D. H. Zhang, J. Chem. Phys. 138 , 154301 (2013). DOI:10.1063/1.4801658 [14] J. Chen, X. Xu, X. Xu, and D. H. Zhang, J. Chem. Phys. 138 , 221104 (2013). DOI:10.1063/1.4811109 [15] J. Li, and H. Guo, Phys. Chem. Chem. Phys. 16 , 6753 (2014). DOI:10.1039/C4CP00241E [16] J. Li, and H. Guo, J. Chem. Phys. 143 , 221103 (2015). DOI:10.1063/1.4937570 [17] D. D. Lu, and J. Li, J. Chem. Phys. 145 , 014303 (2016). DOI:10.1063/1.4954765 [18] J. Li, R. Dawes, and H. Guo, Phys. Chem. Chem. Phys. 18 , 29825 (2016). DOI:10.1039/C6CP06232F [19] B. Jiang, and H. Guo, Phys. Chem. Chem. Phys. 16 , 24704 (2014). DOI:10.1039/C4CP03761H [20] B. Kolb, X. Luo, X. Zhou, B. Jiang, and H. Guo, J. Phys. Chem. Lett. 8 , 666 (2017). DOI:10.1021/acs.jpclett.6b02994 [21] B. Jiang, and H. Guo, J. Chem. Phys. 144 , 091101 (2016). DOI:10.1063/1.4943002 [22] B. Jiang, and H. Guo, Phys. Rev. Lett. 114 , 166101 (2015). DOI:10.1103/PhysRevLett.114.166101 [23] T. H. Liu, B. N. Fu, and D. H. Zhang, Sci. China:Chem. 57 , 147 (2014). DOI:10.1007/s11426-013-5005-7 [24] T. Liu, Z. Zhang, B. Fu, X. Yang, and D. H. Zhang, Chem. Sci. 7 , 1840 (2016). DOI:10.1039/C5SC03689E [25] X. J. Shen, J. Chen, Z. J. Zhang, K. J. Shao, and D. H. Zhang, J. Chem. Phys. 143 , 144701 (2015). DOI:10.1063/1.4932226 [26] X. Shen, Z. Zhang, and D. H. Zhang, Phys. Chem. Chem. Phys. 17 , 25499 (2015). DOI:10.1039/C5CP04229A [27] S. Lorenz, A. Groß, and M. Scheffler, Chem. Phys. Lett. 395 , 210 (2004). DOI:10.1016/j.cplett.2004.07.076 [28] K. Shakouri, J. Behler, J. Meyer, and G. J. Kroes, J. Phys. Chem. Lett. 8 , 2131 (2017). DOI:10.1021/acs.jpclett.7b00784 [29] S. Kondati Natarajan, T. Morawietz, and J. Behler, Phys. Chem. Chem. Phys. 17 , 8356 (2015). DOI:10.1039/C4CP04751F [30] T. Morawietz, V. Sharma, and J. Behler, J. Chem. Phys. 136 , 064103 (2012). DOI:10.1063/1.3682557 [31] J. Behler, and M. Parrinello, Phys. Rev. Lett. 98 , 146401 (2007). DOI:10.1103/PhysRevLett.98.146401 [32] S. K. Natarajan, and J. Behler, Phys. Chem. Chem. Phys. 18 , 28704 (2016). DOI:10.1039/C6CP05711J [33] J. Behler, S. Lorenz, and K. Reuter, J. Chem. Phys. 127 , 014705 (2007). DOI:10.1063/1.2746232 [34] B. Jiang, and H. Guo, J. Chem. Phys. 139 , 054112 (2013). DOI:10.1063/1.4817187 [35] K. Shao, J. Chen, Z. Zhao, and D. H. Zhang, J. Chem. Phys. 145 , 071101 (2016). DOI:10.1063/1.4961454 [36] L. M. Raff, M. Malshe, M. Hagan, D. I. Doughan, M. G. Rockley, and R. Komanduri, J. Chem. Phys. 122 , 084104 (2005). DOI:10.1063/1.1850458 [37] S. Manzhos, X. G. Wang, R. Dawes, and T. Carrington, J. Phys. Chem. A 110 , 5295 (2006). DOI:10.1021/jp055253z [38] J. Behler, J. Chem. Phys. 134 , 074106 (2011). DOI:10.1063/1.3553717 [39] S. Manzhos, R. Dawes, and T. Carrington, Int. J. Quant. Chem. 115 , 1012 (2015). DOI:10.1002/qua.24795 [40] S. Manzhos, T. Carrington Jr., J. Chem. Phys. 125 , 194105 (2006). [41] L. M. Raff, R. Komanduri, M. Hagan, and S. T. S. Bukkapatnam, Neural Networks in Chemical Reaction Dynamics. Oxford: Oxford University Press (2012). [42] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, Nature 323 , 533 (1986). DOI:10.1038/323533a0 [43] C. Igel and M. Hsken, in Proceedings of the Second International ICSC Symposium on Neural Computation, NC'2000, Berlin, Germany: ICSC Academic Press, 115(2000). [44] M. T. Hagan, and M. B. Menhaj, IEEE Trans. Neural Networks 5 , 989 (1994). DOI:10.1109/72.329697 [45] T. B. Blank, and S. D. Brown, J. Chemom. 8 , 391 (1994). DOI:10.1002/(ISSN)1099-128X [46] G. B. Huang, Q. Y. Zhu, and C. K. Siew, presented at the 2004 IEEE International Joint Conference on Neural Networks (IEEE Cat. No. 04CH37541), (2004). [47] J. Tang, C. Deng, and G. B. Huang, IEEE Trans. Neural Netw. Learn. Syst. 27 , 809 (2016). DOI:10.1109/TNNLS.2015.2424995 [48] H. Chen and F. Jin, Advances in Neural Networks-ISNN 2006, J. Wang, Z. Yi, J. M. Zurada, B. L. Lu and H. Yin, Eds, Berlin, Heidelberg: Springer 509(2006). [49] B. Jiang, and H. Guo, J. Chem. Phys. 141 , 034109 (2014). DOI:10.1063/1.4887363 [50] J. Li, B. Jiang, and H. Guo, J. Chem. Phys. 139 , 204103 (2013). DOI:10.1063/1.4832697 [51] Z. Xie, and J. M. Bowman, J. Chem. Theo. Comp. 6 , 26 (2010). DOI:10.1021/ct9004917 [52] K. Levenberg, Q. J. Math. 2 , 164 (1944). [53] D. W. Marquardt, J. Soc. Ind. Appl. Math. 11 , 431 (1963). DOI:10.1137/0111030 [54] A. I. Boothroyd, W. J. Keogh, P. G. Martin, and M. R. Peterson, J. Chem. Phys. 104 , 7139 (1996). DOI:10.1063/1.471430 [55] X. Zhou, F. Nattino, Y. Zhang, J. Chen, G. J. Kroes, H. Guo, and B. Jiang, Phys. Chem. Chem. Phys. 19 , 30540 (2017). DOI:10.1039/C7CP05993K [56] D. Nguyen and B. Widrow, 1990 IJCNN International Joint Conference on Neural Networks, 21(1990). [57] X. Hu, W. L. Hase, and T. Pirraglia, J. Comp. Chem. 12 , 1014 (1991). DOI:10.1002/(ISSN)1096-987X