Chinese Journal of Chemical Physics  2017, Vol. 30 Issue (3): 277-286

#### The article information

Yun-an Yan

Low-Storage Runge-Kutta Method for Simulating Time-Dependent Quantum Dynamics

Chinese Journal of Chemical Physics, 2017, 30(3): 277-286

http://dx.doi.org/10.1063/1674-0068/30/cjcp1703025

### Article history

Accepted on: March 17, 2017
Low-Storage Runge-Kutta Method for Simulating Time-Dependent Quantum Dynamics
Yun-an Yana,b
Dated: Received on March 3, 2017; Accepted on March 17, 2017
a. Guizhou Provincial Key Laboratory of Computational Nano-Material Science, Guizhou Education University, Guizhou 550018, China;
b. Beijing Computational Science Research Center, Beijing 100094, China
*Author to whom correspondence should be addressed. Yun-an Yan, E-mail:yunan@gznc.edu.cn
Abstract: A wide range of quantum systems are time-invariant and the corresponding dynamics is dictated by linear differential equations with constant coefficients.Although simple in mathematical concept, the integration of these equations is usually complicated in practice for complex systems, where both the computational time and the memory storage become limiting factors.For this reason, low-storage Runge-Kutta methods become increasingly popular for the time integration.This work suggests a series of s-stage sth-order explicit RungeKutta methods specific for autonomous linear equations, which only requires two times of the memory storage for the state vector.We also introduce a 13-stage eighth-order scheme for autonomous linear equations, which has optimized stability region and is reduced to a fifth-order method for general equations.These methods exhibit significant performance improvements over the previous general-purpose low-stage schemes.As an example, we apply the integrator to simulate the non-Markovian exciton dynamics in a 15-site linear chain consisting of perylene-bisimide derivatives.
Key words: Low-storage Runge-Kutta    Autonomous linear differential equation    Timedependent dynamics    Time-invariant Hamiltonian
Ⅰ. INTRODUCTION

The physics laws of quantum dynamics, such as the Schrös wave equation [1]and the Heisenberg matrix equation [2]for isolated systems, take the form of linear differential equations with constant coefficients.In reality, a quantum system is inevitably coupled to an environment and thus becomes open. When the coupling is weak and the memory time of the environment is short compared to the intrinsic time scale of the system, a Markovian approximation ignoring the memory time leads to a good description [3-5].Otherwise, the hierarchical approach is a powerful tool to simulate the non-Markovian dynamics by embedding the non-Markovian system in a larger, possibly infinite Markovian system [6-9].Once again, the equations of motion for the open systems take a linear differential form with constant coefficients if external driving fields are not applied.Formally, the dynamics for all of these cases are dictated by the ordinary differential equation

 $\begin{eqnarray}\dot{{\textbf{x}}} = \mathcal{M} {\textbf{x}}\end{eqnarray}$ (1)

together with an initial condition ${\textbf{x}}(t_0) ={\textbf{x}}_0$ at time $t_0$.Here, $\textbf{x}$ is a state vector of length $N$ and $\mathcal{M}$ is an $N× N$ matrix which does not depend on either the time $t$ or the state vector ${\textbf{x}}$.

The numerical integration of such an equation is simple in concept from the mathematical perspective.For systems with small to moderate dimension, one can benefit from the extensive studies on the linear differential equations [10-14].The accumulated results include the Lanczos iteration in Kryslov subspace [15-17], the Chebyshev scheme [18, 19], matrix exponential method [20-22], and the classical Runge-Kutta method [23].

In practice, however, the computations are complicated by the big size of the state vector in applying the quantum principles for complex systems.As the computer memory required to store the state vector grows exponentially with the size of the system, the length $N$ of the state vector can easily reach $10^9$ and even higher in various simulations.For large systems, the quantum dynamical simulation becomes expansive or even prohibited, which is called the curse of dimensionality [24].This curse with respect to the memory issue is enhanced by the numerical integration because the integrator itself needs additional memory to store the state vector.

For this reason, low-storage methods draw a lot of attention.The split operator method, which is a symplectic integrator not requiring additional memory location for the wavefunction, is efficient for quantum mechanical systems without space-momentum mixed terms in the Hamiltonian [25].For other systems, low-storage Runge-Kutta (LSRK) methods, which use less memory locations than the classical Runge-Kutta schemes, are useful in reducing the memory use [26-29].Actually, the LSRK scheme was proposed at the early time of computer simulations [30-32], when the memory storage was small.Later, different groups made schemes using only two memory locations for any stage number (the number of derivative evaluations in each time step of the numerical integration) [27, 29, 33].Kennedy and co-workers systematically studied the low-storage schemes and made the low-storage idea popular [34, 35].

Regardless of how powerful the high performance computing becomes and how big the memory storage is available, we always ask more to simulate bigger systems to reveal new mechanism, to seek new phenomena and to design new materials.A partial, alternative remedy for this problem is to develop efficient algorithm with not only shorter computational time but also less memory demand so that we can simulate bigger systems with available computational resources.Thus, the low-storage methods became increasingly popular during the past two decades although the available computer memory was dramatically improved.

The LSRK scheme assumes the same accuracy and robustness as the traditional explicit Runge-Kutta methods but only needs two memory locations for the state vector regardless of the order.By contrast, the classical explicit $s$-stage Runge-Kutta method normally requires $s$+1 memory locations for the state vector.But, the available LSRK methods were designed for general purpose problems, which suffers from the order barrier condition, namely, there is no $s$-stage $s$th-order explicit Runge-Kutta method for $s$$>4 and no (s+2) -stage sth-order method for s$$>$7 [36].The order barrier is severer for LSRK schemes because there are additional conditions in order to achieve the low-storage integration.For example, for general nonlinear multidimensional differential equation, we have to use five stages to get a fourth-order LSRK method [35].To have an optimal stability region, a 12-stage fourth-order LSRK scheme is also widely used [37].The larger stage number will significantly increase the computational time for the simulations of the dynamics.

Actually, we can develop more efficient LSRK schemes if we take advantage of the linearity of Eq.(1). Zingg and Chisholm [38] suggested a four-stage fourth-order LSRK method when adding an inhomogeneous time-dependent term to Eq.(1). Indeed, more efficient low-storage algorithm exists for linear differential equations with constant coefficients.

In this report, we show that $s$-stage $s$th-order LSRK method exists for Eq.(1), so that the time-dependent dynamics of time-invariant systems can be propagated very efficiently in the use of both computational time and computer memory. The developed methods are applied to the hierarchical simulations of the exciton diffusion in a 15-site linear aggregate.

In this work, we show that $s$-stage $s$-order LSRK methods exist for any positive $s$and suggest the schemes for $s$=4, 6, 8, 10 and 12.A 13-stage method with optimized stability region, which is of eighth-order for linear equation with constant coefficients and of fifth-order for general systems, is also suggested. We also presents a hierarchical simulation of the exciton diffusion in a chain of a $J$-type linear aggregate with the 12-stage 12th-order LSRK method.

Ⅱ. LOW-STORAGE RUNGE-KUTTA METHOD FOR AUTONOMOUS LINEAR A. Classical explicit Runge-Kutta methods

The Runge-Kutta methods are derivative-free schemes to find the numerical solution for the initial value problem of the ordinary differential equations

 $\begin{eqnarray}\dot{{\textbf{x}}}= \textbf{f}(t,\textbf{x}) \end{eqnarray}$ (2)
 $\begin{eqnarray} \textbf{x}(t_0) =\textbf{x}_0\end{eqnarray}$ (3)

for a given function $\textbf{f}$.A general $s$-stage Runge-Kutta method discretizes the time with a grid $\{t_0,t_1,\cdots,$ $t_n,\cdots\}$ and approximates the solution of Eq.(2) step-by-step with $s$ iterations in each step,

 $\begin{eqnarray} {\textbf{k}}_1={\textbf{f}}(t_n,{\textbf{x}}_n)\end{eqnarray}$ (4)
 $\begin{eqnarray}{\textbf{k}}_j={\textbf{f}}\Big(t_n+c_j \Delta t,{\textbf{x}}_n+\Delta t\sum_{l=1}^{j-1}a_{jl}\textbf{k}_l\Big),\hspace{0.3cm} j =2,\cdots,s \hspace{0.5cm}\end{eqnarray}$ (5)
 $\begin{eqnarray} \textbf{x}_{n+1}=\textbf{x}_n + \Delta t\sum_{j=1}^s b_j \textbf{k}_j\end{eqnarray}$ (6)

where $\Delta t=t_{n+1}-t_n$ is the size of the time step, and $\textbf{x}_n$ is an approximation to $\textbf{x}(t_n)$.Here, the coefficients $a_{jk}$, $b_j$ and $c_j$ are the elements of the Butcher tableau [23]. To have a compact presentation, we will treat $\textbf{a}$ as a matrix which only has non-zero matrix elements $a_{jk}$ in the strict lower triangular part. The consistency condition of the Runge-Kutta method imposes the constraints

 $\begin{eqnarray}c_j = \sum_{l=1}^{j-1} a_{j,l},\quad j=2,\cdots,s\end{eqnarray}$ (7)

Therefore, only $a_{jk}$ and $b_j$ are free parameters, which are chosen so that $\textbf{x}_{n+1}$ is the $p$-order Taylor expansion of $\textbf{x}(t)$ at time $t_n$,

 $\begin{eqnarray} \label{eq:taylor}\textbf{x}_{n+1} = \textbf{x}_n + \sum_{j=1}^p \frac{\Delta t^j}{j!}\frac{\textrm{d}^j}{\textrm{d}t^j} \textbf{f}(t,\textbf{x})\Big|_{\textbf{x}=\textbf{x}_n} + O(\Delta t^{p+1})\end{eqnarray}$ (8)

If this condition holds, the Runge-Kutta scheme is called a $p$th-order method.

B. Low-storage Runge-Kutta methods

Eq.(3) clearly shows that the $s$-stage explicit Runge-Kutta scheme normally requires $s+1$ memory locations for integration, which is not favorable for large systems.By putting further constraints on the coefficients$a_{jk}$ and $b_j$, we may achieve a low-storage method which requires less memory locations, even two memory locations regardless of the stage number.In the literature, there exist different schemes to achieve the two-register method [27, 29, 33].Here, we will be sticked to the formulation originally proposed by Williamson [29], which takes the following iteration in each time step

 $\begin{eqnarray} && \textrm{for} \hspace{0.5cm}j=1,\cdots,s,\nonumber\\ && \textrm{do the iteration} \nonumber \\ && \hspace{0.5cm} \Delta \textbf{y} = A[j] \Delta\textbf{y} + \Delta t\textbf{f}(t_n+c_j \Delta t,\textbf{y}) \end{eqnarray}$ (9)
 $\begin{eqnarray} \textbf{y} = \textbf{y} + B_j \Delta \textbf{y} \\ \textrm{end of the iteration} \nonumber\end{eqnarray}$ (10)

Here the register $\textbf{y}$ points to the the memory location storing the state vector $\textbf{x}$, which contains the initial condition $\textbf{x}_n$ before the iteration and directly gives$\textbf{x}_{n+1}$ after the iteration. An explicit scheme implies $A_1$=0, so there are $2s-$1 independent parameters for an $s$-stage scheme, based on which the classical Butcher tableau can be constructed [37]

 $\begin{eqnarray} a_{jk} &=& \left\{ \begin{array}{ll} A_{k+1}a_{j,k+1} + B_k,& \textrm{if~} k $\begin{eqnarray} b_{j} &=& \left\{ \begin{array}{ll} A_{j+1} b_{j+1} + B_j,& \textrm{if~} j

For the autonomous linear equation Eq.(1), all second-order partial derivatives vanish, namely,

 $\begin{eqnarray} \frac{\partial^2 \textbf{f}}{\partial t^2} = \frac{\partial^2 \textbf{f}}{\partial t \partial \textbf{x}} = \frac{\partial^2 \textbf{f}}{\partial \textbf{x}\partial \textbf{x}}=0\end{eqnarray}$ (13)

As a consequence, the order condition Eq.(8) is greatly simplified,

 $\begin{eqnarray}\begin{split} &\sum_{j=1}^s b_j = 1 \\ &\sum_{j,l=1}^sb_j [\textbf{a}^{k-1}]_{jl} c_l = \frac{1}{k!},\quad k=1,\cdots,p-1\end{split}\end{eqnarray}$ (14)

Therefore, there are $p$ constraints in order to have a $p$th-order method, which can normally be satisfied because there are $2p-$1 parameters in a $p$-stage scheme.This implies that $s$-stage $s$-order LSRK methods exist for the autonomous linear differential equations.Although there are multiple solutions of Eq.(14), particular interest is in the choice $A_{j}=-$1 for $j\geq$2.With such a choice, these algebraic equations can be conveniently and analytically solved with a Grö basis decomposition [39].The coefficients $B_j$ for the schemes with 4, 6, 8, 10 and12 stages are given in Table Ⅰ.Hereafter, the method will be named as LSRK$s-p$($q$), where$s$ refers to the number of stages, $p$ denotes the order of the method for autonomous linear equations. When it applies, $q$ means the order of the method for general differential equations.

Table Ⅰ The coefficients $B_j$ of the low-storage Runge-Kutta methods for $s$=4, 6, 8, 10 and 12 for the homogeneous linear differential equations with constant coefficients

In numerical applications, another important issue is the stability of the solver, which requires that small errors in the initial state do not drive the Runge-Kutta approximation off from the real value. The stability analysis of the solver is complicated and case-dependent for general systems.Thus a paradigm for the stability analysis is to use the one-dimensional equation with a complex coefficient $λ$

 $\begin{eqnarray} \dot{x} = λ x\end{eqnarray}$ (15)

to obtain useful guidance. The coefficient $λ$ corresponds to the eigenvalues in multi-dimensional problems.For this one-dimensional case, the explicit Runge-Kutta method approximates $x$($t_0$+$n\Delta t$) with$x_{n}=G(λ \Delta t) x_{n-1}$, where $G(z)$ is the polynomial

 $\begin{eqnarray} G(z) = 1+\sum_{j=1}^s \gamma_j z^j\end{eqnarray}$ (16)

with the coefficients $\gamma_j$ being

 $\begin{eqnarray} \gamma_j = \left\{ \begin{array}{ll} \displaystyle{\sum_{k=1}^s b_k}, & \textrm{ if } j = 1 \\ \displaystyle{\sum_{k,l=1}^s} b_k [\textbf{a}^{j-2}]_{kl} c_l,& \textrm{ if } j > 1 \end{array} \right.\end{eqnarray}$ (17)

The iteration Eqs.(4-6) is robust against small perturbations if the characteristic polynomial $G(λ\Delta t)$ fulfills the criterion$|G(λ\Delta t)|$$\leq1.The stability region of the Runge-Kutta method is then defined as the set of all points z in the complex plane satisfying |G(z)|$$\leq$1.

For schemes with $s=p$, all $\gamma_j$ are determined by the order condition Eq.(14) so that there is no freedom to manipulate the stability region. Therefore, in order to optimize the stability, we have to sacrifice the order of the method to have free parameters.

Following Ref.[37], we adopt a two-stage process to find the LSRK method with optimized stability. For a given stage number $s$and a desired order $p$, we first find $\gamma_j$ ($p$$<$$j$$\leq$$s$)to optimize the area within the contour $|G(z)|$=1.We then determine the coefficients $A_j$ and $B_j$ satisfying the order conditions with the $\gamma_j$.

In principle, we have to optimize the stability with the exact area within the contour $|G(z)|$=1. Here, we will adopt a much simpler, yet reasonable, approximation to this area, namely,

 $\begin{eqnarray} S[G] \approx \frac{x_{\textrm{max}}}{9} \sum_{j=1}^9 y_j\end{eqnarray}$ (18)

where $x_{\textrm{max}}$ is the maximum positive real number satisfying $|G(-x_{\textrm{max}})|$=1, and $y_j$ is the minimum positive number solving $|G(i y_j -j x_{\textrm{max}}/10) |$=1, with $i=\sqrt{-1}$ being the imaginary unit.

Once $\gamma_j$ are obtained, we proceed to find the LSRK parameters $A_j$ and $B_j$.In this work, we are interested in the $13$-stage $8$-order method to have balance between the order and the stability.With this choice, there are five free parameters $\gamma_j$ (9$\leq$$j$$\leq$13) to optimize the stability at the first step.The optimization is carried out by a random search with the Nelder-Mead method [40] to find the local maximum as implemented in the NLopt library [41].The optimized results for $\gamma_j$ are listed in Table Ⅱ.At the second step, there are 25 $A_j$ and $B_j$ parameters satisfying13 order conditions. In order to fix the rest 12 parameters, we require the scheme to be a fifth-order method for general differential equations.Besides the order conditions for linear equations with constant coefficients, there are 12 additional constraints

Table Ⅱ The coefficients of the 13-stage eighth-order low-storage Runge-Kutta method for autonomous linear differential equations. This method is of the fifth-order for time-variant or nonlinear differential equations.
 $\begin{eqnarray} & & \sum_{j=1}^s b_j c_j{^2} = \frac{1}{3}\end{eqnarray}$ (19)
 $\begin{eqnarray} \sum_{j=1}^s b_j c_j{^3} = \frac{1}{4} \end{eqnarray}$ (20)
 $\begin{eqnarray} \sum_{j=1}^s b_j c_j{^4} = \frac{1}{5}\end{eqnarray}$ (21)
 $\begin{eqnarray} \sum_{j,k=1}^s b_j c_j a_{jk} c_k = \frac{1}{8} \end{eqnarray}$ (22)
 $\begin{eqnarray}&& \sum_{j,k=1}^s b_j c_j{^2} a_{jk} c_k = \frac{1}{10}\end{eqnarray}$ (23)
 $\begin{eqnarray}\sum_{j,k=1}^s b_j c_j a_{jk} c_k{^2} = \frac{1}{15}\end{eqnarray}$ (24)
 $\begin{eqnarray} & \sum_{j,k=1}^s b_j c_j [\textbf{a}^2]_{jk} c_k = \frac{1}{30}\end{eqnarray}$ (25)
 $\begin{eqnarray} \sum_{j,k=1}^s b_j a_{jk} c_k{^2} = \frac{1}{12}\end{eqnarray}$ (26)
 $\begin{eqnarray} \sum_{j,k,l=1}^s b_j a_{jk} c_k a_{kl} c_l = \frac{1}{40}\end{eqnarray}$ (27)
 $\begin{eqnarray} &\sum_{j,k=1}^s b_j [\textbf{a}^2]_{jk} c_k{^2} = \frac{1}{60} \end{eqnarray}$ (28)
 $\begin{eqnarray} \sum_{j,k,l=1}^s b_j a_{jk} c_k a_{kl} c_l = \frac{1}{20}\end{eqnarray}$ (29)
 $\begin{eqnarray}\sum_{j,k=1}^s b_j a_{jk} c_k{^3} = \frac{1}{20}\end{eqnarray}$ (30)

The nonlinear Eq.(14) and Eqs.(19-30) are solved with the Levenberg-Marquardt algorithm [42, 43]as implemented in the levmar package [44]with a modified long double precision.Multiple solutions are found in determining $A_j$ and $B_j$.The results for $A_j$, $B_j$ and $c_j$with smallest sum of the absolute values of $A_j$are given in Table Ⅱ, showing 17 effective digits.

C. Stability contours

FIG. 1 presents the stability contours normalized by the inverse of the stage number for all the methods listed in Tables Ⅰ and .As the number of stages coincides the order of the methods, the stability contours of LSRK4-4, LSRK6-6, LSRK8-8, LSRK10-10 and LSRK12-12 schemes are uniquely defined and do not depend on how the parameters $A_j$ and $B_j$ are chosen.For these methods, the stability region is larger when the stage number increases.However, FIG. 1 clearly shows that the renormalized region shrinks for larger stage numbers.When the stability is the limiting factor in the numerical integration, the maximum step size is controlled by the stability region.Then, the efficiency of the un-optimized high order methods is lower because the increase of the stability region is not sufficient to compensate the cost in performing the calculations for the additional stages.In these cases, it is appreciated to sacrifice the order of the method to increase the stability region.As shown, normalized stability region of the LSRK13-8(5) scheme is significantly bigger than that of the un-optimized methods.A dominant feature is that its contour is elongated along the $x$-axis.Correspondingly, LSRK13-8(5) is suitable to simulate the over-damped motion, in which the real part of the eigenvalues is larger than the imaginary part.To have a comparison, we also plot the stability contour for the LSRK13-4(4) method suggested by Niegemann et al. [37].This method have a stability region elongated along $y$-axis and is suitable to simulate the under-damped motion, in which the imaginary part of the eigenvalues is dominant.

 FIG. 1 Stability contours for the methods suggested in Tables Ⅰ and Ⅱ. The contours are symmetric with respect to the $x$-axis, and the lower parts thus are not displayed.For comparison, also shown is the 13-stage fourth-order scheme [LSRK13-4(4)]for general differential equations proposed by Niegemann et al. [37].
D. Accuracy verification

We check the efficiency of all the methods with a symmetric matrix $\mathcal{M}$.Here, the adopted eigenvalues $λ_j$ are

 $\begin{eqnarray}λ_j \hspace{-0.15cm}&=&\hspace{-0.15cm} \frac{5 i (j-1) }{N} - \frac{4(j-1) }{N} \textrm{exp}\left[{1-\frac{4(j-1) }{N}}\right] \nonumber\\&&\quad 1\leq j \leq N\end{eqnarray}$ (31)

where $N$=256 is the dimension of the matrix $\mathcal{M}$.With this settings, $λ_1$=0 is the only zero of the eigenvalues, which implies that there is a unique stationary state of the systems.

For the simplicity of computation, we choose the discrete Fourier transform matrix $F_{jk}$=exp$[\pi i (j-1) (k-1) /N]$ as the eigen-matrix.Then the exact solution at time $t$ is calculated with discrete Fourier transform, namely,

 $\begin{eqnarray} \textbf{x}^{(\textrm{ex})}(t) = F λ(t) F^\dagger \textbf{x}(0) \end{eqnarray}$ (32)

where $∧(t)$ refers to the diagonal matrix taking the eigenvalues $\exp(-λ_j t)$, and the initial vector $\textbf{x}(0)$ at $t$=0is filled with standard normal Gaussian numbers.The errors of the numerical integration are defined as the Euclidean distance between the exact result and the Runge-Kutta approximation $\textbf{x}^{(\textrm{appr})}(t)$

 $\begin{eqnarray}\textrm{Err}(t) = \left|\sum_{j=1}^N\Big[\textbf{x}_j^{(\textrm{ex})}(t) - \textbf{x}_j^{(\textrm{appr})}(t)\Big]^2\right|^{1/2}\end{eqnarray}$ (33)

In FIG. 2, we present the error of all the methods with the step size of $2^n$×$10^{-3}$ (0$\leq$$n$$\leq$12).The results clearly show that there exists a region that the accuracy will be comparable with the machine round-off errors and they cannot be reduced further simply by decreasing the step size.There also exists an unstable region for the each integrator, when the step size is bigger than a critical value, the error increases explosively.The error corresponding to this critical step size is about one, which implies that the step size in this problem is controlled by the accuracy.

 FIG. 2 (Color online)Errors of the methods suggested in Tables Ⅰ and Ⅱ for the linear problem in Section Ⅱ D at time $t$=8.192. For comparison, also shown is the 13-stage fourth-order scheme [LSRK13-4(4)]for general differential equations proposed by Niegemann [37].Dots are calculated results and lines are guide for the eye.

Not shown in FIG. 2, we also carry out the integration with the classical fourth-order Runge-Kutta method, the Fehlberg's 4(5) method [45], the Calvo 6(5) method [46], and the Prince-Dormand 8(7) method [47].The 4th-order LSRK is identical to the classical fourth-order Runge-Kutta method in errors as well as the stability.Higher-order LSRK schemes are faster than the same order classical RK ones, but no more than a factor of two.However, a high-order LSRK ($p$$>6) scheme can be two or three times faster than the corresponding classical Runge-Kutta method which assumes the same number of stages. Provided with a pre-selected error tolerance, higher order methods normally can take a bigger step size.For example, with the error tolerance of1×10^{-5}, the step size can be as large as0.31 for LSRK12-12, but only 0.0083 for LSRK4-4.If the error tolerance deceases by two orders, the corresponding step size becomes 0.21 for LSRK12-12and 0.0026 for LSRK4-4.Roughly speaking, the speedup will be 9 to 21 times if we use a 12-stage 12-order method instead of a fourth-order Runge-Kutta scheme.Other higher s-stage s-order methods also depict an increase of3-20 times in efficiency, depending on the order and the error tolerance.With the same step size in the stable region, the stability-optimized method LSRK13-8(5) always produces smaller errors than the same order method LSRK8-8.When the additional stages of the LSRK13-8(5) scheme are taken into account, the LSRK13-8(5) scheme has the same performance as the LSRK8-8 method. Again, to have comparison, we consider the LSRK13-4(4) scheme which is designed for general differential equations.With a given step size, it systematically produces smaller errors than that of the LSRK4-4 method.However, provided the same accuracy (1×10^{-5} or 1×10^{-7}), the increase in the step size is less than three times.We thus conclude that the LSRK13-4(4) scheme is less efficient than the LSRK4-4 method. Ⅲ. APPLICATIONS TO SIMULATION OF QUANTUM COHERENCE IN BIOLOGICAL-RELEVANT SYSTEMS We will apply the high-order low-storage methods outlined in Section Ⅱ to a more realistic simulation, that is, the non-Markovian exciton diffusion in a linear aggregate.The exciton dynamics plays a crucial role in organic photovoltaic cell [48], photosynthesis [49], organic light-emitting diodes [50]and many others [51, 52].In such systems, energy transfers to the reaction center along with the exciton diffusion.The natural photosynthesis systems exhibit surprisingly high efficiency in the exciton energy transfer [49].The theoretical understanding of the design features and the underline mechanism will greatly help the artificial solar energy application [53].We are still at the very beginning to understand the process and now expect that dissipation caused by the environment is the key for efficient energy transport [49, 53]. Here, we study the noise-assisted exciton diffusion in an artificial chain consisting ofN, N'-Di(N-(2-aminoethyl)-benzamide)-1, 6, 7, 12-tetra(4-tert-butylph-enoxy)-3, 4:9, 10-tetracarboxylic acid bisimide (PBI-1).Perylene bismide systems are very flexible in tuning the geometric configuration as well as the electronic properties in a large extent upon proper substitution or different solvation [54-56].Therefore, they serve as idea test systems for theories to investigate the effects of various factors on exciton transfer and have been extensively studied in the literature.For example, Wolter and co-workers investigated the size-dependent exciton dynamics in one-dimensional PBI aggregates [57].A maximal diffusion length was estimated to be 96 nm, which is about10 times larger than that in disordered polymers [58].The photophysical properties of PBI monomer and oligomers have been calculated in Refs.[59, 60]. As the time scale of the exciton dynamics is of the order of picosecond and longer, it is comparable to that of the environment. Thus, the dynamics is essentially non-Markovian. Here, the non-Markovian dynamics is investigated with the hierarchical approach, which introduces auxiliary density matrices to characterize the effect of the bath and is one of a few available methods to study non-Markovianity [6-9, 61-64].Usually, the number of auxiliary matrices are large and the requirement for the memory storage is high at low temperatures.The low-storage scheme thus is ideal to integrate the hierarchical equations. A. Hierarchical approach for dissipative exciton dynamics In order to focus on the exciton diffusion, the exciton is supposed to be stable and its dissociation as well as the decay is not taken into account.Moreover, we only investigate the dynamics in the one-exciton manifold, that is, only the diabatic electronic state corresponding to the S_0 state is included.With this restriction, there are only N_{\textrm{agg}}one-exciton states |m\rangle (only site m is in S_1)for an N_{\textrm{agg}}-site aggregate. The Hamiltonian for the exciton diffusion under dissipation can be approximated with the system-plus-bath model,  \begin{eqnarray}\hat{H}_{\rm tot} = \hat{H}_{\textrm{ex}}+ \sum_m \hat{H}_m + \hat{H}_{\rm b}\end{eqnarray} (34) where \hat{H}_{\textrm{ex}} is the Hamiltonian of the exciton, and\hat{H}_{\rm b} is the Hamiltonian of the linear bath consisting of independent harmonic oscillators.We will investigate the homogeneous aggregate, so we only include the site-site coupling leading to single exciton transfer,  \begin{eqnarray}\hat{H}_{\textrm{ex}} = \sum_{mn} J_{mn}|m\rangle\langle n|\end{eqnarray} (35) where J_{mn} is the coupling constant. When the size of the monomer in the aggregate is big, each site may have a set of localized vibrations coupled to the exciton at this site only.Operator \hat{H}_m in Eq.(34) stands for the exciton-vibration coupling at site m and a linear form is accepted [65]  \begin{eqnarray}\hat{H}_m &=& \hat{h}_m \sum_\xi \hat{q}_{m,\xi}g_{m,\xi},\quad \hat{h}_m \equiv|m\rangle\langle m|\end{eqnarray} (36) For linear dissipation, the effect of the vibrations on the exciton dynamics is fully captured by the bath response function,  \begin{eqnarray} \alpha(t) = \frac{1}{\pi} \int^\infty_0 \textrm{d}\omega J(\omega) \left[\coth\left(\frac{\beta\hbar\omega}{2}\right) \cos(\omega t) -i\sin(\omega t )\right]\end{eqnarray} (37) which in turn is determined by the spectral density functionJ(\omega) [62, 64].Here, \beta=1/k_{\rm B}T is the inverse temperature rescaled by the Boltzmann constant k_{\rm B}.Generally, the spectral density function can be expressed in terms of the coupling constant, reduced mass, and frequency of each oscillator in the bath.But most often theoretical simulations adopt special form to model the bath.The localized vibrational modes normally have high harmonic frequencies and can be described by a damped Brownian oscillator bath model [66]  \begin{eqnarray}J(\omega) =\frac{\eta \omega{_0}^3\omegaλ} {[(\omega^2 - \omega{_0}^2) ^2 + λ^2\omega^2]}\end{eqnarray} (38) where \eta is the dissipation strength, \omega_0 the central mode frequency, and λ the cutoff.Below we will refer to the under-damped limit, where the central mode frequency \omega_0 is much larger than the cutoff λ. For this spectral density the response function can be expanded into a sum of exponentials as follows [62]  \begin{eqnarray}\alpha(t) &=& \frac{\eta\hbar\omega{_0}^3}{4\xi}\Big\{-\textrm{e}^{\Omega_1 t} \Big[\coth\Big(\frac{i}{2}\hbar\beta\Omega_1\Big) - 1\Big] \nonumber \\&&+\textrm{e}^{\Omega_2 t} \Big[\coth\Big(\frac{i}{2}\hbar\beta\Omega_2\Big) - 1\Big] \Big\} \nonumber \\ && - 2 \textrm{e}\omega_0^3λ^2\hbar \beta \sum_{k=1}^\infty \frac{\nu_k\textrm{e}^{-\nu_k t}}{(\nu_k^2 + \omega_0^2) ^2 - λ^2\nu_k^2} \nonumber \\ &\equiv& \sum_{j=1}^{N_\textrm{e}} c_j \exp(-\Omega_j t) + \alpha^{\textrm{(short)}}(t)\end{eqnarray} (39) where \Omega_{1,2}=λ/2±i \xi with \xi=\sqrt{\omega{_0}^2-{λ^2/4}}, and \Omega_{k+2}=2\pi k/(\hbar \beta) (k$$>$0) is the $k$th Matsubara frequency of the bath. In writing this equation, we have separated the short memory contribution of the response function with the smallest integer ${N_\textrm{e}}$ satisfying$\Omega_{N_\textrm{e}}$$\gg$$λ$.

With the exponential decomposition of the response functions, the hierarchical approach is efficient to propagate the reduced density matrix for the description of the non-Markovian exciton dynamics.The hierarchical approach is a scheme to attack the non-perturbative non-Markovian dynamics for the open quantum system by converting the memory of the bath into a series of auxiliary density matrices.Here we will follow the stochastic decoupling procedure of Shao and coworkers [9, 67]. Derivation based on fundamental Itô calculus can be also found somewhere else [68, 69]. The resulting equations are as follows

 $\begin{array}{l} i\hbar \frac{{\partial {{\tilde \rho }_A}}}{{\partial t}} = - i\hbar (\sum\limits_{m = 1}^{{N_{{\rm{agg}}}}} {\sum\limits_{j = 1}^{{N_{\rm{e}}}} {{A_{m,j}}} } {\Omega _j} + {\cal R}){{\tilde \rho }_A}\\ \quad + \left[{{{\hat H}_{{\rm{ex}}}},{{\tilde \rho }_A}} \right] + \sum\limits_{m = 1}^{{N_{{\rm{agg}}}}} {\sum\limits_{j = 1}^{{N_{\rm{e}}}} {\left[{{{\hat h}_m},{{\tilde \rho }_{A_{m,j}^ + }}} \right]} } \\ \quad + \sum\limits_{m = 1}^{{N_{{\rm{agg}}}}} {\sum\limits_{j = 1}^{{N_{\rm{e}}}} {{A_{m,j}}} } {\rm{(}}{\alpha _{j; + }}\left[{{{\hat h}_m},{{\tilde \rho }_{A_{m,j}^ - }}} \right] + \\ \quad {\alpha _{j; - }}\left\{ {{{\hat h}_m},{{\tilde \rho }_{A_{m,j}^ - }}} \right\}{\rm{)}} \end{array}$ (40)

$\alpha_{j;\pm}$ are defined as:

 $\begin{eqnarray}\alpha_{j;+}=\frac{c_j + c_{j^\prime}^\ast}{2},\quad \quad\alpha_{j;-}=\frac{ -i (c_j - c_{j^\prime}^\ast)}{2}\end{eqnarray}$ (41)

with $j^\prime$ being the index satisfying$\gamma_{j^\prime}^\ast=\gamma_j$.The matrix $A$ has the dimension of $N_{\textrm{agg}}$×$N_\textrm{e}$ and its matrix element $A_{m,j}$ refers to the $j$-th exponential in the response function of the bath for the $m$-th site.The operator "$+/-$'' is defined as$\left[A^\pm_{m,j}\right]_{n,k}=A_{m,j} \pm \delta_{mn}\delta_{jk}$.In Eq.(40), $\mathcal{R}$ is the Redfield superoperator which approximates the effect of the short memory part of the response function in Eq.(39) [62].

In the above equation $\tilde{\rho}_0$ is the reduced density matrix for the exciton with the initial condition$\tilde{\rho}_0(0) =\hat{\rho}(0)$.And $\tilde{\rho}_{A}$($A$$\neq0) are the auxiliary density matrices which characterize the memory effect caused by the system-bath interaction with the initial condition \tilde{\rho}_A(0) =0. Eq.(40) is an infinite-dimensional differential equation with constant coefficients.In numerical simulations, it is an unattainable yet unnecessary task to explicitly treat an infinite system.In practice, one can safely truncate the hierarchy equation to the Lth layer with a sufficiently large integer L to yield a preselected accuracy.That is, all terms \rho_A(t) with \displaystyle{\sum_{m=1}^{N_{\textrm{agg}}}}\sum_{j=1}^{N_\textrm{e}}A_{m,j}$$>$$Lare set to zero [69]. B. Results and discussions As far as the present model is concerned, we will adopt the same parameters as those in Ref.[62].Specifically, in the excitonic part we will only consider the nearest neighbor coupling and use the calculated value J_{mn}=-514 cm^{-1} [60].For the bath, we assume all of them are independent and identical and take the theoretical calculated values \omega_0=1415 cm^{-1}and \eta=0.44 in the spectral density function Eq.(32) [59].The simulations are carried out at a temperature of 300 K with an initial condition {\rho}(0) =|1\rangle\langle1|. There are two key factors for the hierarchical approach to obtain converged results.One is the number of exponentials N_\textrm{e} to approximate the bath response function. The other is the level of truncation L of the hierarchy.The number of auxiliary density matrices depends exponentially on N_\textrm{e} and L.To reduce the number of auxiliary density matrices and yet keep the accuracy, all matrices \tilde{\rho}_{A} with the factor\displaystyle{\sum_{j=1}^{N_\textrm{e}} }A_{m,j} \Omega_{j}$$>$4.8$\pi \hbar/\beta$are not included in the hierarchy.For the present model, choice with $N_\textrm{e}$=3 for all bathes and$L$=6 yields converged results for the dynamics at temperature of 300 K.

With these settings, there are totally 1, 674, 992 auxiliary density matrices in the hierarchy and it takes about 5.6 GB memory to store all complex matrices.A LSRK method with any order will need 11.2 GB memory storage for the numerical propagation.By contrast, the classical fourth-order Runge-Kutta method requires 16.8 GB space; the fifth-order Runge-Kutta-Fehlberg method needs 39.2 GB [45].

The numerical simulations are done with the Hyshe package [70], which employs a shared memory parallelization.Specifically, we use a 24-core Intel(R) Xeon(R) E5-2650v3 CPU with a clock speed of $\mbox{2.30 GHz}$ for the calculations.The efficiency is estimated as about 0.75 for this simulation.With the LSRK12-12 scheme, the dynamics can be propagated using a step size of 100 a.u. ($\sim$2.42 fs).As such, the numerical integration takes 6.5 h for 1000 steps.

The exciton population at site one is shown in FIG. 3.The population follows a steady equilibration after an initial quick drop. Note that there exist three clear spikes around 0.24, 0.47, 0.69 ps, and minor ones around 0.92, 1.2, and 1.5 ps.The peak heights decrease as time elapses, but the peak separation is almost equal in time.One may ask why this quasi-periodic phenomenon emerges?

 FIG. 3 Exciton population evolution at site one.

To answer this question, we scrutinize the exciton population at all sites and present the data in FIG. 4. The results reveal that the exciton transfers in the aggregate with a constant rate and is bounced by the two ends of the aggregate.Together with the dissipation induced by the vibrations, these two factors lead to an under-damped exciton motion.Based on this observation, the exciton hopping rate is about 123.4 ps$^{-1}$, which is significantly larger than 7.8$\pm$2.7 ps$^{-1}$, a value measured with ultrafast transient absorption spectroscopy [56].The reason of this gap is because we are using the theoretically calculated bare Coulomb coupling $J_{mn}$ of -514 cm$^{-1}$, which is larger than the experimental value (120 cm$^{-1}$) [56].A closer examination of the population at each time reveals that the exciton population spread in the aggregate and thus the exciton transfer as a wave.Of course, the environment causes dissipation and the exciton diffuses along the aggregate irreversibly.

 FIG. 4 Exciton population evolution of the aggregate.The region $\displaystyle{\left[n-{1}/{2},n+{1}/{2}\right]}$ along the $y$-axis refers to the population at Site $n$.

The wave-like motion of the exciton transfer can also be manifested by the coherence beating. FIG. 5plots the absolute value of the off-diagonal matrix elements${\rho}_{n,n+1}(t)$ (1$\leq$$n$$\leq$14) of the reduced density matrix at different times.The coherence exhibits a similar pattern as the population.As time elapses, the coherence does not vanish and approaches a constant instead, which reflects the exciton entanglement in equilibrium.

 FIG. 5 Coherence beating in the aggregate.The region $\left[n,n+1\right]$ along the $y$-axis refers to the absolute value of the off-diagonal matrix elements${\rho}_{n,n+1}(t)$ of the reduced density matrix.
Ⅳ. SUMMARY

Previously, low-storage Runge-Kutta methods were designed for integration of general nonlinear or time-variant systems, which is efficient in the use of the memory storage.To generate a low-storage scheme for general equations, we have to take additional stages because there are more restrictions for the method.However, the dynamics of many quantum systems is governed by linear differential equations with constant coefficients.For these autonomous linear equations, there are only $p$restrictions versus $2p$-1 parameters for a $p$-stage low-storage method. Thus, the order barrier does not apply for Runge-Kutta methods in solving autonomous linear equations.The system of equations for the parameters in the current scheme can be solved analytically and it is straightforward to design a new LSRK method. This allows us to obtain $p$-stage $p$th-order low-storage methods which require less stages than the classical $p$th-order Runge-Kutta schemes when $p$ is larger than four.While the general LSRK method consumes longer computational time than the same order classical Runge-Kutta scheme, the current method takes less computational time for autonomous linear equations.In this work, we have introduced a series of low-storage schemes specific for autonomous linear equations.

Numerics shows that the 12-stage 12th-order LSRK method could be an order of magnitude faster than the fourth-order Runge-Kutta method.

As an example, we have applied the suggested 12-stage12th-order method for the hierarchical simulations of the exciton diffusion in a 15-site linear aggregate, where it takes 5.6 GB memory to store the reduced and auxiliary density matrices. Thanks to the efficient low-stage scheme, we can propagate the non-Markovian dynamics up to 2.4 ps within 6.5 h with a 24-core Intel(R) Xeon(R) E5-2650v3 CPU.

Ⅴ. ACKNOWLEDGMENTS

This work is supported by the National Natural Science Foundation of China (No.21373064), the Program for Innovative Research Team of Guizhou Province (No.QKTD[2014]4021), and the Natural Scientific Foundation from Guizhou Provincial Department of Education (No.ZDXK[2014]18).All the calculations were performed at Guizhou Provincial High-Performance Computing Center of Condensed Materials and Molecular Simulation in Guizhou Education University.

Reference