Processing math: 100%

Advanced Search
Ji Qi, Huimin Zhang, Dezun Shan, Minghui Yang. Accelerating Hartree-Fock Self-consistent Field Calculation on C86/DCU Heterogenous Computing Platform[J]. Chinese Journal of Chemical Physics , 2025, 38(1): 81-94. DOI: 10.1063/1674-0068/cjcp2403028
Citation: Ji Qi, Huimin Zhang, Dezun Shan, Minghui Yang. Accelerating Hartree-Fock Self-consistent Field Calculation on C86/DCU Heterogenous Computing Platform[J]. Chinese Journal of Chemical Physics , 2025, 38(1): 81-94. DOI: 10.1063/1674-0068/cjcp2403028

Accelerating Hartree-Fock Self-consistent Field Calculation on C86/DCU Heterogenous Computing Platform

More Information
  • Corresponding author:

    Minghui Yang, E-mail: yangmh@wipm.ac.cn

  • + The authors contributed equally in this work

  • Received Date: March 04, 2024
  • Accepted Date: April 09, 2024
  • Available Online: April 11, 2024
  • Issue Publish Date: February 16, 2025
  • In this study, we investigate the efficacy of a hybrid parallel algorithm aiming at enhancing the speed of evaluation of two-electron repulsion integrals (ERI) and Fock matrix generation on the Hygon C86/DCU (deep computing unit) heterogeneous computing platform. Multiple hybrid parallel schemes are assessed using a range of model systems, including those with up to 1200 atoms and 10000 basis functions. The findings of our research reveal that, during Hartree-Fock (HF) calculations, a single DCU exhibits 33.6 speedups over 32 C86 CPU cores. Compared with the efficiency of Wuhan Electronic Structure Package on Intel X86 and NVIDIA A100 computing platform, the Hygon platform exhibits good cost-effectiveness, showing great potential in quantum chemistry calculation and other high-performance scientific computations.

  • Heterogeneous computing is currently a major trend in high-performance scientific computing, and the use of accelerators such as graphics processing units (GPUs) significantly reduces energy consumption and improves computational efficiency. In the latest 62nd edition of the TOP500 (NOVEMBER 2023), heterogeneous supercomputers accounted for 37% on the list, providing 71% of the total computing power [1]. As a typical application field of high-performance scientific computing, quantum chemistry is intricately linked to the development of computer hardware. Numerous efforts have been made to leverage the powerful computing capabilities of GPUs to enhance computational efficiency of energy calculations with various quantum methods, including semi-empirical methods [24], Hartree-Fock (HF) methods [531], density functional theory (DFT) methods [3246], and electron correlation methods such as Møller-Plesset perturbation theory [4753], coupled cluster theory [5468], configuration interactions theory [6971], complete active space self-consistent field method [7276], time dependent density functional theory [7779], and property calculations such as energy gradient calculation [15, 8082], etc.

    As fundamental methods of quantum chemistry, HF and DFT self-consistent field (SCF) calculations have attracted widespread attentions. The main time-consuming steps in HF and DFT calculations are the calculation of a tremendous number of electron repulsion integrals (ERIs) and the construction of the Fock matrix from these ERIs and density matrices. Therefore, the focus of GPU application in quantum chemistry is mainly focused on accelerating the computation of ERIs. Currently, numerous GPU-accelerated quantum chemistry software and modules have been developed, including TeraChem [83, 84], Fermions++ [23], QUICK [14, 15, 20, 40], libintX [85, 86], LibCChem of GAMESS-US [87], BrianQC [18], and QSL libraries in GAMESS-UK [16].

    On the other hand, multi-core CPU technology has also made rapid progress in recent years. For example, AMD launched an EPYC CPU containing 128 cores in 2023. Unlike GPU cores which are equipped with numerous lightweight programmable parallel stream processors, CPUs are characterized by larger multi-level high-speed caches and higher clock frequencies and are capable of handling more complex problems than GPUs. To fully leverage the characteristics of both CPUs and GPUs, hybrid CPU/GPU parallel algorithms have emerged in recent years for accelerating ERI calculations. In 2012, Asadchev and Gordon proposed a multi-threaded hybrid CPU/GPU approach for Hartree-Fock computation [31] that achieved up to 22.7 times faster computation relative to a single CPU core. Besides, in 2017, Kussmann and Ochsenfeld developed a hybrid CPU/GPU integral engine for exchange matrix computation [28], which provided significant performance enhancements, particularly in terms of increasing the CPU/GPU ratio.

    Hygon computing platform is a China domestic general-purpose heterogeneous computing platform. The Hygon CPU (named as C86) adopts a C86 architecture that is compatible with X86, enabling direct compilation of X86 code into executable programs. Moreover, its acceleration component, the deep computing unit (DCU), is a type of general-purpose graphics processing unit (GPGPU) and provides significant computational power by using the heterogeneous interface for portability (HIP) programming model. The Hygon platform has been widely applied in the field of scientific research, including physics [88101], chemistry [102], computer science [103117], atmospheric science [118], engineering [119121], and energy science [122], etc.

    This study tested our recently proposed hybrid CPU/GPU method on the Hygon platform to assess its effectiveness in quantum chemistry computations. The method is designed to enhance ERI computation and Fock matrix generation by establishing a task queue for ERI and implementing dynamic task scheduling across different hardware components. Previous calculations have demonstrated the method’s high efficiency in HF calculations using various combinations of Intel X86 CPUs and NVIDIA GPUs, effectively leveraging the computing power available on the platform. This hybrid approach has been integrated into our in-house ab initio quantum chemistry software, Wuhan Electronic Structure Package (WESP), with the CPU component programmed in Fortran and the DCU component in HIP C.

    The forthcoming sections of this article will be organized as follows. In Section II, we shall provide a brief description of the CPU/GPU hybrid method, together with the migration and optimization process of the calculation of ERIs and the Fock matrix construction in the HF SCF calculation on the Hygon computing platform. Section III will present and discuss the testing results. Lastly, we will summarize our findings in the conclusion section.

    The SCF procedure is illustrated in FIG. 1. After the definition of computational tasks, basis set and geometry of the system in the first step, the overlap matrix S, the kinetic matrix T, nuclear attraction matrix V and the transformation matrix X which transforms the orbitals to form an orthonormal set are calculated only once and then stored in memory. In the third step, initial guess is obtained by methods such as extended Hückel method [123127] and superposition of atomic density (SAD) [128] etc. The next step is to construct the Fock matrix F from the density matrix D and ERIs, which is the most time-consuming part in SCF to be discussed. The fifth step is to solve the Hartree-Fock-Roothaan (RHF) equation FC=SCε and obtain the molecular orbital C and a new density matrix D, where the matrix multiplication and diagonalization is probably also time-consuming. If convergence is not satisfied, back to the fourth step. Otherwise, the SCF procedure ends or does the calculation of post HF and/or the required properties which is outside the scope of this work.

    Figure  1.  The illustration of HF SCF procedure. Here S is the overlap matrix, T is the kinetic matrix, V is nuclear attraction matrix, X is the transformation matrix, D is density matrix.

    The most time-consuming part in the HF calculations is Step 4, the Fock matrix construction. Additionally, the elements of Fock matrix Fμν are expressed as follows:

    Fμν=Hμν+λσDλσ((μν|λσ)12(μλ|νσ))
    (1)

    Here, Hμν and Dλσ represent the elements of the core Hamiltonian matrix and density matrix, respectively, while (μν|λσ) represents the ERIs. Comprehensive and detailed discussions on the evaluation of ERIs can be found elsewhere [129131]. The second most time-consuming part is consisting matrix multiplications and diagonalization in the fifth step, which is usually finished with various mathematical library, such as BLAS and LAPACK.

    The CPUs and GPGPUs employed in this study are the Hygon C86 7255 CPUs and the Hygon DCU Z100 GPGPUs, respectively (refer to Table I for their technical specifications). The DCU software stack is known as DCU ToolKit (DTK), which offers development tools and runtime environments for the development of high-performance computing and deep learning applications. DTK supports multiple parallel programming models, including heterogeneous-compute interface for portability (HIP), open multi-processing (OpenMP), open accelerators (OpenACC) , and open computing language (OpenCL) recently. In this work, we use HIP programming model in DTK-23.04 for DCU to accelerate Fock matrix construction.

    Table  I.  Main hardware specifications of Hygon C86 7255 and Z100.
    C86 7255 Z100
    CPU cores 16 Compute Unit (CU) 60
    Frequency 2200 MHz Cores per CU 128
    L2 Cache 512 K Boost clock 1,319 MHz
    L3 Cache 8192 K Double-precision performance 10.0 TFLOPS
    Memory bandwidth 1,024 GB/s
    Memory 32 GB
     | Show Table
    DownLoad: CSV

    In order to verify and compare the actual performance of DCUs, computational tests are conducted on both the C86 CPU and DCU of Hygon platform for two commonly used subroutines, namely DGEMM (matrix multiplication) and DSYEVD (matrix diagonalization), which play a significant role in SCF calculations. The intel math kernel library (MKL) is used on the CPU, while for DCU, the hipBLAS and MAGMA libraries for matrix multiplication, and the hipSOLVER and MAGMA libraries for matrix diagonalization are employed. From a practical perspective to reflect efficiency, we use the speedup ratio defined as the wall time using 32 C86 CPU cores divided by the time using various combinations of CPU and DCUs.

    As shown in FIG. 2, the DGEMM calculations with hipBLAS and MAGMA libraries on DCU achieve average speedups of 22.3× and 18.9× over MKL on C86, respectively. These speedups are higher than that of the theoretical peak performance of 17.8 (563.2 GFLOPS for two Hygon C86 7255 CPU vs. 10.0 TFLOPS for the Z100), possibly because the intel MKL library is not specifically optimized for the C86 architecture. For the DSYEVD calculation, both the hipSOLVER and MAGMA libraries achieve significant speedups on the DCU compared to the CPU. Although hipSOLVER is more efficient than MAGMA, integer overflow occurs in an integer variable that storages the size of a buffer needed by DSYEVD in hipSOLVER when the matrix size exceeds 7314. Therefore, the MAGMA library is used when the number of basis functions exceeds 7314; otherwise, hipSOLVER is used.

    Figure  2.  Wall time of DGEMM (left panel) and DSYEVD (right panel) on C86 CPU and DCU by MKL, MAGMA, hipBLAS and hipSOLVER library.

    For the Fock matrix construction involving the calculation of ERIs on DCU, GPU/DCU optimization techniques and methods are crucial to fully realize the powerful computational capabilities provided by GPUs/DCUs in applications. Pieter et al. presented a systematical and comprehensive overview of GPU optimization methods [132], covering the techniques we used. The following techniques are employed in the Fock matrix construction on DCU and GPU:

    The kernel fission technique to reduce branch divergence. In principle, the Fock matrix construction can be completed by one single kernel. However, evaluations of ERIs with different angular momenta are distinct, while the computation of ERIs with the same angular momenta is identical. One particular kernel calculates all the ERIs with one angular momentum quartet. All threads in such a kernel execute the identical instructions without any further optimization.

    The thread/data remapping technique to reduce branch divergence. To reduce the branch divergence caused by the Schwarz screening technique, the shell pair data are sorted in a descending order according to the Schwarz upper bound. Consequently, the upper bound of ERIs is inherently in a descending order. Once the upper bound of ERIs falls below the ERI threshold, the thread exits the innermost loop. Another advantage of the sorting process is the reduction of the branch divergence in evaluating Boys functions, as the argument values of the Boys function of the adjacent threads are likely to be similar.

    Varying work per thread technique to achieve better parallelism-related balance. There are several ways to map ERIs to the GPU threads, such as “1T1CI”, “1B1CI” and “1T1PI/1T1B”. The “1T1PI/1T1B” mapping is chosen because it achieves better parallelism-related balance, providing abundant parallelism even for small systems.

    Recomputing technique to reduce communication and irregular access to global memories. The utilization of eight-fold symmetry can reduce the number of integrals that need to be evaluated, but it also introduces an increase in expensive inter-block communication overhead, irregular memory access, and potential memory conflicts, despite recent progress in leveraging the eight-fold symmetry in GPU algorithms [14, 15, 18, 30, 31]. To address these issues, Ufimtsev and Martínez proposed an efficient GPU algorithm that utilizes a recomputing technique, where the Coulomb and exchange matrices are constructed separately [8]. In Coulomb matrix construction, the symmetry between bra and ket is not adopted, but the symmetry within bra and ket is utilized, resulting in a four-fold permutation symmetry. Similarly, only two-fold permutation symmetry between bra and ket is exploited in exchange matrix construction. Although this algorithm recomputes the ERIs six times in total, GPU/DCU still outperforms the traditional CPU.

    Reducing register usage. The high register pressure on GPU/DCU is probably the greatest challenge for the evaluation of ERIs with high angular momenta on GPU/DCU, because the calculation of these ERIs requires registers far beyond what GPU/DCU provides, which leads to register spilling where thread-local variables are stored in slow off-chip global memory. However, there are several ways to reduce the register usage. For example, the J-engine technique in the Coulomb matrix construction reduces the register usage dramatically, as well as the floating-point operations [8, 133]. The exchange matrix construction is more complicated, but the use of a computer algebra system (CAS) can be helpful [10]. The algebraic expressions of ERIs are multivariate polynomials, which are converted into a Horner form which is computationally efficient. One should note that the order of principal variables in the conversion greatly affects efficiency. Additionally, the register usage often can be further reduced by simplifying the Horner forms by means of the common subexpression elimination (CSE) method. For ERIs with very high angular momenta, the kernel fission technique is also helpful in reducing the register usage, at the expense of recomputing some variables.

    During direct SCF calculations involving thousands of basis functions, the evaluation and digestion of ERIs is the most time-consuming process [11]. In general, the number of low-angular-momentum electron repulsion integrals (low-ERIs) is much larger than that of high-angular-momentum electron repulsion integrals (high-ERIs). However, the algorithms for low-ERIs are simple and easy to be implemented on GPUs, while for high-ERIs, their algorithms are more complicated and CPUs probably have more advantages than GPUs. This fact is the motivation behind our development of an effective CPU/GPU hybrid method to accelerate ERI evaluation and Fock matrix generation.

    The details of CPU/GPU hybrid method have been described in detail in our previous work [134], only the key technologies of this method are listed here: a task queue is generated with each task defined as ERIs with the same angular momentum quartet. The queue begins with task of low angular momentum quartet and ends with ERIs of high angular momentum quartet. All tasks can be performed on CPU. Only the tasks of s, p, d orbitals can be carried out on GPU due to the computational efficiency. As a result, WESP can work using “CPU-only”, “GPU-only”, and the hybrid method. For “GPU-only” calculations, the basis functions are limited to only s, p, d orbitals.

    The ERI calculations on CPU exploit the eight-fold permutation symmetry of ERIs. However, due to the computational efficiency as mentioned in Section II.A, only part of the permutation symmetry is utilized in GPU algorithm and the Coulomb and exchange matrices are calculated separately. For instance, a (sp|sd) task on the CPU corresponds to the GPU calculation of (sp|sd) and (sd|sp) Coulomb integrals, as well as (sp|sd), (sp|ds), (ps|sd), and (ps|ds) exchange integrals. The correspondences between CPU tasks and GPU kernel functions (GPU-J and GPU-K represent GPU kernels for Coulomb and exchange matrices, respectively) for calculations including s, p, d and f orbitals are listed in Table II. The second column is the sum of angular momenta of ERIs denoted as Ltot in Table II. Due to that the algorithms of high-ERIs are so complicated that CPU is probably more advantageous than GPUs, all ERIs involving s, p, d orbitals and some ERIs involving f orbital whose sum of angular momenta is smaller than 7, can be executed on GPU. The Head-Gordon-Pople (HGP) algorithm [82] is employed to evaluate ERIs on CPU. An automatic code generator (ACG) is used to optimize the algorithm. Within CPU algorithm, Coulomb and exchange matrices are calculated using the same calculated ERIs.

    Table  II.  The correspondences between CPU tasks and GPU kernel functions.
    Label Ltot CPU GPU-J GPU-K
    1 0 (ss|ss) (ss|ss) (ss|ss)
    2 1 (ss|sp) (ss|sp), (sp|ss) (ss|sp), (ss|ps)
    3 2 (ss|sd) (ss|sd), (sd|ss) (ss|sd), (ss|ds)
    4 2 (ss|pp) (ss|pp), (pp|ss) (ss|pp)
    5 2 (sp|sp) (sp|sp) (sp|sp), (sp|ps), (ps|ps)
    6 3 (ss|sf) (ss|sf), (sf|ss) (ss|sf), (ss|fs)
    7 3 (ss|pd) (ss|pd), (pd|ss) (ss|pd), (ss|dp)
    8 3 (sp|pp) (sp|pp), (pp|sp) (sp|pp), (ps|pp)
    9 3 (sp|sd) (sp|sd), (sd|sp) (sp|sd), (sp|ds), (ps|sd),(ps|ds)
    10 4 (ss|dd) (ss|dd), (dd|ss) (ss|dd)
    11 4 (pp|pp) (pp|pp) (pp|pp)
    12 4 (ss|pf) (ss|pf), (pf|ss) (ss|pf), (ss|fp)
    13 4 (pp|sd) (sd|pp), (pp|sd) (sd|pp), (pp|ds)
    14 4 (sd|sd) (sd|sd) (sd|sd), (sd|ds), (ds|ds)
    15 4 (sp|sf) (sp|sf), (sf|sp) (sp|sd), (sp|ds), (ps|sd), (ps|ds)
    16 4 (sp|pd) (sp|pd), (pd|sp) (sp|pd), (ps|pd), (sp|dp), (ps|dp)
    17 5 (ss|df) (ss|df), (df|ss) (ss|df), (ss|fd)
    18 5 (sp|dd) (sp|dd), (dd|sp) (sp|dd), (ps|dd)
    19 5 (pp|sf) (sf|pp), (pp|sf) (sf|pp), (pp|fs)
    20 5 (sd|sf) (sd|sf), (sf|sd) (sd|sf), (sd|fs), (ds|sf), (ds|fs)
    21 5 (pp|pd) (pp|pd), (pd|pp) (pd|pp), (pp|dp)
    22 5 (sd|pd) (pd|sd), (sd|pd) (ds|dp), (sd|dp), (sd|pd), (pd|ds)
    23 5 (sp|pf) (sp|pf), (pf|sp) (sp|pf), (ps|pf), (sp|fp), (ps|fp)
    24 6 (ss|ff) (ss|ff), (ff |ss) (ss| ff)
    25 6 (sd|dd) (sd|dd), (dd|sd) (sd|dd), (ds|dd)
    26 6 (pp|dd) (pp|dd), (dd|pp) (pp|dd)
    27 6 (sf|sf) (sf|sf) (sf|sf), (sf|fs), (fs|fs)
    28 6 (sp|df) (sp|df), (df|sp) (sp|df), (ps|df), (sp|fd), (ps|fd)
    29 6 (pd|pd) (pd|pd) (pd|pd), (pd|dp), (dp|dp)
    30 6 (sd|pf) (sd|pf), (pf|sd) (sd|pf), (ds|pf), (sd|fp), (ds|fp)
    31 6 (pp|pf) (pp|pf), (pf|pp) (pf|pp), (pp|fp)
    32 6 (sf|pd) (sf|pd), (pd|sf) (sf|pd), (fs|pd), (sf|dp), (fs|dp)
    33 7 (pd|dd) (pd|dd), (dd|pd) (pd|dd), (dp|dd)
    34 8 (dd|dd) (dd|dd) (dd|dd)
     | Show Table
    DownLoad: CSV

    The aim of this hybrid method is to leverage all available CPU and GPU computing resources and distribute each task to the more efficient device. OpenMP is employed to parallelize ERI calculation and digestion on a computer equipment with multi-CPU cores and multi-GPUs. The OpenMP threads are referred to as the ``GPU thread" when it is associated with the CPU core hosting a GPU, while the other threads are referred to as CPU threads. During the computation process, both CPU and GPU threads dynamically grab and complete tasks from the queue. Each GPU thread grabs and completes a GPU task in ascending order of the queue, including all Coulomb and exchange integrals. For CPU threads, grabbing tasks is done in descending order. A CPU task is divided into batches and assigned to each CPU core, as described in the previous section. The CPU core associated with the GPU thread also participates in the computation of the current CPU task and is assigned to a batch of tasks. However, after completing a batch of CPU task, this core no longer accepts additional CPU tasks but instead grabs and completes a GPU task. Once all GPU tasks are completed, the GPU thread continues to execute the remaining CPU tasks as a CPU thread.

    Using the energy calculated with Gaussian 09 [135] as the standard, we compared the computational results obtained by using different devices on Hygon platform (see Table III). The obtained results demonstrate that the ``CPU-only" methods produce energies accurate up to eleven significant digits. When DCU is involved in the ``GPU-only" or the hybrid calculations, more than eight significant digits are identical to the results calculated by Gaussian 09, with an energy error of less than 0.01 kcal/mol. These results reach quantum chemical accuracy (errors less than 1 kcal/mol), indicating that the DCU-involved calculations satisfy the precision requirements in most chemical research.

    Table  III.  The energies calculated using Gaussian 09 on X86 CPU (Intel Xeon CPU E5-2630 v2) and WESP on Hygon platform (in a.u.).
    SoftwareHardwareEnergy
    Buckyball C60
    900 basis functions
    Valinomycin C54H90N6O18
    1620 basis functions
    Linear alkanes C100H202
    2510 basis functions
    Gaussian 0912 X86 cores−2270.27955140−3768.44749387−3902.08641320
    WESP1 C86 core−2270.27955138−3768.44749353−3902.08641321
    32 C86 cores−2270.27955138−3768.44749353−3902.08641321
    1 DCU−2270.27956627−3768.44749626−3902.08642312
    2 DCUs−2270.27956627−3768.44749626−3902.08642312
    32C/1D−2270.27956389−3768.44749599−3902.08642221
    32C/2D−2270.27956603−3768.44749618−3902.08642224
     | Show Table
    DownLoad: CSV

    In order to evaluate the performance of WESP on Hygon C86 platform, we first conducted HF/def2-SVP calculations for linear alkanes containing 20–320 carbon atoms and branched alkanes containing 35–400 carbon atoms. Linear alkanes serve as loose low-dimensional model systems, while branched alkanes serve as dense three-dimensional model systems.

    Because only s, p, and d orbitals are involved in def2-SVP basis set, the Fock matrix can be completely built on DCU, and the calculation with “CPU only”, “GPU only” and hybrid methods can all be performed and the efficiency of these methods can be compared. For the devices used in calculations, the terms “1D”, “2D”, “32C/1D” and “32C/2D” indicate 1 DCU, 2 DCUs, 32 C86 cores with 1 DCU and 32 C86 cores with 2 DCUs, respectively. Due to our server being equipped with only two DCUs, our test calculations were limited to utilizing a maximum of two DCUs in parallel. In fact, the parallel utilization of GPUs by our algorithm is only limited by the number of GPUs within the computing nodes.

    As shown in FIG. 3, in general, both the calculations for linear and branched alkanes show an increasing speedup as the system size increases. This is primarily because the CPU experiences more performance degradation than the DCU when performing calculations on larger systems. This issue can be attributed to two main factors: firstly, in larger systems, there are more ERIs that need to be evaluated, leading to increased access of the density and Fock matrices, which in turn causes more cache misses; secondly, as the size of the density and Fock matrices increases in larger systems, the larger strides taken to access these matrices also cause additional cache misses. However, the high memory bandwidth of DCUs (1024 GB/s) mitigates these issues. Consequently, higher speedups can be achieved in the calculations of larger systems.

    Figure  3.  The speedups of “GPU only” and hybrid methods over “CPU only” in the HF/def2-SVP calculations. The left and right panels represent linear alkanes (20–320 carbon atoms, 510–8010 basis functions) and branched alkanes (25–400 carbon atoms, 635–10010 basis functions), respectively.

    However, some exceptions are observed. The speedups of linear C200H402 are smaller than that of C300H602 using 2D, 32C/1D or 32C/2D devices, and the speedup of branch C250H502 is slightly smaller than that of C300H602 using 32C/2D device. This is because the speedup takes into account whole SCF procedure including both Fock matrix building and algebraic calculations which mainly consist of matrix multiplication and diagonalization. As mentioned in Section II.A, the library for matrix diagonalization is transferred from hipSOLVER to MAGMA when the number of basis functions exceeds 7314, causing the time consumed twice larger. Explicitly, in the calculation of linear C200H402 with 5010 basis functions and branched C250H502 with 6260 basis functions, the faster hipSOLVER is used. However, for the calculations of C300H602 with 7510 basis functions, the MAGMA is employed, which cause more time spent on matrix diagonalization while other parts remain the same. As a result, the speedups of the whole SCF procedure decrease for C300H602 compared to smaller systems.

    Another interesting result highlighted in FIG. 3 is the higher speedup ratio in the calculations of branched alkanes compared to that of the linear ones. This is mainly because of the different screening schemes used in CPU and GPU, as well as the cache miss problem in the CPU. This point of view has been thoroughly discussed in our previous paper [134]. However, the value of cache miss on C86 CPU cannot be obtained as that on X86, the detailed analysis is not presented in this work.

    However, the speedups of the hybrid CPU/DCU method of “32C/1D” and “32C/2D” are lower than the corresponding “DCU only” methods “1D” and “2D”, respectively. This differs from our previous work on Intel-Nvidia computational platform in which the speedups are enhanced by hybrid method compared to “GPU only” method. To investigate this, we use a static schedule so that the number of tasks assigned to CPU is fixed in the SCF procedure. We change the number of tasks assigned to CPU from zero to eight and record the time of each part including the CPU shell pair, the GPU shell pair, the linear algebra and the Fock build. As shown in FIG. 4, there is a trough at 5 and 6 for linear and branched C80H162, respectively, indicating that the hybrid CPU/DCU indeed enhance the efficiency of Fock build. However, in our dynamic schedule, 6 tasks are assigned on CPU for linear C80H162, which induces load unbalance and reduces the efficiency. Moreover, in the hybrid CPU/DCU method, the overhead of the CPU shell pair construction is not negligible, but in the “DCU only” method the CPU shell pair is unnecessary and the overhead is omitted. This is the second source of the reduced efficiency in the hybrid CPU/DCU method. Although the hybrid CPU/DCU method can accelerate the Fock build, the overall efficiency is reduced by the possible load unbalance due to the dynamic schedule and the extra overhead of the construction of the CPU shell pairs.

    Figure  4.  The wall time of Fock build in the first iteration with different number tasks assigned on CPU by the hybrid method. Left panel: linear C80H162 and right panel: branched C80H162.

    For the calculation involving f or higher angular momentum basis function, since not all the ERIs can be evaluated on DCU, the hybrid CPU/DCU method can result in significant acceleration compared to the “CPU only” method. In FIG. 3 and FIG. 5 it is obvious that the speedups for calculations with def2-TZVP basis set are lower than those using def2-SVP basis set. This phenomenon can be attributed to the reduction of the GPU screening threshold from 10–11 to 10–15 to maintain accuracy and numerical stability in SCF iterations when including f and higher orbitals in the calculations as suggested by Johnson et al. [11], while the CPU screening threshold remains unchanged. Hence, the DCU is not as advantageous as that in the calculations involving up to d orbitals. Furthermore, only 34 out of the total 55 tasks can be executed on DCU. In the test calculations, all of the 34 tasks are assigned to the DCU, which indicates that the DCUs will be idle after the DCU tasks are completed. Nonetheless, when using 32C/1D, up to speedups of 6.7 and 6.3 are obtained for the calculation of linear and branched alkanes, respectively, and when 32C/2D are used, the highest speedups are 7.0 and 9.1 as shown in FIG. 5.

    Figure  5.  Speedups over CPU (32 CPU cores) in the HF/def2-TZVP calculations. The left and right panels represent linear alkanes (5–200 carbon atoms, 252–9612 basis functions) and branched alkanes (10–200 carbon atoms, 492–9612 basis functions), respectively. The calculations are performed on a server described in FIG. 3.

    The number of ERIs scales as N4 formally, where N is the number of basis functions in the calculation. By utilizing the Schwarz inequality to eliminate negligible ERIs, the scaling exponent of the computational time can be theoretically reduced to O(N2–3). However, the scaling factor achieved in practice depends on the algorithm implemented and the hardware performance. The scaling behavior of WESP on Hygon platform are investigated by counting the average SCF iteration time vs. the number of basis functions for HF/def2-SVP calculations of linear and branched alkanes. FIG. 6 shows the scaling exponents of “CPU-only” calculations for linear and branched alkanes are 2.05 and 2.62, respectively, which are lightly lower than those of Intel CPU which are 2.15 and 2.79 reported in our previous work [134]. For the “DCU-only” calculations, the scaling exponents of 1D are 1.55 and 2.18 [134], which is lower than that of “GPU only” values of 1.86 and 2.37 [42]. This indicates the excellent compatibility between WESP and the Hygon platform.

    Figure  6.  Average SCF iteration time (in s) of the HF/def2-SVP calculations. The left and right panels represent the cases of linear alkanes and branched alkanes, respectively. The calculations are performed using CPUs and DCUs as indicated in the legend. A logarithmic scale is used on both axes, and the fits demonstrate scaling with increasing system size. The pre-factors and exponents are provided in the legend.

    GAMESS-US 2020(R2) is compiled on Hygon platform by an Intel compiler (ifort 2021.6.0) and its performance is compared to that of the WESP software running on different hardware. The results are summarized in Table IV. As we can see, WESP outperforms GAMESS-US slightly when using CPU to calculate small molecules, and has a clear advantage over GAMESS-US when processing the olestra molecule (453 atoms and 4015 basis functions), possibly due to better screening algorithms in WESP software. Furthermore, after utilizing the DCU, WESP software demonstrated significantly higher efficiency than GAMESS-US.

    Table  IV.  The computational time of GAMESS-US and WESP on Hygon platform.
    MoleculeNatom; NbfsaTime/s
    GAMESS32C1D
    Caffeine24; 2600.760.680.44
    Cholesterol74; 65010.239.101.42
    Buckyball60; 90043.9539.165.64
    Taxol110; 116043.7642.014.39
    Valinomycin168; 1620105.87122.157.94
    Olestra453; 40151207.38313.1420.45
    a Natom and Nbfs refer to the number of atoms and basis functions, respectively.
     | Show Table
    DownLoad: CSV

    The computational time of WESP on NVIDIA A100 and Hygon Z100 for one SCF iteration are compared in FIG. 7. A100 has a theoretical peak double-precision floating-point performance of 9.7 TFLOPS which is close to the corresponding parameters for Hygon Z00, 10.0 TFLOPS. Although the two devices have a similar theoretical peak performance, the average SCF time obtained from the A100 is lower than that from the Z100, especially for larger systems. This may be attributed to the fact that the A100 has a GPU memory bandwidth of 1555 GB/s, which is much higher than the 1024 GB/s of the Z100. However, considering the lower price of the Z100, it is still a competitive option in the high-performance computer market.

    Figure  7.  Average SCF iteration time (in s) of the HF/def2-SVP calculations. The left and right panels represent the cases of linear alkanes and branched alkanes, respectively. The calculations are performed using Z100 and A100 as indicated in the legend.

    The WESP software has been ported to the Hygon C86 platform and optimized specifically to leverage the hardware advantages, particularly the DCU and the HF calculation for various molecules are tested.

    For HF calculations involving only s, p, and d orbitals, a single DCU achieves speedups up to 17.5 and 33.6 over 32 CPU cores for linear and branched alkanes, respectively, indicating that DCU is very efficient in HF calculation. However, the overall efficiency of the hybrid CPU/DCU method is reduced due to the possible load unbalance and the extra overhead of the construction of the CPU shell pairs. For HF calculations involving up to f orbitals, the hybrid CPU/DCU method has to be employed since not all the ERIs can be evaluated on DCU. Because DCU calculations are so fast that they are idle after all GPU tasks finished, the speeds for calculation involving f orbitals are reduced to 6.3 (32 CPU cores and 1 DCU) and 9.0 (32 CPU cores and 2 DCUs), respectively. The scaling exponents of computational time are estimated by fitting a function of the average iteration time to the number of basis functions. We find that the scaling exponents are lower than those obtained by the corresponding X86 CPU + Nvidia GPU platform, suggesting that WESP software is highly compatible with the Hygon platform. The well-known open-source quantum chemistry software GAMESS-US is compiled and tested on Hygon CPUs. On the Hygon CPUs, the efficiency of WESP and GAMESS-US is comparable for the calculation of small molecules, however, for olestra which is a larger molecular, speedup of 3.9 is obtained by WESP compared to GAMESS-US. When WESP using DCU, speedups of up to 59.0 are achieved.

    In future work, enhancements can be made to WESP on the Hygon platform in several ways. Firstly, for calculations involving f or higher angular momenta, the performance of WESP is reduced compared to those involving up to d orbitals because not all electron repulsion integrals (ERIs) can be evaluated on DCUs. To resolve this issue, more ERIs, especially those with high angular momenta, can be evaluated on DCUs. The DRK algorithm is highly recommended for achieving greater efficiency when evaluating these ERIs on DCUs. Secondly, additional features such as DFT and geometry optimization should be added to WESP.

    This work was supported by the National Natural Science Foundation of China (No.22373112 to Ji Qi, No.22373111 and 21921004 to Minghui Yang) and GHfund A (No.202107011790). We also acknowledge National Supercomputer Center in Kunshan and Wuhan Supercomputing Center for providing computational resources.

  • [1]
    E. Strohmaier, J. Dongarra, H. Simon, and M. Meuer. TOP500. 2023 Available at: https://www.top500.org/lists/top500/2023/11/. Accessed Dec. 14 (2023).
    [2]
    J. D. C. Maia, G. A. U. Carvalho, C. P. Jr. Mangueira, S. R. Santana, L. A. F. Cabral, and G. B. Rocha, J. Chem. Theory Comput. 8, 3072 (2012). doi: 10.1021/ct3004645
    [3]
    X. Wu, A. Koslowski, and W. Thiel, J. Chem. Theory Comput. 8, 2272 (2012). doi: 10.1021/ct3001798
    [4]
    J. D. C. Maia, L. dos Anjos Formiga Cabral, and G. B. Rocha, J. Mol. Model. 26, 313 (2020). doi: 10.1007/s00894-020-04571-6
    [5]
    K. Yasuda, J. Comput. Chem. 29, 334 (2008). doi: 10.1002/jcc.20779
    [6]
    K. Yasuda and H. Maruoka, Int. J. Quantum Chem. 114, 543 (2014). doi: 10.1002/qua.24607
    [7]
    I. S. Ufimtsev and T. J. Martinez, J. Chem. Theory Comput. 4, 222 (2008). doi: 10.1021/ct700268q
    [8]
    I. S. Ufimtsev and T. J. Martinez, J. Chem. Theory Comput. 5, 1004 (2009). doi: 10.1021/ct800526s
    [9]
    N. Luehr, I. S. Ufimtsev, and T. J. Martinez, J. Chem. Theory Comput. 7, 949 (2011). doi: 10.1021/ct100701w
    [10]
    A. V. Titov, I. S. Ufimtsev, N. Luehr, and T. J. Martinez, J. Chem. Theory Comput. 9, 213 (2013). doi: 10.1021/ct300321a
    [11]
    K. G. Johnson, S. Mirchandaney, E. Hoag, A. Heirich, A. Aiken, and T. J. Martínez, J. Chem. Theory Comput. 18, 6522 (2022). doi: 10.1021/acs.jctc.2c00414
    [12]
    A. Asadchev, V. Allada, J. Felder, B. M. Bode, M. S. Gordon, and T. L. Windus, J. Chem. Theory Comput. 6, 696 (2010). doi: 10.1021/ct9005079
    [13]
    K. A. Wilkinson, P. Sherwood, M. F. Guest, and K. J. Naidoo, J. Comput. Chem. 32, 2313 (2011). doi: 10.1002/jcc.21815
    [14]
    Y. P. Miao and K. M. Merz Jr., J. Chem. Theory Comput. 9, 965 (2013). doi: 10.1021/ct300754n
    [15]
    Y. P. Miao and K. M. Merz Jr., J. Chem. Theory Comput. 11, 1449 (2015). doi: 10.1021/ct500984t
    [16]
    K. D. Fernandes, C. A. Renison, and K. J. Naidoo, J. Comput. Chem. 36, 1399 (2015). doi: 10.1002/jcc.23936
    [17]
    Á. Rák and G. Cserey, Chem. Phys. Lett. 622, 92 (2015). doi: 10.1016/j.cplett.2015.01.023
    [18]
    G. J. Tornai, I. Ladjánszki, Á. Rák, G. Kis, and G. Cserey, J. Chem. Theory Comput. 15, 5319 (2019). doi: 10.1021/acs.jctc.9b00560
    [19]
    J. Kalinowski, F. Wennmohs, and F. Neese, J. Chem. Theory Comput. 13, 3160 (2017). doi: 10.1021/acs.jctc.7b00030
    [20]
    M. Manathunga, C. Jin, V. W. D. Cruzeiro, Y. P. Miao, D. W. Mu, K. Arumugam, K. Keipert, H. M. Aktulga, K. M. Jr. Merz, and A. W. Götz, J. Chem. Theory Comput. 17, 3955 (2021). doi: 10.1021/acs.jctc.1c00145
    [21]
    Y. Q. Tian, B. B. Suo, Y. J. Ma, and Z. Jin, J. Chem. Phys. 155, 034112 (2021). doi: 10.1063/5.0052105
    [22]
    Y. Wang, Y. Q. Tian, Z. Jin, and B. B. Suo, Acta Chim. Sin. 79, 653 (2021). doi: 10.6023/A21020044
    [23]
    J. Kussmann and C. Ochsenfeld, J. Chem. Phys. 138, 134114 (2013). doi: 10.1063/1.4796441
    [24]
    M. Beuerle, J. Kussmann, and C. Ochsenfeld, J. Chem. Phys. 146, 144108 (2017). doi: 10.1063/1.4978476
    [25]
    J. Kussmann and C. Ochsenfeld, J. Chem. Theory Comput. 13, 3153 (2017). doi: 10.1021/acs.jctc.6b01166
    [26]
    J. Kussmann and C. Ochsenfeld, J. Chem. Theory Comput. 13, 2712 (2017). doi: 10.1021/acs.jctc.7b00515
    [27]
    H. Laqua, J. Kussmann, and C. Ochsenfeld, J. Chem. Phys. 154, 214116 (2021). doi: 10.1063/5.0045084
    [28]
    A. Asadchev and M. S. Gordon, J. Chem. Theory Comput. 8, 4166 (2012). doi: 10.1021/ct300526w
    [29]
    G. M. J. Barca, D. L. Poole, J. L. G. Vallejo, M. Alkan, C. Bertoni, A. P. Rendell, and M. S. Gordon, International Conference for High Performance Computing, Networking, Storage and Analysis, Atlanta, (2020).
    [30]
    G. M. J. Barca, J. L. Galvez-Vallejo, D. L. Poole, A. P. Rendell, and M. S. Gordon, J. Chem. Theory Comput. 16, 7232 (2020). doi: 10.1021/acs.jctc.0c00768
    [31]
    G. M. J. Barca, M. Alkan, J. L. Galvez-Vallejo, D. L. Poole, A. P. Rendell, and M. S. Gordon, J. Chem. Theory Comput. 17, 7486 (2021). doi: 10.1021/acs.jctc.1c00720
    [32]
    K. Yasuda, J. Chem. Theory Comput. 4, 1230 (2008). doi: 10.1021/ct8001046
    [33]
    L. Genovese, M. Ospici, T. Deutsch, J. F. Méhaut, A. Neelov, and S. Goedecker, J. Chem. Phys. 131, 034103 (2009). doi: 10.1063/1.3166140
    [34]
    L. E. Ratcliff, A. Degomme, J. A. Flores-Livas, S. Goedecker, and L. Genovese, J. Phys.: Condens. Matter 30, 095901 (2018). doi: 10.1088/1361-648X/aaa8c9
    [35]
    L. Wang, Y. Wu, W. L. Jia, W. G. Gao, X. B. Chi, and L. W. Wang, Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, Seattle, (2011).
    [36]
    W. L. Jia, Z. Y. Cao, L. Wang, J. Y. Fu, X. B. Chi, W. G. Gao, and L. W. Wang, Comput. Phys. Commun. 184, 9 (2013). doi: 10.1016/j.cpc.2012.08.002
    [37]
    Q. C. Jiang, L. Y. Wan, S. Z. Jiao, W. Hu, J. S. Chen, and H. An, 2020 IEEE 22nd International Conference on High Performance Computing and Communications; IEEE 18th International Conference on Smart City; IEEE 6th International Conference on Data Science and Systems, Yanuca Island, (2020).
    [38]
    Z. L. Zhang, S. Z. Jiao, J. L. Li, W. T. Wu, L. Y. Wan, X. M. Qin, W. Hu, and J. L. Yang, Chin. J. Chem. Phys. 34, 552 (2021). doi: 10.1063/1674-0068/cjcp2108139
    [39]
    N. Luehr, A. Sisto, and T. J. Martínez, Electronic Structure Calculations on Graphics Processing Units: From Quantum Chemistry to Condensed Matter Physics, R. C. Walker and A. W. Götz Eds., Chichester: John Wiley & Sons, Ltd., 67 (2016).
    [40]
    M. Manathunga, Y. P. Miao, D. W. Mu, A. W. Götz, and K. M. Merz Jr., J. Chem. Theory Comput. 16, 4315 (2020). doi: 10.1021/acs.jctc.0c00290
    [41]
    D. B. Williams-Young, W. A. de Jong, H. J. J. van Dam, and C. Yang, Front. Chem. 8, 581058 (2020). doi: 10.3389/fchem.2020.581058
    [42]
    M. Hacene, A. Anciaux-Sedrakian, X. Rozanska, D. Klahr, T. Guignon, and P. Fleurat-Lessard, J. Comput. Chem. 33, 2581 (2012). doi: 10.1002/jcc.23096
    [43]
    M. Hutchinson and M. Widom, Comput. Phys. Commun. 183, 1422 (2012). doi: 10.1016/j.cpc.2012.02.017
    [44]
    F. Fathurahman, E. Alfianto, H. K. Dipojono, and M. A. Martoprawiro, Proceedings of the 3rd International Conference on Computation for Science and Technology, 168 (2015).
    [45]
    S. Hakala, V. Havu, J. Enkovaara, and R. Nieminen, Proceedings of the 11th International Conference on Applied Parallel and Scientific Computing, Helsinki, (2012).
    [46]
    X. Andrade and A. Aspuru-Guzik, J. Chem. Theory Comput. 9, 4360 (2013). doi: 10.1021/ct400520e
    [47]
    L. Vogt, R. Olivares-Amaya, S. Kermes, Y. Shao, C. Amador-Bedolla, and A. Aspuru-Guzik, J. Phys. Chem. A 112, 2049 (2008). doi: 10.1021/jp0776762
    [48]
    R. Olivares-Amaya, M. A. Watson, R. G. Edgar, L. Vogt, Y. H. Shao, and A. Aspuru-Guzik, J. Chem. Theory Comput. 6, 135 (2010). doi: 10.1021/ct900543q
    [49]
    S. A. Maurer, J. Kussmann, and C. Ochsenfeld, J. Chem. Phys. 141, 051106 (2014). doi: 10.1063/1.4891797
    [50]
    L. Á. Martínez-Martínez and C. Amador-Bedolla, J. Mexican Chem. Soc. 61, 60 (2017). doi: 10.29356/jmcs.v61i1.129
    [51]
    M. Katouda, A. Naruse, Y. Hirano, and T. Nakajima, J. Comput. Chem. 37, 2623 (2016). doi: 10.1002/jcc.24491
    [52]
    D. Bykov and T. Kjaergaard, J. Comput. Chem. 38, 228 (2017). doi: 10.1002/jcc.24678
    [53]
    T. Kjaergaard, P. Baudin, D. Bykov, J. J. Eriksen, P. Ettenhuber, K. Kristensen, J. Larkin, D. Liakh, F. Pawlowski, A. Vose, Y. M. Wang, and P. Jørgensen, Comput. Phys. Commun. 212, 152 (2017). doi: 10.1016/j.cpc.2016.11.002
    [54]
    A. E. DePrince III and J. R. Hammond, J. Chem. Theory Comput. 7, 1287 (2011). doi: 10.1021/ct100584w
    [55]
    A. E. DePrince III, M. R. Kennedy, B. G. Sumpter, and C. D. Sherrill, Mol. Phys. 112, 844 (2014). doi: 10.1080/00268976.2013.874599
    [56]
    W. J. Ma, S. Krishnamoorthy, O. Villa, and K. Kowalski, J. Chem. Theory Comput. 7, 1316 (2011). doi: 10.1021/ct1007247
    [57]
    K. Bhaskaran-Nair, W. J. Ma, S. Krishnamoorthy, O. Villa, H. J. J. van Dam, E. Aprà, and K. Kowalski, J. Chem. Theory Comput. 9, 1949 (2013). doi: 10.1021/ct301130u
    [58]
    W. J. Ma, S. Krishnamoorthy, O. Villa, K. Kowalski, and G. Agrawal, Cluster Comput. 16, 131 (2013). doi: 10.5555/2451462.2451482
    [59]
    J. Kim, A. Sukumaran-Rajam, C. W. Hong, A. Panyala, R. K. Srivastava, S. Krishnamoorthy, and P. Sadayappan, Proceedings of 2018 International Conference on Supercomputing, Beijing, 96 (2018).
    [60]
    T. Nelson, A. Rivera, P. Balaprakash, M. Hall, P. D. Hovland, E. Jessup, and B. Norris,44th International Conference on Parallel Processing, Beijing, (2015).
    [61]
    I. A. Kaliman and A. I. Krylov, J. Comput. Chem. 38, 842 (2017). doi: 10.1002/jcc.24713
    [62]
    B. S. Fales, E. R. Curtis, K. G. Johnson, D. Lahana, S. Seritan, Y. H. Wang, H. Weir, T. J. Martínez, and E. G. Hohenstein, J. Chem. Theory Comput. 16, 4021 (2020). doi: 10.1021/acs.jctc.0c00336
    [63]
    Z. F. Wang, M. G. Guo, and F. Wang, Phys. Chem. Chem. Phys. 22, 25103 (2020). doi: 10.1039/D0CP03800H
    [64]
    J. V. Pototschnig, A. Papadopoulos, D. I. Lyakh, M. Repisky, L. Halbert, A. S. P. Gomes, H. J. A. Jensen, and L. Visscher, J. Chem. Theory Comput. 17, 5509 (2021). doi: 10.1021/acs.jctc.1c00260
    [65]
    Z. F. Wang, B. He, Y. Z. Lu, and F. Wang, Acta Chim. Sin. 80, 1401 (2022). doi: 10.6023/A22070313
    [66]
    R. M. Parrish, Y. Zhao, E. G. Hohenstein, and T. J. Martinez, J. Chem. Phys. 150, 164118 (2019). doi: 10.1063/1.5092505
    [67]
    E. G. Hohenstein, Y. Zhao, R. M. Parrish, and T. J. Martínez, J. Chem. Phys. 151, 164121 (2019). doi: 10.1063/1.5121867
    [68]
    E. G. Hohenstein, B. S. Fales, R. M. Parrish, and T. J. Martinez, J. Chem. Phys. 156, 054102 (2022). doi: 10.1063/5.0077770
    [69]
    B. S. Fales and B. G. Levine, J. Chem. Theory Comput. 11, 4708 (2015). doi: 10.1021/acs.jctc.5b00634
    [70]
    T. P. Straatsma, R. Broer, S. Faraji, R. W. A. Havenith, L. E. A. Suarez, R. K. Kathir, M. Wibowo, and C. de Graaf, J. Chem. Phys. 152, 064111 (2020). doi: 10.1063/1.5141358
    [71]
    T. P. Straatsma, R. Broer, A. Sánchez-Mansilla, C. Sousa, and C. de Graaf, J. Chem. Theory Comput. 18, 3549 (2022). doi: 10.1021/acs.jctc.2c00266
    [72]
    E. G. Hohenstein, N. Luehr, I. S. Ufimtsev, and T. J. Martinez, J. Chem. Phys. 142, 224103 (2015). doi: 10.1063/1.4921956
    [73]
    J. W. Jr. Snyder, B. S. Fales, E. G. Hohenstein, B. G. Levine, and T. J. Martínez, J. Chem. Phys. 146, 174113 (2017). doi: 10.1063/1.4979844
    [74]
    J. W. Mullinax, E. Maradzike, L. N. Koulias, M. Mostafanejad, E. Epifanovsky, G. Gidofalvi, and A. E. DePrince III, J. Chem. Theory Comput. 15, 6164 (2019). doi: 10.1021/acs.jctc.9b00768
    [75]
    B. S. Fales and T. J. Martinez, J. Chem. Theory Comput. 16, 1586 (2020). doi: 10.1021/acs.jctc.9b01165
    [76]
    C. C. Song, J. B. Neaton, and T. J. Martinez, J. Chem. Phys. 154, 014103 (2021). doi: 10.1063/5.0035233
    [77]
    A. F. Morrison, E. Epifanovsky, and J. M. Herbert, J. Comput. Chem. 39, 2173 (2018). doi: 10.1002/jcc.25531
    [78]
    T. Yoshikawa, N. Komoto, Y. Nishimura, and H. Nakai, J. Comput. Chem. 40, 2778 (2019). doi: 10.1002/jcc.26053
    [79]
    L. D. M. Peters, J. Kussmann, and C. Ochsenfeld, J. Phys. Chem. Lett. 11, 3955 (2020). doi: 10.1021/acs.jpclett.0c00320
    [80]
    I. S. Ufimtsev and T. J. Martinez, J. Chem. Theory Comput. 5, 2619 (2009). doi: 10.1021/ct9003004
    [81]
    C. A. Renison, K. D. Fernandes, and K. J. Naidoo, J. Comput. Chem. 36, 1410 (2015). doi: 10.1002/jcc.23938
    [82]
    J. Kussmann and C. Ochsenfeld, J. Chem. Theory Comput. 11, 918 (2015). doi: 10.1021/ct501189u
    [83]
    S. Seritan, C. Bannwarth, B. S. Fales, E. G. Hohenstein, S. I. L. Kokkila-Schumacher, N. Luehr, J. W. Jr. Snyder, C. C. Song, A. V. Titov, I. S. Ufimtsev, and T. J. Martínez, J. Chem. Phys. 152, 224110 (2020). doi: 10.1063/5.0007615
    [84]
    S. Seritan, C. Bannwarth, B. S. Fales, E. G. Hohenstein, C. M. Isborn, S. I. L. Kokkila-Schumacher, X. Li, F. Liu, N. Luehr, J. W. Jr. Snyder, C. C. Song, A. V. Titov, I. S. Ufimtsev, L. P. Wang, and T. J. Martínez, Wiley Interdiscip. Rev. 11, e1494 (2021). doi: 10.1002/wcms.1494
    [85]
    A. Asadchev and E. F. Valeev, J. Chem. Theory Comput. 19, 1698 (2023). doi: 10.1021/acs.jctc.2c00995
    [86]
    A. Asadchev and E. F. Valeev, J. Phys. Chem. A 127, 10889 (2023). doi: 10.1021/acs.jpca.3c04574
    [87]
    G. M. J. Barca, C. Bertoni, L. Carrington, D. Datta, N. De Silva, J. E. Deustua, D. G. Fedorov, J. R. Gour, A. O. Gunina, E. Guidez, T. Harville, S. Irle, J. Ivanic, K. Kowalski, S. S. Leang, H. Li, W. Li, J. J. Lutz, I. Magoulas, J. Mato, V. Mironov, H. Nakata, B. Q. Pham, P. Piecuch, D. Poole, S. R. Pruitt, A. P. Rendell, L. B. Roskop, K. Ruedenberg, T. Sattasathuchana, M. W. Schmidt, J. Shen, L. Slipchenko, M. Sosonkina, V. Sundriyal, A. Tiwari, J. L. G. Vallejo, B. Westheimer, M. Wloch, P. Xu, F. Zahariev, and M. S. Gordon, J. Chem. Phys. 152, 154102 (2020). doi: 10.1063/5.0005188
    [88]
    Y. C. Wang, H. Hu, W. Tang, B. Wang, and X. H. Lin, Comput. Eng. Sci. 42, 1 (2020). doi: 10.3969/j.issn.1007-130X.2020.01.001
    [89]
    W. J. He, Y. N. Kong, K. F. He, M. L. Yang, and X. Q. Sheng, 2021 International Applied Computational Electromagnetics Society, Chengdu, (2021).
    [90]
    H. Bai, C. J. Hu, Y. H. Zhu, D. D. Chen, G. S. Chu, and S. Ren, Int. J. High Perform. Comput. Appl. 37, 516 (2023). doi: 10.1177/10943420231162831
    [91]
    Y. Y. Zhang and X. F. Zhou, Front. Energy Res. 10, 1101050 (2023). doi: 10.3389/fenrg.2022.1101050
    [92]
    J. Li, Parallel Optimization of High Performance Atomistic Simulation Algorithm for Solid Covalent Silicon, Zhengzhou: Zhengzhou University, (2023).
    [93]
    G. S. Chu, Research on Large-scale Parallel Algorithm for Atomic Scale Material Irradiation Damage Simulation, Beijing: University of Science & Technology Beijing, (2023).
    [94]
    Y. J. Yan, H. B. Li, T. Zhao, L. W. Wang, L. Shi, T. Liu, G. M. Tan, W. L. Jia, and N. H. Sun, J. Comput. Sci. Technol. 39, 45 (2024). doi: 10.1007/s11390-023-3011-6
    [95]
    Y. Q. Liu, Numerical Simulation of Droplet Motion Deformation Based on DCU Acceleration Device, Zhengzhou: Zhengzhou University, (2022).
    [96]
    G. Y. Hu, B. Zhou, B. Yang, H. B. Wang, and Z. J. Liu, Comput. Geotech. 163, 105722 (2023). doi: 10.1016/j.compgeo.2023.105722
    [97]
    H. B. Hua, Q. Q. Jin, Y. Zhang, P. Han, L. D. Sun, and L. Han, Proceedings of SPIE 12645, International Conference on Computer, Artificial Intelligence, and Control Engineering, Hangzhou, (2023).
    [98]
    R. Z. Mao, M. Q. Lin, Y. Zhang, T. H. Zhang, Z. Q. J. Xu, and Z. X. Chen, Comput. Phys. Commun. 291, 108842 (2023). doi: 10.1016/j.cpc.2023.108842
    [99]
    Y. H. Zhu, X. L. Li, T. B. Guo, H. W. Liu, and F. L. Tong, Phys. Fluids 35, 074112 (2023). doi: 10.1063/5.0154592
    [100]
    S. D. Li, Z. G. Wang, L. K. Bu, J. Wang, Z. K. Xin, S. G. Li, Y. G. Wang, Y. D. Feng, P. Shi, Y. Hu, and X. B. Chi, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, Denver, (2023).
    [101]
    Y. M. Shi, N. M. Nie, J. Wang, K. H. Lin, C. B. Zhou, S. G. Li, K. H. Yao, S. D. Li, Y. D. Feng, Y. Zeng, F. Liu, Y. G. Wang, and Y. Gao, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, Denver, (2023).
    [102]
    Z. K. Wu, H. H. Shang, Y. J. Wu, Z. C. Zhang, Y. Liu, Y. Y. Zhang, Y. C. Ouyang, H. M. Cui, and X. B. Feng, Front. Chem. 11, 1156891 (2023). doi: 10.3389/fchem.2023.1156891
    [103]
    J. M. Xie, W. F. Hu, L. Han, R. C. Zhao, and L. N. Jing, Comput. Sci. 48, 36 (2021). doi: 10.11896/jsjkx.201200023
    [104]
    J. M. Xie, Large-Scale Quantum Fourier Transform Simulation for “Songshan” Supercomputer System, Zhengzhou: Zhengzhou University, (2021).
    [105]
    L. N. Jing, Research on Large Scale Quantum Computing Simulation Technology, Zhengzhou: Zhengzhou University, (2022).
    [106]
    H. Li, L. Han, Z. Yu, and W. Wang, Comput. Sci. 49, 211200075 (2022). doi: 10.11896/jsjkx.211200075
    [107]
    H. Zhang, Parallel Research of Image Edge Detection Algorithm for DCU Platform, Zhengzhou: Zhengzhou University, (2022).
    [108]
    S. Chen, Z. M. Wu, M. H. Guo, and Z. Y. Wang, Proceedings of the 2022 5th International Conference on Image and Graphics Processing, Beijing, (2022).
    [109]
    Y. Zhang, Porting and Optimizing G-BLASTN to the Sugon Exascale Prototype, Nanjing: Nanjing University of Aeronautics and Astronautics, (2022).
    [110]
    Z. Ju, H. L. Zhang, J. T. Meng, J. J. Zhang, J. P. Fan, Y. Pan, W. G. Liu, X. L. Li, and Y. J. Wei, Future Gener. Comput. Syst. 136, 221 (2022). doi: 10.1016/j.future.2022.05.024
    [111]
    H. F. Wang, H. Zhu, and L. H. Ding, Front. Public Health 10, 1060798 (2022). doi: 10.3389/fpubh.2022.1060798
    [112]
    J. Liu, X. H. Zhou, L. Mo, S. L. Ji, Y. Liao, Z. Li, Q. Gu, and D. J. Dou, Concurr. Comput.: Pract. Exper. 35, e7697 (2023). doi: 10.1002/cpe.7697
    [113]
    C. W. Zhang and X. Yu, Adv. Ultrasound Diagn. Ther. 7, 172 (2023). doi: 10.37015/AUDT.2023.230027
    [114]
    M. Y. Geng, S. W. Wang, D. Z. Dong, H. T. Wang, G. Li, Z. Jin, X. G. Mao, and X. K. Liao, Proceedings of the IEEE/ACM 46th International Conference on Software Engineering, Lisbon, (2024).
    [115]
    Y. N. Liu, F. Zhang, Z. F. Pan, X. G. Guo, Y. H. Hu, X. Zhang, and X. Y. Du, CCF Trans. High Perform. Comput. (2023). DOI: 10.1007/s42514-023-00153-z.
    [116]
    Z. Tian, S. Yang, and C. Y. Zhang, Proceedings of the 32nd International Symposium on High-Performance Parallel and Distributed Computing, Orlando, 329 (2023).
    [117]
    N. Wang, L. Wang, X. Li, and X. L. Qin, Electronics 12, 3404 (2023). doi: 10.3390/electronics12163404
    [118]
    F. Li, Y. Z. Wang, J. R. Jiang, H. Zhang, X. C. Wang, and X. B. Chi, Future Gener. Comput. Syst. 146, 166 (2023). doi: 10.1016/j.future.2023.04.021
    [119]
    Z. F. Yang, L. Han, B. Y. Li, J. M. Xie, P. Han, and Y. J. Liu, Comput. Eng. 48, 155 (2022). doi: 10.19678/j.issn.1000-3428.0063418
    [120]
    P. Han, H. B. Hua, H. Wang, and J. D. Shang, Energy 282, 128179 (2023). doi: 10.1016/j.energy.2023.128179
    [121]
    P. Han, H. B. Hua, H. Wang, F. Xue, C. M. Wu, and J. D. Shang, J. Supercomput. (2024). DOI: 10.1007/s11227-024-05996-z.
    [122]
    G. J. Zheng, W. T. Wen, H. Deng, and Y. Cai, Energies 16, 3717 (2023). doi: 10.3390/en16093717
    [123]
    R. Hoffmann, J. Chem. Phys. 39, 1397 (1963). doi: 10.1063/1.1734456
    [124]
    R. Hoffmann, J. Chem. Phys. 40, 2474 (1964). doi: 10.1063/1.1725550
    [125]
    R. Hoffmann, J. Chem. Phys. 40, 2745 (1964). doi: 10.1063/1.1725601
    [126]
    R. Hoffmann, J. Chem. Phys. 40, 2480 (1964). doi: 10.1063/1.1725551
    [127]
    R. Hoffmann, Tetrahedron 22, 521 (1966). doi: 10.1016/0040-4020(66)80020-0
    [128]
    J. H. van Lenthe, R. Zwaans, H. J. J. van Dam, and M. F. Guest, J. Comput. Chem. 27, 926 (2006). doi: 10.1002/jcc.20393
    [129]
    P. M. W. Gill, Adv. Quantum Chem. 25, 141 (1994). doi: 10.1016/S0065-3276(08)60019-2
    [130]
    T. Helgaker, P. Jørgensen, and J. Olsen, Molecular Electronic-Structure Theory, T. Helgaker, P. Jørgensen, and J. Olsen, Eds., New York: John Wiley & Sons, Ltd., 336 (2000).
    [131]
    J. Zhang, J. Chem. Theory Comput. 14, 572 (2018). doi: 10.1021/acs.jctc.7b00788
    [132]
    P. Hijma, S. Heldens, A. Sclocco, B. van Werkhoven, and H. E. Bal, ACM Comput. Surv. 55, 239 (2023). doi: 10.1145/3570638
    [133]
    G. R. Ahmadi and J. Almlöf, Chem. Phys. Lett. 246, 364 (1995). doi: 10.1016/0009-2614(95)01127-4
    [134]
    J. Qi, Y. F. Zhang, and M. H. Yang, J. Chem. Phys. 159, 104101 (2023). doi: 10.1063/5.0156934
    [135]
    M. J. G. W. T. Frisch, H. B. Schlegel, G. E. Scuseria, M. A. Robb, J. R. Cheeseman, G. Scalmani, V. Barone, B. Mennucci, G. A. Petersson, H. Nakatsuji, M. Caricato, X. Li, H. P. Hratchian, A. F. Izmaylov, J. Bloino, G. Zheng, J. L. Sonnenberg, M. Hada, M. Ehara, K. Toyota, R. Fukuda, J. Hasegawa, M. Ishida, T. Nakajima, Y. Honda, O. Kitao, H. Nakai, T. Vreven, J. A. Jr. Montgomery, J. E. Peralta, F. Ogliaro, M. Bearpark, J. J. Heyd, E. Brothers, K. N. Kudin, V. N. Staroverov, T. Keith, R. Kobayashi, J. Normand, K. Raghavachari, A. Rendell, J. C. Burant, S. S. Iyengar, J. Tomasi, M. Cossi, N. Rega, J. M. Millam, M. Klene, J. E. Knox, J. B. Cross, V. Bakken, C. Adamo, J. Jaramillo, R. Gomperts, R. E. Stratmann, O. Yazyev, A. J. Austin, R. Cammi, C. Pomelli, J. W. Ochterski, R. L. Martin, K. Morokuma, V. G. Zakrzewski, G. A. Voth, P. Salvador, J. J. Dannenberg, S. Dapprich, A. D. Daniels, O. Farkas, J. B. Foresman, J. V. Ortiz, J. Cioslowski, and D. J. Fox, Gaussian 09 Revision B. 01., Wallingford: Gaussian, Inc., (2010).
  • Related Articles

    [1]Chang-li Ma, He Cheng, Tai-sen Zuo, Gui-sheng Jiao, Ze-hua Han, Hong Qin. NeuDATool: an Open Source Neutron Data Analysis Tools, Supporting GPU Hardware Acceleration, and across-Computer Cluster Nodes Parallel[J]. Chinese Journal of Chemical Physics , 2020, 33(6): 727-732. DOI: 10.1063/1674-0068/cjcp2005077
    [2]Haobin Wang, Xinzijian Liu, Jian Liu. Accurate Calculation of Equilibrium Reduced Density Matrix for the System-Bath Model: a Multilayer Multiconfiguration Time-Dependent Hartree Approach and its Comparison to a Multi-Electronic-State Path Integral Molecular Dynamics Approach[J]. Chinese Journal of Chemical Physics , 2018, 31(4): 446-456. DOI: 10.1063/1674-0068/31/cjcp1805122
    [3]Wen-liang Li, Ji-cheng Bian, Lei Yang. Spin-Unrestricted Multi-Configuration Time-Dependent Hartree Fock Theory (cited: 1)[J]. Chinese Journal of Chemical Physics , 2014, 27(2): 175-180. DOI: 10.1063/1674-0068/27/02/175-180
    [4]Qiong-qiong Xia, Wei Xiao, Yong-fan Zhang, Li-xin Ning, Zhi-feng Cui. Density Functional Study on Relative Energies, Structures, and Bonding of Low-lying Electronic States of Lutetium Dimer[J]. Chinese Journal of Chemical Physics , 2009, 22(4): 371-379. DOI: 10.1088/1674-0068/22/04/371-379
    [5]Ming Zhou, Zheng-wu Wang, Zu-min Xu. Study on Interaction Between Two Parallel Plates with Iteration Method in Functional Theory[J]. Chinese Journal of Chemical Physics , 2008, 21(2): 131-135. DOI: 10.1088/1674-0068/21/02/131-135
    [6]Ling Wu, Li-juan Zheng, Xiao-hua Yang, Yu Liu, Yang-qin Chen. Computer Assisted Assignments of Rotationally Resolved Molecular Spectra[J]. Chinese Journal of Chemical Physics , 2006, 19(1): 39-42. DOI: 10.1360/cjcp2006.19(1).39.4
    [7]Ren Xueguang, Zhang Shufeng, Su Guolin, Ning Chuangang, Zhou Hui, Li Bin, Huang Feng, Li Guiqin, Deng Jingkang. Electron Momentum Spectroscopy Investigation on the 2b and 3a Orbitals of Cyclohexene[J]. Chinese Journal of Chemical Physics , 2005, 18(5): 665-669. DOI: 10.1088/1674-0068/18/5/665-669
    [8]Shan Xu, Chen Liqing, Chen Xiangjun, Yang Xuefeng, Li Zhongjun, Liu Tao, Zheng Yanyou, Xu Kezun. Electron Momentum Spectroscopy of the Frontier Molecular Orbitals of CF2BrCl[J]. Chinese Journal of Chemical Physics , 2005, 18(3): 295-297. DOI: 10.1088/1674-0068/18/3/295-297
    [9]Yang Sheng lin, Jin Junhong, Li Guang. Computer Simulation of Band Texture Formation in Liquid Crystalline Polymer[J]. Chinese Journal of Chemical Physics , 2004, 17(5): 577-581. DOI: 10.1088/1674-0068/17/5/577-581
    [10]Yin Xiaofeng, Chen Xiangjun, Zhang Xuhuai, Xu Chunkai, Shan Xu, Wei Zheng, Xu Kezun. An Electron Momentum Spectroscopy Investigation on Outer Valence Orbitals 2e and 4a1 of Trichlorofluoromethane[J]. Chinese Journal of Chemical Physics , 2003, 16(4): 241-243. DOI: 10.1088/1674-0068/16/4/241-243
  • Other Related Supplements

Catalog

    Figures(7)  /  Tables(4)

    Article Metrics

    Article views (69) PDF downloads (19) Cited by()
    Related

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return