In mainstream analytical chemistry, experimental data formats have gradually changed from one-way vectors to two-way matrices. This change is due to, for the most part, advances in analytical instrumentation [1-5]. Two-way data matrices contain a large amount of chemical information and, as such, pose a challenge to qualitative and quantitative analyses. Analyzing two-way experimental data matrices initially involves determining the number of chemical species in a chemical or biological system [6, 7]. Knowing the number of chemical species, one can then examine the intermediate species involved in chemical kinetics or identify impurities [8, 9]. For example, determining the number of chemical species permitted a more complete understanding of lithium battery dynamic following structural modification . In addition, knowing the correct number of chemical species enables self-modeling curve resolution to extract pure components from two-way data matrices without a prior knowledge of the mixture. For example, knowing the number of chemical species is necessary to determine the distribution of non-target chemical species in plant tissues . Determination of chemical species number makes possible the identification of interfacial phases of unknown polymeric materials . By comparison, complete resolutions using other analytical methods may largely depend on exhaustive iterations and expert interaction [13-16].
A variety of methods have been developed for determining the number of chemical species, and many of them are based on PCA . These methods can be classified into three categories: mathematical, empirical, and statistical . The first category includes such methods as orthogonal projection approach and least squares (OPALS) , ratio of eigenvalues calculated by smoothed principal component analysis and those calculated by ordinary principal component analysis (RESO) , and noise perturbation in functional principal component analysis (NPFPCA) . The second category includes such methods as factor indicator function (IND) , frequency analysis of eigenvectors (REFAE) , and morphological score (MS) . The third category includes such methods as Fisher variance ratio tests (F-test) [25, 26], median absolute deviation (DRMAD) , and augmentation (DRAUG) .
These methods are effective in some cases, but are rarely satisfactory for all data types. The application of multiple methods will likely yield a consensus of results. Many methods are more or less data-typespecific, which means satisfactory results are limited to a few types of data. Some methods include crucial parameters which require significant user intervention. As a result, such methods often yield ambiguous results. Mathematical methods are more robust when dealing with complex data matrices. Empirical indices assume an unsubstantiated noise distribution and statistical techniques are often limited by matrix size or normal noise distribution . For REFAE and MS methods, frequency analysis is employed to differentiate chemical information from noise. Frequency analysis reveals that chemical information is relatively low-frequency while noise frequency is high . With the attention to low-frequency chemical signals, RESO and NPFPCA can overcome, to some extent, the unwieldy problems of identifying minor components and heteroscedastic noise.
In this work, we propose a novel method, referred to as SRISM, in which two self-referencing interlaced submatrices play a key role. In SRISM, two submatrices are constructed respectively with odd and even column vectors chosen from the original data matrix in an interlacing manner. These are downsampled matrices that are obtained from the original one. The odd and even interlaced submatrices are similar with respect to low-frequency chemical information but are different in terms of high-frequency noise. These two interlaced submatrices are decomposed into two sets of PCs using PCA. A pairwise comparison of the two sets of PCs readily yields the number of chemical species. SRISM was evaluated using both simulated and experimental datasets. The experimental data matrices were produced with up to six chemical species. Compared to other commonly used methods, SRISM was more robust when dealing with such interferences as signal overlapping, minor components, homoscedastic and heteroscedastic noise, instrument aberrations and collinearity. When applied to monitoring ammonia in infrared atmospheric spectra, it was able to detect ammonia in concentration of 0.1 ppm. Moreover, SRISM was shown to be mathematically rigorous, computationally efficient, and readily automated.Ⅱ. THEORY
Throughout this work, boldface lower- and upper-case letters denote vectors and matrices, respectively. All vectors are column vectors. The subscript is the matrix size.
PCA is broadly used to reduce the dimension of a data matrix by linearly combining the original variables that best account for the variance of the data matrix. When valid measurements are concerned, it seems reasonable that true data signals will be stronger than noise and thus contribute more to variance than the noise does. So it is possible to divide the primary from the secondary PCs. This is the premise of most methods based on PCA including ours.
The SRISM method comprises the following three steps (see FIG. 1):
(ⅲ) Calculate the correlation coefficients between the paired PCs from
It is noted that SRISM might yield overestimations if the data matrix contains large amount of low-frequency interferences, e.g. sloping baselines and strong fluorescent backgrounds in Raman spectra. In such cases, the original data should be preprocessed with baseline removal or background correction.Ⅲ. EXPERIMENTAL METHODS
The proposed SRISM method was extensively evaluated using simulated gas chromatography coupled with infrared spectroscopy (GC-IR), experimental high-performance liquid chromatography coupled with diode array detector (HPLC-DAD), experimental pulsed field gradient nuclear magnetic resonance (NMR) datasets, and open-path Fourier transform infrared (OP/FT-IR) spectra obtained from atmospheric monitoring. All programs were written in MATLAB 2017a (The MathWorks, Inc., Natick, MA).A. Simulated datasets
Based on Beer's law, three-component GC-IR datasets of diethyl ether, ammonia, and beta propiolactone were emulated with IR spectra and chromatograms. The IR spectra were within wavenumber range of 750-1250 cm
Each rare-earth oxide (99.95%) was dissolved in hydrochloric acid solvent (1.0 mol/L) yielding a stock solution (1.000 g/L). A three-component mixture solution contained Yb (2.0 mg/L), Tm (2.0 mg/L), and Er (2.0 mg/L) (mixture 1). A six-component mixture solution was prepared containing Lu (1.5 mg/L), Yb (1.0 mg/L), Tm (3.0 mg/L), Er (2.5 mg/L), Ho (3.8 mg/L), and Tb (2.1 mg/L) (mixture 2). Another six-component mixture solution contained Lu (1.0 mg/L), Yb (2.0 mg/L), Tm (3.5 mg/L), Er (3.2 mg/L), Ho (2.4 mg/L), and Tb (2.1 mg/L) (mixture 3). The three rare-earth mixtures were analyzed with a FL 2000 HPLC Workstation (Spectra-Physics, USA) at multiple wavelengths of the Ultraviolet-visible (UV-Vis) spectroscopy detector (Spectra-Physics, USA) and 5 nm intervals. A 1-dodecanesulphonate solution (0.01 mol/L) was used as hydrophobic ion reagent to pretreat the reversed-phase column. Two mobile phase solutions were prepared containing 0.25 mol/L lactic acid (pH
A three-component mixture containing glucose (10.65 mg), sucrose (12.82 mg), and maltotriose (17.13 mg) was prepared in D
OP/FT-IR spectra were measured around animal farms using two types of spectrometers (System A: Air-Sentry, Cerex Monitoring Solutions, Atlanta, GA and System B: MDA Corp., Atlanta, GA). A Global source was coupled with an interferometer (Bomem Michelson 100, Canada), a splitter and a 25 cm expanding telescope. The expanded beam was reflected at a 100-200 m distance. The reflected beam was measured by a mercury-cadmium-teluride detector. Interferograms, with a resolution of 1 cm
All the GC-IR datasets were simulated with various levels of interference, such as signal overlapping, minor component, and noise.1. Use of SRISM to determine the number of chemical species
Four GC-IR data matrices were simulated with 0.1% of homoscedastic noise added in four different runs. The data matrices were analyzed by SRISM (FIG. 3). In each case, the correlation coefficients were close to 1 for the first three PCs. SRISM analysis showed that the number of chemical species was 3 for each data matrix, which was the correct estimation of the chemical species number in the simulated dataset.2. Effects of chromatographic overlap, strength, and noise
Chromatographic overlap was simulated by moving the chromatographic peak of ammonia toward that of diethyl ether. Variations of chromatographic strength were simulated by decreasing the chromatographic peak height of ammonia. Homoscedastic or heteroscedastic noise was added to all data matrices. Corresponding data matrices were analyzed by SRISM and other commonly used methods. SRISM alone dealt well with the four types of interference of high levels and gave the correct number of chemical species (see Table Ⅰ).
In Tables S2 and S3 (see supplementary materials), it can be seen that mathematical methods generally produced an accurate number of chemical species. SRISM, OPALS, and DRAUG were able to detect the most overlapped or minor components. Empirical and statistical methods performed well when interference levels were low but they tended to underestimate numbers when interference levels were high.
The results shown in Tables S4 and S5 (see supplementary materials) indicate that SRISM has a strong tolerance for the two types of noise present at each level and is able to correctly determine the number of chemical species. Most of other analytical methods were adversely affected by high noise levels and tended to over- or under-estimate the number of chemical species. OPALS and MS were capable of successfully dealing with homoscedastic noise, but had more difficulty producing the correct number of chemical species in presence of heteroscedastic noise. This was also true with IND, DRMAD, and DRAUG. These methods were unable to deal with added levels of heteroscedastic noise because empirical IND and statistical DRMAD and DRAUG are based on the assumption that noise is specifically or normally distributed . By contrast, SRISM analysis depends upon frequency differences between chemical information and noise and, therefore, is free of such assumptions. Thus, it performs much better regardless of the noise type. Based on frequency differences, NPFPCA also estimated the correct number of chemical species in the presence of high-level noise.B. Experimental datasets
FIG. S1 (see supplementary materials) shows the three-dimensional (3D) plot and chromatograms of three HPLC-DAD Datasets 1-3. FIG. S1 (a), (b) (see supplementary materials) show severe pump oscillations in the original Dataset 1 which consequently distorted chromatographic information. To reduce the severe instrument aberrations, chromatograms ranging 625-720 nm from the original Dataset 1 were used for later calculations. Pump oscillation was still present at a high level in Dataset 1. Dataset 1 size is 932-by-20. In FIG. S1 (c, d) (see supplementary materials), high-level instrument aberrations appeared in Datasets 2 and 3. Also present were high signal overlap levels (FIG. S1 (c, d) in supplementary materials) in both Datasets 2 and 3 as evidenced by four distinct and one minor chromatographic peaks for the corresponding six-component mixtures. The sizes of Datasets 2 and 3 are 1600-by-25. The numbers of chemical species for three HPLC-DAD Datasets 1, 2, and 3 are 3, 6, and 6, respectively.
Compared to spectra of GC-IR and HPLC-DAD, NMR spectra contain smaller high-frequency signals. Downsampling in decay-profile domain was carried out to avoid undersampling any of the chemical information. The three components mixture of glucose, sucrose, and maltotriose was detected and recorded as 10470-by-16 Dataset 4. The six-component mixture of methanol, ethanol, butanol, sorbitol, lysine, and sucrose was analyzed. The corresponding experimental NMR data matrix contained problematic noise and collinearity. To more easily analyze the six-component mixture, four data matrices were obtained using NMR spectral segments within chemical shift ranges of 3.280-3.240 ppm (262-by-32 Dataset 5, methonal response), 2.000-1.149 ppm (5570-by-32 Dataset 6, butanol and lysine response), 3.864-3.700 ppm (1074-by-32 Dataset 7, sucrose and sorbitol response) and 3.864-3.605 ppm (1696-by-32 Dataset 8, sucrose, sorbitol and lysine response). Therefore, the resulting chemical species number accounting for the five NMR Datasets 4, 5, 6, 7, and 8 were 3, 1, 2, 2, and 3, respectively.1. Use of SRISM to determine the number of chemical species
The SRISM results for the experimental data matrices are shown in FIG. 4. Eight plots demonstrate that the correlation coefficients of the first few paired PCs were close to 1 and higher than the 0.9 threshold. Correlation coefficients for the remaining PCs were all below threshold. The numbers of chemical species were determined to be 3, 6, 6, 3, 1, 2, 2, and 2, which are consistent with the actual numbers of chemical species except for Dataset 8. The SRISM results exhibit clear separation between the primary and secondary PCs. The remarkable difference between the two sets of PCs stems from their inherent frequency differences. SRISM was able to distinguish such differences to successfully separate the two types of PCs. In summary, this method produced accurate numbers of chemical species in presence of high-level instrument aberrations, signal overlapping and collinearity.
3D plots of the NMR experimental Datasets 4 and 8 are shown in FIG. S2 (see supplementary materials). The high-level collinearity for the decay-profile made it more difficult to determine the number of chemical species especially for pulsed field gradient NMR spectra. The proposed SRISM method was able to successfully process multiple-component NMR datasets producing clear determination (FIG. 4 (d)-(g)) with a high level of collinearity. FIG. 4(h) illustrates that SRISM analysis showed a chemical species number of 2 for the three-component Dataset 8. Extremely similar decay-profiles for sorbitol and lysine account for this incorrect estimation.2. Comparison among three categories of methods
The eight experimental data matrices were also analyzed by several other commonly used methods to determine the number of relevant chemical species. The results (Table Ⅱ) show that SRISM determined the correct number of chemical species for 7 out of 8 cases, achieving the highest accuracy for any of the analytical methods tested. SRISM appeared to perform well in the presence of high interference levels of low-frequency pump oscillation and signal overlapping in Datasets 1-3. These types of high-level interference were not adequately handled by other methods. SRSIM, NPFPCA and REFAE were able to deal with the high-level collinearity in the three-component Dataset 4, and the others cannot yield reliable results (see FIG. S3 in supplementary materials). Based on frequency differences, RESO, REFAE and MS performed well with some datasets, but tended to yield underestimations when dealing with complicated multiple-component datasets like those seen in Datasets 2 and 3. IND, F-test, DRMAD and DRAUG were reported to offer surprisingly good results when noise levels have an assumed or normal distribution in other studies . However, they tended to overestimate the number of chemical species in the presence of instrument aberrations, signal overlapping and collinearity in the eight actual datasets. All methods failed for Dataset 8 in the presence of severe collinearity.
These methods were further evaluated in terms of calculation time. The results are listed in Table Ⅲ. The size of data matrices varies from 932-by-20 (Dataset 1) to 5570-by-32 (Dataset 6). SRISM completed the calculation in a few milliseconds even for the largest Dataset 6. The calculation time of SRISM did not increase significantly with the size of dataset. SRISM was always one of the fastest methods based on the length of computational time listed in Table Ⅲ. The other mathematical methods of analysis were more time-consuming for complex procedures.3. Application of SRISM to infrared spectra of atmospheric monitoring
OP/FT-IR spectra were continuously measured in four sessions of atmospheric monitoring. The spectra within a wavenumber ranging 750-1250 cm
The OP/FT-IR Datasets 9-12 were also analyzed in 10 groups of spectra by other methods (see Tables S6-S9 in supplementary materials). In real atmospheric monitoring data sets, there inevitably exist wind, dust and rain which consequently produce unstable OP/FT-IR results and flawed data [33, 34]. In spite of the fact that Datasets 9-12 were measured under different conditions and with different instruments, SRISM yielded a correct number of chemical species in most cases. F-test showed similar efficacy for atmospheric monitoring and determined the chemical species number to be 3 for Dataset 9 containing high ammonia concentrations. By contrast, REFAE and MS underestimated the number of chemical species. Other methods tended to yield inconsistent chemical species numbers when the concentration of ammonia varied. Therefore, SRISM is a fast and powerful tool for atmospheric monitoring.Ⅴ. CONCLUSION
In this report, the SRISM method is proposed as a mathematically rigorous, computationally efficient, and readily automated technique for determining the number of chemical species in a mixture. Its performance was evaluated using both simulated and experimental datasets. The results show that it tolerated various types of interferences such as signal overlapping, minor components, homoscedastic and heteroscedastic noise, instrument aberrations and collinearity to yield accurate results. SRISM utilizes frequency differences to differentiate between chemical information and noise. It has a large number of potential of application for various datasets, because chemical information is always completely sampled and noise is not. This method requires no user intervention to determine the number of chemical species, which makes it both objective and efficient. Its reliable results are useful for qualitative and quantitative analyses of mixtures.Ⅵ. ACKNOWLEDGMENTS
This work was supported by the Program for Changjiang Scholars and Innovative Research Team in University and Fundamental Research Funds for the Central Universities (wk2060190040). The authors wish to express their thanks to Dr. Bin Yuan at Wuhan Institute of Physics and Mathematics, Chinese Academy of Sciences for providing the pulsed field gradient NMR data and in-depth discussions about the results.
|||E. L. Schymanski, H. P. Singer, J. Slobodnik, I. M. Ipolyi, P. Oswald, M. Krauss, T. Schulze, P. Haglund, T. Letzel, S. Grosse, N. S. Thomaidis, A. Bletsou, C. Zwiener, M. Ibáñez, T. Portolés, R. De Boer, M. J. Reid, M. Onghena, U. Kunkel, W. Schulz, A. Guillon, N. Noyon, G. Leroy, P. Bados, S. Bogialli, D. Stipaničev, P. Rostkowski, and J. Hollender, Anal. Bioanal. Chem. 407 , 21 (2015).|
|||P. A. Mello, J. S. Barin, F. A. Duarte, C. A. Bizzi, L. O. Diehl, E. I. Muller, and E. M. M. Flores, Anal. Bioanal. Chem. 405 , 24 (2013).|
|||B. Meermann, and M. Sperling, Anal. Bioanal. Chem. 403 , 6 (2012).|
|||M. Li, L. Yang, Y. Bai, and H. W. Liu, Anal. Chem. 86 , 1 (2013).|
|||S. Crotty, S. Gerişlioǧlu, K. J. Endres, C. Wesdemiotis, and U. S. Schubert, Anal. Chim. Acta 931 , 1 (2016). DOI:10.1016/j.aca.2016.05.013|
|||Y. B. Monakhova, and S. P. Mushtakova, Anal. Bioanal. Chem. 409 , 13 (2017).|
|||Y. B. Monakhova, S. P. Mushtakova, S. S. Kolesnikova, and S. A. Astakhov, Anal. Bioanal. Chem. 397 , 3 (2010). DOI:10.1007/s00216-010-3559-1|
|||M. Garrido, F. X. Rius, and M. S. Larrechi, Anal. Bioanal. Chem. 390 , 8 (2008).|
|||N. D. Lourenço, J. A. Lopes, C. F. Almeida, M. C. Sarraguça, and H. M. Pinheiro, Anal. Bioanal. Chem. 404 , 4 (2012).|
|||P. Conti, S. Zamponi, M. Giorgetti, M. Berrettoni, and W. H. Smyrl, Anal. Chem. 82 , 9 (2010).|
|||J. B. Chen, S. Q. Sun, and Q. Zhou, Anal. Bioanal. Chem. 407 , 19 (2015).|
|||G. F. Trindade, M. L. Abel, C. Lowe, R. Tshulu, and J. F. Watts, Anal. Chem. 90 , 6 (2018).|
|||L. W. Hantao, H. G. Aleme, M. P. Pedroso, G. P. Sabin, R. J. Poppi, and F. Augusto, Anal. Chim. Acta 731 , 11 (2012). DOI:10.1016/j.aca.2012.04.003|
|||H. Parastar, J. R. Radović, J. M. Bayona, and R. Tauler, Anal. Bioanal. Chem. 405 , 19 (2013). DOI:10.1007/s00216-012-6520-7|
|||Z. D. Zeng, H. M. Hugel, and P. J. Marriott, Anal. Bioanal. Chem. 401 , 8 (2011).|
|||A. Golshan, H. Abdollahi, S. Beyramysoltan, M. Maeder, K. Neymeyr, R. Rajko, and M. Sawall, Tauler R., Anal. Chim. Acta 911 , 1 (2016). DOI:10.1016/j.aca.2016.01.011|
|||E. R. Malinowski, Factor Analysis in Chemistry. 3rd Edn New York: John Wiley & Sons (2002).|
|||W. Lu, and L. M. Shao, J. Uni. Sci. Tech. China 44 , 11 (2014).|
|||S. L. Hao, and L. M. Shao, Chemom Intell Lab Syst. 149 , 17 (2015). DOI:10.1016/j.chemolab.2015.10.011|
|||Z. P. Chen, Y. Z. Liang, J. H. Jiang, Y. Li, J. Y. Qian, and R. Q. Yu, J. Chemom. 13 , 1 (1999). DOI:10.1002/(ISSN)1099-128X|
|||C. J. Xu, Y. Z. Liang, Y. Li, and Y. P. Du, Analyst 128 , 1 (2003). DOI:10.1039/b211322h|
|||E. R. Malinowski, Anal. Chem. 49 , 4 (1977). DOI:10.1021/ac50009a704|
|||T. M. Rossi, and I. M. Warner, Anal. Chem. 58 , 4 (1986). DOI:10.1021/ac00292a703|
|||H. L. Shen, L. Stordrange, R. Manne, O. M. Kvalheim, and Y. Z. Liang, Chemom Intell Lab Syst. 51 , 1 (2000).|
|||E. R. Malinowski, J. Chemom. 3 , 1 (1989). DOI:10.1002/(ISSN)1099-128X|
|||E. R. Malinowski, J. Chemom. 18 , 9 (2004).|
|||E. R. Malinowski, J. Chemom. 23 , 1 (2009). DOI:10.1002/cem.v23:1|
|||E. R. Malinowski, J. Chemom. 25 , 6 (2011).|
|||N. M. Faber, L. M. C. Buydens, and G. Kateman, Anal. Chim. Acta 296 , 1 (1994). DOI:10.1016/0003-2670(94)85145-X|
|||J. G. Proakis, Digital Signal Processing: Principles.. Algorithms, and Application. 3rd Edn New York: Macmillan (1996).|
|||P. R. Griffiths, L. M. Shao, and A. B. Leytem, Anal. Bioanal. Chem. 393 , 1 (2009). DOI:10.1007/s00216-008-2462-5|
|||L. M. Shao, B. X. Liu, P. R. Griffiths, and A. B. Leytem, Appl. Spectrosc. 65 , 7 (2011).|
|||L. M. Shao, and P. R. Griffiths, Anal. Chem. 79 , 5 (2007).|
|||L. M. Shao, C. W. Roske, and P. R. Griffiths, Anal. Bioanal. Chem. 397 , 4 (2010).|