Characteristic wavelength selection of volatile organic compounds infrared spectra based on improved interval partial least squares
Abstract
As important components of air pollutant, volatile organic compounds (VOCs) can cause great harm to environment and human body. The concentration change of VOCs should be focused on in real-time environment monitoring system. In order to solve the problem of wavelength redundancy in full spectrum partial least squares (PLS) modeling for VOCs concentration analysis, a new method based on improved interval PLS (iPLS) integrated with Monte-Carlo sampling, called iPLS-MC method, was proposed to select optimal characteristic wavelengths of VOCs spectra. This method uses iPLS modeling to preselect the characteristic wavebands of the spectra and generates random wavelength combinations from the selected wavebands by Monte-Carlo sampling. The wavelength combination with the best prediction result in regression model is selected as the characteristic wavelengths of the spectrum. Different wavelength selection methods were built, respectively, on Fourier transform infrared (FTIR) spectra of ethylene and ethanol gas at different concentrations obtained in the laboratory. When the interval number of iPLS model is set to 30 and the Monte-Carlo sampling runs 1000 times, the characteristic wavelengths selected by iPLS-MC method can reduce from 8916 to 10, which occupies only 0.22% of the full spectrum wavelengths. While the RMSECV and correlation coefficient (Rc) for ethylene are 0.2977 and 0.9999ppm, and those for ethanol gas are 0.2977 ppm and 0.9999. The experimental results show that the iPLS-MC method can select the optimal characteristic wavelengths of VOCs FTIR spectra stably and effectively, and the prediction performance of the regression model can be significantly improved and simplified by using characteristic wavelengths.
1. Introduction
Volatile organic compounds (VOCs) are defined as any organic compounds that can produce oxidants by reactions with nitrogen oxides in sunlight which would accelerate the photochemical reactions of the atmosphere.1 The members of VOCs are more than 300 types which all have the common features of high vapor pressure, low boiling point and strong reactivity.2 The major species of VOCs along with their sources and influences are listed in Table 1. We can see from the table that most VOCs can cause serious damage to human health and the environment. The main impacts on human bodies are inhalation of or exposure to the VOCs would induce various acute and chronic health effects such as central nervous system impairment, cancer, skin and sensory irritation.3 VOCs also make detrimental effects to the ambient air because they are the crucial precursors for photochemical ozone that is the main inducement of global greenhouse effect,4 moreover, VOCs play a key role in the formation of secondary organic aerosols which is the principal component of PM2.5.5 The sources of VOCs vary from countries to areas, and they are emitted from both anthropogenic and natural sources. In the developed areas, the major sources of VOCs are automobile exhausts which contribute 25% of the total VOCs concentration, other sources include industrial emissions, petroleum refining and storage, solvents evaporation, fossil fuels combustion and biogenic emissions. Natural sources of VOCs are mainly originated from vegetation and ocean emissions,6 which occupy a small proportion of the TVOC (total VOCs) emissions in the city areas. With the accelerated development of industry and the rapid expansion of transportation networks, VOCs have become significant atmospheric pollutants in urban areas of China.7 As the atmospheric pollution incidents caused by VOCs become increasingly serious and prominent, it’s necessary to establish a real-time continuous VOCs monitoring system to confirm the information of specific VOCs and control the emission of them.
Categories | Representatives | Sources | Health impacts | Environment influence |
---|---|---|---|---|
Alkane | Methane9 | Part-combusted gas | Breathing irritation | Greenhouse gas |
Butane10 | Fuel addictive, | Memory loss | Marine ecological damage | |
Hexane | Refrigerant solvent | Anesthetic | Haze | |
Gasoline consumption | ||||
Alcohol | Methanol11 | Antiseptics petrochemical derivative | Skin and eyes illness | Large amounts of waste water and gas during production |
Ethanol | Cosmetics Pharmaceuticals, | Breathing irritation | Photochemical smog | |
Isopropyl alcohol | Central nervous system impairment | |||
Aldehyde Esters | Formaldehyde | Building and decorative materials | Throat, eyes and skin illness | Predecessor of ozone |
Acetaldehyde | Cosmetics decomposition | Carcinogen | Detrimental to vegetation | |
Tobacco smoking | Central nervous system damage | |||
Chemical production | Dizziness | |||
Olefins | Ethylene12 | Petrochemical derivative | Carcinogen | Photochemical ozone |
Isoprene | Product of perfumes and pharmaceutical | Anesthetic | Water and soil pollution | |
Propylene | Adhesives | Vertigo | ||
BTEX12 | Benzene | Petroleum products | Carcinogen mutagenic hazards | Photochemical smog Damage the ozone layer |
Toluene | Part-combusted liquid fuels | Tiredness | Marine ecological damage | |
Ethylbenzene xylenes | Adhesives | Sleepiness | ||
Production of paints | ||||
Cl-VOCs (chlorinated VOCs) | Carbon tetrachloride | Solvent Insecticide | Acute toxicity | Damage of the ozone layer |
Chlorobenzene | Dye industry | Dizziness | Cause greenhouse gas effects | |
Tetrachloroethane | Pharmaceutical | Nervous and kidney damage | ||
Trichloroethylene | Adhesives | |||
Sewage disposal | ||||
Ketones | Acetone13 | Paint | Skin inflammation | Oxygen depletion in aquatic system |
Cyclohexanone | Cleaning agent | Corneal damage | ||
Diluent | Narcosis | |||
Adhesives | Nausea |
Classical VOC analytical methods include chemical methods, liquid chromatography (LC),14 gas chromatography-mass spectrometry (GC-MS),15 gas chromatography-flame ionization detector (GC-FID),16 photo-ionization detection (PID)17 and so on. With the continuous progress of sensors and computer technology, various online real-time VOCs detection technologies and corresponding portable instruments have been developed. Commercial VOC online monitoring instruments are mostly based on GS-MS technology, for example, the Hazardous Air Pollutants on Site (HAPSITE)18 designed by Inficon company is a portable GC-MS unit which can identify and quantify VOCs from environmental sampling. Proton transfer reaction-mass spectrometer (PTR-MS) is also a well-established approach for monitoring VOCs in recent years with its feature of high sensitivity and fast time response. Joost et al.19 used PTR-MS to detect the air composition in aged forest-fire and urban plumes. Cui et al.20 measured continuous VOCs information by PTR-MS instrument in urban roadside of Hong Kong. Chemical ionization reaction mass spectrometry (CIRMS)21 as an extension of PTR-MS has been applied in real-time atmospheric research, it uses various chemical ionization reagents in the ionization process which makes it a more versatile technique than PTR-MS. Koss et al.22 used NO+ as CIRMS reagent ion to achieve fast online measurement of trace atmospheric VOCs.
These online real-time detecting methods are based on chemical or physical reaction between gases and reagents. The detection results are accurate, but they have the disadvantages of being time-consuming, complicated and some require toxic reagents for preprocessing. Reagent-free and rapid continuous measurement will be the new trend of VOC detection.
Due to the progress of modern physics, especially the development of surface physics, optics and electronics, infrared spectroscopy has made considerable development during the past decades. As a nondestructive analysis tool, spectroscopy has been increasingly applied in fields like environment, food, pharmaceutical, agriculture and so on with its advantages of wide range, multi-component analysis and continuous real-time monitoring. The spectroscopic methods include nondispersion infrared (NDIR), differential absorption lidar (DIAL), differential optical absorption spectroscopy (DOAS), tunable diode laser absorption spectroscopy (TDLAS) and Fourier transform infrared (FTIR) spectroscopy.23 Among these methods, FTIR is recommended by the US Environment Protect Agency as the specified VOC online detection method because of its rapid, highly sensitive and simultaneous detection of multiple components.5 FTIR spectrum is generated from infrared radiation absorption during the vibration transition by polyatomic molecules which have asymmetric dipole moment, and the spectrum consists of absorption peaks associated with functional groups (C–H, O–H, N–H, etc.) that can reflect the composition and concentration information of substances. Since the main components in the air, nitrogen (78% of dry air), oxygen (21%) and argon (1%) are transparent to infrared radiation due to their symmetrical form of molecules, the FTIR could detect trace VOCs without the interference of these components.
Spectroscopic method can identify the components and their amounts in unknown substances by building qualitative and quantitative calibration models of the acquired spectra. In conventional spectra calibration models, full spectrum modeling is often used in order to not lose any information from the spectrum. Full spectrum usually consists of thousands of variables, which not only contain the information of target components but also many other redundant information, such as noise disturbance and interference components. While modeling with redundant variables would affect the sensitivity and predict the accuracy of calibration model. Meanwhile, the number of variables is usually much larger than the number of available sample spectra, which make the analysis of full spectrum extremely difficult with the common modeling method. Therefore, variable selection method which selects the most informative variables instead of using full spectrum is crucial in the spectroscopy analysis.24 The researches about variable selection, which is also called characteristic wavelength selection on infrared spectrum, have achieved good performance in many fields. Miaw et al.25 used iPLS method to quantify the FTIR spectra of adulterated fruits syrup. Durand et al.26 adopted genetic algorithm (GA) integrated with PLS modeling for near infrared (NIR) spectrum quantitative prediction of cotton content in cotton-viscose textile samples. Li et al.27 investigated the Monte-Carlo uninformative variable elimination (MC-UVE) combined with successive projections algorithm (SPA) to select the most effective variables from NIR spectrum of pears to determine the soluble solid content and firmness of the pears. Han et al.28 presented an ensemble of Monte-Carlo uninformative variable elimination (EMCUVE) in multivariate calibration of spectra data. Fan et al.29 used improved Monte-Carlo sampling, named competitive adaptive reweighted sampling (CARS), to select the characteristic wavelengths of vinegar spectrum in order to determine the total acid of the vinegar. The results of these researches show that variable selection can enhance the prediction ability efficiently and simplify the identify model by using selected variables. However, there were few researches on characteristic wavelengths selection of VOCs gas spectrum.
The objective of this paper was to find the optimal characteristic wavelengths of VOCs infrared spectrum by the variable selection method, as the selected characteristic wavelengths can make the calibration model of the spectrum more effective and simpler. In addition, this study aimed at selecting as few wavelengths as possible to identify the concentration information of gas spectrum. To achieve these goals, the specific works are as follows: (1) presented the acquisition and the pretreatment method of experimental materials, described the feature of general VOCs spectrum; (2) introduced the principle of the proposed wavelengths selection method and its feasibility in selecting characteristic wavelengths of VOCs spectrum; (3) compared with the regression results from full-spectrum modeling and improved iPLS modeling, the advantages of proposed iPLS-MC modeling were analyzed from the perspective of wavelength numbers, Root Mean Square Error of Cross-Validation (RMSECV), Root Mean Square Error of Prediction (RMSEP), Related coefficient of calibration set (Rc) and Related coefficient of prediction set (Rp).
2. Materials and Methods
2.1. Sample preparation
Ethylene (molecular formula C2H4), as a representative gas of VOCs, is one of the world’s most productive petrochemical derivative, which is mainly emitted from the engine exhaustion, thermal power plant and food industry. Excessive inhalation of ethylene can cause anesthetic disease, ethylene also causes photochemical smog and increases the ground-level ozone.30 Ethanol (molecular formula C2H5OH) gas is one of the typical VOCs as well, it is the most common monohydric alcohol, and has been widely used in medical and health service, chemical industry, food industry and agriculture.31 According to the carcinogens list published by the World Health Organization, ethanol is a risk factor for many cancer types including cancer of the pharynx, liver and breast. As a flammable gas, ethanol tends to evaporate into vapor which can form an explosive mixture with air. Because of the advantages of easy preparation, low price and low concentration of the gas doing little harm to human body, ethylene and ethanol gas are suitable for laboratory research. Therefore, we choose ethylene and ethanol gas as the experimental materials for this study. The gases for experiments are produced by Hefei Ningte Gas Company and stored in 4 L sealed air cylinders, respectively. The original concentration of ethylene is 2005ppm (parts per million, volume concentration) and the original concentration of ethanol is 2007ppm. The experimental gases were mixed with the auxiliary gas nitrogen to get different concentrations. Gas distribution platform adopts the 4-channel high precision gas distribution system independently developed by Hefei Institutes of Physical Science, Chinese Academy of Sciences. The precision of the distribution system is 0.1% of the original gas concentration, with error range of 0.5–1‰. In total, we got 60 groups of ethylene gas with concentrations ranging from 60.15 to 178.445ppm, with the interval between each group to be 2.005ppm, and 60 groups of ethanol gas with concentrations ranging from 60.21 to 178.623ppm, with the interval between each group to be 2.007ppm. Because of the existence of distribution system error, the accuracy of the concentration is within a range of ±5–10% of calculated concentration.
2.2. Spectra acquisition
The infrared spectra were collected by FTIR multicomponent gas analyzer system independently developed by Anhui Institute of Optics and Fine Mechanics, Chinese Academy of Sciences. The analyzer adopts FTIR spectroscopy technology and gets a multiple reflection gas cell with the optical path of 10m inside. The analyzer’s measuring bandwidth covers from 700 to 5000cm−1 with a resolution of 1cm−1 and it scans 16 times to obtain the average value as the output spectrum. The experiment system was completely airtight and all the pipes went through vacuum test before starting the experiment. At the beginning of the experiment, the gas cylinder was put in the ventilation room, and the air flow was kept free. The flow rate of gas in each cylinder was controlled by software parameters of gas distribution platform. The experimental gas and nitrogen were proportionally configured to the required concentration. When the gas flow was kept steady and the barometric pressure in the gas cell was close to one bar pressure, we checked the viewing screen of the analyzer to see whether the spectrum was stable, then recorded the spectrum data and its corresponding concentration when it was stable.
The raw spectra of all 60 groups of C2H4 were shown in Fig. 1. According to the HITRAN (high-resolution transmission molecular absorption database), the infrared spectrum of C2H4 has three strong absorption peaks, located in 2848–2972cm−1, 1463–1473cm−1 and 920–980cm−1, respectively. The three absorption peaks correspond to C–H stretching vibration band, –CH2– scissor vibration band and trans olefin vibration band.

Fig. 1. Raw spectra of C2H4.
The raw spectra of all 60 groups of C2H5OH are shown in Fig. 2. According to the database from PNNL (Pacific Northwest National Laboratory), the infrared spectrum of C2H5OH has three strong absorption peaks located in 3280–3425cm−1, 2848–2972cm−1, 1050–1090cm−1, respectively. The three absorption peaks correspond to –OH stretching vibration band, C–H stretching vibration band and saturated alcohol’s C–O stretching vibration band.

Fig. 2. Raw spectra of C2H5OH.
As we can see from Figs. 1 and 2, in addition to the absorption peaks of target components, there are many other interference absorption peaks which are primarily generated by H2O(g) and CO2, as well as the spectral response of the device. The bending vibration peak of –OH– in H2O(g) appears in the range of 1500–1800cm−1, the stretching vibration peak appears in the range of 3500–3950cm−1, and the antisymmetric stretching vibration peak of CO2 appears near 2359cm−1. Therefore, it’s vital to select the characteristic wavelengths of target components before establish the quantitative and qualitative calibration models.
2.3. Preparation of spectrum
2.3.1. Lifting wavelet spectrum denoising
In the process of spectrum acquisition, the spectrum is usually affected by the random noises from acquisition equipment or transmission path. The pretreatment of original spectrum is an essential procedure before spectrum analysis. In this paper, lifting wavelet transform (LWT) denoising method was introduced for pretreatment of the original spectrum.32 LWT is an improvement of wavelet algorithm, and it uses multiplication operation instead of convolution operation in traditional wavelet transform. The lifting scheme is divided into three stages: split, predict and update to realize its spatial transformation.
(1) | Split: The input signal Si is divided into two mutually disjoint subsets eveni−1 and oddi−1: eveni−1=S[2n],oddi−1=S[2n−1].(1) | ||||
(2) | Predict: On the basis of maintaining the correlation of the original data, the predictor P was used to treat eveni−1 and the predicted value P(eveni−1) of oddi−1 was obtained. The actual value of oddi−1 was subtracted from its predicted value to get the wavelet coefficient di−1 : di−1=oddi−1−P(eveni−1).(2) | ||||
(3) | Update: The purpose of update is to preserve some global characteristics of the original signal set in subset Si−1. Construct the update operator U, update eveni−1 with wavelet coefficient di−1, and obtain the low-frequency coefficient Si−1 : Si−1=eveni−1+U(di−1).(3) |
After n times of decomposition of the above three steps, the original signal expression is {Si−n,di−n,di−n+1,…,di−1}, where Si−n represents the low-frequency part of the signal, {di−n,di−n+1,…,di−1} represents the high-frequency part of the signal. Lifting wavelet simplifies the filtering process into three basic steps, and each step of decomposition is reversible. The reconstruction process of the lifting method is the inverse process of decomposition process. The decomposition and reconstruction structure of LWT is shown in Fig. 3.

Fig. 3. Decomposition and reconstruction of the lifting scheme.
It can be seen from above description that the lifting scheme does not need other data except the output of the previous lifting steps, so that the new data flow can replace the old data flow at each point, and the lifting wavelet coefficient can be obtained by repeating the lifting filter banks. The lifting wavelet not only inherits the multi-resolution property of classical wavelet but also has simple structure, low computational complexity and easy hardware implementation. For online real-time measurement system, LWT denoising has the advantage of high running speed, so it is suitable for VOC online monitoring.
We take the preprocessing of C2H4 spectrum as an example. Figure 4 is the original spectrum of C2H4 with the concentration of 60.15ppm, and Fig. 5 is the denoised spectrum by 6-layer “db4” wavelet function lifting scheme denoising.

Fig. 4. Original spectrum of C2H4 (60.15ppm).

Fig. 5. Pretreated spectrum of C2H4 (60.15ppm).
2.3.2. Comparison of denoising effect
When the resolution of the spectrometer is high enough, the infrared spectrum in an ideal state has a narrow enough absorption peak in the infrared absorption waveband, but showing a smooth curve in the waveband without infrared absorption. In the measured spectrum, due to the hardware limitation of spectrometer and the influence of noise, the spectrum in the waveband without infrared absorption presents a sawtooth shape. According to this property, the noninfrared absorption waveband was selected to fit the smooth curve. This smooth curve was taken as the pure infrared spectrum in the noise-free environment, and the signal-to-noise ratios (SNRs) of the original spectrum and the spectrum after lifting wavelet denoising were calculated.
Figure 6 takes the waveband of 1100–1200cm−1 as the analysis object, as this band has no infrared material absorption in this experiment. In the figure, the solid line is the original spectrum, the dotted line is the spectrum after lifting wavelet denoising, and the point solid line is the pure spectrum fitted. The calculation formula of SNR is shown in Formula (4), and the calculation results are listed in Table 2.

Fig. 6. Spectrum of C2H4 (60.15ppm) in the waveband of 1100–1200cm−1.
Calibration sets | Prediction sets | |||||
---|---|---|---|---|---|---|
SNR (dB) | Number of LVs | Rc | RMSECV (ppm) | Rp | RMSEP (ppm) | |
Original spectra | 25.4690 | 3 | 0.9568 | 2.7685 | 0.9454 | 2.9731 |
Pretreated spectra | 35.1569 | 3 | 0.9737 | 2.5827 | 0.9629 | 2.7349 |
The original spectra of 60 groups with different concentrations ranging from 60.15 to 178.445ppm were divided into calibration set and validation set by the Kennard-Stone method with a ratio of 2:1, the sizes of each set are 40 and 20. Perform the same divided process to the denoised spectra. The PLS models were built based on original spectra and pretreated spectra, respectively, the results are summarized in Table 2.
From Figs. 4 and 5, we can see clearly that the pretreatment of spectrum has not only reduced the noise interference but also made the absorption peaks more obvious. It can be seen from Table 2 that the spectral SNR increases from 25.4690 to 35.1569db after lifting wavelet denoising, that is, increased by 38.03%. For the two spectra calibration models, both RMSECV and RMSEP are less than 3ppm, and both Rc and Rp are over 0.94, which indicates that the PLS model has a good ability to identify the concentration of VOCs. The calibration result shows that the PLS modeling of pretreated spectra has a better accuracy and predictive ability than the modeling of original spectra.
2.4. Primary methods
2.4.1. PLS
PLS is an innovative multivariate statistical analysis method proposed by Svante Wold in 1983. On the basis of traditional regression model which directly uses predictors (independent variables) and responses (dependent variables), PLS also investigates the multiple correlations among variables by transfering predictors and responses into a set of independent factors named latent variables (LVs). LVs can describe the maximum covariance between predictors and responses.33,34 In the quantitative identification model of VOC spectra, the predictor is the spectral absorption matrix (X), and the response is the gas concentration matrix (Y). At the beginning of modeling, the spectra set is divided into calibration set and prediction set with a certain ratio. The calibration set is used to establish the calibration model, and the validation parameters of calibration set are RMSECV and Rc. RMSECV is used to calculate the errors on test split using a cross validation scheme. It can measure the goodness of fit between known data and the calibration model. The number of LVs is determined by the minimum value of RMSECV. The prediction set is used to verify the predictive ability of the calibration model, and the validation parameters are RMSEP and Rp.
2.4.2. Internal partial least squares
Interval partial least squares (iPLS) is an improved development of PLS. It was proposed by Nørgaard in 2000 and has become one of the most commonly used chemometrics method for selecting characteristic variables in recent years.35 The principle of this method is to divide the whole spectrum equidistant into several subintervals, then establish the independent PLS model in each subinterval with different numbers of LVs. The optimal number of LVs in each subinterval is also determined by the minimum RMSECV from cross-validation. By comparing the model of each subintervals and the global model of full-spectrum, the model with the minimum RMSECV value is selected as the optimal subinterval. IPLS has advantages of removing the redundant information and simplifying calculation, but also has the disadvantage of losing useful characteristic information because of the single interval modeling.
2.4.3. Monte-Carlo sampling
Monte-Carlo sampling is also called statistical simulation method, it is a numerical method guided by the theory of probability statistics. The feature of Monte-Carlo sampling is that the more the experiments are, the more accurate the results will be. With the development of modern computer technology-Monte Carlo sampling is simple and fast in practical application.36 In the characteristic wavelength selection of spectrum, Monte-Carlo sampling uses the different combinations of random wavelengths to carry out multiple linear regress (MLR) modeling, and calculates its regression coefficient of each combination. The final results of characteristic wavelength selection are determined by the regression coefficient. Although the calculation is simple and fast, Monte-Carlo sampling has the disadvantage that if there are too many variables to be measured, the calculation amount will be very large.
2.4.4. Proposed method
The features of VOCs infrared spectrum include the following: (1) the spectrum usually contains several absorption peaks, and the location and intensity of absorption peaks are related to the type and quantity of gas molecules which contain different functional groups; (2) infrared radiation absorption changes rapidly with wavelengths, and the spectrum has maximal values at some certain wavelengths.37 According to these features, we propose a new characteristic wavelength selecting method for VOC gase infrared spectrum called iPLS-MC algorithm. This method uses the iPLS method to select subintervals with RMSECV value less than that of full-spectrum PLS as the characteristic wavebands, then combines these wavebands into a new spectrum combination. Monte-Carlo sampling is used to select random wavelengths from the new spectrum combination, then carry out PLS modeling to calculate the prediction error of different random wavelength combinations. Monte-Carlo operations are repeated many times to ensure the stability and reliability of the results. Finally, the wavelength combination with the minimum prediction error is selected as the characteristic wavelength of the spectrum. This method makes use of the advantages of both iPLS and Monte-Carlo sampling while avoiding their disadvantages. iPLS can select several characteristic spectrum bands with optimal predictive ability to avoid the possibility of losing useful information during single modeling. Monte-Carlo sampling is applied to the new spectrum combination that has already been screened to avoid the disadvantage of massive computation, so that Monte-Carlo sampling can accurately select the optimal characteristic wavelengths of the VOCs infrared spectrum.
3. Experiment Results and Discussion
3.1. Model of iPLS
For the iPLS model, the number of intervals has significant impact on the performance of the model. If the interval number is too small, it may degenerate into full spectrum PLS model, while if the interval number is too big, the amount of computation will be increased. In this study, the full spectra are divided equally into 20 and 30 intervals, respectively. Figures 7 and 8 show the iPLS models of two gas spectra with the intervals of 20, and the results of two models are listed in Tables 3 and 4. In each figure, the gray bars represent the RMSECV value of each subinterval, and the dotted line expresses the RMSECV value of full spectrum (global) model. The results that are listed in each table include the selected interval numbers and their corresponding wavenumber range, each subinterval’s optimal LVs number, calibration set’s RMSECV and Rc value, prediction set’s RMSEP and Rp values. Under the same experimental conditions, the number of iPLS intervals was set to 30, and the experimental results of iPLS for the two gas spectra are listed in Tables 5 and 6.

Fig. 7. 20 intervals iPLS model of C2H4 spectra.

Fig. 8. 20 intervals iPLS model of C2H5OH spectra.
Calibration set | Prediction set | |||||
---|---|---|---|---|---|---|
Interval number | Wavenumber range (cm−1) | Number of LVs | RMSECV (ppm) | Rc | RMSEP (ppm) | Rp |
1 | 700–915 | 2 | 0.9074 | 0.9977 | 0.9739 | 0.9951 |
2 | 915–1130 | 3 | 0.6662 | 0.9989 | 0.7043 | 0.9982 |
4 | 1345–1560 | 4 | 1.1394 | 0.9951 | 1.0484 | 0.9935 |
6 | 1775–1990 | 3 | 1.1251 | 0.9974 | 1.2010 | 0.9964 |
11 | 2851–3066 | 5 | 0.8077 | 0.9983 | 0.8205 | 0.9981 |
12 | 3066–3281 | 3 | 1.5176 | 0.9958 | 1.5176 | 0.9940 |
Full spectra | 700–4999 | 3 | 2.5827 | 0.9737 | 2.7349 | 0.9629 |
Calibration set | Prediction set | |||||
---|---|---|---|---|---|---|
Interval number | Wavenumber range (cm−1) | Number of LVs | RMSECV (ppm) | Rc | RMSEP (ppm) | Rp |
2 | 915–1130 | 2 | 0.6726 | 0.9987 | 0.7043 | 0.9981 |
3 | 1130–1345 | 4 | 1.9062 | 0.9928 | 2.2734 | 0.9859 |
10 | 2636–2851 | 3 | 2.1153 | 0.9814 | 2.5671 | 0.9817 |
11 | 2851–3066 | 3 | 1.4317 | 0.9985 | 1.8502 | 0.9902 |
12 | 3066–3281 | 4 | 1.0554 | 0.9990 | 1.2167 | 0.9953 |
13 | 3281–3496 | 3 | 1.8934 | 0.9976 | 1.9631 | 0.9925 |
Full spectra | 700–4999 | 3 | 3.0184 | 0.9625 | 3.2361 | 0.9594 |
Calibration set | Prediction set | |||||
---|---|---|---|---|---|---|
Interval number | Wavenumber range (cm−1) | Number of LVs | RMSECV (ppm) | Rc | RMSEP (ppm) | Rp |
2 | 843–987 | 3 | 0.6294 | 0.9990 | 0.6458 | 0.9989 |
3 | 987–1131 | 2 | 0.8144 | 0.9983 | 0.8216 | 0.9983 |
5 | 1275–1418 | 10 | 1.2473 | 0.9951 | 1.2487 | 0.9960 |
6 | 1418–1562 | 4 | 1.5858 | 0.9913 | 2.1040 | 0.9892 |
9 | 1848–1992 | 4 | 1.1492 | 0.9969 | 1.2135 | 0.9964 |
16 | 2851–2994 | 2 | 0.9305 | 0.9974 | 1.0895 | 0.9969 |
17 | 2994–3138 | 2 | 2.4827 | 0.9821 | 2.5700 | 0.9804 |
18 | 3138–3281 | 5 | 2.4636 | 0.9836 | 2.5808 | 0.9813 |
Full spectra | 700–4999 | 3 | 2.5827 | 0.9737 | 2.7349 | 0.9629 |
Calibration set | Prediction set | |||||
---|---|---|---|---|---|---|
Interval number | Wavenumber range (cm−1) | Number of LVs | RMSECV (ppm) | Rc | RMSEP (ppm) | Rp |
2 | 843–987 | 3 | 2.0877 | 0.9895 | 2.3583 | 0.9816 |
3 | 987–1131 | 3 | 0.5219 | 0.9996 | 0.7126 | 0.9983 |
15 | 2708–2851 | 3 | 2.0447 | 0.9901 | 2.6912 | 0.9878 |
16 | 2851–2994 | 3 | 1.1211 | 0.9974 | 1.0895 | 0.9969 |
17 | 2994–3138 | 4 | 1.0516 | 0.9978 | 1.5065 | 0.9904 |
18 | 3138–3281 | 3 | 1.9206 | 0.9927 | 2.4185 | 0.9813 |
19 | 3281–3424 | 4 | 1.7698 | 0.9921 | 1.9219 | 0.9878 |
Full spectra | 700–4999 | 3 | 3.0184 | 0.9625 | 3.2361 | 0.9594 |
The two figures above can provide an overall view of the relevant information in different spectral subintervals. It helps us to focus on informative wavebands and remove other interference regions. The predictive ability of each interval varies greatly, thereby we can easily compare the prediction performance of subinterval models and global model from the figures.
For the common iPLS model, the subinterval which has the smallest RMSECV value would be selected as the characteristic waveband. The VOC FTIR spectra usually have several absorption bands, so the operation of iPLS model may lose some useful parts of the spectra. In this paper, we select the subintervals in which the RMSECV value is smaller than the global model RMSECV value as the characteristic wavebands, and the corresponding subintervals of each iPLS models are listed in each table. There are six wavebands selected from the 20 intervals iPLS model of C2H4 spectra, 30% of the total wavelengths. Also, eight wavebands are selected from the 30 intervals iPLS model of C2H4 spectra, 27% of the total wavelengths. As for C2H5OH spectra, there are six and seven wavebands selected from the 20 and 30 intervals iPLS model, respectively, 30% and 23% of the total wavelengths. Compared with the information from the standard infrared spectra database, the selected subintervals located in the positions coincide with the absorption peaks in the database. The results indicate that the strategy based on iPLS model could be feasible to select the characteristic regions of the VOC infrared spectra.
3.2. Model of Monte-Carlo sampling
In Sec. 3.1, iPLS model was performed for acquiring characteristic wavebands of VOCs spectra, and the selected wavebands were prepared for the next operation of Monte-Carlo sampling. A certain number of wavelengths were selected randomly from the selected wavebands by Monte-Carlo sampling and the PLS model was established based on these selected wavelengths. We repeated the Monte-Carlo sampling and PLS modeling many times to ensure the stability and reliability of the results. The prediction error was calculated separately corresponding to wavelengths set in each operation. Finally, the wavelengths set with the minimum prediction error were adopted as the optimal characteristic wavelength combinations of the spectrum.
The wavebands selected from 20 intervals iPLS model of C2H4 spectrum have totally 1290 wavelengths. In this study, we selected 20 wavelengths from the preselected wavebands by Monte-Carlo sampling, the sampling process was repeated 1000 times to get the minimum RMSECV. Through several experiments, the results of RMSECV were stable at about 0.3ppm. The results of iPLS-MC model are shown in Fig. 9, with the characteristic wavelengths denoted by red asterisk. The 20 wavelengths selected from 30 intervals iPLS model of C2H4 spectra are shown in Fig. 10.

Fig. 9. Wavelengths selected by iPLS (20 intervals)-MC model of C2H4 spectra.

Fig. 10. Wavelengths selected by iPLS (30 intervals)-MC model of C2H4 spectra.
For C2H5OH spectra, we selected 10 wavelengths each time by Monte-Carlo sampling and repeated 1000 times for the finally result. The 10 wavelengths selected from 20 intervals and 30 intervals iPLS model of C2H5OH spectra are shown in Figs. 11 and 12, respectively. The specific wavelengths positions of each figure are listed in Table 7.

Fig. 11. Wavelengths selected by iPLS (20 intervals)-MC model of C2H5OH spectra.

Fig. 12. Wavelengths selected by iPLS (30 intervals)-MC model of C2H5OH spectra.
Spectrum type | Ipls interval | Wavelength number | Wavelength position (cm−1) | RMSECV (ppm) | Rc |
---|---|---|---|---|---|
C2H4 | 20 | 20 | 839, 907, 987, 1025, 1035, 1087, 1111, 14601, 546, 1823, 1905, 1907, 1951, 2891, 2942, 2963, 3002, 3057, 3077, 3273 | 0.3001 | 0.9999 |
30 | 20 | 947, 953, 961, 1016, 1285, 1293, 1343, 1435, 1453, 1461, 1882, 1907, 2881, 2899, 2997, 3110, 3196, 3231, 3238, 3256 | 0.3364 | 0.9998 | |
C2H5OH | 20 | 10 | 1015, 1049, 2771, 2896, 2981, 3023, 3064, 3161, 3290, 3450 | 0.2977 | 0.9999 |
30 | 10 | 1085, 2899, 2942, 2973, 3014, 3079, 3098, 3161, 3257, 3308 | 0.3109 | 0.9999 |
The results listed in Table 7 indicate that the Monte-Carlo sampling can select the characteristic wavelengths from gas spectra efficiently, and the values of RMSECV and Rc are much smaller than the corresponding values of full spectra PLS model. In the experiment of C2H4 spectra modeling, we set the number of the characteristic wavelengths to 20 which is about 1.6% of the number of iPLS wavebands combination. The final results indicate that the proposed method gets excellent prediction effects. On the basis of the experiment of C2H4, we try to reduce the number of characteristic wavelengths number to 10 which only accounts for 0.7% of the C2H5OH spectrum wavebands selected by iPLS model, and the iPLS-MC modeling of C2H5OH get better prediction results than the experiment of C2H4 spectra. The reason is that in combinatorial mathematics, when the total sample points are determined, the smaller the number of points to be sampled is, the greater the probability of all possible sample combinations to be obtained will be. The VOC infrared spectrum usually contains several absorption peaks which can explain the feature of the certain gas, so the number of characteristic wavelengths can reduce to a small number corresponding to the number of absorption peaks. Comparing different models on two VOC spectra, it is obvious that the iPLS-MC model can select the informative wavelengths effectively and accurately. The accuracy of prediction has been greatly improved.
3.3. Comparison of selected wavelengths by a different method
The performance of each model was evaluated by the prediction set randomly selected from the acquired spectral samples. Both prediction sets of two gases contain 24 samples with different concentrations. The scatter plots in Figs. 13 and 14 intuitively show the correlation between predicted and actual concentrations from different modeling methods. In each figure, the star points represent the identification result of full spectrum PLS model, and the circle points and triangle points represent the results of iPLS model and iPLS-MC model, respectively. As we can see from the figures, the prediction errors of iPLS model and iPLS-MC model evidently decreased compared with the result of full spectrum PLS model. Especially in the iPLS-MC model, the predicted and actual concentrations are highly correlated.

Fig. 13. Relationship between the predicted and actual concentration of C2H4 (a) iPLS intervals of 20 and (b) iPLS intervals of 30.

Fig. 14. Relationship between the predicted and actual concentration of C2H5OH (a) iPLS intervals of 20 and (b) iPLS intervals of 30.
Table 8 summarized the parameters and experiment results of three models performed on the quantitative analysis of VOC spectra. Compared with the preliminary number of full spectrum (700–4999cm−1, 4300 wavelengths), the characteristic wavebands selected from the iPLS model were 215 (20 intervals) and 143 (30 intervals), accounting for 5% and 3.3% of the preliminary number. Meanwhile, the RMSECV and RMSEP values reduced from 3 to about 0.6ppm, and the Rc and Rp values increased from 0.96 to 0.998. The characteristic wavelengths selected from iPLS-MC model are 20 and 10, accounting for 0.4% and 0.2% of the preliminary number, and the RMSECV and RMSEP values reduced dramatically to 0.3ppm, simultaneously the Rc and Rp values increased to 0.9999. All the parameters indicated the stability of the model which can get a satisfactory prediction performance in selecting characteristic wavelengths of VOC spectrum.
Type | C2H4 | C2H5OH | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
parameters | number | RMSECV | Rc | RMSEP | Rp | number | RMSECV | Rc | RMSEP | Rp |
PLS | 4300 | 2.5827 | 0.9737 | 2.7349 | 0.9629 | 4300 | 3.0184 | 0.9625 | 3.2361 | 0.9594 |
iPLS | 215 | 0.6662 | 0.9989 | 0.7043 | 0.9982 | 215 | 0.6726 | 0.9987 | 0.7043 | 0.9981 |
143 | 0.6294 | 0.9990 | 0.6458 | 0.9989 | 143 | 0.5219 | 0.9996 | 0.7126 | 0.9983 | |
iPLS-MC | 20 | 0.3001 | 0.9999 | 0.3118 | 0.9999 | 10 | 0.2977 | 0.9999 | 0.3201 | 0.9999 |
20 | 0.2846 | 0.9998 | 0.3007 | 0.9999 | 10 | 0.2904 | 0.9999 | 0.3193 | 0.9999 |
The prediction results of the quantitative identification models were significantly improved when employing characteristic wavelength selection, and the models were also simplified by using only a small part of informative wavelengths. The results experimentally proved the necessity to perform characteristic wavelength selection before building a calibration model.
4. Conclusion
In this work, an improved characteristic wavelength selection method called iPLS-MC model was proposed to identify the concentration information of VOCs from its FTIR spectrum. As the spectra of different gases vary greatly, it’s necessary to select the characteristic wavebands or wavelengths of gas spectrum before establishing the calibration models. First, we systematically collected the FTIR spectra of ethylene and ethanol in laboratory, then used the obtained spectra to build full-spectrum PLS model, iPLS model and iPLS-MC model, respectively. At last, we compared the experiment results of each model and got the following conclusions: (1) FTIR as a noncontact spectral technology integrated with PLS modeling can effectively detect the concentration information of VOCs. (2) Both iPLS method and iPLS-MC method can select the characteristic wavebands or wavelengths corresponding to the gas spectral absorption peak regions, and the prediction effects of models which used the selected wavebands or wavelengths were better than that of full-spectrum modeling. (3) iPLS-MC method improves the single selection of iPLS, it combines multiple wavebands screened by iPLS modeling and selects wavelengths from those wavebands by Monte-Carlo sampling to get the optimal wavelength combination. The performance of the regression model built by characteristic wavelengths from iPLS-MC model was superior in contrast with that by the iPLS model. In the iPLS-MC model, less than 0.4% of the total spectral wavelengths had been selected, the correlation coefficient of prediction set (Rp) by the selected wavelengths reached 0.9999. The results demonstrate that iPLS-MC method is a useful wavelength reduction tool for VOC spectra to retain useful information.
In this paper, for the first time, we put forward the variable selection method to optimize the VOC FTIR spectrum, meanwhile, we proposed a new characteristic wavelength selection method named iPLS-MC which achieved high accuracy prediction performance in laboratory conditions. The proposed characteristic wavelength selection method provides a new thought to make the online atmospheric environment pollution monitoring and other spectroscopic analysis simpler and more effective. It is noteworthy that the experiment was carried out with VOC gas as the carrier because VOC gas detection is a key issue in the prevention and control of atmospheric environmental pollution at present. This method is applicable not only to VOC gas but also to any other asymmetric polyatomic gas with infrared spectrum absorption, such as CO2, SO2, SO3, etc. In order to improve the applicability of the model, different kinds of asymmetric polyatomic gases should be tested in this model.
Moreover, when the gas mixtures with characteristic wavebands overlapped in the same spectral region, the results should be discussed in three cases: (1) The concentration of one gas is known, while the concentrations of other mixed gases change little, this method can accurately select the characteristic wavelengths of the gas with the known concentration. As shown in Fig. 1, one absorption peak of C2H4 overlaps with water vapor’s absorption peak in the waveband near 1500cm−1, the proposed method can select the characteristic wavelengths of C2H4 accurately. (2) The concentration of one gas is known, the concentrations of other mixed gases change in the same proportion to the known gas, this method can select the characteristic wavelengths in the overlapped region, but the determination of gas type requires assistance of other selected absorption peaks. (3) The concentration of the mixed gases varies in different proportions, the linear superposition of absorption peaks in the same wavelengths results in mismatches with concentration information, errors or failure of fitting may occur when spectral PLS is calibrated quantitatively. The new method cannot work in this situation because it based on PLS model.
The number of the wavelengths that participated in the Monte-Carlo sampling is not determined in this study, how to choose the optimal number of characteristic wavelengths should be studied in the future research.
Conflict of Interest
The authors declare that there is no potential conflict of interest related to this manuscript.
Acknowledgements
The authors are grateful to Prof. Minguang Gao for his technical assistance. This work was supported by National Key Scientific Instrument and Equipment Development Project of China, Grant Nos. 2013YQ220643, the National 863 Program of China, Grant Nos. 2014AA06A503.