Research ArticleOpen Access

SpectraTr: A novel deep learning model for qualitative analysis of drug spectroscopy based on transformer structure

School of Computer Science and Information Security, Guilin University of Electronic Technology, 1 Jinji Road, Guilin 541004, P. R. China

School of Artificial Intelligence, Beijing University of Posts and Telecommunications, 10 Xitucheng Road, Beijing 100876, P. R. China

Search for more papers by this author

Yue Wen

School of Artificial Intelligence, Beijing University of Posts and Telecommunications, 10 Xitucheng Road, Beijing 100876, P. R. China

Search for more papers by this author

Yuke Zhang

School of International, Beijing University of Posts and Telecommunications, 10 Xitucheng Road, Beijing 100876, P. R. China

Search for more papers by this author

Lingqiao Li

School of Computer Science and Information Security, Guilin University of Electronic Technology, 1 Jinji Road, Guilin 541004, P. R. China

E-mail Address: 54pe@163.com

Search for more papers by this author

Yanchun Feng

National Institutes for Food and Drug Control, 10 Tiantanxili Road, Beijing 100050, P. R. China

Search for more papers by this author

Lihui Yin

National Institutes for Food and Drug Control, 10 Tiantanxili Road, Beijing 100050, P. R. China

Search for more papers by this author

, and

Huihua Yang

School of Computer Science and Information Security, Guilin University of Electronic Technology, 1 Jinji Road, Guilin 541004, P. R. China

School of Artificial Intelligence, Beijing University of Posts and Telecommunications, 10 Xitucheng Road, Beijing 100876, P. R. China

E-mail Address: yhh@bupt.edu.cn

Corresponding author.

Search for more papers by this author

https://doi.org/10.1142/S1793545822500213Cited by:13 (Source: Crossref)

Abstract

The drug supervision methods based on near-infrared spectroscopy analysis are heavily dependent on the chemometrics model which characterizes the relationship between spectral data and drug categories. The preliminary application of convolution neural network in spectral analysis demonstrates excellent end-to-end prediction ability, but it is sensitive to the hyper-parameters of the network. The transformer is a deep-learning model based on self-attention mechanism that compares convolutional neural networks (CNNs) in predictive performance and has an easy-to-design model structure. Hence, a novel calibration model named SpectraTr, based on the transformer structure, is proposed and used for the qualitative analysis of drug spectrum. The experimental results of seven classes of drug and 18 classes of drug show that the proposed SpectraTr model can automatically extract features from a huge number of spectra, is not dependent on pre-processing algorithms, and is insensitive to model hyperparameters. When the ratio of the training set to test set is 8:2, the prediction accuracy of the SpectraTr model reaches 100% and 99.52%, respectively, which outperforms PLS_DA, SVM, SAE, and CNN. The model is also tested on a public drug data set, and achieved classification accuracy of 96.97% without pre-processing algorithm, which is 34.85%, 28.28%, 5.05%, and 2.73% higher than PLS_DA, SVM, SAE, and CNN, respectively. The research shows that the SpectraTr model performs exceptionally well in spectral analysis and is expected to be a novel deep calibration model after Autoencoder networks (AEs) and CNN.

Keywords:

1. Introduction

The active ingredients of the same type of drug produced by different manufacturers differ due to distinctions in raw materials, manufacturing processes, resulting in price differences. Unscrupulous merchants take the similarity of active ingredients in drugs to package low-cost drugs into high-priced drugs from well-known manufacturers. This happens from time to time, seriously damaging consumers’ rights and interests as well as the reputations of well-known manufacturers. Therefore, it is critical to identify drug manufacturers during drug supervision.

Traditional methods for the identification of drug manufacturers include chemical methods, HPLC authorization methods, etc. But the detection process is cumbersome, the analysis speed is slow and destructive. As one of the mainstream nondestructive testing techniques, near-infrared spectroscopy has been widely used in many fields due to its fast detection speed and low cost.^1,2,3,4 However, the drug supervision methods based on spectral analysis are heavily dependent on the chemometrics model which characterizes the relationship between spectral data and drug categories. The resolution of spectroscopic measuring devices has grown in tandem with the advancement of manufacturing technology. While more information about the analysis target can be obtained, the high-dimensional and easily disturbed spectra also make it difficult for traditional chemometrics to directly extract effective features. Therefore, pre-processing and wavelength selection algorithms are needed prior to modeling.^5,6,7 At the same time, due to the wide variety of existing drugs and the continuous research and development of new drugs, the cumulative drug data also poses challenges to methods such as PLS and SVM.⁸

The application of advanced methods represented by deep learning in spectroscopy analysis promotes the development of chemometrics. Autoencoder networks (AEs) and convolutional neural networks (CNNs) are the most common deep learning-based spectral analysis approaches currently available. The former uses an autoencoder network to reconstruct features and then classifies them with traditional classifiers, while the latter extracts feature based on convolutional layers and use fully connected layers for classification.^9,10,11,12 Although both can automatically learn and extract features from a large volume of spectral data, with low dependency on preprocessing and wavelength selection engineering, some studies have shown that convolution neural networks perform better in applications.^13,14,15 However, CNNs have many hyperparameters, and their prediction performance depends on appropriate hyperparameter selection, which leads to difficult network construction.^16,17,18,19 According to the most recent deep learning research, the transformer model based on the attention mechanism has the feature extraction potential of the CNN model and is easier to build network parameters. It is one of the current research hotspots of deep learning.^20,21,22 Some studies have applied transformers to various fields such as drug design,²³ chemical synthesis,²⁴ protein three-dimensional structure prediction,²⁵ medical imaging,²⁶ and hyperspectral analysis,²⁷ and demonstrated excellent results. But its application potential in molecular spectroscopy has not been studied yet.

This paper proposes a novel deep learning model for drug spectral identification, in order to build an analytical model with the better predictive ability and lower hyperparameter dependence. The approach is named SpectraTr based on the transformer structure. The performance of the model was verified based on the spectral data sets of our drugs from multiple manufacturers, and then the public drug data sets were used to further prove the advanced nature of the model. In the following paper, we will introduce our experiment materials and the mechanism of the transformer network structure in Sec. 2. In Sec. 3, the experiment procedure and the classification prediction results in different datasets will be discussed. Section 4 will summarize the whole paper. The source code is available at https://github.com/FuSiry/Transformer-for-Nirs for academic use only.

2. Materials and Methods

2.1. Datasets

Dataset A is provided by the National Institute for Food and Drug Control and all samples pass the statutory method test. It comes from 333 cefixime drugs produced by Guangzhou Baiyunshan Pharmaceutical General Factory and 716 phenytoin sodium drugs produced by 11 pharmaceutical companies including Southwest Pharmaceutical Co., Ltd. The raw spectra of both drugs are shown in Fig. 1 and are measured by the Bruker Matrix spectrometer. Each spectral wavelength ranges from 4000 to 11995cm $^{- 1}$ $^{- 1}$ , with intervals of 4cm $^{- 1}$ $^{- 1}$ and a total of 2074 absorption points. It can be seen from the figure that the spectra of similar drugs from different manufacturers are similar.

Fig. 1. Spectra of cefixime and phenytoin sodium tablets. (a) Spectra of cefixime tablets produced by seven pharmaceutical companies and (b) Spectra of phenytoin sodium tablets produced by 11 companies.

Dataset B is an open-source drug tablet dataset with a wavelength range of 7400–10507cm $^{- 1}$ $^{- 1}$ and 404 absorbances points. The samples were classified into four categories according to their active ingredients. The raw spectra are shown in Fig. 2 and can be obtained from http://www.models.life.ku.dk/Tablets.

2.2. SpectraTr architecture

As one of the most groundbreaking works in the deep learning field, the transformer has been widely discussed. We propose SpectraTr, as shown in Fig. 3, based on its work.²⁸ SpectraTr is similar to a classifier based on AE network and consists of an encoder and MLP layers. The raw spectrum is divided into several spectra patches. After the position coding, the spectral features are extracted from the Spectra encoder composed of multi-head attention layers and weighted by MLP layer, which obtains the qualitative analysis results of the drug spectrum.

2.2.1. Position embedding

SpectraTr splits the full spectrum into multiple spectra patch sequences and extracts the features, which makes it compatible with 1d spectrum and 2d image data. Previous researchers have made similar attempts in the CNN network by splitting the raw spectrum into multiple segments and then entering the model, with better results.¹⁶ However, the position information of the spectra patch in the raw spectrum is destroyed because of the splitting of the spectrum into multi-segment spectra patches. In order to retain the position information, the position encoding is performed on the spectra patch, as shown in the following equation :

E pos = n \sum i = i x i, x i \in R 1 \times D, N (μ, σ 2) . E_{pos} = n \sum i = i x_{i}, x_{i} \in R^{1 \times D}, N (μ, σ^{2}) . <math display="block" altimg="eq-00007.gif"><msub><mrow><mi>E</mi></mrow><mrow><mstyle><mtext mathvariant="normal">pos</mtext></mstyle></mrow></msub><mo>=</mo><munderover accentunder="true" accent="true"><mrow><mo>\sum</mo></mrow><mrow><mi>i</mi><mo>=</mo><mi>i</mi></mrow><mrow><mi>n</mi></mrow></munderover><msub><mrow><mi>x</mi></mrow><mrow><mi>i</mi></mrow></msub><mo>,</mo><mspace width="1em"></mspace><msub><mrow><mi>x</mi></mrow><mrow><mi>i</mi></mrow></msub><mo>\in</mo><msup><mrow><mi>R</mi></mrow><mrow><mn>1</mn><mo>\times</mo><mi>D</mi></mrow></msup><mo>,</mo><mtext> </mtext><mi>N</mi><mo stretchy="false">(</mo><mi>μ</mi><mo>,</mo><msup><mrow><mi>σ</mi></mrow><mrow><mn>2</mn></mrow></msup><mo stretchy="false">)</mo><mo>.</mo></math> (1)

For each spectral patch, create a positional encoding of the Gaussian distribution of the

1 \times D

$1 \times D$ dimension at random, and then combine all positional encodings of spectral patches to get the entire positional encoding information.

2.2.2. Multi-head attention mechanism

The self-attention mechanism is the core innovation of the transformer model. It can automatically focus on the important information in the data, increase the weight of important information and decrease the weight of unimportant information in the input data. The implementation of the attention mechanism in chemometrics is similar to automatic wavelength selection.²⁹ The computational procedure of the self-attention mechanism is shown in Fig. 4.

Fig. 4. The computational procedure of the self-attention mechanism.

The matrices $q_{i}$ $q_{i}$ , $k_{i}$ $k_{i}$ , and $v_{i}$ $v_{i}$ are obtained by linear transforming the input sequence $a_{i}$ $a_{i}$ . Then, the elements of matrix $q_{i}$ $q_{i}$ from $a_{i}$ $a_{i}$ are dot-product with the corresponding elements of the matrix $k_{j}$ $k_{j}$ from $a_{j}$ $a_{j}$ to obtain $a_{i j}$ $a_{i j}$ , which is the correlation between $a_{i}$ $a_{i}$ and $a_{j}$ $a_{j}$ , as shown in the following equation :

ai,j=(qi)T⋅kj√dk,ai,j=(qi)T⋅kj√dk,<math display="block" altimg="eq-00020.gif"><msub><mrow><mi>a</mi></mrow><mrow><mi>i</mi><mo>,</mo><mi>j</mi></mrow></msub><mo>=</mo><mfrac><mrow><msup><mrow><mo stretchy="false">(</mo><msup><mrow><mi>q</mi></mrow><mrow><mi>i</mi></mrow></msup><mo stretchy="false">)</mo></mrow><mrow><mi>T</mi></mrow></msup><mo>⋅</mo><msup><mrow><mi>k</mi></mrow><mrow><mi>j</mi></mrow></msup></mrow><mrow><msqrt><mrow><msub><mrow><mi>d</mi></mrow><mrow><mi>k</mi></mrow></msub></mrow></msqrt></mrow></mfrac><mo>,</mo></math>(2)

where

d_{k}

$d_{k}$ is the vector dimension of the matrix

k

$k$ to keep the dot-product with

q_{i}

$q_{i}$ and

k_{j}

$k_{j}$ from being too large. The correlation obtained from the vectors

a_{i}

$a_{i}$ and

a_{j}

$a_{j}$ is input into the softmax function to obtain the probability distribution of

a_{i, j}

$a_{i, j}$ , as shown in the following equation :

a′i,j=eai,j∑Nn=1eai,j.<math display="block" altimg="eq-00028.gif"><msubsup><mrow><mi>a</mi></mrow><mrow><mi>i</mi><mo>,</mo><mi>j</mi></mrow><mrow><mi>′</mi></mrow></msubsup><mo>=</mo><mfrac><mrow><msup><mrow><mi>e</mi></mrow><mrow><msub><mrow><mi>a</mi></mrow><mrow><mi>i</mi><mo>,</mo><mi>j</mi></mrow></msub></mrow></msup></mrow><mrow><msubsup><mrow><mo>∑</mo></mrow><mrow><mi>n</mi><mo>=</mo><mn>1</mn></mrow><mrow><mi>N</mi></mrow></msubsup><msup><mrow><mi>e</mi></mrow><mrow><msub><mrow><mi>a</mi></mrow><mrow><mi>i</mi><mo>,</mo><mi>j</mi></mrow></msub></mrow></msup></mrow></mfrac><mo>.</mo></math>(3)

Finally, the relationship degree of each input vector with other vectors $a_{i, j}^{'}$ is multiplied by $v_{j}$ , to obtain the attention score of the input, as shown in Eq. (4). If $a_{i}$ is strongly correlated with $a_{j}$ , the calculated value of $a_{i, j}^{'}$ will be larger. Eventually, $o_{i, j}$ will be higher, representing the information is more critical.

o i, j = N \sum i = 1 N \sum j = 1 a' i, j v j . <math display="block" altimg="eq-00035.gif"><msub><mrow><mi>o</mi></mrow><mrow><mi>i</mi><mo>,</mo><mi>j</mi></mrow></msub><mo>=</mo><munderover accentunder="true" accent="true"><mrow><mo>\sum</mo></mrow><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mrow><mi>N</mi></mrow></munderover><munderover accentunder="true" accent="true"><mrow><mo>\sum</mo></mrow><mrow><mi>j</mi><mo>=</mo><mn>1</mn></mrow><mrow><mi>N</mi></mrow></munderover><msubsup><mrow><mi>a</mi></mrow><mrow><mi>i</mi><mo>,</mo><mi>j</mi></mrow><mrow><mi>'</mi></mrow></msubsup><msub><mrow><mi>v</mi></mrow><mrow><mi>j</mi></mrow></msub><mo>.</mo></math> (4)

Multiple self-attention layers are used to form a multi-head attention block to capture richer feature information. This is similar to the principle of using multiple filters simultaneously in CNN. The schematic is shown in Fig. 5.

Fig. 5. Structure diagram of multi-head attention mechanism.

A single input $a_{1}$ is sequentially stacked through the output matrix of $n$ self-attention layer. Then, the result is linear matrix transformed to obtain the final attention matrix $o_{1}$ . Multiple self-attention layers form multiple subspaces, which allow the model to focus on different aspects of information.

2.2.3. Layersnorm and residual connection

A layersnorm (LN layer) is added before and after the multi-head attention block to eliminate the undesirable effects caused by odd sample data, which is similar to the principle of using Batchnorm (BN) in CNN, as shown in the following equation :

Layernorm(x)=γx−μ√σ2+ε+β,<math display="block" altimg="eq-00039.gif"><mstyle><mtext mathvariant="normal">Layernorm</mtext></mstyle><mo stretchy="false">(</mo><mi>x</mi><mo stretchy="false">)</mo><mo>=</mo><mi>γ</mi><mfrac><mrow><mi>x</mi><mo>−</mo><mi>μ</mi></mrow><mrow><msqrt><mrow><msup><mrow><mi>σ</mi></mrow><mrow><msup><mrow></mrow><mrow><mn>2</mn></mrow></msup></mrow></msup><mo>+</mo><mi>ε</mi></mrow></msqrt></mrow></mfrac><mo>+</mo><mi>β</mi><mo>,</mo></math>(5)

where

$γ$ and

$β$ are the scaling and translation variables of the LN layer, which are the parameters that can be learned to ensure that the original learned features are retained after each normalization.

$ε$ is a minimal value, which aims to ensure that the target is not 0.

$μ$ and

$σ$ are, respectively, the mean and variance of the LN layer. Specific calculation formulas are shown in the following equations :

μ=1HH∑i=1xi,<math display="block" altimg="eq-00045.gif"><mi>μ</mi><mo>=</mo><mfrac><mrow><mn>1</mn></mrow><mrow><mi>H</mi></mrow></mfrac><munderover accentunder="true" accent="true"><mrow><mo>∑</mo></mrow><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mrow><mi>H</mi></mrow></munderover><msub><mrow><mi>x</mi></mrow><mrow><mi>i</mi></mrow></msub><mo>,</mo></math>(6)

σ=√1HH∑i=1(xi−μ)2,<math display="block" altimg="eq-00046.gif"><mi>σ</mi><mo>=</mo><msqrt><mrow><mfrac><mrow><mn>1</mn></mrow><mrow><mi>H</mi></mrow></mfrac><munderover accentunder="true" accent="true"><mrow><mo>∑</mo></mrow><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mrow><mi>H</mi></mrow></munderover><msup><mrow><mo stretchy="false">(</mo><msub><mrow><mi>x</mi></mrow><mrow><mi>i</mi></mrow></msub><mo>−</mo><mi>μ</mi><mo stretchy="false">)</mo></mrow><mrow><mn>2</mn></mrow></msup></mrow></msqrt><mo>,</mo></math>(7)

where

$H$ is the number of hidden nodes in the layer. Overall, the LN makes the same layer neuron inputs have the same mean and variance by normalizing all neurons in each layer.

Since the spectral encoder also uses a stacked structure similar to CNN, an over-deep network structure may cause the gradient disappearance problem. Therefore, drawing on the idea of Resnet, a residual mechanism is introduced between the multi-head attention block and the MLP block, as shown in the following equation³⁰ :

H (x) = F (x) + x, <math display="block" altimg="eq-00048.gif"><mi>H</mi><mo stretchy="false">(</mo><mi>x</mi><mo stretchy="false">)</mo><mo>=</mo><mi>F</mi><mo stretchy="false">(</mo><mi>x</mi><mo stretchy="false">)</mo><mo>+</mo><mi>x</mi><mo>,</mo></math> (8)

where

$x$ is the input.

$F (x)$ is the output of

$x$ after the convolutional layer.

$H (x)$ is the sum of

$x$ and

$F (x)$ . In the backpropagation process,

$H (x)$ can receive the gradient information transmitted by

$x$ and

$F (x)$ at the same time. Even if the network is too deep, the updated signal can still be obtained through branch

$x$ . Therefore, the gradient disappearance can be effectively suppressed.

2.3. Modeling

SpectraTr is implemented based on Python 3.8 and the open-source deep learning framework Pytorch 1.8.0. It runs on workstations equipped with GeForce MP40GPU, Intel $®$ Corei7 4770K CPU @ 3.5GHz and ubuntu operating system. The network structure of SpectraTr is described in Sec. 2.2. The initial parameters are shown in Table 1.

**Table 1. Intervals used for hyperparameters optimization. (MA: The hyperparameters of model architecture, MT: The hyperparameters of model train).**
Hyperparameter	Type	Range
Batch size	MT	{16, 32, 64, 128}
Learning rate	MT	{1e-4–1e-1}
L2 regularization	MT	{0–3e-2}
Patch num	MA	{5, 10, 20, 25, 40, 50, 80, 100}
Multi-head num	MA	{8, 10, 12, 14, 16, 18, 20}
Layer depth	MA	{2, 3, 4, 5, 6, 7, 8, 9}
Mlp_dim	MA	{128, 256, 512, 1024, 1536, 2048}

On both datasets, cross-entropy is used as the loss function and the Adam optimizer is used for training. The total epoch is set to 200. In order to balance the training time and training effect, the learning rate decline strategy is introduced. If the loss of the training set does not decline within 10 epochs, the learning rate becomes half of the current one. The early_stop strategy is also introduced to prevent over-fitting. If the loss of the test set does not drop within 30 epochs, the training stops. The final selected parameters are shown in Table 2.

**Table 2. The hyperparameters in this study.**
	Hyperparameter
Dataset	Batch size	Learning rate	L2 regularization	Patch num	Multi-head num	Layer depth	Mlp_dim
A	16	0.0001	0.01	40	19	5	512
B	16	0.0001	0	10	12	3	1024

2.4. Baseline comparison

The classical chemometrics model SVM, PLS_DA, and the mainstream deep learning model SAE and CNN network are used as baseline models, respectively and compared with the proposed SpectraTr. SVM and PLS_DA are implemented by the machine learning library scikit-learn, where SVM parameter C is set to 1.0, gamma is set to 0.001, PLS_DA component is set to 22, and the rest is the default. The SAE network consists of three encoder stacks of input-1024, 1024-256, 256-output, as in the reference study.⁸ The CNN network consists of three convolution layers and three fully connected layers, as in the reference study.¹³ The above model is the same as the operating environment for SpectraTr.

2.5. Model evaluation

Models are evaluated using the classical qualitative analysis metrics: accuracy, specificity, recall, and F1-Score. Accuracy is the number of samples predicted correctly as a percentage of the total. Precision represents the percentage of positive samples predicted by the model that is actually positive. The recall represents the percentage of samples actually positive that are predicted to be positive. F1-score is the result of reconciling precision and recall. Their mathematical expressions are shown in the following equation :

acc=TN+TPTP+TN+FN+FP×100%,precision=TPTP+FP×100%,recall=TNFP+TN×100%,f1-score=2(precision∗recall)recall+precision×100%,<math display="block" altimg="eq-00062.gif"><mtable displaystyle="true" equalrows="false" equalcolumns="false"><mtr><mtd columnalign="left"><mstyle><mtext mathvariant="normal">acc</mtext></mstyle><mo>=</mo><mfrac><mrow><mstyle><mtext mathvariant="normal">TN</mtext></mstyle><mo>+</mo><mstyle><mtext mathvariant="normal">TP</mtext></mstyle></mrow><mrow><mstyle><mtext mathvariant="normal">TP</mtext></mstyle><mo>+</mo><mstyle><mtext mathvariant="normal">TN</mtext></mstyle><mo>+</mo><mstyle><mtext mathvariant="normal">FN</mtext></mstyle><mo>+</mo><mstyle><mtext mathvariant="normal">FP</mtext></mstyle></mrow></mfrac><mo>×</mo><mn>1</mn><mn>0</mn><mn>0</mn><mi>%</mi><mo>,</mo></mtd></mtr><mtr><mtd columnalign="left"><mstyle><mtext mathvariant="normal">precision</mtext></mstyle><mo>=</mo><mfrac><mrow><mstyle><mtext mathvariant="normal">TP</mtext></mstyle></mrow><mrow><mstyle><mtext mathvariant="normal">TP</mtext></mstyle><mo>+</mo><mstyle><mtext mathvariant="normal">FP</mtext></mstyle></mrow></mfrac><mo>×</mo><mn>1</mn><mn>0</mn><mn>0</mn><mi>%</mi><mo>,</mo></mtd></mtr><mtr><mtd columnalign="left"><mstyle><mtext mathvariant="normal">recall</mtext></mstyle><mo>=</mo><mfrac><mrow><mstyle><mtext mathvariant="normal">TN</mtext></mstyle></mrow><mrow><mstyle><mtext mathvariant="normal">FP</mtext></mstyle><mo>+</mo><mstyle><mtext mathvariant="normal">TN</mtext></mstyle></mrow></mfrac><mo>×</mo><mn>1</mn><mn>0</mn><mn>0</mn><mi>%</mi><mo>,</mo></mtd></mtr><mtr><mtd columnalign="left"><mtext>f1</mtext><mtext>-</mtext><mstyle><mtext mathvariant="normal">score</mtext></mstyle><mo>=</mo><mfrac><mrow><mn>2</mn><mo stretchy="false">(</mo><mstyle><mtext mathvariant="normal">precision</mtext></mstyle><mo>∗</mo><mstyle><mtext mathvariant="normal">recall</mtext></mstyle><mo stretchy="false">)</mo></mrow><mrow><mstyle><mtext mathvariant="normal">recall</mtext></mstyle><mo>+</mo><mstyle><mtext mathvariant="normal">precision</mtext></mstyle></mrow></mfrac><mo>×</mo><mn>1</mn><mn>0</mn><mn>0</mn><mi>%</mi><mo>,</mo></mtd></mtr></mtable></math>(9)

where TP is true positive, which indicates both the real label and the prediction are positive. FP is false positive, indicating that the true label is negative, but the prediction is positive. FN is false negative, representing the true label is positive, but the prediction is negative. FP is true negative, which presents both real label and the prediction are negative.

3. Result and Discussion

3.1. Result of dataset A

3.1.1. Performance of spetralTr in dataset A

In this section, based on the proposed SpectraTr model, the near-infrared spectroscopy of cefixime and phenytoin sodium produced by multiple manufacturers are predicted and compared with other baseline methods. For comparison with previous research, we use the same dataset as in Ref. 13 and follow the arrangement of experiments described in the text. First, a qualitative analysis of cefixime tablets from seven manufacturers was carried out. Then, the mixture of cefixime tablets and phenytoin sodium tablets was used for qualitative analysis of 18 manufacturers in two drug classes. In experiments, 80% of the samples were randomly selected as the train set, and the remaining 20% were used as the test set. Both train and test sets use standardized pre-processing to speed up the model training, and all experimental results come from the test set. The prediction performance of SpectraTr on cefixime and mixed drugs is shown in Table 3.

**Table 3. The prediction performance of SpectraTr on dataset A (%).**
Dataset	Acc	Precision	Recall	F1-Score
Cefixime	100.00	100.00	100.00	100.00
Phenytoin sodium and Cefixime	99.52	99.55	99.52	99.53

The proposed SpectraTr model has excellent manufacturer identification capabilities for both cefixime and mixed drug spectra. The prediction accuracy on the cefixime data set of seven manufacturers achieves 100%, and prediction accuracy on the spectrum data set of 18 manufacturers of mixed drugs also achieves 99.5%.

3.1.2. Model performance under different training sets

The usual spectrum data is less due to the expensive cost of gathering spectral data and labeling it. However, deep learning is driven by the data. It is easy to overfit and leads to poor prediction performance with fewer samples. Therefore, the training set is randomly selected from the drug spectroscopy data set at a ratio of 0.8 to 0.2 to construct training sets and testing at different scales. Comparing SpectraTr with the baseline model, the accuracy is taken as the evaluation index, and the result on cefixime drugs is shown in Table 4.

**Table 4. The result of different train sets on cefixime drugs (%).**
Train/Test	SVM	PLS_DA	SAE	CNN	SpectraTr
266/67	97.61	96.05	99.01	100.00	100.00
233/100	97.00	95.00	99.00	100.00	100.00
199/134	96.25	94.06	98.76	99.25	99.83
166/167	95.20	93.40	98.28	98.70	99.57
133/200	94.00	91.37	97.80	98.23	99.04
99/234	92.87	89.15	96.86	97.50	98.58
66/267	90.63	87.21	95.32	96.38	97.06

The prediction accuracy of the deep learning model is better than that of the classic chemometric method when the pre-processing method is not precisely chosen, and there is no notable performance erosion when the sample size is reduced. The proposed SpectraTr model performs best on every different scale train set. The prediction accuracy even reaches 100% when the train set is 266 and 233. After the train set is reduced to 66, the accuracy is still 97.06%, which is slightly better than CNN and SAE, and significantly better than PLS_DA and SVM. Construct train sets with various proportions and conduct studies on mixed drugs in the same way. The result is shown in Table 5.

**Table 5. The result of different train sets on phenytoin sodium and cefixime (%).**
Train/Test	SVM	PLS_DA	SAE	CNN	SpectraTr
839/210	88.57	92.00	99.49	99.57	99.52
734/315	85.39	91.36	99.28	99.64	99.68
629/420	84.04	89.76	98.09	98.86	99.07
524/525	81.90	87.90	96.81	98.01	98.67
419/630	78.10	85.13	95.77	96.90	97.65
314/735	77.14	82.59	94.73	94.16	95.98
209/840	74.40	79.28	92.36	92.75	93.81

The results of deep learning methods are still better, similar to the results on cefixime. In most cases, SpectraTr is the best model in the majority of scenarios. It achieves a prediction accuracy of 99.68% when the training sample is 734. When the training sample is reduced to 209, the result is still 93.81%, which is better than CNN and SAE. However, when the train set is reduced, the performance loss is more noticeable than on cefixime, indicating that deep learning performance in small samples is also restricted by the data set’s classification difficulty.

3.1.3. Influence of pre-processing algorithm on model prediction ability

The drug is pre-processed with the methods of standardization, MSC, SG (window is 15), and SNV, respectively, to compare the effects of different pre-processing algorithms on the model. Under the 7:3 train set and test set division, the experimental results of the model on the data set A are shown in Table 6. SVM, PLS_DA are greatly affected by the pre-processing algorithm, under the suitable pre-processing can have more than 90% accuracy rates. But in most cases, the model does not have usability. Deep learning shows a low dependence on pre-processing. Although different pre-processing methods have different predictive performances, they are all acceptable, which is one of the main reasons for the gradual application of deep learning in spectral analysis.

**Table 6. The results of different pre-processing on dataset A (%).**
Datasets	Model	RAW	Stander	MSC	SG	SNV
Cefixime	SVM	26.00	97.00	21.00	26.00	81.00
	PLS_DA	20.00	95.00	20.00	20.00	67.00
	SAE	99.00	99.00	98.00	98.00	100.00
	CNN	98.00	100.00	100.00	100.00	99.00
	SpectraTr	99.00	100.00	100.00	100.00	99.00
Cefixime and phenytoin sodium	SVM	21.59	85.39	18.73	21.59	50.04
	PLS_DA	23.87	91.36	10.15	11.16	37.16
	SAE	86.03	99.28	76.82	85.71	98.41
	CNN	92.61	99.64	91.42	99.37	95.62
	SpectraTr	96.42	99.68	96.30	99.15	94.06

In this experiment, we observed that the SpectraTr model is less dependent on pre-processing than the CNN model. On cefixime, the accuracy of the original spectra-based SpectraTr model was 99%, with a difference of less than 1% from the prediction of the qualitative analysis model based on the pre-processing spectra. The raw spectral-based SpectraTr model achieves 96.42%, while the Stander, MSC, SG and SNV pretreatment models achieve 99.68%, 96.30%, 99.15%, and 94.06% on cefixime and phenytoin sodium, respectively. This demonstrates that the SpectraTr model can extract features from spectral data with various noise components automatically.

3.2. Result of dataset B

To compare with previous research effectively, we used the same open-source dataset as in Ref. 10 and conducted experiments with the arrangement described in the text. 201 samples were randomly selected to create the train set, and the remaining 99 samples were utilized as the test set. The prediction performance of the SpectraTr model on the original spectrum is shown in Table 7.

**Table 7. The prediction performance of the SpectraTr model on dataset B (%).**
Dataset	Acc	Precision	Recall	F1-score	CNN¹⁰
Table	96.97	97.03	96.97	96.98	94%

The SpectraTr model achieve 96.97%, 97.03%, 96.97%, and 96.98% of accuracy, specificity, sensitivity, and F1-score, respectively, better than the model built based on CNN. The prediction performance of SpectraTr model and other baseline models under different pre-processing are shown in Table 8.

**Table 8. The results of different pre-processing on dataset B (%).**
Model	RAW	Stander	MSC	SG	SNV
SVM	68.69	78.57	40.20	68.86	40.20
PLS_DA	62.12	79.58	47.20	60.25	55.25
SAE	91.92	95.95	62.26	92.93	58.59
CNN	94.24	98.21	90.77	95.54	90.53
SpectraTr	96.97	99.21	89.36	96.09	88.85

SpectraTr reaches 99.21%, 89.36%, 96.09%, and 88.85%, respectively after the pre-processing of standardization, MSC, SG, and SNV, which are better than other models in most cases. But pre-processing has a negative gain on SpectraTr in the vast majority of situations. We think that the deep learning method does not rely on pre-processing or only needs standardized pre-processing, because the deep learning method has good spatial invariance. The extensive application of the CNN algorithm in the image shows that image rotation, translation, and clipping have little effect on the prediction results of the deep learning model. In our previous studies, we have also found that the deep learning method has good adaptability to the slight rotation of spectra.³¹ Therefore, deep learning does not rely too much on preprocessing algorithms to remove noise such as baseline drift in spectra but needs to be standardized to allow faster gradient convergence.

3.2.1. The effect of hyperparameters on model performance

The above experiments have shown the qualitative analysis performance of the SpectraTr model and the ability of automatic feature extraction under different preprocessing. But the research shows that the network of the deep learning model is large and sensitive to the setting of network parameters. Therefore, on our drug and open-source drug data sets, the depth of the spectral encoder, the number of multi-heads, the number of patches, and the dimension of Mlp_dim are used as objects to study the impact of the above parameter settings on the spectralTr. The results are shown in Fig. 6.

The effect of different patch numbers on the model is shown in Fig. 6(a). The difference in dataset A is not large. It is relatively better when the patch number is set to 40. As for dataset B, the performance is relatively best when the patch number is set to 10. When the patch number is set to 25, the prediction accuracy drops below 95%. However, due to the different spectral dimensions of dataset A and dataset B, all the results perform better when there are about 50 spectral points in the patch. Figure 6(b) shows the influence of Spectra encoder depth on the model. On the two data sets, the effects are similar. A network that is too shallow is not enough to extract features. If it is too deep, it is easy to overfit. From Fig. 6(c), the parameter setting of the number of attention heads has little effect on the model, and the model prediction accuracy fluctuates around 3% under different parameter settings. The influence of different parameters of Mlp_dim on the model performance is shown in Fig. 6(d), and the effect is better when it is around 512 or 1024. In summary, the four groups of main parameters can make the model have better predictive performance in a larger range, which shows that the SpectraTr model has low sensitivity to the parameter settings of the network layer. Therefore, when setting up the SpectraTr model, there is no need to search for the optimal parameter combination one by one from the massive parameter space. This has an advantage over the CNN model in modeling.

3.3. The result of training time

The training time of different models on datasets A and B is shown in Fig. 7. Different numbers of training samples will affect the model training time. The experiment of the table dataset was carried out when the number of training samples was 211, and the rest of the dataset was tested under the 8:2 training test set.

The training period for deep learning is substantially longer than for SVM and PLS_DA algorithms since it takes several training sessions. The need for greater hardware resources and longer training time is the drawback of deep learning methods. The SpectraTr model has the longest training time among the deep learning models. The SpectraTr model’s training time is affected by both the number of samples and the spectral dimension. The training time increases as the number of samples and spectral dimensions increases.

4. Conclusion

A novel and advanced deep learning-based model is proposed called SpectraTr for qualitative analysis of near-infrared spectroscopy of drugs. Experimental results on multiple drug classification datasets show that SpectraTr outperforms the classical PLS, SVM, and existing deep learning methods AE and CNN in most cases. In addition, the model can automatically extract features from spectral data and does not rely on complex pre-processing and model parameter selection engineering. The initial application of SpectraTr in spectroscopy has shown potential and is expected to become a major class of deep learning methods in addition to AE networks and CNN networks.

Acknowledgments

The authors would like to express our gratitude to anonymous reviewers and editors for their helpful comments. This work was supported by the National Natural Science Foundation of China (61906050 and 21365008); Guangxi Technology R&D Program (2018AD11018), and Innovation Project of GUET Graduate Education (2021YCXS050).

References

1. X. Miao, Y. Miao, H. Gong, S. Tao, Y. Chen , “NIR spectroscopy coupled with chemometric algorithms for the prediction of cadmium content in rice samples,” Spectrochim. Acta A. Mol. Biomol. Spectrosc. 257, 119700 (2021). Web of Science, Google Scholar
2. C. Yu, S. Quan, C. Yang, C. Zhang, H. Zang , “Determination of the immunoglobulin G precipitation end-point by an intelligent near-infrared spectroscopy system,” J. Innov. Opt. Health Sci. 14(3), 2150007 (2021). Link, Web of Science, Google Scholar
3. L. Yun, C. Tf, B. Msa , “Quantitative NIR spectroscopy for determination of degree of polymerisation of historical paper,” Chemometr. Intell. Lab. Syst. 214, 104337 (2021). Web of Science, Google Scholar
4. W. Chen, H. Li, F. Zhang, W. Xiao, Y. Du , “Handheld short-wavelength NIR spectroscopy for rapid determination of sugars and carbohydrate in fresh juice with sampling error profile analysis,” Infrared Phys. Technol. 115, 103732 (2021). Web of Science, Google Scholar
5. H. Wang, R. Liu, L. Nie, D. Xu, H. Zang , “Spectra selection methods: A novel optimization way for treating dynamic spectra and in-line near infrared modeling,” J. Innov. Opt. Health Sci. 13(4), 2050015 (2020). Link, Web of Science, Google Scholar
6. L. Haoxiang, Z. Jing, L. Lingqiao, L. Zhenbing, Y. Huihua, F. Yanchun, Y. Lihui , “Least angle regression combined with competitive adaptive re-weighted sampling for NIR spectral wavelength selection,” Spectrosc. Spectral Anal. 41(6), 1782–1788 (2021). Web of Science, Google Scholar
7. P. Mishra, J. M. Roger, F. Marini, A. Biancolillo, D. N. Rutledge , “Parallel pre-processing through orthogonalization (PORTO) and its application to near-infrared spectroscopy,” Chemometr. Intell. Lab. Syst. 212, 104190 (2021). Web of Science, Google Scholar
8. G. Borui, Y. Huihua, Z. Weidong, F. Yanchun, Y. Lihui, H. Changqin , “Stacked contractive auto-encoders application in identification of pharmaceuticals,” Spectrosc. Spectral Anal. 39(1), 96–102 (2019). Web of Science, Google Scholar
9. M. P. Milali, S. S. Kiware, N. J. Govella, F. Okumu, N. Bansal, S. Bozdag et al., “An autoencoder and artificial neural network-based method to estimate parity status of wild mosquitoes from near-infrared spectra,” PLoS One 15(6), e0234557 (2020). Web of Science, Google Scholar
10. J. Acquarelli, T. van Laarhoven, J. Gerretzen, T. N. Tran, L. M. Buydens, E. Marchiori , “Convolutional neural networks for vibrational spectroscopic data analysis,” Analy. Chim. Acta 954, 22–31 (2017). Web of Science, Google Scholar
11. X. Zhang, T. Lin, J. Xu, X. Luo, Y. Ying , “DeepSpectra: An end-to-end deep learning approach for quantitative spectral analysis,” Anal. Chim. Acta 1058, 48–57 (2019). Web of Science, Google Scholar
12. P. S. Vasafi, O. Paquet-Durand, K. Brettschneider, J. Hinrichs, B. Hitzmann , “Anomaly detection during milk processing by autoencoder neural network based on near-infrared spectroscopy,” J. Food Eng. 299, 110510 (2021). Web of Science, Google Scholar
13. L. Lingqiao, P. Xipeng, F. Yanchun, Y. Lihui, Y. Huihua , “Deep convolution network application in identification of multi-variety and multi-manufacturer pharmaceutical,” Spectrosc. Spectral Anal. 39(11), 3606–3613 (2019). Web of Science, Google Scholar
14. S. Jo, W. Sohng, H. Lee, H. Chuang , “Evaluation of an autoencoder as a feature extraction tool for near-infrared spectroscopic discriminant analysis,” Food Chem. 331, 127332 (2020). Web of Science, Google Scholar
15. X. Zhang, J. Yang, T. Lin, Y. Ying , “Food and agro-product quality evaluation based on spectroscopy and deep learning: A review,” Trends Food Sci. Technol. 112, 431–441 (2021). Web of Science, Google Scholar
16. Y. Chen, Z. Wang , “Feature selection based convolutional neural network pruning and its application in calibration modeling for NIR spectroscopy,” Chemometr. Intell. Laboratory Syst. 191, 103–108 (2019). Web of Science, Google Scholar
17. H. Qiao, X. Shi, H. Chen, J. Lyu, S. Hong , “Effective prediction of soil organic matter by deep SVD concatenation using FT-NIR spectroscopy,” Soil Tillage Res. 215, 105223 (2022). Web of Science, Google Scholar
18. A. Khan, A. Sohail, U. Zahoora, A. S. Qureshi , “A survey of the recent architectures of deep convolutional neural networks,” Artif. Intell. Rev. 53(8), 5455–5516 (2020). Web of Science, Google Scholar
19. D. Passos, P. Mishra , “An automated deep learning pipeline based on advanced optimisations for leveraging spectral classification modelling,” Chemometr. Intell. Laboratory Syst. 215(1), 104354 (2021). Web of Science, Google Scholar
20. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit et al., “Attention is all you need,” Adv. Neural Inf. Process. Syst. 30(NIPS 2017), 5998–6008 (2017). https://doi.org/10.48550/arXiv.1706.03762. Google Scholar
21. S. Khan, M. Naseer, M. Hayat, S. W. Zamir et al., “Transformers in vision: A survey,” arXiv:2101.01169 (2021). Google Scholar
22. T. Lin, Y. Wang, X. Liu, X. Qiu, “A Survey of Transformers,” arXiv:2106.04554 (2021). Google Scholar
23. J. Wang, C. Y. Hsieh, M. Wang, X. Wang, Z. Wu et al., “Multi-constraint molecular generation based on conditional transformer, knowledge distillation and reinforcement learning,” Nat. Mach. Intell. 3(10), 914–922 (2021). Google Scholar
24. V. Venkatasubramanian, V. Mann , “Artificial intelligence in reaction prediction and chemical synthesis,” Curr. Opin. Chem. Eng. 36, 100749 (2022). Web of Science, Google Scholar
25. J. Jumper, R. Evans, A. Pritzel, T. Green, M. Figurnov et al., “Highly accurate protein structure prediction with AlphaFold,” Nature 596(7873), 583–589 (2021). Web of Science, Google Scholar
26. Y. Dai, Y. Gao, F. Liu , “Transmed: Transformers advance multi-modal medical image classification,” Diagnostics 11(8), 1384 (2021). Web of Science, Google Scholar
27. B. Yun, Y. Wang, J. Chen, H. Wang, W. Shen, “SpecTr: Spectral transformer for hyperspectral pathology image segmentation,” arXiv:2103.03604 (2021). Google Scholar
28. A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv:2010.11929 (2020). Google Scholar
29. L. Huang, S. Guo, Y. Wang, S. Wang, T. Bai , “Attention based residual network for medicinal fungi near infrared spectroscopy analysis,” Math. Biosci. Eng. 16(4), 3003–3017 (2019). Web of Science, Google Scholar
30. K. He, X. Zhang, S. Ren, J. Sun , Deep residual learning for image recognition, Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 770–778 (IEEE, 2016). Google Scholar
31. L. Lingqiao, L. Yanhui, Y. Linlin, Y. Huihua, F. Yanchun, Y. Lihui, H. Changqin , “Data augmentation of Raman spectral and its application research based on DCGAN,” Spectrosc. Spectral Anal. 41(2), 400–407 (2021). Web of Science, Google Scholar

Vol. 15, No. 03

Metrics

Downloaded 10,092 times

History

Received 11 January 2022

Accepted 27 February 2022

Published: 27 April 2022

Information

This is an Open Access article. It is distributed under the terms of the Creative Commons Attribution 4.0 (CC-BY) License. Further distribution of this work is permitted, provided the original work is properly cited.

Keywords

PDF download