Free Access

Use of Autoencoders for Improving the Performance of Classification and Interpretation of Convolutional Neural Networks in Histopathological Images

Centro de Informática, Universidade Federal de Pernambuco, Av. Jorn. Aníbal Fernandes, Recife, Pernambuco 50740-560, Brazil

E-mail Address: danielmacedo.dc@gmail.com

Corresponding author.

Search for more papers by this author

Fernando M. de Paula Neto

https://orcid.org/0000-0003-4264-1124

Centro de Informática, Universidade Federal de Pernambuco, Av. Jorn. Aníbal Fernandes, Recife, Pernambuco 50740-560, Brazil

E-mail Address: fernando@cin.ufpe.br

Search for more papers by this author

Tasso L. O. Moraes

https://orcid.org/0009-0005-0224-9621

Centro de Informática, Universidade Federal de Pernambuco, Av. Jorn. Aníbal Fernandes, Recife, Pernambuco 50740-560, Brazil

E-mail Address: tlom@cin.ufpe.br

Search for more papers by this author

Vinicius D. Santos

https://orcid.org/0009-0008-3169-999X

Centro de Informática, Universidade Federal de Pernambuco, Av. Jorn. Aníbal Fernandes, Recife, Pernambuco 50740-560, Brazil

E-mail Address: vds@cin.ufpe.br

Search for more papers by this author

, and

John W. S. de Lima

https://orcid.org/0000-0003-1606-6517

Centro de Informática, Universidade Federal de Pernambuco, Av. Jorn. Aníbal Fernandes, Recife, Pernambuco 50740-560, Brazil

E-mail Address: jwsl@cin.ufpe.br

Search for more papers by this author

https://doi.org/10.1142/S1469026823500360Cited by:1 (Source: Crossref)

Abstract

Breast cancer is one of the most common types of cancer and it presents itself as being the leading cause of death among women. If its diagnosis occur soon enough, the probability to achieve the cure of the patient can be increased. Recently, it has been more common the use of deep neural network techniques to aid pathologists in their prognosis, but they still do not fully trust them because they lack interpretability. In light of that, this work investigates if previous training of the models as encoders could enhance their accuracy in both classification and interpretability. There were implemented three models to the BreakHis and BreCaHAD dataset: NASNet Mobile, DenseNET201, and MobileNetV2. The experiments have shown that the three models increased the classification performance and two models improved their interpretability using the proposed strategy. DenseNet201 encoder has performed almost 23% better than its vanilla version in classifying a tumor and the NASNet Mobile encoder has improved 28.5% in its tumor interpretation.

Keywords:

1. Introduction

Currently, cancer is one of the leading causes of death in the world.¹ It is characterized as an uncontrolled growth of cells — usually due to some anomaly — that can spread to other tissues aggressively resulting in the formation of tumors.² In 2020, reports indicated that approximately 19.3 million cases of cancer were detected worldwide, with a specific analysis stating that out of 36 types of cancer studied in 185 countries, 11.7% of total cases were diagnosed as breast cancer.^3,4 Breast cancer can occur in both sexes, being more common in females between the ages of 35 and 55 years old.⁵ Thus, early detection of this disease can help prevent the metastatic phase and consequently prevent patient mortality.

Malignant breast tumors are classified as: in situ, invasive ductal carcinoma, inflammatory cancer, and metastatic cancer.⁶ There are ways to investigate the onset of the disease.⁷ Among them, magnetic resonance imaging, computed tomography, mammography, and ultrasound. However, the most effective method (although more invasive) is diagnosis through histopathological analysis, where a sample of breast tissue is extracted to further cell analysis.⁸ Additionally, digital histopathological diagnosis requires an experienced pathologist, demanding time to thoroughly analyze the collected tissue sample.

The development of Artificial Intelligence (AI), and their capacity of extracting information,⁹ has played a pivotal role in classifying tumors from images of tissue samples, which has reduced diagnosis time and contributed to early treatment.^10,11

Thus, this work aimed to conduct a comparative analysis between three convolutional neural network architectures, DenseNet, NASNet Mobile, and MobileNet, with the same undergoing two types of training: As encoders in an autoencoder architecture for subsequent classification and purely as classifiers through transfer learning. With this, it was possible to identify if there was an improvement in the performance metrics of tumor classification accuracy and interpretability by the models from this variation in encoder training.

2. Relevant Works

In this project, several works related to Deep Learning (DL) models were observed, as well as their training as encoders in autoencoders architectures.

Over the last few years, scientists have tried to implement several techniques to improve the classification accuracy of the models, particularly in the medical field.¹² However, accuracy should not be the only factor to be taken into account.¹³ Within the error scenarios, it is important to observe the amount of detected false-positives (precision) and false-negatives (sensitivity). Therefore, it is necessary to develop models that reduce the number of false-negatives, in order to increase medical confidence.

A study¹⁴ has proposed a modified version of Inception V3, a fine-tuned version of DenseNet121, and a convolutional autoencoder model with the aim of considering sensitivity and precision data. However, this work sought to perform an analysis between three architecture models trained as autoencoders and investigate an improvement in the interpretability of the models, considering not only accuracy but also the ability of the models to “understand” what a tumor is. Recently, an autoencoder-based image reconstruction model was proposed by a research¹⁵ to improve feature extraction, and also another published paper¹⁶ has used a comparison between four DL architectures and one transformer using classification accuracy and interpretability criteria (although the analyses showed that the MobileNetV2 network presented better performance, further analyses could have been performed). In addition, another studies^17,18 have shown the use of autoencoders as a technique to train a feature extractor for subsequent cancer classification and to reduce noise present in histopathological images of osteosarcoma cancer so that model performance can be improved.

Also, to preserve information and improve the level of segmentation of mammography images used, the use of deeper architectures is recommended.¹⁹ A research have used models like U-Net to improve the accuracy of breast cancer classification in radiographic images.²⁰ In this proposed work, three of the architectures used in a previous research¹⁶ are used, employing the same interpretability metrics but performing pre-training as autoencoders to investigate if there is an improvement in their performance. Also, using that interpretability metric, this work made use of the DenseNet architecture focused on histopathology images.

A few other studies have explored the use of convolutional neural networks in order to identify and recognize breast cancer cells present in pathology images.^21,22 Besides that, another study¹³ proposed a convolutional autoencoder with DenseNet architecture in order to reduce the complexity of the model and improve its ability to extract features from histopathological images. In this work, it was elaborated a similar approach, but considering another architectures trained as encoders of an autoencoder architecture and serving as the convolutional section prior to a fully connected network.

Several articles have used autoencoder networks for histopathological image segmentation²³ in order to extract relevant features from cancer image cells. Also, the use of encoder–decoder models can have different architecture configurations, and a specific pre-trained network can even be used as the encoder of the model.²⁴ In this work, three different models were loaded to be analyzed as previously mentioned, whether pre-trained or not (vanilla models).

Transfer learning technique is highly recommended in medical field applications, since DL models require many images and the image annotation process (labeling) consumes a lot of time and money.²⁵ The ImageNet dataset²⁶ is the most used for pre-training networks to improve their performance in tasks such as classification, detection and segmentation.²⁷ However, a model trained with random weight initialization presented very similar performance to a pre-trained one with the ImageNet dataset.²⁸ One possible reason for this would be the different nature of the types of images. Despite this, this work made trainings in both ways (with random weights and those of ImageNet), since more features were learned by the initial layers of models with ImageNet can be reused, saving training time.

Currently, there are still barriers to using Machine Learning (ML) in the medical field, especially due to the difficulty many specialists have in understanding what the models classify. Thus, a survey²⁹ was conducted to evaluate articles of histopathology that utilize ML algorithms and others articles that elucidate the behavior of the models, along with the necessary metrics to assess their interpretation. Although there have been other alternatives for explaining models results,³⁰ it has been shown that Grad-CAM can be a safe option for interpreting what the model would be classifying, due to its reliability in histopathological images,²⁹ being then utilized in this work. The survey also highlighted that there is no definitive metric for evaluating interpretability, although the intersection of regions is the most common in annotated datasets. This work, that used two datasets, has worked a new way of automatically measuring interpretability based on the conditions provided by the annotated dataset.

3. Methodology

In this work, not only the behavior of DL models regarding their classification after being trained as autoencoders was investigated, but also their ability to correctly interpret tumor characteristics from histopathological image datasets. Three state-of-the-art architectures were evaluated and compared in terms of classification accuracy and interpretability. The models were trained with a breast cancer histopathological dataset with two classes, with images featuring nuclei of malignant and benign tumors, and tested with another histopathological dataset, but with only malignant tumors. The three models were selected due to their architectural simplicity, their common usability and also the possibility to attend a particular challenge without creating unnecessary complexity.

To evaluate the interpretability of the models, the Grad-CAM technique was used on the annotated dataset. Then, it was observed how many annotations were within the regions pointed out by the models as having the highest incidence of tumors. In this context, a metric known as Interpretable Region Accuracy (IRA) was proposed to evaluate the interpretation of the images.¹⁶ Both this metric and the datasets are explained in Sec. 4 of this work. The approach of this work uses a pipeline that describes how the models are trained and validated on one dataset, and then how they are tested on another dataset for accuracy and interpretability metrics.

3.1. Approach

First, the models were trained as the encoders in autoencoder architectures for histopathological image reconstruction from the BreakHis dataset.³¹ After training, the encoder models were saved, along with their latent representation space (LRS) (intermediate space that encodes and stores a representation, or several, of the characteristics of the dataset), and then a densely connected network was added for classification training of the same images from the BreakHis dataset. The method adopted is shown in Fig. 1.

Fig. 1. Training architecture as encoder.

Next, the networks were tested on an annotated dataset: BreCaHAD.³² This dataset contains only images of malignant tumors with point annotations identifying them. After testing each model, for each image correctly classified as malignant tumor, heatmaps were generated with Grad-CAM. These heatmaps, in turn, indicated the regions in the image that the model deemed relevant for classification.

Using the Otsu method, a mask was generated from each heatmap, and then the number of annotations present in each region was investigated. These annotations indicating the location of tumors had their coordinates described in a JSON file included with the dataset. Finally, the accuracy of the interpretable region was measured, by checking how many points were present in the relevant zones for the model to perform the classification, indicating how well it was able to identify tumor characteristics. This procedure is described through Fig. 2.

3.2. Autoencoders

Autoencoders³³ are a particular type of neural network that encodes an input signal through an encoder into a compressed vector of meaningful representation, and then decoding it back to its original format in the best possible way.³⁴ Its goal can be understood as learning in an “unsupervised” way a simplified representation of the input data of the model. In the medical imaging field, this neural network is very used due to its ability to extract and learn relevant features from the original image in order to denoise it for further analysis.³⁵

Both the encoder and decoder can be convolutional neural networks,³⁶ as they allow their network layers to have various sizes. In a more general sense, the autoencoder functions as a generalization of Principal Component Analysis (PCA).³⁷

To reduce the difference between the input and output reconstructed images used in training an autoencoder, there are two error functions that can be used in weight update calculations³⁸: The Mean Square Error (MSE) or the Binary Cross-Entropy (BCE).

This paper used BCE since the training images were normalized between 0 and 1. Its mathematical description is shown in Eq. (1), where $y_{i}$ is the true value, $ŷ$ is the predicted value, and N is the number of pixels.

BCE(y,ŷ)=−1NN∑i=0yi⋅log(ŷ)+(1−yi)⋅log(1−ŷ).<math display="block" altimg="eq-00003.gif"><mstyle><mtext mathvariant="normal">BCE</mtext></mstyle><mo stretchy="false">(</mo><mi>y</mi><mo>,</mo><mi>ŷ</mi><mo stretchy="false">)</mo><mo>=</mo><mo>−</mo><mfrac><mrow><mn>1</mn></mrow><mrow><mi>N</mi></mrow></mfrac><munderover accentunder="true" accent="true"><mrow><mo>∑</mo></mrow><mrow><mi>i</mi><mo>=</mo><mn>0</mn></mrow><mrow><mi>N</mi></mrow></munderover><msub><mrow><mi>y</mi></mrow><mrow><mi>i</mi></mrow></msub><mo>⋅</mo><mo>log</mo><mo stretchy="false">(</mo><mi>ŷ</mi><mo stretchy="false">)</mo><mo>+</mo><mo stretchy="false">(</mo><mn>1</mn><mo>−</mo><msub><mrow><mi>y</mi></mrow><mrow><mi>i</mi></mrow></msub><mo stretchy="false">)</mo><mo>⋅</mo><mo>log</mo><mo stretchy="false">(</mo><mn>1</mn><mo>−</mo><mi>ŷ</mi><mo stretchy="false">)</mo><mo>.</mo></math>(1)

3.3. Convolutional neural networks

In this study, the networks MobileNetV2, DenseNet201, and NASNet Mobile were used. It is important to note that for training as encoders, they used transfer learning with ImageNet³⁹ weights, and random weight initialization for the training phase. For the classification and feature extraction process, in addition to the pre-trained encoder models, “vanilla” or simply “pure” models were also used, using ImageNet weights by transfer learning. These networks were selected because they are considered state-of-the-art in the literature and have little or almost no use in the BreCaHAD dataset. Table 1 shows the main parameters used.

**Table 1. The hyperparameters of the architectures.**
	NASNet Mobile	MobileNetV2	DenseNet121
Epochs-Autoencoder	50	70	70
Epochs-Classifier	100	100	100
Training samples	2074	2074	2074
Validation samples	360	360	360
Test samples	159	159	159
Learning rate	[0.001, 0.0001]	[0.001, 0.0001]	[0.001, 0.0001]
Optimizer	Adam, SGD	Adam, SGD	Adam, SGD

3.4. MobileNetV2

This architecture was developed to improve the state-of-the-art of convolutional networks applied in computer vision, reducing memory consume, while seeking to maintain the same accuracy.⁴⁰ It was an evolution of the MobileNetV1 architecture.⁴¹ It consists, in short, of the addition of a new block of layers containing a residual layer with a linear “bottleneck”. Figure 3 shows the representation of the convolution block used.

Fig. 3. MobileNetV2 convolutional block.

3.5. DenseNet201

The DenseNet201 model⁴² has appeared as an alternative to the problem of Gradient Vanishing of information passed by convolutional network layers during weight updates in backpropagation, given that these networks have become increasingly deeper. Gradient Vanishing can occur during the calculation of the gradient of the error function along the network with respect to its weights. As more layers are added to the architecture, the loss function tends to zero. This occurs because certain activation functions limit the output of the layers to values between 0 and 1 (Sigmoid) or −1 and 1 ( $tan h$ ), so that a large variation in the input signal of the activation function causes only a small variation in the output. Thus, the more layers that use the Sigmoid function, for example, the gradient exponentially decreases to the initial layers. As a result, the weights of the initial layers will not be properly updated, which can impair the ability of the model to generalize.

The DenseNet201 also has a connectivity pattern in its architecture that allows all layers to be connected to each other (as long as they have equivalent feature map sizes), allowing maximum information to be preserved. Figure 4 visually illustrates how this architectural pattern occurs.

Another interesting factor about this network is that it requires fewer parameters for training than other networks, because due to the connections, there is no need to retrain redundant feature maps. It is also named with the number 201 because it has an extension of 201 convolutional layers.

Similarly to MobileNetV2, it was trained as an encoder for image reconstruction and as a convolutional model for classification.

3.6. NASNet mobile

This present network⁴³ was developed based on the Neural Architecture Search (NAS) framework.⁴⁴ However, the application of this method is computationally costly, as it can produce complex models in the search for better accuracy values. In general terms, the proposed network was chosen through a search in a space with different architectures and weights (NASNet Search Space). However, instead of searching for the best architecture (with higher accuracy), the search was made looking for the best convolution blocks.

Figure 5 shows how the NAS model generally works.

Like the previous ones, the goal was to identify the best model that could be obtained with NASNet Mobile to be trained as an encoder for image reconstruction and, later, as a convolutional network for classification.

3.7. Grad-CAM

The Grad-CAM⁴⁵ is an interpretation technique for deep model classification that is based on identifying the relevant features in the classification of the model. It is calculated through the gradients between the target of the prediction of the model and the activation maps of the last layer of the convolutional model, in order to highlight the most relevant feature maps for that particular classification. In addition, it is one of the most recommended models for interpreting results in medical images.²⁹

It is understood that Grad-CAM is a generalization of the CAM method⁴⁶ — Class Activation Maps — and therefore is applicable to various types of convolutional networks.

To obtain the localization map of a certain predicted class of a model, the gradient of the predicted value of a certain class b, or $γ^{b}$ , before applying the softmax, with respect to the activation maps $β^{k}$ of a convolutional layer ( $\frac{\partial γ^{b}}{\partial β^{k}}$ ) is computed. These gradients are then weighted by Global Average Pooling to obtain the “importance” or relevance of each feature map in relation to class b ( ${α_{k}}^{b}$ ), as shown in the following equation :

αkb=1Z∑i∑j∂γb∂βijk.<math display="block" altimg="eq-00009.gif"><msup><mrow><msub><mrow><mi>α</mi></mrow><mrow><mi>k</mi></mrow></msub></mrow><mrow><mi>b</mi></mrow></msup><mo>=</mo><mfrac><mrow><mn>1</mn></mrow><mrow><mi>Z</mi></mrow></mfrac><munder><mrow><mo>∑</mo></mrow><mrow><mi>i</mi></mrow></munder><munder><mrow><mo>∑</mo></mrow><mrow><mi>j</mi></mrow></munder><mfrac><mrow><mi>∂</mi><msup><mrow><mi>γ</mi></mrow><mrow><mi>b</mi></mrow></msup></mrow><mrow><mi>∂</mi><msup><mrow><msub><mrow><mi>β</mi></mrow><mrow><mi>i</mi><mi>j</mi></mrow></msub></mrow><mrow><mi>k</mi></mrow></msup></mrow></mfrac><mo>.</mo></math>(2)

After calculating the relevance of each feature map k for a class b, a weighted combination is applied between the weights and the activation maps, followed by the linearization of the operation (application of ReLU). This produces a heat map of the same resolution as the feature maps used. The ReLU function^a is used in the linear combination of the activation maps because it allows only the maps that have a positive influence on a certain class b to be counted (increasing the intensity of the activation pixel). Equation (3) mathematically shows how this process occurs, and Fig. 6 shows the result of the operation of Eq. (3) (for the example in this study, the activation maps of the last convolution layer have a size of

$4 \times 4$ , which also represents the two main dimensions of the latent feature vector).

L b Grad-CAM = ReLU (\sum k α k b β k) . <math display="block" altimg="eq-00011.gif"><msub><mrow><msup><mrow><mi>L</mi></mrow><mrow><mi>b</mi></mrow></msup></mrow><mrow><mstyle><mtext mathvariant="normal">Grad-CAM</mtext></mstyle></mrow></msub><mo>=</mo><mstyle><mtext mathvariant="normal">ReLU</mtext></mstyle><mfenced separators="" open="(" close=")"><mrow><munder><mrow><mo>\sum</mo></mrow><mrow><mi>k</mi></mrow></munder><msup><mrow><msub><mrow><mi>α</mi></mrow><mrow><mi>k</mi></mrow></msub></mrow><mrow><mi>b</mi></mrow></msup><msup><mrow><mi>β</mi></mrow><mrow><mi>k</mi></mrow></msup></mrow></mfenced><mo>.</mo></math> (3)

Fig. 6. Activation maps combination result.

From the previously generated activation map, it is expanded to cover the entire image, indicating the level of relevance of each pixel to the predicted class, as observed in the heat map of Fig. 7. From this result, a mask is calculated using the Otsu method and then used to calculate the proposed IRA metric.

Fig. 7. Applied heatmap generated by Grad-CAM.

3.8. The method of Otsu

Each heat map generated in cases where the model correctly classified the tumor as malignant contains information about the features that the model considered relevant to make the classification. Since the annotated dataset of BreCaHAD only contains information on the Cartesian coordinates of the tumor location in each image, a technique for binarizing the heat maps was used to generate a mask and identify how many points would be present in that region. The method used was the one of Otsu. This method is very used in medical image processing⁴⁷ and, besides that, it benefits from its simplicity and speed (it can determine automatically the threshold value that differentiates background and foreground regions. It consists in identifying a threshold value ( $α$ ) that minimizes the intra-class variance present in an image whose histogram can be approximated by two Gaussian functions. This information can be algebraically represented by the following equation :

σ 2 (α) = w b g (α) \cdot σ 2 b g (α) + w f g (α) \cdot σ 2 f g (α) . <math display="block" altimg="eq-00013.gif"><msup><mrow><mi>σ</mi></mrow><mrow><mn>2</mn></mrow></msup><mo stretchy="false">(</mo><mi>α</mi><mo stretchy="false">)</mo><mo>=</mo><msub><mrow><mi>w</mi></mrow><mrow><mi>b</mi><mi>g</mi></mrow></msub><mo stretchy="false">(</mo><mi>α</mi><mo stretchy="false">)</mo><mo>\cdot</mo><msubsup><mrow><mi>σ</mi></mrow><mrow><mi>b</mi><mi>g</mi></mrow><mrow><mn>2</mn></mrow></msubsup><mo stretchy="false">(</mo><mi>α</mi><mo stretchy="false">)</mo><mo>+</mo><msub><mrow><mi>w</mi></mrow><mrow><mi>f</mi><mi>g</mi></mrow></msub><mo stretchy="false">(</mo><mi>α</mi><mo stretchy="false">)</mo><mo>\cdot</mo><msubsup><mrow><mi>σ</mi></mrow><mrow><mi>f</mi><mi>g</mi></mrow><mrow><mn>2</mn></mrow></msubsup><mo stretchy="false">(</mo><mi>α</mi><mo stretchy="false">)</mo><mo>.</mo></math> (4)

It can be observed that, respectively,

$w_{b g} (α)$ represents the number of pixels present in the background of the image, and

$w_{f g} (α)$ represents the intensity value of each pixel (it is important to note that for this study, the masks were generated in grayscale so that the method could be implemented). The value of

$α$ will, therefore, be automatically generated by the method itself and used in the generation of masks.

4. Experiments and Results

The experiments conducted in this work aimed to investigate, through comparison, the effects on the classification and interpretation accuracies of each model when pre-training them as encoders of an autoencoder model.

4.1. Computational resources

The models were trained on a NVIDIA GeForceGTX 1650 GPU. The version of Tensorflow used was 2.7, and the software used was Jupyter Notebook.

During training, hyperparameters were adjusted to avoid high memory consume. A single round of experiments with the models took almost 7h. Such experiments would be longer if the free version of Google Colab was used, whose use of the GPU is limited. Therefore, the experiments were conducted locally. The use of autoencoders can require a lot of time and computational cost.⁴⁸ Overtraining can compromise the robustness of the model, and insufficient training can prevent the models from achieving high performance.

4.2. Datasets

In this study, two datasets of histopathological images were used: BreakHis — used for model training — and the annotated BreCaHAD dataset — used to measure the classification accuracy and how well each model interprets tumor characteristics. Both datasets only expose histopathological images, without revealing any confidential patient information. This training approach with distinct datasets helps to better understand a possible overfitting behavior that may be occurring with the models. If a single dataset were used, such a problem could be difficult to investigate.

BreakHis presents 9109 images of breast tissue with tumors. The images are in PNG format and have a size of 700×460 pixels with three color channels (RGB). There was no prior feature selection, and the original image, just sized at 700×460 pixels (with its three RGB color channels) was introduced to the network. The dataset presents eight specific types of tumors, but was divided into two classes: Malignant and benign. In addition, it presents different magnification factors of tissue regions: 40×, 100×, 200×, and 400×. Therefore, a resolution of 400× was used in this work, resulting in a final set of 1820 images, with 588 benign and 1232 malignant images. The training-validation proportion was 80–20.

Due to the large class imbalance (approximately 2:1), a Data Augmentation process was performed on the training set.^49,50,51 Thus, effects such as rotation and mirroring of the images were applied.

The BreCaHAD dataset has 162 histopathological images of malignant tumors (three were removed: Two duplicates and one without annotation indicating a malignant tumor). However, these images are in TIFF format and have dimensions of $1360 \times 1024$ , also having three color channels (RGB). The magnification factor is $400 \times$ . This dataset has annotations with the locations of the tumors in Cartesian coordinates (JSON format) indicated by experts in the field of medical pathology. Figure 8 shows an annotated image. The blue-colored points represent the location of malignant tumors. In this work, BreCaHAD was used as the test set.

In this project, there was no prior feature selection, and the original image, just sized at $700 \times 460$ pixels (with its three RGB color channels), after undergoing a resizing process, was fed into the training networks with a size of $128 \times 128$ , due to the computational limitation of the machine used in the study.

4.3. Metrics

The interpretation method to be used depends on the context of the problem and is subjectively evaluated by the researcher.²⁹ Similarly, a study¹⁶ has proposed an idea of counting the number of annotations present in the mask extracted from the method of Otsu and the heatmap generated by Grad-CAM for the BreCaHAD dataset.

The identification of tumors within the region delimited by the mask was done from a JSON file with the coordinates of the points representing the tumors. For each image, it was verified the respective percentage of points (tumor cells) present in the “relevant region” for the classification of the model, and then an average of these percentage values was taken (IRA).

Thus, in the experiments, we sought to answer which model had the better ability to classify and interpret tumor regions. Therefore, we also sought to observe if pre-training them as an encoders favors the extraction of relevant features.

It is important to highlight that the training of each architecture in the classification phase was carried out in three versions: With the weights from the training as an encoder initialized randomly (1) and by ImageNet via transfer learning (2) and without training as an autoencoder but with the weights from ImageNet (in the context of this work, this scenario will be known as the Vanilla model).

4.4. Results

Table 2 represents the results table obtained and displays four types of information about each model, with their respective results: Model classification accuracy, the mean IRA, the median value of the percentage of tumors identified by each model (Interpretable Region Median, IRM), and the standard deviation of the Interpretable Region Accuracy (SD-IRA).

**Table 2. Result of the experiments.**
				Interpretability
CNN Models	Training	Loaded Weigths	Classification Accuracy	IRA	SD-IRA	IRM
NASNet Mobile	Vanilla	ImageNet	59.12%	56.47%	21.64%	59.68%
	Encoder	ImageNet	96.23%	72.62%	14.51%	75.26%
		Random	84.28%	47.71%	23.70%	43.57%
DenseNet201	Vanilla	ImageNet	79.87%	36.37%	23.42%	31.93%
	Encoder	ImageNet	98.11%	45.06%	28.8%	39.24%
		Random	91.82%	45.4%	30.11%	36.11%
MobileNetV2	Vanilla	ImageNet	72.33%	58.48%	20.78%	57.89%
	Encoder	ImageNet	76.73%	48.53%	20.25%	47.10%
		Random	56.60%	22.15%	9.03%	22.5%

Comparing results, it could be observed that training the architectures as encoders increased their classification accuracy. There was also an improvement in interpretation when not only the networks were pre-trained, but also initialized with the ImageNet weights.

Although the MobileNetV2 Vanilla architecture was able to interpret almost 20% better what a tumor is (Region Accuracy), with the other networks, the (pre-trained) encoders performed better. For MobileNetV2, it was also possible to note that pre-training increased its classification accuracy, but it was not accompanied by an improvement in performance in interpretation, indicating that the model may not be understanding well what a tumor is.

Also, DenseNet201 pre-trained as an encoder with ImageNet weights showed the best classification result, however, only 2% more than NASNet, and had an almost 40% inferior performance than NASNet when trying to infer what represents a tumor, even presenting a similar result to its version pre-trained as an encoder with random weights. It was likely that it was “seeing” information in the image that did not represent a tumor and classified them as if they were.

Finally, it was possible to identify that the NASNet network in its pre-trained encoder version presented the best performance when compared to the other networks and also presented an improvement in performance of almost 62% in classification accuracy and 28% in the ability to correctly interpret a tumor. This could mean that its LRS was actually learning to compress input information, forcing the model to learn the most relevant features of the image.

The performance improvement in both the NASNet Mobile and DenseNet201 networks is associated with their pre-training as autoencoders, as this could indicate that the networks were memorizing weights that best described the histopathological features of the images. The same did not occur with MobileNetV2, suggesting that future investigations could help to understand how the feature extraction of this network behaves.

In Fig. 9 above, it is possible to identify the result of the best model for an image from the BreCaHAD dataset. It is noted that the model highlights much of the image, which may assist a pathologist in searching for cancerous cells. For the same image, the Vanilla version of NASNet misclassified it.

When the previous results were compared to the those of similar networks,¹⁶ it could be noted that only the ImageNet pre-trained DenseNet201 showed improvement in both Accuracy and Region Accuracy (0.6% and 180% more, respectively). Figure 10 shows the performance of DenseNet201 visually.

The MobileNetV2 in this work obtained better performance in Region Accuracy (16.8% and 40.8% more when pre-trained with ImageNet and when trained in Vanilla mode). However, MobileNetV2 showed low performance when compared to the others in classification accuracy (approximately 12%) when pre-trained as an autoencoder with ImageNet weights. This may have occurred due to probable overfitting of the model on the first dataset. Finally, the NASNet Mobile performed better than its Large version in both Accuracy and Region Accuracy when pre-trained as an encoder with ImageNet (15% and 742%, respectively).

5. Conclusion

In this work, we evaluated how pre-training the models MobileNetV2, DenseNet201, and NASNet Mobile could impact their ability to classify and interpret breast cancer tumors in the BreCaHAD dataset (even with few epochs for autoencoder training: 50–70). As observed, all three networks improved their classification accuracy, and NASNet and DenseNet201 improved their IRA.

As observed, the NASNet Mobile network pre-trained as an encoder with ImageNet training weights showed a 62% improvement in classification accuracy, a 28% improvement in IRA, and a median percentage of the most shifted region to the right (from 59.68% to 75.26%). That is, it was not only able to classify tumor images well, but also interpreted better what a tumor would be in the image passed to the model, which can be explained by its pre-training as an autoencoder. The DenseNet201 network pre-trained as an encoder with ImageNet training weights showed a nearly 23% improvement in classification accuracy, a 25% improvement in IRA, and a median percentage of the most shifted region to the right (from 31.93% to 39.24%). The MobileNetV2 network only showed an improvement in its classification accuracy (5%).

It is known that it takes a long time for a pathologist to accurately and confidently identify a tumor in a tissue sample. There are many challenges in the field of diagnosis,⁵² despite many advances having been made and new technologies being welcomed. In cases of pregnancy, the detection scenario becomes even more delicate.⁵³

Applying deep networks that perform well classifying and interpreting the results of their classification, providing clear indications of the results, would allow streamline the clinical diagnostic process and also increase the trust of pathologists in adopting these techniques.

This paper presented a few limitations: Low computational processing capacity (due to the complexity of the architectures and the high resolution of the images, some of them like MobileNet had to be loaded with only 50% of the original weights, and the images had their resolutions reduced to fit the input of the networks); high training time — due to the high computational cost, even for low resolutions, the autoencoder architectures could only be trained for a few epochs (which still consumed several hours of training); and finally, a small number of test samples: there is still a lot of difficulty in finding histopathological datasets with reliable annotations from specialists. Due to the nature of the problem, it may be more interesting to use datasets annotated by areas rather than by coordinates.

As suggestions for future works, it is recommended the use of other histopathological databases, such as KIMIA Path^b; the use of generative models to improve training^54,55 and also other robust networks,^11,56 other model interpretability techniques,⁵⁷ and finally, encourage partnerships with hospitals and specialist teams for data collection.

ORCID

Daniel C. Macedo https://orcid.org/0009-0002-0236-3836

Fernando M. de Paula Neto https://orcid.org/0000-0003-4264-1124

Tasso L. O. Moraes https://orcid.org/0009-0005-0224-9621

Vinicius D. Santos https://orcid.org/0009-0008-3169-999X

John W. S. de Lima https://orcid.org/0000-0003-1606-6517

Notes

^a https://d2l.ai/chapter_multilayer-perceptrons/mlp.html#relu-function.

^b http://kimia.uwaterloo.ca/.

Remember to check out the Most Cited Articles!
Check out these titles in artificial intelligence!

References

1. G. V. Batista, J. A. Moreira, A. L. Leite and C. I. Holanda Moreira, Breast cancer: Risk factors and prevention methods, Res., Soc. Dev. 9(12) (2020) e15191211077. Crossref, Google Scholar
2. WHO, Cancer, World Health Organization (2022), https://www.who.int/health-topics/cancer. Google Scholar
3. IARC, Global cancer observatory: Cancer today (year 2020), International Agency for Research on Cancer (2020), https://gco.iarc.fr/today/home. Google Scholar
4. H. Sung, J. Ferlay, R. L. Siegel, M. Laversanne, I. Soerjomataram, A. Jemal and F. Bray, Global cancer statistics 2020: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries, CA Cancer J. Clin. 71(3) (2021) 209–249. Crossref, Google Scholar
5. K. Polat, S. Sahan, H. Kodaz and S. Gnes, A new classification method for breast cancer diagnosis: Feature selection artificial immune recognition system (FS-AIRS), in Int. Conf. Natural Computation (Springer, 2005), pp. 830–838. Crossref, Google Scholar
6. National Breast Cancer Foundation, Diagnostic mammogram. What is the difference between a diagnostic mammogram and a screening mammogram (2020), https://www.nationalbreastcancer.org/diagnostic-mammogram. Google Scholar
7. S. H. Jafari, Z. Saadatpour, A. Salmaninejad, F. Momeni, M. Mokhtari, J. S. Nahand, M. Rahmati, H. Mirzaei and M. Kianmehr, Breast cancer diagnosis: Imaging techniques and biochemical markers, J. Cell. Physiol. 233(7) (2018) 5200–5213. Crossref, Google Scholar
8. M. Liu, Y. He, M. Wu and C. Zeng, Breast histopathological image classification method based on autoencoder and Siamese framework, Information 13(3) (2022) 107. Crossref, Google Scholar
9. S. Saraeian and M. M. Golchi, Application of deep learning technique in an intrusion detection system, Int. J. Comput. Intell. Appl. 19(02) (2020) 2050016. Link, Google Scholar
10. T. G. Debelee, F. Schwenker, A. Ibenthal and D. Yohannes, Survey of deep learning in breast cancer image analysis, Evol. Syst. 11(1) (2020) 143–163. Crossref, Google Scholar
11. R. F. Mansour, A robust deep neural network based breast cancer detection and classification, Int. J. Comput. Intell. Appl. 19(01) (2020) 2050007. Link, Google Scholar
12. J. Schmidhuber, Deep learning in neural networks: An overview, Neural Netw. 61 (2015) 85–117. Crossref, Google Scholar
13. M. Naderan and Y. Zaychenko, Convolutional autoencoder application for breast cancer classification, in 2020 IEEE 2nd Int. Conf. System Analysis & Intelligent Computing (SAIC) (IEEE, 2020), pp. 1–4. Crossref, Google Scholar
14. M. Naderan, Y. Zaychenko and A. Napoli, Using convolutional neural networks for breast cancer diagnosing, Syst. Res. Inf. Technol. 4 (2019) 85–93. Google Scholar
15. A. E. Minarno, K. M. Ghufron, T. S. Sabrila, L. Husniah and F. D. S. Sumadi, CNN based autoencoder application in breast cancer image retrieval, in 2021 Int. Seminar on Intelligent Technology and its Applications (ISITIA) (IEEE, 2021), pp. 29–34. Crossref, Google Scholar
16. D. Macedo, J. Soares, V. Dantas, T. Moraes, N. Arrais, T. Vinuto, J. Lucena and F. de Paula, Evaluating interpretability in deep learning using breast cancer histopathological images, in 2022 35th SIBGRAPI Conf. Graphics, Patterns and Images (IEEE, 2022), pp. 1–8. Crossref, Google Scholar
17. D. Zhang, L. Zou, X. Zhou and F. He, Integrating feature selection and feature extraction methods with deep learning to predict clinical outcome of breast cancer, IEEE Access 6 (2018) 28936–28944. Crossref, Google Scholar
18. L. Pan, H. Wang, L. Wang, B. Ji, M. Liu, M. Chongcheawchamnan, J. Yuan and S. Peng, Noise-reducing attention cross fusion learning transformer for histological image classification of osteosarcoma, Biomed. Signal Process. Control 77 (2022) 103824. Crossref, Google Scholar
19. H. N. AlEisa, W. Touiti, A. A. ALHussan, N. Ben Aoun, R. Ejbali, M. Zaied and A. Saadia, Breast cancer classification using FCN and beta wavelet autoencoder, Comput. Intell. Neurosci. 2022 (2022) 8044887. Crossref, Google Scholar
20. D. Abdelhafiz, S. Nabavi, R. Ammar, C. Yang and J. Bi, Residual deep learning system for mass segmentation and classification in mammography, in Proc. 10th ACM Int. Conf. Bioinformatics, Computational Biology and Health Informatics (ACM, New York, NY, 2019), pp. 475–484. Crossref, Google Scholar
21. F. A. Spanhol, L. S. Oliveira, C. Petitjean and L. Heutte, Breast cancer histopathological image classification using Convolutional Neural Networks, in 2016 Int. Joint Conf. Neural Networks (IJCNN) (IEEE, 2016), pp. 2560–2567. Crossref, Google Scholar
22. F. Yilmaz, O. Kose and A. Demir, Comparison of two different deep learning architectures on breast cancer, in 2019 Medical Technologies Cong. (TIPTEKNO) (IEEE, 2019), pp. 1–4. Crossref, Google Scholar
23. X. Zhou et al., A comprehensive review for breast histopathology image analysis using classical and deep neural networks, IEEE Access 8 (2020) 90931–90956. Crossref, Google Scholar
24. H. Wang, M. Xian and A. Vakanski, Bending loss regularized network for nuclei segmentation in histopathology images, in 2020 IEEE 17th Int. Symp. Biomedical Imaging (ISBI) (IEEE, 2020), pp. 1–5. Crossref, Google Scholar
25. S. U. Khan, N. Islam, Z. Jan, I. U. Din and J. J. P. C. Rodrigues, A novel deep learning based framework for the detection and classification of breast cancer using transfer learning, Pattern Recognit. Lett. 125 (2019) 1–6. Crossref, Google Scholar
26. J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li and L. Fei-Fei, ImageNet: A large-scale hierarchical image database, in 2009 IEEE Conf. Computer Vision and Pattern Recognition (IEEE, 2009), pp. 248–255. Crossref, Google Scholar
27. R. Girshick, J. Donahue, T. Darrell and J. Malik, Rich feature hierarchies for accurate object detection and semantic segmentation, in Proc. IEEE Conf. Computer Vision and Pattern Recognition (IEEE, 2014), pp. 580–587. Crossref, Google Scholar
28. M. Raghu, C. Zhang, J. Kleinberg and S. Bengio, Transfusion: Understanding transfer learning for medical imaging, Adv. Neural Inf. Process. 32 (2019) 1–4. Google Scholar
29. P. Messina, P. Pino, D. Parra, A. Soto, C. Besa, S. Uribe, M. And’ıa, C. Tejos, C. Prieto and D. Capurro, A survey on deep learning and explainability for automatic report generation from medical images, ACM Computing Surveys (CSUR), Vol. 54 (ACM, New York, NY, 2022), pp. 1–40. Google Scholar
30. J. Zhang, S. A. Bargal, Z. Lin, J. Brandt, X. Shen and S. Sclaroff, Top-down neural attention by excitation backprop, Int. J. Comput. Vis. 126(10) (2018) 1084–1102. Crossref, Google Scholar
31. F. A. Spanhol, L. S. Oliveira, C. Petitjean and L. Heutte, A dataset for breast cancer histopathological image classification, IEEE. Trans. Biomed. Eng. 63(7) (2015) 1455–1462. Crossref, Google Scholar
32. A. Aksac, D. J. Demetrick, T. Ozyer and R. Alhajj, BreCaHAD: A dataset for breast cancer histopathological annotation and diagnosis, BMC Res. Notes 12 (2019) 1–3. Crossref, Google Scholar
33. Y. LeCun, Modeles connexionnistes de l’apprentissage (connectionist learning models), Ph.D. thesis, Universite P. et M. Curie, Paris (1987). Google Scholar
34. D. Bank, N. Koenigstein and R. Giryes, Autoencoders, arXiv:2003.05991. Google Scholar
35. Y. He, A. Carass, L. Zuo, B. E. Dewey and J. L. Prince, Autoencoder based self-supervised test-time adaptation for medical image analysis, Med. Image Anal. 72 (2021) 102136. Crossref, Google Scholar
36. Y. Ji, H. Zhang, Z. Zhang and M. Liu, CNN-based encoder-decoder networks for salient object detection: A comprehensive review and recent advances, Inf. Sci. 546 (2021) 835–857. Crossref, Google Scholar
37. H. Abdi and L. J. Williams, Principal component analysis, Wiley Interdiscip. Rev., Comput. Stat. 2(4) (2010) 433–459. Crossref, Google Scholar
38. U. Michelucci, An introduction to autoencoders, arXiv:2201.03898. Google Scholar
39. O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg and L. Fei-Fei, ImageNet large scale visual recognition challenge, Int. J. Comput. Vis. 115(3) (2015) 211–252. Crossref, Google Scholar
40. M. Sandler, A. Howard, M. Zhu, A. Zhmoginov and L.-C. Chen, MobileNetV2: Inverted residuals and linear bottlenecks, in Proc. IEEE Conf. Computer Vision and Pattern Recognition (IEEE, 2018), pp. 4510–4520. Crossref, Google Scholar
41. A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto and H. Adam, MobileNets: Efficient convolutional neural networks for mobile vision applications, arXiv:1704.04861. Google Scholar
42. G. Huang, Z. Liu, L. Van Der Maaten and K. Q. Weinberger, Densely connected convolutional networks, in Proc. IEEE Conf. Computer Vision and Pattern Recognition (IEEE, 2017), pp. 4700–4708. Crossref, Google Scholar
43. B. Zoph, V. Vasudevan, J. Shlens and Q. V. Le, Learning transferable architectures for scalable image recognition, in Proc. IEEE Conf. Computer Vision and Pattern Recognition (IEEE, 2018), pp. 8697–8710. Crossref, Google Scholar
44. B. Zoph and Q. V. Le, Neural architecture search with reinforcement learning, arXiv:1611.01578. Google Scholar
45. R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh and D. Batra, Grad-CAM: Visual explanations from deep networks via gradient-based localization, in Proc. IEEE Int. Conf. Computer Vision (IEEE, 2017), pp. 618–626. Crossref, Google Scholar
46. B. Zhou, A. Khosla, A. Lapedriza, A. Oliva and A. Torralba, Learning deep features for discriminative localization, in Proc. IEEE Conf. Computer Vision and Pattern Recognition (IEEE, 2016), pp. 2921–2929. Crossref, Google Scholar
47. S. Jardim, J. Antnio and C. Mora, Image thresholding approaches for medical image segmentation — Short literature review, Procedia Comput. Sci. 219 (2023) 1485–1492. Crossref, Google Scholar
48. R. Zhao, J. Yin, Z. Xue, G. Gui, B. Adebisi, T. Ohtsuki, H. Gacanin and H. Sari, An efficient intrusion detection method based on dynamic autoencoder, IEEE Wirel. Commun. Lett. 10(8) (2021) 1707–1711. Crossref, Google Scholar
49. R. Rani, S. Pundhir, A. Dev and A. Sharma, An optimized flower categorization using customized deep learning, Int. J. Comput. Intell. Appl. 21(4) (2022) 2250029. Link, Google Scholar
50. Erwin, A. Safmi, A. Desiani, B. Suprihatin and Fathoni, The augmentation data of retina image for blood vessel segmentation using U-Net convolutional neural network method, Int. J. Comput. Intell. Appl. 21(01) (2022) 2250004. Link, Google Scholar
51. H. Elmannai, M. Hamdi and A. AlGarni, Deep learning models combining for breast cancer histopathology image classification, Int. J. Comput. Intell. Syst. 14(1) (2021) 1003. Crossref, Google Scholar
52. M. Solanki and D. Visscher, Pathology of breast cancer in the last half century, Hum. Pathol. 95 (2020) 137–148. Crossref, Google Scholar
53. J. de Haan, V. Vandecaveye, S. N. Han, K. K. Van de Vijver and F. Amant, Difficulties with diagnosis of malignancies in pregnancy, Best Pract. Res. Clin. Obstet. Gynaecol. 33 (2016) 19–32. Crossref, Google Scholar
54. K. Falahkheirkhah, S. Tiwari, K. Yeh, S. Gupta, L. Herrera-Hernandez, M. R. McCarthy, R. E. Jimenez, J. C. Cheville and R. Bhargava, Deepfake histological images for enhancing digital pathology, arXiv:2206.08308. Google Scholar
55. A. Mauricio, J. Lpez, R. Huauya and J. Diaz, High-resolution generative adversarial neural networks applied to histological images generation, in Int. Conf. Artificial Neural Networks (Springer, 2018), pp. 195–202. Crossref, Google Scholar
56. M. Zhang and Y. Wu, An unsupervised model with attention autoencoders for question retrieval, Proc. AAAI Conf. Artif. Intell. 32(1) (2018) 2118–2132. Google Scholar
57. B. S. White et al., Deep learning features encode interpretable morphologies within histological images, Sci. Rep. 12(1) (2022) 1–12. Google Scholar