Use of Autoencoders for Improving the Performance of Classification and Interpretation of Convolutional Neural Networks in Histopathological Images
Abstract
Breast cancer is one of the most common types of cancer and it presents itself as being the leading cause of death among women. If its diagnosis occur soon enough, the probability to achieve the cure of the patient can be increased. Recently, it has been more common the use of deep neural network techniques to aid pathologists in their prognosis, but they still do not fully trust them because they lack interpretability. In light of that, this work investigates if previous training of the models as encoders could enhance their accuracy in both classification and interpretability. There were implemented three models to the BreakHis and BreCaHAD dataset: NASNet Mobile, DenseNET201, and MobileNetV2. The experiments have shown that the three models increased the classification performance and two models improved their interpretability using the proposed strategy. DenseNet201 encoder has performed almost 23% better than its vanilla version in classifying a tumor and the NASNet Mobile encoder has improved 28.5% in its tumor interpretation.
1. Introduction
Currently, cancer is one of the leading causes of death in the world.1 It is characterized as an uncontrolled growth of cells — usually due to some anomaly — that can spread to other tissues aggressively resulting in the formation of tumors.2 In 2020, reports indicated that approximately 19.3 million cases of cancer were detected worldwide, with a specific analysis stating that out of 36 types of cancer studied in 185 countries, 11.7% of total cases were diagnosed as breast cancer.3,4 Breast cancer can occur in both sexes, being more common in females between the ages of 35 and 55 years old.5 Thus, early detection of this disease can help prevent the metastatic phase and consequently prevent patient mortality.
Malignant breast tumors are classified as: in situ, invasive ductal carcinoma, inflammatory cancer, and metastatic cancer.6 There are ways to investigate the onset of the disease.7 Among them, magnetic resonance imaging, computed tomography, mammography, and ultrasound. However, the most effective method (although more invasive) is diagnosis through histopathological analysis, where a sample of breast tissue is extracted to further cell analysis.8 Additionally, digital histopathological diagnosis requires an experienced pathologist, demanding time to thoroughly analyze the collected tissue sample.
The development of Artificial Intelligence (AI), and their capacity of extracting information,9 has played a pivotal role in classifying tumors from images of tissue samples, which has reduced diagnosis time and contributed to early treatment.10,11
Thus, this work aimed to conduct a comparative analysis between three convolutional neural network architectures, DenseNet, NASNet Mobile, and MobileNet, with the same undergoing two types of training: As encoders in an autoencoder architecture for subsequent classification and purely as classifiers through transfer learning. With this, it was possible to identify if there was an improvement in the performance metrics of tumor classification accuracy and interpretability by the models from this variation in encoder training.
2. Relevant Works
In this project, several works related to Deep Learning (DL) models were observed, as well as their training as encoders in autoencoders architectures.
Over the last few years, scientists have tried to implement several techniques to improve the classification accuracy of the models, particularly in the medical field.12 However, accuracy should not be the only factor to be taken into account.13 Within the error scenarios, it is important to observe the amount of detected false-positives (precision) and false-negatives (sensitivity). Therefore, it is necessary to develop models that reduce the number of false-negatives, in order to increase medical confidence.
A study14 has proposed a modified version of Inception V3, a fine-tuned version of DenseNet121, and a convolutional autoencoder model with the aim of considering sensitivity and precision data. However, this work sought to perform an analysis between three architecture models trained as autoencoders and investigate an improvement in the interpretability of the models, considering not only accuracy but also the ability of the models to “understand” what a tumor is. Recently, an autoencoder-based image reconstruction model was proposed by a research15 to improve feature extraction, and also another published paper16 has used a comparison between four DL architectures and one transformer using classification accuracy and interpretability criteria (although the analyses showed that the MobileNetV2 network presented better performance, further analyses could have been performed). In addition, another studies17,18 have shown the use of autoencoders as a technique to train a feature extractor for subsequent cancer classification and to reduce noise present in histopathological images of osteosarcoma cancer so that model performance can be improved.
Also, to preserve information and improve the level of segmentation of mammography images used, the use of deeper architectures is recommended.19 A research have used models like U-Net to improve the accuracy of breast cancer classification in radiographic images.20 In this proposed work, three of the architectures used in a previous research16 are used, employing the same interpretability metrics but performing pre-training as autoencoders to investigate if there is an improvement in their performance. Also, using that interpretability metric, this work made use of the DenseNet architecture focused on histopathology images.
A few other studies have explored the use of convolutional neural networks in order to identify and recognize breast cancer cells present in pathology images.21,22 Besides that, another study13 proposed a convolutional autoencoder with DenseNet architecture in order to reduce the complexity of the model and improve its ability to extract features from histopathological images. In this work, it was elaborated a similar approach, but considering another architectures trained as encoders of an autoencoder architecture and serving as the convolutional section prior to a fully connected network.
Several articles have used autoencoder networks for histopathological image segmentation23 in order to extract relevant features from cancer image cells. Also, the use of encoder–decoder models can have different architecture configurations, and a specific pre-trained network can even be used as the encoder of the model.24 In this work, three different models were loaded to be analyzed as previously mentioned, whether pre-trained or not (vanilla models).
Transfer learning technique is highly recommended in medical field applications, since DL models require many images and the image annotation process (labeling) consumes a lot of time and money.25 The ImageNet dataset26 is the most used for pre-training networks to improve their performance in tasks such as classification, detection and segmentation.27 However, a model trained with random weight initialization presented very similar performance to a pre-trained one with the ImageNet dataset.28 One possible reason for this would be the different nature of the types of images. Despite this, this work made trainings in both ways (with random weights and those of ImageNet), since more features were learned by the initial layers of models with ImageNet can be reused, saving training time.
Currently, there are still barriers to using Machine Learning (ML) in the medical field, especially due to the difficulty many specialists have in understanding what the models classify. Thus, a survey29 was conducted to evaluate articles of histopathology that utilize ML algorithms and others articles that elucidate the behavior of the models, along with the necessary metrics to assess their interpretation. Although there have been other alternatives for explaining models results,30 it has been shown that Grad-CAM can be a safe option for interpreting what the model would be classifying, due to its reliability in histopathological images,29 being then utilized in this work. The survey also highlighted that there is no definitive metric for evaluating interpretability, although the intersection of regions is the most common in annotated datasets. This work, that used two datasets, has worked a new way of automatically measuring interpretability based on the conditions provided by the annotated dataset.
3. Methodology
In this work, not only the behavior of DL models regarding their classification after being trained as autoencoders was investigated, but also their ability to correctly interpret tumor characteristics from histopathological image datasets. Three state-of-the-art architectures were evaluated and compared in terms of classification accuracy and interpretability. The models were trained with a breast cancer histopathological dataset with two classes, with images featuring nuclei of malignant and benign tumors, and tested with another histopathological dataset, but with only malignant tumors. The three models were selected due to their architectural simplicity, their common usability and also the possibility to attend a particular challenge without creating unnecessary complexity.
To evaluate the interpretability of the models, the Grad-CAM technique was used on the annotated dataset. Then, it was observed how many annotations were within the regions pointed out by the models as having the highest incidence of tumors. In this context, a metric known as Interpretable Region Accuracy (IRA) was proposed to evaluate the interpretation of the images.16 Both this metric and the datasets are explained in Sec. 4 of this work. The approach of this work uses a pipeline that describes how the models are trained and validated on one dataset, and then how they are tested on another dataset for accuracy and interpretability metrics.
3.1. Approach
First, the models were trained as the encoders in autoencoder architectures for histopathological image reconstruction from the BreakHis dataset.31 After training, the encoder models were saved, along with their latent representation space (LRS) (intermediate space that encodes and stores a representation, or several, of the characteristics of the dataset), and then a densely connected network was added for classification training of the same images from the BreakHis dataset. The method adopted is shown in Fig. 1.

Fig. 1. Training architecture as encoder.
Next, the networks were tested on an annotated dataset: BreCaHAD.32 This dataset contains only images of malignant tumors with point annotations identifying them. After testing each model, for each image correctly classified as malignant tumor, heatmaps were generated with Grad-CAM. These heatmaps, in turn, indicated the regions in the image that the model deemed relevant for classification.
Using the Otsu method, a mask was generated from each heatmap, and then the number of annotations present in each region was investigated. These annotations indicating the location of tumors had their coordinates described in a JSON file included with the dataset. Finally, the accuracy of the interpretable region was measured, by checking how many points were present in the relevant zones for the model to perform the classification, indicating how well it was able to identify tumor characteristics. This procedure is described through Fig. 2.

Fig. 2. Interpretation of testing set.
3.2. Autoencoders
Autoencoders33 are a particular type of neural network that encodes an input signal through an encoder into a compressed vector of meaningful representation, and then decoding it back to its original format in the best possible way.34 Its goal can be understood as learning in an “unsupervised” way a simplified representation of the input data of the model. In the medical imaging field, this neural network is very used due to its ability to extract and learn relevant features from the original image in order to denoise it for further analysis.35
Both the encoder and decoder can be convolutional neural networks,36 as they allow their network layers to have various sizes. In a more general sense, the autoencoder functions as a generalization of Principal Component Analysis (PCA).37
To reduce the difference between the input and output reconstructed images used in training an autoencoder, there are two error functions that can be used in weight update calculations38: The Mean Square Error (MSE) or the Binary Cross-Entropy (BCE).
This paper used BCE since the training images were normalized between 0 and 1. Its mathematical description is shown in Eq. (1), where yi is the true value, ŷ is the predicted value, and N is the number of pixels.
3.3. Convolutional neural networks
In this study, the networks MobileNetV2, DenseNet201, and NASNet Mobile were used. It is important to note that for training as encoders, they used transfer learning with ImageNet39 weights, and random weight initialization for the training phase. For the classification and feature extraction process, in addition to the pre-trained encoder models, “vanilla” or simply “pure” models were also used, using ImageNet weights by transfer learning. These networks were selected because they are considered state-of-the-art in the literature and have little or almost no use in the BreCaHAD dataset. Table 1 shows the main parameters used.
NASNet Mobile | MobileNetV2 | DenseNet121 | |
---|---|---|---|
Epochs-Autoencoder | 50 | 70 | 70 |
Epochs-Classifier | 100 | 100 | 100 |
Training samples | 2074 | 2074 | 2074 |
Validation samples | 360 | 360 | 360 |
Test samples | 159 | 159 | 159 |
Learning rate | [0.001, 0.0001] | [0.001, 0.0001] | [0.001, 0.0001] |
Optimizer | Adam, SGD | Adam, SGD | Adam, SGD |
3.4. MobileNetV2
This architecture was developed to improve the state-of-the-art of convolutional networks applied in computer vision, reducing memory consume, while seeking to maintain the same accuracy.40 It was an evolution of the MobileNetV1 architecture.41 It consists, in short, of the addition of a new block of layers containing a residual layer with a linear “bottleneck”. Figure 3 shows the representation of the convolution block used.

Fig. 3. MobileNetV2 convolutional block.
3.5. DenseNet201
The DenseNet201 model42 has appeared as an alternative to the problem of Gradient Vanishing of information passed by convolutional network layers during weight updates in backpropagation, given that these networks have become increasingly deeper. Gradient Vanishing can occur during the calculation of the gradient of the error function along the network with respect to its weights. As more layers are added to the architecture, the loss function tends to zero. This occurs because certain activation functions limit the output of the layers to values between 0 and 1 (Sigmoid) or −1 and 1 (tanh), so that a large variation in the input signal of the activation function causes only a small variation in the output. Thus, the more layers that use the Sigmoid function, for example, the gradient exponentially decreases to the initial layers. As a result, the weights of the initial layers will not be properly updated, which can impair the ability of the model to generalize.
The DenseNet201 also has a connectivity pattern in its architecture that allows all layers to be connected to each other (as long as they have equivalent feature map sizes), allowing maximum information to be preserved. Figure 4 visually illustrates how this architectural pattern occurs.

Fig. 4. DenseNet201 layer connections.
Another interesting factor about this network is that it requires fewer parameters for training than other networks, because due to the connections, there is no need to retrain redundant feature maps. It is also named with the number 201 because it has an extension of 201 convolutional layers.
Similarly to MobileNetV2, it was trained as an encoder for image reconstruction and as a convolutional model for classification.
3.6. NASNet mobile
This present network43 was developed based on the Neural Architecture Search (NAS) framework.44 However, the application of this method is computationally costly, as it can produce complex models in the search for better accuracy values. In general terms, the proposed network was chosen through a search in a space with different architectures and weights (NASNet Search Space). However, instead of searching for the best architecture (with higher accuracy), the search was made looking for the best convolution blocks.
Figure 5 shows how the NAS model generally works.

Fig. 5. NAS recursive training method.
Like the previous ones, the goal was to identify the best model that could be obtained with NASNet Mobile to be trained as an encoder for image reconstruction and, later, as a convolutional network for classification.
3.7. Grad-CAM
The Grad-CAM45 is an interpretation technique for deep model classification that is based on identifying the relevant features in the classification of the model. It is calculated through the gradients between the target of the prediction of the model and the activation maps of the last layer of the convolutional model, in order to highlight the most relevant feature maps for that particular classification. In addition, it is one of the most recommended models for interpreting results in medical images.29
It is understood that Grad-CAM is a generalization of the CAM method46 — Class Activation Maps — and therefore is applicable to various types of convolutional networks.
To obtain the localization map of a certain predicted class of a model, the gradient of the predicted value of a certain class b, or γb, before applying the softmax, with respect to the activation maps βk of a convolutional layer (∂γb∂βk) is computed. These gradients are then weighted by Global Average Pooling to obtain the “importance” or relevance of each feature map in relation to class b (αkb), as shown in the following equation :

Fig. 6. Activation maps combination result.
From the previously generated activation map, it is expanded to cover the entire image, indicating the level of relevance of each pixel to the predicted class, as observed in the heat map of Fig. 7. From this result, a mask is calculated using the Otsu method and then used to calculate the proposed IRA metric.

Fig. 7. Applied heatmap generated by Grad-CAM.
3.8. The method of Otsu
Each heat map generated in cases where the model correctly classified the tumor as malignant contains information about the features that the model considered relevant to make the classification. Since the annotated dataset of BreCaHAD only contains information on the Cartesian coordinates of the tumor location in each image, a technique for binarizing the heat maps was used to generate a mask and identify how many points would be present in that region. The method used was the one of Otsu. This method is very used in medical image processing47 and, besides that, it benefits from its simplicity and speed (it can determine automatically the threshold value that differentiates background and foreground regions. It consists in identifying a threshold value (α) that minimizes the intra-class variance present in an image whose histogram can be approximated by two Gaussian functions. This information can be algebraically represented by the following equation :
4. Experiments and Results
The experiments conducted in this work aimed to investigate, through comparison, the effects on the classification and interpretation accuracies of each model when pre-training them as encoders of an autoencoder model.
4.1. Computational resources
The models were trained on a NVIDIA GeForceGTX 1650 GPU. The version of Tensorflow used was 2.7, and the software used was Jupyter Notebook.
During training, hyperparameters were adjusted to avoid high memory consume. A single round of experiments with the models took almost 7h. Such experiments would be longer if the free version of Google Colab was used, whose use of the GPU is limited. Therefore, the experiments were conducted locally. The use of autoencoders can require a lot of time and computational cost.48 Overtraining can compromise the robustness of the model, and insufficient training can prevent the models from achieving high performance.
4.2. Datasets
In this study, two datasets of histopathological images were used: BreakHis — used for model training — and the annotated BreCaHAD dataset — used to measure the classification accuracy and how well each model interprets tumor characteristics. Both datasets only expose histopathological images, without revealing any confidential patient information. This training approach with distinct datasets helps to better understand a possible overfitting behavior that may be occurring with the models. If a single dataset were used, such a problem could be difficult to investigate.
BreakHis presents 9109 images of breast tissue with tumors. The images are in PNG format and have a size of 700×460 pixels with three color channels (RGB). There was no prior feature selection, and the original image, just sized at 700×460 pixels (with its three RGB color channels) was introduced to the network. The dataset presents eight specific types of tumors, but was divided into two classes: Malignant and benign. In addition, it presents different magnification factors of tissue regions: 40×, 100×, 200×, and 400×. Therefore, a resolution of 400× was used in this work, resulting in a final set of 1820 images, with 588 benign and 1232 malignant images. The training-validation proportion was 80–20.
Due to the large class imbalance (approximately 2:1), a Data Augmentation process was performed on the training set.49,50,51 Thus, effects such as rotation and mirroring of the images were applied.
The BreCaHAD dataset has 162 histopathological images of malignant tumors (three were removed: Two duplicates and one without annotation indicating a malignant tumor). However, these images are in TIFF format and have dimensions of 1360×1024, also having three color channels (RGB). The magnification factor is 400×. This dataset has annotations with the locations of the tumors in Cartesian coordinates (JSON format) indicated by experts in the field of medical pathology. Figure 8 shows an annotated image. The blue-colored points represent the location of malignant tumors. In this work, BreCaHAD was used as the test set.

Fig. 8. Annotated images from BreCaHad.
In this project, there was no prior feature selection, and the original image, just sized at 700×460 pixels (with its three RGB color channels), after undergoing a resizing process, was fed into the training networks with a size of 128×128, due to the computational limitation of the machine used in the study.
4.3. Metrics
The interpretation method to be used depends on the context of the problem and is subjectively evaluated by the researcher.29 Similarly, a study16 has proposed an idea of counting the number of annotations present in the mask extracted from the method of Otsu and the heatmap generated by Grad-CAM for the BreCaHAD dataset.
The identification of tumors within the region delimited by the mask was done from a JSON file with the coordinates of the points representing the tumors. For each image, it was verified the respective percentage of points (tumor cells) present in the “relevant region” for the classification of the model, and then an average of these percentage values was taken (IRA).
Thus, in the experiments, we sought to answer which model had the better ability to classify and interpret tumor regions. Therefore, we also sought to observe if pre-training them as an encoders favors the extraction of relevant features.
It is important to highlight that the training of each architecture in the classification phase was carried out in three versions: With the weights from the training as an encoder initialized randomly (1) and by ImageNet via transfer learning (2) and without training as an autoencoder but with the weights from ImageNet (in the context of this work, this scenario will be known as the Vanilla model).
4.4. Results
Table 2 represents the results table obtained and displays four types of information about each model, with their respective results: Model classification accuracy, the mean IRA, the median value of the percentage of tumors identified by each model (Interpretable Region Median, IRM), and the standard deviation of the Interpretable Region Accuracy (SD-IRA).
Interpretability | ||||||
---|---|---|---|---|---|---|
CNN Models | Training | Loaded Weigths | Classification Accuracy | IRA | SD-IRA | IRM |
NASNet Mobile | Vanilla | ImageNet | 59.12% | 56.47% | 21.64% | 59.68% |
Encoder | ImageNet | 96.23% | 72.62% | 14.51% | 75.26% | |
Random | 84.28% | 47.71% | 23.70% | 43.57% | ||
DenseNet201 | Vanilla | ImageNet | 79.87% | 36.37% | 23.42% | 31.93% |
Encoder | ImageNet | 98.11% | 45.06% | 28.8% | 39.24% | |
Random | 91.82% | 45.4% | 30.11% | 36.11% | ||
MobileNetV2 | Vanilla | ImageNet | 72.33% | 58.48% | 20.78% | 57.89% |
Encoder | ImageNet | 76.73% | 48.53% | 20.25% | 47.10% | |
Random | 56.60% | 22.15% | 9.03% | 22.5% |
Comparing results, it could be observed that training the architectures as encoders increased their classification accuracy. There was also an improvement in interpretation when not only the networks were pre-trained, but also initialized with the ImageNet weights.
Although the MobileNetV2 Vanilla architecture was able to interpret almost 20% better what a tumor is (Region Accuracy), with the other networks, the (pre-trained) encoders performed better. For MobileNetV2, it was also possible to note that pre-training increased its classification accuracy, but it was not accompanied by an improvement in performance in interpretation, indicating that the model may not be understanding well what a tumor is.
Also, DenseNet201 pre-trained as an encoder with ImageNet weights showed the best classification result, however, only 2% more than NASNet, and had an almost 40% inferior performance than NASNet when trying to infer what represents a tumor, even presenting a similar result to its version pre-trained as an encoder with random weights. It was likely that it was “seeing” information in the image that did not represent a tumor and classified them as if they were.
Finally, it was possible to identify that the NASNet network in its pre-trained encoder version presented the best performance when compared to the other networks and also presented an improvement in performance of almost 62% in classification accuracy and 28% in the ability to correctly interpret a tumor. This could mean that its LRS was actually learning to compress input information, forcing the model to learn the most relevant features of the image.
The performance improvement in both the NASNet Mobile and DenseNet201 networks is associated with their pre-training as autoencoders, as this could indicate that the networks were memorizing weights that best described the histopathological features of the images. The same did not occur with MobileNetV2, suggesting that future investigations could help to understand how the feature extraction of this network behaves.
In Fig. 9 above, it is possible to identify the result of the best model for an image from the BreCaHAD dataset. It is noted that the model highlights much of the image, which may assist a pathologist in searching for cancerous cells. For the same image, the Vanilla version of NASNet misclassified it.

Fig. 9. NASNet generated maps.
When the previous results were compared to the those of similar networks,16 it could be noted that only the ImageNet pre-trained DenseNet201 showed improvement in both Accuracy and Region Accuracy (0.6% and 180% more, respectively). Figure 10 shows the performance of DenseNet201 visually.

Fig. 10. DenseNet201 generated maps.

Fig. 11. MobileNetV2 generated maps.
The MobileNetV2 in this work obtained better performance in Region Accuracy (16.8% and 40.8% more when pre-trained with ImageNet and when trained in Vanilla mode). However, MobileNetV2 showed low performance when compared to the others in classification accuracy (approximately 12%) when pre-trained as an autoencoder with ImageNet weights. This may have occurred due to probable overfitting of the model on the first dataset. Finally, the NASNet Mobile performed better than its Large version in both Accuracy and Region Accuracy when pre-trained as an encoder with ImageNet (15% and 742%, respectively).
5. Conclusion
In this work, we evaluated how pre-training the models MobileNetV2, DenseNet201, and NASNet Mobile could impact their ability to classify and interpret breast cancer tumors in the BreCaHAD dataset (even with few epochs for autoencoder training: 50–70). As observed, all three networks improved their classification accuracy, and NASNet and DenseNet201 improved their IRA.
As observed, the NASNet Mobile network pre-trained as an encoder with ImageNet training weights showed a 62% improvement in classification accuracy, a 28% improvement in IRA, and a median percentage of the most shifted region to the right (from 59.68% to 75.26%). That is, it was not only able to classify tumor images well, but also interpreted better what a tumor would be in the image passed to the model, which can be explained by its pre-training as an autoencoder. The DenseNet201 network pre-trained as an encoder with ImageNet training weights showed a nearly 23% improvement in classification accuracy, a 25% improvement in IRA, and a median percentage of the most shifted region to the right (from 31.93% to 39.24%). The MobileNetV2 network only showed an improvement in its classification accuracy (5%).
It is known that it takes a long time for a pathologist to accurately and confidently identify a tumor in a tissue sample. There are many challenges in the field of diagnosis,52 despite many advances having been made and new technologies being welcomed. In cases of pregnancy, the detection scenario becomes even more delicate.53
Applying deep networks that perform well classifying and interpreting the results of their classification, providing clear indications of the results, would allow streamline the clinical diagnostic process and also increase the trust of pathologists in adopting these techniques.
This paper presented a few limitations: Low computational processing capacity (due to the complexity of the architectures and the high resolution of the images, some of them like MobileNet had to be loaded with only 50% of the original weights, and the images had their resolutions reduced to fit the input of the networks); high training time — due to the high computational cost, even for low resolutions, the autoencoder architectures could only be trained for a few epochs (which still consumed several hours of training); and finally, a small number of test samples: there is still a lot of difficulty in finding histopathological datasets with reliable annotations from specialists. Due to the nature of the problem, it may be more interesting to use datasets annotated by areas rather than by coordinates.
As suggestions for future works, it is recommended the use of other histopathological databases, such as KIMIA Pathb; the use of generative models to improve training54,55 and also other robust networks,11,56 other model interpretability techniques,57 and finally, encourage partnerships with hospitals and specialist teams for data collection.
ORCID
Daniel C. Macedo https://orcid.org/0009-0002-0236-3836
Fernando M. de Paula Neto https://orcid.org/0000-0003-4264-1124
Tasso L. O. Moraes https://orcid.org/0009-0005-0224-9621
Vinicius D. Santos https://orcid.org/0009-0008-3169-999X
John W. S. de Lima https://orcid.org/0000-0003-1606-6517
Notes
a https://d2l.ai/chapter_multilayer-perceptrons/mlp.html#relu-function.
Remember to check out the Most Cited Articles! |
---|
Check out these titles in artificial intelligence! |