Research PaperNo Access

Large-Scale Image Retrieval with Deep Attentive Global Features

College of Computer Science and Software Engineering, Shenzhen University, Nanhai Ave 3688, Shenzhen, Guangdong 518060, P. R. China

E-mail Address: zhuyy@szu.edu.cn

Corresponding author.

Search for more papers by this author

Yinghao Wang

College of Computer Science and Software Engineering, Shenzhen University, Nanhai Ave 3688, Shenzhen, Guangdong 518060, P. R. China

Equal contributors.

Search for more papers by this author

Haonan Chen

College of Computer Science and Software Engineering, Shenzhen University, Nanhai Ave 3688, Shenzhen, Guangdong 518060, P. R. China

Equal contributors.

Search for more papers by this author

Zemian Guo

College of Computer Science and Software Engineering, Shenzhen University, Nanhai Ave 3688, Shenzhen, Guangdong 518060, P. R. China

Search for more papers by this author

, and

Qiang Huang

College of Computer Science and Software Engineering, Shenzhen University, Nanhai Ave 3688, Shenzhen, Guangdong 518060, P. R. China

Search for more papers by this author

https://doi.org/10.1142/S0129065723500132Cited by:10 (Source: Crossref)

Abstract

How to obtain discriminative features has proved to be a core problem for image retrieval. Many recent works use convolutional neural networks to extract features. However, clutter and occlusion will interfere with the distinguishability of features when using convolutional neural network (CNN) for feature extraction. To address this problem, we intend to obtain high-response activations in the feature map based on the attention mechanism. We propose two attention modules, a spatial attention module and a channel attention module. For the spatial attention module, we first capture the global information and model the relation between channels as a region evaluator, which evaluates and assigns new weights to local features. For the channel attention module, we use a vector with trainable parameters to weight the importance of each feature map. The two attention modules are cascaded to adjust the weight distribution for the feature map, which makes the extracted features more discriminative. Furthermore, we present a scale and mask scheme to scale the major components and filter out the meaningless local features. This scheme can reduce the disadvantages of the various scales of the major components in images by applying multiple scale filters, and filter out the redundant features with the MAX-Mask. Exhaustive experiments demonstrate that the two attention modules are complementary to improve performance, and our network with the three modules outperforms the state-of-the-art methods on four well-known image retrieval datasets.

Keywords:

References

1. M. J. Swain and D. H. Ballard , Color indexing, Int. J. Comput. Vis. 7(1) (1991) 11–32. Crossref, Web of Science, Google Scholar
2. J. Huang, S. R. Kumar, M. Mitra, W.-J. Zhu and R. Zabih , Image indexing using color correlograms, in Proc. IEEE Computer Society Conf. Computer Vision and Pattern Recognition (IEEE, 1997), pp. 762–768. Crossref, Google Scholar
3. M. A. Stricker and M. Orengo , Similarity of color images, in Storage and Retrieval for Image and Video Databases III, Vol. 2420 (International Society for Optics and Photonics, 1995), pp. 381–393. Crossref, Google Scholar
4. R. M. Haralick , Statistical and structural approaches to texture, Proc. IEEE 67(5) (1979) 786–804. Crossref, Google Scholar
5. H. Tamura, S. Mori and T. Yamawaki , Textural features corresponding to visual perception, IEEE Trans. Syst. Man Cybern. 8(6) (1978) 460–473. Crossref, Web of Science, Google Scholar
6. J. Mao and A. K. Jain , Texture classification and segmentation using multiresolution simultaneous autoregressive models, Pattern Recognit. 25(2) (1992) 173–188. Crossref, Web of Science, Google Scholar
7. S. Bres and J.-M. Jolion , Detection of interest points for image indexation, in Int. Conf. Advances in Visual Information Systems (Springer, 1999), pp. 427–435. Crossref, Google Scholar
8. A. Del Bimbo, P. Pala and S. Santini , Image retrieval by elastic matching of shapes and image patterns, in Int. Conf. Computing and Communication Networks (IEEE, 1996), p. 0215. Crossref, Google Scholar
9. D. G. Lowe , Distinctive image features from scale-invariant keypoints, Int. J. Comput. Vis. 60(2) (2004) 91–110. Crossref, Web of Science, Google Scholar
10. J. Sivic and A. Zisserman , Video Google: A text retrieval approach to object matching in videos, in Proc. 9th IEEE Int. Conf. Computer Vision (IEEE, 2003), p. 1470. Crossref, Google Scholar
11. J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li and L. Fei-Fei , ImageNet: A large-scale hierarchical image database, in IEEE Conf. Computer Vision and Pattern Recognition, 2009 (IEEE, 2009), pp. 248–255. Crossref, Google Scholar
12. S. Gkelios, A. Sophokleous, S. Plakias, Y. Boutalis and S. A. Chatzichristofis , Deep convolutional features for image retrieval, Expert Syst. Appl. 177 (2021) 114940. Crossref, Web of Science, Google Scholar
13. K. Wang, Y. Wang, B. Zhan, Y. Yang, C. Zu, X. Wu, J. Zhou, D. Nie and L. Zhou , An efficient semi-supervised framework with multi-task and curriculum learning for medical image segmentation, Int. J. Neural Syst. 32(09) (2022) 2250043. Link, Web of Science, Google Scholar
14. P. Mishra, C. Piciarelli and G. L. Foresti , A neural network for image anomaly detection with deep pyramidal representations and dynamic routing, Int. J. Neural Syst. 30 (2020) 2050060. Link, Web of Science, Google Scholar
15. D. S. Jodas, T. Yojo, S. Brazolin, G. D. N. Velasco and J. P. Papa , Detection of trees on street-view images using a convolutional neural network, Int. J. Neural Syst. 32(01) (2022) 2150042. Link, Web of Science, Google Scholar
16. A. Babenko and V. Lempitsky , Aggregating local deep features for image retrieval, in Proc. IEEE Int. Conf. Computer Vision (IEEE, 2015), pp. 1269–1277. Google Scholar
17. A. Iscen, Y. Avrithis, G. Tolias, T. Furon and O. Chum , Fast spectral ranking for similarity search, in Proc. IEEE Conf. Computer Vision and Pattern Recognition (IEEE, 2018), pp. 7632–7641. Crossref, Google Scholar
18. S. Hussain, M. A. Zia and W. Arshad , Additive deep feature optimization for semantic image retrieval, Expert Syst. Appl. 170 (2021) 114545. Crossref, Web of Science, Google Scholar
19. A. Babenko, A. Slesarev, A. Chigorin and V. Lempitsky , Neural codes for image retrieval, in European Conf. Computer Vision (Springer, 2014), pp. 584–599. Crossref, Google Scholar
20. A. Sharif Razavian, H. Azizpour, J. Sullivan and S. Carlsson , CNN features off-the-shelf: An astounding baseline for recognition, in Proc. IEEE Conf. Computer Vision and Pattern Recognition Workshops (IEEE, 2014), pp. 806–813. Google Scholar
21. J. Yue-Hei Ng, F. Yang and L. S. Davis , Exploiting local features from deep networks for image retrieval, in Proc. IEEE Conf. Computer Vision and Pattern Recognition Workshops (IEEE, 2015), pp. 53–61. Google Scholar
22. G. Tolias, R. Sicre and H. Jégou, Particular object retrieval with integral max-pooling of CNN activations, arXiv:1511.05879. Google Scholar
23. H. Jégou and A. Zisserman , Triangulation embedding and democratic aggregation for image search, in Proc. IEEE Conf. Computer Vision and Pattern Recognition (IEEE, 2014), pp. 3310–3317. Crossref, Google Scholar
24. J. Philbin, O. Chum, M. Isard, J. Sivic and A. Zisserman , Object retrieval with large vocabularies and fast spatial matching, in 2007 IEEE Conf. Computer Vision and Pattern Recognition (IEEE, 2007), pp. 1–8. Crossref, Google Scholar
25. J. Philbin, O. Chum, M. Isard, J. Sivic and A. Zisserman , Lost in quantization: Improving particular object retrieval in large scale image databases, in 2008 IEEE Conf. Computer Vision and Pattern Recognition (IEEE, 2008), pp. 1–8. Crossref, Google Scholar
26. Y. Wang, C. Chen, J. Wang and Y. Zhu , Learning discriminative features for image retrieval, in Proc. 2019 Int. Conf. Multimedia Retrieval (IEEE, 2019), pp. 96–104. Crossref, Google Scholar
27. B. Zhang, J. Gu, C. Chen, J. Han, X. Su, X. Cao and J. Liu , One-two-one networks for compression artifacts reduction in remote sensing, ISPRS J. Photogramm. Remote Sens. 145 (2018) 184–196. Crossref, Web of Science, Google Scholar
28. Y. Tian, C. Chen and M. Shah , Cross-view image matching for geo-localization in urban environments, in Proc. IEEE Conf. Computer Vision and Pattern Recognition (IEEE, 2017), pp. 3608–3616. Crossref, Google Scholar
29. R. Hou, C. Chen and M. Shah , Tube convolutional neural network (T-CNN) for action detection in videos, in Proc. IEEE Int. Conf. Computer Vision (IEEE, 2017), pp. 5822–5831. Crossref, Google Scholar
30. B. Sun, C. Chen, Y. Zhu and J. Jiang, GeoCapsNet: Aerial to ground view image geo-localization using capsule network, arXiv:1904.06281. Google Scholar
31. A. Krizhevsky, I. Sutskever and G. E. Hinton , ImageNet classification with deep convolutional neural networks, in Advances in Neural Information Processing Systems (Curran Associates, Inc., 2012), pp. 1097–1105. Google Scholar
32. K. Simonyan and A. Zisserman, Very deep convolutional networks for large-scale image recognition, arXiv:1409.1556. Google Scholar
33. C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke and A. Rabinovich , Going deeper with convolutions, in Proc. IEEE Conf. Computer Vision and Pattern Recognition (IEEE, 2015), pp. 1–9. Crossref, Google Scholar
34. K. He, X. Zhang, S. Ren and J. Sun , Deep residual learning for image recognition, in Proc. IEEE Conf. Computer Vision and Pattern Recognition (IEEE, 2016), pp. 770–778. Crossref, Google Scholar
35. H. Jégou, M. Douze, C. Schmid and P. Pérez , Aggregating local descriptors into a compact image representation, in 2010 IEEE Conf. Computer Vision and Pattern Recognition (CVPR) (IEEE, 2010), pp. 3304–3311. Crossref, Google Scholar
36. F. Perronnin and C. Dance , Fisher kernels on visual vocabularies for image categorization, in 2007 IEEE Conf. Computer Vision and Pattern Recognition (IEEE, 2007), pp. 1–8. Crossref, Google Scholar
37. Y. Kalantidis, C. Mellina and S. Osindero , Cross-dimensional weighting for aggregated deep convolutional features, in Eur. Conf. Computer Vision (Springer, 2016), pp. 685–701. Crossref, Google Scholar
38. M. Jaderberg et al., Spatial transformer networks, in Advances in Neural Information Processing Systems (Curran Associates, Inc., 2015), pp. 2017–2025. Google Scholar
39. J. Hu, L. Shen and G. Sun , Squeeze-and-excitation networks, in Proc. IEEE Conf. Computer Vision and Pattern Recognition (IEEE, 2018), pp. 7132–7141. Crossref, Google Scholar
40. Y. Zhu, J. Wang, L. Xie and L. Zheng , Attention-based pyramid aggregation network for visual place recognition, in 2018 ACM Multimedia Conf. Multimedia Conf. (ACM, 2018), pp. 99–107. Crossref, Google Scholar
41. T.-T. Do, T. Hoang, D.-K. L. Tan and N.-M. Cheung, From selective deep convolutional features to compact binary representations for image retrieval, arXiv:1802.02899. Google Scholar
42. F. Radenović, G. Tolias and O. Chum , CNN image retrieval learns from bow: Unsupervised fine-tuning with hard examples, in Eur. Conf. Computer Vision (Springer, 2016), pp. 3–20. Crossref, Google Scholar
43. F. Radenović, G. Tolias and O. Chum , Fine-tuning CNN image retrieval with no human annotation, IIEEE Trans. Pattern Anal. Mach. Intell. 41(7) (2018) 1655–1668. Crossref, Medline, Web of Science, Google Scholar
44. J. Xu, C. Wang, C. Qi, C. Shi and B. Xiao , Unsupervised semantic-based aggregation of deep convolutional features, IEEE Trans. Image Process. 28(2) (2018) 601–611. Crossref, Web of Science, Google Scholar
45. J. Zhou, J. Gan, W. Gao and A. Liang , Image retrieval based on aggregated deep features weighted by regional significance and channel sensitivity, Inf. Sci. 577 (2021) 69–80. Crossref, Web of Science, Google Scholar
46. R. Arandjelovic, P. Gronat, A. Torii, T. Pajdla and J. Sivic , NetVLAD: CNN architecture for weakly supervised place recognition, in Proc. IEEE Conf. Computer Vision and Pattern Recognition (2016), pp. 5297–5307. Crossref, Google Scholar
47. G. Berton, R. Mereu, G. Trivigno, C. Masone, G. Csurka, T. Sattler and B. Caputo , Deep visual geo-localization benchmark, in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition (2022), pp. 5396–5407. Crossref, Google Scholar
48. G. Berton, C. Masone and B. Caputo , Rethinking visual geo-localization for large-scale applications, in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition (2022), pp. 4878–4888. Crossref, Google Scholar