No Access

School of Mechanical Engineering and Automation, Northeastern University, Wenhua Road, Shen Yang, Liao Ning, P. R. China

https://orcid.org/0009-0006-8670-9046

School of Mechanical Engineering and Automation, Northeastern University, Wenhua Road, Shen Yang, Liao Ning, P. R. China

Search for more papers by this author

Shihua Zhou

https://orcid.org/0000-0001-6984-594X

School of Mechanical Engineering and Automation, Northeastern University, Wenhua Road, Shen Yang, Liao Ning, P. R. China

Search for more papers by this author

Zhaohui Ren

https://orcid.org/0009-0008-8289-4541

School of Mechanical Engineering and Automation, Northeastern University, Wenhua Road, Shen Yang, Liao Ning, P. R. China

E-mail Address: zhhren_neu@126.com

Corresponding author.

Search for more papers by this author

Yongchao Zhang

https://orcid.org/0000-0001-5892-3391

School of Mechanical Engineering and Automation, Northeastern University, Wenhua Road, Shen Yang, Liao Ning, P. R. China

Search for more papers by this author

Tianzhuang Yu

https://orcid.org/0000-0002-9884-2446

School of Mechanical Engineering and Automation, Northeastern University, Wenhua Road, Shen Yang, Liao Ning, P. R. China

Search for more papers by this author

, and

Yulin Liu

https://orcid.org/0009-0007-0519-1124

School of Mechanical Engineering and Automation, Northeastern University, Wenhua Road, Shen Yang, Liao Ning, P. R. China

Search for more papers by this author

https://doi.org/10.1142/S0129065725500157Cited by:0 (Source: Crossref)

Abstract

Since vision transformers excel at establishing global relationships between features, they play an important role in current vision tasks. However, the global attention mechanism restricts the capture of local features, making convolutional assistance necessary. This paper indicates that transformer-based models can attend to local information without using convolutional blocks, similar to convolutional kernels, by employing a special initialization method. Therefore, this paper proposes a novel hybrid multi-scale model called Frequency-Assisted Local Attention Transformer (FALAT). FALAT introduces a Frequency-Assisted Window-based Positional Self-Attention (FWPSA) module that limits the attention distance of query tokens, enabling the capture of local contents in the early stage. The information from value tokens in the frequency domain enhances information diversity during self-attention computation. Additionally, the traditional convolutional method is replaced with a depth-wise separable convolution to downsample in the spatial reduction attention module for long-distance contents in the later stages. Experimental results demonstrate that FALAT-S achieves 83.0% accuracy on IN-1k with an input size of $224 \times 224$ $224 \times 224$ using 29.9M parameters and 5.6G FLOPs. This model outperforms the Next-ViT-S by 0.9AP^b/0.8AP^m with Mask-R-CNN $1 \times$ $1 \times$ on COCO and surpasses the recent FastViT-SA36 by 3.1% mIoU with FPN on ADE20k.

Keywords:

References

1. S. Zuo, Y. Xiao, X. Chang and X. Wang, Vision transformers for dense prediction, Knowl.-Based Syst. 253 (2022) 109552. Crossref, Web of Science, Google Scholar
2. A. De Nardin, S. Zottin, C. Piciarelli, E. Colombi and G. L. Foresti, Few-shot pixel-precise document layout segmentation via dynamic instance generation and local thresholding, Int. J. Neural Syst. 33(10) (2023) 2350052. Link, Web of Science, Google Scholar
3. A. Dosovitskiy et al., An image is worth 16×16 words: Transformers for image recognition at scale, preprint (2020), arXiv:2010.11929. Google Scholar
4. H. Zhu, J. Wang, S.-H. Wang, R. Raman, J. M. Górriz and Y.-D. Zhang, An evolutionary attention-based network for medical image classification, Int. J. Neural Syst. 33(03) (2023) 2350010. Link, Web of Science, Google Scholar
5. J. Zhang, L. Zhang, Y. Wang, J. Wang, X. Wei and W. Liu, An efficient multi-objective evolutionary zero-shot neural architecture search framework for image classification, Int. J. Neural Syst. 33(05) (2023) 2350016. Link, Web of Science, Google Scholar
6. M. Raghu, T. Unterthiner, S. Kornblith, C. Zhang and A. Dosovitskiy, Do vision transformers see like convolutional neural networks? Adv. Neural Inf. Process. Syst. 34 (2021) 12116–12128. Google Scholar
7. W. Wang, E. Xie, X. Li, D.-P. Fan, K. Song, D. Liang, T. Lu, P. Luo and L. Shao, Pyramid vision transformer: A versatile backbone for dense prediction without convolutions, in Proc. IEEE/CVF Int. Conf. Computer Vision (IEEE, 2021), pp. 568–578. Crossref, Google Scholar
8. Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin and B. Guo, Swin transformer: Hierarchical vision transformer using shifted windows, in Proc. IEEE/CVF Int. Conf. Computer Vision (IEEE, 2021), pp. 10012–10022. Crossref, Google Scholar
9. X. Zhou, Z. Ren, Y. Zhang, T. Mi, S. Zhou and Z. Jiang, A shunted-swin transformer for surface defect detection in roller bearings, Measurement 238 (2024) 115283. Crossref, Web of Science, Google Scholar
10. Z. Wenxuan, Z. Yaqin, Z. Zhaoxiang and L. Ao, Lite transformer network with long–short range attention for real-time fire detection, Fire Technol. 59 (2023) 3231–3253. Crossref, Google Scholar
11. A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y. Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu and R. Pang, Conformer: Convolution-augmented transformer for speech recognition, Interspeech, 25–29 October 2020, Shanghai, China, 5037–5040. Google Scholar
12. F. Wu, A. Fan, A. Baevski, Y. N. Dauphin and M. Auli, Pay less attention with lightweight and dynamic convolutions, preprint (2019), arXiv:1901.10430. Google Scholar
13. J.-B. Cordonnier, A. Loukas and M. Jaggi, On the relationship between self-attention and convolutional layers, preprint (2019), arXiv:1911.03584. Google Scholar
14. H. S. Nogay and H. Adeli, Detection of epileptic seizure using pretrained deep convolutional neural network and transfer learning, Eur. Neurol. 83(6) (2021) 602–614. Crossref, Web of Science, Google Scholar
15. T. Jiang, L. Li, B. Samali, Y. Yu, K. Huang, W. Yan and L. Wang, Lightweight object detection network for multi-damage recognition of concrete bridges in complex environments, Comput.-Aided Civ. Infrastruct. Eng. 39(23) (2024) 3646–3665. Crossref, Web of Science, Google Scholar
16. J. Dong, G. Zhang, Y. Hu, Y. Wu and H. Rong, An optimization numerical spiking neural membrane system with adaptive multi-mutation operators for brain tumor segmentation, Int. J. Neural Syst. 34(8) (2024) 2450036. Link, Google Scholar
17. J. Á. Díaz-Francés, J. D. Fernández-Rodríguez, K. Thurnhofer-Hemsi and E. López-Rubio, Semi-supervised semantic image segmentation by deep diffusion models and generative adversarial networks, Int. J. Neural Syst. 34(11) (2024) 2450057. Link, Web of Science, Google Scholar
18. J. Wang, L. Zhang and Y. Zhang, Mixture 2d convolutions for 3d medical image segmentation, Int. J. Neural Syst. 33(01) (2023) 2250059. Link, Web of Science, Google Scholar
19. H. S. Nogay and H. Adeli, Multiple classification of brain MRI autism spectrum disorder by age and gender using deep learning, J. Med. Syst. 48(1) (2024) 15. Crossref, Medline, Web of Science, Google Scholar
20. G. B. Martins, J. P. Papa and H. Adeli, Deep learning techniques for recommender systems based on collaborative filtering, Expert Syst. 37(6) (2020) e12647. Crossref, Web of Science, Google Scholar
21. S. Koravuna et al., Evaluation of spiking neural nets-based image classification using the runtime simulator ravsim, Int. J. Neural Syst. 33(9) (2023) 2350044. Link, Web of Science, Google Scholar
22. M. H. Rafiei and H. Adeli, A new neural dynamic classification algorithm, IEEE Trans. Neural Netw. Learn. Syst. 28(12) (2017) 3074–3083. Crossref, Medline, Web of Science, Google Scholar
23. K. M. R. Alam, N. Siddique and H. Adeli, A dynamic ensemble learning algorithm for neural networks, Neural Comput. Appl. 32(12) (2020) 8675–8690. Crossref, Web of Science, Google Scholar
24. J. Urdiales, D. Martín and J. M. Armingol, An improved deep learning architecture for multi-object tracking systems, Integr. Comput.-Aided Eng. 30(2) (2023) 121–134. Crossref, Web of Science, Google Scholar
25. J. Hu, C. Yu, Z. Yi and H. Zhang, Enhancing robustness of medical image segmentation model with neural memory ordinary differential equation, Int. J. Neural Syst. 33(12) (2023) 2350060. Link, Web of Science, Google Scholar
26. J. Devlin, M.-W. Chang, K. Lee and K. Toutanova, BERT: Pre-training of deep bidirectional transformers for language understanding, in Proc. Conf. American Chapter of the Association for Computational Linguistics: Human Language Technologies Vol. 1, (ACM, 2019), pp. 4171–4186. Google Scholar
27. H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles and H. Jégou, Training data-efficient image transformers & distillation through attention, in Proc. 38th Int. Conf. Machine Learning, Vol. 139, (2021), pp. 10347–10357. Google Scholar
28. T. Yao, Y. Li, Y. Pan, Y. Wang, X.-P. Zhang and T. Mei, Dual vision transformer, IEEE Trans. Pattern Anal. Mach. Intell. 45(9) (2023) 10870–10882. Crossref, Medline, Web of Science, Google Scholar
29. P. K. A. Vasu, J. Gabriel, J. Zhu, O. Tuzel and A. Ranjan, Fastvit: A fast hybrid vision transformer using structural reparameterization, in Proc. IEEE/CVF Int. Conf. Computer Vision (IEEE, 2023), pp. 5785–5795. Google Scholar
30. H. Zhang, W. Hu and X. Wang, Fcaformer: Forward cross attention in hybrid vision transformer, in Proc. IEEE/CVF Int. Conf. Computer Vision (IEEE, 2023), pp. 6037–6046. Crossref, Google Scholar
31. J. Shen, L. Jiao, C. Zhang and K. Peng, Monocular 3d object detection for construction scene analysis, Comput.-Aided Civ. Infrastruct. Eng. 39(9) (2024) 1370–1389. Crossref, Web of Science, Google Scholar
32. T. Xiao, M. Singh, E. Mintun, T. Darrell, P. Dollár and R. Girshick, Early convolutions help transformers see better, Adv. Neural Inf. Process. Syst. 34 (2021) 30392–30400. Google Scholar
33. P. Wang, X. Wang, H. Luo, J. Zhou, Z. Zhou, F. Wang, H. Li and R. Jin, Scaled ReLU matters for training vision transformers, in Proc. AAAI Conf. Artif. Intell. 36(3) (2022) 2495–2503. Crossref, Google Scholar
34. L. Yuan, Q. Hou, Z. Jiang, J. Feng and S. Yan, Volo: Vision outlooker for visual recognition, IEEE Trans. Pattern Anal. Mach. Intell. 45(5) (2022) 6575–6586. Google Scholar
35. S. Ren, D. Zhou, S. He, J. Feng and X. Wang, Shunted self-attention via multi-scale token aggregation, in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition (IEEE, 2022), pp. 10853–10862. Crossref, Google Scholar
36. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser and I. Polosukhin, Attention is all you need, Adv. Neural Inf. Process. Syst. 30 (2017) 6000–6010. Google Scholar
37. S. d’Ascoli, H. Touvron, M. L. Leavitt, A. S. Morcos, G. Biroli and L. Sagun, Convit: Improving vision transformers with soft convolutional inductive biases, in Proc. 38th Int. Conf. Machine Learning (2021), pp. 2286–2296. Google Scholar
38. S. Qian, C. Ning and Y. Hu, Mobilenetv3 for image classification, IEEE Int. Conf. Big Data, Artificial Intelligence and Internet of Things Engineering (IEEE, 2021), pp. 490–497. Crossref, Google Scholar
39. Q. Wang, B. Wu, P. Zhu, P. Li, W. Zuo and Q. Hu, Eca-net: Efficient channel attention for deep convolutional neural networks, in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition (IEEE, 2020), pp. 11534–11542. Crossref, Google Scholar
40. J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li and L. Fei-Fei, Imagenet: A large-scale hierarchical image database, IEEE Conf. Computer Vision and Pattern Recognition (IEEE, 2009), pp. 248–255. Crossref, Google Scholar
41. T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár and C. L. Zitnick, Microsoft COCO: Common objects in context, Computer Vision—ECCV 2014: 13th European Conference, 6–12 September 2014, Zurich, Switzerland, Proceedings, Part V (Springer, 2014), pp. 740–755. Crossref, Google Scholar
42. B. Zhou, H. Zhao, X. Puig, T. Xiao, S. Fidler, A. Barriuso and A. Torralba, Semantic understanding of scenes through the ade20k dataset, Int. J. Comput. Vis. 127 (2019) 302–321. Crossref, Web of Science, Google Scholar
43. I. Loshchilov and F. Hutter, Decoupled weight decay regularization, preprint (2017), arXiv:1711.05101. Google Scholar
44. D. Qin et al., Mobilenetv4-universal models for the mobile ecosystem, preprint (2024), arXiv:2404.10518. Google Scholar
45. Y. Lee, J. Kim, J. Willette and S. J. Hwang, Mpvit: Multi-path vision transformer for dense prediction, in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition (IEEE, 2022), pp. 7287–7296. Crossref, Google Scholar
46. W. Wang, E. Xie, X. Li, D.-P. Fan, K. Song, D. Liang, T. Lu, P. Luo and L. Shao, Pvt v2: Improved baselines with pyramid vision transformer, Comput. Vis. Media 8(3) (2022) 415–424. Crossref, Web of Science, Google Scholar
47. D. Kim, B. Heo and D. Han, Densenets reloaded: Paradigm shift beyond ResNets and ViTs, in Computer Visison — ECCV 2024 (Springer, 2024), pp. 395–415. Google Scholar
48. T.-Y. Lin, P. Goyal, R. Girshick, K. He and P. Dollár, Focal loss for dense object detection, in Proc. IEEE Int. Conf. Computer Visison (IEEE, 2017), pp. 2999–3007. Crossref, Google Scholar
49. K. He, G. Gkioxari, P. Doll’ar and R. Girshick, Mask R-CNN, in Proc. IEEE Int. Conf. Computer Vision (IEEE, 2017), pp. 2980–2988. Crossref, Google Scholar
50. X. Glorot and Y. Bengio, Understanding the difficulty of training deep feedforward neural networks, in Proc. 13th Int. Conf. Artificial Intelligence and Statistics (2010), pp. 249–256. Google Scholar
51. X. Ding, X. Zhang, N. Ma, J. Han, G. Ding and J. Sun, RepVGG: Making VGG-style convnets great again, in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition (IEEE, 2021), pp. 13733–13742. Crossref, Google Scholar
52. P. K. A. Vasu, J. Gabriel, J. Zhu, O. Tuzel and A. Ranjan, Mobileone: An improved one millisecond mobile backbone, in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition (IEEE, 2023), pp. 7907–7917. Crossref, Google Scholar
53. Z. Pan, B. Zhuang, H. He, J. Liu and J. Cai, Less is more: Pay less attention in vision transformers, in Proc. AAAI Conf. Artif. Intell. 36(2) (2022) 2035–2043. Crossref, Google Scholar
54. Z. Liu, H. Mao, C.-Y. Wu, C. Feichtenhofer, T. Darrell and S. Xie, A convnet for the 2020s, in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition (IEEE, 2022), pp. 11976–11986. Crossref, Google Scholar
55. I. Radosavovic, R. P. Kosaraju, R. Girshick, K. He and P. Doll’ar, Designing network design spaces, in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition (IEEE, 2020), pp. 10425–10433. Crossref, Google Scholar
56. J. Li, X. Xia, W. Li, H. Li, X. Wang, X. Xiao, R. Wang, M. Zheng and X. Pan, Next-vit: Next generation vision transformer for efficient deployment in realistic industrial scenarios, preprint (2022), arXiv:2207.05501. Google Scholar
57. K. He, X. Zhang, S. Ren and J. Sun, Deep residual learning for image recognition, in Proc. IEEE Conf. Computer Vision and Pattern Recognition (IEEE, 2016), pp. 770–778. Crossref, Google Scholar
58. X. Chu, Z. Tian, Y. Wang, B. Zhang, H. Ren, X. Wei, H. Xia and C. Shen, Twins: Revisiting the design of spatial attention in vision transformers, Adv. Neural Inf. Process. Syst. 34 (2021) 9355–9366. Google Scholar
59. J. Yang, C. Li, P. Zhang, X. Dai, B. Xiao, L. Yuan and J. Gao, Focal attention for long-range interactions in vision transformers, Adv. Neural Inf. Process. Syst. 34 (2021) 30008–30022. Google Scholar
60. T. Xiao, Y. Liu, B. Zhou, Y. Jiang and J. Sun, Unified perceptual parsing for scene understanding, in Proc. Eur. Conf. Computer Vision (ACM, 2018), pp. 432–448. Crossref, Google Scholar
61. A. Kirillov, R. Girshick, K. He and P. Doll’ar, Panoptic feature pyramid networks, in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition (IEEE, 2019), pp. 6392–6401. Crossref, Google Scholar
62. M. Contributors, Mmsegmentation: Openmmlab semantic segmentation toolbox and benchmark (2020), https://github.com/open-mmlab/mmsegmentation. Google Scholar
63. X. Xia, J. Li, J. Wu, X. Wang, X. Xiao, M. Zheng and R. Wang, Trt-vit: Tensorrt-oriented vision transformer, preprint (2022), arXiv:2205.09579. Google Scholar
64. S. Xie, R. Girshick, P. Doll’ar, Z. Tu and K. He, Aggregated residual transformations for deep neural networks, in Proc. IEEE Conf. Computer Vision and Pattern Recognition (IEEE, 2017), pp. 5987–5995. Crossref, Google Scholar
65. A. Krizhevsky et al., Learning multiple layers of features from tiny images, University of Toronto (2009). Google Scholar
66. M.-E. Nilsback and A. Zisserman, Automated flower classification over a large number of classes, Sixth Indian Conf. Computer Vision, Graphics & Image Processing (IEEE, 2008), pp. 722–729. Crossref, Google Scholar
67. O. M. Parkhi, A. Vedaldi, A. Zisserman and C. Jawahar, Cats and dogs, in Proc. IEEE Conf. Computer Vision and Pattern Recognition (IEEE, 2012), pp. 3498–3505. Crossref, Google Scholar
68. J. Krause, M. Stark, J. Deng and L. Fei-Fei, 3d object representations for fine-grained categorization, in Proc. IEEE Int. Conf. Computer Vision Workshops (IEEE, 2013), pp. 554–561. Crossref, Google Scholar
69. Y.-H. Wu, Y. Liu, X. Zhan and M.-M. Cheng, P2t: Pyramid pooling transformer for scene understanding, IEEE Trans. Pattern Anal. Mach. Intell. 45(11) (2022) 12760–12771. Crossref, Google Scholar
70. Y. Li, C.-Y. Wu, H. Fan, K. Mangalam, B. Xiong, J. Malik and C. Feichtenhofer, Mvitv2: Improved multiscale vision transformers for classification and detection, in Proc. IEEE/CVF Conf. Computer vision and Pattern Recognition (IEEE, 2022), pp. 4794–4804. Crossref, Google Scholar
71. M. H. Rafiei, L. V. Gauthier, H. Adeli and D. Takabi, Self-supervised learning for electroencephalography, IEEE Trans. Neural Netw. Learn. Syst. 35(2) (2022) 1457–1471. Crossref, Web of Science, Google Scholar
72. D. R. Pereira, M. A. Piteri, A. N. Souza, J. P. Papa and H. Adeli, Fema: A finite element machine for fast learning, Neural Comput. Appl. 32 (2020) 6393–6404. Crossref, Web of Science, Google Scholar