With the rapid development of deep learning technology, object detection algorithms have made significant breakthroughs in the field of computer vision. However, due to the complexity and computational requirements of deep Convolutional Neural Network (CNN), these models face many challenges in practical applications, especially on resource-constrained edge devices. To address this problem, researchers have proposed many lightweight methods that aim to reduce the model size and computational complexity while maintaining high performance. The popularity of mobile devices and embedded systems has led to an increasing demand for lightweight models. However, existing lightweight methods often lead to accuracy loss, limiting their feasibility in practical applications. Therefore, how to realize the light weight of the model while maintaining high accuracy has become an urgent problem to be solved. To address this challenge, this paper proposes a lightweight YOLOv7 method based on PConv, Squeeze-and-Excitation (SE) attention mechanism and Wise-IoU (WIoU), which we refer to as YOLOv7-PSW. PConv can effectively reduce the number of parameters and computational complexity. The SE can help the model focus on important feature information, thereby improving performance. WIoU is introduced to measure the similarity between the detection box and the Ground Truth, so that the model can effectively reduce the False Positive rate. By applying these advanced techniques to the YOLOv7, we achieve a lightweight model while maintaining a high detection accuracy. Experimental results on PASCAL VOC dataset show that YOLOv7-PSW performs better than the original YOLOv7 on object detection tasks. The number of parameters is reduced by 12.3%, FLOPs is reduced by 18.86%, and the accuracy is improved by about 0.5%. While the detection accuracy is not decreased or even slightly improved, the number of FLOPs and parameters is greatly reduced, which realizes lightweight to a certain extent. The proposed method can provide new ideas and directions for the subsequent research on lightweight object detection, and is expected to promote its application on edge devices. Meanwhile, YOLOv7-PSW can also be applied to other computer vision tasks to improve its performance and efficiency. In summary, the proposed YOLOv7-PSW lightweight method realizes the light weight of the model while maintaining high accuracy. This is of great significance for promoting the application of object detection algorithms on edge devices.

Keywords:

1. Introduction

As an important task in computer vision, object detection is widely used in face recognition, pedestrian detection, vehicle recognition, and other fields. However, due to the limitation of computing resources and real-time requirements, the performance of traditional object detection algorithms on mobile devices and embedded devices often cannot meet the needs of practical applications. Therefore, it is of great significance to study lightweight object detection algorithms. In recent years, object detection algorithms based on deep learning have made significant progress. Among them, You Only Look Once (YOLO)^1,2,3,4,5 series algorithms have attracted wide attention due to their high speed and high accuracy. However, although YOLOv7⁶ achieves fast inference speed while maintaining high detection accuracy, it is difficult to meet the real-time and resource consumption requirements of mobile devices and embedded devices due to its huge model scale and complex calculation process, which leads to many challenges in the deployment of YOLOv7 algorithm on mobile devices and embedded devices. First, YOLOv7 requires a lot of computing resources for inference. This includes high-resolution input images, lots of convolution operations, pooling operations, etc. For mobile and embedded devices, these computing resources tend to be very limited. Therefore, how to reduce the amount of computation while maintaining high detection accuracy has become an urgent problem to be solved. Second, YOLOv7 also has certain limitations when dealing with complex backgrounds and multi-target scenes. The complex background often interferes with the detection of objects, and the multi-object scene will lead to the overlap and occlusion between objects. All these problems will reduce the accuracy and robustness of object detection. In addition, if only a simple lightweight method is used, the detection performance of YOLOv7 will be reduced due to the reduction of the model size, which will make it difficult for the YOLOv7 algorithm to accurately locate and identify small targets or targets with serious occlusion. This is unacceptable for some applications that require accurate object detection. Therefore, how to achieve lightweight while maintaining high accuracy is a key issue.

In response to the above problems, this paper aims to study a lightweight YOLOv7⁶ algorithm to improve its convenient use performance on mobile devices and embedded devices. Specifically, in order to reduce the number of model parameters and computational complexity without affecting the detection performance of YOLOv7, we introduced PConv⁷ and SE⁸ into YOLOv7, and the model structure of the modified part can be referred to Fig. 1. Through a series of experiments, YOLOv7-PSW does achieve lightweight and minimizes the impact on detection performance. The reasons can be summarized as the following three points: (1) PConv can reduce the amount of calculation of the model, reduce the memory occupation of the model, and improve the running speed of the model, so as to speed up the training and inference speed of the model. (2) By calculating the importance weight for each channel, SE achieves a better selective screening effect, helps the network to learn a more representative feature representation, improves the accuracy of detection, and improves the robustness of the model. (3) The traditional IoU calculation method is easily affected by the overlap degree between the target position and the detection box, and is prone to False positives. After the introduction of WloU,⁹ the False Positive false detection rate can be effectively reduced. After the improvement of the above three aspects, the retrained model can indeed get good results, whether in the amount of parameters and calculation, or in the recognition accuracy. Specific experimental details will be explained in Sec. 4.

Fig. 1. This graph depicts PConv and where SE is added. The module to which PConv is added is PBS, P is PConv, B is BN, and S is SiLU. Replacing the original Conv with PConv completes the introduction. The places where SE is introduced are the two modules C7_3 and SPPCSPC. Refer to Fig. 2 for the specific structure of each module.

Fig. 2. This figure describes in detail the specific structure of each module in Fig. 1.

The contributions of this paper are mainly reflected in the following aspects: First, we propose a new YOLOv7 network structure based on PConv, SE attention mechanism, and Wise-IoU (WIoU). By changing the convolution method of YOLOv7, the parameter amount and computational complexity of the model are reduced, so as to realize the lightweight and efficiency of the model. The SE attention mechanism and WIoU are introduced to enhance the processing ability of complex background and multi-target scenes, so as to improve the perception of complex background and multi-target scenes. This attention mechanism can effectively improve the accuracy and robustness of object detection. Second, we experimentally verify the performance benefits of the proposed method on the PASCAL VOC dataset, demonstrating its applicability in different scenarios. Finally, we provide a comprehensive analysis and discussion of the proposed method, which provides a valuable reference for further improving the lightweight object detection algorithm.

Finally, the structure of this paper is as follows: Sec. 2 introduces the related work; The third part describes the lightweight method of YOLOv7 network structure in detail. In the fourth part, the experimental evaluation and result analysis are carried out. Section 5 summarizes the research work of this thesis and looks forward to the future research directions.

Through the research of this thesis, we expect to provide a lightweight solution for object detection applications on mobile devices and embedded devices, and promote the wide application and development of computer vision technology in these fields.

2. Related Work

2.1. Optimization of YOLOv7

YOLOv7⁶ is a new real-time object detection model. The model can quickly and accurately detect multiple types of objects, such as people, cars and animals, and can provide more fine detection details for objects. The main optimization work of YOLOv7 includes the following aspects:

(I)	The backbone structure is optimized: the lightweight structure is adopted to reduce the complexity and redundant calculation of the network, so as to improve the operation speed and detection accuracy of the network.
(II)	An efficient feature extraction method is adopted: YOLOv7 extracts the features of the object image by means of Feature Pyramid Network (FPN) and attention mechanism,¹⁰ thereby reducing the computational complexity of object detection.
(III)	A new data augmentation method is introduced: YOLOv7 improves the generalization ability and robustness of the model by increasing the diversity and amount of data, thereby improving the performance of the model.
(IV)	The model training process is optimized: the cumulative gradient correction, weight adjustment and multi-scale training techniques are used to optimize the model training process, thereby improving the training efficiency and accuracy of the model.

In short, YOLOv7⁶ is an efficient, accurate and fast object detection model, which is one of the more excellent real-time object detection models at present. It has been widely used and promoted, but unfortunately, in some scenarios, the detection performance is still unsatisfactory, and there is still room for improvement.

2.2. Advantages of PConv

PConv⁷ exploits the redundancy in the feature maps and applies regular convolution (Conv) on only a subset of the input channels without affecting the rest. In essence, PConv has lower FLOPs than regular Conv, while FLOPS is higher than DWConv and GConv. In other words, PConv makes better use of the computing power on the device. PConv is also effective in extracting spatial features, and it is likely to replace the existing DWConv.

Chen et al.⁷ further introduced PConv to design FasterNet as a new family of networks that runs very fast on a variety of devices. In particular, FasterNet achieves state-of-the-art performance on classification, detection, and segmentation tasks while having lower latency and higher throughput. For example, the small model FasterNet-T0 is 3.1, 3.1, and 2.5 times faster than MobileVit-XXS¹¹ on GPU, CPU, and ARM processors, respectively, while achieving 2.9% higher accuracy on ImageNet-1k. The large model FasterNet-L achieves 83.5% Top-1 accuracy, on par with Swin-B,¹² while providing 49% higher throughput on the GPU and 42% less computation time on the CPU. To fully and efficiently utilize the information from all channels, Chen et al. further appended PWConv to PConv. Their effective receptive field on the input feature map looks like a T-shaped Conv, and the PWConv is more focused on the central position compared to the regular Conv which processes patches uniformly.

2.3. Advantages of the WIoU

As a core problem in computer vision, object detection performance depends on the design of loss function. The bounding box loss function is an important part of the object detection loss function, and its good definition will bring significant performance improvement to the object detection model. In recent years, most of the research assumes that the examples in the training data have high quality, and focuses on strengthening the fitting ability of the bounding box loss. However, Tong et al.⁹ noticed that the object detection training set contains low-quality examples, and if the regression of bounding boxes on low-quality examples is blindly strengthened, it will obviously endanger the improvement of model detection performance. Focal-EIoUv1¹³ was proposed to solve this problem, but because its focusing mechanism is static, it does not fully exploit the potential of non-monotonic focusing mechanism. Based on this idea, Tong et al. proposed a dynamic non-monotonic focusing mechanism and designed WIoU.⁹ The dynamic non-monotonic focusing mechanism uses “outlier” instead of IoU to evaluate the quality of anchor boxes and provides a sensible gradient gain allocation strategy. This strategy reduces the competitiveness of high-quality anchor boxes while also reducing the harmful gradients generated by low-quality examples. This allows WIoU to focus on ordinary quality anchor boxes and improve the overall performance of the detector. When WIoU is applied to the most advanced single-stage detector YOLOv7,⁶ the AP-75 on MS-COCO dataset is improved from 53.03% to 54.50%.

WIoU v1⁹ constructed the attention-based bounding box loss, and WIoU v2 and WIoU v3 added the focusing mechanism by constructing the calculation method of gradient gain (focusing coefficient) on this basis. In terms of calculation speed, the computational cost increased by WIoU mainly lies in the calculation of the focusing coefficient and the mean statistics of the IoU loss. Under the same experimental conditions, WIoU has faster speed because it does not calculate the aspect ratio, and the calculation time of WIoU is 87.2% of Complete IoU Loss (CIoU).¹⁴

2.4. Introduction to the SE

In Ref. 8, Hu et al. studied the relationship between channels in network design. They introduced a new architectural unit, called the “Squeeze-and-Excitation” (SE) block, which aims to improve the quality of the representations produced by the network by explicitly modeling the interdependencies between the channels of its convolutional features. To this end, Hu et al. proposed a mechanism that allows the network to perform feature recalitioning, by which the network can learn to use global information to selectively emphasize informative features and suppress less useful ones.

An SE network⁸ (SENet) can be built by simply stacking a collection of SE blocks. In addition, these SE blocks can also be used as replacement blocks for the original blocks with different depths in the network architecture. While the template for building blocks is generic, it plays different roles at different depths throughout the network. In the early layers, it motivates informative features in a class-agnosia manner, enforcing the shared underlying representation. In later layers, the SE blocks become increasingly specialized and respond to different inputs in a highly class-specific manner. Thus, the benefit of the feature recalitioning performed by the SE block can be accumulated through the network.

3. Methods

In this section, we describe the model modification approach, detailing how PConv,⁷ SE,⁸ and WIoU⁹ are incorporated into the YOLOv7⁶ model. In the YOLOv7 baseline model, there is a case of false detection, see Fig. 3. In YOLOv7-PSW, the situation of false detection is effectively improved (see Fig. 8.).

Fig. 3. This figure depicts the case where the YOLOv7 baseline model has recognition errors when the background map has high noise. At this point, the left finger of the man on the left is recognized as scissors, simply because his gesture is somewhat sciss-like. The reason for the recognition error is the weak feature extraction ability in the fuzzy background.

3.1. Introduce the PConv

At the beginning, PConv⁷ was not our first choice, but we tried a variety of popular convolution methods, including DSConv,¹⁵ CoordConv,¹⁶ DCNv2,¹⁷ and PConv. After trying, we found that PConv had the best effect and contributed the most to the light weight of the model. Finally, it was decided to choose PConv as the convolution mode of the model. The locations and quantities added by the above several convolution methods are all the same during the experiment. The specific comparison results are shown in the Fig. 4.

Fig. 4. This figure shows the changes in the number of parameters and FLOPs after YOLOv7 is combined with different convolution methods. YOLOv7+PConv has the best effect and the least number of parameters and FLOPs.

PConv⁷ can reduce the amount of computation with almost no impact on accuracy, and spatial features can be effectively extracted by simultaneously reducing redundant computations and memory accesses. PConv applies regular Conv to some input channels for spatial feature extraction, while leaving the rest unchanged. Sort of like GhostNet,¹⁸ but instead of using DWConv from GhostNet, keep normal convolutions. Let’s look at the FLOPs and memory accesses of PConv:

• FLOPs of PConv⁷

h \times w \times k 2 \times c 2 p, <math display="block" altimg="eq-00001.gif"><mi>h</mi><mo>\times</mo><mi>w</mi><mo>\times</mo><msup><mrow><mi>k</mi></mrow><mrow><mn>2</mn></mrow></msup><mo>\times</mo><msubsup><mrow><mi>c</mi></mrow><mrow><mi>p</mi></mrow><mrow><mn>2</mn></mrow></msubsup><mo>,</mo></math> (1)

where h and w are the width and height of the feature map, k is the size of the convolution kernel, and

$c_{p}$ is the number of channels operated by conventional convolution. In practical implementation, it is generally

$r = c_{p}$ /

$c =$ 1/4, so the FLOPs of PConv is only 1/16 of that of a regular Conv.

• Memory access of PConv⁷

h \times w \times 2 c p + k 2 \times c 2 p \approx h \times w \times 2 c p, <math display="block" altimg="eq-00005.gif"><mi>h</mi><mo>\times</mo><mi>w</mi><mo>\times</mo><mn>2</mn><msub><mrow><mi>c</mi></mrow><mrow><mi>p</mi></mrow></msub><mo>+</mo><msup><mrow><mi>k</mi></mrow><mrow><mn>2</mn></mrow></msup><mo>\times</mo><msubsup><mrow><mi>c</mi></mrow><mrow><mi>p</mi></mrow><mrow><mn>2</mn></mrow></msubsup><mo>\approx</mo><mi>h</mi><mo>\times</mo><mi>w</mi><mo>\times</mo><mn>2</mn><msub><mrow><mi>c</mi></mrow><mrow><mi>p</mi></mrow></msub><mo>,</mo></math> (2)

where h and w are the width and height of the feature map, k is the size of the convolution kernel,

$c_{p}$ is the number of channels used by conventional convolution, the number of memory accesses of PConv is only 1/4 of that of conventional convolution, and the remaining (

$c - c_{p})$ channels are not involved in the calculation, so there is no need to access memory.

When combined with YOLOv7,⁶ it was able to significantly improve the performance of the baseline model. In the YOLOv7 baseline model, stacking is mainly carried out by convolution, so the introduction of PConv⁷ is not difficult. We mainly replace the $3 \times 3$ convolution with stride = 1 in the backbone part of the baseline model. The reason why we do not replace the $1 \times 1$ convolution at the same time is that the performance improvement is not very large after doing so. The reason is that the number of $1 \times 1$ convolution parameters is small, so the replacement is not very meaningful. So we only made $3 \times 3$ convolutional substitutions. First, we replace the normal Conv module in the CBS module with the PConv module, thus becoming PBS. Next, we replace the four CBS modules in the C7_1 module with PBS modules and refer to the replaced modules as C7_3. Finally, we integrate the C7_3 module into the backbone of YOLOv7, and successfully integrate PConv into YOLOv7. In the Fig. 5, we can clearly see the process of adding PConv.

Fig. 5. This figure shows how to add PConv to YOLOv7.

3.2. Introduce the WIoU

Due to the reduction of the number of parameters and FLOPs, it is bound to affect the accuracy of detection to a certain extent. In order to reduce the degree of influence, we try to modify the coordinate loss function. The original coordinate loss function of YOLOv7 is CIoU,¹⁴ which is defined as follows :

CIoU=IoU−ρ2(b,bgt)c2−αv,<math display="block" altimg="eq-00012.gif"><mstyle><mtext mathvariant="normal">CIoU</mtext></mstyle><mo>=</mo><mstyle><mtext mathvariant="normal">IoU</mtext></mstyle><mo>−</mo><mfrac><mrow><msup><mrow><mi>ρ</mi></mrow><mrow><mn>2</mn></mrow></msup><mo stretchy="false">(</mo><mi>b</mi><mo>,</mo><msup><mrow><mi>b</mi></mrow><mrow><mi>g</mi><mi>t</mi></mrow></msup><mo stretchy="false">)</mo></mrow><mrow><msup><mrow><mi>c</mi></mrow><mrow><mn>2</mn></mrow></msup></mrow></mfrac><mo>−</mo><mi>α</mi><mi>v</mi><mo>,</mo></math>(3)

where

$α$ is a weighting function, here the priority is assigned according to the IoU value, and the coefficient is larger when the IoU of the prediction and target boxes is larger. The v is used to measure the aspect ratio similarity :

v=4π2(arctanwgthgt−arctanwh)2,<math display="block" altimg="eq-00014.gif"><mi>v</mi><mo>=</mo><mfrac><mrow><mn>4</mn></mrow><mrow><msup><mrow><mi>π</mi></mrow><mrow><mn>2</mn></mrow></msup></mrow></mfrac><msup><mrow><mfenced separators="" open="(" close=")"><mrow><mstyle><mtext mathvariant="normal">arctan</mtext></mstyle><mfrac><mrow><msup><mrow><mi>w</mi></mrow><mrow><mi>g</mi><mi>t</mi></mrow></msup></mrow><mrow><msup><mrow><mi>h</mi></mrow><mrow><mi>g</mi><mi>t</mi></mrow></msup></mrow></mfrac><mo>−</mo><mstyle><mtext mathvariant="normal">arctan</mtext></mstyle><mfrac><mrow><mi>w</mi></mrow><mrow><mi>h</mi></mrow></mfrac></mrow></mfenced></mrow><mrow><mn>2</mn></mrow></msup><mo>,</mo></math>(4)

α=v(1−IoU)+v.<math display="block" altimg="eq-00015.gif"><mi>α</mi><mo>=</mo><mfrac><mrow><mi>v</mi></mrow><mrow><mo stretchy="false">(</mo><mn>1</mn><mo>−</mo><mstyle><mtext mathvariant="normal">IoU</mtext></mstyle><mo stretchy="false">)</mo><mo>+</mo><mi>v</mi></mrow></mfrac><mo>.</mo></math>(5)

It can be seen that CIoU¹⁴ adds the influence factor of image similarity on the basis of DIoU,¹⁴ so it can better reflect the difference between two boxes. However, there are still problems in the following four aspects:

(I)	The aspect ratio describes the relative value, and there is some ambiguity.
(II)	The balance problem of hard and easy samples is not considered.
(III)	The calculation of CIoU loss is relatively complex, which may lead to a large computational overhead in the training process — does not meet the requirements of lightweight.
(IV)	The v in the CIoU formula reflects the difference in aspect ratio, rather than the real difference between width and height and their confidence, which sometimes prevents the model from effectively optimizing the similarity.

Based on the above problems, we consider replacing the coordinate loss function with WIoU.

At present, there are three versions of WIoU v1, v2, and v3.⁹ WIoU v1 constructs the attention-based bounding box loss, and WIoU v2 and WIoU v3 add the focusing mechanism by constructing the calculation method of gradient gain (focusing coefficient) on this basis.

3.3. Join the SE

The structure of SE⁸ is simple, and can be directly used in the architecture of YOLOv7⁶ by replacing the components with the corresponding components of SE, so as to effectively improve the performance. The SE block is also computationally lightweight and only slightly increases the model complexity and computational burden. In YOLOv7, SE can be added to either the backbone or the head section. The specific location is the place where the output of the feature layer is added. First, we introduce the way of adding SE attention. After many heat comparison experiments, we find that adding SE module directly after CBS module can achieve the best effect, so CBS module becomes CBSS module. As mentioned above, we added PConv to C7_1 to form C7_3. On this basis, we replaced the CBS module after Concat in C7_3 with the CBSS module, which became the C7_4 module. At the same time, we replace the CBS module after Concat in the SPPCSPC module with the CBSS module, so as to become a new SPPCSPC module. Finally, the two parts are integrated into the backbone part and the head part of YOLOv7 to become the final structure of YOLOv7-PSW. In the Fig. 6, we can clearly see the process of adding SE.

Fig. 6. This figure shows how to add SE to YOLOv7.

4. Experiment and Results

4.1. Dataset selection

The datasets used in this experiment are all public datasets. To verify whether the YOLOv7-PSW improvement is successful, we first select a small dataset, COCO128, which consists of the first 128 images in COCO train2017 and contains only a few categories. The reason to use a small dataset for training is to quickly test whether the training results are working as expected. To prevent overfitting, we conduct experiments with different numbers of epochs and find that the best performance can be achieved exactly when epochs = 100, while the training results also verify that the optimization of YOLOv7-PSW is successful.

To increase the robustness of YOLOv7-PSW, we decided to use the full MS-COCO dataset for training. The dataset has over 330K images containing 1.5 million objects, 80 object categories (pedestrian, car, elephant, etc.), and 91 stuff categories (stuff categories: grass, wall, sky, etc.). However, due to the limited computing resources, we can only use MS-COCO dataset for the training process which leads to out of memory and the training is interrupted. Finally, we decided to adopt the PASCAL VOC dataset for our experiments. We integrate PASCAL VOC2007 with PASCAL VOC2012 to form a medium-sized dataset, which we refer to as VOC0712 for short. This dataset contains a total of four categories: vehicle, household, animal, and person, and these four categories contain a total of 20 sub-categories. The 20 sub-categories and their hierarchy are shown in Fig. 7. The VOC0712 dataset contains about 30,000 images, and we split the entire dataset into training and validation sets with an 8:2 ratio, using the same data as the validation set. The VOC0712 dataset is in “xml” format, so we converted it to “txt” format to make it “YOLO format”. At this point, the preparation of the dataset is complete.

Fig. 7. Categories structure diagram of VOC0712, italic and bold for the 20 sub-categories.

4.2. Introduction of relevant parameters

Before the training starts, some parameters need to be set, and some of the parameters involved are shown in Table 1. First, let’s explain the meaning of some parameters: lr0 stands for the initial value of the learning rate, lrf stands for the final value of the learning rate. The momentum can be understood as the inertia of the parameter update. It does this by maintaining a momentum vector that records the weighted average of the previous gradient directions and uses it for parameter updates. Doing so can speed up training and improve model stability. Properly adjusting the magnitude of the momentum can make the parameter update direction smoother, and a larger momentum can accelerate the parameter update. Another parameter is weight_decay, which is a common regularization technique designed to reduce model complexity to prevent overfitting. Higher values of weight_decay lead to stronger regularization and better model generalization. However, too much weight_decay will cause the problem of underfitting the model. Now to warmup_epochs, when training a deep learning model, it is sometimes necessary to warmup with a smaller learning rate first to avoid unstable gradients or loss in the initial stage. warmup_epochs controls the number of warm-up epochs, that is, using a smaller learning rate for the first few epochs of training to allow the model to converge to a steady state faster. After the warm-up phase, the learning rate will gradually increase to the set initial learning rate, and the training will continue at the set learning rate. Finally, we have warmup_momentum, which, in a nutshell, represents the setting of the momentum during warmup.

**Table 1. Parameters of YOLOv7-PSW during training.**
Parameters	Values
Epoch	200
Lr0	0.01
Lrf	0.1
Image size	640
Batch size	8
Momentum	0.935
Weight_Decay	0.0005
Warmup_Epochs	3

4.3. Training results of YOLOv7-PSW

In order to compare the performance of YOLOv7⁶ with YOLOv7-PSW, we trained YOLOv7 based on VOC0712 dataset. Then, in the process of optimizing YOLOv7, we recorded the performance of some key steps. For example, when only PConv was added to YOLOv7, in order to explore the influence of PConv on lightweight, we performed a training on the dataset, and then recorded the number of parameters, FLOPs, mAP@0.5 and mAP@0.5:0.95. Next, we trained YOLOv7+SE, YOLOv7+WIoU, YOLOv7+PConv+SE on the dataset, and finally we trained YOLOv7+ALL, namely YOLOv7-PSW. Table 2 shows some key data recorded during the experiment.

**Table 2. Comparison of YOLOv7-PSW with baseline model.**
Model	#Param. (M)	FLOPs (G)	mAP@0.5	mAP@0.5:0.95
YOLOv7	37.62	106.5	80.9	61.5
YOLOv7+PConv	33.11	86.2	81.1	57.8
YOLOv7+SE	37.55	106.1	80.8	60.2
YOLOv7+WIoU	37.62	106.5	80.9	59.3
YOLOv7+PConv+SE	32.99	85.3	81.1	62.2
YOLOv7+ALL (YOLOv7-PSW)	32.99	85.3	81.6	62.4

After YOLOv7-PSW training, in order to check whether the false recognition has improved, we detected the same image as Fig. 3. From the detection results, YOLOv7-PSW did not incorrectly detect the palm as a scissors. Figure 8 shows the result after detection.

Fig. 8. YOLOv7-PSW does not recognize the palm of the person in the figure as a scissors.

As can be seen from Fig. 9, the performance of our YOLOv7-PSW rises smoothly during the training process, and the training effect is relatively good.

Figure 10 shows the training process of each category in VOC0712 dataset and the average of all categories (the thick line in the figure indicates the average).

Fig. 10. Panel (a) shows the relationship between recall and confidence, which means the probability of each class being recalled when the confidence is set to a certain value. We can see that when the confidence is smaller, the category detection is more comprehensive. Panel (b) shows the relationship between accuracy and confidence. When the confidence is higher, the category detection is more accurate. Panel (c) is the relationship between precision and recall. The area enclosed under this curve is the AP, and the average AP of all categories is the mAP.

Figure 11 is the confusion matrix after YOLOv7-PSW training. The confusion matrix summarizes the records in the dataset in the form of a matrix according to the two criteria of the true category and the category judgment predicted by the classification model. The rows of the matrix represent the actual values and the columns represent the predicted values.

4.4. Comparison and analysis of YOLOv7-PSW with other models

4.4.1. Comparison with YOLOv7 baseline model

As can be seen in Fig. 12, from the baseline YOLOv7 to YOLOv7-PSW, the number of parameters and FLOPs show a decreasing trend in general, while mAP@0.5 and mAP@0.5:0.95 show an increasing trend in general.

Figure 13 shows the variation of YOLOv7 and optimized YOLOv7-PSW in terms of accuracy. On our dataset, the precision of YOLOv7 is 80.98%, and the precision of optimized YOLOv7-PSW is improved to 82.37%, which is about 1.4 percentage points higher. This is not an easy achievement, and we did it without making the detection accuracy drop while reducing the model size.

Fig. 13. Precision comparison between YOLOv7-PSW and YOLOv7.

Figure 14 shows the improvement of both in terms of recall. Compared with YOLOv7, the recall rate of YOLOv7-PSW is improved by about 3 percentage points.

Fig. 14. Recall comparison between YOLOv7-PSW and YOLOv7.

From Fig. 15, we can observe that the promotion of the mAP values. The mAP values have two forms: mAP@0.5 and mAP@0.5:0.95. Here, mAP@0.5 represents the average mAP at a threshold greater than 0.5, and mAP@0.5:0.95 represents the average mAP at different IoU thresholds (0.5–0.95 in steps of 0.05) (0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95).

4.4.2. Comparison with other object detectors

In this experiment, YOLOv7-PSW was compared with six other object detection models in addition to the YOLOv7⁶ baseline model. Including YOLOv4,⁴ YOLOv5-M,⁵ PPYOLO-M,¹⁹ YOLOR-CSP,²⁰ YOLOX-M,²¹ CenterNet,²² all models were trained and tested using the same dataset and under the same hardware environment. We trained and tested the above models through VOC0712 dataset, and evaluated each model, including four indicators: the number of parameters, the FLOPs, mAP@0.5, mAP@0.5:0.95.

In Fig. 16, we compare the above mentioned models with our YOLOv7-PSW data and see that our model has higher mAP@0.5 and mAP@0.5:0.95 compared to the others.

5. Conclusion

The research of this paper aims to optimize YOLOv7 based on PConv, SE attention mechanism and WIoU to achieve a lightweight YOLOv7 algorithm. By reducing the number of parameters and computational complexity of the model, we successfully compress the YOLOv7 model to a size that can run efficiently on mobile and embedded devices. Meanwhile, we also improve the performance and robustness of the model by introducing the SE attention mechanism and WIoU. The results of this paper show that the number of parameters and the amount of computation can be significantly reduced by using PConv to replace the traditional convolution operation. By using PConv, we successfully reduce the size of the YOLOv7 model by about 20% while maintaining a high detection accuracy. In this paper, we introduce the SE attention mechanism to improve the model’s attention to key features. The SE attention mechanism enables the model to automatically learn the weights of important features by rescaling the feature maps at the channel level. By introducing the SE attention mechanism, we further improve the performance of YOLOv7 model and make it achieve better results in object detection tasks. In addition, WIoU is proposed as a new coordinate loss function. WIoU considers not only the overlap degree between the predicted box and the true box, but also the relative position relationship between the predicted box and the true box. By using WIoU, we can improve the performance of object detection algorithms more comprehensively and find some problems that are not captured by traditional metrics.

The research results of this paper are of great significance for solving practical problems. First, by making the YOLOv7 model lightweight, we can achieve efficient object detection on mobile and embedded devices. This has important application value for real-time monitoring, intelligent transportation system, unmanned driving and other fields. Second, by introducing the SE attention mechanism and WIoU, we can improve the performance and robustness of the model, so that it can achieve good detection results in different scenarios. This is important for improving the reliability and adaptability of object detection algorithms.

However, there are some limitations and future research directions in this paper. First, although we successfully lightweight the YOLOv7 model, there are still some parameters and computational redundancy. Future research can further explore how to reduce the redundancy of the model to improve the efficiency and speed of the model. Second, this paper only studies the optimization of YOLOv7 model, and other object detection algorithms can also learn from these optimization methods. Future research can extend the proposed method to other object detection algorithms to further improve the performance of the whole object detection field. Finally, the research of this thesis mainly focuses on the object detection task in a single image, and future research can further explore the object detection problem in complex scenes such as multi-scale, multi-view, and multi-object tasks.

In summary, this paper optimizes YOLOv7 based on PConv, SE attention mechanism and WIoU, and realizes a lightweight object detection algorithm. By reducing model size and computational complexity and improving performance and robustness, we provide an efficient, accurate, and resource-friendly solution for real-time object detection on mobile and embedded devices. Future research can further explore how to reduce the redundancy of the model, extend the optimization method to other object detection algorithms, and solve the object detection problem in complex scenes.

Acknowledgments

This work is partially supported by Natural Science Foundation of China Grants (61972456, 61173032) and Tianjin Natural Science Foundation (20JCYBJC00140). Also, we would like to thank the School of Computer Science and Technology, Tiangong University for supporting our work.

ORCID

Liu Zhigang https://orcid.org/0009-0008-0987-7649

Sun Baoshan https://orcid.org/0009-0008-2967-9201

Bi Kaiyu https://orcid.org/0009-0009-5977-2471

Remember to check out the Most Cited Articles!
Check out these titles in artificial intelligence!

References

1. J. Redmon, S. Divvala, R. Girshick and A. Farhadi, You only look once: Unified, real-time object detection, in Proc. 2016 IEEE Conf. Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, 2016, Vol. 1 (IEEE, Piscataway, NJ, 2016), pp. 779–788. Crossref, Google Scholar
2. J. Redmon and A. Farhadi, YOLO9000: Better, faster, stronger, in Proc. 2017 IEEE Conf. Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, 2017, Vol. 1 (IEEE, Piscataway, NJ, 2017), pp. 6517–6525. Crossref, Google Scholar
3. X. Li, Y. Jiang, C. Liu, S. Liu, H. Luo and S. Yin, Playing against deep-neural-network-based object detectors: A novel bidirectional adversarial attack approach, IEEE Trans. Artif. Intell. 3 (2022) 20–28. Crossref, Google Scholar
4. C.-Y. Wang, A. Bochkovskiy and H.-Y. M. Liao, Scaled-YOLOv4: Scaling cross stage partial network, in Proc. 2021 IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), Nashville, TN, 2021, Vol. 1 (IEEE, Piscataway, NJ, 2021), pp. 13024–13033. Crossref, Google Scholar
5. N. D. T. Yung, W. K. Wong, F. H. Juwono and Z. A. Sim, Safety helmet detection using deep learning: Implementation and comparative study using YOLOv5, YOLOv6, and YOLOv7, in Proc. 2022 Int. Conf. Green Energy, Computing and Sustainable Technology (GECOST), Miri Sarawak, Malaysia, 2022, Vol. 1 (IEEE, Piscataway, NJ, 2022), pp. 164–170. Crossref, Google Scholar
6. C.-Y. Wang, A. Bochkovskiy and H.-Y. M. Liao, YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors, in Proc. 2023 IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 2023, Vol. 1 (IEEE, Piscataway, NJ, 2023), pp. 7464–7475. Crossref, Google Scholar
7. J. Chen et al., Run, don’t walk: Chasing higher FLOPS for faster neural networks, in Proc. 2023 IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 2023, Vol. 1 (IEEE, Piscataway, NJ, 2023), pp. 12021–12031. Crossref, Google Scholar
8. J. Hu, L. Shen, S. Albanie, G. Sun and E. Wu, Squeeze-and-excitation networks, IEEE Trans. Pattern Anal. Mach. Intell. 42 (2020) 2011–2023. Crossref, Google Scholar
9. Z. Tong, Y. Chen, Z. Xu and R. Yu, Wise-IoU: Bounding box regression loss with dynamic focusing mechanism, arXiv preprint (2023), arXiv:2301.1005. Google Scholar
10. M. Hu, Y. Li, L. Fang and S. Wang, A2-FPN: Attention aggregation based feature pyramid network for instance segmentation, in Proc. 2021 IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), Nashville, TN, 2021, Vol. 1 (IEEE, Piscataway, NJ, 2021), pp. 15338–15347. Crossref, Google Scholar
11. J. Yi, Z. Shen, F. Chen, Y. Zhao, S. Xiao and W. Zhou, A lightweight multiscale feature fusion network for remote sensing object counting, IEEE Trans. Geosci. Remote Sens. 61 (2023) 1–13. Google Scholar
12. Z. Liu et al., Swin transformer: Hierarchical vision transformer using shifted windows, in Proc. 2021 IEEE/CVF Int. Conf. Computer Vision (ICCV), Montreal, QC, Canada, 2021, Vol. 1 (IEEE, Piscataway, NJ, 2021), pp. 9992–10002. Crossref, Google Scholar
13. Y.-F. Zhang, W. Ren, Z. Zhang, Z. Jia, L. Wang and T. Tan, Focal and efficient IoU loss for accurate bounding box regression, Neurocomputing 506 (2022) 146–157. Crossref, Google Scholar
14. Z. Zheng, P. Wang, W. Liu, Ji. Li, R. Ye and D. Ren, Distance-IoU loss: Faster and better learning for bounding box regression, in Proc. AAAI Conf. Artificial Intelligence, New York, 2020, Vol. 34. (AAAI, Palo Alto, CA, 2020), pp. 12993–130000. Crossref, Google Scholar
15. M. G. D. Nascimento, V. Prisacariu and R. Fawcett, DSConv: Efficient convolution operator, in Proc. 2019 IEEE/CVF Int. Conf. Computer Vision (ICCV), Seoul, South Korea, 2019, Vol. 1 (IEEE, Piscataway, NJ, 2019), pp. 5147–5156. Crossref, Google Scholar
16. R. Liu, J. Lehman, P. Molino, F. P. Such, E. Frank, A. Sergeev and J. Yosinski, An intriguing failing of convolutional neural networks and the CoordConv solution, in Proc. 32nd Int. Conf. Neural Information Processing Systems (NIPS’18), Montreal, Canada, 2018, Vol. 31 (NIPS, La Jolla, CA, 2018), pp. 9628–9639. Google Scholar
17. X. Zhu, H. Hu, S. Lin and J. Dai, Deformable ConvNets V2: More deformable, better results, in Proc. 2019 IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, 2019, Vol. 1 (IEEE, Piscataway, NJ, 2019), pp. 9300–9308. Crossref, Google Scholar
18. K. Han, Y. Wang, Q. Tian, J. Guo, C. Xu and C. Xu, GhostNet: More features from cheap operations, in Proc. 2020 IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), Seattle, WA, 2020, Vol. 1 (IEEE, Piscataway, NJ, 2020), pp. 1577–1586. Crossref, Google Scholar
19. Y. Li, H. Huang, Q. Chen, Q. Fan and H. Quan, Research on a product quality monitoring method based on multi scale PP-YOLO, IEEE Access 9 (2021) 80373–80387. Crossref, Google Scholar
20. G. Puthilibai, M. A. Kishore, S. Sankar, B. Sanjay Krishna, K. Vishal and K. Bharathi, Autonomous traffic system management based on YOLOR, in Proc. 2022 1st Int. Conf. Computational Science and Technology (ICCST), Chennai, India, 2022, Vol. 1 (IEEE, Piscataway, NJ, 2022), pp. 897–902. Crossref, Google Scholar
21. X. Wang, J. Wu, J. Zhao and Q. Niu, Express carton detection based on improved YOLOX, in Proc. 2022 IEEE 5th Advanced Information Management, Communicates, Electronic and Automation Control Conf. (IMCEC), Chongqing, China, 2022, Vol. 1 (IEEE, Piscataway, NJ, 2022), pp. 1267–1272. Crossref, Google Scholar
22. C. Tang, Z. Wu, S. Wang, C. Deng and L. Luo, Industrial object detection method based on improved CenterNet, in Proc. 2021 Int. Conf. Computer Engineering and Artificial Intelligence (ICCEAI), Shanghai, China, 2021, Vol. 1 (IEEE, Piscataway, NJ, 2021), pp. 121–125. Crossref, Google Scholar

Vol. 23, No. 01

Metrics

Downloaded 1,168 times

History

Received 24 July 2023

Revised 16 November 2023

Accepted 14 December 2023

Published: 7 March 2024

Information

This is an Open Access article published by World Scientific Publishing Company. It is distributed under the terms of the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 (CC BY-NC-ND) License which permits use, distribution and reproduction, provided that the original work is properly cited, the use is non-commercial and no modifications or adaptations are made.

Keywords

PDF download