^{#}These authors contributed equally to this work.

Object detection plays an important role in the sorting process of mechanical fasteners. Although object detection has been studied for many years, it has always been an industrial problem. Edge-based model matching is only suitable for a small range of illumination changes, and the matching accuracy is low. The optical flow method and the difference method are sensitive to noise and light, and camshift tracking is less effective in complex backgrounds. In this paper, an improved target detection method based on YOLOv3-tiny is proposed. The redundant regression box generated by the prediction network is filtered by soft nonmaximum suppression (NMS) instead of the hard decision NMS algorithm. This not only increases the size of the network structure by 52 × 52 and improves the detection accuracy of small targets but also uses the basic structure block MobileNetv2 in the feature extraction network, which enhances the feature extraction ability with the increased network layer and improves network performance. The experimental results show that the improved YOLOv3-tiny target detection algorithm improves the detection ability of bolts, nuts, screws and gaskets. The accuracy of a single type has been improved, which shows that the network greatly enhances the ability to learn objects with slightly complex features. The detection result of single shape features is slightly improved, which is higher than the recognition accuracy of other types. The average accuracy is increased from 0.813 to 0.839, an increase of two percentage points. The recall rate is increased from 0.804 to 0.821.

According to a survey, the global automotive fastener market is expected to grow by 4 billion US dollars, with an increase rate of 2.6%. In many developed countries, the demand for fasteners is more apparent. For example, the United States will maintain a 3.2% growth momentum. In the next five to six years, Germany will increase investment by 212 million US dollars. To date, the fastener market in Japan is 952.9 million US dollars. As the second largest economy in the world, the growth potential of China is 2.4% [

In recent years, deep convolutional networks have made breakthrough progress in various fields. A deep convolutional network mainly deepens the network level by a weight sharing strategy so that the network has a stronger analysis ability. Hamdia et al. developed a deep neural network (DNN) model to evaluate the flexoelectric effect in truncated pyramid nanostructures under compression conditions [

Guo et al. [

In actual production, due to variabilities in the assembly process and unsystematic parts management, a large number of fasteners with different specifications are mixed together, which greatly reduces the efficiency and accuracy of subsequent assembly production. It is of great practical significance to accurately separate these artifacts.

Target detection technology is a core technology of vision systems, including target recognition and target location technology [

In the research of Wren et al. [

In recent years, with the maturity of GPU technology, deep learning has developed rapidly. With the development of GPU technology, the development of neural networks has been promoted. In most cases, the formation of deep learning models without graphics processors is very slow [

The target sorting method based on a candidate box is a two-stage detection method that generates a suggestion box and classifies and regresses the objects in the suggestion box to generate the final results. In recent years, many researchers have performed much research on target detection using this method. In 2014, Girshick et al. [

A target detection algorithm based on regression obtains the detection position and category directly through the whole picture, and the representative algorithms are the You Only Look Once (YOLO) series. In 2016, Redmon of Washington University proposed the YOLO algorithm, which directly expanded the original image to 448 × 448 and divided it into 7 × 7 grids [

The algorithms in the above documents still have some shortcomings, such as slow detection speed, large consumption of network resources, low accuracy and recall rate, and poor detection accuracy. In this paper, we use deep learning technology to realize target detection. In view of the simple background and clear target feature pairs, this paper adopts lightweight YOLOv3-tiny as the basic detection framework and improves it to realize target detection. This model is simple in structure and low in computational complexity and can be run on a mobile terminal and a device terminal. Experimental results show that this algorithm has high detection accuracy and robustness in target detection.

As shown in

As shown in

YOLOv3-tiny is initialized by using the prior box obtained by the anchor mechanism. YOLOv3-tiny predicts the offset between the center point of the boundary and the upper left corner of the corresponding grid [

In _{x}, _{y}, _{w} and _{h} are network prediction values for each frame. It predicts the offset between the center of the bounding box and the upper left corner of the target grid cell. _{x}, _{y}, _{w,} and _{h} are the predicted coordinate values. _{y} are the offset values of the upper left corner of the grid, where the center point is relative to the upper left corner of the original image. _{h} are the width and height of the prior frame, respectively. The center point offset _{x} and _{y} of the prediction frame are normalized by the sigmoid function. Therefore, the coordinate conversion formula of the bounding box is shown in

Nonmaximum Suppression, also known as the NMS algorithm, keeps the local maximum and suppresses the nonmaximum value [

Intersection over union (IOU) is an index used to measure the degree of overlap between two prediction bounding boxes. It calculates the ratio of the intersection area and the joint area between the predicted boundary and the actual boundary. A schematic diagram is shown in

The flow of the NMS algorithm is as shown in

In the YOLOv3-tiny target detection framework, k-means clustering is used to cluster the prior box. K-means is an unsupervised learning algorithm based on distance clustering, in which k represents the number of classification categories, and means represents the average value of data samples of each category as the centroid of the category. The division is based on the nearest distance between data sample points and the centroid of each category, that is, to minimize

For data set clustering, the k-means algorithm is implemented as follows:

Randomly generate k central points as k centroids;

The Euclidean distance from each sample point to K center points is calculated, and the point is divided into the class with the smallest Euclidean distance;

According to the second step, redivide the centroid of sample points;

Determine whether the distance is reduced according to

During the target detection process, the positions of objects tend to overlap. The YOLOv3-tiny overall network structure model predicts three frames for each grid at each scale, which generates a large number of redundant prediction frames in the network. If NMS is used as a hard decision algorithm, it is easy to delete the frames with high overlap, which can easily lead to missed detection. These problems can directly influence the performance of the model. See Section 2.4 for the algorithm flow of NMS, and see

The principle and steps of filtering prediction frames using soft NMS algorithms are the same as those for NMS. According to

As shown in

If the rough NMS algorithm is used, the IOU of prediction frames 0 and 1, 2, 3 and 4 may be too large, resulting in the last 1, 2 and 4 being considered false frames, thus deleting frames 1, 2 and 4. However, after using soft NMS, frames 1, 2 and 4 were not deleted by mistake, but their confidence was attenuated by Gaussian weighting. The higher the degree of overlap, the stronger the suppression and the worse the confidence reduction.

The YOLOv3-tiny network has a light structure and can realize target detection under limited experimental equipment conditions. However, compared with deep feature extraction networks such as darknet53, its accuracy is low. To ensure sorting accuracy, it is necessary to improve the detection accuracy of YOLOv3-tiny when the network detection meets the requirements.

The structure of the mixed domain is composed of a channel domain and a spatial domain. The attention structure of the channel domain [

According to the functional requirements of the sorting system, the human-computer interaction software is compiled on the computer. The software interface is shown in

This experiment adopts a simpler nine-pointibration method, which means that the mobile manipulator locates the nine-point position point by point and records the corresponding manipulator coordinates in the hand-eyealibration module to store the nine-point manipulator coordinate records. Then, the software obtains and saves the current pixel coordinate position of the center at 9 o'clock in

When calling the trained network model for target detection, the detection results of the upper computer are shown in

For robot grabbing, the grabbed workpiece is placed at the designated position to finish the sorting task. The fastener grabbing and sorting experiment is shown in

A total of 100 fasteners, including the same number of bolts, nuts, screws and washers, are uniformly mixed in 5 batches and placed in a specific area within the camera's visual field, and the number of the four fasteners successfully sorted by the robot is recorded.

Batch | 1 | 2 | 3 | 4 | 5 |
---|---|---|---|---|---|

19 | 18 | 17 | 18 | 19 | |

5 | 4 | 4 | 5 | 4 | |

5 | 4 | 5 | 4 | 5 | |

4 | 5 | 3 | 4 | 4 | |

5 | 5 | 5 | 5 | 5 | |

95 | 90 | 85 | 90 | 90 |

Training a deep network requires many parameters, but in practice, the number of training samples may be insufficient, which leads to the risk of overfitting in the network. Therefore, a large amount of data is needed to reduce this risk. The main methods for obtaining large amounts of data are to obtain a large number of new samples or to enhance the data according to the existing samples. The former has a high cost, while the latter has a low cost. On the one hand, the function of data enhancement is to improve the data volume and generalization ability of the model and, on the other hand, to improve the robustness of the model by adding various noise data. Common classification methods of data enhancement are divided into offline enhancement and online enhancement. Offline enhancement uses various methods to process data set samples before training, which is suitable for small data sets. Online enhancement processes each batch of samples during training, which is typically used for large data sets. The commonly used data enhancements include rotation, scaling, flipping, noise, color enhancement, and affine transformation. Through data enhancement, the training effect of the model is generally reduced, but the testing accuracy of the model is improved, and the labeling complexity caused by a large amount of data is reduced. By rotating, flipping, color dithering, contrast enhancement, adding Gaussian noise to each sample and rotating, the generalization ability of the sample can be improved.

By labeling the assistant, the collected and enhanced training set images are manually labeled, and all the images are randomly divided into a training set and a test set at a ratio of 8:2. During the network training stage, the random gradient descent method is used to optimize the weight coefficient. After many iterations, the network weight parameters are constantly updated, YOLOv3-tiny is trained, and the target detection model is improved. The adaptive gradient (Adagrad) algorithm is very suitable for dealing with sparse data, and the root mean square Prop (RMSprop) algorithm can alleviate the problem where the learning speed of the Adagrad algorithm drops too fast. The main advantage of Adam is that the learning rate of each iteration has a certain range after offset correction so that the parameters are relatively stable. Therefore, the Adam optimization algorithm is more advantageous than RMSprop and Adagrad. Through the iterative training of 2500 batches of YOLOv3-tiny and its improved target detection model, it is found that the change in loss is stable, and the loss value is reduced to approximately 0.009 as shown in

As shown in

The improved YOLOv3-tiny model is consistent with the original YOLOv3-tiny training set, testing set and confidence threshold. The two networks trained on the same batch of data sets detect the target of four kinds of pictures. The confidence threshold and the soft NMS threshold are 0.6 and 0.6, respectively. After repeated training, the model parameters are adjusted to the best condition. The two models are trained with the same training samples, and the results of repeated sampling are shown in

It can be seen in the above table that, compared with the detection results of the improved YOLOv3-tiny and YOLOv3-tiny networks, the improved YOLOv3-tiny network has an improved detection ability for bolts, nuts and screws, and the accuracy of a single class is improved to some extent, indicating that the network greatly enhances the ability to learn objects with slightly complex features. The results of the gasket inspection with a single shape feature improves slightly, the average accuracy rate increases from 0.813 to 0.839, which is 2 percentage points higher than other categories, and the recall rate increases from 0.804 to 0.821, which is nearly two percentage points higher. The speed and improvement of YOLOv3-tiny were tested. The average detection speed of the YOLOv3-tiny model is 44.71 milliseconds per picture. However, the improved YOLOv3-tiny reduces the detection speed and improves the accuracy due to the increase in network layers and scales. The detection speed is 84.64 ms for each picture on average.

Network type | Detailed category | AP | mAP | recall | Avgtime/picture (ms) |
---|---|---|---|---|---|

YOLOv3-tiny | Bolt | 0.768 | 0.813 | 0.804 | 44.71 |

Nut | 0.786 | ||||

Screw | 0.812 | ||||

Shim | 0.887 | ||||

Improved YOLOv3-tiny | Bolt | 0.814 | 0.839 | 0.821 | 84.64 |

Nut | 0.817 | ||||

Screw | 0.836 | ||||

Shim | 0.892 |

Using YOLOv3-tiny and improved YOLOv3-tiny to detect different bolts, nuts, screws and washers, the description of target attributes is shown in

Name representation | Bolt | Nut | Washer | Screw |
---|---|---|---|---|

Outline color | red | blue | yellow | orange |

YOLOv3-tiny and the improved YOLOv3-tiny algorithm were used to test some images in the test set. The YOLOv3-tiny test results are shown in

In view of the poor anti-interference ability and low matching accuracy of traditional edge feature matching, an improved target detection algorithm based on YOLOv3-tiny is proposed by using the target detection framework based on deep learning, and the redundant regression frames generated by network prediction are filtered by using the Gaussian weighted attenuation soft NMS algorithm instead of the hard decision NMS algorithm. The network structure not only has a scale of 52 ∗ 52 to improve the detection accuracy of small targets but also uses the basic structure block MobileNetv2 in the feature extraction network, which makes the network perform better.

The experimental results show that in terms of target detection, the improved YOLOv3-tiny target detection algorithm proposed in this paper improves the detection ability of bolts, nuts and screws and improves the accuracy of a single category. The average accuracy rate increases by two percentage points from 0.813 to 0.839, and the recall rate increases by nearly two percentage points from 0.804 to 0.821. In summary, the improved target detection algorithm based on YOLOv3-tiny proposed in this paper has higher accuracy and stronger robustness.

In this paper, an improved target detection method based on YOLOv3-tiny is proposed, which adopts the Gaussian weighted attenuation soft NMS (nonmaximum suppression) algorithm, and the MobileNetv2 basic structure block is used in the feature extraction network. The basic concept of the improved YOLOv3-tiny algorithm can be divided into two parts. One is to generate a series of candidate regions in the picture according to certain rules and then mark the candidate regions according to the positional relationship between these candidate regions and the real frame of the object in the picture. An anchor box can be defined by the aspect ratio of the frame and the area (dimension) of the frame, which is equivalent to a series of preset frame generation rules. According to the anchor frame, a series of images may be generated at any position of the image. The calculation formula for the adaptive anchor frame is shown in

The second part extracts image features by using a convolutional neural network and predicts the position and category of candidate regions. In this way, each predicted frame can be regarded as a sample, and labeled values can be obtained by labeling the real frame with respect to its position and category. The loss function can be established by predicting its position and category through the network model and comparing the predicted value of the network with the marked value. The confidence loss function, the classification loss function and the regression function of the bounding box are expressed in

The calculation of total loss is shown in

For the Soft-NMS algorithm, first, the NMS algorithm can be expressed by the

To change the hard threshold practice of NMS and follow the principle that the larger the iou, the lower the score (the larger the iou, the more likely it is a false positive), the following formula can be used to express Soft NMS:

However, the above formula is discontinuous, which will lead to faults in the score in the box set, so there is the following Soft NMS formula (which is also used in most experiments):

The K-means algorithm divides the x matrix of an n sample group into disjunct k clusters. Intuitively speaking, clusters are a collection of data groups, and the data in the cluster are considered to be of the same class. Cluster are the result of clusters. The average of all data in a cluster is usually called the centroid of the cluster. In a two-dimensional plane, the abscissa of the centroid of a cluster of data points is the average value of the abscissa of the cluster data points, and the ordinate of the centroid is the average value of the ordinate. The same principle may also be extended to high-dimensional space.

For a cluster, the smaller the distance between all sample points and the centroid, the more similar the samples in the cluster and the smaller the difference within the cluster. There are many ways to measure distance, and we use absolute distance here. x represents a sample point in a cluster, μ represents the centroid of the cluster, n represents the number of features in each sample point, and i represents the composition of each feature of x. The distance formula from the sample point to the centroid is as follows:

The sum of squares of distances from all sample points to the cluster center is expressed by