In the past several years, remarkable achievements have been made in the field of object detection. Although performance is generally improving, the accuracy of small object detection remains low compared with that of large object detection. In addition, localization misalignment issues are common for small objects, as seen in GoogLeNets and residual networks (ResNets). To address this problem, we propose an improved region-based fully convolutional network (R-FCN). The presented technique improves detection accuracy and eliminates localization misalignment by replacing position-sensitive region of interest (PS-RoI) pooling with position-sensitive precise region of interest (PS-Pr-RoI) pooling, which avoids coordinate quantization and directly calculates two-order integrals for position-sensitive score maps, thus preventing a loss of spatial precision. A validation experiment was conducted in which the Microsoft common objects in context (MS COCO) training dataset was oversampled. Results showed an accuracy improvement of
Deep learning (DL) has become an active topic of research in the field of artificial intelligence, as it offers significant advantages in the domains of object detection, automated speech recognition, and natural language processing [
Object detection is one of the most fundamental tasks in computer vision [
Small objects of interest can be defined in two ways based on their relative size (an area smaller than 12% of the original image) or absolute size (a rectangular area less than 32 × 32). The absolute size was used to define small objects in the MS COCO dataset [
This paper primarily focuses on two key aspects of object detection. The first is a PS-Pr-RoI pooling layer that represents relative spatial positioning and aligns coordinates, effectively improving detection accuracy and efficiency for small objects. The second step involves oversampling the training image with small objects and then training the detector on the augmented data. This approach achieved high detection accuracy for the MS COCO dataset. The proposed PS-Pr-RoI pooling layer can also be easily embedded in other detection frameworks.
The remainder of this paper is organized as follows. Section 2 presents related work. Section 3 outlines the proposed methodology, including the overall detection framework and specific implementation details. Section 4 discusses the validation experiment and analyzes corresponding results. Section 5 provides conclusions and considers avenues for future work.
There have been two key periods in object detection history: the use of traditional models and the application of deep learning [
As the technology developed further, deep learning-based object detection became more accurate than traditional methods [
Existing DL detectors can be classified as either one-stage or two-stage processes. One-stage models, such as YOLO and SSD [
As discussed above, despite the recent progress made in object detection, practical issues remain due to low image resolution, reduced information availability, and increased noise associated with small objects. After several years of development, some solutions have emerged to improve small object detection performance, including the use of smaller and denser anchors, improved anchor-matching strategies, and generative adversarial networks (GANs) [
The primary aim of this paper is to improve the accuracy of small object detection and eliminate localization misalignment. As such, a two-stage object detector (R-FCN) was selected for the study [
As shown in
The overall architecture for the presented detector can be divided into four parts: a basic convolutional neural network (ResNet-101) [
Residual networks are powerful tools that make training deeper networks possible. Prior to the development of residual networks, two potential problems arose as the number of network layers increased: decreasing accuracy and vanishing or exploding gradients [
The residual network used in this study is a modified version of ResNet-101, developed by the authors of the R-FCN [
RPNs can be used to generate RoIs for object detection. In a fast R-CNN [
where
Fully convolutional networks are suitable for image classification applications due to their strong feature extraction capabilities. However, these networks focus solely on features and do not consider relative spatial positioning, making them unsuitable for object detection. In contrast, position-sensitive score maps can represent relative object locations. In this study, PS-Pr-RoI pooling was used to generate output of the same size from RoIs of different sizes, thereby producing a more accurate relative spatial position than that of RoI pooling [
This approach can be described as follows. First, the original image is sent to a ResNet-101 [
In conventional RoI pooling, two quantization operations are performed on the RoI boundaries and bins, resulting in low position accuracy. RoI features are then updated without a gradient and thus cannot be adjusted during training. An RoI alignment step [
Precise RoI pooling does not involve any quantization operations and solves problems associated with the number of sampling points by eliminating it as a system parameter [
Here,
It is evident this function works to average all continuous features, yielding the pooled response in the
The input of the PS-Pr-RoI pooling is
The position-sensitive object score for the c
The
This term was used to calculate the cross-entropy loss during training and to rank RoIs during inference [
Bounding box regression was defined in a similar way using a
Detection performance was evaluated experimentally for small objects using the mean average precision (mAP) metric, as shown in
The validation experiment included a GPU utilizing an Nvidia GeForce GTX TITAN X. The convolutional neural network environment and configuration were as follows. All detectors in the experiment were run on Ubuntu 16.04 using the TensorFlow deep learning framework and CUDA (version 10.1) programmed with Python 3.7.
The accuracy of the detector was evaluated using MS COCO datasets containing a total of 80 objects. By default, we used a single-scale training and testing step, following a process similar to that of Kisantal et al. [
The proposed detector is simple and requires no additional features, such as multi-scale training or testing. In addition, with the aid of PS-Pr-RoI pooling, our detector can increase object detection accuracy beyond that of some conventional algorithms.
Detector | Basis architecture | Pooling | Training data | Test data | AP | AP |
AP |
AP |
---|---|---|---|---|---|---|---|---|
Faster R-CNN | ResNet-101 | RoI pooling | Train | Val | 27.2 | 6.6 | 28.6 | 45.0 |
R-FCN | ResNet-101 | PS-RoI pooling | Trainval | Test-dev | 29.2 | 10.3 | 32.4 | 43.3 |
Deformable convolutional networks | ResNet-101 + deformable convolution | Deformable PS-RoI pooling | Multi-sc trainval | Test-dev | 34.5 | 14.0 | 37.7 | 50.3 |
Precise R-FCN | ResNet-101 | PS-Pr-RoI pooling | Trainval | Test-dev | 32.9 | 16.3 | 36.2 | 44.8 |
Detector | Basis architecture | Pooling | Training data | AP | AP |
AP |
AP |
---|---|---|---|---|---|---|---|
R-FCN | ResNet-101 | RoI pooling | MS COCO | 29.2 | 10.3 | 32.4 | 43.3 |
Precise R-FCN | ResNet-101 | PS-Pr-RoI pooling | MS COCO | 32.8 | 15.7 | 36.2 | 45.2 |
Precise R-FCN | ResNet-101 | PS-Pr-RoI pooling | MS COCO |
32.9 | 16.3 | 36.2 | 44.8 |
In this study, the accuracy of small object detection was improved and localization misalignment was eliminated using PS-Pr-RoI pooling. This pooling module was embedded in a two-stage detector R-FCN, which greatly increased detection accuracy for small objects. In a future study, we intend to combine PS-Pr-RoI pooling with other detectors for use in other types of image research [