In the last decade, there has been remarkable progress in the areas of object detection and recognition due to high-quality color images along with their depth maps provided by RGB-D cameras. They enable artificially intelligent machines to easily detect and recognize objects and make real-time decisions according to the given scenarios. Depth cues can improve the quality of object detection and recognition. The main purpose of this research study to find an optimized way of object detection and identification we propose techniques of object detection using two RGB-D datasets. The proposed methodology extracts image normally from depth maps and then performs clustering using the Modified Watson Mixture Model (mWMM). mWMM is challenging to handle when the quality of the image is not good. Hence, the proposed RGB-D-based system uses depth cues for segmentation with the help of mWMM. Then it extracts multiple features from the segmented images. The selected features are fed to the Artificial Neural Network (ANN) and Convolutional Neural Network (CNN) for detecting objects. We achieved 92.13% of mean accuracy over NYUv1 dataset and 90.00% of mean accuracy for the Redweb_v1 dataset. Finally, their results are compared and the proposed model with CNN outperforms other state-of-the-art methods. The proposed architecture can be used in autonomous cars, traffic monitoring, and sports scenes.
The majority of current RGB-D datasets were gathered with depth sensors, such as Kinect or LiDAR. Kinect can only be utilised for inside situations, although LiDAR is frequently employed for outside scenes. Due to the variety of situations, it is challenging to achieve decent results in the wild while training on outdoor scene datasets. Accessibility of publicly available network datasets like as ImageNet [
To develop robots capable of perceiving the environment as humans do, researchers have focused heavily on Scene Semantic Recognition (SSR), automated analysis of item placements [
One difficulty encountered in 3D image processing is the mathematically consistent blending of various point clouds gathered from multiple angles. Various object detection techniques have been proposed for RGB-D photos and movies. Some systems employ the technique of object localization based on depth maps, such as establishing a Conditional Random Field (CRF) model and a system to comprehend interior scenes. Utilizing 3D characteristics yields outcomes with high precision. Proposed is a method that extracts the fusion of characteristics such as depth edges, 3D forms, and size features. Using this fusion, they were able to obtain considerable high performance with RGB-D pictures [
A approach based on multi-object categorization is proposed to conduct scene classification on a variety of benchmark datasets in order to circumvent the issues inherent in scene classification. The suggested approach initially preprocesses the photos. In the second stage, the improved Watson Mixture Model (mWMM) technique is used to generate efficient segmentation results, and clustering is conducted. Multiple characteristics are retrieved in the third stage, including 3D-point clouds, form features, and a bag of words. In the final phase, the characteristics are provided to two distinct architectures, Artificial Neural Network (ANN) and CNN [ An approach for multi-object detection and scene understanding based on modified WMM, ANN, and CNN (Vgg-16) is proposed. Improved segmentation for the detection of multiple regions of different objects using modified WMM and 3D-geometric features are the main contributions of this work. Novel 3D-geometric features for scene understanding have refined the scene recognition accuracy with both ANN and CNN architectures. The proposed model’s efficiency and effectiveness are validated with two different publically available datasets. Other state-of-the-art approaches are compared to the suggested method’s outcomes. Section 2 discusses relevant work. Section 3 describes the approach and suggested scene categorization system in depth. The fourth section provides an analysis of the experimental outcomes and a comprehensive explanation of the information. Section 5 contains the paper’s summary.
Our technique is connected to a large body of work on both CNNs for fusion and machine vision. In furthermore, we briefly analyses the appropriateness of CNN’s detailed estimate. It is beyond the goal of this research to do a comprehensive literature review of CNNs for these three parameters; therefore, we will present a brief summary of the existing studies, with a focus on more recent publications.
Reference [
Early fusion [
It has been demonstrated that CNN-based techniques perform better than exquisite methodology such as HOG [
The significant proportion of CNN-based detection equipment consist of one-stage processes (e.g., SSD [
Attributed to the reason that mostght pictures include numerous bounding boxes, extensive textural variations, and intricate geometric elements, adjusted is a significant obstacle in image interpretation. Diverse depth estimation strategies employing supervised [
CNNs exhibit promise effectiveness for this task, but supervised techniques require costly and time-consuming datasets with extensive labelling. Various studies apply a personality learning approach to estimate depth maps from unlabeled video sequences in order to overcome this issue. Self-supervised learning approaches overcome the difficulty of background subtraction by educating a network to predict the appearance of a target image from the perspective from another image. Df-net [
In this segment, the suggested architecture for object detection is discussed.
A number of researchers have proposed surface normal in their recent work. In-depth images, image normal are the unit vectors having 3D properties that draw the positioning of the pixels. The most collective method used to compute normal was the plane fitting method [
With a depth sensor,
The sequence diagram presented in
Throughout data classification, the subsurface normal of distance pictures is computed. Normals are produced for each pixels by choosing neighboring pixels within a depth minimum and estimating a least-squares surface. Then, updated WMM is used to cluster the normal. Each constellation in the output is a collection of pixels from the same location.
Modified Watson Mixture Model is a productive model, which undertakes that the data models are issued from a combination of multivariate Watson distributions (mWDs) [
After soft clustering, a set of mWMMs using hierarchical collective clustering is produced. In the end, it applies a model selection method to select the optimal mWMM is applied. In the proposed model, more than one mWMM is generated against a single depth map and an optimum mWMM is picked gives the best clusters. The proposed technique holds distributional information acquired from depth gradients and needs no repetitive numerical calculation during the optimization procedure. Moreover, the approach provides a lower certain on the peripheral likelihood. Consequences are shown in
The segmentation step starts with the initialization of the parameters that are variational in nature. The optimization of the variational subsequent distribution involves a series of optimization. Firstly, we use the current distributions over the model parameters to examine the responsibilities. Next, these obligations are engaged to re-estimate the subsequent distribution over the parameters. The process is guaranteed to converge as the lower bound. A summary of the process is presented in Algorithm 1.
There are various feature extraction techniques including spatiotemporal motion variation features [
Some geometric features that show a given 3D lattice model lie on the model’s surface. 3D geometric information assumes a significant part in numerous issues identified with computer vision applications.
By that, as it may, their scale-dependent nature, such as the relative variation in the spatial degrees of nearby geometric constructions. For this reason, a scale-space sort of portrayal is fabricated that dependably encodes the scale fluctuation of its surface geometry. The given geometry is addressed with its surface normal and a thick and ordinary 2D plane of it is processed by defining the surface on a 2D plane. At this point, a scale-space of this surface typical field is worked by determining and applying a scale-space administrator that effectively represents the geodesic distances on a superficial level. A 2D representation of the 3D geometry is given as a 3D lattice model by first opening up the outside of the model onto a 2D plane. A significant arrangement of scale-subordinate features can be procured from the subsequent typical space portrayal.
Geometric edges and sharp focuses are extracted at various scales. To set up these edges, the first-and second-request subsidiaries of the depth map are gotten attentively. The outcome is a bunch of scale-subordinate 3D geometric features that give a rich and extraordinary reason for the exhibition of 3D unique basis. In
In feature extraction, specific geometric forms, such as cylindrical, rectangular, and other configurations, are used to extract significant features from various features in an image. In the suggested approach, form total number of positive on the profundity qualities of the material shapes are utilized. From depth information, it can be tried by fitting a quantized model based on boundaries, shape priors, and detected spatial properties. A contour-based approach often has minute applicability to specific shapes, such as generalized cylinder shapes. In some scenarios objects in a messy environment are turned into region of interest (ROI) parts. In this section, a straight-line strategy is applied to detect the shape of our interest regions. First of all, region contour is extracted by the boundary acquiring technique.
A bag of words representation of the image features has also been used for better scene categorization. Since this algorithm does not account for spatial relationships between the features, it is bound to miss categorize some scenes.
Starting with the extraction of features from the training set, These features lead us to develop a vocabulary that will help in image classification. The next step is to cluster all the features found in images and then to find a difference between features vocabulary by using the center of the cluster. After taking the feature extracted from images, categorize each feature as the word it is closest to it in the vocabulary. In this way, a bag of words representation is made. Against each bag of words, a histogram is constructed.
Two different methods have been used for multi-object recognition and classification. ANN and CNN. Both are robust in multi object scenario and scene classification problems.
ANN is a computer model used for modelling statistical data across non-linear data. It is a tool for computer science that is informed by the nervous system and replicates the actual brain’s learning system to execute learning. ANN discovers various data correlations or input-output correlations utilizing artificially generated neurons.
A pre-trained CNN VGG-16 classical is also used for object classification in the proposed system.
This section describes the study’s preparation and assessment procedure in details.
Two different datasets have been used to test the proposed methodology. These datasets comprise various scenes with multiple objects and various classes. A description of these datasets is given in the next section.
ReDWeb V1 is a comprehensive database comprised of a variety of photos and their respective complex comparable depth maps. The ReDWeb V1 [
The NYU V1 [
In this part, the setup and assessment are explained in greater detail. In the evolutionary sense, precision of classification and comparability with established state approaches were tested by analysing all indoor photos. Because of the strong object segmentations (mWMM), which exhibits greater efficiency in object identification utilising ANN and Network architectures, the suggested system produced consistent results.
Considering the RedWeb V1 dataset, the proposed system was applied for scene classification accuracy.
Objects | WH | CF | CR | CHR | CMP | MD | LIB | LAB | BS | COR |
---|---|---|---|---|---|---|---|---|---|---|
WH | 0.03 | 0.01 | 0.01 | 0.01 | 0.02 | 0.03 | 0.01 | 0.00 | 0.01 | |
CF | 0.01 | 0.02 | 0.01 | 0.01 | 0.02 | 0.00 | 0.01 | 0.02 | 0.01 | |
CR | 0.02 | 0.01 | 0.01 | 0.02 | 0.01 | 0.01 | 0.02 | 0.01 | 0.01 | |
CHR | 0.02 | 0.01 | 0.02 | 0.01 | 0.00 | 0.01 | 0.02 | 0.03 | 0.01 | |
CMP | 0.01 | 0.01 | 0.02 | 0.00 | 0.01 | 0.02 | 0.01 | 0.02 | 0.01 | |
MD | 0.02 | 0.01 | 0.01 | 0.01 | 0.02 | 0.02 | 0.01 | 0.03 | 0.01 | |
LIB | 0.03 | 0.02 | 0.01 | 0.02 | 0.01 | 0.01 | 0.03 | 0.01 | 0.02 | |
LAB | 0.02 | 0.01 | 0.02 | 0.00 | 0.01 | 0.02 | 0.01 | 0.02 | 0.02 | |
BS | 0.01 | 0.02 | 0.01 | 0.01 | 0.03 | 0.02 | 0.03 | 0.01 | 0.01 | |
COR | 0.03 | 0.01 | 0.01 | 0.02 | 0.01 | 0.01 | 0.01 | 0.02 | 0.01 |
Note: WH = ware house, CR = computer room, CH = chair, CMP = computer, MD = mobile device, LAB = laboratory, COR = corridor.
Objects | WH | CF | CR | CHR | CMP | MD | LIB | LAB | BS | COR |
---|---|---|---|---|---|---|---|---|---|---|
WH | 0.00 | 0.02 | 0.01 | 0.02 | 0.01 | 0.01 | 0.00 | 0.01 | 0.01 | |
CF | 0.01 | 0.02 | 0.01 | 0.01 | 0.00 | 0.02 | 0.01 | 0.01 | 0.01 | |
CR | 0.02 | 0.01 | 0.01 | 0.02 | 0.01 | 0.01 | 0.01 | 0.00 | 0.02 | |
CHR | 0.01 | 0.00 | 0.01 | 0.02 | 0.01 | 0.02 | 0.00 | 0.00 | 0.01 | |
CMP | 0.02 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.02 | 0.03 | 0.01 | |
MD | 0.01 | 0.02 | 0.01 | 0.00 | 0.00 | 0.01 | 0.02 | 0.01 | 0.01 | |
LIB | 0.01 | 0.02 | 0.00 | 0.01 | 0.01 | 0.01 | 0.02 | 0.01 | 0.01 | |
LAB | 0.03 | 0.01 | 0.02 | 0.01 | 0.01 | 0.01 | 0.02 | 0.01 | 0.01 | |
BS | 0.01 | 0.02 | 0.01 | 0.01 | 0.01 | 0.01 | 0.02 | 0.01 | 0.00 | |
COR | 0.02 | 0.01 | 0.01 | 0.02 | 0.01 | 0.01 | 0.02 | 0.01 | 0.01 |
Note: WH = ware house, CR = computer room, CH = chair, CMP = computer, MD = mobile device, LAB = laboratory, COR = corridor.
Throughout experiments with the NYU V1 dataset, classification accuracy score of 89.1% as shown in
Objects | BR | BD | BS | CF | KIT | LR | OFF |
---|---|---|---|---|---|---|---|
BR | 0.02 | 0.02 | 0.01 | 0.03 | 0.02 | 0.03 | |
BD | 0.03 | 0.02 | 0.02 | 0.03 | 0.02 | 0.03 | |
BS | 0.02 | 0.02 | 0.03 | 0.02 | 0.01 | 0.02 | |
CF | 0.03 | 0.04 | 0.03 | 0.01 | 0.02 | 0.03 | |
KIT | 0.03 | 0.01 | 0.01 | 0.02 | 0.03 | 0.03 | |
LR | 0.03 | 0.02 | 0.02 | 0.03 | 0.03 | 0.02 | |
OFF | 0.04 | 0.02 | 0.03 | 0.01 | 0.02 | 0.02 |
Note: BR = bar room, BD = bed room, B
Objects | BR | BD | BS | CF | KIT | LR | OFF |
---|---|---|---|---|---|---|---|
BR | 0.01 | 0.02 | 0.03 | 0.02 | 0.01 | 0.02 | |
BD | 0.01 | 0.02 | 0.01 | 0.01 | 0.01 | 0.02 | |
BS | 0.02 | 0.01 | 0.01 | 0.02 | 0.01 | 0.01 | |
CF | 0.01 | 0.02 | 0.01 | 0.01 | 0.01 | 0.02 | |
KIT | 0.02 | 0.01 | 0.02 | 0.01 | 0.03 | 0.01 | |
LR | 0.03 | 0.01 | 0.01 | 0.02 | 0.03 | 0.01 | |
OFF | 0.01 | 0.02 | 0.03 | 0.01 | 0.00 | 0.02 |
Note: BR = bar room, BD = bed room, B
Methods | NYU V1 | Redweb V1 |
---|---|---|
Mean accuracy | Mean accuracy | |
K. Chen [ |
– | 60.5 |
S. Gupta [ |
– | 65.0 |
A. Zeng et al. [ |
– | 78.1 |
Silberman et al. [ |
70 | – |
Multiscale convnet [ |
51.1 | – |
The classification results of the three classifiers, i.e., Random forest, Artificial Neural Network (ANN), and Convolutional Neural Network on Nyuv1, and Redweb V1 datasets are reported in
Multi Objects | Random Forest | ANN | CNN | ||||||
---|---|---|---|---|---|---|---|---|---|
Precision | Recall | F-Measures | Precision | Recall | F-Measures | Precision | Recall | F-Measures | |
WH | 0.781 | 0.770 | 0.762 | 0.879 | 0.862 | 0.874 | 0.891 | 0.890 | 0.885 |
CF | 0.750 | 0.751 | 0.759 | 0.864 | 0.859 | 0.854 | 0.889 | 0.890 | 0.885 |
CR | 0.756 | 0.749 | 0.753 | 0.880 | 0.872 | 0.880 | 0.883 | 0.880 | 0.890 |
CHR | 0.752 | 0.751 | 0.750 | 0.879 | 0.866 | 0.876 | 0.878 | 0.875 | 0.872 |
CMP | 0.739 | 0.745 | 0.749 | 0.880 | 0.878 | 0.878 | 0.890 | 0.885 | 0.880 |
MR | 0.748 | 0.740 | 0.742 | 0.879 | 0.874 | 0.875 | 0.878 | 0.875 | 0.870 |
LIB | 0.749 | 0.748 | 0.7750 | 0.880 | 0.879 | 0.879 | 0.889 | 0.880 | 0.887 |
Note: WH = ware house, CR = computer room, CH = chair, CMP = computer, MD = mobile device, LAB = laboratory, COR = corridor.
Multi Objects | Random Forest | ANN | CNN | ||||||
---|---|---|---|---|---|---|---|---|---|
Precision | Recall | F-Measures | Precision | Recall | F-Measures | Precision | Recall | F-Measures | |
BR | 0.840 | 0.839 | 0.843 | 0.890 | 0.890 | 0.889 | 0.911 | 0.910 | 0.899 |
BD | 0.850 | 0.848 | 0.847 | 0.889 | 0.887 | 0.886 | 0.899 | 0.900 | 0.886 |
BS | 0.843 | 0.840 | 0.842 | 0.890 | 0.886 | 0.888 | 0.901 | 0.892 | 0.898 |
CF | 0.851 | 0.850 | 0.849 | 0.889 | 0.887 | 0.890 | 0.910 | 0.910 | 0.900 |
KIT | 0.849 | 0.850 | 0.846 | 0.890 | 0.888 | 0.882 | 0.900 | 0.887 | 0.889 |
LR | 0.851 | 0.847 | 0.845 | 0.890 | 0.884 | 0.890 | 0.910 | 0.889 | 0.899 |
In this paper, a novel and effective approach for the segmentation and classification of single and heterogeneous objects is provided. Particles were segmented using the powerful algorithm Watson Hybrid Concept. Furthermore, many characteristics were extracted from both collections. The suggested scheme outperforms previous state-of-the-art technologies in terms of computational, segments outcomes, and precision, as determined by experiments performed. In future consideration, the scholars want to conduct an in-depth analysis of photos of outdoor space in order to increase the accuracy of semantic segmentation and discover a solution to the computational effort of semantic segmentation.
This study was funded by the
The authors declare that they have no conflicts of interest to report regarding the present study.