Content-Based Image Retrieval (CBIR) is an approach of retrieving similar images from a large image database. Recently CBIR poses new challenges in semantic categorization of the images. Different feature extraction technique have been proposed to overcome the semantic breach problems, however these methods suffer from several shortcomings. This paper contributes an image retrieval system to extract the local features based on the fusion of scale-invariant feature transform (SIFT) and KAZE. The strength of local feature descriptor SIFT complements global feature descriptor KAZE. SIFT concentrates on the complete region of an image using high fine points of features and KAZE ponders on details of a boundary. The fusion of local feature descriptor and global feature descriptor boost the retrieval of images having diverse semantic classification and also helps in achieving the better results in large scale retrieval. To enhance the scalability of image retrieval bag of visual words (BoVW) is mainly used. The fusion of local and global feature representations are selected for image retrieval for the reason that SIFT effectively captures shape and texture and robust towards the change in scale and rotation, while KAZE have strong response towards boundary and changes in illumination. Experiments conducted on two image collections, namely, Caltech-256 and Corel 10k demonstrate the proposed scheme appreciably enhanced the performance of the CBIR compared to state-of-the-art image retrieval techniques.
The progressive advancement of technology has led to a rapid increase in the collection of digital images as well as the image repository. Retrieval of images according to the user’s objective from the haystack of a large database is a tedious process. CBIR [
Generally, the features are classified as global features and local features. Global features describe the feature distribution of an entire image. It provides global descriptions based on color, shape, and texture. Several studies have been done on feature extraction based on color [
It has been observed that a single feature-based representation is not suitable for attaining a higher rate of retrieval performance. Hence several works are done based on a combination of multiple features. Some research works combined color with other features [
However, the major constraint of global features is the segmentation for the general description of an image. Further, it is more likely to fail in the case of occlusion and clutter. It is sensitive to locality and fails to recognize vital visual characteristic which makes global features too stringent for image representation [
The aforementioned facts motivated us to explore the fusion of local features. It is studied that the local features are capable of capturing minuscule features in images moreover, they are compact and expressive. Local feature determines salient keypoints in an image and provides more robustness to occlusion and clutter. To the authors’ knowledge, the proposed work is the first to perform the feature fusion of SIFT and KAZE for image retrieval.
Another major hindrance in image retrieval is bridging the semantic gap. To bridge the semantic gap relevance feedback is used. In large image datasets, the number of local features extracted for every image may be enormous. To solve this problem, BoVW [
The fusion of SIFT and KAZE descriptor preserves the unique property of an image representation. The introduction of the BoVW framework enhances scalability. The inclusion of an RF system based on the user’s feedback reduces the semantic gap.
The rest of this paper is organized as follows. Section 2 presents the related works. The general idea of the BoVW model, k-means, and SVM is discussed in Section 3. Section 4 addresses the local feature descriptors. Section 5 presents the proposed method. Experimental results are discussed in Section 6. Section 7 addresses the conclusion and future work.
CBIR has emerged extensively since the 1990s [
Ashraf et al. [
Bu et al. [
The Bag of visual word model is shown in
Feature extraction was performed on all the training images. Visual vocabulary consists of clustered features and these features are vector quantized. The visual word defines every cluster. Vocabulary terms are the codes in the codebook. The count of each term that appears in an image creates a normalized histogram that represents the BoVW.
In this work, we present a method for producing a robust set of features by the utilization of SIFT and KAZE. K-means clustering computes the nearest neighbor points and the cluster center. It makes use of the approximation of the nearest neighbor method for computation and it scales to a similar large size vocabulary [
Here, we choose the Hellinger kernel or Bhattacharya coefficient [
Various applications in image processing emphasize that the obtained feature should possess good uniqueness against diverse image transformations [
SIFT detects the salient feature points and remains invariant to image scale and rotation. It also provides resistance for diverse image transforms including illumination changes, occlusion, and several affine transform. It entails the following four steps.
This generates a multiple-scale pyramid of the original image. This process eliminates the details that are not present at different scales. So, the image is left with information that is invariant to scale. This is achieved by applying Gaussian blur to the image. The algorithm searches for all image regions and scales, which can be invariant to orientation. These are evaluated by applying a Gaussian scale-space kernel, which produces various blurred regions of the original image.
Gaussian convolution and interpolations are performed to create Gaussian scale space. Then images based on difference-of-gaussian (DoG) are computed by approximation of Laplacian of Gaussian and down-sampling.
To determine the keypoint, the sample point is evaluated to its eight neighbors on the same scale and nine neighbors on the adjoining scale. This shows that at different scales, every pixel will be examined thoroughly and the keypoint is chosen based on the maximum value in that scale of frequency, resulting in invariance in scale.
The computed numbers of keypoints are increasing, and it is crucial to eliminate the unstable keypoints, which are degraded by noise and have low contrast. Similarly, the detected edge region that is prone to noise should be discarded. This is done by applying the Harris corner detector, and maximum gradients in all scales are detected.
To achieve strong rotation invariance, each keypoint is assigned an orientation. It considers the magnitude and gradient directions of an image to compute the histogram. The histogram having a maximum peak in the particular direction is investigated and assigned as the orientation for the keypoint.
In each region of the keypoint, a neighborhood of 16x16 is chosen. It is separated into a size of 4 × 4 containing sub-blocks of 16. 8-bin histograms are generated for each sub-block. The histogrammed descriptor vector results in 128 elements. The magnitude calculation is based on the gradient orientations of weights. For matching of keypoint the Euclidean distance between descriptors is calculated. The keypoints matched are called control points, and these are employed to enhance the image transformation.
The KAZE algorithm detects and describes the image feature based on nonlinear diffusion filtering operation and conductivity function. It makes use of nonlinear scale-space. The conductivity function proposed by Perona and Malik captures the gradient of the original image, where the gradient is the smoothed Gaussian function version concerning time. To construct adjustable blur, image features of nonlinear diffusion filtering along with the additive operator Splitting method are used. In SIFT, the Gaussian approach causes blurring and does not preserve the image’s natural boundary, but in KAZE, all scale levels of noise are smoothed to the same degree to make blur adaptive to image features.
where div and
KAZE utilizes the combination of nonlinear diffusion filtering with a conductivity function. The conductivity function of Perona and Mallik is represented as:
where
The determinant of the Hessian (DoH) matrix is scale-normalized and the response is computed by KAZE. DoH is a blob detection method with automatic scale selection. The responses with the maximum value formulate the possible keypoints.
In this process, the dominant orientation is computed in a circular area by a window of sliding orientation having a size of π/3 and a radius of 6 s where s is the scale. The first order derivatives Lx and Ly are computed and weighted with the keypoint of the centered Gaussian. These two responses are summed up by a segment of the sliding circle resulting in the dominant orientation. Then, the longest vector with the dominant orientation is assigned as the orientation of that keypoint.
SIFT deals with the entire region of the image with the salient keypoint, where KAZE focuses on the details present in the boundary. By this combination of both KAZE and SIFT, the following characteristics are achieved. Definite boundary representation. Detection of salient keypoint. Property of uniqueness.
However, SIFT can smooth information and noises to the same degree, but it fails to preserve the boundary, whereas KAZE preserves the boundary. The illustration of SIFT and KAZE keypoints is shown in the
In an image when compared to other areas, the salient region is sparser, and the keypoints present in this region have the capacity of unique identification of an image. SIFT helps to detect these salient keypoints and attain better responses in these regions.
To differentiate images against the background, the keypoints near the object boundaries play a vital role. The utilization of KAZE keypoints addresses this requirement and achieves the property of uniqueness. However, feature descriptor plays a vital role in CBIR semantic gap is the most prominent concern which influence the performance of a CBIR system. To bridge this gap relevance feedback (RF) is introduced with this forceful image representation by fusion of distinctive local and global features from the image. The RF method adds user feedback to enhance retrieval performance and provides related images [
The method deploys SIFT and KAZE techniques to extract keypoints and calculate feature descriptors.
The acquired image set needs some preprocessing methods such as converting RGB image to a gray-scale image and image resizing. Grayscale image provides better density details. Digital filtering is done to eliminate noise and inconsistency and normalization is achieved with scaling.
First, the image features are extracted as keypoints by applying SIFT and KAZE feature descriptors. Extracted features contain vast local information and predominant image patches. The spatial structure and the local orientation distribution of surrounded keypoints are captured by SIFT. The multi-neighborhood strategy is applied to find KAZE features. In the case of a large volume of data, the feature extracted may contain large dimension of redundant and correlated data. Hence, it cannot be processed directly. In this case, we employ principal component analysis (PCA) to reduce the dimensionality of features to achieve better precision. SIFT and KAZE are fused with the help of canonical correlation analysis in a BoVW model, which enhances classification performance.
The BoVW approach creates the vocabulary based on feature descriptors vector quantization. K-means Clustering is the practiced to cluster the descriptors. Then histogram is computed for these clustered descriptors. Select initial cluster centroids for vocabulary randomly. Calculate the distance between each feature vector and the centroid of the clusters. Allocate every feature vector to the cluster with the nearest centroid (minimum distance). Recalculate each centroid as the mean of the objects allocated to it. Repeat the previous two steps until no change.
As the allocation of the related feature to the same cluster is done, visual words are formed and stored in the codebook. Indexed visual words are formed with the gathering of all similar features within the same code and represent a visual dictionary. In large scale image retrieval systems, BoVW uses index pruning to reduce the retrieval cost. The idea is to recognize and remove the images which are not likely to contribute to top results.
SVM is then used to classify the images, with the concatenated histograms of code words in
The cross-validation is done on the training dataset and the finest value for the regularization parameter
The retrieved images are rearranged according to the RF fed back from the user. It represents the progression of refining the results returned by the CBIR system in a given iteration of an interaction session. In the proposed method, the user assigns positive samples to the retrieved relevant images and negative samples to the irrelevant images. Further, based on these positive and negative samples, the proposed method refines the image retrieval results. These steps are repeated in different iterations and user preference is learned according to the positive samples. Thus, the proposed method reduces the semantic gap. The search strategy of image retrieval is improved with the increased number of iterations and the preference of user is studied in this iterations.
The following algorithm illustrates the efficient visual dictionary of images for different features and proposes a high- performance classification model. 1. 2. 3. 4. BoVW model () a. Preprocess the image b. Creating-dictionary ( c. Feature-extraction ( d. K-means (data, e. SVM-training (samples, label) f. Classify-image ( g. Classification-result ( h. Feature-extraction ( i. SIFT-key ( j. KAZE ( k. Generate feature vector l. Apply PCA } m. Set initial centers of clusters, n. Classify each vector o. Recalculate the cluster center p. Assign the label. q. } build SVM-kernel classifier classify-image ( input image for classification match ( //extract local descriptor for testing image i. SIFT-key ( ii. KAZE-key ( Assign the descriptor to visual modeling Compare images using SIFT-key and KAZE-key //classify image using SVM SVMClass (Test Feature), Display classification result
The required simulation is performed using the MATLAB software, the 64-bit operating system of Windows 7 that includes the computational resources of RAM 4 GB, CPU Intel Core i5 with an operating frequency of 3.1 GHz.
To evaluate the performance of the proposed method the versatile standard datasets such as Caltech 256 and Corel 10 K datasets are used.
Data sets | Number of categories | Number of images | Size of the image | URL |
---|---|---|---|---|
Caltech 256 | 256 | 30,607 | 300 × 200 pixels | |
Corel 10 K | 100 | 10,000 | 192 × 128 pixels |
The corresponding sample images are shown in
The image retrieval performance is done on the test image dataset. The closeness of the classifier score values is calculated to retrieve the images. The image class is determined by the classifier output label. The Euclidean distance is used between the scores of the images stored in an image database and the score of a given query image to validate the output of retrieved images. The selected feature percentages are varied to reduce the computational cost. It is studied that an increase in the size of the dictionary increases the performance of the image retrieval. To achieve the best results dictionary size is varied (i.e., 100, 200, 300, 400, 600, 800, 1000, and 1200).
In this work, we consider dictionary size and feature percentages per image as two important parameters which influence the performance of the CBIR. In order to evaluate the best performance of the proposed technique, dictionaries of different sizes are used with different features percentages (i.e., 10%, 25%, 50%, 75%, and 100%) per image. The precision determines the number of correctly retrieved images over the total number of retrieved images from the test image database. It measures the specificity of an image retrieval system, represented as:
The ratio of correctly retrieved images over the total number of relevant images of that semantic class in the image database is known as recall and it measures the sensitivity of the image retrieval system, represented as:
The performance is measured by mean average precision and accuracy. The performance analysis measured in terms of the mean average precision (mAP) of the proposed method is presented in
Selected feature | mAP analysis based on different dictionary sizes | |||||||
---|---|---|---|---|---|---|---|---|
100 | 200 | 300 | 400 | 600 | 800 | 1000 | 1200 | |
25% | 71.63 | 73.45 | 74.58 | 76.62 | 77.52 | 79.68 | 83.99 | 84.23 |
50% | 76.21 | 77.68 | 76.91 | 78.91 | 79.96 | 82.11 | 84.72 | 84.98 |
75% | 78.32 | 79.32 | 81.89 | 82.87 | 83.87 | 86.22 | 89.94 | 90.12 |
mAP | 75.38 | 76.81 | 77.79 | 79.46 | 80.45 | 82.63 | 86.21 | 90.55 |
Selected feature | mAP analysis based on different dictionary sizes | |||||||
---|---|---|---|---|---|---|---|---|
100 | 200 | 300 | 400 | 600 | 800 | 1000 | 1200 | |
25% | 77.13 | 78.12 | 79.11 | 81.98 | 82.61 | 83.68 | 84.99 | 85.62 |
50% | 79.21 | 80.68 | 81.93 | 83.94 | 84.96 | 85.23 | 86.72 | 86.74 |
75% | 83.28 | 84.78 | 85.91 | 86.58 | 86.78 | 88.39 | 89.52 | 89.81 |
MAP | 79.87 | 81.19 | 82.31 | 84.16 | 84.78 | 85.76 | 87.07 | 87.39 |
mAP value | Dictionary size | |||||||
---|---|---|---|---|---|---|---|---|
100 | 200 | 300 | 400 | 600 | 800 | 1000 | 1200 | |
SIFT & FREAK based on BoVW | 71.13 | 73.99 | 74.12 | 74.55 | 74.87 | 74.25 | 73.28 | 76.02 |
SIFT & LIOP based on BoVW | 74.1 | 77.2 | 77.92 | 78.12 | 79.24 | 82.9 | 74.64 | 76.04 |
*Proposed method (SIFT & KAZE based on BoVW) | 79.87 | 81.19 | 82.31 | 84.16 | 84.78 | 85.76 | 87.07 | 87.39 |
The
In CBIR, the image descriptor plays a vital role in assessing the similarities among images. The proposed method based on feature fusion of SIFT and KAZE enhances the scalability of CBIR. PCA employed in this method eliminates the problem of over fitting by reducing the feature dimension. RF employed in this method reduces the semantic gap between high-level and low-level features, while preserving the original topology of the high-dimensional space. It is observed that the fusion methodology of SIFT and KAZE feature descriptors overcomes the multi-scale issues. Performance comparison shows that the proposed method produces better means average precision. In the future, we plan to deploy pre-trained convolutional neural networks.