Plant species recognition is an important research area in image recognition in recent years. However, the existing plant species recognition methods have low recognition accuracy and do not meet professional requirements in terms of recognition accuracy. Therefore, ShuffleNetV2 was improved by combining the current hot concern mechanism, convolution kernel size adjustment, convolution tailoring, and CSP technology to improve the accuracy and reduce the amount of computation in this study. Six convolutional neural network models with sufficient trainable parameters were designed for differentiation learning. The SGD algorithm is used to optimize the training process to avoid overfitting or falling into the local optimum. In this paper, a conventional plant image dataset TJAU10 collected by cell phones in a natural context was constructed, containing 3000 images of 10 plant species on the campus of Tianjin Agricultural University. Finally, the improved model is compared with the baseline version of the model, which achieves better results in terms of improving accuracy and reducing the computational effort. The recognition accuracy tested on the TJAU10 dataset reaches up to 98.3%, and the recognition precision reaches up to 93.6%, which is 5.1% better than the original model and reduces the computational effort by about 31% compared with the original model. In addition, the experimental results were evaluated using metrics such as the confusion matrix, which can meet the requirements of professionals for the accurate identification of plant species.
It is well known that the diversity of plant species plays a very important role in various fields, such as agriculture, industry, medicine, environmental protection, human daily life and production activities, etc. Plants provide a large amount of food and some necessities and maintain the balance of carbon dioxide and oxygen in the atmosphere. In addition, plants are also important in concealing water, preventing desertification, and improving climate. However, in recent years, with the increase in human production activities and rapid urban development, as well as over-cultivation, global warming, severe environmental pollution, and insufficient knowledge of plant species, human beings have not only destroyed the living environment but also the biological ecology, leading to the extinction of hundreds of plant species [
With the development of computer hardware and software, mobile devices, image processing, and machine learning [
Traditional methods for plant species recognition have several drawbacks: features need to be manually designed and cannot be automatically learned, and vision-based features often cannot be taken into account simultaneously, thus reducing recognition accuracy [
The improved method based on the ShuffleNetV2 convolutional neural network proposed in this study achieved better results in plant image recognition compared to the traditional methods. Six convolutional neural network models were designed, containing enough trainable parameters to learn discriminative features, and the training process was optimized using the stochastic gradient descent (SGD) algorithm to avoid overfitting or falling into local optima [
There are two main novelties in this research. First, the lightweight model ShuffleNetV2 1.0X is chosen as the baseline version, which has the advantage of considering the MAC (Memory Access Cost) at the beginning of the design, enabling a very low latency when deployed on mobile. In this study, we make some optimizations to improve the accuracy and reduce the computation, design six models with better structure than the baseline version, and test them on the constructed TJAU10 dataset to verify their better generalization performance. Second, this paper is not only the optimization of the simple convolutional neural network model but also a special feature to use the features of the whole plant (stem, leaf, flower, fruit, etc.) for plant species recognition. The advantage is that the deep learning algorithm based on the convolutional neural network can learn the plant image features independently, reduce human intervention, and improve the image recognition rate by excluding noise interference for natural background plant images. The existing plant species recognition methods have low recognition accuracy, and the recognition accuracy cannot meet professional needs. The purpose of this paper is to use the convolutional neural network model to recognize plant species based on the overall plant image characteristics by autonomously learning plant image features and reducing human intervention, and further improving the plant species recognition precision through model optimization. Finally, the model with the best performance was selected to meet the requirements of professionals for the accurate identification of plant species.
In November 2021, nearly 3,000 images of 10 plant species in the natural background were collected by cell phones on the campus of Tianjin Agricultural University to construct the Tianjin Agricultural University 10 dataset (hereinafter referred to as TJAU10). 4,000 × 3,000 pixels of cell phones were used to ensure that the collected data had a certain complexity, and different lighting and different shooting angles were used for the acquisition method. To facilitate the experiment, 10 plants were numbered from 1 to 10, and each number corresponds to the plant name as shown in
Number | 1 | 2 | 3 | 4 | 5 |
---|---|---|---|---|---|
Plant name | |||||
Number | 6 | 7 | 8 | 9 | 10 |
Plant name |
The purpose of data preprocessing is to enhance image data, reduce distorted data and features, and enhance image features relevant to further processing. The preprocessing subprocess receives an image as an input and generates a modified image as an output for the next step of feature extraction. Preprocessing operations usually include image noise reduction [
In the mobile scenario, there are many good lightweight networks to choose from, such as Google’s MobileNet series, Kuangsi’s ShuffleNet series, Huawei’s GhostNet, etc. Among these models, ShuffleNetV2 is widely used because of its clear and concise structure [
To improve the accuracy, we combine ShuffleNetV2 with SENet and SKNet attention mechanisms to obtain ShuffleNet_SE and ShuffleNet_SK networks, respectively. By adjusting the size of the depth-separable convolutional kernel in the downsampling of ShuffleNetV2, we obtain the ShuffleNet_K5 network by expanding all 3 * 3 depthwise convolutions in the downsampling to 5 * 5. The ShuffleNet_Ks4 network is obtained by expanding the depthwise convolution of 3 * 3 in the downsampled single branch to 4 * 4.
Mainly, we borrow the channel attention mechanism in SENet [
For each input X, there are two main operations: squeeze and excitation. Squeeze refers to compressing the spatial information of features by global average pooling, which compresses the original C * H * W dimensional information to C * 1 * 1. Excitation uses two fully connected layers, the first one reduces the dimensionality and downscales C * 1 * 1 to C/r * 1 * 1 (with ReLU activation), and the second FC layer maps the features back to C * 1 * 1 (without ReLU activation), and then after sigmoid, the weight coefficients of each channel are obtained. Then the weight coefficients are multiplied with the original output channel features of each channel to get a new weighted feature, called feature recalibration.
We mainly borrowed from the attention mechanism in SKNet [
The new features U3 * 3 and U5 * 5 are obtained by passing feature X through the small convolution kernel (3 * 3) and the large convolution kernel (5 * 5), respectively. Then the two new features U are summed, and the weight vectors are obtained after the same squeeze and excitation as SENet. Finally, the weight vectors are multiplied with the corresponding U respectively and then summed to get the new features.
Looking at the distribution of the computation of ShuffleNetV2, the computation on the depthwise convolution is very small, and the main computation is on the 1 * 1 convolution. ShuffleNet_K5 replaces all the 3 * 3 depthwise convolutions with 5 * 5 depthwise convolutions. In the PyTorch implementation, the padding needs to be modified from 1 to 2, so that the output feature map can maintain the same resolution as the original. The principle is shown in
ShuffleNet_Ks4 replaces the single branch 3 * 3 depthwise convolution with a 4 * 4 depthwise convolution in the downsampling, and in the PyTorch implementation, the padding size is set to 1 to ensure consistent input and output resolution of the feature map. The principle is shown in
To reduce the computational effort, we optimized ShuffleNetV2 using convolutional cropping and CSP techniques, respectively. The insignificant 1 * 1 convolution in the ShuffleNetV2 block is cropped to obtain the ShuffleNet_LiteConv network. The network is reorganized using the CSP technique to obtain the ShuffleNet_CSP network.
By observing the blocks of ShuffleNetV2, it can be divided into two structures. One is the first block of each stage, which copies the input into two portions, extracting features by convolution of branch1 and branch2 respectively, and then concatenating together, as shown in
Generally, 1 * 1 convolution is used before or after depthwise convolution for two purposes, one is to fuse the information between channels to compensate for the lack of depthwise convolution to fuse information between channels. The other is to downscale and upscale, such as the inverted residual module in MobilenetV2 [
We mainly draw on the network reorganization technique in CSPNet to comply with the gradient variability by integrating the feature maps at the beginning and end of the network phases. It also reduces parameters and computational effort. In addition, it improves accuracy and reduces inference time [
To compare the effect of model improvement before and after, we take the comparison between ShuffleNetV2 (baseline) and ShuffleNet_SE as an example.
The experimental software environment is Windows10 (64-bit) operating system, Python language (Python 3.6), and PyTorch framework. The computer memory is 24 GB, equipped with Intel (R) Core (TM) i7-8550U CPU @ 1.80 GHz 1.99 GHz quad-core processor, and the graphics card is Nvidia MX150 with 4 GB of video memory.
The ShuffleNet_SE model is used as the standard for model parameter setting. The initial learning rate is 0.01, the learning rate decay mode is exponential decay, and the parameter is set to 0.9; the training period is set to 20, the batch size is set to 16, and the SGD algorithm is used to optimize the training process to avoid overfitting or falling into a local optimum. The image preprocessing method and training details of the experiments in this paper are all adopted in the same way.
Epoch represents the number of times the dataset is trained completely, and the appropriate number of training times is important for the convergence of the model fit, especially when the data is too large, one epoch is not enough to update the weights, and when the parameter values are too large, overfitting will also occur. In this paper, the simulation is carried out for epoch values of 15, 20, and 30, respectively, and the results are shown in
Epoch | 15 | 20 | 30 |
---|---|---|---|
Accuracy (%) | 98.8 | 99.9 | 99.7 |
The specific computation process of the attention mechanism can be summarized into two processes. The first is to calculate the weight coefficients based on the query and key. The second is to weight the summation of value based on the weight coefficients. The first process can be subdivided into two stages. The first stage calculates the similarity or relevance of both based on the query and key. The second stage normalizes the original score of the first stage. In this way, the computational process of attention can be abstracted into three stages as shown in
In the first stage, different functions and computer mechanisms can be introduced to calculate the similarity or correlation between a query and a certain keyi, based on the most common methods consisting of finding the vector dot product of the two to find the value, as shown in
The scores generated in the first stage have different ranges of values depending on the specific method of generation. In the second stage, a softmax-like calculation is used to numerically transform the scores in the first stage. On the one hand, the scores in the first stage can be normalized to organize the original calculated scores into a probability distribution with the sum of all element weights being 1. On the other hand, it can also highlight more through the inherent mechanism of softmax weights of important elements. It is generally calculated using
The calculation result ai in the second stage is the weight coefficient corresponding to valuei, and then the weighted summation is performed to obtain the attention value, as shown in
Through the above three stages of calculation, the attention value for the query can be found, and most of the current specific attention mechanism calculation methods are in line with the above three-stage abstract calculation process.
Precision-Recall is a metric used to evaluate the quality of the model output. It is a useful indicator of prediction success when the categories are unbalanced. In information retrieval, Precision is a measure of the relevance of results, while Recall is a measure of how many truly relevant results are returned. Accuracy is the difference between the average of each independent prediction and the known true value of the data. These metrics are defined as follows:
(1)If an instance is a positive class, but is predicted to be a positive class, it is a true positive class (True Positive, TP).
(2)If an instance is a negative class but is predicted to be a negative class, it is a true negative class (True Negative, TN).
(3)If an instance is a negative class, but is predicted to be a positive class, it is a false positive class (False Positive, FP).
(4)If an instance is a positive class, but is predicted to be a negative class, it is a false negative class (False Negative, FN).
Precision (P) is defined as the number of true positive classes (TP) over the number of true positive classes (TP) plus the number of false positive classes (FP), as shown in
Recall (R) is defined as the number of true positive classes (TP) over the number of true positive classes (TP) plus the number of false negative classes (FN), as shown in
Accuracy (A) is defined as the number of correctly classified samples as a percentage of the total number of samples, as shown in
The loss function is used to evaluate the degree of inconsistency between the predicted value of the model
Precision is a measure of the relevance of the results, and the precision comparison curve is shown in
Recall is a measure of how many truly relevant results are returned, and the recall comparison curve is shown in
Accuracy is our most common evaluation metric. Usually, a higher accuracy indicates a better model, and the accuracy comparison curve is shown in
The loss function is a metric used to evaluate the degree of inconsistency between the predicted value
Confusion matrix is a tool that is used to evaluate the accuracy of the classification results of each stacked generation during the algorithm learning process. As shown in
To better evaluate the performance of the models, we mixed the training and test sets and trained the models using the 10-fold cross-validation method, the core idea of which is to divide the dataset several times and average the results of multiple evaluations, thus eliminating the adverse effects caused by the imbalance of data division in a single division, the experimental results are shown in
ShuffleNet_SE, ShuffleNet_SK, ShuffleNet_K5, ShuffleNet_Ks4, ShuffleNet_LiteConv, and ShuffleNet_CSP were trained on the TJAU10 dataset respectively, and their final precision, parameter size, measured FLOPs, and model running time are compared with ShuffleNetV2 of baseline. The results are shown in
Model | Precision (%) | Parameter (M) | FLOPs (M) | Model run time |
---|---|---|---|---|
ShuffleNetV2 (baseline) | 89.1 | 1.264 | 151.34 | 2 h 14 m 5 s |
ShuffleNet_SE | 90.9 | 1.406 | 151.49 | 2 h 30 m 13 s |
ShuffleNet_SK | 91.0 | 1.690 | 156.14 | 2 h 11 m 48 s |
ShuffleNet_K5 | 91.7 | 1.303 | 158.91 | 2 h 3 m 34 s |
ShuffleNet_Ks4 | 90.1 | 1.266 | 151.72 | 1 h 47 m 56 s |
ShuffleNet_LiteConv | 93.2 | 0.923 | 107.29 | 2 h 14 m 50 s |
ShuffleNet_CSP |
From the perspective of boosting accuracy, ShuffleNet_K5 has the best result, the model running time is shorter than both SE and SK with the addition of the attention mechanism, but the number of parameters and computation are elevated. ShuffleNet_Ks4 improves the precision over ShuffleNetV2 (baseline) with basically the same computation and parameter size of 1.1%, and the model running time is also reduced more significantly. It indicates that the gain of the attention mechanism is not obvious in the lightweight network, and not as high as the gain from directly boosting the convolution kernel of the depthwise convolution. In terms of reducing computation, the ShuffleNet_CSP network is the most effective as can be seen from
To select a more suitable improved model, the six improved network models were tested separately with other CNN models on the TJAU10 dataset under the same experimental conditions. The results of recognition accuracy on the test set are shown in
Model | Classification rate (%) |
---|---|
AlexNet | 48.8 |
VGG11 | 81.3 |
ResNet50 | 86.0 |
ResNet101 | 86.2 |
ShuffleNetV2 (baseline) | 89.7 |
ShuffleNet_SE | 98.0 |
ShuffleNet_SK | 95.5 |
ShuffleNet_K5 | 97.3 |
ShuffleNet_Ks4 | 96.8 |
ShuffleNet_LiteConv | |
ShuffleNet_CSP | 97.5 |
The CNN-based approach for feature extraction and classification under the same architecture achieves good results because it has a convolutional input layer, which acts as a self-learning feature extractor that can learn the optimal features directly from the original pixels of the input image, and the integrated features learned are not limited to shape, texture, or color, but also extend to specific kinds of leaf features, such as structural split, leaf tip, leaf base, leaf margin types, etc. [
This paper introduces a plant species recognition method based on an improved convolutional neural network and its importance. Among the model improvements, the ShuffleNet_K5 model is the best from the perspective of improving precision, with better model runtime than both the model with the attention mechanism SE and SK, but higher parameter size and computation. ShuffleNet_Ks4 improves precision by 1.1% over ShuffleNetV2 (baseline) with essentially the same parameter size and computational effort, and the model running time is the least among the four improved precision models. In terms of reducing computation, the ShuffleNet_CSP model has the most obvious effect, with reduced parameter size and model running time, and more obvious precision improvement, but the precision of plant image recognition on the overall test set is not as good as the ShuffleNet_LiteConv model. In general, each of the six models has its advantages and disadvantages, and all of them achieve the purpose of model optimization to different degrees. The deep learning algorithm based on the convolutional neural network can learn plant image features independently, reduce human intervention, exclude noise interference for natural background plant images, etc., and improve the image recognition rate.
Although the advantages of changing the network structure approach have been seen. However, there are many ways to modify the network structure to improve the generalization ability and reduce computation. For example, we can try to change the number of layers and width of the neural network, etc. They may also obtain better results.
Plants have a close relationship with human beings and the environment they live in. How to quickly identify unknown plants without relevant expertise is an important and difficult task because there are a large number of plant species with different leaves that vary greatly between species and are similar within species. With the development of the Internet, computer hardware and software, image processing, and pattern recognition techniques, automatic plant identification based on image processing techniques has become possible. In this paper, we present that CNNs can accurately classify plants in natural environments by improving the network structure for better feature extraction and reducing the complexity of the network with improved accuracy. Experiments show that the improved network can obtain better features, the classification accuracy of the improved model is up to 98.3%, the recognition precision reaches up to 93.6%, the highest increase in recognition precision is 5.1%, and the computational effort is reduced by about 31% compared to the original model. However, our method still has some drawbacks, such as the small variety of plant images for training and the lack of development of a visualization application platform. Therefore, in future work, plant image data of more species can be collected and an attempt can be made to deploy this lightweight model on mobile, embedded, and PC devices, respectively, to test the recognition of plant species in large-scale natural environments to improve the usage.
The authors confirm contribution to the paper as follows: study conception and design: C.Y., T.L.; data collection: C.Y., S.S., F.G., R.Z.; analysis and interpretation of results: C.Y.; draft manuscript preparation: C.Y. All authors reviewed the results and approved the final version of the manuscript.
This study was supported by the Key Project Supported by
The authors declare that they have no conflicts of interest to report regarding the present study.