Hearing a species in a tropical rainforest is much easier than seeing them. If someone is in the forest, he might not be able to look around and see every type of bird and frog that are there but they can be heard. A forest ranger might know what to do in these situations and he/she might be an expert in recognizing the different type of insects and dangerous species that are out there in the forest but if a common person travels to a rain forest for an adventure, he might not even know how to recognize these species, let alone taking suitable action against them. In this work, a model is built that can take audio signal as input, perform intelligent signal processing for extracting features and patterns, and output which type of species is present in the audio signal. The model works end to end and can work on raw input and a pipeline is also created to perform all the preprocessing steps on the raw input. In this work, different types of neural network architectures based on Long Short Term Memory (LSTM) and Convolution Neural Network (CNN) are tested. Both are showing reliable performance, CNN shows an accuracy of 95.62% and Log Loss of 0.21 while LSTM shows an accuracy of 93.12% and Log Loss of 0.17. Based on these results, it is shown that CNN performs better than LSTM in terms of accuracy while LSTM performs better than CNN in terms of Log Loss. Further, both of these models are combined to achieve high accuracy and low Log Loss. A combination of both LSTM and CNN shows an accuracy of 97.12% and a Log Loss of 0.16.
A long time ago (i.e., more than 500 lakh years ago), Rainforest formed after tropical temperatures dropped down drastically when the Atlantic Ocean had widened enough to provide a warm, moist climate to the Amazon basin [
The sign of climate change and habitat loss can be achieved by the presence of rainforest species. Recognizing these species visually is not as simple as by hearing them, it’s important to use acoustic technologies that can work on a global scale. Machine learning techniques can supply real-time information and it could enable early-stage detection of human impacts on the environment. This result could drive more effective conservation management decisions [
This paper is further classified into different sections and sub-sections which explain various stages of this paper. Section 2 includes a high-level overview of work that has already been done in audio signal processing by different researchers. Next, the methodology used in this paper with data collection and all the necessary preprocessing steps are discussed in Section 3. The next Section 4 shows result formulation and evaluation with the help of LSTM and CNN. Finally, Section 5 gives the conclusion and future scope of the work.
This section describes work related to Audio Classification and other preprocessing techniques for an audio signal that has been used by other researchers to improve results.
In [
In [
In [
In [
Based on above discussed different researcher views about the audio signals, it has been shown that a lot of work needs to be done in this area but there are some challenges and in this paper, techniques like frequency domain conversion of audio signals and combination of two types of models have been used to improve the results.
In this section, dataset collection, preprocessing steps have been discussed with performance metrics used for further evaluation.
The dataset used in this paper collected from a Kaggle Competition repository [
Species Name | Train | Test |
---|---|---|
160 | 40 | |
160 | 40 | |
120 | 28 | |
160 | 40 | |
160 | 40 | |
160 | 40 | |
Eleutherodactylus richmond | 160 | 40 |
160 | 40 | |
160 | 40 | |
116 | 28 | |
160 | 40 | |
160 | 40 | |
160 | 40 | |
160 | 40 | |
160 | 40 | |
160 | 40 | |
160 | 40 | |
220 | 56 | |
Setophaga angelae | 160 | 40 |
132 | 36 | |
128 | 32 | |
160 | 40 | |
132 | 36 | |
320 | 80 | |
Total | 3888 | 976 |
A total of 4727 audio signals were recorded and later this data was manually checked by experts and for ~1100 of these audio files, they found that their device detected correct species and information for these ~1100 files were stored in a separate file which contains the name of the audio file, species which is present in the audio, maximum and minimum frequency of the audio and the time at which a species was heard in that audio file. For the remaining ~3600 files, experts found that species detected by their device was not true and all the information for these audio files was stored in a separate file. Both of the files are provided in the competition repository and all the audio recording files are in .flac format which is used for storing lossless digital signal and as a result, this makes complete data of 17 GB+ size. Audio files are sampled at 44100 Hertz, as correct species label is only provided for ~1100 files so, in this work, ~1100 files are used and later preprocessing techniques like augmentation and time-based slicing are used to increase the size of data which increases total data samples to 4864, out of these files, 3888 files are used for training and 976 files are used for testing.
To implement a deep learning model, all the files were loaded using Librosa and then sliced according to the time at which a species is heard (i.e., let’s say audio files is of 60 s and a species is heard in the audio at 5-8 s then audio signal at that particular time was separated to use it for further steps.), but it should not be very precise for the slice time as it may not help the model in generalization. So, in addition to the slicing audio file according to the given time, 0.2 s was also added at the start and beginning.
Train and test split was performed and new labels were also generated as each file was loaded twice. Data Augmentation techniques like time stretch and pitch shift were also used based on uniform distribution to make the model more generalized and new labels were also generated according to the data augmentation. Random numbers with uniform distribution were generated to decide which augmentation technique should be applied to a sample. Every audio signal was needed to be of the same length. According to the data analysis, there is very much difference in length of different audio files, the maximum length of an audio file was checked and all other signals were post padded accordingly. After that to perform data normalization, mean and standard deviation were computed.
This section uses LSTM analysis for the considered dataset. This model is created in such a way that it captures sequential info of frequency of the sound of a species (LSTM layer with 50 units) along and then calculates average frequency over the sample (Global Average Pooling). Next, a dense layer with 32 units is used in the network and then a Dropout layer is also applied to regularize the model which can prevent overfitting and underfitting. The final layer of the model is a Dense layer that uses Softmax as an activation unit to generate the final prediction. Weights for this model were initialized randomly and the model was trained for 100 epochs.
This section uses CNN analysis for the considered dataset. This model tries to capture patterns in frequency based on sliding window using Kernel and strides through a convolutional layer with 24 filters and 5 kernels. Next, it uses a MaxPooling layer with 2 strides to get the maximum frequency, this process is repeated two times and then a GlobalAveragePooling layer is used to get the average frequency. Later, a Dropout layer is also applied to regularize the model which can prevent overfitting and underfitting, and then the final layer of the model is a Dense layer with 24 units which uses Softmax as an activation unit to generate the final prediction. Weights for this model were initialized randomly and the model was trained for 100 epochs.
In this section, both CNN and LSTM models are combined. For this, the input layer has been connected to two different layers, one is a Convolutional layer with 24 filters and 5 kernels and the other is an LSTM layer with 50 units. After this, the same layers are used as used in the previous two models. Next, the Dense layer of CNN and LSTM has been combined and a Dropout layer with a 0.3 rate is used to prevent overfitting and underfitting and then the final layer of the model is a Dense layer with 24 units which uses Softmax as an activation unit to generate the final prediction. Weights for this model were initialized randomly and the model was trained for 300 epochs.
As probabilities were to be predicted instead of actual class labels, using Multiclass Log Loss would be a better option instead of using accuracy [
where F is loss, pij is the probability output by the classifier, yij is the binary variable (1 if expected label, 0 otherwise), N is the number of samples, M is the number of classes.
OR
where P is the probability of the actual class label and N is the number of samples.
This paper uses a combination of LSTM and CNN. The following steps describe the end-to-end working of algorithm and model architecture along with the purpose of each part of the architecture. The model can be divided into the following 5 steps:
Load audio file using Librosa at a sampling rate of 44100 Hertz Convert time domain to frequency domain (i.e., Spectrogram) Time stretch of 0.7 and 1.3 Pitch shift of -1 and 1 Pad all audio signals to the same length The final shape of the audio signal is (64,1115)
Several types of neural network architectures were tried to find the best model. Experimentation was started with simple LSTM based architecture and then Convolution-based architecture was also tried. After that, both architectures were combined to get the best performance.
This model is created in such a way that it captures sequential info of frequency of the sound of a species (LSTM layer) along and then calculates average frequency over the sample (Global Average Pooling). Next, a dense layer is used in the network and then a Dropout layer is also applied to regularize the model which can prevent overfitting and underfitting. The final layer of the model is Dense layer which uses Softmax as an activation unit to generate the final prediction.
This model captures patterns in frequency based on sliding windows using Kernel and strides through a convolutional layer. Next, it uses the MaxPooling layer to get the maximum frequency, this process is repeated two times and then a GlobalAveragePooling layer is used to get the average frequency. Later, a Dropout layer is also applied to regularize the model which can prevent overfitting and underfitting, and then the final layer of the model is a Dense layer which uses Softmax as an activation unit to generate the final prediction.
In this model, both CNN and LSTM models are combined. For this, the input layer has been connected to two different layers, one is a Convolutional layer and the other is an LSTM layer. After this, the same layers are used as used in the previous two models. Next, the Dense layer of CNN and LSTM has been combined and a Dropout layer is used to prevent overfitting and underfitting and then the final layer of the model is a Dense layer which uses Softmax as an activation unit to generate the final prediction.
Receiver Operating Characteristic (ROC) shows how much the model is capable to differentiate between classes.
The above-discussed results are shown in
Model | Parameters | Accuracy (%) | Log loss |
---|---|---|---|
LSTM | 235,624 | 93.15 | 0.17 |
CNN | 155,896 | 95.62 | 0.21 |
LSTM and CNN | 396,936 | 97.12 | 0.16 |
Few people show interest these days in becoming a forest ranger. With the scarcity of experts in this area, this paper can be a boon. In this work, multiple models were tried to classify the distinct species of tropical rainforest with the audio signal. The combination of LSTM and CNN showed the best results. The model is created in such a way that it can give almost real-time prediction and does not require high-end GPU to work, making it a lot easier to productionize the model. This work showed particularly superior performance with greater than 97% accuracy and 0.16 multiclass log losses but it can surely be improved with the use of widely popular algorithm like Attention Mechanism. If more data can be collected for more types of species, this model can be retrained so that it can recognize a large variety of species. Also, this model can be deployed in an IoT device.