Parkinson’s disease (PD), one of whose symptoms is dysphonia, is a prevalent neurodegenerative disease. The use of outdated diagnosis techniques, which yield inaccurate and unreliable results, continues to represent an obstacle in early-stage detection and diagnosis for clinical professionals in the medical field. To solve this issue, the study proposes using machine learning and deep learning models to analyze processed speech signals of patients’ voice recordings. Datasets of these processed speech signals were obtained and experimented on by random forest and logistic regression classifiers. Results were highly successful, with 90% accuracy produced by the random forest classifier and 81.5% by the logistic regression classifier. Furthermore, a deep neural network was implemented to investigate if such variation in method could add to the findings. It proved to be effective, as the neural network yielded an accuracy of nearly 92%. Such results suggest that it is possible to accurately diagnose early-stage PD through merely testing patients’ voices. This research calls for a revolutionary diagnostic approach in decision support systems, and is the first step in a market-wide implementation of healthcare software dedicated to the aid of clinicians in early diagnosis of PD.
Parkinson’s disease (PD) is a brain disorder whose symptoms include stiffness, shaking, uneven gait, and difficulty with walking and coordination [
Studies indicate that approximately 89% of PD patients experience speech and voice disorders, including a soft, monotone, and hoarse voice, coupled with hesitancy and uncertain articulation. This is because of the disordered motor system that comes in conjunction with PD. Poor muscle activation leads to bradykinesia and hypokinesia can carry on to the muscles involved in speech, possibly leading to reduced locomotion of the respiratory system, larynx, and deficient articulation [
It is very important to note that studies have shown that initial diagnoses conducted by general neurologists showed erroneous results in 24% to 35% of the cases upon postmortem patient examination [
This paper uses machine learning and deep learning models to enhance the accuracy of early-stage PD diagnosis through classification of processed speech signals. Unlike previous research, which focuses mainly on the speech signal algorithm processing phase, this paper is dedicated to the pursuit of enhancing classification accuracy as much as possible. A gap in the research persists—maintaining the same ultra-specific scope of said algorithms while prioritizing the experimentation of both machine and deep learning methods on the resulting processed data. This voice-mixed-with-computer approach minimizes human error and provides a unique yet robust substitute for the outdated diagnosis techniques practiced in clinical settings worldwide.
Studies are showing that early intervention in PD could potentially help preserve neuron functionality, reduce symptoms, slow disease progression, and improve patient quality of life (QoL) [
The structure of the paper is as follows: the first section presents the introduction, clarifying the challenge at hand and the paper’s proposed solution. The first section also includes the structure of the paper which clarifies the content being discussed in each section of the paper. The second section is a background/literature review, which mainly focuses on researchers’ previous efforts to solve the issue of inaccurate early diagnosis of PD. The scope starts out general and narrows down to the implementation of computational methods using processed speech signals. The third section represents the methodology, and basically consists of a detailed description of the model implementation. This includes, but is not limited to, model selection, feature selection, and the implementation of each specific machine learning algorithm. Graphs and statistical equations are used as supportive evidence. The fourth section shows the experimental results. This is followed by a comparative analysis and brief discussion to contextualize the value of the paper’s contribution. Finally, the fifth section provides the conclusion of the research and future recommendations.
Efforts to improve accuracy of early-stage PD detection have been headlined by biological markers and advancements in neuropathological findings. Using the latter as the gold standard, studies have indeed increased accuracy and called for diagnostic biomarkers [
Perhaps the progress most relevant to this paper in the efforts to increase PD early diagnosis accuracy (outside of speech signal processing) comes in a study by Mohskova et al., in which hand movements were obtained (via a motion sensor) to detect PD through machine learning methods. The kinematic parameters of subjects with PD and a PD control group were obtained via three motor tasks–finger tapping, pronation–supination of the hand, and opening–closing hand movements. Different classifiers were used and the key point determination was conducted using maximums and minimums finder algorithm in order to determine the binary disease status (PD or non PD) of each subject. The results were highly informative, displaying 95.3% in finger tapping accuracy, 90.6% for opening–closing hand movements, and 93.8% for pronation–supination [
There is an intricate web of steps taken to convert analog sound signals phonated by patients into numbers that the model can analyze. Such is the process carried out in feature extraction, or “extracting features characterizing the underlying patterns of the speech signals using signal processing algorithms” [
An important step in noninvasive PD diagnostic decision support was taken in perhaps the most similar study to the current one; a wide spectrum of speech signal processing algorithms (dysphonia measures) were analyzed using two statistical classifiers: random forests and support vector machines. Patients were asked to vocalize sustained vowels, from which 132 different dysphonia measures were computed. The results were beyond state-of-the-art, with nearly 99% accuracy of classification of ten dysphonia features, proving that this suggested approach can complement existing algorithms in assisting classifiers in differentiation between control and PD patients [
A separate study took this idea a step further by implementing non-linear analysis of the range of speech signal processing algorithms on the standard clinical score that determines PD symptom severity (Unified Parkinson’s Disease Rating Scale or UPDRS). Along with the normal set of tasks required of the patient, the study tested accuracy using self-administered speech tests. Selection algorithms were used to filter for the best subset, which was pumped into non-parametric regression and classification algorithms. The results were more accurate than clinicians’ predictions, showing about 2 points’ difference. This suggests the advancement of this technology to scale it up to large-scale clinical trials [
Mustaqeem et al. [
It is no secret that quality of results and model accuracy depend ultimately on two factors:
Data quality Model selection (then fine tuning that model to optimal performance)
Therefore, choosing which model to work with is a decision that in no way can be taken lightly. In fact, a variety of factors go into such a question, and these factors played a substantive role in influencing which models were selected to work with in this research.
Factors affecting model selection:
Size of training data: A large dataset such as the one present in this research is better suited for low bias/high variance algorithms such as decision trees, random forest, and K-nearest neighbor. Accuracy: There will always be a tradeoff between accuracy and interpretability of output, as is represented in Speed/training time: Models with higher accuracy will usually require a higher training time, such as SVM and random forest, while models like logistic regression are quicker to implement. Linearity: Kernel SVM and random forest are preferred for non-linear data, while logistic regression and linear SVM are preferred for linear data. Number of features: Because this dataset has an extremely high number of features, dimension reduction is necessary before continuing to input the data to a classification model [
Based on the previous factors, and taking into consideration that a classification model is required to divide between positive detection and negative detection (0’s and 1’s), a model with the following specifications is required:
Classification model—High size of training data (low bias/high variance)—High accuracy—Linear or non-linear data (tested to see)—High number of features.
Thus, the decision was made to use random forest, logistic regression, and deep neural network algorithms. These three were chosen in specific because they covered all of the aforementioned criteria, but at the same time all three algorithms are significantly different than each other. Each algorithm is distant from the next on the accuracy-interpretability tradeoff spectrum. Each algorithm varies in run time and complexity of handled data. With such a unique challenge at hand, it is imperative to diversify the approach in order to identify the best point of attack in coming trials. The high versatility found between these 3 models facilitated algorithm experimentation, as through the results, it became evident which models were better suited for this unique task. Because of this, the findings hold significant importance for future research papers working on the same topic.
Data was retrieved as a csv file from a dataset on Kaggle™ (the Google online data science community), and this data was collected from UCI Machine Learning Repository [
The data was gathered from 188 patients (107 men, 81 women) with ages ranging from 33 to 87 (65.1Â ± 10.9), provided by the Department of Neurology, Faculty of Medicine, Istanbul University. The control group is made up of 64 people (23 men, 41 women) with ages ranging from 41 and 82 (61.1Â ± 8.9). The data collection process consisted of the subject sustaining phonation of the vowel “a” for three repetitions.
Attribute information: A variety of speech signal processing algorithms have been applied to the dataset, including Time Frequency Features, tunable Q-factor wavelet (TWQT), Wavelet Transform-based Features, and Vocal Fold Features in order to derive clinically significant information for PD diagnosis [
Data import: Using Anaconda Jupiter™ notebook in Python, a multitude of libraries were used, namely NumPy (for linear algebra and arithmetic equations), Pandas (data processing and csv file reading), and scikit-learn™ (free machine learning algorithm library for Python).
Data visualization was done to identify general patterns within the data that would be difficult to recognize with just numbers. Various tools such as MatPlot library, Seaborn, and others facilitate this process. An example of this is a heatmap, such as the one included in
One of the most crucial and significant phases of the data cleaning process, feature selection is the process in which the 754 features are slimmed down to only 15–20 via
A commodity found in any predictive modelling, splitting is necessary to divide the dataset into a training set that the model can learn from and a test set that the model can test its predictive accuracy upon. There are endless ways to split the data, but in this research the method used was 70/30.
One of the most useful and accurate models, the random forest model is basically an aggregation of decision trees whose final decision is equivalent to the majority of final decisions of the trees composing it. After obtaining the dataset and importing it (as explained above), the dataset’s first five rows is taken a look at using the head() function from the Pandas library.
Feature selection is done using the wrapper method, which follows a greedy search approach by evaluating all the possible combinations of features against the evaluation criterion. In this case for classification, the evaluation criterion is accuracy. The method then selects the combination of features that gives the optimal results for the algorithm [
To test if the model works, train_pred and test_pred are compared. Parameters used to create the model:
n_estimators (number of trees to build before taking the maximum voting or averages of predictions)
random_state (facilitates replication of any solution)
max_depth (longest path between root node and leaf node)
Finally, the data is fit to the random forest classifier and a confusion matrix is produced to indicate true positives, true negatives, false positives, and false negatives. Lastly, accuracy is produced.
One of the simpler models, logistic regression is a good way to test the data and see where it stands in terms of complexity and linearity. It is based on the standard sigmoid logistic function in statistics.
Steps:
Import data and libraries (previously explained) Display data (previously explained) Select features using a tree-based classifier:
Because logistic regression is a restrictive model that depends on interpretability and time efficiency, using wrapper method would be too time-consuming. Similarly, univariate selection would be incompatible because it can only work with positive values, while the used dataset contains positive and negative values. Another technique, correlation matrix with heatmap, also requires a lot of time as well as high computational power. Therefore the method used, tree-based classifier, was the most suitable method, as it relies on decision trees. Although it might not yield the highest accuracy, it is much more efficient and works with negative numbers.
Split data Model.
In the context of this research paper, a neural network can be best defined as an intricate web of classifiers all linked together in the form of a network. This network contains input, output, and hidden layers that pick up on hard-to-detect patterns that simple classifiers would find difficulty with. Neural networks are extremely beneficial in situations where pattern complexity becomes a viable obstacle, and this became the driving force behind the idea of attempting to implement a deep neural network in this research.
Steps:
Noteworthy libraries used here are TensorFlow, one of the main neural network frameworks, from which keras (API) and layers are imported. Pandas is also used to display the data entries in a table. Next, the data is split. The first split is 60% training, 40% validation. At the end of each epoch, the loss is evaluated as well as any model matrices of this data. Then the validation data is split into 80% training and 20% cross validation. Normalization of the data follows; training, valid, and test data are normalized. This is a transformation that maintains an output close to 0 and an output standard deviation close to 1. Subsequently, layers are created via a sequential model which trains stacks of layers. Data is input and output in the form of tensors, or 3D matrices. The type of layer used in this research is a regularly densely connected neural network layer (dense layer). Layers: 1 input layer, 7 hidden layers (ReLU activation function), and 1 output layer (Sigmoid activation function) Neurons: First five hidden layers: 70 neurons Sixth hidden layer: 50 neurons Seventh hidden layer: 30 neurons
Activation functions:
ReLU (Rectified Linear Units): Allows use of non-zero threshold, change of max value of activation, and use of a non-zero multiple of input for values below threshold, as shown in
Sigmoid: Similar to ReLU but always returns value between 0 & 1 which is why it is used as the output. (see sigmoid function and graph in logistic regression section)
After the creation of each layer, it undergoes batch normalization to provide more layer stability. In addition, the function Dropout() is used with parameter 0.1 to randomly remove 10% of the nodes, as this prevents overfitting. Optimization of the neural network is done using the notable method Another detail worth analyzing is the loss factor: The whole goal of training is to increase accuracy by removing losses. The loss factor used here is
The callback used in this algorithm was min_delta (determines the minimum change in the monitored quantity to qualify as an improvement) patience (counts how many epochs weren’t improved on because the processing stopped) Batch size (how many samples are processed each time before the model is updated epoch (number of complete passes through the dataset before the improvement stops) Verbose (gives a status report of the training).
Lastly, accuracy is determined and visual representation is shown (elaborated on in the results section).
After applying different approaches to the same dataset in terms of feature selection and predictive modelling, the resulting accuracy shows discrepancies that can indicate significant differences in quality using all the experimented models, as the neural network showed once again its capability to increase accuracy with a highly pleasing result of nearly
For the sake of a legitimate comparative analysis, the paper whose results were chosen to compare to this study’s results was reference 16 “A comparative analysis of speech signal processing algorithms for Parkinson’s disease classification and the use of the tunable Q-factor wavelet transform”. This is because not only is that paper considered state-of-the-art (SOTA) in its domain, but it is also the paper most similar to the current research. Results of this study are included below in
The results in this figure show us that through different trials and by using various classification methods, accuracies varied from mid-60% all the way up to mid-80%. The highest accuracy found across all trials is 86%, when all feature subsets were used and classified by SVM (RBF). The highest accuracy obtained by logistic regression in the SOTA study is 85% (using all feature subsets). The current study achieved a logistic regression accuracy of 81.5%. Thus, the SOTA study gains the edge in that regard. The highest accuracy obtained in the SOTA study using random forest is 85% (using all feature subsets). The current study achieved a random forest accuracy of 90.7%. The current study is superior in this regard. The current study went a step beyond anything done in the SOTA study by implementing a deep neural network in addition to the machine learning methods. The resulting accuracy was nearly 92%, almost 6% higher than any model in any trial conducted in the SOTA study.
Thus, the current study proves to have significantly enhanced the accuracy of the data present in its counterpart. This implies that if this research’s methods had been used in the SOTA study, it would have actually yielded much more accurate results, although this wasn’t the purpose of the SOTA as all the authors wanted to do was relatively compare variable features. However, in light of this numerical representation, the bigger-picture takeaway is evident: this study makes a significant contribution to the field of PD diagnostics, doing it in a unique and simple way—patients’ voices.
The medical field can rest assured that the future of diagnostics is in good hands with the development of rapidly advancing machine learning technologies such as this one. Setting out with the goal of improving the early diagnosis of Parkinson’s disease, this research has certainly achieved its task. After ending up with
We would like to recognize Dr. Ali Wagdy, Eng. Ruwaa Ibrahem, and Eng. Ahmed Kamal for their invaluable insight, constructive feedback, and undying support throughout the whole duration of this arduous research. They are our unsung heroes and the driving force behind the initiative to publish this paper.