Traditional linear statistical methods cannot provide effective prediction results due to the complexity of human mind. In this paper, we apply machine learning to the field of funding allocation decision making, and try to explore whether personal characteristics of evaluators help predict the outcome of the evaluation decision? and how to improve the accuracy rate of machine learning methods on the imbalanced dataset of grant funding? Since funding data is characterized by imbalanced data distribution, we propose a slacked weighted entropy decision tree (SWE-DT). We assign weight to each class with the help of slacked factor. The experimental results show that the SWE decision tree performs well with sensitivity of 0.87, specificity of 0.85 and average accuracy of 0.75. It also provides a satisfied classification accuracy with Area Under Curve (AUC) = 0.87. This implies that the proposed method accurately classified minority class instances and suitable to imbalanced datasets. By adding evaluator factors into the model, sensitivity is improved by over 9%, specificity improved by nearly 8% and the average accuracy also increased by 7%. It proves the feasibility of using evaluators’ characteristics as predictors. And by innovatively using machine learning method to predict evaluation decisions based on the personal characteristics of evaluators, it enriches the literature in the field of decision making and machine learning field.
Evaluators are the core of project evaluation. The expertise of evaluators will not only affect the project selection, but also affect the fairness of project evaluation. Nowadays, studies of evaluator characteristics and evaluation decision mainly focus on exploring causal relationship between variables, neglecting to which extent that evaluators’ characteristics can predict evaluation results.
Human mind is a complex system, traditional linear statistical methods cannot provide effective prediction results and may be difficult to clearly and accurately clarify the complex relationship in between [
Among the many machine learning algorithms, the data analysis efficiency of decision tree is better due to it is a non-parametric approach without distributional assumptions, and the output result is easy to understand [
Compared with the existing literature, the contributions of this article are as follows: This paper innovatively using machine learning method to evaluates the personal characteristics of the evaluators to predict evaluation decisions, and it enriches the literature in the field of decision making and machine learning field. The experimental results showed that the proposed slacked weighted entropy decision tree performs well with sensitivity of 0.87, specificity of 0.85 and average accuracy of 0.75. It also provides a satisfied classification accuracy with AUC = 0.87. This implies that the proposed method accurately classified minority class instances and suitable to imbalanced datasets.
In this section, we will describe the potential influential factor to funding decision making, and then, we provide an introduction to the advantages and disadvantages that classical classification methods may have when dealing with imbalanced data. Next, we specify the research gap of current machine learning algorithms that may have when dealing with imbalanced datasets. At last, we describe the dataset and prepare it for testing the slacked weighted entropy decision tree model we proposed.
The influential factors of reviewers can be divided into the following categories: demography, life experience factors, and project evaluation factors.
Among the many machine learning algorithms, the calculation process of K nearest neighbor (KNN) is simple and efficient, yet it is easily influenced by the sensitivity of k [
Models | Advantage | Disadvantage |
---|---|---|
KNN | • The calculation process is simple and efficient | • Easily influenced by the sensitivity of k [ |
Naïve bayes | • Performs well on huge dataset |
• Attribute dependence |
Decision tree | • A self-explanatory and easy to follow representation |
• Overfitting |
SVM | • Suitable to high-dimensional problems |
• Difficult to train large-scale data |
ANN | • Able to handle a variety of input data |
• Overfitting |
However, when dealing with imbalanced datasets, these classical classification algorithms may tend to favor majority class, making the minority class ignored. Therefore, it is necessary to improve these algorithms to adapt to imbalanced data.
In order to have a better prediction result on imbalanced dataset, scholars modify the machine learning algorithms either from data-level or algorithm-level. Data-level methods concentrate on modifying the training data set to make it suitable for a standard learning algorithm. With respect to balancing distributions, we may distinguish approaches that generate new objects for minority groups (over-sampling) and that remove examples from majority groups (under-sampling). Standard approaches use random approach for selection of target samples for preprocessing. They do not depend on the specific classifier and have better adaptability, but it often leads to removal of important samples or introduction of meaningless new objects [
Classification methods | Advantages | Disadvantages | ||
---|---|---|---|---|
Data-level methods | Under-sampling | Random under-sampling | • Easy to manipulate | • Miss out potentially useful data |
Under-sampling based on nearest neighbors | • High generalization ability | • Increase training time and learning cost | ||
Over-sampling | Random over-sampling | • Easy to manipulate | • Over-fitting |
|
SMOTE | • Overcomes the overfitting problem of random oversampling methods | • Sample overlapping |
||
Hybrid sampling | • Improve generalization ability |
• Time consuming | ||
Algorithm-level methods | Cost sensitive learning | • Assigns differential misclassification costs to classes | • Misclassified costs are usually not available [ |
|
One-class learning | One class SVM | • Effectively reduce training |
• Overfitting and poor generalization ability | |
Ensemble learning | Bagging | • Effectively reduce model variance | • Poor fitness | |
Boosting | • Adaptive to weights | • Sensitive to noise | ||
Random forest | • Stable and scalable, and low risk of overfitting | • Sensitive to noise and time consuming |
Algorithm-level methods concentrate on modifying existing learners to alleviate their bias towards majority groups [
These conventional class imbalance handling methods might suffer from the loss of potentially useful information, unexpected mistakes or increasing the likelihood of overfitting because they may alter the original data distribution.
The dataset we use is
Due to both Gain and Gain Ratio are based on the entropy evaluation before and after splitting the dataset, decision trees have been proven to be skew-sensitive. The main problem with the entropy-based measures is that entropy reaches it maximal value when dataset is fully balanced-all classes have equal proportions. And if the initial dataset is not balanced, the initial entropy will be low and will result in a high false-negative rate [
In order to solve the above-mentioned problem, researchers keep improving decision tree algorithms. Reference [
In addition to imbalanced data distribution, the grant funding datasets is also characterized by nonlinearity and complicated interaction between variables. Existing research has not made a breakthrough on this issue. Aiming to solve the above-mentioned issue, we proposed a slacked weighted entropy decision tree algorithms targeting to grant funding dataset.
In this section, we introduce the decision tree model and propose a solution to its degradative performance for imbalanced classification, i.e., slacked weighted entropy decision tree.
Decision tree is a classical machine learning method. It aims to classify an instance into a predefined set of classes based on their attributes’ values (features) [
Decision tree is constructed in a top-down manner in a sequence, i.e., the entire dataset splits into smaller partitions until no further partitioning can be made. Let’s take a binary tree for example. For a single binary target variable Y (0 or 1) and four attributes, the construction of decision tree can be divided into three steps. And we will illustrate the construction of a typical decision tree in
Step 1: According to the splitting criterion, select the most important input attribute (X1) for the dataset (D). This root node represents a choice that will result in the subdivision of all records into two mutually exclusive subsets. At this split, all attributes/features are taken into account and the training data is divided into different groups. As presented in
Step 2: Select the optimal splitting point for the optimal splitting attribute and divide the dataset into two sub-nodes. It is worth mentioning that only input variables related to the target variable are used to split parent nodes into purer child nodes. As the child nodes splitting, the decision tree continues to grow. The decision tree uses the information gain or the information gain rate as the basis for selecting attributes.
Step 3: When all leaf nodes satisfy the stopping criterion, the decision tree stops growing. In such condition, the data in the child nodes all belong to the same class. The stopping criterion usually consists of conditions such as purity, number of leaf nodes, and tree depth.
Since the predicting items of Beijing Innofund is composed of firm level information and individual level of information, normalization is needed. In this study, data are scaled into the interval of [0, 1] by using the following
where x is the original value, x’ is the scaled value, max(x) is the maximum value of feature x, and min (x) is the minimum value of feature x. Then the dataset is divided into two parts for training and testing the model. The training dataset is used to train the model, while the testing dataset is used to evaluate the predictive and generalization ability of the developed model. Since there is little or no guidance in the literature on the division ratio for the training and testing of the model, we follow the tradition to set 80% of the data for training and 20% for testing.
When the data distribution is unbalanced, the C4.5 algorithm tends to favor the attribute of majority partition. In order to address this issue, we improved C4.5 algorithm by adding weight:
where
Based on the information theory, the decision tree uses entropy to determine which node to split next in the algorithm. It calculates the amount of information for different classifications and then obtains the average amount of information for the training set. The higher the entropy, the higher the potential to improve the classification. The weighted information entropy of set (D) can be illustrated as
where
where
The decision tree aims to normalize gain and to divide gain value by a value of split information, formulated as
In extreme cases, the information gain rate is an infinite multiple of the information gain, and the information gain obviously does not play a leading role at this time. Therefore, the slacked factor δ is introduced in the calculation formula of the information gain rate. The improved information gain rate calculation formula is as follows:
After introducing the slacked factor, the information gain rate is limited to the range of [
In this section, we present the classification results of the slacked weighted entropy decision tree. By comparing it with the other machine learning methods, we find that our proposed methods can achieve satisfying classification results.
In the case of unbalanced data, sensitivity and specificity are introduced to measure classification accuracy. The sensitivity (True Positive Rate) is the ratio of correctly classified positive instances to true positive instances. Among them, True Positives (TP) represents the number of positive samples predicted to be positive, and false negatives (False Negatives, FN) represents the number of positive samples predicted to be negative.
The specificity indicates the percentage of correctly classified negative samples to the true negative samples, as shown in
A receiver operating characteristic (ROC) graph is a useful tool for visualizing and comparing classifier performance based on TPR and FPR measures. In the presence of imbalanced datasets with unequal error costs, it is more appropriate to use the ROC curve or other similar techniques. ROC curves can be thought of as representing the family of best decision boundaries for relative costs of TP and FP. On an ROC curve the X-axis represents FPR = FP/(TN + FP) and the Y-axis represents TPR = TP/(TP + FN). The ideal point on the ROC curve would be (0, 100), that is all positive examples are classified correctly and no negative examples are misclassified as positive [
Besides the above-mentioned metrics, we also include average accuracy rate into our model. Unlike accuracy rate, average accuracy rate is suitable for unbalanced data, because it calculates the accuracy of each category separately.
This default rule of decision tree parameters often works well across a broad range of problems. In details, the splitting criterion used is entropy; the minimum sample leaf = 1; the min sample split = 2; the max features = none, random state = none; the maximum leaf nodes = none; the minimum impurity split = none; the maximum depth = none. Due to our dataset is imbalanced, we set class weight = ’balanced’, and let minimum weight fraction leaf increased from 0 to 0.5 gradually. As shown in
ROC Results of SWE-DT
Robust Check of SWE-DT
Comparison between Slacked Weighted Entropy Algorithm and Classical Decision Tree
As can be seen from the Comparison between Machine Learning Methods
Sensitivity | Specificity | Accuracy | |
---|---|---|---|
Decision tree | 0.835 | 0.817 | 0.72 |
SWE decision tree | 0.869 | 0.85 | 0.75 |
As can be seen from
Baseline model | Evaluator model | Improvement | |
---|---|---|---|
Sensitivity | 0.794 | 0.869 | 9.4% |
Specificity | 0.785 | 0.85 | 8.28% |
Accuracy | 0.701 | 0.75 | 7% |
We proposed a new tree induction method-slacked weighted entropy to modify the classical decision tree algorithm. Instead of examining information entropy, the proposed method uses the difference in the weight to determine the best split when growing a decision tree. We compared our method with the original decision tree algorithm based on Beijing Innofund data. The experimental results showed that the slacked weighted entropy algorithm with the proposed enhancement performs well with sensitivity of 0.87, specificity of 0.85 and average accuracy of 0.75. The proposed method appeared to be more suitable for binary imbalanced data classification. More specifically, our method did not sacrifice the false positive rate to increase the true positive rate. It is, therefore, useful when the misclassification costs are unknown in imbalanced data classification tasks.
For future research, more evaluators’ factors could be used by the developed model to improve performance. To learn which factors and how much improvement have to be done to win a funding would be a challenging issue to study. The other possible direction is the collection of data from other fundings, such as NNSFC, NSSFC, in order to examine the feasibility and generality of the slacked weighted entropy decision tree model.
We would like to thank Beijing municipal government for support us doing this research.