|Computers, Materials & Continua |
Improved KNN Imputation for Missing Values in Gene Expression Data
1Faculty of Science and Technology, Pibulsongkram Rajabhat University, Thailand
2Center of Excellence in Artificial Intelligence and Emerging Technologies, School of Information Technology, Mae Fah Luang University, Chiang Rai 57100, Thailand
*Corresponding Author: Tossapon Boongoen. Email: email@example.com
Received: 17 May 2021; Accepted: 12 July 2021
Abstract: The problem of missing values has long been studied by researchers working in areas of data science and bioinformatics, especially the analysis of gene expression data that facilitates an early detection of cancer. Many attempts show improvements made by excluding samples with missing information from the analysis process, while others have tried to fill the gaps with possible values. While the former is simple, the latter safeguards information loss. For that, a neighbour-based (KNN) approach has proven more effective than other global estimators. The paper extends this further by introducing a new summarization method to the KNN model. It is the first study that applies the concept of ordered weighted averaging (OWA) operator to such a problem context. In particular, two variations of OWA aggregation are proposed and evaluated against their baseline and other neighbor-based models. Using different ratios of missing values from 1%–20% and a set of six published gene expression datasets, the experimental results suggest that new methods usually provide more accurate estimates than those compared methods. Specific to the missing rates of 5% and 20%, the best NRMSE scores as averages across datasets is 0.65 and 0.69, while the highest measures obtained by existing techniques included in this study are 0.80 and 0.84, respectively.
Keywords: Gene expression; missing value; imputation; KNN; OWA operator
DNA microarray technology  is used to monitor expression data under a variety of conditions. In previous decades, gene expression data obtained from various microarray experiments has inspired several applications, including the discovery of differential gene expression for molecular studies or drug therapy response , the creation of predictive systems for improved cancer diagnosis  and the identification of unknown effect of a specific therapy . However, using this technology to generate gene expression data sometimes leave a number of spots on the array missing . These may be caused by an insufficient resolution, an image corruption, array fabrication and experimental errors during the laboratory process [6–8]. In general, missing of data around 1%–10% would affect up to 95% of the genes in any microarray experiments . The treatment of missing values is a critical pre-processing step, as data quality is a major concern in the downstream analysis and actual medical applications. Ignoring this may degrade the reliability of knowledge or model generated from the underlying data set. In many domains ranging from gene expression to survey responses in social science, missing data causes several statistical models and machine learning algorithms to be incompetent as they are designed to work with a complete data . In fact, data cleansing prior the actual analysis is critical to the quality of outcome . It helps to decrease the need of repeating experiments, which can be expensive and time consuming. Above all, the repetition of experiments may not guarantee completeness of the data .
Instead of repeating an experiment, one can attempt to estimate missing values by imputation. As a result, many algorithms have been proposed to tackle this problem found in gene expression data . A quick search in PubMed for the phrase ‘missing value imputation’ in the Title/Abstract field returns more than 100 articles, in which 76 of them were published during 2010–2020. Among these, an obvious solution is simply to exclude any samples with missing information from the analysis step. However, it is recommended to apply this only when a large volume of data is available, such that representatives of different data patterns remain in the final data set . In addition, different statistical measurements such as zero, means, maximum and minimum are exploited as a reference value . A rich collection of machine learning techniques has also illustrated a leap of improvement during past decades. These include linear regression imputation and K nearest-neighbors imputation , maximum likelihood , decision trees  and the fuzzy approach . Among those, K nearest-neighbors imputation or KNNimpute  is perhaps one of the earliest and most frequently used missing value imputation algorithms. It makes use of pairwise information between the target gene with missing values and the K nearest reference genes. The missing value j in the target gene is estimated as the weighted average of the j-th component of those K reference genes, where the weights are inversely proportional to the proximity measures (e.g., Euclidean distance) between the target and the reference genes. Based on the empirical study with published gene expression data sets , KNNimpute and its variants often perform better than other alternatives, provided that a strong local correlation exists between genes in the data.
Several modifications to the basic KNNimpute algorithm have been proposed in the literature. For sequential KNNimpute or SKNNimpute , imputed genes are reused in later imputation processes of other genes. In particular, the data matrix is first split into two sets: the first set (i.e., the reference set) consists of genes with no missing value and the second set (i.e., the target set) consists of genes with missing values that are ranked with respect to the missing rate. Missing values are estimated sequentially, starting with the gene having the smallest missing rate in the target set. Once all the missing values in a target gene are imputed, the target gene is moved to the reference set to be used for subsequent imputation of the remaining genes in the target set. Another variation of KNN imputation is introduced as an iterative KNN imputation or IKNNimpute . This algorithm is based on a procedure that initially involves replacing all missing values via means imputation and iteratively refining these estimates. In each iteration, K closest reference genes selected from the previously imputed complete matrix are used to refine the missing values estimated of the target gene. The iteration terminates when the sum of square difference between the current and the previous estimated complete matrix falls below a pre-specified threshold. In addition, the study of  compares the performance of incomplete case KNN imputation (ICKNNI) against complete case KNN imputation (CCKNNI). The empirical results show that using incomplete cases often increases the effectiveness of nearest-neighbors imputation, especially at a high missing level. In the work of , a method for nearest-neighbors selection for iteratively KNN imputation is proposed. This so-called GKNN algorithm selects k nearest neighbors for each missing data via calculating the grey distance instead of the traditional Euclidean distance. This function calculates the proximity from only one dimension, while other methods conduct this measurement on multiple dimensions. Besides, the feature weighted grey KNN (FWGKNN) imputation technique  incorporates the concept of feature relevance to determine estimated values using the mutual information (MI) metric.
Unlike the aforementioned, a trend to apply data structure or cluster to guide the neighbor selection has recently emerged with reported successes over gene expression data. Specific to the Evolutionary kNNimpute (EvlkNNImputation) model introduced by , it extends KNNimpute with the genetic algorithm being employed to optimize parameters of the underlying KNNimpute algorithm. Given the prior step of clustering, the missing data will be filled in by taking into account all neighbor instances belonging to the same cluster. Similarly, the cluster based KNN imputation or CKNNimpute  also makes a good use of cluster analysis, where the k-means clustering algorithm is employed to obtain clusters of data set under examination. Instead of using all available genes, only those in the cluster whose centroid is the closet to the target gene are candidates for the selection of nearest neighbors. However, a simple average operator is still used to deliver the imputed value at the end, which may be ineffective for cases with extreme values or noises. In fact, a number of alternatives have been proposed under the umbrella of ‘aggregation operator’ that combines multiple sources of information into a global outcome . For this purpose, Yager’s ordered weighted averaging (OWA) operators  have proven useful for many problem domains such as data mining, decision making, artificial neural networks, approximate reasoning and fuzzy system . Furthermore, a rich collection of weight determination methods for OWA can also be found in the literature [28–36].
In order to improve the quality of imputation, the work presented in this paper proposes an organic combination of CKNNimpute with the argument-dependent OWA operator [33,34], which has not been investigated thus far in the literature. In particular, the performance of CKNNimpute technique may be enhanced, where imputed values are summarized from those of selected neighbors using a data-centric aggregation operator instead of a conventional average function. New models are evaluated with several published gene expression data sets, in comparison with basic statistical models, the conventional KNNimpute and its weighted variation. The behavior of these models are also assessed using different levels of missing values, with the results providing a guideline for their practical uses. The rest of this paper is organized as follows. Section 2 presents the methodology of proposed imputation process, in which a clustering of data under examination is obtained prior the selection of nearest neighbors. Then, basic and argument-dependent OWA operators are applied to a set of reference inputs each belonging to a particular neighbor, to create an imputed value. After that, the performance evaluation of this new technique and compared methods are included and discussed in Section 3. At the end, the conclusion with directions of future research is given in Section 4.
2 Proposed Method
In this section, the proposed imputation methods called OWA-KNN and OWA-CKNN are fully explained. It combines the cluster-based selection of neighboring genes and the application of argument-dependent OWA operator that helps to reduce the effect of false or biased judgment in a group decision-making. In particular, these new models commonly include three steps of: (i) finding the appropriate number of clusters and creating a clustering model; (ii) using this as a reference for the following gene selection process; and (iii) applying the ordered weighted averaging operator with K nearest-neighbor imputation algorithm in conjunction with the data cluster previously discovered. Each of these is described in the following sub-sections.
2.1 Acquisition of Gene Clusters
The objective of this initial phase is to obtain a set of data clusters that will be exploited as references to the next stage of nearest neighbor selection. In particular, the k-means clustering algorithm is employed here for its simplicity and efficiency. This technique aims to divide a given set of data into a predefined number of groups or clusters, provided that there is no missing value in the underlying data matrix. Therefore, a simple average imputation is introduced to the proposed framework to firstly estimate those missing entries in the original data matrix , where , and correspond to rows and columns (i.e., genes and experiments, respectively). This step delivers the so-called complete data matrix of the same dimensionality as the original. For a given matrix , the k-means algorithm searches for the partition of genes into clusters, such that genes in the same cluster are more similar to each other than to those in the others. This is achieved through minimizing the following objective function .
where denotes a set of vectors representing centroids of clusters, i.e., . Furthermore, is another matrix in which each entry represents a membership degree that a specific gene having with cluster ( and for crisp/hard and soft clustering, respectively), provided that . For many clustering algorithms including k-means, the parameter k indicating a number of gene clusters is to be determined prior the generation of reference partition π. In fact, setting the value of k requires either knowledge of the investigated data or the alternative of trial-and-error experiment. For the latter, a user must have sufficient expertise to know what a good clustering looks like. However, if the data set is very large or of a high dimensionality, human verification could become difficult or even impossible at times. As such, it is necessary to have an algorithm that can efficiently justify a reasonable number of clusters to use. With this in mind, the next step is to identify the appropriate k value, which can be summarized as follows.
Step1: the process starts with applying the k-means algorithm to the data matrix . Specific to the trial , this generates a set of partitions using different value of . In the current research, is used for the advantage of efficiency. It is noteworthy that a more general heuristics such as can be applied, however with higher computational/time requirement .
Step2: find the data partition from trial with the best internal cluster quality, based on a group of cluster validity indices . In other words, one vote is given to that is a member of the collection if it provides the best score of a quality index . Having evaluated across indices, the partition with b clusters that has the majority vote  is taken as the optimal setting of cluster numbers, i.e., is the preferred k for the trial. In case of a tie, a vote is divided between those relevant partitions. Note that a cluster validity index is one of standard tools to assess the goodness of clustering results. For this work, five of the most well-known quality indices are included to form a committee (i.e., λ = 5) that judges the appropriate value. These include Silhouette index, Dunn’s index, DB index, Calinski-Harabasz index and Kzannowski-Lia index, respectively. Please refer to [9,38] for more details of these validity indices.
Step3: since k-means is non-deterministic, Steps 1–3 are repeated for times, i.e., . This is to ensure that the target number of clusters is not randomly obtained from a few trials. Results from trials are then used to form a vector of preferred cluster numbers (i.e., . With this information, the optimal value of is the most frequently occurring numbers in the aforementioned vector. For instance, would be 3 given the result vector of trials. In case of a tie, a smaller value is preferred. Note that is set to 20 for the current research, as several works on ensemble clustering  have commonly identified that the promotion of diversity within an ensemble is limited as the size grows larger than 20 k-means repetitions. In other words, the patterns of data partitions become highly overlapping, as more results are included.
Step4: once the value of optimal k is known, the quality of all partitions of clusters from trials are examined again. Let be the quality of , which can be calculated by:
where denotes the quality measure of partition with respect to the quality index in the normalized domain of across different λ indices. Following that, the selected partition for the next stage is one with the maximum value of .
Instead of selecting a partition with the best quality from the given pool, another alternative that can be investigated in the future work is to exploit the cluster ensemble approach to summarize all available partitions . With this intuition, the final partition may be more accurate and robust. Despite higher time requirement, this research direction seems promising.
2.2 Cluster-Directed Selection of Nearest Neighbours
With the optimal clustering model obtained from the previous stage, the selection of nearest neighbours to the target gene is emphasised next. Note that the reference partition consists of k clusters , each of which is represented by a unique centroid . Firstly, find the cluster for any row with missing values to associate with. Such a row in gene expression data is called a target gene whose missing values will be estimated. A target gene is formally assigned to a cluster only when
where denotes the centroid of cluster , while is the Euclidean distance between vectors and . After that, the cluster membership previously discovered is utilized to determine the gene set for a particular target gene . The size of this gene set may be subjectively specified by a human expert, but a data-driven counterpart is usually preferred for better adaptability. Specific to this study, the number of genes in is dynamically determined by the intuition that only when
where the target gene belongs to cluster , and denotes the size of (or the number of genes in that cluster). A missing value in is then estimated by applying a KNN imputation to the set of those genes of . The KNNimpute method has proven simple and effective in the literature, whilst being generally competitive to other advanced techniques. However, the efficiency of KNN imputation is still subjected to the number of genes (), with the time complexity of searching for nearest neighbours being around . In other words, KNN does not scale up well with a very large set of data. In order to increase to efficiency of imputing missing values in microarray data, the idea is to reduce biases and increase correlation of data by clustering method before imputing the missing values. With a clustering model, the neighbor search is restricted only to the cluster that the target gene is closest related. Thus, the time complexity would reduce dramatically to where is the average size (or number of genes) of clusters and .
2.3 Application of Argument-Dependent OWA Operator
The general process of OWA consists of three steps: (i) input values are rearranged in the descending order, (ii) weights of these inputs are determined using a preferred method, and (iii) based on the derived weights, these rearranged input values are combined into a single value. In the community of OWA research, weight determination has long attracted a large number of studies and publications. These include different types of methods such as constraint optimization models [28,29], quantifier functions , and data distribution assumption [31,32]. Given these techniques, weights are generated in an objective way, without considering actual distribution characteristics of input values. It is simply assumed that the distribution of inputs follows one of common probability density functions. This hypothesis may be unrealistic provided the observation that OWA weights often cannot fit any pre-defined functions in many real problems . Unlike the aforementioned families of weight specification methods, another category takes into account distribution characteristics of input values. At first, the argument dependent method [33,34] is proposed where large weights are assigned to input values close to the average, and small weights to those values further away from the center. While this initial method treats the whole inputs as one global cluster, local clusters are also exploited to estimate weights [35,36]. However, the underlying cluster analysis can be highly expensive, especially to a big data set. For the present work, the argument-dependent OWA operator introduced by [33,34] is exploited to deliver the proposed OWA-CKNN imputation model. For a comparative purpose, another new method of OWA-KNN is also introduced here by making a good use of the same OWA operator to the basic KNNimpute technique. Details of these applications to create the final estimate from the set of genes identified previously are presented next.
Specific to OWA-CKNN, the -th attribute or component that has been missing in the target gene can be estimated from a set of -th attribute values of all genes in that set of cluster-based nearest neighbors. With CKNNimpute that is the baseline counterpart of OWA-CKNN, is obtained as an average of the aforementioned attribute values.
This is considered to be unreliable at times, such that OWA-CKNN applies the argument-dependent OWA instead. In particular, each value in the set is given a weight that can be approximated by the following equations.
After that, the estimate of is calculated by the next equation.
For the OWA-KNN model, this same process is repeated, with the argument-dependent OWA being also applied to the set of nearest neighbors. However, this set is simply obtained from a simple search for nearest neighbors without the constraint of clustering reference explained in Section 2.2. Note that OWA-KNN is considered to be the extension of KNNimpute, while OWA-CKNN is a novel modification made to CKNNimpute.
3 Performance Evaluation
To obtain a rigorous assessment of proposed methods, OWA-CKNN and OWA-KNN, this section presents the framework that is systematically designed and employed for the performance evaluation. It includes details of datasets to be examined, compared methods, parameter settings, an evaluation metric and the statistical assessment. Also, experimental results, observations and discussion are provided herein.
3.1 Experimental Design
Specific to this empirical study, proposed models are evaluated on six gene expression datasets that are obtained from published microarray experiments. This follows the previous study of , which initially introduces the concept of CKNNimpute, i.e., the baseline of OWA-CKNN. Four of these datasets originate from the cell-cycle expression of yeast Saccharomyces Cerevisiae (or S. Cerevisiae) that has been reported in . The first set named Sp.Alpha is represented as an data matrix of genes and experiments. The second dataset that is referred to as Sp.Elu hereafter is generated from elutriation data and presented as a matrix. Next, Sp.Cyca is the third set in which time series data for the analysis of cell cycle regulate genes is recorded as a matrix of dimension . The fourth microarray dataset or Sp.Cycb is also drawn from this set of time series, and contains 242 genes and 14 experiments. In addition to these, the fifth dataset is acquired from the study of , which investigates response in yeast to environmental changes. This data matrix called Ga.Env contains 5,431 genes and 13 experiments. At last, the sixth dataset or Ta.Crc presents cDNA microarray data  relevant to human colorectal cancer (CRC). In particular, the underlying data matrix contains 758 genes and 205 primary CRCs. Details of these six datasets are summarized in Tab. 1. Note that these have been investigated in many studies, including the survey of  and comparative reports by [42,43]. Henceforth, these can be considered as the benchmark data collection for the comparison of new and existing imputation methods. In order to make use of these datasets for the problem of missing values, they are modified such that missing values are randomly inserted to make up the proportion of up to of the data matrix, where . This can be achieved by the salt-and-pepper selection of corresponding positions across the space of .
To gain a thorough comparison of performance, the next five methods are included in experiments in addition to OWA-CKNN and OWA-KNN. The setting of method-specific parameters are also specified.
• Two basic imputation techniques of zero heuristics and row average, which are referred to as Zero and RA hereafter.
• Two common neighbor-based models KNNimpute  and its weighted variation, WKNN . Algorithmic variables are set in accordance with those reported in published reports.
• CKNN: cluster-based KNNImpute or called CKNNimpute in the original work of . Note that, as the baseline of OWA-CKNN such that the steps of finding clustering-based nearest neighbor set is identical to that of OWA-CKNN. Furthermore, CKNN is a good representative of many imputation techniques found in the literature as it demonstrates performance superior than others  such as SKNNimpute , IKNNimpute , LLSimpute , and BPCAimpute .
• Similar to many previous studies, normalized root mean square error or NRMSE  is used to determine a goodness of imputation. It is based on the difference between values estimated by an imputation technique and their true values. Intuitively, the lower such a difference is the better the performance is. Formally, NRMSE is defined by the following. Note that is the actual value in the original data matrix, is the corresponding estimated value, is the variance of the actual values. The lower NRMSE is, the better the value estimated by a computerized method becomes.
• Each experiment setting (imputation method, dataset and missing rate) is repeated for 20 trials to generalize the results and comparison.
3.2 Experimental Results
According to Fig. 1 that presents method-specific NRMSE measures as averages across datasets and multiple trials, the two basic alternatives of Zero and RA appear to be the worst among all seven techniques investigated here. The gap of difference between these two with the others is obvious when the missing rate is less than 15%, while KNN is only slightly better than Zero and RA as the rate rises up to 20%. Not only the basic KNN model, performance of other neighbor-based imputation techniques also drops as the magnitude of missing value inclines. Provided that WKNN and OWA-KNN are extensions of KNN, it is only natural to compare their NRMSE scores across the range of missing rates, from 1% to 20%. In particular to this objective, the results illustrated in Fig. 1 suggest that both WKNN and OWA-KNN similarly improve the effectiveness of KNN, whereas OWA-KNN usually provides more accurate estimates than the other for all the missing rates. However, this proposed use of OWA with KNN is still not as good as CKNN, thus confirming the benefit of cluster-based selection of nearest neighbors. This leads to the comparison between OWA-CKNN and its baseline, i.e., CKNN. It is observed that the former performs consistently better than the latter, with the different between their NRMSE scores becomes gradually larger along the increase of missing rate. It is noteworthy that OWA-CKNN is a promising choice as it is able to keep the NRMSE measure below 0.68 even with a large amount of missing values. Apart from this overview, details of dataset-specific results are given next.
Figs. 2–7 provide further results summarized for each of datasets examined in this study. Note that the results of Zero and RA are not included in these figures as they are significantly higher than five neighbor-based counterparts, hence making illustrations rather difficult to understand. With the Sp.Alpha dataset, it is shown that all the four variants of KNN outperform the baseline model, with OWA-CKNN achieving the best NRMSE score for each of missing rates. This trend can be similarly observed with other datasets (based on Figs. 3–7), where NRMSE scores of OWA-CKNN are significantly lower than those of CKNN. It is also interesting to see that OWA-KNN can be more effective than CKNN in datasets like Sp.Elu, Sp.Cyca, Sp.Cycb and Ga.Env. This suggests that the exploitation of OWA operator can provide a reliable estimate even from a set of simple nearest neighbors, i.e., without referring to the cluster-based reference. Specific to Fig. 7 that shows the results with Ta.Crc, OWA-CKNN is able to boost the performance of CKNN, which is originally only comparable to WKNN and slightly better than the KNN baseline.
To achieve a more reliable assessment, the number of times (or frequencies) that one technique is ‘significantly better’ and ‘significantly worse’ (of 95% confidence level) than the others are considered. This comparison framework has been successfully used by [37,38] to discover trustworthy conclusions from the results. Let be the average of NRMSE measures across runs ( = 20 in this evaluation) for an imputation method ( is a set of seven methods assessed here), on a specific dataset ( is a set of investigated datasets). In other words, is estimated by the next equation.
where denotes the measure obtained from the -th run of method , on dataset . The comparison of average values (or means) to discriminate the effectiveness of examined methods may be misleading, as the difference between means can be statistically insignificant at times. Thus, such an evaluation decision can be more robust using the 95% confidence interval for the mean , which is defined as follows.
where denotes the standard deviation of the measures across runs for a method over a dataset . The statistical significance of the difference between any two techniques over any dataset is found if there is no intersection between their confidence intervals of and . For any dataset , a method is significantly better than another method when the following is true.
Following that, the number of times that one method is significantly better than its competitors across all experimented datasets, i.e., , can be estimated by the following equation.
Similarly, the number of times that one method is significantly worse than its competitors, i.e., can be computed by the next pair of equations.
Given these definitions, it is useful to evaluate the quality of imputation techniques based on the frequencies of better () and worse () performance than competitors. Figs. 8 and 9 depict better and worse statistics at low missing rates of 1% and 5%. These lead to a familiar conclusion that OWA-CKNN is the most effective, and another proposed model of OWA-KNN is usually better than both WKNN and KNN. Let use turn to the case of a high missing rate of 20%. The same observation is also obtained from the results shown in Fig. 10. Note that the quality of OWA-CKNN is exceptional in this extreme case as compared to other alternatives. This is implied by the fact that ‘Worse’ frequency of this model is zero. Nonetheless, the goodness of all methods including OWA-CKNN is likely to rapidly decrease as the rate of missing values grows beyond the mark of 20%.
This paper presents an organic combination of KNN imputation and argument-dependent OWA operator, which has been missing from the literature, especially for improving quality of gene expression data. Instead of relying on a simple average as the representative of attribute values extracted from a set of nearest neighbors, a weighted aggregation is exploited. Each of these reference values is assigned with a specific weight emphasizing its significant to the underlying summarization. A simple global approach to determine argument-dependent weights is employed in the current work for its simplicity and efficiency. The proposed models of OWA-CKNN and OWA-KNN are assessed against benchmark competitors, over a set of published data and a widely used quality metric of NRMSE. In addition, an additional evaluation framework of significant better and worse is also exploited herein to provide further comparison. Based on the experimental results, both new techniques usually perform better than their baselines, whilst reaching the best performance per setting of dataset and missing rate. Despite this success, it is recommended to make use of other alternatives to resolve the problem of missing values when the rate of missing values grows higher than the level of 20%. Perhaps, a re-run of microarray experiment might be a better choice than analyzing uncertain data. With respect to the current development, several directions of possible future work can be considered. These include the exploitation of consensus clustering [37,47] to provide accurate clusters for the proposed neighbor-based imputation method. In particular, an investigation of generating those using quality-diversity based selection of ensemble members  and noise-induced ensemble generation  can be truly useful in practice. Besides, possible applications of imputation techniques to fuzzy reasoning  and clustering-based data discretization  can also be further studied.
Acknowledgement: This research work is partly supported by Pibulsongkram Rajabhat University and Mae Fah Luang University.
Funding Statement: This work is funded by Newton Institutional Links 2020--21 project: 6237188 81, jointly by British Council and National Research Council of Thailand (https://www.britishcouncil.org). The corresponding author is the project PI.
Conflicts of Interest: There is no conflict of interest to report regarding the present study.
|This work is licensed under a Creative Commons Attribution 4.0 International License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.|