MissingData Treatment on Cluster Analysis
Datamining is an important aspect indiscovering certain aspects of rawdata that is important in facilitating the extraction of data thatcan be transformed for further uses. One major aspect of data miningis the data pre-processing. Data pre-processing is synonymous tobusiness intelligence and ensuresthat valid and quality data isprepared for the purpose of data analysis and decision making.Dealing with missing data is one of the objectives of datapre-processing inorder to recover of fill the missing values. Thereare several methods of dealing with missing values, but the chosenmethod should be compatible with the properties of the selected data.In this project, an integrated regression cluster method of analysiswas used to assess howdemographic data from the Bank would berecovered to aid in analyzing customer behaviors in readiness for anew product launch in the market.The missing demographic data was agreat challenge to the marketing team. However, through the proposedmethod of integrated clustering and regression technique, itwas foundeffective in the predicting accurately the demographic values forsome clients. The results indicated that regression clustering wouldidentify missing values of income, age and gender attributes from aset of data at the bank.
KeywordsData mining, Pre-processing, Regression clustering
Datacollected from the field is always incomplete, noisy andinconsistent. As such, it requires data cleaning, data integration,transformation, data reduction, discretization and the generation ofthe concept hierarchy before actual data analysis can be done. Datamining is an important aspect in the extraction of essentialinformation from a set of data and transforming it into acomprehensible form for further use. Through data mining, acomputation approach is used to discover distinctive patterns inlarge sets of data through the integration of machine learning,artificial intelligence, statistic and database system.
Datamining helps in organizing and establishing patterns of datain orderto understand particular trends that may be useful for businessanalysis.There has been a great development of techniques used indata mining and discovered new knowledge for various businessindustries (Tseng & Kuo-Howang, 2003: 32). According to mostresearchers accurate prediction of missing values can be achievedthrough the matching the data with a suitable analysis method(O`Brien& Marakas, 2011: 132).
However,according to these studies (Han & Kamber, 2000: 123) there is nounique method that can deal with most kinds of missing data sets andthus the importance of understanding the underlying property of aparticular data set. Data pre-processing is an important aspect indata mining and enhances knowledge discovery as part of improving thequality of data to be analyzed. In this project, an integration ofclusters and regression techniques was used to predict themissing values in the customer database.Thisphenomenon of managing missing values is used in various lifesituations. For instance, in most cases business decisions are basedon quality data mining in order to have effective market analysis,and the data warehouse needs to have a consistent integration ofquality data.
Researcherswould have a rough time analyzing data collected directly from thefield. For instance, consider the customer database at Baroda Bank ofSaudi Arabia. A research analysis on thecustomers’ database at thebanking institution found that the customers records did haveadequatedemographic data (occupation, income, age and gender) (Han &Kamber, 2000: 122). In this case the existing customer database doesnot provide sufficient market prediction that could be used to launcha new product in the market or pursuestrategies aimed at retainingthe existing customers. As such, more survey would be conducted toget available data on customer demographics and then arrange them indifferent clusters age, gender, income as part of finding themissing values. When making business decisions data preprocessing isassumed, and this contributes to great errors in the overallanalysisof business intelligence. As such, as part of data preprocessing, itis imperative to clean raw data in order to polish the missingvalues, noisy data and any other inconsistent values. Datapreprocessing is achieved through four main stages, these arecleaning, transformation, integration and reduction (O`Brien& Marakas, 2011: 132).
Missingvalues thisproblem is rectified by filling the missing values, correctinginconsistent data, using the arithmetic mean, manually filling themissing values and using the most probable value.
Noisydatarefers to data that has outliers and errors that arise fromthelimitation of data collection methods or instruments used, wrong datacollection from the field, using the wrong technology to collect dataand problems during the data transmission of data.In addition, noisydata may result due to data entry issues or through the namingprocess (Tseng & Kuo-Howang, 2003: 32).
Inthis case, noisy data may lead to ineffective analysis andconsequently wrong decisions. Several methods are used to address theproblem of noisy data such as the binning method, clustering andregression. The binning method entails portioning dataintoequalfrequency through the means, bin medians or the bin boundaries.The clustering method entails the removal of outliers in the expectedgroup while regression may be used to smooth data into specificfunctions (Han & Kamber, 2000: 124).
Inconsistentdata thismeansdata that has some discrepancies in the names or codes.Forinstanceina bank institution acustomer’s current age may berecorded as 30 years whereas the date of birth indicated is01/05/1970.
Dataintegration refers to the integration of sets of data derived fromvarious sources in a coherent format. Integration may be achievedthrough the scheme method where metadata from several sources isintegrated together. For instance in the case of Baroda bank, severalsources of customer information were integrated to form a customerdatabase but the main problem was that some information aspects werein files having different names, and some records had differingvalues. For instance, some annual figures of annual income were in‘derived statements’ while others were in ‘table’ formats andsome values were inconsistent. This is because most databases are notupdated well leading to synchronized problem, and many attributeslack the required consistencies. In such a case, ignoring suchrecords would be necessary to enhance data cleaning. Therefore,careful integration of data from multiple sources is important aspart of reducing redundant data and other inconsistencies to improvethe overall data mining quality (Tseng & Kuo-Howang, 2003: 23).
Inthis case, transformation means smoothing the data in order to removeany noisy data, aggregate, normalize and generalize certain aspect ofthe data.
Banksinstitutions have large sets of customer databases that are complexand mining of quality data from such databases may take long topre-process and thus the need for reduction techniques.Data reductionhelps to improve efficiency, simplicity and accuracy of data to bemined. Although certain aspects of data are lost through reductionstrategies, the process of data reduction helps to obtain a reducedrepresentation of data that is smaller and one that is likely to givethe same analytical results. Strategies used commonly in datareduction entails the following data cube aggregation, reduction ofthe dimension, reducing the numerosity, generating discretization andthe hierarchy of concepts (Tseng &Kuo-Howang, 2003: 31).
Inresearch work, missing values present a great challenge in makingaccurate predictions. In most cases, respondents are wary and don notgive their demographics information easily, or they give falseinformation. This is a great problem in making market analysis forbusiness(Han & Kamber, 2000: 126). For instance, a bank that hasmissing or false information about customer’s age, gender or incomemay not adequately make verifiable market analysis and decisions ifthey want to launch a new product in the market.
Inthe case of Baroda Bank, certain demographics on customers’annualincome, sources of income, age, marital status and gender weremissing. The incomplete data is ineffective and incomplete and cannotaid any effective analysis. Several reasons could be given for theincomplete database customers may opt to lie, hide or refuse toprovide certain demographic information or the banking officerintentionally failed to record the information.In the event, the bankmanagementwants to make important market, and customer decisions,dealing with the problem of missing and false values areimportant(Tseng & Kuo-Howang, 2003: 26).
Ideally,the problem of missing values is addressed through various methodssuch as manual filling, imputation and using computational methods.However, researchers agree that, the best method of addressing thechallenge of missing values is through the application of the bestmethod that matches the data set based on the inherent properties(O`Brien& Marakas, 2011: 130).
Casestudyin this case, consider that the management of Baroda Bank wishes toanalyze customer behaviors based on their demographics.
Proposedmethod of data mining
Inthis project,a cluster method will be used to cluster customerattributes based on the existing data records in order to handle theproblem of missing values. Cluster analysis allows the researcher togather significant information on customers’demographics. Forinstance in the above case an integration of the clustering methodand the regression would help identify the missing values.
Regressionmethod is used to predict values of the missing data to form adataset that is complete or cleaned. Then a clustering method is usedto assess the best method of clustering the data. The missing valuesare then calculated using the regression method within the sameclusters, and this helps to estimate the missing values moreaccurately. As part of evaluating the efficacy of the method, priorexperiments would be done with different sets of data.
Previousstudies on cluster method of estimating the missing values
Previousempirical studies indicate that the cluster analysis method has ahigher sense of accuracy in estimating the missing values compared toother methods.Past methods used in this category are the data-basedmining and the imputation based methods. The imputation method haspreviously been used when dealing with cases of missing numericalvalues while the data-based mining method has been used when dealingwith category data.
Theimputation method relies on existing values as a basis of makingaccurate estimation on the missing values. In this case, theimputation method assumptionsare that there correlations between theknown data and unknown attributes. The data-based mining method, onthe other hand, makes use of associations, clusters andclassifications in assessing the pattern between data sets that isthen applied to make estimations on the missing values.This study wasconfined to dealing with missing numerical values using a twofoldapproach by using the cluster properties of data set and integratingit with prediction method to generate accurate estimates(Tseng& Kuo-Howang, 2003: 35).
Therecent Customer information in records at Baroda Bank was collectedin year 2013 with the intention of informing the bank oncustomers’behaviors as a strategy of assessing the viability of themarket for a new product. The survey was supported by the bank withthe interest of the stakeholders in mind as part ofassessingcustomers’ behavior in before launching a new product inthe market and thereforethe research proved tricky balancing thoseinterests.
Missingdata is a serious problem in most research work and worse forbusiness firms that rely on unprepared data to make market decisions.Nonetheless, through data preprocessing techniques the problem ofmissing data is addressed to enhance the quality of analyzeddataanddecisions.
Theaim of thisproject was to fill in the missing demographic data fromthe clients in order to bridge in the gap of missing data foreffective customer analysis.The filling of missing data would aid thebank managers in making accurate decisions on the bank financialhealth, the bank market strength and customer behaviors. The studyresults would be helpful to the marketing team in making effectivedecisions inregard to customer behaviors and market trends.
Inthis case data was collected from the existing customer databases inall files at the bank. In particular, the data collection methodwould focus on assembling important data based on certain aspectssuch as gender, income per annumage and spending habits.The collecteddatacame from the central and the Eastern regions of Saudi Arabiacovering 1200 customers. Furthermore, analysis of existing customerdatabase indicated somemissing demographic data for customers fromthe western remote region. Twenty-five percent of customers wereselected randomly from each branch of the bank in the region. Theregion has vast customer base, and it was hard for the marketing teamto analyze all these datain order to investigate the missing data.The cluster analysis method was, therefore, used to analyze the dataas a way of predicting the missing values and obtain samples thatwould be used in data collection from the whole population.
SamplingDesignand selection procedures
Thesampling clusters were derived from the bank set of data with missingvalues. A nonlinear stratified sample of 500 customer details wasselected from the banks customer database as collected in year 2013.The selected sample would aid in clustering data in differentclusters based on attributes such as age, income, region and spendinghabits. These samples would further aid in regression analysis andaccurately predicting the values of the missing data (Han &Kamber, 2000: 121).
Inthis project the best way torecover the missing values in thecustomer database, integration of cluster analysis and regressionmethod was used to predict the missing values. However, rather thandepend on fresh survey data, the regression method would use samplesfrom a previouslycollected data. The underlying strategy is to useregression and the cluster method to deal with the large set of dataand missing values. Clustering would be done on all customer basedata in a remote region in order to assess the best clusters of thegroups. For instance, if the customer database could be clustered onthe basis on income, age, and gender. Later a regression method wouldbe used to estimate the missing values from the records.
Usingthe regression and clustering method to determine the missing data.
Inthe case of Baroda Bank, the whole data (D) with missing values inthe Eastern region would be grouped into two parts, group D1 and D2group D1 representing all data with missing values and group D2 as abase. In this method, the missing values in cluster D1 would bepredicted through the regression method where D2 would be the base.For instance, the clusters can be made of high-income earners versuslow-income earners of males versus females.
Usingthe clustering method a set of possible clusters of missing valueswould be developed as in Ca, Cb, Cc…Cf. In this case, a regressionanalysis would be used to recover the missing values in the clustersusing the existing records. In this way, the estimated values ofmissing data would be used in the preceding steps and help resolvethe problem that would occur during the clustering-based imputationwhere many samples are involved. After selecting the best sampleclusters, regressionanalysiswas performed to generate more accurateresults.
Theprocess of finding the best cluster results in any given data isproblematic, but the validation approach can be used to get goodresults. In this case the CAST algorithm or the HubertStatisticscould be used when validating the quality of clusteredresults.Efficient results are obtained when cluster data is runthrough CAST algorithm and selecting the best clusters that have thehighest affinity threshold value. For instance, in the case ofmissing income and agedemographic data on the Bank of Baroda Easternregion, clusters initially formed on the basis of a range of incomevalues or age of customers could be made better by using thethreshold values to get accurate and effective clusters.
Inaddition, if an assessment on similarity between two sets of data isrequired a Pearson correlation coefficient could be used to enhanceeffective integration with the regression analysis. For instance, ifthe missing data at Baroda bank two attributes are related (incomeand age), these could be grouped in same clusters to enhance easieranalysis of customer behavior or spending habits (Han & Kamber,2000: 127).
Inanother example, in the whole set of missing data if two selectedsample clusters have the similar means even if some values aremissing, using the Pearson correlation method the values of themissing attributed could be predicted accurately. For instance, inthe missing demographics at Baroda Bank, if the means of income overyears per customer on a given cluster show similarity, thisinformation could be used to recover values of income earned by othercustomers (O`Brien& Marakas, 2011: 132).
Limitationofusing the regression and clustering method to determine the missingdata
Onemajor limitation of regression and clustering method is that it isnot effective for dispersed data and is only used with data that havesimilar cluster base. Secondly, the sample base used for predictingthe missing values is sometimes too large to aid in effectiveprediction of missing values, and this may lead to precisionimputation problem. In the same line, the cluster method of samplinglarge set of data requires precautions as the missing attributes mayaffect the quality of clusters leading to poor prediction of themissing values. However, the integration of regression and clustermethod (RC) is effective in handling missing values of large sets ofdata thereby ensuring quality of data mined.
Datamining is an important part in research work. Raw data brings manystatistical and analysis problem if it is now well prepared andpre-processed in the warehouse.Researchers need to prepare beforeengaging in any statistical analysis as a way of cleaning,integrating, transforming and reducing data aspects to suit therequired statistical analysis. The problem of missing values in rawdata should be the first step in data pre-processing.
Datapre-processing helps in producing valid and reliable data that couldbe used to make valid analysis and decisions. In the business world,data pre-processing is equally important. Business managers cannotmake valid market, customer or the financial health analysis if theexisting data in their records have missing aspects. As such, beforethe analysis of any data, it is important to prepare the data so thatonly quality and reliable data is used to make analysis anddecisions. Data pre-processing involves data cleaning, dataintegration, data transformation, data discretization and datareduction.
Severaltechniques are used in data cleaning, but the most recent andeffective is through regression clustering (RC). In this case,regression analysis is used on several clusters to improve on theprediction of missing values. Using the regression clusteringtechnique was easierto identify missing values of income, age and gender attributes froma set of data at the bank.
Han,J., & M. Kamber, (2000). “Data Mining: Concepts and Techniques.San Mateo.” CA: Morgan Kaufmann Press.
O`Brien,J. A., & Marakas, G. M (2011). “Management InformationSystems.” New York, NY: McGraw-Hill/Irwin.
Shin-MuTseng & Kuo-Howang, (2003) “A Pre-Processing Method to dealwith missing values by integrating clustering and regressiontechniques.” AppliedArtificial Intelligence.London,GB: Taylor& Francis.