DataMining and Business Intelligence
1.0 Introduction 4
2 0 Classification 5
2.1 Risk.csv Dataset and definition. 5
2.1.a Decision tree. 6
2.1.b Rule-based classifier 7
Evaluating the results 8
2.2 Heart disease.csv Dataset and definition 9
2.3 Skeletal measurements.csv 12
Dataset and definition. 12
2.3.a Neural Network training 14
2.3.b Rule induction training 14
3 0 Clustering 17
3.1 Clusterdata.csv 17
3.2 Clustering with K-Mean 19
3.3 Clustering with Hierarchical clustering(Agglomerative) 20
Results of clustering the data 20
Manypractical application domains currently employ Data mining algorithmsto perform analysis of data, and most of its uses are forclassification and prediction problems. Data mining as a practiceinvolves techniques and processes of computationally analyzing largesets of data, discovering patterns and trying to make sense of thedata using the discovered patterns.
Businesses,most notably online stores utilize predictive analytics to offersuggestions to its users/customers on articles/items they shouldconsider buying/viewing. Data mining algorithms have evolved over theyears but despite this evolution there are still challenges in theindustry, such as “Spamdexing”1 which is used to influence searchengine’s page rank algorithm to assign a page a higher rank.
Thispaper will evaluate four data sets using data mining algorithms forseveral application domains. This paper is in two partsclassification algorithms and clustering algorithms.
Thispaper covers two data mining methods, cluster analysis andclassification over four data sets. Three of these data sets areinvestigated using classification algorithms and the remaining dataset using a clustering algorithm.
Allexperiments in this paper will be conducted using Rapidminer dataanalytics tool.
Classification2problems attempt to classify objects into classes based on theircharacteristics. This pattern based classification can be used tomake sense of existing and presumably predict the occurrence of newinstances of the object.
2.1Risk.csv Dataset and definition.
Thedata set comprises of 4117 examples contained in 12 columns that arebroken down in the table below. The data contained in risk.csv is ofloan applicants and their corresponding risk factor.
Therisk associated with loan applicants who possess similarcharacteristics as another loan applicant in the past. The data setcontains twelve attributes RISK, AGE, ID, HOWPAID, MORTGAGE,STORECAR, INCOME, GENDER, MARITAL, NUMKIDS, NUMCARDS, and LOANS. Thisinformation is used as a decision matrix to make a classificationmodel from previous applicants who were either granted or denied aloan. The model can then be used as a prediction algorithm for newapplicants and used to inform if they can get a loan.
Table2.1.1 shows the instances values for each of the attributes and thelabel.
Table2.0.1 Risk.csv Metadata definition and role
Twoclassification methods will be used to learn patterns in thisdataset, rule-based classifier and decision tree. In both problems,the data will be passed through the learner, and the model appliedthen finally a performance/confusion matrix generated.
Theseclassifiers are appropriate because they can work with bothpolynomial and integer attributes. This data set has seven integerattributes and five polynomial attributes.
Thepattern to use should be simple and provide a summary analysis of thedataset, this way it is easier to interpret and make sense of theavailable data
Beforeclassification training, the data in CSV format must first beimported.
2.1.a Decision tree.
Theimported data is added to the process, then split into training andvalidation set using the split validation operator.
Figure2.1.a.1 Decision tree setup
Thesplit validator is a nested operator which has two sub-processes, atraining and validation sub-process. The validator used stratifiedsampling with a default split ratio of 70%. This sampling type canenhance the representativeness of the sample by reducing samplingerror3.
Figure22.1.a.2 Split validation sub-process
Runningthe process resulted in two views a decision tree and a performancetable.
Theprocess was rerun several times with various changes to the decisiontree parameters. These variations were done to observe if thealterations would have an effect on the efficiency and performance ofthe algorithm.
Theprocesses above were repeated but used a rule-based classifierinstead
Figure2.1.b.1 Rule based validation sub-process.
Decisiontrees performed better compared to rule-based classification in thatthe decision in the following spectrum
– Thetraining time for decision tree was small compared to the time takento learn a rule-based classifier
– Betterresults were achieved using decision tree algorithm, even thoughrule-based algorithm outputs a more detailed result sheet it is muchfaster and easier to interpret a decision tree.
– Theaccuracy obtained from the decision tree was much higher compared tothat of rule-based algorithm
Table 2.1.b.1 Information-gain performance vector for a decision treeclassifier
Table 2.1.b.2 Information-gain performance vector for a rule basedclassifier
Figure2.1.b.2 Decision tree
2.2Heart disease.csv Dataset and definition
Allvalues in this data set are of numeric type, this presents a problembecause naive Bayesian and support vector machine data algorithms donot support numeric names.
Solution:when importing data, we use att14 as a label and must, therefore, beimported as a polynomial type.
Thedata is wholesomely numeric, and we split the data into two distinctgroups. Thus, the most appropriate classification algorithm to use isnaïve Bayesian and Support vector machine.
Thedata set contains a number of 270 examples that have14 attributes.
Thedata set contains 270 examples with 14 attributes.
Table2.2.1 Heart disease.csv metadata
Att14represents the diagnosis for heart disease, 1 for no and 2 for yes
Asimple solution is prevalent and almost obvious for the currentclassification problem. The attributes we use should work withcategorical attributes (heart disease = yes, or no), in our caseAtt14 is either 1 or 2, and we can, therefore, fully categorize allour data as either one of the two. This subjective classificationresults in loss of data the alternative is use of more rigorousprobability density function values, for each of the two outcomes, weestimate the probability density function value at decision points.
Withthis basic idea in mind, two classification algorithms were chosen tomeet this criterion naïve Bayes as a Generative model and Supportvector machines (SVM) as a Discriminative model
NaïveBayes are a class probabilistic classifiers which can give aprobability distribution given a sample input over a set of classes.
Wefirst retrieve the sample data that the algorithm will learn. Next weconnect the output of the retrieved data to select attribute operatorthat will carry out selection of two attributes to be used to give aprobabilistic distribution. The two selected attributes and the labelare then connected to the naïve Bayes operator which will thengenerate the distribution. We apply the model and generate aperformance static table of the operation.
Figure2.2.1 Setupfor Naïve Bayes training
Figure2.2.2 Setupfor SVM training
Setupfor SVM learner.
Table2.2.1 Performance indicators for Naïve Bayes
Table2.2.2 Performance indicators for SVM
Fromthe performance statistics, the Naïve Bayes has better accuracy, butlow precision for predicting the probability of diagnosis 2. The SVMhowever is better at predicting the probability of getting a yesdiagnosis for heart disease but this compromises on the accuracy ofthe algorithm.
Figure2.2.3 Naïve Bayes graph displaying probability of heart disease withreference to age
Thegraph shows the probability of diagnosis 2(heart disease = yes) ishigher than that of diagnosis 1(heart diseases = no) with olderpeople from age 52 onwards.
2.3Skeletal measurements.csvDatasetand definition.
Thedata set comprises of five hundred and seven (507) examples each withtwelve (12) regular attributes and one (1) special attribute genderwhose role is to act as a label. The attributes are as indicated intable 2.3.1 bellow.
Eachrow in the dataset represents a person and each of the twelve regularrepresents a characteristic, in our case the measurement values oftheir specific attributes.
Inthis exercise, we will attempt to classify members of a class basedon their gender 1 for female and 0 for a male person. With thisinformation, we will generate patterns that can be used to predict aperson’s gender based on their skeletal measurements.
Table2.3.1 metadata for skeletalmeasurements.csv
Fromthe metadata provided two classification algorithms were chosen tolearn the patterns in the dataset, neural networks and ruleinduction.
Justlike in setup 2.1 Risk.csv, we setup the training to have two mainoperations on the data except in this case we use across/X-validation operator. We first retrieve the data then connectit to the validator port. The cross validator is a nested operatorand has two sub-processes for training and testing the dataset. Themain setup for the training is as shown in figure 2.3.1.
Figure2.3.1The main setup process.
2.3.aNeural Network training
Togenerate a neural-network we must first add a neural network learnerto the cross-validation training sub-process. We then apply the modeland measure performance based on this operation to the data.
Aneural network is a three layer architecture made up of a singleinput layer that corresponds to a certain number of input variables,another sinlge output layer that corresponds to the number ofpossible outcomes and any hidden layer(s) in between. A typical setupmay have more than one hidden layer of different sizes , this howeveris a compromise on the performance of the training. We repeat theexperiment with different changes in the configuration. For bothexperiments it was observed that optimal performance is obtained withthe number of cross validations4 between 9 and 12, any higher orlower the performance decreased. The learning rate3 controls the stepsize when weights are iteratively adjusted and determines how fastthe network learns. A higher learning rate also increases the chancesthat the neurons will overshoot contributing to network instability.
2.3.2Sub-process setup for neural-network training.
Weuse one hidden layer of size -1, this means the layer is determinedusing heuristic for best performance.
2.3.bRule induction training
Thesetup is the same as 2.3.a above, but the neural-network operator isreplaced by rule-induction operator. Rule induction method was usedto verify the results obtained in 2.3.a above. From the experiment itwas observed that both learners performed relatively well , bothobtaining an accuracy of above 90%.
Comparingresults obtained in a and b above gives an overview of the mainattributes involved in determining the gender of a person given their12 attributes. On average Neural network performed better atpredicting a person gender with an accuracy of 93.51 with precisionof above 92%.. With results obtained from both experiments, we canverify that the biacromial measurement of a person is the keydeterminant if the person is male or female.
Performanceof the neural-network trainer
Performanceof the rule-induction trainer.
Thestrength/weight of the line connecting the neurons indicates theweight/importance of that connection. We can see that the biacromialinput is stronger and thus carries a higher weight.
Ruleinduction training results
Therules obtained from a rule-based classifier algorithms also showsthat the biacromial measurements have a higher importance.
Clusteringis unsupervised learning which attempts to find natural grouping ofinstances given un-labeled data.
Clusteringgroups instances into sets based on a comparison of attributesbetween the dataset. For example, if we have a group of organisms wecan group them based on features like colour, presence of anexoskeleton and so forth. It is an unsupervised learner in that wegroup objects where we do not know the classification groups inadvance. These groups are formed based on variables within the data.
Forthis data set, we will use K-Means and Hierarchical clusteringalgorithms.
K-meanssplits a large data set into smaller clusters, with each clusterhaving a centroid(k), The centroid are placed as much as possible faraway from each other. We then take each of the points that belong toa given row of data and associate it with the centroid nearest to it.Once all rows have been grouped, we have completed the first step andgroup rows based on this. We then have to re-calculate "k"new centroids as centers of the clusters that result from the initialgrouping. Once set with these "k" new centroids, the samedata is regrouped so as to set points to the nearest new centroid.Because of the regrouping, all the centroids change their originallocation this process is repeated until there are no more changes.
Hierarchicalsplits up data into a form of a tree structure. Hierarchicalclustering falls into two categories, bottom-up clustering andtop-down clustering.
BottomUp: each record is considered a single cluster then clusters aremerged together basing based on the smallest distance recordedbetween them.
TopDown: the whole data set is considered a single cluster. The data isthen subdivided into smaller clusters based on how far apartneighbors are. This process is repeated until all the records havetheir own unique clusters.
Table3.1 shows the column names, the data type for each of the column, theleast and highest values, the deviation, the average and the numberof missing values. The dataset holds 1000 records each with threeattributes, att1, att2 and att3.
Table3.1 Metadata for clusterdata.csv
Wefirst import the data and place it under observation in a 3d scatterplot to view if there are any natural clustering patterns that can beseen by the human eye. This can be achieved by plotting all the dataat hand by use of a scatter 3D plot. Each attribute is plotted on theplanes of x, y and z. From observing the diagram, it is evident thatthere are five natural globular clusters
Figure3.1.1 3D colour plot showing 5 different cluster
Figure3.1.2 3D plot showing 4 dominant clusters
Figure3.1.3 3D plot showing one smaller cluster
3.2Clustering with K-Mean
Wefirst retrieve the data set and connect it to a parameter alterationoperator. The parameter alteration operator is a nested operator withone sub-process.
Figure3.2.1 The main setup
Figure3.2.2 Parameter alteration sub-process setup.
Theclustering algorithm K‐Meansis added into the parameter alteration operator sub-process alongsidethe Cluster Distance Performance operator and process log operator.The parameter alteration operator is used in place to loop throughthe K values in the operator of K‐Means.We use Davies Bouldin as the main criterion for Cluster DistancePerformance operator. A log operator process is embedded into thealteration parameter operator so as to record the Davies Bouldinmetric and the total number of clusters.
3.3Clustering with Hierarchical clustering(Agglomerative)
Theimported data is connected to a loop operator which is a nestedoperator.
Figure3.3.2 Loop parameter sub-process setup.
TheDistribution Performance observes the occurrence of objects and latergives a report on how they are evenly distributed. It strives to getto the zero mark or close to it.
Resultsof clustering the data
Whenthe experiment was run with single linkage, the maximum clusters werefound to be 12 for both the distribution performance and the densitybefore the graph started to converge. This performance is differentwhen complete linkages were used. When complete linkages were used itwas found that the optimal number of clusters was 3. The final testwas conducted with average linkage and number of clusters was 5. Withthis setup, it was found that the smallest value of the DaviesBouldin metric occurred when the total number of clusters was five.
Figure3.3.3 K-Mean cluster plot
Han,J., Kamber, M., & Pei, J. (2011). DataMining Concepts and Techniques.Burlington, Elsevier Science.http://www.123library.org/book_details/?id=37236.
DataMining and Neural Networks Danny Leung. (n.d.). Scribd.Retrieved August 31, 2014, fromhttp://www.scribd.com/doc/217619078/Data-Mining-and-Neural-Networks-Danny-Leung
AProgrammer`s Guide to Data Mining. (n.d.). AProgrammers Guide to Data Mining.Retrieved August 31, 2014, from http://guidetodatamining.com/
Zhu,X., & Davidson, I. (2007). Knowledgediscovery and data mining challenges and realities.Hershey, Information Science Reference.http://www.books24x7.com/marc.asp?bookid=20780.
Larose,D. T. (2006). DataMining Methods and Models.Hoboken, John Wiley & Sons.http://www.123library.org/book_details/?id=10193.
Kantardzic,M. (2011). Datamining: concepts, models, methods, and algorithms.Hoboken, N.J., John Wiley.
Cabena,P., Pablo H,. Stadler, R., Verhees,J., & Zanasi, A.,1998.DiscoveringData Mining: From Concept to Implementation.Prentice-Hall, Inc., Upper Saddle River, NJ, USA.
Witten,I. H., & Frank, E. (2005). Datamining practical machine learning tools and techniques.Amsterdam, Morgan Kaufman.http://public.eblib.com/choice/publicfullrecord.aspx?p=234978.
Raymond,T., & Han, J. (n.d.). E_cient and E_ective Clustering Methods forSpatial Data Mining. Techreports.Retrieved August 31, 2014, fromftp://ftp.cs.ubc.ca/.snapshot/sv_weekly.2/local/techreports/1994/TR-94-13.pdf
(2000). Principlesof data mining.Cambridge, Mass, MIT Press.