IMPORTANCEOF STATISTICAL MEASURES TO DATA EXPLORATION
Dataexploration is an important aspect in any statistical research and animportant step in the analysis of data. In many instances,insufficient focus is applied on conducting a preliminary assessmentof data to collect. Most researchers proceed with data statisticaldata analysis through the various phases of analysis withoutcritically assessing the accuracy of data entered or collected. Inthis case therefore, data exploration forms an important step beforeengaging in other data processing activities in order to have a clearunderstanding on the attributes of the collected or entered data.Data exploration is very useful in making the right choice oftechnique to use in data analysis. Data mining is a methodologicalapproach of investigating patterns on large sets of database. Itsobjective is to derive information from a set of collected data inorder to formulate it in a structure that is more understandable andfor more use.The process of data exploration may be through graphical displays orin summery statistics
Ideally,data exploration means assessing the variability of data, patterns ofdata, assessing the distribution of variables through statisticaltools such as charts, histograms and graphs. In addition, dataexploration means assessing any other unusual characteristic in thedata collected before formal data analysis commences this dataexploration is critical in giving more insights on hypothesistesting. Data exploration focuses on assessing the assumptions ofdata and hypothesis modeled as well as the relationship that existsbetween variables. The relationship between variables is measured byvisualizing them through graphs, scatterplots and bar graphs(Vaughan 150).Moreover, data exploration entails making changes on missing valuesas well as making changes on the transformation variables. Thisresearch paper is about a discussion on the importance of statisticalmeasures in data exploration.
Datamining is a methodological approach of investigating patterns onlarge sets of database in which various techniques such as databasesystems, artificial intelligence and statistics are used (Chen,Han &Yu 872).Data fishing and data dredging were technical terms used to refer tounworthy practices used to analyze data void of hypothesis input(Ye 150).The objective of data mining is to derive information from a set ofcollected data in order to formulate it in a structure that is moreunderstandable and for more use.
Theprocess of data mining involves data management and data aspects,data pre-processing, considering inference and model of data,metrics, complexity, visualization and data updating (Healy189).The processing of data mining may be done through semi-automatic orthrough automatic analysis of a large set of data in order todiscover any unusual characteristic in the data clusters, in theseprocess techniques are such as spatial indices may be used inchecking any anomaly in the data collected (Ye150).In data mining process data is grouped in order to obtain moreaccurate results(Chen, Han &Yu 870).However, in this process of data mining, other data analysisprocesses such as data preparations, data collection andinterpretations are not part of this process.
Theprinciple of data mining is useful in the modern world wherestatistical measures are used to analyze specific data that is inturn used to make important management decisions. Ideally, it is alsopossible to make decisions on large sets of data but this brings aproblem when specific and measurable data is required. Furthermore,data mining is important in that it helps to analyze large data whichother standards of data analysis cannot achieve (Nisbet,Elder & Gary 75).
Researchersshould spend quality time in the preliminary analysis of data thatwould help them in their discussions assessing the patterns of dataanalysis. Ideally, most researchers spend a little time in describingand exploration of technics for data analysis. Data exploration onthe other hand enables the users to analyze large sets of datathrough techniques that allow important data to be sieved. In mostcases, data that exhibit uncommonness and other uniquecharacteristics should be analyzed using the exploration method.Although the process of data exploration takes longer, it isimportant that any statistical data be explored and arranged in anorganizable structure that can be understood (Vaughan163).
Duringdata collection, vast data is collected from the field, and only partof this data is useful when making statistical analysis or hypothesistesting. Therefore, data exploration is an important part in dataprocessing especially in providing useful information on the existingrelationships and patterns between variables (Ye156).In the absence of data exploration, other aspects of data processingcannot be valid. As such, data exploration needs to be conductedbefore carrying out the exploratory data analysis (EDA) in order tofree the collected data from typing errors, data entry errors as wellas other mistakes done during data collection and analysis(William 15).In some cases, the application of effective descriptive andexploration technique may be all that is required to have a completeanalysis of data. For instance, when data is collected from casesstudies and recorded in tables and graphs only few statisticalmeasures could be adopted to check on the significance of theresults.
Manyresearchers and students fail to realize that, in the process of dataanalysis, statistical analysis is not always at the end of dataanalysis statistical analysis of data is required prior hypothesistesting and not necessarily in the end. In addition, statisticalanalysis is just one technique of making data analysis conclusion andnot an end in itself (Weisberg154).Therefore, in a large set of data the application of statisticalmeasures in the exploratory stage becomes useful in assessing notablepatterns and relationships between the various variables this inturn help the researcher to know which data analysis tool iseffective for final data analysis.
Aimsof carrying out an exploratory data analysis
Facilitate in the selection of the best technique for data preprocessing and analysis
Aids the researcher in discovering the best way to respond the null hypothesis
Facilitates in the development of the hypothesis for future studies
The exploratory phase helps in the detection of unusual data values in the collected data and the best way of fixing them
The exploratory phase enables the researcher to know the variables to analyze
Preliminary summaries are made which are useful in providing answers for particular study objectives.
The exploratory phase helps in laying out a plan for statistical analysis.
Thereare two approaches used in the data analysis one is throughone-dimensional approach and the two-dimensional approach ofanalysis. In one-dimensional data analysis data is analyzed in onesingle variable that are independent of other attributes in the setof data(Weisberg 154).In this case, analysis is conducted through statistical tools such ashistograms and the measure of central tendency. The two-dimensionalapproach makes an analysis between two variables in analyzing twovariables that are related such as income and customer. Many analysismethods are used in the data exploration analysis involving multiplevariables and cluster analysis (Vaughan155).
Importanceof Data exploration
Dataexploration is very useful in making the right choice of technique touse in data analysis. There are rising needs and use of dataexploration by large business in making market analysis for businessthis has helped enhance business expansions. Many industries now usethe data exploration method in making assessment of their onlineperformance and dominance. In small games such as chess, table bases,dots-a-box and others utilize the data mining technique. In thebusiness world, data mining application is prominent in analyzingpast business activities that is stored in the data warehouses withthe objective of investigating the underlying trends and patterns(Nisbet,Elder & Gary 56).
Inthis case, data algorithms are used to shift through vast data ofcollected information in order to assess a unique aspect in theinformation. For instance, if businesses want to understand trends inthe sale of a new product in the market, data mining technique isapplied to perform market analysis. In other cases, this method ofdata mining is applied by large manufacturing plants to assess anyproblem in the manufacturing. Other applications include acquiringnew customers, profiling customers and when cross-selling products orservices to existing clients.
Applicationsof data mining technique
Mostbusiness corporations in the world collect vast data each day throughonline and manual transactions. The collected transaction informationis stored in centralized large database. In such a case, thiscollected information may be of little use if there exist notechnique of analyzing or data mining the important information. Forinstance, large multinational stores like Wal-Mart, Amazon andAlibaba after sometimes wish to analyze the market trends forparticular goods, customer trends and sales trends. The informationgathered in these areas would be important for the firm when makingstrategic plans concerning the promotions, marketing, and campaignsand advertising in order to increase sales trends. In addition,analyze of such large set of data would assist business in makingaccurate customer prediction.
Inthe same line, when clients uses their credit cards in makingfinancial transactions, vast data is collected which helps firms makeassessment on consumer behaviors, detect crime and other uniqueaspects associated with the customers database (William15).Likewise, data mining is useful when creating customer relationshipmanagement systems in this case data mining is used to retrieveimportant information concerning particular clients when the need tocontact them arises. Sophisticated software are now used in makingdetermining tools which are used in customer profiling, predictionand automatically contacting the required set of clients from largedatabases (Vaughan151).Recent studies indicate that, business that employ the data miningtools have recorded higher returns than business that do not. This isbecause application of data mining techniques helps businesses tomake sales and customer prediction for certain regions this in turnhelps them apply the right marketing approach depending on the typeof clients (Chen,Han &Yu 870).
Furthermore,determining technique is now used in the human resources departmentsto investigate and select successful employees based on certainattributes. In the same line, corporations rely on data mining methodin making strategic enterprises management plans in relation to goalsto be achieved, profits, sales target margins and setting work plansfor the employees(Liu 130).The bucket market analysis applies the data mining principle in theidentification of client’s tastes, preference and the purchasespatterns this gives a solid base of making future buying trends andsupply trends. Overall, there are various areas where data miningapplication is important but the most critical aspect of data miningis assisting in decision making process.
Inresearch studies statistic is used to refer to approaches used in thedata collection, data classification, data computation, datasynthesis, analyzing and interpreting set of collected data in asystematic and qualitative way. Statistics provides importantinformation on a set of data or more insight on a particularexperiment(Healy 179).Statistics also gives an idea on the general attribute of thecollected data. In general, statistics involves the use ofdescriptive and the inferential aspects. In descriptive statisticsgraphical and numerical values are used when describing set ofgathered data. The inferential statistics makes statisticalinferences on a given population based on the analysis of datavariables collected from the sample (Siddharth2).
Ideally,the approach used in data exploration and transformation isdetermined by the type of variables in the collected data thevariables may be ordinal, numeric or nominal. The process of dataexploration may be through graphical displays or in summerystatistics. A measure of central tendency is commonly used in moststatistical analysis measures of central tendency are also referredto us the central value or a probability distribution (Nisbet, Elder& Gary 50). In short, measure of central tendency means theclustering of quantitative data around a particular central value.These measures of central tendency involve the median, the mode,arithmetic mean and the geometric mean. Measures of central tendencyare useful in statistics for the following reasons
Findinga representative value for a given set of data
Inthe computation of statistical analysis of data, measures of centraltendency provides one value that represents the attributes of thewhole set of data distribution. For instance, when assessing andcomparing academic performance among many students and subjects inthe same class, measures of central tendency become effective incomputing single figures that can be used to make performancedistinction between one student and the other (Quinn,Keough & Michael 47)
Helpsin condensing data
Ideally,data is collected in large quantities, and this may present problemswhen making important decisions in regard to a particular attributeon certain population. In such a case, the data is condensed to getan average figure that can be used to make a decision on the wholepopulation of data. To illustrate this, a large retail shoppingcenter like Wal-Mart may wish to make an analysis of the most salableproduct among the daily million transactions worldwide (Nisbet, Elder& Gary 67).
Centraltendency measures are useful in Comparisons
Whenlarge set of data is collected on multiple variables ordistributions, a measure of central tendency becomes necessary. Inthis case, the central value is calculated to act as a representativeof one set of distribution. With single value representatives ofpopulations, it is easy to make comparisons between or among multiplevariables (Cook,& Swayne 165).
Measuresof central tendency are important in making further statisticalanalysis
Inmost statistical analysis measures of central tendency are used whenmaking more statistical assessments. For instance, arithmetic means,modes and standard deviations are used in calculating the skewness ofdata, correlation and index numbers.
Methodsof data exploration
Inthe initial part of data exploration computation of means, ranges,standard deviation and counts provide an insightful summary on whichaspects of data patterns can be discovered this is important inenhancing further assessment on the variation of data. Using thesemethods helps to identify the missing values in a given set of data.Mean is the most assistive value when making investigations on theuniqueness of any data when there are variations or missing valuesin a set of data, this magnitude is reflected in the mean. Mean isused when a central measure is required for a particular set of dataand in finding the central balance. In order to effectively describedata distributions, the measure of central tendency and dispersionbecomes key value (Siddharth2).
Medianreflects the central value that divides a set of given data into twohalves upper vales and bottom value. In this measure of centraltendency, a set of data is arranged in order, and the middle value ispicked no absolute computations are done and therefore, extremeobservations have no effect on this value(Quinn, Keough & Michael 45).However, in extreme scenarios, mean is the most effective value ingiving a description on a set of data.
Themode of a particular set of data refers to the common value that isnominally observable. In this case, the most-repeated value is pickedas a measure of central tendency to explain and describe thecharacteristic of a set of data.
Thesemeasures of central tendency are hailed for their effectiveness insolving the variation problems occasioned by data the logic is toreduce data variation from the central value.
Samplingmethod as a method of data exploration
Inmost study projects, due to time constraints, costs, geographicfactors and other aspects processing of a large set of data involvinga wide population is not possible. As such, sampling method is usedto select a particular portion of the wider population in making dataanalysis. In this case, various sampling approaches are used and theyinclude simple random, stratified and sampling without replacementmethod. In this case, data collected from a sample is used to makeinference on attributes of the larger population (Nisbet, Elder &Gary 50).
Inthis case, data is arranged in distributions of means in a table inthese tables one can make easy investigation on missing values ofdata as well as understanding data distribution. When data iscomputed in excel sheets, an excel pivot table becomes useful indeveloping frequency tables that are used to detect individualoutliers in data (Healy189).
Graphicalmethods of data exploration
Histogramsfacilitates in giving a visual picture of data distribution. Forinstance, if data distribution is localized near the center on eachside of the histogram is symmetrical or balanced, the data is said tobe normally distributed. In some cases, data may be localized on oneside leaving the other side of the histogram unbalanced, in such acase data distribution is said to be skewed, and other statisticaltools should then be used to analyze such data.
Theabove histograms illustrates data that is normally distributed
Thescatter graphs are used to show the relationship between twovariables visually in which scatter plots are made without thepresence of a straight line. In this case, scatter plots are used toidentify any existing outliers in the data the individual scatterplots indicate the existing relationship and differences betweenindividual values of a particular set of data.
Barcharts and pie charts
Barcharts are important ways of expressing data distribution infrequencies, proportions and mean values visually. Values arepresented in pictorial form in series of columns. In pie charts,sections of circle are used to illustrate data attributes. Chartsform the most understandable way of understanding data distributionwhich is in form of ordinal or nominal value.
Inthis case data distribution is used to plot a line in graph thesepoints represent single case of data observations of statisticalvalues derived from the sum, mean or the median (Healy199).Line graphs are used when data distribution exhibits a regularpattern or repeated measurements. Line graphs are a present’s datain visual form which helps describe the set of data.
Othermethods of graphical data representation involve the use of boxplots,outliers, trend graphs and the survival curves(Cook, & Swayne 135).
Dataexploration is critical before final data analysis is made. Theprocess of data exploration is essential in discovering aspects ofdata that are missing or variables that creates issues. In such acase, it is important to change bad variables and rectify on themissing values so that final data analysis reflects little errors.Many statistical measures are used in data exploration and it isimportant to understand these measures when choosing the rightmeasure to describe data distribution. It is also imperative toconsider the relationship that exists between variables in order touse the right method of data exploration.
Datamining and data exploration are useful when analyzing large data inorder to have a central value which can be used to make decisions.Data mining technique is commonly used in large organizations toretrieve specific information from large set of data. The process ofdata exploration utilizes the measures of central tendency indescribing data distribution. These measures of central tendency arethe mean, median, regression and correlation. The process of dataexploration is not only useful in describing the pattern of datadistribution and identification of missing values but also helps inthe determination of the appropriate statistical technique for theoverall data analysis.
Chen,M.S. J. Han, P.S.Yu. "Datamining: an overview from a database perspective."Knowledge and data Engineering, IEEE Transactions on 8 (6),(1996) 866-883.
Cook,D. and Swayne, D.F. (with A. Buja, D. Temple Lang, H. Hofmann, H.Wickham, M. Lawrence). ″Interactive and Dynamic Graphics for DataAnalysis: With R and GGobi″ Springer, 2007.
Healy,Joseph F. “The Essentials of Statistics: A Tool for Social Research(2nd Ed.).” Belmont, CA: Cengage Learning. 2009. pp. 177–205.
Liu,Bing. “WebData Mining: Exploring Hyperlinks, Contents and Usage Data.” Springer.2007.
Nisbet,Robert Elder, John Miner, Gary. “Handbook of StatisticalAnalysis & Data Mining Applications.” AcademicPress/Elsevier. 2009.
Quinn,Geoffrey R. Keough, Michael J. “Experimental Design and DataAnalysis for Biologists (1st Ed.).” Cambridge, UK: CambridgeUniversity Press. 2002. pp. 46–69.
SiddharthKalla. “StatisticalMean.” (Jan 13, 2009). Retrieved Aug 05, 2014 from Explorable.com:https://explorable.com/statistical-mean
Vaughan,Simon. “Scientific Inference: Learning from Data (1st Ed.).”Cambridge, UK: Cambridge University Press. 2013. pp. 146–152.
WeisbergH.F. “CentralTendency and Variability,”Sage University Paper Series on Quantitative Applications in theSocial Sciences. 1992.
WilliamBarnett II. "Dimensionsand Economics: Some Problems",QuarterlyJournal of Austrian Economics 7(1) 2007.
Ye,Nong. “TheHandbook of Data Mining,”Mahwah, NJ: Lawrence Erlbaum. 2003.