Introduction
XLMiner offers an array of features for exploring and visualizing data; for example, users can opt to view detailed charts of their data using the Chart Wizard — there are 8 different types of charts available for selection, from a standard bar chart to a parallel coordinates chart.
Data cleansing is facilitated via dimensionality reduction techniques, such as Feature Selection and Principal Components Analysis. These methods enable users to increase the accuracy of their models by reducing extraneous input/independent variables. Of particular interest in XLMiner v2015 is the Feature Selection tool, which is used to discover the best data subset within classification and prediction models.
Lastly, data can further be organized via the use of data clustering methods. XLMiner supports k-means Clustering and Hierarchical Clustering, both of which offer robust clustering processes using different techniques.
Exploring Data
Creating a Chart
The Chart Wizard interface is used to create a wide variety of charts based on specified data. The types of charts that can be created within XLMiner include:
- Bar Chart: Bar charts are straightforward and particularly effective in comparing an individual statistic across a group of variables.
- Histogram: Histograms excel at conveying the shape and spread of data within datasets. A histogram is essentially a bar graph that depicts the range & scale of observations on the x-axis and the number of data points on the y-axis.
- Line Chart: Line charts convey data over time so, consequently, are preferred when using time series based datasets. Typically the time period/range is displayed on the x-axis.
- Parallel Coordinates: Parallel coordinate charts are useful when doing predictive analytics. These charts display a backdrop comprised of n parallel lines (usually vertical and evenly-spaced). A data point within the n-dimensional space is conveyed as a polyline (connected series of lines) with vertices on the parallel axes. Axes can be re-ordered by dragging to different locations, enabling users to add meaning or insight to the chart.
- Scatterplot: Scatterplot charts are useful when identifying clusters of data and variable overlap. Scatterplots use Cartesian coordinates to display data points and can be color-coded in order to increase the number of displayed variables.
- Scatterplot Matrix: Scatterplot matrices are becoming increasingly common as a way to quickly determine if you have a linear correlation between multiple variables. A scatterplot matrix essentially combines multiple scatterplots into one panel. The names of the variables are displayed in a diagonal line, from top left to bottom right. The axes titles and variable values appear at the edge of tthe rows or columns, enabling users to easily compare interactions between variables.
- Box Plot: Box plot charts are typically used to summarize datasets and to perform exploratory data analysis. These charts convey groups of numerical data via their quartiles and also convey key statistical values, such as the mean, median, maximum, and minimum. The spacing between different parts of convey the spread and skewness of the data. Box plots are highly useful because they display sample variations without making any assumptions about the statistical distributions and, consequently, are not strongly influenced by extreme values (outliers).
- Variable: The variables graph simply plots each selected variable's distribution.
- Export to PowerBI: This option is used to export data directly to Microsoft Power BI, a data visualization module within Office 365.
- Export to Tableau: This option enables you to convert the results of your optimization model into a Tableau Data Extract file (.tde), allowing you to use the popular Tableau solution to visually explore and analyze your data.
The Feature Selection tool, in combination with Principal Components Analysis, provides access to a process called Dimensionality Reduction — this process reduces the number of extraneous input/independent variables used in order to increase the accuracy of a model. XLMiner v2015 offers both branches of dimensionality reduction: Feature Selection and Principal Components Analysis... we'll talk about the latter below.
The former, Feature Selection, is used to discover the "best" subset of variables from all available variables to be used as input to a classification or prediction method. The primary purposes of the Feature Selection tool are to:
- Clean the data;
- Eliminate data redundancies; and,
- Identify the most relevant data in order to reduce its scale/dimensionality.
The end-result is the facilitation of data exploration, visualization, and the possible conversion of infeasible analytic models to feasible ones. In XLMiner, the Feature Selection tool is used to define the "best" data subset within classification/prediction models, with the best subset being the one with the lowest residual error (i.e., mis-classification rate). Feature Selection does this by using filter methods to rank variables according to one, or more, univariate measures and to select the top-ranked variables for data representation purposes. A top-ranked variable is one that most-accurately predicts the output variable.
The filter method is a powerful technique because it is model-independent; that is, it can be widely applied as a pre-processing step for supervised learning algorithms.
Data Transformation
On one side of the Dimensionality Reduction coin we have Feature Selection, described above, while on the flip side we have Principal Components Analysis (PCA). PCA reduces the number of redundant/unnecessary variables by creating a representation of original model data using a lower-dimensional representation of the data — this process is also referred to as feature extraction.
The PCA method accomplishes this by transforming a number of correlated variables into a smaller number of uncorrelated variables called principal components. The goal here is to reduce the dimensionality of the dataset without compromising the original data variability.
Data Clustering
Data clustering is the process of grouping together a collection of objects based on their similarity to each other and relative dissimilarity compared to objects in other clusters. There are multiple algorithms available for clustering data objects; XLMiner offers two of the most popular approaches: k-means clustering and hierarchical clustering.
In k-means Clustering, data is divided into a set number of clusters (k) and then each object is assigned to one of the k clusters so as to minimize dispersion within the clusters. The measurement used is typically the sum of distances from the mean of each cluster; the mean acts as the prototype for each cluster. The problem is usually computed using a heuristic algorithm capable of quickly converging to a local optimum.
The other variety of data clustering, Hierarchal Clustering, a hierarchy of clusters is created using an agglomerative method. In this approach, a series of partitions takes place, which may run from a single cluster containing all objects to n clusters each containing a single objects. In the agglomerative technique used by XLMiner, each observation starts in its own cluster and pairs of clusters are merged as one moves up the hierarchy. Hierarchal clustering is typically conveyed via the use of a dendrogram which illustrates the fusions made at each successive stage of analysis; below is an example of a dendrogram:
Resources
- Getting Data into XLMiner: How to import data into XLMiner via database, file folder, and/or Apache Spark.
- Text Mining Features: Introduction to text mining in XLMiner.
- Data Mining Features: Introduction to data mining in XLMiner using multiple methods (classification, prediction, and ensemble methods).