Data Preprocessing

When the source for the data has been specified (be it a database or a text file), it is possible to analyse and preprocess the data before it is used for learning.

The data can be analysed using the Data Analyser Tool. The functionality of the Data Analyser Tool becomes available when the Analyse button is pushed.

Three different kinds of built-in preprocessing is available in the Learning Wizard:

These can be supplemented by additional external preprocess-plugins that the user provides.

While specifying the preprocesses, it is possible to preview the preprocessed data, and when the stream is ready for the next step it is also possible to save the resulting data stream to avoid later repetition of the preprocess definitions.

Exclusion

It is possible to define the exact set of variables to use for the learning. This is done in the "Variable exclusion" window:

The Learning Wizard reads the complete set of variables from the data source. This set of variables is presented to the user, and the user can check/uncheck the variables which are to be included/excluded from the set.

Furthermore, it is possible to exclude or include all of the variables by pushing either the "Exclude All" or "Include All" button.

Value Replacement

Choosing the "Value Replace" preprocessing from the "Preprocess Type" combo-box, allows for performing "on the fly" replacement of values in the data set.

The variable for which the value replacement is to be performed is selected in the "Variable" combo-box. In this combo-box, all variables, which have not been excluded, are present.

Once a variable has been selected, any value replacements that have been defined are shown in the "Value Replace" table.

New value replacements are added by writing the value to replace in the "From" field, and the value to replace with in the "To" field, and then pressing the "Add" button. Note, that it is not possible to replace the same value with two different values. Neither is it possible to replace a value with a value that is being replaced.

When a replacement is selected in the "Value Replace" table, the "Remove" button will be enabled, and it is possible to remove the replacement.

Discretization

If a variable is continuous or has a large set of discrete values, it is possible to reduce the number of states in the resulting table, by discretizing the data.

As can be seen from the picture, the process of defining the discretization is very similar to that of defining a value replacement:
The bottom and top value of an interval in the discretization is written in the "From" and "To" fields, and the "Add"-button is pressed. This is done for each interval in the discretization.

The "Minimize Entropy" button,  discretizes the data using Information Entropy Minimization (U. M. Fayyad and K. B. Irani. 'Multi-Interval discretization of continuous valued attributes').

It is also possible to make an automatic specification of the intervals, if all the intervals are to be equally sized. To do this, the "Auto" button is pressed. This will bring up a window for specifying the lower and upper bounds, and the number of intervals for the discretization.

The resulting intervals will run from the lower bound to the upper bound, and each interval will have the size

( (upper bound - lower bound) / number of intervals )

Apart from specifying the bounds and the number of intervals, it is possible to extend each bound to infinity. If the bounds are extended, the intervals are still calculated in the same way, but once they are calculated, the lower value of the first interval is set to -Infinity (if the lower bound is extended), and the upper value of the last interval is set to Infinity (if the upper bound is extended).

For example, the intervals resulting from the depicted auto discretization will be:
[ 0 ; 20 )
[ 20 ; 40 )
[ 40 ; 60 )
[ 60 ; 80 )
[ 80 ; Infinity )

Note, that the intervals need not be continuous. However, all values in the data set must be included in an interval. If this is not the case, an error will be shown, when the data is used.

If a specified interval needs to be removed, it is selected in the "Discretization" table and the "Remove" button is pressed.

Loading and Saving Preprocess descriptions

Some times one has to specify a great number of preprocesses to load a data file, and if the data file must be loaded often (as would likely be the case when doing analysis), manually specifying the same preprocess definitions each time would become a tiresome.
Preprocess definitions can be re-used by saving the definitions to a file, and loading from file each time they are needed (please note this functionallity only applies to value-replace and discretize preprocesses).

Loading and saving preprocess definitions can done using these buttons:

External Preprocess-Plugins

If the built-in preprocesses prove insufficient for a given task, the user can add external preprocess-plugins. These come in the form of java classes, and all that is required to make them accessible is that the classes are placed in the plugins sub directory under the Hugin installation directory.

Once they are placed there, they can be accessed by choosing the "External Plug-Ins" from the preprocess combo box. This shows the panel depicted below:

When adding one of the available preprocess-plugins, the plugin may or may not show a dialog which helps specify the behaviour of the preprocess. Once this is done, a description of the preprocess will be shown in the panel, from where it may be removed (for information on how to make your own preprocess-plugins, please consult the online manual, the source code of the sample preprocess-plugins, and the java-doc in the plugins directory).

Saving the Data Stream

At any time during data preprocessing, the current preproccessing can be applied and the resulting data stream can be saved to a file. This preprocessed data stream can then later be loaded into the Learning Wizard, avoiding repetition of the preprocessing step.

To save the data stream, simply push the "Save Data" button.

Previewing Data

Also, at any point in the preprocessing, it is possible to preview the data stream as it will appear after it has been preprocessed. This is done by pressing the "Preview Data" button.

Pressing the "Preview Data" button will bring up a view of the preprocessed data:

Due to the linear nature of data streams, it is only possible to move forwards in the preprocessed data. This is done by pressing the "More Data" button. This will show the next range of cases from the data stream. The position in the data source is displayed at the top of the preview window. The only way to move backwards in the data stream is to reset the data stream by pressing the "Reset" button.

Data Analyser Tool

The "Data Analyser Tool" allows the user to analyse the data items associated with each variable in the data as shown in the left part of the figure below.
The "Data Analyser Tool" shows the number of values, the number of distinct values, and the percentage of missing values.

If the selected variable is continuous or has a large set of discrete values, it is possible to reduce the number of state in the resulting table by discretizing the data (as shown in the right part of the above figure). It is possible to discretize many variables at the same time by selecting the "Multiple selection" radio button. If there are numerical variables the "Select all numerical" button is enabled that allows the selection of all numerical variables. The selection can also be made the classic way "ctrl-left click or shift-left click". If an error occurs while discretizing, the variables with an error are marked with red color. The "(n)" inticates that the variable is numerical.

The "Discretize variable" functionality of the "Data Analyser Tool" supports equi-distance and uniform distribution. The "Data Analyser Tool" is disabled when returning to the "Data Preprocessing" step. To enable the "Data Analyser Tool" it is necessary to return to the "Data Acquisition" step and push "Next".