The data can be analysed using the Data Analyser Tool. The
functionality of the Data Analyser Tool becomes available when the Analyse button is pushed.
Three different kinds of built-in preprocessing is available in the Learning Wizard:
These can be supplemented by additional external preprocess-plugins that the user provides.While specifying the preprocesses, it is possible to preview the preprocessed data, and when the stream is ready for the next step it is also possible to save the resulting data stream to avoid later repetition of the preprocess definitions.
The Learning Wizard reads the complete set of variables from the data source. This set of variables is presented to the user, and the user can check/uncheck the variables which are to be included/excluded from the set.
Furthermore, it is possible to exclude or include all of the variables by pushing either the "Exclude All" or "Include All" button.
The variable for which the value replacement is to be performed is selected in the "Variable" combo-box. In this combo-box, all variables, which have not been excluded, are present.
Once a variable has been selected, any value replacements that have been defined are shown in the "Value Replace" table.
New value replacements are added by writing the value to replace in the "From" field, and the value to replace with in the "To" field, and then pressing the "Add" button. Note, that it is not possible to replace the same value with two different values. Neither is it possible to replace a value with a value that is being replaced.
When a replacement is selected in the "Value Replace" table, the "Remove" button will be enabled, and it is possible to remove the replacement.
As can be seen from the picture, the process of defining the
discretization is very similar to that of defining a value
replacement:
The bottom and top value of an interval in the
discretization is written in the "From" and "To" fields, and the
"Add"-button is pressed. This is done for each interval in the
discretization.
The "Minimize Entropy" button, discretizes the data using Information Entropy Minimization (U. M. Fayyad and K. B. Irani. 'Multi-Interval discretization of continuous valued attributes').
It is also possible to make an automatic specification of the intervals, if all the intervals are to be equally sized. To do this, the "Auto" button is pressed. This will bring up a window for specifying the lower and upper bounds, and the number of intervals for the discretization.
The resulting intervals will run from the lower bound to the upper bound, and each interval will have the size
Apart from specifying the bounds and the number of intervals, it is possible to extend each bound to infinity. If the bounds are extended, the intervals are still calculated in the same way, but once they are calculated, the lower value of the first interval is set to -Infinity (if the lower bound is extended), and the upper value of the last interval is set to Infinity (if the upper bound is extended).
For example, the intervals resulting from the depicted auto discretization will be:
[ 0 ; 20 )
[ 20 ; 40 )
[ 40 ; 60 )
[ 60 ; 80 )
[ 80 ; Infinity )
Note, that the intervals need not be continuous. However, all values in the data set must be included in an interval. If this is not the case, an error will be shown, when the data is used.
If a specified interval needs to be removed, it is selected in the "Discretization" table and the "Remove" button is pressed.
Some times one has to specify a great number of preprocesses to load a data file, and if the data file must be loaded often (as would likely be the case
when doing analysis), manually specifying the same preprocess definitions each time would become a tiresome.
Preprocess definitions can be re-used by saving the definitions to a file, and loading from file each time they are needed (please note this functionallity only applies to value-replace and discretize preprocesses).
Loading and saving preprocess definitions can done using these buttons:
plugins
sub directory under the
Hugin installation directory.Once they are placed there, they can be accessed by choosing the "External Plug-Ins" from the preprocess combo box. This shows the panel depicted below:
When adding one of the available preprocess-plugins, the plugin may or may not show a dialog which helps specify the behaviour of the preprocess. Once this is done, a description of the preprocess will be shown in the panel, from where it may be removed (for information on how to make your own preprocess-plugins, please consult the online manual, the source code of the sample preprocess-plugins, and the java-doc in the plugins directory).
To save the data stream, simply push the "Save Data" button.
Pressing the "Preview Data" button will bring up a view of the preprocessed data:
Due to the linear nature of data streams, it is only possible to move forwards in the preprocessed data. This is done by pressing the "More Data" button. This will show the next range of cases from the data stream. The position in the data source is displayed at the top of the preview window. The only way to move backwards in the data stream is to reset the data stream by pressing the "Reset" button.
![]() |
![]() |
If the selected variable is continuous or has a large set of discrete values, it is possible to reduce the number of state in the resulting table by discretizing the data (as shown in the right part of the above figure). It is possible to discretize many variables at the same time by selecting the "Multiple selection" radio button. If there are numerical variables the "Select all numerical" button is enabled that allows the selection of all numerical variables. The selection can also be made the classic way "ctrl-left click or shift-left click". If an error occurs while discretizing, the variables with an error are marked with red color. The "(n)" inticates that the variable is numerical.
The "Discretize variable" functionality of the "Data Analyser Tool" supports equi-distance and uniform distribution. The "Data Analyser Tool" is disabled when returning to the "Data Preprocessing" step. To enable the "Data Analyser Tool" it is necessary to return to the "Data Acquisition" step and push "Next".