The user can select from four different algorithms for learning the structure of the model from data: PC, NPC, Chow-Liu tree or Tree Augmented Naive Bayes model. The first to algorithm (PC and NPC) are constraint-based approaches whereas the later two are both based on the Chow-Liu algorithm for learning a tree representation of the underlying joint distribution.
Note that, in this step of the Learning Wizard, only the structure of
the network is determined. The conditional probability tables (CPTs)
to be associated with the learned structure, will be learned in a
subsequent step.
Two pieces of information are needed in order to start the learning process using the PC or NPC algorithm:
The level of significance can be entered into this text field. The
default value is set to 0.05, which should be appropriate for most
learning sessions, but it can take on any value in the open interval
(0;1). In general, the higher the significance level the more links
will be included in the learned structure. Reducing the level of
significance will usually also reduce the running time of the
structure learning algorithms.
The algorithm performs the following steps:
Traditional constraint-based learning algorithms produce provably correct structures under the assumptions of infinite data sets, perfect tests, and DAG faithfulness (i.e., that the data can be assumed to be simulated from a DAG). In the case of limited data sets, however, these algorithms often derive too many conditional independence statements. Also, they may in some cases leave out important dependence relations.
Generally, it is recommended to use the NPC algorithm, as the
resulting graph will be a better map of the (conditional) independence
relations represented in the data. In particular, when the data set is
small, the NPC algorithm should be the one preferred. The NPC
algorithm, however, has longer running times than the PC algorithm.
The NPC algorithm seeks to repair the deficiencies of the PC algorithm, which occur especially in the face of limited data sets. The solution provided by the NPC algorithm is based on inclusion of a criterion known as the necessary path condition. This criterion forms the basis for introducing the notion of ambiguous regions, which in turn provide a language for selecting among sets of inter-dependent uncertain links. The resolution of ambiguous regions is performed in interaction with the user (see the next page of the Learning Wizard).
In constraint-based learning algorithms, the skeleton of the graph is constructed by not including a link in the induced graph whenever the corresponding nodes are found to be conditional independent. There can, however, be inconsistencies among the set of conditional independence and dependence statements (CIDs) derived from limited data sets. That is, not all CIDs can be represented simultaneously. The inconsistencies are assumed to stem solely from sampling noise; i.e., we still assume that there exists a perfect independence map of the (unknown) probability distribution from which the data is generated.
The number of inconsistencies in the set of CIDs reflects structure model uncertainty. Thus, the number of uncertainties is a confidence measure for the learned structure and can as such be used as an indication of whether or not sufficient data has been used to perform the learning. The inconsistent CIDs produce multiple solutions when inducing a directed acyclic graph (DAG) from them. These solutions differ with respect to the set of links included.
To resolve the inconsistencies, the NPC algorithm relies on user
interaction where the user gets the opportunity to decide on
directionality of undirected links and to resolve the ambiguous regions.
The main goal is to obtain as few and small ambiguous regions as possible. It should be noted that deterministic relations between variables will also produce ambiguous regions.
The user is offered the possibility of providing information as to how
the ambiguous regions should be resolved. The next page of the
Learning Wizard will provide an intuitive graphical interface for this
interaction.
A Chow-Liu tree is the best possible tree-shaped network approximation to a probability distribution such that all edges are directed away from the root. The quality of the approximation is measured using the Kullback-Leibler distance between the true distribution and the distribution defined by the Chow-Liu tree. When we learn from data, the true distribution is defined by the frequencies of the observations. Chow and Liu (1968) show that the optimum tree can be found as the maximum-weight spanning tree over all variables, where the weight of each edge is given as the mutual information between the variables connected by the edge.
A Chow-Liu tree can be constructed as follows:
The TAN algorithm is useful for constructing classification networks where a specific node of the model is target of the reasoning. The target node is used to construct a conditional Chow-Liu tree (i.e., a Chow-Liu tree over all nodes except the selected target) with the selected root as root of the tree. Weights are defined as conditional mutual information (conditional on the target), and all nodes (except target) have target as an extra parent.