This is going to have to be a short post because I’m simply out of time for the day.
I’ve been doing some reading about information theory as the basis for using entropy and information gain as measures to choose the branches and leaves of a decision tree. I’ve also been reading other people’s implementations of decision tree algorithms, so I won’t be totally reinventing the wheel with this project. First off, a very good general explanation of decision trees can be found at Rudjer Boskovic Institute’s Data Mining Server tutorial titled Decision Trees. In the tutorial is J. Ross Quinlan’s classic decision tree algorithm ID3.
function ID3
Input: (R: a set of non-target attributes,
C: the target attribute,
S: a training set) returns a decision tree;
begin If S is empty, return a single node with
value Failure;
If S consists of records all with the same
value for the target attribute,
return a single leaf node with that value;
If R is empty, then return a single node
with the value of the most frequent of the
values of the target attribute that are
found in records of S; [in that case
there may be be errors, examples
that will be improperly classified];
Let A be the attribute with largest
Gain(A,S) among attributes in R;
Let {aj| j=1,2, .., m} be the values of
attribute A;
Let {Sj| j=1,2, .., m} be the subsets of
S consisting respectively of records
with value aj for A;
Return a tree with root labeled A and arcs
labeled a1, a2, .., am going respectively
to the trees (ID3(R-{A}, C, S1), ID3(R-{A}, C, S2),
.....,ID3(R-{A}, C, Sm);
Recursively apply ID3 to subsets {Sj| j=1,2, .., m}
until they are empty
end
I’m doing my best to implement something similar to this in Processing, we’ll see how it goes. So far I am able to calculate the entropy and gain of a collection and all it’s attributes, but have not yet written the code to branch the attributes and outputs in order to recursively traverse the tree. If this doesn’t prove too difficult I’d also like to make graphical output to demonstrate the structure of the tree and possibly post-prune the tree as recommended by experts to remove over-fitted branches that reduce the decision tree’s ability to generalize to data outside the training set.