More Decision Trees in Processing

This is going to have to be a short post because I’m simply out of time for the day.

I’ve been doing some reading about information theory as the basis for using entropy and information gain as measures to choose the branches and leaves of a decision tree. I’ve also been reading other people’s implementations of decision tree algorithms, so I won’t be totally reinventing the wheel with this project. First off, a very good general explanation of decision trees can be found at Rudjer Boskovic Institute’s Data Mining Server tutorial titled Decision Trees. In the tutorial is J. Ross Quinlan’s classic decision tree algorithm ID3.

function ID3
Input:   (R: a set of non-target attributes,
          C: the target attribute,
          S: a training set) returns a decision tree;
begin    If S is empty, return a single node with
      value Failure;
   If S consists of records all with the same
      value for the target attribute,
      return a single leaf node with that value;
   If R is empty, then return a single node
      with the value of the most frequent of the
      values of the target attribute that are
      found in records of S; [in that case
      there may be be errors, examples
      that will be improperly classified];
   Let A be the attribute with largest
      Gain(A,S) among attributes in R;
   Let {aj| j=1,2, .., m} be the values of
      attribute A;
   Let {Sj| j=1,2, .., m} be the subsets of
      S consisting respectively of records
      with value aj for A;
   Return a tree with root labeled A and arcs
      labeled a1, a2, .., am going respectively
      to the trees (ID3(R-{A}, C, S1), ID3(R-{A}, C, S2),
      .....,ID3(R-{A}, C, Sm);
   Recursively apply ID3 to subsets {Sj| j=1,2, .., m}
      until they are empty
end

I’m doing my best to implement something similar to this in Processing, we’ll see how it goes. So far I am able to calculate the entropy and gain of a collection and all it’s attributes, but have not yet written the code to branch the attributes and outputs in order to recursively traverse the tree. If this doesn’t prove too difficult I’d also like to make graphical output to demonstrate the structure of the tree and possibly post-prune the tree as recommended by experts to remove over-fitted branches that reduce the decision tree’s ability to generalize to data outside the training set.

Leave a comment