Formatting Data for the Decision Tree Generator

In my last post I explained a bit about what needs to be passed to the decision tree generator’s constructor function, but I didn’t go into detail about how to get the numbers into the formats specified. Luckily it’s very easy! I have included a function in the decision tree generator sketch which will take a properly formatted text file and and automatically turn it into the column array of integers that you need. Also in the example sketch folder I have included some such text files as examples. Essentially you will need one text file each for the output array and for each of your attributes. The text file should be formatted to contain only the number 0 through however many levels that particular attribute or output can take on. Each number should be on its own line and the file should contain no punctuation. This may sound tedious to make, but Microsoft Excel, Matlab and many other free spreadsheet programs can save a column of numbers in just such a format by saving the file as a .txt or .csv. The text file just needs to be placed in the same folder as the Processing .pde file (the sketch itself) and it should be good to go. Now you should be ready to follow this example and make your decision tree! The example shows generically how to load the output, three attributes and then create their labels and variables. The variable and text file names can obviously re-named to anything you may want.

  //*****************
  // Load a text file with separate values (just an integer number) on each line
  // the text file is converted to an integer array which can be input to the decision tree generator

  int[] output = loadData("outputs.txt");
  int[] attribute0 = loadData("attribute0.txt");
  int[] attribute1 = loadData("attribute1.txt");
  int[] attribute2 = loadData("attribute2.txt");

  //******************
  // OUTPUT Information
  String[] output_labels = {"out_label1", "out_label2"};

  // INPUT #0 / Attribute 0
  int num_attribute0_levels = 3;   // 0 = level0, 1 = level1, 2 = level2
  String[] attribute0_labels = {"level0", "level1", "level2"};

  // INPUT #1 / Attribute 1
  int num_attribute1_levels = 2;   // 0 = false, 1 = true
  String[] attribute1_labels = {"false", "true"};

  // INPUT #2 / Attribute 2
  int num_attribute2_levels = 4;  // 0 = low, 1 = med, 2 = high
  String[] attribute2_labels = {"Low", "Med", "High"};

  // INPUT Information
  int[][] input_attributes = {attribute0, attribute1, attribute2};
  String[] attribute_labels = {"Attribute_0", "Attribute_1", "Attribute_2"};
  String[][] attribute_level_labels = {attribute1_labels, attribute2_labels, attribute3_labels};
  Decision_Tree tree;
  int leave_out = 2;
  int[] node = {1, 1};
  float screen_width = width;

How to Use the Decision Tree Generator

For the longest time I have simply been showing outputs from my decision tree generator and telling you all the things it is capable of doing, but I haven’t given you access to it or shown you the code itself. For the most part this is because the code is ugly and some may say hacked, but since I am not a professionally trained software engineer, I don’t care too much. I am a roboticist who is interested in writing software and designing systems to allow people (especially those with little experience) to enjoy robots and machine learning, more so than making optimally efficient software. My code is fully commented (maybe even overly so) in the hopes that someone looking at it in the future won’t have trouble understanding what it does (or how to make it better). The code is being hosted on my Bitbucket account and there is a link to the code on my software page.

A brief explanation of how to use the generator should start with explaining that the decision tree generator is essentially a single Java Class object and the only function the user is required to interact with is the constructor for the class. The constructor for a Class is the function/method in the object oriented programming Class which initializes an object (an instance) of that Class. Meaning the constructor builds you one of the things you want. All you have to do is pass the constructor some inputs. The decision tree generator constructor is relatively large and looks like this:

// CONSTRUCTOR //
  Decision_Tree(String tree_name, int[] outputs, String[] output_labels, int[][] attributes, String[] attribute_labels, String[][] attribute_level_labels, int leave_out, int[] initial_node, float screen_width)

Yes, there are a lot of inputs to this single function, but since it is the only one you need to use to generate the decision tree and Arduino code, it really isn’t all that bad. Plus, I’ll explain what each input is and how to make it.

(String tree_name) should be the name of the decision tree you are making, such as “play_tennis”. Make sure this name contains no spaces or punctuation except underscore, “_”, characters.

(int[] outputs) should be be an integer array representing the output label/class for each input/output pair. This integer array should have an element for each input/output pair, it should contain only numbers ranging from 0 to the number of possible output classes. If your data is something like “no”/”yes” or “left”/”right”/”center” simply replace each label with an integer so that no = 0, yes = 1 or left = 0, right = 1 and center = 2. This array may look something like int[] outputs = {0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0};

(String[] output_labels) should be an array of String objects, where each element represents the actual label for the corresponding output level/class. For example we just learned that the outputs should be integers from 0 to the number of output classes so the number of labels should be equal to the number of classes as well. For example: String[] output_labels = {“no”, “yes”} or String[] output_labels = {“left”, “right”, “center”};

(int[][] attributes) might be little more confusing because it is an array of integer arrays, but if you think about it as a matrix or a spreadsheet it shouldn’t be too difficult to understand. Each column of the matrix/spreadsheet of attributes represents a different attribute, for example: int[][] attributes = {outlook, humidity, temperature, wind}. Each one of these columns is just like the outputs array we already learned about. The column arrays for all the attributes must be the same length (have the same number of elements) and each one must only contain numbers from 0 to the number of levels that specific attribute can take on. The length of all the columns must be the same as the outputs array because the decision tree algorithm expects to see input/output pairs, where the entry for all the attributes is the input and the output entry is obviously the output.

(String[][] attribute_level_labels) is very similar to the attributes matrix, except that the length of the columns are simply equal to the number of different levels for that given attribute. For example: we could have String[] attribute1_labels = {“hot”, “mild”, “cold”} and String[][] attribute2_labels = {“cloudy”, “sunny”}. Then String[][] attribute_level_labels = {attribute1_labels, attribuite2_labels}. We need to have a label for each possible value of a given attribute and an array of labels for each of the attributes.

(int leave_out) is simply the number of input/output pairs that we have passed to the constructor that we do not want to use to generate the decision tree. In general it is always a good idea to partition your data into test and validation sets randomly and then use the percent accuracy of the validation set to determine how well the tree generated from the training set will generalize to new instances. If the number leave_out is greater than 0 then this is the number of pairs at the end of the output and attribute columns which will be ignored during tree generation. The validation data will be output in the command line in a format for use on the Arduino for validation.

(int[] initial_node) is used to help generate the graphical tree and the auto-generated code. This should always be passed int[] initial_node = {1, 1}.

(float screen_width) is used to determine the size of the graphical output. Processing has a global constant called width which should be passed for this value.

I’ll go into more detail explaining exactly how to generate some of this data without having to manually enter it all by hand in my next post.

Validation of System to Date

Now comes the moment you’ve all been waiting for. The moment when I actually explain how I intend to test my system and verify that it works as I designed. All machine learning and data analysis systems which attempt classification must be vetted by measuring their percent accuracy at classifying some aptly named “test set” after being trained on some equally well named “training set”. By set we simply mean some vectors of input attributes with their associated output labels/classes/decisions. In general it is simple for a machine learning system to be good at classifying the data it is trained on (because it has seen those examples during the training phase), but it is more difficult (and more beneficial) for the system to have high accuracy at inferring the proper classification for inputs that it has not seen before. Because decision trees specifically use labeled data for training (meaning they require both input vectors and output labels), by simply partitioning the data set into a training portion and test portion it is possible to test the accuracy of the system by comparing the systems recommended decisions for the test set to the actual output labels.

To validate my system I have used two relatively small data sets so far. The popular “play tennis data set” from Tom Mitchell’s book “Machine Learning”, and the ubiquitous (in the world of machine learning) Fisher’s Iris Data Set. The play tennis data set consists of 16 days worth of data, where each day records the outlook (overcast, sunny, rainy), temperature (hot, mild, cool), humidity (high, normal), wind (weak, strong) and the result of play tennis (yes, no). This data was randomly ordered and two days were held back from training to be used for validation. This processes of randomly mixing the days then selecting two for validation occurred seven times. Each time the 14 training days were used to build the decision tree and then the two validation days were run through the decision tree function on the Arduino three times each. Each generated tree was used three times to see the effects of the random decision on some undetermined nodes. The result: 78.6% accurate classification.

The Iris Data Set was a bit different to implement simply because its attributes (sepal length, sepal width, petal length and petal width) are all measured on a continuous scale. Because the decision tree program as currently written can only handle discrete attributes I first had to essentially histogram the attributes into discrete levels/bins. I used a scatter plot of the binned data to somewhat arbitrarily choose eight bins, or eight levels for each of the four attributes. In future, modifications to the decision tree program to handle continuous data would be useful, however for now simply comparing percent accuracy on the test set can be used to verify the number of discrete levels chosen. With the discrete data I built ten decision trees each using a random 90% of the data set (135 data vectors) and tested the remaining random 10% (15 data vectors) on the implemented decision tree in the Arduino three times on each generated tree. The result: 94.4% accurate classification.

Following these tests I attempted to get a sense of how well the tree program could generalize if it was built using less training data. I attempted building trees using random samples of the 150 data vectors of size 100, 75, 50 and 25 vectors. Then I ran each tree on the remainder of the 150 data vectors for validation. The results:

# of Vectors to Build Tree | # of Vectors Tested | % Correct
100                                             50                         94.5%
75                                               75                         93.1%
50                                             100                         90.2%
25                                             125                         80.9%

In the future I hope to test additional data sets with more data points and potentially adapt the decision tree to handle continuous data as well as discrete data. Below is the Arduino code used to implement the validation of one of the generated decision trees.

#include <play_tennis.h>

int output_tennis[]  = {0, 0};
int outlook[]  = {2, 2};
int temp[]  = {2, 2};
int humidity[]  = {0, 1};
int wind[]  = {1, 0};

void setup() {
  Serial.begin(9600);

  int correct_count = 0;
  int wrong_count = 0;
  for (int i = 0; i < 2; i++) {
    int decision = play_tennis(outlook[i], temp[i], humidity[i], wind[i], analogRead(0));
    if(decision == output_tennis[i]){
      correct_count++;
    }
    else{
      wrong_count++;
    }
    Serial.print("Decision = ");
    Serial.print(decision);
    Serial.print(" | Answer = ");
    Serial.println(output_tennis[i]);
  }
    Serial.print("\nCorrect = ");
    Serial.print(correct_count);
    Serial.print(" | Wrong = ");
    Serial.println(wrong_count);
}

void loop(){
}

Auto-generated Arduino Code and Randomness

I am aware that I haven’t posted anything new for a few days now, and that is simply because I’ve been making too much progress to slow down and write about what’s going on. All in all the auto-generated code portion is essentially complete. I have added some functionality to allow the generated code to also include comments  which make the if/else  branch structure nearly as easy to read and follow as the graphical tree generated in Processing. The idea is for future students to make trees by hand and compare their work to the generated code, or to simply take the generated code and write about how and why it works or to possibly implement a task with an Arduino that will require a decision tree to make choices. The comments amount to explaining what the numbers correspond to in the if/else branches and what the decisions are and why.

In addition to adding commenting to the the auto-generated code I decided to also add the random number generation that was occurring in Processing to the Arduino code. In general the decision tree attempts to traverse down a given branch gaining information the whole time and reducing entropy to zero, at which point there is only one option left, the decision. However, it is possible for a decision branch to check all the attributes and still be unsure (not yet at zero entropy). If this occurs my decision tree randomly chooses a leaf node to place on the end of the branch (if there were no training examples on this branch) or else randomly chooses a weighted leaf node to place on the branch (if there were training examples on this branch). What this means is, if a branch has never occurred in training then the decision is randomly made (because there is not enough data to make an inference), and it should be correct 1/n*100% of the time, where n is the total number of possible decisions. However, if there were examples on that branch then the random choice is weighted in favor of examples that have occurred more often. For example if in training there were examples for a given non-zero entropy branch of a, a, b then the decision a should be chosen twice as often as decision b.

I have chosen to allow non-zero entropy random branches to exist in my decision tree implementation because in general some randomness can be desirable in mobile robots and other applications with noisy data that can be prone to local minima. For this reason I have chosen to move the random decision generation to the Arduino. If the random choice was made in Processing then for the duration the tree is loaded on the Ardunio the same decision will always be made for that branch. With the random number generation on the Arduino it has the ability to make different decisions each time an undetermined condition occurs. Further testing will be necessary to determine if this design choice is valid. For now I will leave you with some of the generated code with comments and random number generation included.

/* play_tennis decision Tree file for Arduino	*/
/* Date: 20 Feb 2012					*/
/* Time: 14:50:24						*/

#ifndef play_tennis_H
#define play_tennis_H

#if defined(ARDUINO) && ARDUINO >= 100
#include "Arduino.h"
#else
#include "WProgram.h"
#endif

// 	Output (Decision):
// play_tennis	levels: 0 = No 1 = Yes
// 	Input (Attributes):
// Outlook	levels: 0 = Overcast 1 = Rain 2 = Sunny
// Temp	levels: 0 = Cool 1 = Mild 2 = Hot
// Humidity	levels: 0 = Normal 1 = High
// Wind	levels: 0 = Weak 1 = Strong

int play_tennis(int Outlook, int Temp, int Humidity, int Wind, int random_seed){
	randomSeed(random_seed); // seed the psuedo random number generator
	if(Outlook == 0){	// Outlook == Overcast?
		return 1;	// play_tennis = Yes
	}
	else if(Outlook == 1){	// Outlook == Rain?
		if(Wind == 0){	// Wind == Weak?
			return 1;	// play_tennis = Yes
		}
		else if(Wind == 1){	// Wind == Strong?
			return 0;	// play_tennis = No
		}
	}
	else if(Outlook == 2){	// Outlook == Sunny?
		if(Humidity == 0){	// Humidity == Normal?
			if(Temp == 0){	// Temp == Cool?
				return 1;	// play_tennis = Yes
			}
			else if(Temp == 1){	// Temp == Mild?
				return 1;	// play_tennis = Yes
			}
			else if(Temp == 2){	// Temp == Hot?
				if(Wind == 0){	// Wind == Weak?
					return random(2);	// play_tennis = random choice
				}
				else if(Wind == 1){	// Wind == Strong?
					int random_choices[] = {0, 1};
					return random_choices[random(2)]; // play_tennis = weighted random choice
				}
			}
		}
		else if(Humidity == 1){	// Humidity == High?
			return 0;	// play_tennis = No
		}
	}
}

#endif

Auto-generated Arduino Code: Day 1

So the moment I’ve been working toward for the past few weeks has arrived. I’ve finally started the process of having my decision tree program in Processing output generated code that can be run on the Arduino platform. In this first day I’ve made a huge amount of progress, going from nothing to having all the necessary code generated to create a function which can be run on the Arduino. What I’ve done is allow the decision tree program in processing to create a .h header file with the name of the tree as the name of the generated function. The inputs to the function are integer variables whose names are the names of the attributes used to create the tree. The auto-generated code includes comments which tell the name of the function and the date and time it was generated. Once the .h file is generated, all I have to do is use the decision tree function with the Arduino is place the generated “tree_name”.h file in the Arduino libraries folder and then add #include <“tree_name”.h> to my Arduino sketch. From there on the decision tree function will simply take in integer inputs for the attributes and output the decision as an integer which is the output level. I’ll be working to clean up all my code and examples and to run some data sets through the system for verification and to determine percent accuracy. For now I’ll leave you with some of the auto-generated code and the decision tree diagram that goes with it.

/* tennis decision Tree file for Arduino	*/
/* Date: 13 Feb 2012                        */
/* Time: 18:22:33                           */

#ifndef tennis_H
#define tennis_H

#if defined(ARDUINO) && ARDUINO >= 100
#include "Arduino.h"
#else
#include "WProgram.h"
#endif

int tennis(int Outlook, int Temp, int Humidity, int Wind){
	if(Outlook == 0){
		return 1;
	}
	else if(Outlook == 1){
		if(Wind == 0){
			return 1;
		}
		else if(Wind == 1){
			return 0;
		}
	}
	else if(Outlook == 2){
		if(Humidity == 0){
			if(Temp == 0){
				return 1;
			}
			else if(Temp == 1){
				return 1;
			}
			else if(Temp == 2){
				if(Wind == 0){
					return 0;
				}
				else if(Wind == 1){
					return 1;
				}
			}
		}
		else if(Humidity == 1){
			return 0;
		}
	}
}

#endif

Tennis Decision Tree

Basic Graphics for Decision Trees

As I’ve continued to work towards making my decision tree program capable of auto-generating Arduino code, I’ve realized that it would be helpful to get a visual sense of what the decision tree structure looks like. By producing some graphical representation of the decision tree I thought I would get a better understanding of how I would generate the program control structure, and where in my program to generate the if/else branches. I have learned a couple of things, first that it is difficult for a recursive structure to keep track of where in the structure it currently is, and also that Processing is very adept at making simple graphical output. In summary, I was able to use the pushMatrix() and popMatrix() functions, which are part of Processing, to save my current coordinate system on a stack before each recursive branch. This would allow each branch to behave identically to the larger tree. Also I used a very clever, at least I think it’s clever, method of storing my current node location in terms of the level number, and the branch number. This allows the tree to make decisions about how to represent and space the branches on a given node given the depth and number of other nodes at that level. Here are a couple of teaser images of the graphical output.

decision tree graphic 2decision tree graphic 1Now that I have some basic (very basic) graphical output to represent the tree structure, I should be ready to begin the process of using the graphical tree structure to create actual Arduino code.

Loading Data into Processing

For the past couple of days I have been doing my best to use other data that has been collected and analyzed with commercial software, like Matlab, to verify if my decision tree program is working. One common data set that is often used is something called the Fisher’s Iris Data. This data set is useful because it has been analyzed with many different algorithms and methods over the years and also because it is the example set used in the Matlab Statistics Toolbox Classification Tutorial. I will go further into how I used this data set to test my decision tree in an upcoming post. One problem with using a data set like Fisher’s Iris Data is that there are over a hundred data points for each of several factors; these take a long time to type into an array, as is required by my basic decision tree program. In Matlab a simple command like “load fisheriris” is all that is needed to load the whole data set into a collection of arrays. I wanted to find a way to do something similar with my program that would allow data to be collected and manipulated in Excel or Matlab or a simple text editor and then loaded directly into the program. There are surely multiple ways to handle this problem but I’ve got a simple solution that uses a combination of the Processing function loadString() as well as a simple method to convert arrays of Strings to arrays of Integers as described on  Manikandan’s Weblog. Enjoy the example:


// Each Processing Sketch requires a setup() and draw() function, this is the setup()
void setup()
{
  // Basic Processing setup codes
  size(640, 360);           // Create window of size 640 by 360 pixels
  background(0);            // Make window background color black
  stroke(255);              // Make stroke (line) color white
  noLoop();                 // Don't repeat (loop) the draw() fucntion

  // Load a text file with separate values (just an integer number) on each line
  // into an array of Strings, which each value in its own String
  // 100.txt contains the number 1 to 100, each on its own line
  String[] dataString = loadStrings("100.txt");
  // Print how many lines the file contained
  println("there are " + dataString.length + " lines");
  for (int i=0; i < dataString.length; i++) {
    // Print each line or number String
    println(dataString[i]);
  }
  println();             // Add a blank line to the command line ouput

  // Use custom function convertStringArraytoIntArray to convert each
  //String in an array of Strings to an integer in an array of integers
  int[] dataIntArray = convertStringArraytoIntArray(dataString);
  // Print how many numbers are in the array
  println("there are " + dataIntArray.length + " numbers in array");
  for (int i=0; i < dataIntArray.length; i++) {
    // Print each number in the array
    println(dataIntArray[i]);
  }
  println();             // Add a blank line to the comand line output
}

// Each Processing Sketch requires a setup() and draw() function
// We're not using the draw function, so we leave it blank
void draw()
{
}

// Custom Function that takes an array of Strings
// and returns an array of integers
public int[] convertStringArraytoIntArray(String[] sarray) {
  // If the array of Strings is not empty or NULL then process it
  if (sarray != null) {
    int intarray[] = new int[sarray.length];
    // Create int array with elements for each String in String array
    for (int i = 0; i < sarray.length; i++) {
      // For each String in String array, use parseInt to get integer
      intarray[i] = Integer.parseInt(sarray[i]);
    }
    return intarray;      // Return the interger array
  }
  return null;            // If the String array was null, return null
}