Demo Box v1.2

When I was preparing for my first set of interviews with Apple (January 2015) I decided to throw together a box which would combine various hardware/software projects that I had worked on in the past. Putting together this box of “demos” was a way to remind myself about some of the various projects I’ve worked over the past  few years. Also, I thought it might make a cool visual aid if the opportunity arose to show it off during an interview.

Just recently I dusted the thing off and decided to take some pictures and the above video, so that I could document what is actually being demonstrated in this box.

Formatting Data for the Decision Tree Generator

In my last post I explained a bit about what needs to be passed to the decision tree generator’s constructor function, but I didn’t go into detail about how to get the numbers into the formats specified. Luckily it’s very easy! I have included a function in the decision tree generator sketch which will take a properly formatted text file and and automatically turn it into the column array of integers that you need. Also in the example sketch folder I have included some such text files as examples. Essentially you will need one text file each for the output array and for each of your attributes. The text file should be formatted to contain only the number 0 through however many levels that particular attribute or output can take on. Each number should be on its own line and the file should contain no punctuation. This may sound tedious to make, but Microsoft Excel, Matlab and many other free spreadsheet programs can save a column of numbers in just such a format by saving the file as a .txt or .csv. The text file just needs to be placed in the same folder as the Processing .pde file (the sketch itself) and it should be good to go. Now you should be ready to follow this example and make your decision tree! The example shows generically how to load the output, three attributes and then create their labels and variables. The variable and text file names can obviously re-named to anything you may want.

  //*****************
  // Load a text file with separate values (just an integer number) on each line
  // the text file is converted to an integer array which can be input to the decision tree generator

  int[] output = loadData("outputs.txt");
  int[] attribute0 = loadData("attribute0.txt");
  int[] attribute1 = loadData("attribute1.txt");
  int[] attribute2 = loadData("attribute2.txt");

  //******************
  // OUTPUT Information
  String[] output_labels = {"out_label1", "out_label2"};

  // INPUT #0 / Attribute 0
  int num_attribute0_levels = 3;   // 0 = level0, 1 = level1, 2 = level2
  String[] attribute0_labels = {"level0", "level1", "level2"};

  // INPUT #1 / Attribute 1
  int num_attribute1_levels = 2;   // 0 = false, 1 = true
  String[] attribute1_labels = {"false", "true"};

  // INPUT #2 / Attribute 2
  int num_attribute2_levels = 4;  // 0 = low, 1 = med, 2 = high
  String[] attribute2_labels = {"Low", "Med", "High"};

  // INPUT Information
  int[][] input_attributes = {attribute0, attribute1, attribute2};
  String[] attribute_labels = {"Attribute_0", "Attribute_1", "Attribute_2"};
  String[][] attribute_level_labels = {attribute1_labels, attribute2_labels, attribute3_labels};
  Decision_Tree tree;
  int leave_out = 2;
  int[] node = {1, 1};
  float screen_width = width;

Validation of System to Date

Now comes the moment you’ve all been waiting for. The moment when I actually explain how I intend to test my system and verify that it works as I designed. All machine learning and data analysis systems which attempt classification must be vetted by measuring their percent accuracy at classifying some aptly named “test set” after being trained on some equally well named “training set”. By set we simply mean some vectors of input attributes with their associated output labels/classes/decisions. In general it is simple for a machine learning system to be good at classifying the data it is trained on (because it has seen those examples during the training phase), but it is more difficult (and more beneficial) for the system to have high accuracy at inferring the proper classification for inputs that it has not seen before. Because decision trees specifically use labeled data for training (meaning they require both input vectors and output labels), by simply partitioning the data set into a training portion and test portion it is possible to test the accuracy of the system by comparing the systems recommended decisions for the test set to the actual output labels.

To validate my system I have used two relatively small data sets so far. The popular “play tennis data set” from Tom Mitchell’s book “Machine Learning”, and the ubiquitous (in the world of machine learning) Fisher’s Iris Data Set. The play tennis data set consists of 16 days worth of data, where each day records the outlook (overcast, sunny, rainy), temperature (hot, mild, cool), humidity (high, normal), wind (weak, strong) and the result of play tennis (yes, no). This data was randomly ordered and two days were held back from training to be used for validation. This processes of randomly mixing the days then selecting two for validation occurred seven times. Each time the 14 training days were used to build the decision tree and then the two validation days were run through the decision tree function on the Arduino three times each. Each generated tree was used three times to see the effects of the random decision on some undetermined nodes. The result: 78.6% accurate classification.

The Iris Data Set was a bit different to implement simply because its attributes (sepal length, sepal width, petal length and petal width) are all measured on a continuous scale. Because the decision tree program as currently written can only handle discrete attributes I first had to essentially histogram the attributes into discrete levels/bins. I used a scatter plot of the binned data to somewhat arbitrarily choose eight bins, or eight levels for each of the four attributes. In future, modifications to the decision tree program to handle continuous data would be useful, however for now simply comparing percent accuracy on the test set can be used to verify the number of discrete levels chosen. With the discrete data I built ten decision trees each using a random 90% of the data set (135 data vectors) and tested the remaining random 10% (15 data vectors) on the implemented decision tree in the Arduino three times on each generated tree. The result: 94.4% accurate classification.

Following these tests I attempted to get a sense of how well the tree program could generalize if it was built using less training data. I attempted building trees using random samples of the 150 data vectors of size 100, 75, 50 and 25 vectors. Then I ran each tree on the remainder of the 150 data vectors for validation. The results:

# of Vectors to Build Tree | # of Vectors Tested | % Correct
100                                             50                         94.5%
75                                               75                         93.1%
50                                             100                         90.2%
25                                             125                         80.9%

In the future I hope to test additional data sets with more data points and potentially adapt the decision tree to handle continuous data as well as discrete data. Below is the Arduino code used to implement the validation of one of the generated decision trees.

#include <play_tennis.h>

int output_tennis[]  = {0, 0};
int outlook[]  = {2, 2};
int temp[]  = {2, 2};
int humidity[]  = {0, 1};
int wind[]  = {1, 0};

void setup() {
  Serial.begin(9600);

  int correct_count = 0;
  int wrong_count = 0;
  for (int i = 0; i < 2; i++) {
    int decision = play_tennis(outlook[i], temp[i], humidity[i], wind[i], analogRead(0));
    if(decision == output_tennis[i]){
      correct_count++;
    }
    else{
      wrong_count++;
    }
    Serial.print("Decision = ");
    Serial.print(decision);
    Serial.print(" | Answer = ");
    Serial.println(output_tennis[i]);
  }
    Serial.print("\nCorrect = ");
    Serial.print(correct_count);
    Serial.print(" | Wrong = ");
    Serial.println(wrong_count);
}

void loop(){
}

Auto-generated Arduino Code and Randomness

I am aware that I haven’t posted anything new for a few days now, and that is simply because I’ve been making too much progress to slow down and write about what’s going on. All in all the auto-generated code portion is essentially complete. I have added some functionality to allow the generated code to also include comments  which make the if/else  branch structure nearly as easy to read and follow as the graphical tree generated in Processing. The idea is for future students to make trees by hand and compare their work to the generated code, or to simply take the generated code and write about how and why it works or to possibly implement a task with an Arduino that will require a decision tree to make choices. The comments amount to explaining what the numbers correspond to in the if/else branches and what the decisions are and why.

In addition to adding commenting to the the auto-generated code I decided to also add the random number generation that was occurring in Processing to the Arduino code. In general the decision tree attempts to traverse down a given branch gaining information the whole time and reducing entropy to zero, at which point there is only one option left, the decision. However, it is possible for a decision branch to check all the attributes and still be unsure (not yet at zero entropy). If this occurs my decision tree randomly chooses a leaf node to place on the end of the branch (if there were no training examples on this branch) or else randomly chooses a weighted leaf node to place on the branch (if there were training examples on this branch). What this means is, if a branch has never occurred in training then the decision is randomly made (because there is not enough data to make an inference), and it should be correct 1/n*100% of the time, where n is the total number of possible decisions. However, if there were examples on that branch then the random choice is weighted in favor of examples that have occurred more often. For example if in training there were examples for a given non-zero entropy branch of a, a, b then the decision a should be chosen twice as often as decision b.

I have chosen to allow non-zero entropy random branches to exist in my decision tree implementation because in general some randomness can be desirable in mobile robots and other applications with noisy data that can be prone to local minima. For this reason I have chosen to move the random decision generation to the Arduino. If the random choice was made in Processing then for the duration the tree is loaded on the Ardunio the same decision will always be made for that branch. With the random number generation on the Arduino it has the ability to make different decisions each time an undetermined condition occurs. Further testing will be necessary to determine if this design choice is valid. For now I will leave you with some of the generated code with comments and random number generation included.

/* play_tennis decision Tree file for Arduino	*/
/* Date: 20 Feb 2012					*/
/* Time: 14:50:24						*/

#ifndef play_tennis_H
#define play_tennis_H

#if defined(ARDUINO) && ARDUINO >= 100
#include "Arduino.h"
#else
#include "WProgram.h"
#endif

// 	Output (Decision):
// play_tennis	levels: 0 = No 1 = Yes
// 	Input (Attributes):
// Outlook	levels: 0 = Overcast 1 = Rain 2 = Sunny
// Temp	levels: 0 = Cool 1 = Mild 2 = Hot
// Humidity	levels: 0 = Normal 1 = High
// Wind	levels: 0 = Weak 1 = Strong

int play_tennis(int Outlook, int Temp, int Humidity, int Wind, int random_seed){
	randomSeed(random_seed); // seed the psuedo random number generator
	if(Outlook == 0){	// Outlook == Overcast?
		return 1;	// play_tennis = Yes
	}
	else if(Outlook == 1){	// Outlook == Rain?
		if(Wind == 0){	// Wind == Weak?
			return 1;	// play_tennis = Yes
		}
		else if(Wind == 1){	// Wind == Strong?
			return 0;	// play_tennis = No
		}
	}
	else if(Outlook == 2){	// Outlook == Sunny?
		if(Humidity == 0){	// Humidity == Normal?
			if(Temp == 0){	// Temp == Cool?
				return 1;	// play_tennis = Yes
			}
			else if(Temp == 1){	// Temp == Mild?
				return 1;	// play_tennis = Yes
			}
			else if(Temp == 2){	// Temp == Hot?
				if(Wind == 0){	// Wind == Weak?
					return random(2);	// play_tennis = random choice
				}
				else if(Wind == 1){	// Wind == Strong?
					int random_choices[] = {0, 1};
					return random_choices[random(2)]; // play_tennis = weighted random choice
				}
			}
		}
		else if(Humidity == 1){	// Humidity == High?
			return 0;	// play_tennis = No
		}
	}
}

#endif

Auto-generated Arduino Code: Day 1

So the moment I’ve been working toward for the past few weeks has arrived. I’ve finally started the process of having my decision tree program in Processing output generated code that can be run on the Arduino platform. In this first day I’ve made a huge amount of progress, going from nothing to having all the necessary code generated to create a function which can be run on the Arduino. What I’ve done is allow the decision tree program in processing to create a .h header file with the name of the tree as the name of the generated function. The inputs to the function are integer variables whose names are the names of the attributes used to create the tree. The auto-generated code includes comments which tell the name of the function and the date and time it was generated. Once the .h file is generated, all I have to do is use the decision tree function with the Arduino is place the generated “tree_name”.h file in the Arduino libraries folder and then add #include <“tree_name”.h> to my Arduino sketch. From there on the decision tree function will simply take in integer inputs for the attributes and output the decision as an integer which is the output level. I’ll be working to clean up all my code and examples and to run some data sets through the system for verification and to determine percent accuracy. For now I’ll leave you with some of the auto-generated code and the decision tree diagram that goes with it.

/* tennis decision Tree file for Arduino	*/
/* Date: 13 Feb 2012                        */
/* Time: 18:22:33                           */

#ifndef tennis_H
#define tennis_H

#if defined(ARDUINO) && ARDUINO >= 100
#include "Arduino.h"
#else
#include "WProgram.h"
#endif

int tennis(int Outlook, int Temp, int Humidity, int Wind){
	if(Outlook == 0){
		return 1;
	}
	else if(Outlook == 1){
		if(Wind == 0){
			return 1;
		}
		else if(Wind == 1){
			return 0;
		}
	}
	else if(Outlook == 2){
		if(Humidity == 0){
			if(Temp == 0){
				return 1;
			}
			else if(Temp == 1){
				return 1;
			}
			else if(Temp == 2){
				if(Wind == 0){
					return 0;
				}
				else if(Wind == 1){
					return 1;
				}
			}
		}
		else if(Humidity == 1){
			return 0;
		}
	}
}

#endif

Tennis Decision Tree