Validation of System to Date

Now comes the moment you’ve all been waiting for. The moment when I actually explain how I intend to test my system and verify that it works as I designed. All machine learning and data analysis systems which attempt classification must be vetted by measuring their percent accuracy at classifying some aptly named “test set” after being trained on some equally well named “training set”. By set we simply mean some vectors of input attributes with their associated output labels/classes/decisions. In general it is simple for a machine learning system to be good at classifying the data it is trained on (because it has seen those examples during the training phase), but it is more difficult (and more beneficial) for the system to have high accuracy at inferring the proper classification for inputs that it has not seen before. Because decision trees specifically use labeled data for training (meaning they require both input vectors and output labels), by simply partitioning the data set into a training portion and test portion it is possible to test the accuracy of the system by comparing the systems recommended decisions for the test set to the actual output labels.

To validate my system I have used two relatively small data sets so far. The popular “play tennis data set” from Tom Mitchell’s book “Machine Learning”, and the ubiquitous (in the world of machine learning) Fisher’s Iris Data Set. The play tennis data set consists of 16 days worth of data, where each day records the outlook (overcast, sunny, rainy), temperature (hot, mild, cool), humidity (high, normal), wind (weak, strong) and the result of play tennis (yes, no). This data was randomly ordered and two days were held back from training to be used for validation. This processes of randomly mixing the days then selecting two for validation occurred seven times. Each time the 14 training days were used to build the decision tree and then the two validation days were run through the decision tree function on the Arduino three times each. Each generated tree was used three times to see the effects of the random decision on some undetermined nodes. The result: 78.6% accurate classification.

The Iris Data Set was a bit different to implement simply because its attributes (sepal length, sepal width, petal length and petal width) are all measured on a continuous scale. Because the decision tree program as currently written can only handle discrete attributes I first had to essentially histogram the attributes into discrete levels/bins. I used a scatter plot of the binned data to somewhat arbitrarily choose eight bins, or eight levels for each of the four attributes. In future, modifications to the decision tree program to handle continuous data would be useful, however for now simply comparing percent accuracy on the test set can be used to verify the number of discrete levels chosen. With the discrete data I built ten decision trees each using a random 90% of the data set (135 data vectors) and tested the remaining random 10% (15 data vectors) on the implemented decision tree in the Arduino three times on each generated tree. The result: 94.4% accurate classification.

Following these tests I attempted to get a sense of how well the tree program could generalize if it was built using less training data. I attempted building trees using random samples of the 150 data vectors of size 100, 75, 50 and 25 vectors. Then I ran each tree on the remainder of the 150 data vectors for validation. The results:

# of Vectors to Build Tree | # of Vectors Tested | % Correct
100                                             50                         94.5%
75                                               75                         93.1%
50                                             100                         90.2%
25                                             125                         80.9%

In the future I hope to test additional data sets with more data points and potentially adapt the decision tree to handle continuous data as well as discrete data. Below is the Arduino code used to implement the validation of one of the generated decision trees.

#include <play_tennis.h>

int output_tennis[]  = {0, 0};
int outlook[]  = {2, 2};
int temp[]  = {2, 2};
int humidity[]  = {0, 1};
int wind[]  = {1, 0};

void setup() {
  Serial.begin(9600);

  int correct_count = 0;
  int wrong_count = 0;
  for (int i = 0; i < 2; i++) {
    int decision = play_tennis(outlook[i], temp[i], humidity[i], wind[i], analogRead(0));
    if(decision == output_tennis[i]){
      correct_count++;
    }
    else{
      wrong_count++;
    }
    Serial.print("Decision = ");
    Serial.print(decision);
    Serial.print(" | Answer = ");
    Serial.println(output_tennis[i]);
  }
    Serial.print("\nCorrect = ");
    Serial.print(correct_count);
    Serial.print(" | Wrong = ");
    Serial.println(wrong_count);
}

void loop(){
}

Leave a comment