Learning Decision Trees

Material:

The zip archive contains three data sets and Java code to learn a decision tree from those data sets. Your task is to check how well the learning algorithm performs on the given data sets.

Tasks:

  1. Change the existing code (changing the main method should do), so that it prints the necessary data for a learning curve to standard output. That is, change the code such that it learns trees for increasing numbers of training examples and test each of the trees using the test set.
  2. Plot the learning curves for all three data sets (e.g., using a spreadsheet application of your choice) and interpret them.
  3. Are all the trees that are learned consistent with the training data (consistent = all examples are classified correctly)? If not, what could be the reason? (Hint: Consistency of the tree with the training set can be checked easily by adding a single line of code.)
  4. Look at the true functions for the three data sets in monk.names (Section 9 of the file). These are the functions that were used to generate the training and test data. Now that you know the function, design a good (small) decision tree for monks-1 by hand.
  5. Compare this decision tree to the one that was learned by the algorithm using the whole training set and explain the difference.

Hand in your code together with a PDF document containing:

  1. Learning curves for the three data sets.
  2. Interpretation of the learning curves.
  3. Answer to 3.
  4. Decision tree for 4.
  5. Interpretation of your findings for 5.