Machine Learning with Java - Part 4 (Decision Tree)

50.2K views

Open Source Your Knowledge, Become a Contributor

Technology knowledge has to be shared and made accessible for free. Join the movement.

Machine Learning with Java - Part 4 (Decision tree)

In my previous articles, we have seen the Linear Regression , Logistic Regression and Nearest Neighbor. This article focuses on Decision Tree Classification and its sample use case.

Decision Tree

Decision Trees are a classic supervised learning algorithms.

A decision tree is a decision support tool that uses a tree-like graph or model of decisions and their possible consequences, including chance-event outcomes, resource costs, and utility. The decision tree algorithm can be used for solving the regression and classification problems too.

The main goal of decision tree is to achieve perfect classification with minimum number of decision and it is not always possible due to inconsistencies of data.

Sample Example

Let us consider you are planning to go out for dining , as your friends are visiting but you are hesitant in deciding on which restaurant to choose. Whenever you want to go out for dining you ask Bobby if he thinks you will like that place or not. You give him a list of restaurants that you have visited and tell him whether you liked each restaurant or not (giving a labelled training dataset). Bobby, ask you few questions like, whether you like roof top seating? Does restaurant serve Indian food ?, Is restaurant open till midnight ? Does restaurant have live music and so on to answer your question. It asks you several informative questions to give the reply whether you will like that restaurant or not. In this Bobby is a decision tree for finding your restaurant preferences.

Types of Decision Trees

Classification Trees
Regression Trees

Classification trees

It is the default kind of decision tree used to separate the dataset into different classes. The response variable is categorical in nature. (2 categories or multiple categories)

Example: We have two variables age and weight .Based on this we are going to determine whether the person will join jym or not.

Regression Trees

It is used when the response variable is continuous or numerical in nature. This is again classified into linear relationship and nonlinear relationship between the predictors and response.

Example:Predicted price of a consumer good.

When to use Decision Trees?

The few scenarious where we can use decision tree algorithm are,

The decision trees are suited if the training data contains error. Because they are robust to errors.
It is used when the training data has missing values. Because they can handle missing values by looking the data into other columns.

Advantages

The few advantages of decision trees are,

Easy to explain.
Data type is not constraint as they can handle both categorical and numerical values.
Helpful in data exploration as they implicitly perform the feature selection and which is very helpful in predictive analysis.

Overfitting

It is the practical problem while building the decision tree model. The module is having an issue of overfitting when the algorithm continues to go deeper and deeper to reduce the training set error with an increased test set error. (Accuracy of prediction goes down)

It mainly happens because of construction of many branches due to irregularities in data. The overfitting can be avoided by using 2 approaches. They are

Pre-Pruning
Post-Pruning

Pre-Pruning

It stops the tree constructions bit early. It is preferred not to split the node ,if its goodness measure is below a threshold value

Post-Pruning

It goes deeper and deeper in the tree to build a complete tree.

When tree shows the overfitting problem, then the pruning is done as a post-pruning step.We use a cross-validation data to check the effect of our pruning. Using cross-validation data, it tests whether expanding a node will make an improvement or not.Incase if it shows an improvement, then we can continue expanding the node.Otherwise it should not be expanded.

Decision Tree Demo

Id3 and J48 Classifier in Decision Tree

package com.gg.ml;
import java.io.File;
import java.io.IOException;
import weka.classifiers.trees.Id3;
import weka.classifiers.Classifier;
import weka.classifiers.Evaluation;
import weka.classifiers.trees.J48;
import weka.core.Instances;
import weka.core.converters.ArffLoader;
/**
 * @author Gowtham Girithar Srirangasamy
 *
 */
public class DecisionTreeDemo {
    /** file names are defined*/
    public static final String TRAINING_DATA_SET_FILENAME="decision-train.arff";
    public static final String TESTING_DATA_SET_FILENAME="decision-test.arff";
    
    /**
     * This method is to load the data set.
     * @param fileName
     * @return
     * @throws IOException
     */
    public static Instances getDataSet(String fileName) throws IOException {
        /**
         * we can set the file i.e., loader.setFile("finename") to load the data
         */
        int classIdx = 1;
        /** the arffloader to load the arff file */
        ArffLoader loader = new ArffLoader();
        /** load the traing data */
        loader.setSource(DecisionTreeDemo.class.getResourceAsStream("/" + fileName));
        /**
         * we can also set the file like loader3.setFile(new
         * File("test-confused.arff"));
         */
        //loader.setFile(new File(fileName));
        Instances dataSet = loader.getDataSet();
        /** set the index based on the data given in the arff files */
        dataSet.setClassIndex(classIdx);
        return dataSet;
    }
    /**

Code Explanation

In the above code, we have used both the Id3 and J48 algorithms. The ID3 could be implemented when we need faster/simpler result without considering all those additional factors in the J48 consider. J48 handles missing values, has more robust splitting and has routines for pruning the tree structure. In short it is an industrial strength decision tree learner.

Real Time applications using Decision Tree

The few real time applications are,

Great use in finance for option pricing
Pattern recognition based on decision trees
Bank to classify the loan applicants
A popular baby product company, used decision tree machine learning algorithm to decide whether they should continue using the plastic PVC in their products.
To identify at-risk patients and disease trends.
To select the category of question paper based on expert level
To take a decision on accepting or rejecting the employment offer

Open Source Your Knowledge: become a Contributor and help others learn. Create New Content