Machine Learning with Java - Part 6 (Random Forest)

Gowgi
1,309 views

Open Source Your Knowledge, Become a Contributor

Technology knowledge has to be shared and made accessible for free. Join the movement.

Create Content

Machine Learning with Java - Part 6 (Random Forest)

In my previous articles, we have discussed about Linear Regression , Logistic Regression , Nearest Neighbor,Decision Tree and Naive Bayes .In this article, we are going to discuss about the most important classification algorithm which is Random Forest Algorithm.

Random Forest

Random forest is a trademark term for an ensemble classifier (learning algorithms that construct a. set of classifiers and then classify new data points by taking a (weighted) vote of their predictions) that consists of many decision trees and outputs the class that is the mode of the classes output by individual trees. Random forests are collections of trees, all slightly different. It belongs to supervised learning.

Random Forest algorithm can be used for both the classification and regression kind of problems. You might be wondered to know "How a single algorithm can be used for both classification and regression kind of problems?". We will discuss in detail below.

Difference with Decision Tree

In Random Forest, we are creating more number of decision trees but the construction of decision is not with information gain and Gini index approach. The process of finding the root node and splitting the features node will happen randomly.

Let us consider, you are planning to go for a trip and also decided to ask suggestions about place from your friend. Your friend will ask some questions to decide which places you will like and which are all places you may not be interested based on the details he got from you. In this case, decision tree will be used to find the place which you will like.

The above case is a decision tree, your friend used the answers given by you to predict your likes and moreover final decision is taken by a single person using the only one decision tree.

But you don’t want to ask suggestions only from your close friend. So, you decided to ask all your friends. Your friends will ask random questions to predict your likes. In this case, random forest will be used to decide the place based on the ratings which have got most. There are many friends involved and everyone has asked different questions. i.e. Many trees are involved and final decision is based on number of votes. So, it is random forest.

Hope difference between decision tree with random forest is understood from the above example.

Why Random Forest

The few advantages of random forest algorithm are listed below,

  1. It can be used for both classification and regression kind of problems.

  2. It will be easier to handle missing values.

  3. Overfitting (noise of the model) will be avoided by having many trees and so therefore more accurate.

  4. It run efficiently on large databases

Case Study

Let us consider we need to find people favorite places based on few factors.

In this scenario, we want to find people favorite tourist places (Places 1,2,3 are the three given places) and based on the high percentage we will decide which place people will like based on age, gender and residence. Another way to calculate this is based on number of likes instead percentage.

Age In this based on age, we are getting the data in terms of percentage. Train Diagram

Gender Gender based people votes in terms of percentage are given below Test Diagram

Residence Residence is used as one of the factor determine people favorite spot Train Diagram

Final Chart Based on the above sets, we will find the favorite spot for a guy whose age is 30 and stays in metro. Test Diagram The final chart describes that place 1 he likes 70 % and place 2 is 20% and 6% may be place 3.

Random Forest Demo

Random Forest Demo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
package com.gg.ml;
import java.io.File;
import weka.classifiers.Classifier;
import weka.classifiers.Evaluation;
import weka.core.Instances;
import weka.core.converters.ArffLoader;
import weka.filters.Filter;
import weka.filters.unsupervised.attribute.StringToWordVector;
import java.io.IOException;
import weka.classifiers.trees.RandomForest;
/**
* @author Gowtham Girithar Srirangasamy
*
*/
/**
* @author Gowtham Girithar Srirangasamy
*
*/
public class RandomForestDemo {
/** file names are defined*/
public static final String TRAINING_DATA_SET_FILENAME="decision-train.arff";
public static final String TESTING_DATA_SET_FILENAME="decision-test.arff";
/**
* This method is to load the data set.
* @param fileName
* @return
* @throws IOException
*/
public static Instances getDataSet(String fileName) throws IOException {
/**
* we can set the file i.e., loader.setFile("finename") to load the data
*/
int classIdx = 1;
/** the arffloader to load the arff file */
ArffLoader loader = new ArffLoader();
/** load the traing data */
loader.setSource(RandomForestDemo.class.getResourceAsStream("/" + fileName));
/**
* we can also set the file like loader3.setFile(new
* File("test-confused.arff"));
*/
//loader.setFile(new File(fileName));
Instances dataSet = loader.getDataSet();
/** set the index based on the data given in the arff files */
dataSet.setClassIndex(classIdx);
return dataSet;
}
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

Code Explanation

In this above sample code, we have used Random Forest classifier of Weka. We can also set the number of trees. Please don’t take the data considered here for executing the class files. Because we can predict learn more with large data sets only.

Note: Example case study data and demo file data are different.In my upcoming artciles, I will explain about information gain and gini index in detail.

Open Source Your Knowledge: become a Contributor and help others learn. Create New Content