Machine Learning with Java - Part 2 (Logistic Regression)

Gowgi
3,061 views

Open Source Your Knowledge, Become a Contributor

Technology knowledge has to be shared and made accessible for free. Join the movement.

Create Content

Machine Learning with Java - Part 2 (Logistic Regression)

Regression analysis is a predictive modelling technique, which is used to investigate the relationship between the dependent and independent variable(s). It is the important tool for modelling and analyzing the data. From the data points, we draw the curve or the line and we try to fit it in such a manner that the differences between the distance between the data points to the curve or line is minimal.

In my last article, I have explained the linear regression with the sample data points. This article focuses on the Logistic regressions and its types with simple example.

Logistic Regression

Logistic regression is used when there are one or more independent variables that determine an outcome. It requires large sample sizes because maximum likelihood estimates are less powerful at low sample sizes than ordinary least square.

The difference with linear regression is that, linear regression output is continuous and not limited to number of possible.

Example : To determine, whether we can play or not based on weather data logistic regression is used.

Types of Logistic Regression

The types of Logistic Regression are,

1.Ordinal logistic regression

2.Multinomial Logistic regression

3.Binomial Logistic regression

Ordinal logistic regression

If the values of dependent variable are ordinal, then it is called as Ordinal logistic regression. Ordinal regression is used to predict the dependent variable with ‘ordered’ multiple categories given one or more independent variables.

Example : To predict the belief that the tax is too high, the dependent variable ranges from strongly agree to strongly disagree and the independent variables are age and income. In this case, we will use the ordinal logistic regression.

Multinomial Logistic regression

Multinomial logistic regression is used to predict a nominal dependent variable given one or more independent variables. It is sometimes considered as extension of binomial logistic regression.

Example : To understand which type of drink consumers prefer based on location in the US and age. The dependent variables would be type of the drink (Coffee, Soft Drink, Tea and Water) and the independent variables would be the nominal variable, location in US and the age (in years).

Binomial Logistic regression

A binomial logistic regression, predicts the probability that an observation falls into one of two categories of a dichotomous dependent variable based on one or more independent variables that can be either continuous or categorical. This is often called as simple logistic regression.

Example : Let us predict, whether students will pass or not (i.e. The dependent variables are Pass and Fail.) in their final exam based on the internal marks , assignment submission and few other independent variables.

Sample Training and Testing Data

Training Data Train Diagram
Testing Data Test Diagram

Notes

An ARFF (Attribute-Relation File Format) file is an ASCII text file that describes a list of instances sharing a set of attributes. ARFF files were developed by the Machine Learning Project for Weka machine learning software.

Logistic Regression Demo

Logistic Regression
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
package com.gg.ml;
import java.io.File;
import java.io.IOException;
import weka.classifiers.Classifier;
import weka.classifiers.Evaluation;
import weka.core.Instance;
import weka.core.Instances;
import weka.core.converters.ArffLoader;
public class LogisticRegressionDemo {
/** file names are defined*/
public static final String TRAINING_DATA_SET_FILENAME="weather.nominal.arff";
public static final String TESTING_DATA_SET_FILENAME="weather.nominal .test.arff";
public static final String PREDICTION_DATA_SET_FILENAME="weather.nominal-confused.arff";
/**
* This method is to load the data set.
* @param fileName
* @return
* @throws IOException
*/
public static Instances getDataSet(String fileName) throws IOException {
/**
* we can set the file i.e., loader.setFile("finename") to load the data
*/
int classIdx = 1;
/** the arffloader to load the arff file */
ArffLoader loader = new ArffLoader();
//loader.setFile(new File(fileName));
/** load the traing data */
loader.setSource(LogisticRegressionDemo.class.getResourceAsStream("/" + fileName));
/**
* we can also set the file like loader3.setFile(new
* File("test-confused.arff"));
*/
Instances dataSet = loader.getDataSet();
/** set the index based on the data given in the arff files */
dataSet.setClassIndex(classIdx);
return dataSet;
}
/**
* This method is used to process the input and return the statistics.
*
* @throws Exception
*/
public static void process() throws Exception {
Instances trainingDataSet = getDataSet(TRAINING_DATA_SET_FILENAME);
Instances testingDataSet = getDataSet(TESTING_DATA_SET_FILENAME);
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

Explanation

  1. The logistic regression outputs probabilities based on the following equation:

logit(pi)=log(pi/(1−pi))=β0+β1x1+...+βkxk

the coefficients refer to each βi.

  1. Odds ratios are simply the exponential of the weights i.e.The first coefficient you have is outlook=sunny:-3.5821. Calculation of exp⁡(-3.5821) gives 0.0278 that is the corresponding value in the odds ratio table. log(Odds(outlook=sunny)/Odds(outlook=¬sunny))

It will also display the correctly classified instances and incorrectly classified instances. With that data, we can understand the accuracy of the algorithm with the datasets that we have.

Note: Next article, I will focus on next algorithm with an example and also how to use weka lib.

Open Source Your Knowledge: become a Contributor and help others learn. Create New Content