Open Source Your Knowledge, Become a Contributor
Technology knowledge has to be shared and made accessible for free. Join the movement.
Machine Learning with Java  Part 5 (Naive Bayes)
In my previous articles we have seen series of algorithms : Linear Regression , Logistic Regression , Nearest Neighbor,Decision Tree and this article describes about the Naive Bayes algorithm. The simplest solutions are the most powerful ones and Naive Bayes is the best example for the same.
Naive Bayes
Naive Bayes is a family of probabilistic algorithm that takes an advantage of probability theory and Bayes theory to predict the category of sample. A Naive Bayesian model is easy to build, with no complicated iterative parameter estimation which makes it particularly useful for very large datasets.
Example
To find whether an email is ham or spam, we have training samples listed below
Training Data
now we will calculate which category the sentence "you won lottery" belongs to? Naive Bayes is a probabilistic classifier. To determine whether an email is ham or spam, we want to calculate the probability that the sentence "you won lottery" is ham and the probability that its spam. After the calculation, we will take the largest one.
How to calculate the probabilities?
The probabilities are calculated using the word frequencies. It is calculated using some basic properties of probabilities and the Bayes theorem. The conditional probabilities like the one discussed here will suits for the Bayes theorem.
P (A  B) = (P (B  A) * P(A))/ P(B)
Let us do one sample example to understand the probability calculation.
P (ham you won lottery) = (P (you won lottery  ham) * P(ham))/ P (you won lottery)
In our classifier, we are trying to find out the category which has the bigger probability.so we can discard the divisor and perform how many times the sentence "you won lottery" appears in ham category divide it by the total and obtain P (you won lottery  ham)
we here assume that every word in a sentence is independent of the other ones.so, we are not considering entire sentences but rather at individualâ€™s words. Example: "Good things are â€śand "things are Good" are same.
P (you won lottery  ham) = P (you  ham) * P (won  ham) * P (lottery  ham)
P (you won lottery  spam) = P (you  spam) * P (won  spam) * P (lottery  spam)
with this above formula, we will find which one this sentence belongs to.
There are cases, some of the words will not be in training sets and the value is 0. The multiplication of 0 with others will be 0. To avoid this, we are going to use Laplace smoothing. i.e., we will add 1 to every count so it will not be zero
The words are listed in the above table. The total number of words: 15 The total number of ham words: 7 The total number of spam words: 10
You won Lottery?
P (you  ham) = (1 + 1)/ (7 + 15)
P (you  spam) = (1 + 1)/ (10 + 15)
Like the above we will calculate for every word and multiply using the above probability. The results for the one who have higher probability won.
"You won Lottery" belongs to spam category
Naive Bayes Multinomial Demo
Code Explanation
In the above example, we have used the multinomial weka classifier for naive bayes. We can use another naive Bayes classifier in weka. If you want to try out different classifier just instantiate the specific classifier in the code (Line number 64 in code) and work on the same.
Advantages

Easy to understand and implement

The classifier model is fast to build

No optimization is required

Easily updateable if new training data is received

Performance is good
Challenges with Naive Bayes
Despite its many strengths and robust performance, there are well known weaknesses with this classification technique,

Incomplete training Data  In our case also to overcome this, we have done smoothing so that the probability wonâ€™t be zero if any of the words are not in training set.

Continuous Variable  The probability computation by the traditional method of frequency count is not possible.

Attribute independence  This requires a little bit of extra effort. In the calculation of outcome probabilities, the implicit assumption is that all the attributes are mutually independent. This allows us to multiply the class conditional probabilities to compute the outcome probability.