Python Naive Bayes in Machine Learning

Python Naive Bayes in Machine Learning

Naive Bayes in Machine Learning

Bayes' theorem finds many uses in the probability theory and statistics. There's a micro chance that you have never heard about this theorem in your life. Turns out that this theorem has found its way into the world of machine learning, to form one of the highly decorated algorithms. In this article, we will learn all about the Naive Bayes Algorithm, along with its variations for different purposes in machine learning.

As you might have guessed, this requires us to view things from a probabilistic point of view. Just as in machine learning, we have attributes, response variables and predictions or classifications. Using this algorithm, we will be dealing with the probability distributions of the variables in the dataset and predicting the probability of the response variable belonging to a particular value, given the the attributes of a new instance. Lets start by reviewing the Bayes' theorem.

Bayes' Theorem

This lets us examine the probability of an event based on the prior knowledge of any event that related to the former event. So for example, the probability that price of a house is high, can be better assessed if we know the facilities around it, compared to the assessment made without the knowledge of location of the house. Bayes' theorem does exactly that.

Image for post
Image taken from Wikipedia

Above equation gives the basic representation of the Bayes' theorem. Here A and B are two events and,

P(A|B) : the conditional probability that event A occurs , given that B has occurred. This is also known as the posterior probability.

P(A) and P(B) : probability of A and B without regard of each other.

P(B|A) : the conditional probability that event B occurs , given that A has occurred.

Now, let's see how this suits well to the purpose of machine learning.

Take a simple machine learning problem, where we need to learn our model from a given set of attributes(in training examples) and then form a hypothesis or a relation to a response variable. Then we use this relation to predict a response, given attributes of a new instance. Using the Bayes' theorem, its possible to build a learner that predicts the probability of the response variable belonging to some class, given a new set of attributes.

Consider the previous equation again. Now, assume that A is the response variable and B is the input attribute. So according to the equation, we have

P(A|B) : conditional probability of response variable belonging to a particular value, given the input attributes. This is also known as the posterior probability.

P(A) : The prior probability of the response variable.

P(B) : The probability of training data or the evidence.

P(B|A) : This is known as the likelihood of the training data.

Therefore, the above equation can be rewritten as

Image for post
Image taken from Wikipedia

Let's take a problem, where the number of attributes is equal to n and the response is a boolean value, i.e. it can be in one of the two classes. Also, the attributes are categorical(2 categories for our case). Now, to train the classifier, we will need to calculate P(B|A), for all the values in the instance and response space. This means, we will need to calculate 2*(2^n -1), parameters for learning this model. This is clearly unrealistic in most practical learning domains. For example, if there are 30 boolean attributes, then we will need to estimate more than 3 billion parameters.

Naive Bayes Algorithm

The complexity of the above Bayesian classifier needs to be reduced, for it to be practical. The naive Bayes algorithm does that by making an assumption of conditional independence over the training dataset. This drastically reduces the complexity of above mentioned problem to just 2n.

The assumption of conditional independence states that, given random variables X, Y and Z, we say X is conditionally independent of Y given Z, if and only if the probability distribution governing X is independent of the value of Y given Z.

In other words, X and Y are conditionally independent given Z if and only if, given knowledge that Z occurs, knowledge of whether X occurs provides no information on the likelihood of Y occurring, and knowledge of whether Y occurs provides no information on the likelihood of X occurring.

This assumption makes the Bayes algorithm, naive.

Given, n different attribute values, the likelihood now can be written as

Image for post

Here, X represents the attributes or features, and Y is the response variable. Now, P(X|Y) becomes equal to the products of, probability distribution of each attribute X given Y.

Maximizing a Posteriori

What we are interested in, is finding the posterior probability or P(Y|X). Now, for multiple values of Y, we will need to calculate this expression for each of them.

Given a new instance Xnew, we need to calculate the probability that Y will take on any given value, given the observed attribute values of Xnew and given the distributions P(Y) and P(X|Y) estimated from the training data.

So, how will we predict the class of the response variable, based on the different values we attain for P(Y|X). We simply take the most probable or maximum of these values. Therefore, this procedure is also known as maximizing a posteriori.

Maximizing Likelihood

If we assume that the response variable is uniformly distributed, that is it is equally likely to get any response, then we can further simplify the algorithm. With this assumption the priori or P(Y) becomes a constant value, which is 1/categories of the response.

As, the priori and evidence are now independent of the response variable, these can be removed from the equation. Therefore, the maximizing the posteriori is reduced to maximizing the likelihood problem.

Feature Distribution

As seen above, we need to estimate the distribution of the response variable from training set or assume uniform distribution. Similarly, to estimate the parameters for a feature's distribution, one must assume a distribution or generate nonparametric models for the features from the training set. Such assumptions are known as event models. The variations in these assumptions generates different algorithms for different purposes. For continuous distributions, the Gaussian naive Bayes is the algorithm of choice. For discrete features, multinomial and Bernoulli distributions as popular. Detailed discussion of these variations are out of the scope of this article.

Naive Bayes classifiers work really well in complex situations, despite the simplified assumptions and naivety. The advantage of these classifiers is that they require small number of training data for estimating the parameters necessary for classification. This is the algorithm of choice for text categorization. This is the basic idea behind naive Bayes classifiers, that you need to start experimenting with the algorithm.

Happy Pythoning....!!

More Articles of Aditi Kothiyal:

Name Views Likes
Python AdaBoost Mathematics Behind AdaBoost 421 1
Python PyCaret How to optimize the probability threshold % in binary classification 2069 0
Python K-means Predicting Iris Flower Species 1322 2
Python PyCaret How to ignore certain columns for model building 2624 0
Python PyCaret Experiment Logging 680 0
Python PyWin32 Open a File in Excel 941 0
Python Guppy GSL Introduction 219 2
Python Usage of Guppy With Example 1101 2
Python Naive Bayes Tutorial 552 2
Python Guppy Recent Memory Usage of a Program 892 2
Introduction to AdaBoost 290 1
Python AdaBoost Implementation of AdaBoost 513 1
Python AdaBoost Advantages and Disadvantages of AdaBoost 3713 1
Python K-Means Clustering Applications 332 2
Python Random Forest Algorithm Decision Trees 439 0
Python K-means Clustering PREDICTING IRIS FLOWER SPECIES 457 1
Python Random Forest Algorithm Bootstrap 476 0
Python PyCaret Util Functions 441 0
Python K-means Music Genre Classification 1763 1
Python PyWin Attach an Excel file to Outlook 1541 0
Python Guppy GSL Document and Test Example 248 2
Python Random Forest Algorithm Bagging 386 0
Python AdaBoost An Example of How AdaBoost Works 279 1
Python PyWin32 Getting Started PyWin32 602 0
Python Naive Bayes in Machine Learning 375 2
Python PyCaret How to improve results from hyperparameter tuning by increasing "n_iter" 1723 0
Python PyCaret Getting Started with PyCaret 2.0 356 1
Python PyCaret Tune Model 1325 1
Python PyCaret Create your own AutoML software 321 0
Python PyCaret Intoduction to PyCaret 296 1
Python PyCaret Compare Models 2696 1
Python PyWin Copying Data into Excel 1153 0
Python Guppy Error: expected function body after function declarator 413 2
Python Coding Random forest classifier using xgBoost 247 0
Python PyCaret How to tune "n parameter" in unsupervised experiments 659 0
Python PyCaret How to programmatically define data types in the setup function 1403 0
Python PyCaret Ensemble Model 805 1
Python Random forest algorithm Introduction 227 0
Python k-means Clustering Example 337 1
Python PyCaret Plot Model 1243 1
Python Hamming Distance 715 0
Python Understanding Random forest algorithm 311 0
Python PyCaret Sort a Dictionary by Keys 244 0
Python Coding Random forest classifier using sklearn 340 0
Python Guppy Introduction 368 2
Python How to use Guppy/Heapy for tracking down Memory Usage 1069 2
Python AdaBoost Summary and Conclusion 232 1
Python PyCaret Create Model 365 1
Python k -means Clusturing Introduction 325 2
Python k-means Clustering With Example 348 2