# Naive Bayes in Machine Learning

Bayes' theorem finds many uses in the probability theory and statistics. There's a micro chance that you have never heard about this theorem in your life. Turns out that this theorem has found its way into the world of machine learning, to form one of the highly decorated algorithms. In this article, we will learn all about the Naive Bayes Algorithm, along with its variations for different purposes in machine learning.

As you might have guessed, this requires us to view things from a probabilistic point of view. Just as in machine learning, we have attributes, response variables and predictions or classifications. Using this algorithm, we will be dealing with the probability distributions of the variables in the dataset and predicting the probability of the response variable belonging to a particular value, given the the attributes of a new instance. Lets start by reviewing the Bayes' theorem.

## Bayes' Theorem

This lets us examine the probability of an event based on the prior knowledge of any event that related to the former event. So for example, the probability that price of a house is high, can be better assessed if we know the facilities around it, compared to the assessment made without the knowledge of location of the house. Bayes' theorem does exactly that.

Above equation gives the basic representation of the Bayes' theorem. Here A and B are two events and,

P(A|B) : the conditional probability that event A occurs , given that B has occurred. This is also known as the posterior probability.

P(A) and P(B) : probability of A and B without regard of each other.

P(B|A) : the conditional probability that event B occurs , given that A has occurred.

Now, let's see how this suits well to the purpose of machine learning.

Take a simple machine learning problem, where we need to learn our model from a given set of attributes(in training examples) and then form a hypothesis or a relation to a response variable. Then we use this relation to predict a response, given attributes of a new instance. Using the Bayes' theorem, its possible to build a learner that predicts the probability of the response variable belonging to some class, given a new set of attributes.

Consider the previous equation again. Now, assume that A is the response variable and B is the input attribute. So according to the equation, we have

P(A|B) : conditional probability of response variable belonging to a particular value, given the input attributes. This is also known as the posterior probability.

P(A) : The prior probability of the response variable.

P(B) : The probability of training data or the evidence.

P(B|A) : This is known as the likelihood of the training data.

Therefore, the above equation can be rewritten as

Let's take a problem, where the number of attributes is equal to n and the response is a boolean value, i.e. it can be in one of the two classes. Also, the attributes are categorical(2 categories for our case). Now, to train the classifier, we will need to calculate P(B|A), for all the values in the instance and response space. This means, we will need to calculate 2*(2^n -1), parameters for learning this model. This is clearly unrealistic in most practical learning domains. For example, if there are 30 boolean attributes, then we will need to estimate more than 3 billion parameters.

## Naive Bayes Algorithm

The complexity of the above Bayesian classifier needs to be reduced, for it to be practical. The naive Bayes algorithm does that by making an assumption of conditional independence over the training dataset. This drastically reduces the complexity of above mentioned problem to just 2n.

The assumption of conditional independence states that, given random variables X, Y and Z, we say X is conditionally independent of Y given Z, if and only if the probability distribution governing X is independent of the value of Y given Z.

In other words, X and Y are conditionally independent given Z if and only if, given knowledge that Z occurs, knowledge of whether X occurs provides no information on the likelihood of Y occurring, and knowledge of whether Y occurs provides no information on the likelihood of X occurring.

This assumption makes the Bayes algorithm, naive.

Given, n different attribute values, the likelihood now can be written as

Here, X represents the attributes or features, and Y is the response variable. Now, P(X|Y) becomes equal to the products of, probability distribution of each attribute X given Y.

## Maximizing a Posteriori

What we are interested in, is finding the posterior probability or P(Y|X). Now, for multiple values of Y, we will need to calculate this expression for each of them.

Given a new instance Xnew, we need to calculate the probability that Y will take on any given value, given the observed attribute values of Xnew and given the distributions P(Y) and P(X|Y) estimated from the training data.

So, how will we predict the class of the response variable, based on the different values we attain for P(Y|X). We simply take the most probable or maximum of these values. Therefore, this procedure is also known as maximizing a posteriori.

## Maximizing Likelihood

If we assume that the response variable is uniformly distributed, that is it is equally likely to get any response, then we can further simplify the algorithm. With this assumption the priori or P(Y) becomes a constant value, which is 1/categories of the response.

As, the priori and evidence are now independent of the response variable, these can be removed from the equation. Therefore, the maximizing the posteriori is reduced to maximizing the likelihood problem.

## Feature Distribution

As seen above, we need to estimate the distribution of the response variable from training set or assume uniform distribution. Similarly, to estimate the parameters for a feature's distribution, one must assume a distribution or generate nonparametric models for the features from the training set. Such assumptions are known as event models. The variations in these assumptions generates different algorithms for different purposes. For continuous distributions, the Gaussian naive Bayes is the algorithm of choice. For discrete features, multinomial and Bernoulli distributions as popular. Detailed discussion of these variations are out of the scope of this article.

Naive Bayes classifiers work really well in complex situations, despite the simplified assumptions and naivety. The advantage of these classifiers is that they require small number of training data for estimating the parameters necessary for classification. This is the algorithm of choice for text categorization. This is the basic idea behind naive Bayes classifiers, that you need to start experimenting with the algorithm.

Happy Pythoning....!!

#### More Articles of Aditi Kothiyal:

Name Views Likes
Python PyCaret How to optimize the probability threshold % in binary classification 2240 0
Python K-means Predicting Iris Flower Species 1391 2
Python PyCaret How to ignore certain columns for model building 3243 0
Python PyCaret Experiment Logging 806 0
Python PyWin32 Open a File in Excel 1099 0
Python Guppy GSL Introduction 256 2
Python Usage of Guppy With Example 1176 2
Python Naive Bayes Tutorial 596 2
Python Guppy Recent Memory Usage of a Program 983 2
Python K-Means Clustering Applications 375 2
Python Random Forest Algorithm Decision Trees 486 0
Python K-means Clustering PREDICTING IRIS FLOWER SPECIES 513 1
Python Random Forest Algorithm Bootstrap 549 0
Python PyCaret Util Functions 495 0
Python K-means Music Genre Classification 1863 1
Python PyWin Attach an Excel file to Outlook 1731 0
Python Guppy GSL Document and Test Example 304 2
Python Random Forest Algorithm Bagging 439 0
Python PyWin32 Getting Started PyWin32 881 0
Python Naive Bayes in Machine Learning 414 2
Python PyCaret How to improve results from hyperparameter tuning by increasing "n_iter" 1822 0
Python PyCaret Getting Started with PyCaret 2.0 404 1
Python PyCaret Tune Model 1541 1
Python PyCaret Create your own AutoML software 387 0
Python PyCaret Intoduction to PyCaret 347 1
Python PyCaret Compare Models 2998 1
Python PyWin Copying Data into Excel 1250 0
Python Guppy Error: expected function body after function declarator 451 2
Python Coding Random forest classifier using xgBoost 282 0
Python PyCaret How to tune "n parameter" in unsupervised experiments 715 0
Python PyCaret How to programmatically define data types in the setup function 1476 0
Python PyCaret Ensemble Model 881 1
Python Random forest algorithm Introduction 257 0
Python k-means Clustering Example 376 1
Python PyCaret Plot Model 1382 1
Python Hamming Distance 769 0
Python Understanding Random forest algorithm 341 0
Python PyCaret Sort a Dictionary by Keys 279 0
Python Coding Random forest classifier using sklearn 389 0
Python Guppy Introduction 423 2
Python How to use Guppy/Heapy for tracking down Memory Usage 1154 2
Python AdaBoost Summary and Conclusion 270 1
Python PyCaret Create Model 408 1
Python k -means Clusturing Introduction 386 2
Python k-means Clustering With Example 397 2