NAIVE BAYES CLASSIFIER WITH NLTK
Naive Bayes Classifier algorithm is a pretty popular algorithm
used in text classification, therefore this algorithm having a high preference
Before we can train and test our algorithm, however, we need to go ahead and split up the data into a training set and a testing set.
You could train and test on the same dataset, but this would present you with some serious bias issues, so you should never train and test against the exact same data.
To do this, since we've shuffled our data set, we'll assign the first 1,900 shuffled reviews, consisting of both positive and negative reviews, as the training set.
Then, we can test against the last 100 to see how accurate we are.
This is called supervised machine learning because we're showing the machine data, and telling it "hey, this data is positive," or "this data is negative."
First, we have to split data into testing and training sets like below:
Next step is we have to define and train classifier as shown below:
Now the last step we have to perform the test a given below:
Next, we can take it a step further to see what the most valuable words are when it comes to positive or negative reviews:
This is going to vary again for each person, but you should see something like:
Most Informative Features
insulting = True neg : pos = 10.6 : 1.0
ludicrous = True neg : pos = 10.1 : 1.0
winslet = True pos : neg = 9.0 : 1.0
detract = True pos : neg = 8.4 : 1.0
breathtaking = True pos : neg = 8.1 : 1.0
silverstone = True neg : pos = 7.6 : 1.0
excruciatingly = True neg : pos = 7.6 : 1.0
warns = True pos : neg = 7.0 : 1.0
tracy = True pos : neg = 7.0 : 1.0
insipid = True neg : pos = 7.0 : 1.0
freddie = True neg : pos = 7.0 : 1.0
damon = True pos : neg = 5.9 : 1.0
debate = True pos : neg = 5.9 : 1.0
ordered = True pos : neg = 5.8 : 1.0
lang = True pos : neg = 5.7 : 1.0
What this tells you is the ratio of occurrences in negative to positive or visa versa,
for every word.
Using the Naive Bayes algorithm you can classify text easily.