Python NLTK Text Classification


The goal of text classification can be pretty broad. Maybe we're trying to classify text as about politics or the military.

Maybe we're trying to classify it by the gender of the author who wrote it.

A fairly popular text classification task is to identify a body of text as either spam or not spam, for things like email filters.

In our case, we're going to try to create a sentiment analysis algorithm.

To do this, we're going to start by trying to use the movie reviews database that is part of the NLTK corpus.

From there we'll try to use words as "features" which are a part of either a positive or negative movie review?

The NLTK corpus movie reviews data set has the reviews, and they are labeled already as positive or negative.

This means we can train and test with this data. First, let's wrangle our data.

