Tagging Sentences
In order to access NLTK's POS tagger, we'll need to import it. All import statements must go at the beginning of the script. Let's put this new import under our other import statement.
from nltk.corpus import twitter_samples
from nltk.tag import pos_tag_sents
tweets = twitter_samples.strings('positive_tweets.json')
tweets_tokens = twitter_samples.tokenized('positive_tweets.json')
Now, we can tag each of our tokens. NLTK allows us to do it all at once using: pos_tag_sents(). We are going to create a new variable tweets_tagged, which we will use to store our tagged lists. This new line can be put directly at the end of our current script:
tweets_tagged = pos_tag_sents(tweets_tokens))
To get an idea of what tagged tokens look like, here is what the first element in our tweets_tagged list looks like:
[(u'#FollowFriday', 'JJ'), (u'@France_Inte', 'NNP'), (u'@PKuchly57', 'NNP'), (u'@Milipol_Paris', 'NNP'), (u'for', 'IN'), (u'being', 'VBG'), (u'top', 'JJ'), (u'engaged', 'VBN'), (u'members', 'NNS'), (u'in', 'IN'), (u'my', 'PRP$'), (u'community', 'NN'), (u'this', 'DT'), (u'week', 'NN'), (u':)', 'NN')]
Counting POS Tags
Now,we will keep track of how many times JJ and NN appear using an accumulator (count) variable, which we will continuously add to every time we find a tag. First let's create our count at the bottom of our script, which we will first set to zero.
from nltk.corpus import twitter_samples
from nltk.tag import pos_tag_sents
tweets = twitter_samples.strings('positive_tweets.json')
tweets_tokens = twitter_samples.tokenized('positive_tweets.json')
JJ_count = 0
NN_count = 0
After we create the variables, we'll create two for loops. The first loop will iterate through each tweet in the list. The second loop will iterate through each token/tag pair in each tweet. For each pair, we will look up the tag using the appropriate tuple index.
We will then check to see if the tag matches either the string 'JJ' or 'NN' by using conditional statements. If the tag is a match we will add (+= 1) to the appropriate accumulator.
from nltk.corpus import twitter_samples
from nltk.tag import pos_tag_sents
tweets = twitter_samples.strings('positive_tweets.json')
tweets_tokens = twitter_samples.tokenized('positive_tweets.json')
JJ_count = 0
NN_count = 0
for tweet in tweets_tagged:
for pair in tweet:
tag = pair[1]
if tag == 'JJ':
JJ_count += 1
elif tag == 'NN':
NN_count += 1
After the two loops are complete, we should have the total count for adjectives and nouns in our corpus. To see how many adjectives and nouns our script found, we'll add print statements to the end of the script.
for tweet in tweets_tagged:
for pair in tweet:
tag = pair[1]
if tag == 'JJ':
JJ_count += 1
elif tag == 'NN':
NN_count += 1
print('Total number of adjectives = ', JJ_count)
print('Total number of nouns = ', NN_count)
At this point, our program will be able to output the number of adjectives and nouns that were found in the corpus.
Save your .py file and run it to see how many adjectives and nouns we find:
Be patient, it might take a few seconds for the script to run. If all went well, when we run our script, we should get the following output:
Comments