Description:
NLTK is one of the leading platforms for working with human language data and Python, the module NLTK is used for natural language processing. NLTK is literally an acronym for Natural Language Toolkit.
Follow this link if you want to know more about NLTK.
Tokenize Words:
A sentence or data can be split into words using the method word_tokenize().
from nltk.tokenize import sent_tokenize,word_tokenize
data="Hey! How are You ?"
print(word_tokenize(data))
Output:
['Hey', '!', 'How', 'are', 'You', '?']
All of them are words except the comma. Special characters are treated as separate tokens.
Tokenizing Sentences
The same principle can be applied to sentences. We can tokenize the sentences by using method sent_tokenize().
from nltk.tokenize import sent_tokenize
data="Hey! How are You ? Hey! How are You ? "
print(sent_tokenize(data))
Output:
['Hey!', 'How are You ?', 'Hey!', 'How are You ?']
NLTK and Arrays
from nltk.tokenize import sent_tokenize,word_tokenize
data="Hey! Are you okay? Hey! Are you okay?"
sen=sent_tokenize(data)
words=word_tokenize(data)
print(sen)
print(words)
Output
['Hey!', 'Are you okay?', 'Hey!', 'Are you okay?']
['Hey', '!', 'Are', 'you', 'okay', '?', 'Hey', '!', 'Are', 'you', 'okay', '?']
Comments