Advanced Project Sentiment and WordCloud Analysis of Online Reviews

Advanced Project Sentiment and WordCloud Analysis of Online Reviews

Hello, Rishabh here, this time I bring to you:

Continuing the series - 'Simple Python Project'. These are simple projects with which beginners can start with. This series will cover beginner python, intermediate and advanced python, machine learning and later deep learning.

Comments recommending other to-do python projects are supremely recommended.

Anyways, let's crack on with it!

Sentiment and WordCloud Analysis of Online Reviews

In [15]:
import pandas as pd
import numpy as np
import datetime as dt
import matplotlib.pyplot as plt
from wordcloud import WordCloud
import seaborn as sns
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import roc_curve, auc
from sklearn.neural_network import MLPClassifier
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix
from plotly import tools
import plotly.offline as py
import plotly.graph_objs as go
%matplotlib inline
import warnings

df1 = pd.read_csv('')
df = df1[['Review Text','Rating','Class Name','Age']]
Review TextRatingClass NameAge
0Absolutely wonderful - silky and sexy and comf...4Intimates33
1Love this dress! it's sooo pretty. i happene...5Dresses34
2I had such high hopes for this dress and reall...3Dresses60
3I love, love, love this jumpsuit. it's fun, fl...5Pants50
4This shirt is very flattering to all due to th...5Blouses47
In [4]:
# fill NA values by space
df['Review Text'] = df['Review Text'].fillna('')

# CountVectorizer() converts a collection
# of text documents to a matrix of token counts
vectorizer = CountVectorizer()
# assign a shorter name for the analyze
# which tokenizes the string
analyzer = vectorizer.build_analyzer()

def wordcounts(s):
c = {}
# tokenize the string and continue, if it is not empty
if analyzer(s):
d = {}
# find counts of the vocabularies and transform to array
w = vectorizer.fit_transform([s]).toarray()
# vocabulary and index (index of w)
vc = vectorizer.vocabulary_
# items() transforms the dictionary's (word, index) tuple pairs
for k,v in vc.items():
d[v]=k # d -> index:word
for index,i in enumerate(w[0]):
c[d[index]] = i # c -> word:count
return c

# add new column to the dataframe
df['Word Counts'] = df['Review Text'].apply(wordcounts)
Review TextRatingClass NameAgeWord Counts
0Absolutely wonderful - silky and sexy and comf...4Intimates33{'absolutely': 1, 'and': 2, 'comfortable': 1, ...
1Love this dress! it's sooo pretty. i happene...5Dresses34{'am': 1, 'and': 2, 'bc': 2, 'be': 1, 'below':...
2I had such high hopes for this dress and reall...3Dresses60{'and': 3, 'be': 1, 'bottom': 1, 'but': 2, 'ch...
3I love, love, love this jumpsuit. it's fun, fl...5Pants50{'and': 1, 'but': 1, 'compliments': 1, 'every'...
4This shirt is very flattering to all due to th...5Blouses47{'adjustable': 1, 'all': 1, 'and': 1, 'any': 1...

Demonstrating the Densities of Class Names, Some Selected Words and All Words in the Reviews By Using WordCloud

In this section, I demonstrated the word densities which can be very informative. First, I selected some words which show the customer sentiments like love, hate, fantastic or regret. Second, since we do not know the product names, I decided to check the product class names. By doing this, we may at least learn the most prefered classes. Further, I thought that looking at the densities of all words in the reviews might be interesting. Lastly, I used the WordCloud module and printed the first five lines of the tables which shows the word counts for the selected words and the class names.

It can be observed from the below figures and tables that positive words as love, great, super were used more. When we look at the classes, customers mostly prefered dress, knits and blouses. We may also see that dress and love are in the freequently used words within all reviews.

In [5]:
selectedwords = ['awesome','great','fantastic','extraordinary','amazing','super',

def selectedcount(dic,word):
if word in dic:
return dic[word]
return 0

dfwc = df.copy()
for word in selectedwords:
dfwc[word] = dfwc['Word Counts'].apply(selectedcount,args=(word,))

word_sum = dfwc[selectedwords].sum()
print('Selected Words')

print('\nClass Names')
print(df['Class Name'].fillna("Empty").value_counts().iloc[:5])

fig, ax = plt.subplots(1,2,figsize=(20,10))
wc0 = WordCloud(background_color='white',
height=400 ).generate_from_frequencies(word_sum)

cn = df['Class Name'].fillna(" ").value_counts()
wc1 = WordCloud(background_color='white',

ax[0].set_title('Selected Words\n',size=25)

ax[1].set_title('Class Names\n',size=25)

rt = df['Review Text']
wordcloud = WordCloud(background_color='white',
).generate(" ".join(rt))
plt.title('All Words in the Reviews\n',size=25)
Selected Words
love 8951
great 6117
super 1726
happy 705
glad 614
dtype: int64

Class Names
Dresses 6319
Knits 4843
Blouses 3097
Sweaters 1428
Pants 1388
Name: Class Name, dtype: int64


Build a Sentiment Analyser

Since we do not have a column which shows the sentiment as positive or negative in the dataset, I defined a new sentiment column. To do this, I assumed the reviews which has 4 or higher rating as positive (True in the new dataframe) and 2 or lower rating as negative (False in the new dataframe). Also, I did not include the lines that has neutral ratings which are equal to 3.

In [7]:
# Rating of 4 or higher -> positive, while the ones with 
# Rating of 2 or lower -> negative
# Rating of 3 -> neutral
df = df[df['Rating'] != 3]
df['Sentiment'] = df['Rating'] >=4
Review TextRatingClass NameAgeWord CountsSentiment
0Absolutely wonderful - silky and sexy and comf...4Intimates33{'absolutely': 1, 'and': 2, 'comfortable': 1, ...True
1Love this dress! it's sooo pretty. i happene...5Dresses34{'am': 1, 'and': 2, 'bc': 2, 'be': 1, 'below':...True
3I love, love, love this jumpsuit. it's fun, fl...5Pants50{'and': 1, 'but': 1, 'compliments': 1, 'every'...True
4This shirt is very flattering to all due to th...5Blouses47{'adjustable': 1, 'all': 1, 'and': 1, 'any': 1...True
5I love tracy reese dresses, but this one is no...2Dresses49{'0p': 1, 'alterations': 1, 'am': 1, 'and': 4,...False

First, I splitted the data as training and test. Afterwards, I fitted the models one by one. Since, some of them take too much time, I think running each of them in different cells is a better choice.

In [8]:
train_data,test_data = train_test_split(df,train_size=0.8,random_state=0)
X_train = vectorizer.fit_transform(train_data['Review Text'])
y_train = train_data['Sentiment']
X_test = vectorizer.transform(test_data['Review Text'])
y_test = test_data['Sentiment']

Logistic Regression , Naive Bayes, SVM, Neural Net

In [10]:
lr = LogisticRegression(),y_train)
print('Elapsed time: ',str(
nb = MultinomialNB(),y_train)
print('Elapsed time: ',str(
svm = SVC(),y_train)
print('Elapsed time: ',str(
nn = MLPClassifier(),y_train)
print('Elapsed time: ',str(
Elapsed time:  0:00:00.453760
Elapsed time: 0:00:00.004985
Elapsed time: 0:00:37.232989
Elapsed time: 0:05:13.507303

Evaluating Models

Adding Results to the Dataframe

At first, I added the prediction results to my training data. However, if you want to observe the prediction probabilies, you might use the commented out code.

In [11]:
# define a dataframe for the predictions
df2 = train_data.copy()
df2['Logistic Regression'] = lr.predict(X_train)
df2['Naive Bayes'] = nb.predict(X_train)
df2['SVM'] = svm.predict(X_train)
df2['Neural Network'] = nn.predict(X_train)
Review TextRatingClass NameAgeWord CountsSentimentLogistic RegressionNaive BayesSVMNeural Network
19218I love this dress's gentle blue lace. the silh...5Dresses35{'and': 1, 'as': 1, 'blue': 1, 'chest': 1, 'dr...TrueTrueTrueTrueTrue
3530Beautiful choice...beautiful fit for my daught...5Knits51{'beautiful': 2, 'body': 1, 'choice': 1, 'daug...TrueTrueTrueTrueTrue
15663If you are shaped anything like me, you will h...4Dresses25{'am': 1, 'and': 2, 'anything': 1, 'are': 1, '...TrueTrueTrueTrueTrue
21310This top is so cute and of spectacular quality...5Blouses33{'10': 1, '34c': 1, 'all': 1, 'almost': 1, 'an...TrueTrueTrueTrueTrue
15154First saw this poncho on a petite blog and aft...5Sweaters56{'after': 1, 'and': 5, 'below': 1, 'blog': 1, ...TrueTrueTrueTrueTrue

ROC Curves and AUC

In [13]:
pred_lr = lr.predict_proba(X_test)[:,1]
fpr_lr,tpr_lr,_ = roc_curve(y_test,pred_lr)
roc_auc_lr = auc(fpr_lr,tpr_lr)

pred_nb = nb.predict_proba(X_test)[:,1]
fpr_nb,tpr_nb,_ = roc_curve(y_test.values,pred_nb)
roc_auc_nb = auc(fpr_nb,tpr_nb)

pred_svm = svm.decision_function(X_test)
fpr_svm,tpr_svm,_ = roc_curve(y_test.values,pred_svm)
roc_auc_svm = auc(fpr_svm,tpr_svm)

pred_nn = nn.predict_proba(X_test)[:,1]
fpr_nn,tpr_nn,_ = roc_curve(y_test.values,pred_nn)
roc_auc_nn = auc(fpr_nn,tpr_nn)

f, axes = plt.subplots(2, 2,figsize=(15,10))
axes[0,0].plot(fpr_lr, tpr_lr, color='darkred', lw=2, label='ROC curve (area = {:0.2f})'.format(roc_auc_lr))
axes[0,0].plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
axes[0,0].set(xlim=[-0.01, 1.0], ylim=[-0.01, 1.05])
axes[0,0].set(xlabel ='False Positive Rate', ylabel = 'True Positive Rate', title = 'Logistic Regression')
axes[0,0].legend(loc='lower right', fontsize=13)

axes[0,1].plot(fpr_nb, tpr_nb, color='darkred', lw=2, label='ROC curve (area = {:0.2f})'.format(roc_auc_nb))
axes[0,1].plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
axes[0,1].set(xlim=[-0.01, 1.0], ylim=[-0.01, 1.05])
axes[0,1].set(xlabel ='False Positive Rate', ylabel = 'True Positive Rate', title = 'Naive Bayes')
axes[0,1].legend(loc='lower right', fontsize=13)

axes[1,0].plot(fpr_svm, tpr_svm, color='darkred', lw=2, label='ROC curve (area = {:0.2f})'.format(roc_auc_svm))
axes[1,0].plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
axes[1,0].set(xlim=[-0.01, 1.0], ylim=[-0.01, 1.05])
axes[1,0].set(xlabel ='False Positive Rate', ylabel = 'True Positive Rate', title = 'Support Vector Machine')
axes[1,0].legend(loc='lower right', fontsize=13)

axes[1,1].plot(fpr_nn, tpr_nn, color='darkred', lw=2, label='ROC curve (area = {:0.2f})'.format(roc_auc_nn))
axes[1,1].plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
axes[1,1].set(xlim=[-0.01, 1.0], ylim=[-0.01, 1.05])
axes[1,1].set(xlabel ='False Positive Rate', ylabel = 'True Positive Rate', title = 'Neural Network')
axes[1,1].legend(loc='lower right', fontsize=13);

Confusion Matrix

In [14]:
# preparation for the confusion matrix
lr_cm=confusion_matrix(y_test.values, lr.predict(X_test))
nb_cm=confusion_matrix(y_test.values, nb.predict(X_test))
svm_cm=confusion_matrix(y_test.values, svm.predict(X_test))
nn_cm=confusion_matrix(y_test.values, nn.predict(X_test))

plt.suptitle("Confusion Matrices",fontsize=24)

plt.title("Logistic Regression")
sns.heatmap(lr_cm, annot = True, cmap="Greens",cbar=False);

plt.title("Naive Bayes")
sns.heatmap(nb_cm, annot = True, cmap="Greens",cbar=False);

plt.title("Support Vector Machine (SVM)")
sns.heatmap(svm_cm, annot = True, cmap="Greens",cbar=False);

plt.title("Neural Network")
sns.heatmap(nn_cm, annot = True, cmap="Greens",cbar=False);


When we look at the evaluating models section, Naive Bayes and Logistic Regression gives the best results. Thus, both of them are very effective at predicting sentiment. On the other hand, it seems that Naive Bayes takes less time and when we have a bigger dataset, this difference might increase and be an important advantage .

In [ ]:



Please comment below any questions or article requests.
Like the articles and Follow me to get notified when I post another article.


More Articles of Rishabh Karmakar:

Name Views Likes
Basic Date and Time in Python 167 1
Using Defaultdict in Python 55 1
Floor Division in Python 208 1
Python time sleep function 77 1
Immutable Objects in Python 78 1
Mutable Objects in Python 81 1
Python Escape Sequences 70 1
Escape Sequence in Python 0 0
Python Bytearray 66 1
Plotting Histogram in Python 141 1
Using Voluptuous in Python 177 1
Pandas - ModuleNotFoundError - No Module Named Pandas and Functions Explained 5530 5
OpenCV - ModuleNotFoundError - No Module Named cv2 and Algorithms Explained 344 5
OpenCV - ModuleNotFoundError - No Module Named 274 9
Simple Python Project Mario Shooter Game 389 4
Web crawling using Python and Scrapy 267 6
CNN vs ANN vs RNN 295 6
Deep Learning - Activation and Loss Function 249 6
CNN vs. RNN vs. ANN 42 10
Advanced Python Projects Predicting and Forecasting Stock Market Prices 564 6
Intermediate Python Projects Tensorflow Live Object Detection 331 7
Intermediate Python Projects Optical Character and Text Recognizer from Live Images 336 7
Advanced Python Projects Analysing Music Trends and Recommendations 469 7
Advanced Python Projects Python Chatbot using NLTK and Keras 760 7
Advanced Python Projects CIFAR-10 Image Classification using Deep Learning 554 9
Advanced Python Projects Image Caption Generator using CNN and LSTM 454 10
Advanced Python Project Breast Cancer Classification using SVR 373 10
Advanced Python Project Smiling Face Detector using CNN 445 10
Advanced Python Project Handwritten Digit Recognizer 509 11
Intermediate Python Project Speed Typing in Python 456 10
Advanced Python Project Next Alphabet or Word Prediction using LSTM 365 12
Intermediate Python Project Find most similar word using Word2Vec 288 10
Advanced Python Projects IMDB Movie Review Sentiment Analysis 261 10
Intermediate Python Project Detection of Real or Fake News 282 10
Advanced Python Project Credit Card Fraud Detection 264 10
Intermediate Python Project Snake Game Pygame 326 10
Simple Python Projects Make a multi-colored rainbow pattern in Python 400 10
Advanced Project Analysis with Sentiment Classification using Bidirectional Recurrent Neural Network 244 10
Advanced Project Forest Fire Prediction using SVR Random Forest and Deep NN 381 10
Advanced Project Sentiment and WordCloud Analysis of Online Reviews 338 12
Python - Synthetic Data Generator for Machine Learning and Artificial Intelligence 263 9
Advanced Project Deep Dream AI 271 10
Python - Symbolic regression classification generator 271 9
Generate Polynomial Functions and Random Function Generator - Python 289 10
Intermediate Project Sentiment and WordCloud Analysis of Women E-Commerce 288 10
Intermediate Project Iris Data Classification 226 10
Intermediate Project Titanic Classification using Decision Tree 248 10
Machine Learning - Decision Tree in Python 298 9
Python - All about lambda functions 251 9
Machine Learning - Polynomial Regression in Python 278 9
Machine Learning - Multiple Linear Regression in Python 260 11
Machine Learning - Simple Linear Regression in Python 245 9
Implement all 2D and 3D types plots in Python 236 9
Implement 1D, 2D and 3D CNN in Python 311 9
Simple Python Projects Multiplayer Tic-Tac-Toe 273 11
Problem - Validating Credit Card Numbers Hackerrank 280 9
Problem - Linear Algebra using Numpy Hackerrank 715 9
Problem - Basic Spell Checker Hackerrank 214 9
Implement Binary Tree in Python 277 9
Implement Graph in Python 262 9
Python for Data Science 213 9
Simple Python Projects Select Region of Interest - OpenCV 324 10
Simple Python Projects Code to mask white pixels in a coloured image - OpenCV 336 10
Simple Python Projects Code to mask white pixels in a gray scale image - OpenCV 298 10
Simple Python Projects Convert colour image to gray scale and apply cartoon effects - OpenCV 348 10
Advanced Project - Automatic Facial Recognition Based Attendance System 721 9
Simple Python Projects Live Talking Counting Clock - Pyttsx3, Tkinter 264 10
Python Simple and Fast Text to Speech 228 9
Simple Python Projects Simple Zodiac Sign Teller 233 10
Simple Python Projects Simple Fortune Teller Game 276 10
Simple Python Projects Simple Guessing Game 390 10
Implementing Multi-line Strings 202 9
Save and Load Machine Learning Models in Python with scikit-learn 401 9
Use of Ternary operator for conditional assignment. 351 9
Chaining comparison operators 216 9
Implement Circular Singly Linked List 241 9
Implementing Command Line Arguments 356 9
Solve Knapsack Problem Using Dynamic Programming 390 10