Advanced Project Analysis with Sentiment Classification using Bidirectional Recurrent Neural Network














































Advanced Project Analysis with Sentiment Classification using Bidirectional Recurrent Neural Network




Hello, Rishabh here, this time I bring to you:


Continuing the series - 'Simple Python Project'. These are simple projects with which beginners can start with. This series will cover beginner python, intermediate and advanced python, machine learning and later deep learning.

Comments recommending other to-do python projects are supremely recommended.

Anyways, let's crack on with it!


Analysis on E-Commerce Reviews, with Sentiment Classification using Bidirectional Recurrent Neural Network (RNN)


Statistical Analysis on E-Commerce Reviews, with Sentiment Classification using Bidirectional Recurrent Neural Network (RNN)

Abstract

Understanding customer sentiments is of paramount importance in marketing strategies today. Not only will it give companies an insight as to how customers perceive their products and/or services, but it will also give them an idea on how to improve their offers. This paper attempts to understand the correlation of different variables in customer reviews on a women clothing e-commerce, and to classify each review whether it recommends the reviewed product or not and whether it consists of positive, negative, or neutral sentiment. To achieve these goals, we employed univariate and multivariate analyses on dataset features except for review titles and review texts, and we implemented a bidirectional recurrent neural network (RNN) with long-short term memory unit (LSTM) for recommendation and sentiment classification. Results have shown that a recommendation is a strong indicator of a positive sentiment score, and vice-versa. On the other hand, ratings in product reviews are fuzzy indicators of sentiment scores. We also found out that the bidirectional LSTM was able to reach an F1-score of 0.88 for recommendation classification, and 0.93 for sentiment classification.

In [1]:
import matplotlib.pyplot as plt
import nltk
import numpy as np
import pandas as pd
import seaborn as sns
from wordcloud import WordCloud, STOPWORDS
In [2]:
df = pd.read_csv('https://raw.githubusercontent.com/AFAgarap/ecommerce-reviews-analysis/master/Womens%20Clothing%20E-Commerce%20Reviews.csv')
In [3]:
for column in ["Division Name","Department Name","Class Name","Review Text"]:
df = df[df[column].notnull()]
df.drop(df.columns[0], inplace=True, axis=1)
In [4]:
df.shape
Out[4]:
(22628, 10)
In [5]:
df['Label'] = 0
In [6]:
df.loc[df.Rating >= 3, ['Label']] = 1
In [7]:
df['Word Count'] = df['Review Text'].str.split().apply(len)
In [8]:
df.sample(5)
Out[8]:
Clothing IDAgeTitleReview TextRatingRecommended INDPositive Feedback CountDivision NameDepartment NameClass NameLabelWord Count
21864110443Not greatIf you're tall, this dress sits up too high-wa...3018General PetiteDressesDresses169
1963047923Absolutely loveI just bought these shorts in the gold color. ...511GeneralBottomsShorts130
278386436Pretty and classicThis blouse is pretty and drapes wonderfully. ...510General PetiteTopsKnits188
2231486232Comfy and flatteringI love this shirt so much that i bought it in ...510General PetiteTopsKnits188
4756108053Beautiful and flatteringThis dress is wonderful. wide enough straps fo...510GeneralDressesDresses177
In [9]:
df.describe().T.drop('count', axis=1)
Out[9]:
meanstdmin25%50%75%max
Clothing ID919.695908201.6838041.0861.0936.01078.01205.0
Age43.28288012.32817618.034.041.052.099.0
Rating4.1830921.1159111.04.05.05.05.0
Recommended IND0.8187640.3852220.01.01.01.01.0
Positive Feedback Count2.6317845.7875200.00.01.03.0122.0
Label0.8952630.3062220.01.01.01.01.0
Word Count60.21195028.5330532.036.059.088.0115.0
In [10]:
df[['Title', 'Division Name', 'Department Name', 'Class Name']].describe(include=['O']).T.drop('count', axis=1)
Out[10]:
uniquetopfreq
Title13983Love it!136
Division Name3General13365
Department Name6Tops10048
Class Name20Dresses6145

Univariate Distributions

Age and Positive Feedback Frequency Distributions

In [11]:
f, ax = plt.subplots(1, 3, figsize=(16, 4), sharey=False)
sns.distplot(df.Age, ax=ax[0])
ax[0].set_title('Age Distribution')
ax[0].set_ylabel('Density')
sns.distplot(df['Positive Feedback Count'], ax=ax[1])
ax[1].set_title('Positive Feedback Count Distribution')
sns.distplot(np.log10((df['Positive Feedback Count'][df['Positive Feedback Count'].notnull()] + 1)), ax=ax[2])
ax[2].set_title('Positive Feedback Count Distribution\n[Log 10]')
ax[2].set_xlabel('Log Positive Feedback Count')
plt.rcParams.update({'font.size': 12})
plt.tight_layout()
plt.savefig('age-and-positive-feedback-freqdist.png', format='png', dpi=600)
plt.show()

Division Name and Department Name Distributions

In [12]:
row_plots = ['Division Name', 'Department Name']
f, axes = plt.subplots(1, len(row_plots), figsize=(14, 4), sharex=False)

for i, x in enumerate(row_plots):
sns.countplot(y=x, data=df, order=df[x].value_counts().index, ax=axes[i])
axes[i].set_title('Count of Categories in {}'.format(x))
axes[i].set_xlabel('')
axes[i].set_xlabel('Frequency Count')
axes[0].set_ylabel('Category')
axes[1].set_ylabel('')
plt.savefig('divname-and-deptname-freqdist.png', format='png', dpi=600)
plt.show()

Clothing ID Frequency Distribution

In [13]:
# Clothing ID Category
f, axes = plt.subplots(1, 2, figsize=[16, 7])
num = 30
sns.countplot(y='Clothing ID', data=df[df['Clothing ID'].isin(df['Clothing ID'].value_counts()[:num].index)],
order= df['Clothing ID'].value_counts()[:num].index, ax=axes[0])
axes[0].set_title('Frequency Count of Clothing ID\nTop 30')
axes[0].set_xlabel('Count')

sns.countplot(y='Clothing ID', data=df[df['Clothing ID'].isin(df['Clothing ID'].value_counts()[num:60].index)],
order=df['Clothing ID'].value_counts()[num:60].index, ax=axes[1])
axes[1].set_title('Frequency Count of Clothing ID\nTop 30 to 60')
axes[1].set_ylabel('')
axes[1].set_xlabel('Count')
plt.savefig('freqdist-clothingid-top60.png', format='png', dpi=600)
plt.show()

print('Dataframe Dimension: {} Rows'.format(df.shape[0]))
df[df['Clothing ID'].isin([1078, 862,1094])].describe().T.drop('count',axis=1)
Dataframe Dimension: 22628 Rows
Out[13]:
meanstdmin25%50%75%max
Clothing ID1015.4848103.396022862.0862.01078.01094.01094.0
Age42.724812.15042918.034.041.051.099.0
Rating4.18921.1043061.04.05.05.05.0
Recommended IND0.81840.3855920.01.01.01.01.0
Positive Feedback Count2.86246.7730210.00.01.03.098.0
Label0.90240.2968320.01.01.01.01.0
Word Count60.591228.7314862.036.060.089.0115.0
In [14]:
# Class Name
plt.subplots(figsize=(12, 8))
sns.countplot(y='Class Name', data=df,order=df['Class Name'].value_counts().index)
plt.title('Frequency Distribution of Class Name')
plt.xlabel('Frequency')
plt.tight_layout()
plt.savefig('freqdist-classname.png', format='png', dpi=300)
plt.show()

Frequency Distribution of Rating, Recommended IND, and Label

In [15]:
cat_dtypes = ['Rating', 'Recommended IND', 'Label']
increment = 0
f, axes = plt.subplots(1, len(cat_dtypes), figsize=(16, 6), sharex=False)

for i in range(len(cat_dtypes)):
sns.countplot(x=cat_dtypes[increment], data=df, order=df[cat_dtypes[increment]].value_counts().index, ax=axes[i])
axes[i].set_title('Frequency Distribution for\n{}'.format(cat_dtypes[increment]))
axes[i].set_ylabel('Occurrence')
axes[i].set_xlabel('{}'.format(cat_dtypes[increment]))
increment += 1
axes[1].set_ylabel('')
axes[2].set_ylabel('')
plt.savefig('freqdist-rating-recommended-label.png', format='png', dpi=300)
plt.show()

Word Count by Rating, Department Name, and Recommended IND

In [16]:
f, axes = plt.subplots(1, 4, figsize=(30, 5), sharex=False)

for index, y in enumerate(['Rating', 'Department Name', 'Recommended IND']):
for x in set(df[y][df[y].notnull()]):
sns.kdeplot(df['Word Count'][df[y]==x], label=x, shade=False, ax=axes[index])
axes[index].set_title('{} Distribution (X)\nby {}'.format('Word Count', y))
axes[index].set_ylabel('Occurrence Density')
axes[index].set_xlabel('')

# Plot 4
axes[3].set_title('Word Count Distribution (X)\n')
sns.kdeplot(df['Word Count'],shade=True,ax=axes[3])
axes[index].set_xlabel("")
axes[3].legend_.remove()
plt.savefig('wordcountdist-rating-deptname-recommended.png', format='png', dpi=300)
plt.show()

print("\nTotal Word Count is: {}".format(df["Word Count"].sum()))
df['Word Count'].describe().T
Out[16]:
count    22628.000000
mean 60.211950
std 28.533053
min 2.000000
25% 36.000000
50% 59.000000
75% 88.000000
max 115.000000
Name: Word Count, dtype: float64

Multivariate Distributions

Categorical Variable by Categorical Variable

Division Name by Department Name

In [17]:
f, ax = plt.subplots(1, 2, figsize=(16, 4), sharey=True)
sns.heatmap(pd.crosstab(df['Division Name'], df['Department Name']),
annot=True, linewidths=.5, ax=ax[0], fmt='g', cmap='Purples',
cbar_kws={'label': 'Count'})
ax[0].set_title('Division Name Count by Department Name - Crosstab\nHeatmap Overall Count Distribution')

sns.heatmap(pd.crosstab(df['Division Name'], df['Department Name'], normalize=True).mul(100).round(0),
annot=True, linewidths=.5, ax=ax[1],fmt='g', cmap='Purples',
cbar_kws={'label': 'Percentage %'})
ax[1].set_title('Division Name Count by Department Name - Crosstab\nHeatmap Overall Percentage Distribution')
ax[1].set_ylabel('')
plt.tight_layout(pad=0)
plt.savefig('divname-deptname.png', format='png', dpi=300)
plt.show()
Total Word Count is: 1362476
In [18]:
f, ax = plt.subplots(1, 2, figsize=(16, 4), sharey=True)
sns.heatmap(pd.crosstab(df['Division Name'], df['Department Name'], normalize='columns').mul(100).round(0),
annot=True, linewidths=.5, ax=ax[0], fmt='g', cmap='Greens',
cbar_kws={'label': 'Percentage %'})
ax[0].set_title('Division Name Count by Department Name - Crosstab\nHeatmap % Distribution by Columns')

sns.heatmap(pd.crosstab(df['Division Name'], df['Department Name'], normalize='index').mul(100).round(0),
annot=True, linewidths=.5, ax=ax[1],fmt='g', cmap='Greens',
cbar_kws={'label': 'Percentage %'})
ax[1].set_title('Division Name Count by Department Name - Crosstab\nHeatmap % Distribution by Index')
ax[1].set_ylabel('')
plt.tight_layout(pad=0)
plt.savefig('divname-deptname-pivot.png', format='png', dpi=300)
plt.show()

Class Name by Department Name

In [19]:
f, ax = plt.subplots(1, 2, figsize=(16, 9), sharey=True)
fsize = 13
sns.heatmap(pd.crosstab(df['Class Name'], df['Department Name']),
annot=True, linewidths=.5, ax=ax[0], fmt='g', cmap='Reds',
cbar_kws={'label': 'Count'})
ax[0].set_title('Class Name Count by Department Name - Crosstab\nHeatmap Overall Count Distribution')

sns.heatmap(pd.crosstab(df['Class Name'], df['Department Name'], normalize=True).mul(100).round(0),
annot=True, linewidths=.5, ax=ax[1],fmt='g', cmap='Reds',
cbar_kws={'label': 'Percentage %'})
ax[1].set_title('Class Name Count by Department Name - Crosstab\nHeatmap Overall Percentage Distribution')
ax[1].set_ylabel('')
plt.tight_layout(pad=0)
plt.savefig('classname-deptname.png', format='png', dpi=300)
plt.show()
In [20]:
f, ax = plt.subplots(1, 2, figsize=(16, 9), sharey=True)
fsize = 13
sns.heatmap(pd.crosstab(df['Class Name'], df['Department Name'], normalize = 'columns').mul(100).round(0),
annot=True, fmt='g', linewidths=.5, ax=ax[0],cbar=False, cmap='Blues')
ax[0].set_title('Class Name Count by Count - Crosstab\nHeatmap % Distribution by Column', fontsize=fsize)
ax[1] = sns.heatmap(pd.crosstab(df['Class Name'], df['Department Name'], normalize = 'index').mul(100).round(0),
annot=True, fmt='2g', linewidths=.5, ax=ax[1],cmap='Blues',
cbar_kws={'label': 'Percentage %'})
ax[1].set_title('Class Name Count by Count - Crosstab\nHeatmap % Distribution by Index', fontsize=fsize)
ax[1].set_ylabel('')
plt.tight_layout(pad=0)
plt.savefig('classname-deptname-pivot.png', format='png', dpi=300)
plt.show()

Division Name by Department Name

In [21]:
f, ax = plt.subplots(1, 2, figsize=(16, 9), sharey=True)
fsize = 13
sns.heatmap(pd.crosstab(df['Class Name'], df['Division Name']),
annot=True, linewidths=.5, ax=ax[0], fmt='g', cmap='Oranges',
cbar_kws={'label': 'Count'})
ax[0].set_title('Class Name Count by Division Name - Crosstab\nHeatmap Overall Count Distribution')

sns.heatmap(pd.crosstab(df['Class Name'], df['Division Name'], normalize=True).mul(100).round(0),
annot=True, linewidths=.5, ax=ax[1], fmt='g', cmap='Oranges',
cbar_kws={'label': 'Percentage %'})
ax[1].set_title('Class Name Count by Division Name - Crosstab\nHeatmap Overall Percentage Distribution')
ax[1].set_ylabel('')
plt.tight_layout(pad=0)
plt.savefig('classname-divname.png', format='png', dpi=300)
plt.show()

# Heatmaps of Percentage Pivot Table
f, ax = plt.subplots(1, 2, figsize=(16, 9), sharey=True)
fsize = 13
sns.heatmap(pd.crosstab(df['Class Name'], df['Division Name'], normalize = 'columns').mul(100).round(0),
annot=True, fmt='g', linewidths=.5, ax=ax[0],cbar=False,cmap='Purples')
ax[0].set_title('Class Name Count by Count - Crosstab\nHeatmap % Distribution by Column', fontsize = fsize)
ax[1] = sns.heatmap(pd.crosstab(df['Class Name'], df['Division Name'], normalize = 'index').mul(100).round(0),
annot=True, fmt='2g', linewidths=.5, ax=ax[1], cmap='Purples',
cbar_kws={'label': 'Percentage %'})
ax[1].set_title('Class Name Count by Count - Crosstab\nHeatmap % Distribution by Index', fontsize=fsize)
ax[1].set_ylabel('')
plt.tight_layout(pad=0)
plt.savefig('classname-divname-pivot.png', format='png', dpi=300)