In the 21st Century, with the increase in the use of technology and the internet, the impact of fake news has become widespread. Fake news encapsulates pieces of news that may be hoaxes and is generally spread through social media. This is often intended to increase the financial profits of the news outlets.
Many people think that distinguishing fake news is tough or near impossible and don't bother with the idea of doing something about it. But Twitter, out of all social media platforms, has now started flagging some posts as fake. They make use of a Machine Learning model similar to what we will train today.
Dataset:
The dataset we will use for this is a csv file named 'news.csv'. It has three columns namely, Title, Text and Labels, and 6,335 examples/rows.
Necessary Installations:
To install necessary libraries needed for this project:
First open Terminal/Command Line.
Run the following command
pip3 install pandas sklearn
Installation should be complete.
To verify whether the installation is correct and/or to view information about the installation run the following command
pip3 show pandas sklearn
Training the model:
Make necessary imports
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.metrics import accuracy_score, confusion_matrix
Read the 'news.csv' dataset into a Pandas DataFrame.
news = pd.read_csv('news.csv')
Get the labels into a different DataFrame named %u2018labels%u2019.
labels = news['Label']
Now, we will separate the dataset into training and test data. This will help us test the accuracy of our model while making sure we are not overfitting.
The 'random_state' argument sets seed and helps maintain duplicacy of results.
X_train, X_test, y_train, y_test = train_test_split(news['Text'], labels, random_state=0)
Transform the training and test data using TfifdfVectorizer. TfidfVectorizer converts a collection of raw documents into a matrix of TF-IDF features.
TF stands for Term Frequency which is the number of times a word appears in a document. IDF stands for Inverse Document Frequency which measure significance of words in the document.
vectorizer = TfidfVectorizer(stop_words='english', max_df=0.7)
X_train_transformed = vectorizer.fit_transform(X_train)
X_test_transformed = vectorizer.transform(X_test)
Note, we are fitting our vectorizer using only training data and then transforming both training and test data. This is to prevent Data Leakage.
Finally, we will fit our training data in a PassiveAgressiveClassifier model.
clf = PassiveAggressiveClassifier()
clf.fit(X_train_transformed, y_train)
Predicting and Testing our model:
We will predict the labels for our test dataset using the predict method.
predictions = clf.predict(X_test_transformed)
Now, we can test the accuracy of our model.
accuracy_score(y_test, predictions)
Our model gives approximately 93% accuracy.
We can also get more details about where our model went wrong using the Confusion Matrix.
confusion_matrix(y_test, predictions, labels=['FAKE','REAL'])
Comments