The term regression was first applied to statistics by the polymath Francis Galton. Galton is a major figure in the development of statistics and genetics.
Linear Regression is one of the simplest machine learning algorithms that map the relationship between two variable by fitting a linear equation to observed data. One variable is considered to be an explanatory variable, and the other is considered to be a dependent variable.
Simple Linear Regression
In simple linear regression when we have a single input. and we have to obtain a line that best fits the data. The best fit line is the one for which total prediction error (all data points) are as small as possible. Error is the distance between the point to the regression line.
E.g: Relationship between hours of study and marks obtained.
Error is the difference between predicted and actual value. to reduce the error and find the best fit line we have to find value for bo and b1.
for finding a best fit line value of bo and b1 must be that minimize the error.error is the difference between predicted and actual output.
Assumptions of simple linear regression
Simple linear regression is a parametric test, meaning that it makes certain assumptions about the data. These assumptions are:
- Independence of observations: the observations in the dataset were collected using statistically valid sampling methods and there are no hidden relationships among observations.
- Homogeneity of variance: the size of the error in our prediction doesn't change significantly across the values of the independent variable.
- Normality: The data follows a normal distribution
- The relationship between the independent and dependent variable is linear: the line of best fit through the data points is a straight line . If your data do not meet the assumptions of homoscedasticity or normality, you may be able to use a nonparametric test instead, such as the Spearman rank test.
Let us now apply Machine Learning to train a dataset to predict the Salary from Years of Experience. We will follow the following steps to do so
- Step1 :-Importing Libraries and Datasets
Three python libraries will be used in the code.
## Importing the necessary Libraries
import pandas as pd
import matplotlib.pyplot as plt
## Importing the datasert
data = pd.read_csv("https://raw.githubusercontent.com/codePerfectPlus/DataAnalysisWithJupyter/master/SalaryVsExperinceModel/Salary.csv")
X = data.iloc[:, :-1].values
y = data.iloc[:, -1].values
Step 2:- Split the Data
In this step, Split the dataset into the Training set, on which the Linear Regression model will be trained and the Test set, on which the trained model will be applied to visualize the results. In this the test_size=3.0 denotes that 30% of the data will be kept as the Test set and the remaining 70% will be used for training as the Training set.
## Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
train_X, test_X, train_y, test_y = train_test_split(X, y, test_size = 0.3, random_state=12)
Step3:- Fit and Predict
Import LinearRegression from linear_model and assigned it to the variable lr. lr.fit() used to learn from data and lr.predict() to predict basis on learn data.
## Fit and Predict Model
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
predicted_y = lr.predict(test_X)
Step4:- Evaluate and Visualize
Create a pandas dataframe of predict and actual values and visualize the dataset.
## Visualising the Test set results
plt.scatter(test_X, test_y, color = 'red')
plt.scatter(test_X, predicted_y, color = 'green')
plot(train_X, lr.predict(train_X), color = 'black')
plt.title('Salary vs Experience (Result)')
Now lets us know how we can plot our linear Regression Curve