The term regression was first applied to statistics by the polymath Francis Galton. Galton is a major figure in the development of statistics and genetics.

Linear Regression is one of the simplest machine learning algorithms that map the relationship between two variable by fitting a linear equation to observed data. One variable is considered to be an explanatory variable, and the other is considered to be a dependent variable.

** Simple Linear Regression**

In simple linear regression when we have a single input. and we have to obtain a line that best fits the data. The best fit line is the one for which total prediction error (all data points) are as small as possible. Error is the distance between the point to the regression line.

** E.g: Relationship between hours of study and marks obtained.**

** ****y(predict)=b0+b1*x**

Error is the difference between predicted and actual value. to reduce the error and find the best fit line we have to find value for bo and b1.

** ****Error=(predict-actual)*2**

for finding a best fit line value of bo and b1 must be that minimize the error.error is the difference between predicted and actual output.

Simple linear regression is a parametric test, meaning that it makes certain assumptions about the data. These assumptions are:

- Independence of observations: the observations in the dataset were collected using statistically valid sampling methods and there are no hidden relationships among observations.
- Homogeneity of variance: the size of the error in our prediction doesn't change significantly across the values of the independent variable.
- Normality: The data follows a normal distribution
- The relationship between the independent and dependent variable is linear: the line of best fit through the data points is a straight line . If your data do not meet the assumptions of homoscedasticity or normality, you may be able to use a nonparametric test instead, such as the Spearman rank test.

Let us now apply Machine Learning to train a dataset to predict the *Salary* from *Years of Experience*. We will follow the following steps to do so

- Step1 :-Importing Libraries and Datasets

Three python libraries will be used in the code.

- pandas
- matplotlib
- sklearn

import pandas as pd

import matplotlib.pyplot as plt

data = pd.read_csv("https://raw.githubusercontent.com/codePerfectPlus/DataAnalysisWithJupyter/master/SalaryVsExperinceModel/Salary.csv")

X = data.iloc[:, :-1].values

y = data.iloc[:, -1].values

data.head()

Step 2:- Split the Data

In this step, Split the dataset into the Training set, on which the Linear Regression model will be trained and the Test set, on which the trained model will be applied to visualize the results. In this the test_size=3.0 denotes that 30% of the data will be kept as the Test set and the remaining 70% will be used for training as the Training set.

from sklearn.model_selection import train_test_split

train_X, test_X, train_y, test_y = train_test_split(X, y, test_size = 0.3, random_state=12)

Step3:- Fit and Predict

Import LinearRegression from linear_model and assigned it to the variable lr. lr.fit() used to learn from data and lr.predict() to predict basis on learn data.

from sklearn.linear_model import LinearRegression

lr = LinearRegression()

lr.fit(train_X, train_y)

predicted_y = lr.predict(test_X)

Step4:- Evaluate and Visualize

Create a pandas dataframe of predict and actual values and visualize the dataset.

plt.scatter(test_X, test_y, color = 'red')

plt.scatter(test_X, predicted_y, color = 'green')

plot(train_X, lr.predict(train_X), color = 'black')

plt.title('Salary vs Experience (Result)')

plt.xlabel('YearsExperience')

plt.ylabel('Salary')

plt.show()

## Comments