If you’re new to machine learning, this is the most common model you will implement.

The simplest and most extensively used supervised machine learning algorithm for predictive analysis is Linear Regression. It is a model for analyzing the relationship between input and output numerical variables that were established in the field of statistics and has been carried by Machine Learning. It is a statistical and machine learning algorithm at the same time.

Linear Regression linearly models the relationship between a dependent variable and one or more independent variables. We introduced two variables here, independent and dependent. A variable whose value remains the same by the effect of other variables and used to manipulate the dependent variable is termed an independent variable. On the other hand, a variable whose value change when there is a manipulation in the values of other variables, is termed a dependent variable.

In this topic, we will consider Y as the dependent variable and X as an independent variable.

Understanding Simple Linear Regression

To model data using Linear regression, it is important to examine two factors. The first one is to find the variables which are significant predictors of the outcome variables. The second one we need to look at is how significant is the regression line to make predictions with the highest possible accuracy. 

Let’s consider taking one independent variable(x) vs one dependent variable(y). Now, to write the simplest form of the linear regression equation mathematically is represented by :

y = m * x + c

Here, m is the slope of the line and c is the coefficient of the line.

Practical example –

Businesses often use linear regression to figure out how much money they spend on advertising and how is their profit. They may run a basic linear regression model with advertising spending as the independent variable and revenue as a dependent variable.

revenue = β0 + β1 * (Ad spend)

The coefficient β0 would represent total expected revenue when ad spending = zero.

The coefficient β1would represent the average change in total revenue when ad spending is increased by one unit.

  • If β1 is negative, it would mean that more ad spending is associated with less revenue.
  • If β1 is close to zero, it would mean that ad spending has little effect on revenue.
  • If β1 is positive, it would mean more ad spending is associated with more revenue.

Depending on the value of β1, a company may decide to either decrease or increase their ad spending. Let’s consider the below data and plot a graph. 

Ad Spend (in $)
Revenue (in $)
500
1200
700
1500
1100
2000
1300
2070
1800
2500

Considering the above measures of Revenue vs Ad spend, the following graph will be plotted.

Regression Line

As you can infer from the above figure, a single straight line covering all the data points cannot be drawn. Thus we have to mark the regression line in such a way it gives us a minimum error. It means that we find a bar and then find the prediction error. Since we have the data points, we can easily find the error in prediction. Our ultimate goal is to find the line that has the minimal error. That line will be called the linear BEST fit.

Training Data

In supervised learning problems, each observation consists of an observed output variable and one or more observed input variables. The observations in the training set are the experience that the algorithm uses to learn. 

Testing Data

The test set is defined as a set of observations used to evaluate the performance of the model using some performance metric. It is important that no observations from the base set are included in the test set. It will be difficult to tell if the algorithm has learned to generalize from the training set or has just memorized it if the test set contains examples from the training set.

Ordinary Least Squares

We train the linear regression algorithm with a method named Ordinary Least Squares(OLS). It is a great way to start for any spatial regression analysis. It creates a single regression equation to describe the variable or process to are attempting to stand, it provides a global model of the variable you are trying to predict.

Implementation using S2/Kotlin

After all the understanding of concepts, it’s the coding time! So load your S2/Kotlin IDE and start by uploading your dataset. While constructing a regression equation, don’t forget to consider all your independent variables.

Model Training:

				
					%use s2
 //Enter the dataset
 val sg: DataFrame = dataFrameOf(
      "AdSpent" , "Revenue")(
      500, 1200,
      700, 1500,
      1100, 2000,
      1300, 2070,
      1800, 2500,
  )
 println(sg)
 
// Construct a linear model problem
val problem = LMProblem(
    DenseVector(sg["Revenue"].asDoubles()), // Y, the dependent variables
    DenseMatrix(DenseVector(sg["AdSpent"].asDoubles())), // X, the independent variables
    true) 

// run OLS regression
val ols = OLSRegression(problem)

// the regression coefficients
val beta_hat = ols.beta().betaHat()
val residuals = ols.residuals()

 println("R^2 = ${residuals.R2()}")    
 //Residual is the difference between actual and model value
				
			

Output:

A DataFrame: 5 x 2
    AdSpent   Revenue
1       500      1200
2       700      1500
3      1100      2000
4      1300      2070
5      1800      2500
R^2 = 0.9738135781876655

Model Testing:

To code for this part, you should first know that this is called testing our model against the dataset we supplied to our model for training. So here we are gonna predict the value of Revenue for a given amount on AdSpent.

Add the following line of code at the end of the program written till now.

				
					println("Expected value of y at x = 900 is ${ols.Ey(DenseVector(900.0))}")
				
			

Output:

The goal of the regression analysis (modeling) is to find the values for the unknown parameters of the equation. That is, to find the Revenue for a particular amount of AdSpent. Once we have trained and evaluated our model, we improve it to make more accurate predictions. Generally, this is a step done for large datasets with help of Data Preprocessing and Feature Engineering. 

Applications of Linear Regression –

  1. To determine the economic growth of a state or country over a period of time. Can also be used to predict the GDP of a country.
  2. To predict what would be the price of a product in the future.
  3. Can be used in housing sales. To estimate the number of houses a builder would sell and at what price in coming months or years.
  4. Prediction of scores can be done after a match has been played. Linear Regression helps to determine the number of runs a player will score in the upcoming matches based on their previous performance.