We all have plotted line graphs on coordinate planes (X-axis and Y-axis) before. Where x variable is an independent variable, usually called a predictor variable and y is a dependent variable, called the criterion variable. To predict the relationship between these two variables, regression analysis is used.

For instance, you are aware of the fact that a house of 2,200 square feet costs 50,000$. You wish to buy a 3,500 square feet house. How much it will cost you? In order to find the price, you need a relationship between price and the area of the house. Linear regression solves helps you do this. Let’s find out!

Simple Linear Regression

Simple linear regression is the simplest approach for statistical learning. It is a part of Bivariate Statistics. Bivariate meaning, two variables. Now the relationship between variable 1 and variable 2 is special in case of Linear regression. Thus, the value of one variable is a function of the other variable.

\(y = f(x)\)

The first equation which crosses our minds after reading this is the equation of the line.

**The slope-intercept form of a line:**

\(y=m*x+b\)

In this formula, x is an independent variable, y is a dependent variable along with two real components, m, and b. Where, m is the slope of the line and b is the y-intercept that is when the line crosses the Y-axis i.e. value of y when x=0.

Now, let’s take an example:

\(y=2*x+3\)

We can learn a lot about this equation by superimposing this equation on the slope-intercept form of the line equation.

The slope of the line = 2

Y-intercept = 2(0)+3 = 3

On plotting, it gives a straight line-

We had a look at the algebric formula of lines and now we will examine its connection with linear regression.

**Linear Regression Equation:**

Consider the following dataset in Table-1, which has prices of houses based on the area of each house.

Area (in square feet) | Price (in thousand $) |
---|---|

1500 | 340 |

1750 | 390 |

2600 | 550 |

3000 | 565 |

3200 | 610 |

3600 | 680 |

4000 | 725 |

2000 | 490 |

Now, with the help of the data provided in the table, we have to predict the price of the house that has an area of 3,500 square feet.

**Step – 1:** We will plot a scatter graph using the available data points.

**Note: **The order in which data points are plotted doesn’t matter. We can graph the points in any order. This just happens to be the final graph we ended up with.

**Step – 2:** Look for a rough visual line.

Warning: If you try and find a linear regression equation through an automated program like Excel, you will find a solution, but it does not necessarily mean the equation is a good fit for your data.

As you can see, there are several lines passing through the data points. We don’t know that any one of these lines here is the actual regression line. So which line is it? To answer this question, let’s get to the next step.

**Step – 3:** In this step, descriptive statistics is done in order to find the ‘best-fit line. We will start by finding the mean of each variable.

The formula for finding the mean-

\(Mean = \frac{Sum of Observation}{Total number of Observations}\)

Mean for Area = xm = 2706.25 square feet

Mean for House prices = ym = 543.75 $

Plot this point, (xm,ym) in the graph. This point is termed the centroid.

**Note:** The best-fit regression line must pass through the centroid, which comprises the mean of x variable and mean of y variable.

**Step – 4: **To make the best-fit line we need two points. The centroid gives you one point to work with. Let’s move ahead with the calculations.

Mathematically, the regression line shares the characteristics of being linear with adjustable parameters. Remember the general equation we learned of the line, the slope formula.

\(y = mx+b\)

where m is the slope of the line and b is the y-intercept.

Finding for finding slope-

\( m=\frac{\displaystyle\sum_{i=1}^{n}(x – xm) (y – ym)} {\sum_{i=1}^{n}(x – xm)^2}\)

Finding the value of y-intercept b, we take ym which is the mean of the dependent variable and then subtract the slope times the mean of the independent variable.

\(b=ym-m*(xm)\)

We already calculated values of xm and ym earlier in Step – 3.

**Calculations: **In order to form our linear equation, each data point should be considered for calculating slope. So let’s re-create our Table-1 by adding the necessary columns.

Area (sq. feet) \(x\) | Price ($) \(y\) | Area Deviation \( (x – xm)\) | Price Deviation \( (y – ym)\) | Deviation Product \( (x – xm)(y – ym)\) | Area Deviation squared \( (x – xm)^2\) |
---|---|---|---|---|---|

1500 | 340 | -1206.25 | -203.75 | 245773.4375 | 1455039.063 |

1750 | 390 | -956.25 | -153.75 | 147023.4375 | 914414.0625 |

2000 | 490 | -706.25 | -53.75 | 37960.9375 | 498789.0625 |

2600 | 550 | -106.25 | 6.25 | -664.0625 | 11289.0625 |

3000 | 565 | 293.75 | 21.25 | 6242.1875 | 86289.0625 |

3200 | 610 | 493.75 | 66.25 | 32710.9375 | 243789.0625 |

3600 | 680 | 893.75 | 136.25 | 121773.4375 | 798789.0625 |

4000 | 725 | 1293.75 | 181.25 | 234492.1875 | 1673789.063 |

**Note:** Regression is very sensitive to rounding. Thus, it is best to take your calculations to four decimal places.

From the table, we can say-

Sum of deviation product = \( \displaystyle\sum_{i=1}^n(x – xm)(y – ym)\) = 825312.5

Sum of area deviation squared = \( \displaystyle\sum_{i=1}^n(x – xm)^2\) = 5682187.5

From these two above answers,

Slope = m= \( \frac{\displaystyle\sum_{i=1}^{n}(x – xm) (y – ym)} {\sum_{i=1}^{n}(x – xm)^2}\) = \( \frac{825312.5}{5682187.5}\) = 0.14526

We found the values xm, ym, m and thus y-intercept is

\(b=ym-m*(xm)\) = \( 543.75 – 0.14526 * 2706.25\) = 150.64

Now, as our final step, we will assemble the values of slope and y-intercept gives us the following regression equation.

\(y=0.14526*x+150.64\)

This equation is our answer.

Remember we discussed that our centroid has to fall on the best-fit regression line. Well, it does.

This method is The Least Squares method.

**Interpretation: **For every 1 square foot the area increases, we would expect the price to increase by 0.14526 thousand $ that is, 145.26$.

Implementation using Kotlin

You are now aware of the concept of linear regression. Let’s start with the implementation of what we learned.

**Dataset-**

We will be using a self-generated dataset. This is just an arbitrary choice, you can use any other dataset of your choice. The first step is splitting the data between testing and training data sets.

The training process builds up the machine learning algorithm. This data is fed to the algorithm, the model evaluates the data repeatedly to learn about the behavior of data and then reveals patterns to serve the intended purpose.

After the model is built, testing data validates that it can make accurate predictions. The testing data should be unlabeled. It acts as a real-world check of an unseen dataset to confirm that the algorithm is trained effectively.

We usually split the data around 20%-80% between the testing and training stages.

**Code-**

Starting with the training dataset representing salary in thousands per year of experience.

` ````
```val xs = arrayListOf(1, 2, 3, 4, 5, 6, 7, 8, 9, 10) //independent variable
val ys = arrayListOf(25, 35, 49, 60, 75, 90, 115, 130, 150, 200) //dependent variable

As discussed earlier, the next step is to find the centroid. That is the mean of both variables.

` ````
```val meanX = xs.average() //Mean of independent variable
println(meanX) //print function

Output: 5.5

` ````
```val meanY = ys.average() //Mean of dependent variable
println(meanY) //print function

Output: 92.9

The coordinates of the centroid are (5.5, 92.9). Let’s move forward!

**Training the model-**

` ````
```val yearsdeviation = xs.map{it - 5.5} //Finding the deviation of years
//map function is an inbuilt function whih applies the specified condition to every element of list
println(yearsdeviation)

Output: [-4.5, -3.5, -2.5, -1.5, -0.5, 0.5, 1.5, 2.5, 3.5, 4.5]

` ````
```val salarydeviation = ys.map{it - 92.9} //Finding deviation in the salary
println(salarydeviation)

Output: [-67.9, -57.900000000000006, -43.900000000000006, -32.900000000000006, -17.900000000000006, -2.9000000000000057, 22.099999999999994, 37.099999999999994, 57.099999999999994, 107.1]

` ````
```val deviationproduct = xs.zip(ys) { x, y -> (x - meanX) * (y - meanY) } //Multiplying both the list generated
//zip function merges two different lists into a new list
println(deviationproduct)

Output: [305.55, 202.65000000000003, 109.75000000000001, 49.35000000000001, 8.950000000000003, -1.4500000000000028, 33.14999999999999, 92.74999999999999, 199.84999999999997, 481.95]

` ````
```val sqofyearsdeviation = area.map {e -> e.pow(2)} //Squaring of deviation of years column
//e denotes each element in the list
println(sqofyearsdeviation)

Output: [20.25, 12.25, 6.25, 2.25, 0.25, 0.25, 2.25, 6.25, 12.25, 20.25]

` ````
```val num = deviationproduct.sum() //summation of all elements in the list
println(num)

Output: 1482.5

` ````
```val deno = sqofyearsdeviation.sum()
println(deno)

Output: 82.5

Now let’s calculate value of slope-

Slope = \( \frac {num}{deno}\)

` ````
```val slope = num / deno
println(slope)

Output: 17.96969696969697

` ````
```val slope:Double = String.format("%.3f", 17.96969696969697).toDouble() //Rounding off the value of slope
println(slope)

Output: 17.97

After calculating slope, the centroid is used to find Y intercept –

` ````
```val yIntercept = meanY - slope * meanX // Finding value of Y Intercept
yIntercept

Output: -5.934999999999988

` ````
```//Final equation of linear regression
val simpleLinearRegression = { independentVariable: Double -> slope * independentVariable + yIntercept }

Our model is trained now. Let’s test it!

**Testing the model-**

For instance, how much an individual with *2.5* or 6*.5* years of experience should be entitled to earn?

` ````
```val testcaseone = simpleLinearRegression.invoke(2.5) // Predicting the salary of an employee with experience of 2.5 years
testcaseone

Output: 38.99000000000001

` ````
```//Rounding off predicted value of test case 1
val testcaseone:Double = String.format("%.3f", 38.99000000000001).toDouble()
testcaseone

Output: 38.99

` ````
```val testcasetwo = simpleLinearRegression.invoke(6.5) // Predicting salary of an emplyee with 6.5 years experience
testcasetwo

Output: 128.83999999999997

` ````
```//Rounding off predicted value of test case 2
val testcasetwo:Double = String.format("%.3f", 128.83999999999997).toDouble()
testcasetwo

Output: 128.84

Our model predicted the salary with respect to years of experience of the employee. Now the question is, how accurate is this value? What is the error percentage between the original value and the predicted value? To answer these questions let’s move to the next section.

Building Regression Model in S2

S2/Kotlin has ample packages to solve linear model problems using regression models. We will implement the Ordinary Least Squares(OLS) model. To build and analyze a model for a given dataset, \(LMProblem\) is constructed.

**Q.** Consider a response vector(dependent variables) as: \(Y = \begin{bmatrix}1\\2\\4\\5\\10\end{bmatrix}\) and design matrix of explanatory vector(independent variables) as: \(X = \begin{bmatrix}25\\35\\60\\75\\200\end{bmatrix}\). Construct a OLS model using S2 IDE.

` ````
```%use s2 //a keyword used in every program of S2
//Defining the vector Y
val Y: Vector = DenseVector(arrayOf(1.0, 2.0, 4.0, 5.0, 10.0))
//Defining the matrix X
val X: Matrix = DenseMatrix(
arrayOf(
doubleArrayOf(25.0),
doubleArrayOf(35.0),
doubleArrayOf(60.0),
doubleArrayOf(75.0),
doubleArrayOf(200.0)
)
)
//estimation of true intercept
val intercept = true
val problem1 = LMProblem(Y, X, intercept)
printOLSResults(problem1) //calling the function
//Runs an OLS regression
//Block for defining the function
fun printOLSResults(problem: LMProblem?) {
val ols = OLSRegression(problem)
val olsResiduals: OLSResiduals = ols.residuals()
//coefficients for explanatory variables
println("beta hat: ${ols.beta().betaHat()}\nstderr: ${ols.beta().stderr()},\nt: ${ols.beta().t()},\nresiduals: ${olsResiduals.residuals()}")
//beta is the slope value
//Standard error is stderr
//Residual is the difference between original and model value
}

Output:

beta hat: [0.048918, 0.535481]

stderr: [0.005264, 0.532022] ,

t: [9.293036, 1.006500]

residuals: [-0.758430, -0.247609, 0.529441, 0.795672, -0.319074]

FIT of regression model

Till now we have covered and coded fundamental concepts of Simple Linear Regression. In this section, we will learn to evaluate how a regression line fits the data it models.

A regression model is unique to the data it represents. Once a regression model is built, the sum of squared errors or residuals is calculated using the regression line. Till now we have calculated the regression line using the Least Squares Method. Now, we will find the errors and residuals.

In statistics, the residual and error are not the same. Error is defined as the difference between the observed value and true value(unobserved value). Whereas, residual is defined as the difference between observed value and model value(predicted values).

**Error Calculation:**

Recall the example of house prices we considered. We already discussed and found the equation of linear regression on selected data and how the line passes through the centroid. Now let’s move on to the part where we calculate predicted prices for each house with the help of the equation we derived earlier.

Area (square feet) | Price (thousand $) | \(y=0.14526*x+150.64\) | \(y\) (predicted prices) |
---|---|---|---|

1500 | 340 | \(y=0.14526*(1500)+150.64\) | 368.53 |

1750 | 390 | \(y=0.14526*(1750)+150.64\) | 404.845 |

2600 | 550 | \(y=0.14526*(2600)+150.64\) | 528.316 |

3000 | 565 | \(y=0.14526*(3000)+150.64\) | 586.42 |

3200 | 610 | \(y=0.14526*(3200)+150.64\) | 615.472 |

3600 | 680 | \(y=0.14526*(3600)+150.64\) | 673.576 |

4000 | 725 | \(y=0.14526*(4000)+150.64\) | 731.68 |

2000 | 490 | \(y=0.14526*(2000)+150.64\) | 441.16 |

All we are doing in the above table is to substitute each area value in place of x. After evaluating these equations, it will give us the predicted price in the last column. Basically in this case the point to be noted is that our training data and testing data are the same.

**Note:** Instead of grabbing a calculator for the above calculations. Open S2 and code for each data point.

` ````
```val y=0.14526*(1500)+150.64 //Calculating for first row
println(y)

Output: 368.53

**Observations:** For a house of 1500 square feet, the price was 340 thousand $ and according to our regression equation we predicted 368.53 thousand $. Thus, you can observe the price of houses is not exactly the same as our predicted value. This discrepancy in values is what we refer to as an error.

So let’s find the difference in the original price and the predicted price for all the data points we have.

Area (square feet) | Price (thousand $) | \(y\) (predicted prices) | Error = Observed – Predicted | Squared Error |
---|---|---|---|---|

1500 | 340 | 368.53 | -28.53 | 813.9609 |

1750 | 390 | 404.845 | -14.845 | 220.3740 |

2600 | 550 | 528.316 | 21.684 | 470.1958 |

3000 | 565 | 586.42 | -21.42 | 458.8164 |

3200 | 610 | 615.472 | -5.472 | 29.9428 |

3600 | 680 | 673.576 | 6.424 | 41.2678 |

4000 | 725 | 731.68 | -6.68 | 44.6224 |

2000 | 490 | 441.16 | 48.84 | 2385.3456 |

Squaring the differences, the residuals come out as mentioned in the last column of the above table. On adding them up we get the Sum of Squared Errors (SSE).

Sum of Squared Error = SSE = \( \displaystyle\sum_{i=1}^n(Observed-Predicted)^2\) = 4464.5257

Error in the scatter graph of Table -1 is the distance from the true line to each observed data point. Squaring all these errors (SSE) can be represented on the graph as –

**Residual Analysis:**

Residual analysis is important to assess the appropriateness of a linear regression model. It is the second major step towards validating a model. If the model assumptions are not satisfied, residual analysis often suggests ways for improving the model in order to obtain better results.

For understanding residuals, recall the dataset and its scatter plot we studied in Table-1. Now, construct a parallel line to X-axis using the value of Ymean that is, 543.75.

**Interpretation: **In the above graph, observe how the distance from the line Ymean to the observed data point can be divided into exactly two distances. One is the SSE we discussed in the previous section and the second is the SSR.

Sum of Squared Residuals(SSR) is a statistical way to study the amount of variance in a regression model. It is the sum of squared values of residuals. Graphical interpretation will be –

SSR = SST – SSE

Here we have calculated SSE and now let’s calculate the Total Sum of Squared (SST). Consider the following diagram –

**Note: **In this case, we considered only the dependent variable which is the price of the house. We figured out its mean and marked a horizontal line at $543.75.

Squaring the distance between the observed data point to mainline and then adding them.

Area (square feet) | Price (thousand $) | Price difference wrt ymean | Squared |
---|---|---|---|

1500 | 340 | -203.75 | 41514.0625 |

1750 | 390 | -153.75 | 23639.0625 |

2600 | 550 | 6.25 | 39.0625 |

3000 | 565 | 21.25 | 451.5625 |

3200 | 610 | 66.25 | 4389.0625 |

3600 | 680 | -136.25 | 18564.0625 |

4000 | 725 | 181.25 | 32851.5625 |

2000 | 490 | -53.75 | 2889.0625 |

In the last column, on adding the values we get the Total Sum of Squares (SST).

Total Sum of Squares = SST = \( \displaystyle\sum_{i=1}^n(Observed(dependent) -ymean)^2\) = 124337.5

Recall the relation between SSR, SSE, and SST.

SSR = SST – SSE = 124337.5 – 4464.5257 = 119872.9743

Thus, we can say that “Residual is the estimation of error”.

**Coefficient of Determination:** Now that we have values of SSR, SSE, and SST. Let’s go ahead and find the coefficient of determination represented as \(R^2\). It is a statistical technique to measure the goodness of fit.

R-squared is generally interpreted as the percentage. If SSR is large, more SST is used and thus, SSE is smaller relative to the total. This ratio acts as a percentage.

Mathematically, it’s the sum of squares regression divided by the Total Sum of Squares.

\(R^2 = \frac{Sum of Squared Regression}{Total Sum of Squares} = \frac{SSR}{SST} \)

**Note-1:** The value is always between 0(0%) and 1(100%).

**Note-2:** Larger values of \(R^2\) suggest that our linear model is a good fit for the data we provided.

In our case,

\(R^2 = \frac{119872.9743}{124337.5} = 0.9641\) or 96.41%

**Conclusion:** We can conclude that 96.41% of the total sum of squares can be explained by our regression equation to predict house prices. Thus the error percentage is less than 4% which implies a GOOD FIT for our model.

While coding in S2, we include the following line to print the coefficient of determination.

` ````
``` println("R2: ${olsResiduals.R2()}",)

**Mean Square Error:** Represented as \(s^2\) and tells us about how spread out the data points are from the regression line. \(s^2\) is the estimate of sigma squared that is variance.

So, MSE is SSE divided by its degrees of freedom which in our case is 2 because we are estimating slope and intercept.

**Note-1:** In simple linear regression, the degree of freedom is always 2.

**Note-2:** MSE is not simply the average of residuals.

\(s^2 = \frac{SSE}{n-2} = \frac{4464.5257}{6} = 744.08\)

**Standard Error: **It is the standard deviation of the overall error. Represented as \(s\). It is termed as the average distance an observation/data point falls from the regression line in units of the dependent variable.

\(s = \sqrt[2]{MSE} = \sqrt[2]{744.08} = 27.278\)

\(s\) is a measure of how well the regression model makes prediction.

When coding in S2, the following lines are typed printing –

` ````
``` println("standard error: ${olsResiduals.stderr()}, f: ${olsResiduals.Fstat()}",) //standard error

Testing Data Implementation

Consider the same question we coded previously in the “Building Regression Model” section. Here, we will predict values and find errors associated with the data. Consider the vector for testing as: \(Y = \begin{bmatrix}1.2\\2.4\\4.1\\5.3\\9.9\end{bmatrix}\).

` ````
```%use s2
val Y: Vector = DenseVector(arrayOf(1.0, 2.0, 4.0, 5.0, 10.0))
val X: Matrix = DenseMatrix(
arrayOf(
doubleArrayOf(25.0),
doubleArrayOf(35.0),
doubleArrayOf(60.0),
doubleArrayOf(75.0),
doubleArrayOf(200.0)
)
)
val intercept = true
val problem1 = LMProblem(Y, X, intercept)
printOLSResults(problem1)
// Testing data vector
val W: Vector = DenseVector(arrayOf(1.2, 2.4, 4.1, 5.3, 9.9))
val problem2 = LMProblem(Y, X, intercept, W)
printOLSResults(problem2)
fun printOLSResults(problem: LMProblem?) {
val ols = OLSRegression(problem)
val olsResiduals: OLSResiduals = ols.residuals()
println("beta hat: ${ols.beta().betaHat()},\nstderr: ${ols.beta().stderr()}, \nresiduals: ${olsResiduals.residuals()}")
println("R2: ${olsResiduals.R2()}, standard error: ${olsResiduals.stderr()}, f: ${olsResiduals.Fstat()}",)
println("fitted values: ${olsResiduals.fitted()}")
println("sum of squared residuals: ${olsResiduals.RSS()}")
println("total sum of squares: ${olsResiduals.TSS()}")
println()
}

**Output:**

beta hat: [0.048918, 0.535481] , stderr: [0.005264, 0.532022] , residuals: [-0.758430, -0.247609, 0.529441, 0.795672, -0.319074] R2: 0.9664281242711773, standard error: 0.7420099473407974, f: 86.36051188299813 fitted values: [1.758430, 2.247609, 3.470559, 4.204328, 10.319074] sum of squared residuals: 1.6517362858580784 total sum of squares: 49.2 beta hat: [0.045201, 1.055123] , stderr: [0.003621, 0.504376] , residuals: [-1.185147, -0.637157, 0.232818, 0.554804, -0.095319] R2: 0.9811093504747105, standard error: 1.23873309823374, f: 155.80872682454918 fitted values: [2.185147, 2.637157, 3.767182, 4.445196, 10.095319] sum of squared residuals: 4.60337906597928 total sum of squares: 243.68558951965065

Outliers

We have now covered the most important part of regression. In this section, we will understand the concept of outliers.

Sometimes when you make a scatter plot, some data points or point just don’t look right. Consider the following scatter plot for example.

Here, you will notice that in the above case the orange point looks way out of place. So, for one of the variables considered, a value can appear out of the norm. A point having such large residual values even though it is in the range of one variable is termed as an outlier.

**Note:** Outlier affects the slope of the regression line as it falls outside the general pattern of data.

*We now conclude the topic, “Simple Linear Regression”.*

*Thank you for spending time on this course! Hope you learned a lot.*