Loading [Contrib]/a11y/accessibility-menu.js


 Linear Regression and Finding the Line of Best Fit 


 

Lesson Summary

In this lesson, students will explore the concept of linear regression, a statistical method used to model the relationship between two quantitative variables. Through hands-on activities, students will learn to create scatter plots, determine the line of best fit, interpret the slope and y-intercept in context, and utilize linear models to make predictions. The lesson emphasizes the practical applications of linear regression in various fields and discusses its limitations.

Lesson Objectives

  • Understand the concept of linear regression
  • Use linear regression to find the line of best fit
  • Interpret slope and y-intercept of the regression line
  • Apply the linear model to make predictions
  • Understand the limitations of linear regression models

Common Core Standards

  • S.ID.6: Represent data on two quantitative variables on a scatter plot, and describe how the variables are related.
  • S.ID.7: Interpret the slope (rate of change) and the intercept (constant term) of a linear model in the context of the data.
  • S.ID.8: Compute (using technology) and interpret the correlation coefficient of a linear fit.

Prerequisite Skills

  • Understanding linear functions and equations
  • Plotting points on the Cartesian plane
  • Calculating slope and y-intercept

Key Vocabulary

  • Linear Regression: A statistical technique that models the relationship between a dependent variable and one or more independent variables by fitting a linear equation to observed data.
  • Line of Best Fit: The straight line that best represents the data on a scatter plot, minimizing the sum of the squared differences between observed and predicted values.
  • Scatter Plot: A graphical representation of the relationship between two quantitative variables, where each point represents an observation.
  • Correlation Coefficient (r): A numerical measure ranging from -1 to 1 that indicates the strength and direction of the linear relationship between two variables.
  • Dependent Variable: The outcome variable that researchers are trying to explain or predict; it depends on the independent variable.
  • Independent Variable: The variable that is manipulated or selected by the researcher to determine its relationship to the dependent variable.
  • Residual: The difference between the observed value and the value predicted by the regression line; it indicates the error in the prediction.
  • Extrapolation: The process of estimating values beyond the range of the observed data using the regression line.
  • Interpolation: The process of estimating values within the range of the observed data using the regression line.

 


 

Warm Up Activities

Choose from one or more activities

Activity 1: Graphing Functional Data on the Coordinate Plane

Before introducing linear regression, students will review how to plot functional data points on a coordinate plane. This activity reinforces the concept of visualizing relationships between two variables.

Instructions:

  1. Provide students with the following data set representing the number of hours studied (\( x \)) and the corresponding test scores (\( y \)):
  2. Ask students to:
    • Plot the points on a coordinate plane.
    • Identify whether the data suggests a linear relationship.
    • Discuss how an increase in study hours appears to affect test scores.

 

Hours Studied (\( x \))

Test Score (\( y \))

1

60

2

68

3

75

4

80

5

90

 

  • Ask students to:
    • Plot the points on a coordinate plane.
    • Identify whether the data suggests a linear relationship.
    • Discuss how an increase in study hours appears to affect test scores.

This activity sets the stage for understanding scatter plots and the need for a line of best fit. Have them use the Desmos graphing calculator or a spreadsheet to graph the data set.

Activity 2: Estimating the Line of Best Fit from a Graph

Students will learn to estimate the line of best fit by analyzing a scatter plot.

Instructions:

  1. Provide the following scatter plot data, representing the number of weeks a product has been on sale (\( x \)) and its price (\( y \)):
  2. Have students:
    • Plot the data points on a graph.
    • Draw a straight line that best represents the trend of the data.
    • Estimate the equation of the line using two points close to the line.

 

Weeks on Sale (\( x \))

Price (\( y \))

1

$50

2

$47

3

$45

4

$42

5

$40

 

Activity 3: Looking at Trends

Display a scatter plot of real-world data showing the relationship between hours spent studying per week and SAT scores. Use the following dataset (source: College Board, 2019 SAT Suite of Assessments Annual Report):

 

Hours Studied Per Week

Average SAT Score

0

1050

1-5

1090

6-10

1150

11-15

1190

16-20

1220

More than 20

1240

 

Note: For the "More than 20" category, we'll use 25 hours as an approximation for graphing purposes.

Ask students to describe the relationship they observe between the variables and sketch a line that best fits the data. Discuss how this line could be used to make predictions about SAT scores based on study time.

Prompt students with questions such as:

  • What trend do you notice in the data?
  • How does the average SAT score change as study time increases?
  • Is the relationship between study time and SAT scores perfectly linear? Why or why not?
  • If a student studies for 8 hours per week, what SAT score might you predict based on this data?

 


 

Teach

Introduce the concept of linear regression as a statistical method for finding the line of best fit for a set of data points. Explain that this line minimizes the sum of squared vertical distances from the data points to the line.

Definitions

  • Linear regression: A statistical method used to model the relationship between a dependent variable and one or more independent variables by fitting a linear equation to observed data.
  • Line of best fit: The straight line that best represents the relationship between two variables in a scatter plot.
  • Scatter plot: A graph that shows the relationship between two variables as a collection of points.
  • Correlation coefficient: A measure of the strength and direction of the linear relationship between two variables, ranging from -1 to 1.

Use this slide show to review these definitions:

https://www.media4math.com/library/slideshow/linear-regression-vocabulary

Developing a Linear Regression Model

Use the following dataset for average height vs. average weight for American males (source: Centers for Disease Control and Prevention, National Health and Nutrition Examination Survey, 2015-2018):

 

Height (inches)

Weight (pounds)

67

160

68

165

69

170

70

175

71

180

72

185

73

190

74

195

 

Walk through the process of graphing the data set and performing linear regression using the Desmos graphing calculator:

  1. Go to www.desmos.com/calculator
  2. Click on the + button in the upper left corner and select "Table"
  3. Enter the height data in column 1 and the weight data in column 2
  4. Click on the + button again and select "f(x) expression"
  5. input y1 ~ mx1 + b
  6. Desmos will display the scatter plot and the line of best fit, along with other statistics.

Here is a Desmos activity you can use with this data set:

https://www.desmos.com/calculator/axs0bosbvm 

After performing the linear regression, present the following table with the results:

 

Aspect

Value

Linear function equation

y = 5x - 175

y-intercept

-175

Slope

5

Correlation coefficient

1

 

Explain how to interpret each of these values:

  • The linear function equation represents the line of best fit.
  • The y-intercept (-175) represents the theoretical weight when height is 0 (not meaningful in this context).
  • The slope (5) indicates that for each inch increase in height, weight increases by 5 pounds on average.
  • The correlation coefficient (1) shows a perfect positive linear relationship between height and weight in this dataset.

Using the Linear Model for Predictions

Demonstrate how to use the linear function equation y = 5x - 175 to calculate the weight for someone whose height is between values in the table.

Example: Calculate the predicted weight for someone who is 70.5 inches tall.

y = 5(70.5) - 175
y = 352.5 - 175
y = 177.5

Explain that the model predicts a weight of 177.5 pounds for someone who is 70.5 inches tall.

Limitations of the Linear Regression Model

Discuss the following limitations:

  • Extrapolation: The model may not be accurate for heights outside the range of the data (67-74 inches). Using it to predict weights for very short or very tall individuals could lead to unrealistic results.
  • Assumption of linearity: The model assumes a perfectly linear relationship, which may not always be true in real-world scenarios.
  • Oversimplification: The model only considers height as a factor in determining weight, ignoring other important variables like age, gender, body composition, and lifestyle factors.
  • Data quality: The accuracy of the model depends on the quality and representativeness of the data used to create it.
  • Correlation vs. causation: While there's a strong correlation between height and weight, the model doesn't imply that height causes weight or vice versa.

Emphasize that understanding these limitations is crucial for appropriate use and interpretation of linear regression models.

 


 

Review

Provide students with the following real-world dataset showing the relationship between age and average income for US adults (source: U.S. Bureau of Labor Statistics, Current Population Survey, 2020):

 

Age Group

Average Annual Income ($)

20-24

33,280

25-34

52,052

35-44

67,340

45-54

70,356

55-64

70,616

65+

56,632

 

Guide students through the process of creating a scatter plot, finding the regression line, and interpreting the results using Desmos or a graphing calculator. Have students work in pairs to perform linear regression and answer questions about the meaning of the slope and y-intercept.

After completing the linear regression, present the following summary:

 

Aspect

Value

Linear function equation

y = 1057.8x + 33,957

Slope

1057.8

y-intercept

33957

Correlation coefficient

0.82

 

Discuss the interpretation of these values:

  • The linear function equation represents the line of best fit for the age-income relationship.
  • The slope (1057.8) indicates that, on average, annual income increases by $1,057.80 for each year increase in age.
  • The y-intercept (33957) represents the theoretical average annual income at age 0 (not meaningful in this context).
  • The correlation coefficient (0.82) shows a strong positive linear relationship between age and income, although it's not perfect.

Have students discuss the limitations of this model, such as:

  • The use of age groups instead of individual ages may affect the precision of the model.
  • The model doesn't account for factors other than age that might influence income.
  • The relationship may not be truly linear, especially at the upper end of the age range.

 


 

Quiz

Answer the following questions.

 

  1. What is the purpose of linear regression?


     
  2. How is the line of best fit determined in linear regression?


     
  3. What does the slope of a regression line represent?


     
  4. What does the y-intercept of a regression line represent?


     
  5. What does a correlation coefficient of 0.95 indicate about the relationship between variables?


     
  6. Given the regression equation y = 2.5x + 10, what is the slope?


     
  7. In the equation y = 2.5x + 10, what does the 10 represent?


     
  8. True or False: A negative slope in a regression line always indicates a weak correlation.


     
  9. What is the range of possible values for the correlation coefficient?


     
  10. When might linear regression not be an appropriate method for analyzing a dataset?

Answer Key

  1. To find the best-fitting straight line through a set of data points
  2. By minimizing the sum of squared vertical distances from data points to the line
  3. The rate of change in the dependent variable for each unit change in the independent variable
  4. The predicted value of the dependent variable when the independent variable is zero
  5. A very strong positive linear relationship between the variables
  6. 2.5
  7. The y-intercept
  8. False
  9. -1 to 1
  10. When the relationship between variables is not linear

 

Purchase the lesson plan bundle. Click here.