Lesson 6 - Regression and Curve Fitting

 

It is common in the sciences, and in other fields such as economics and political science (and increasingly in other social sciences), either to use data to develop a mathematical model of some phenomenon, or to ask how well a set of data is described by an existing model. In both instances you are interested in how well the data fit the model over the range in which the data were measured, and also in how well you can use the results of fitting your data to a model to predict additional values inside the range of your measurements (interpolation) or outside of the range of your measurements (extrapolation).

A very common approach to this fitting of equations to data is known as regression analysis. We will begin with the special case of linear regression, where the goal is to fit a straight line to the data. A straight line, as you remember, can be generally written as

y = mx + b

Most commonly we read this equation as the value of the dependent variable y is equal to the slope m times the value of the independent variable x plus the intercept b, the value of y when x is zero. If you were talking about a population of cells which started at 5000 cells and grew at a rate of 400 cells per hour, then you would say that the initial value of 5000 correspond to the intercept b, and the growth rate of 400 per hour corresponds to the slope m. The equation could then be written as

population = 400x + 5000

where x is the time in hours.

Here are some measurements of the free energy change for a particular reaction at four different temperatures.

Temperature (K)
delta G (kcal)
100
4.4
200
6.8
300
8.4
400
10.8

Free energy is defined by the expression

Here is a plot of the free energy change versus temperature

Note that the data do not seem to lie on a perfect straight line. Here is the same graph in which several straight lines have been drawn.

The question we have to ask, and answer, is what is the best straight line that we can draw for these four data points. When you ask a group of people which is the best straight line, you usually get a variety of answers including "the one which goes through the most data points" or "the one about which the data seem most evenly distributed". The linear regression method, also known as the method of least squares, finds the lines for which the sum of the squares of the deviations of all data points from that line is minimized. This is equivalent to minimizing the standard deviations of the y values. We won't go through the derivation of the least squares method, but here are the expressions for the best values of m and b

Use these expressions to calculate m and b for the data in the table above, and write the equation for the free energy change for this reaction as a function of temperature.

As you can see, this procedure can become somewhat tedious if you have more than a few data points (I would suggest using Excel to perform the calculations - you should be able to enter the x and y values, and then define formulas in Excel to perform the calculations, though it might be almost as fast to do it by hand). Report your values of the slope and intercept on the submission form.

Fortunately, there are faster ways of doing regression. Excel has several different forms of regression built in, including one for plotting the best line on a graph together with the equation. The regression line is called a trend line (after our analogy to a growth model). Once you have a graph on the screen, double click on the graph and then click on one of the data points, and choose trendline from the insert menu. This brings up the trendline dialog box. The linear fit probably will be highlighted. Click on the options label and check to show the equation and r squared (see below for an explanation of r squared). If you wish, you can have the trendline extend beyond the data, now go back to the type tab and click on OK (as long as the linear trendline is highlighted). Do this for the data we have been working with. How do the values of m and b compare with what you calculated? The parameter r2, called the correlation coefficient, is a measure of how closely the two variables are correlated. The closer r2 gets to a value of 1, the "better the fit". The r2 value is actually a property of the data set, not of the line that is drawn by the least squares criterion. Though r2 is probably the most common fitting parameter associated with linear regression, it is more useful to have some idea of the standard deviation associated with the slope and intercept. Excel does not show this on the graph. (EasyPlot displays the standard error for these properties, which equals the standard deviation divided by the square root of the number of observations. Press the button below to see a graph of our data plotted in EasyPlot and then fit with a straight line.)

You can obtain this information in Excel by using the regression tool under data analysis in the tools menu. Clicking on this brings up a dialog box. Normally you would want to fill out the input ranges, choose a cell to the right of your work for the output and select a confidence interval, leaving the rest of the form blank. (If the labels box is checked your input range must include a row of labels as the first row. Otherwise, Excel ignores the first row of data!). Clicking OK will result in a large amount of output, similar to that shown here

While all of this information has statistical significance, we will focus on only selected portions of it. The r2 value is given in the first group of information. The values of the slope and intercept are in the last group (referred to as the coefficients), together with the standard error. There is also information here on confidence intervals at the level of confidence you chose in the dialog box.

Clicking the button below will take you to an interactive spreadsheet that illustrates the least squares approach - this should be very useful in helping you see and understand what the method is based on.


Return to the volume/temperature data in lesson 1. Using Excel, plot volume versus temperature, and perform a linear regression. What is the equation for the line? What is r2? Perform a regression analysis and determine the standard error for the slope and intercept. Submit your results to your laboratory instructor.

 
return to the response form page

  • Flick Coleman wcoleman@wellesley.edu
  • Dept. of Chemistry
  • Date Created: Aug 13, 1997
  • Last Modified: Aug 2, 1998
  • Expires: Aug 1, 2000
  • copyright by W.F. Coleman - 1997