Lesson 5 - Confidence Intervals, Significance Testing and Criteria for the Rejection of Data

 

OK, you have collected a set of data, and have calculated various measures such as the mean and standard deviation. The question now is, how meaningful are the data? There are many ways to think about this question, and we will discuss only a few. One way would be to ask how confident are you that the true value of the property you are measuring lies within some range defined by your data. In the previous lesson you calculated that for a normally distributed data set, 95% of values will lie within approximately 2 (1.96, to be more precise) standard deviations of the mean. While that is a useful number to keep in mind, it is based on a population (or large sample >30), and not a small sample of data. As has been mentioned earlier, it is far more likely that you will have fewer than 30 data points. It seems logical that you would have to widen the range in which you would expect to find the true value (with a given degree of confidence) as a consequence of having fewer data points. There are several ways to address this issue.

Student's t-test

This approach is named after the mathematician Gossett, who wrote under the pseudonym Student. Gossett defined a distribution which approaches a normal distribution when the number of samples, n, is large, but becomes wider than a normal curve as n decreases. A confidence interval for a true mean is then defined as

where mu is the true value, xbar the calculated mean, s the calculated standard deviation (of the sample), n the number of data points and t a value defined by Gossett.

We will use t values calculated using the computer algebra system MathCad, but you can also find tables of t values in any statistics text, and can calculate them in Excel. Here are t values for the 90, 95 and 99% confidence levels, for samples sizes ranging from 2-11. Note that they are given for n-1, not n.

How might you use the t-test? Suppose that you had measured the wavelength of maximum absorption (in nm) of a particular substance 10 times, and found the following values

205.0

205.0
205.4
205.2
205.0

204.9

205.1
204.7
204.9
204.9

Calculate the mean and standard deviation of this sample. Using the t tables, calculate the 95% confidence interval for the true value. (Remember that you will want to use the t value for 9 and 95%.) Submit your mean, standard deviation and confidence interval to your laboratory instructor.

Another application of the t-test would be to compare your data to a known "true value" and ask whether it is likely that there is some systematic error in your measurements. To do this you would calculate a t value using the expression

where the symbols have their customary meanings. Compare this value of t to the theoretical value (from tables) for your sample and sample size. If the computed value is larger than the theoretical then you might suspect a systematic error (of course how justified you are in doing this depends on the confidence level you have chosen). Calculate a value of t for the data above, assuming that the true value of the wavelength is 204.6 nm. Compare it to the theoretical t at the 99% confidence level. Is your t larger or smaller? If it is larger, then you would say that you are 99% confident that there is a systematic error in your measurements since the "true value" lies outside the 99% confidence interval. What do you conclude for this experiment? Submit your calculated t value and conclusion (in complete sentences that explain your conclusion) to your laboratory instructor.

 

The t-test and Hypothesis Testing - the Comparison of Two Means

You have been hired to oversee the Institutional Technology Department of a major liberal arts college. One of your tasks is to advise individuals on the purchase of various forms of electronic technology. In looking at one particular component, you have data from a random sample of 10 of that one company's version of that component which show a mean time to failure (MTTF) of 40000 hr, with a standard deviation of 2000 hr, and a sample of 8 of another manufacturers version of the same component which show a MTTF of 43000 hr with a standard deviation of 2500 hr. The question that you wish to answer is whether or not these two brands actually differ statistically, at a particular level of confidence.

In statistics, this is known as hypothesis testing. The so-called null hypothesis in this case would be that the two means and, by implication the standard deviations, are statistically indistinguishable from one another. (In all of this you are assuming that the lifetimes of the components are normally distributed.) You can then use the t-test to decide whether you should accept or reject the null hypothesis. Rejecting the null hypothesis is the same as saying that the two means are different - one brand of component has a longer MTTF that the other.

This hypothesis is tested in the following manner (let's do the test at the 99% confidence level).

A theoretical t is calculated - this t is derived from the 18 observations in the pooled sample but the t is calculated for 16 (not 17) observations - this is done because each of the data sets has only n-1 degrees of freedom (that means that if you know the mean for a set of data, you need only know n-1 of the actual values as the mean will determine the other value, and the t is based on the number of degrees of freedom of the data) - you could write that degrees of freedom = n1 - 1 + n2 - 1 = n1 + n2 - 2, where n1 and n2 are the number of data in the two sets - for this case, at the 99% level, t from tables is 2.921

Now we need to calculate a t value from the data to see if it falls in the range of the theoretical value we just found - to do this we need to calculate what is known as the pooled standard deviation - the complete derivation is not given here, but in rough outline it is as follows - the variance (the square of the standard deviation) of the pooled data set is given by

 

assuming that both data sets have the same standard deviation (or variance), you can calculate the t value from

For our problem, t = -2.83 (verify that you get the same thing - the variance of the pooled data calculates to be 4984375). This falls within our critical range which means at the 99% confidence level, the two means are the same, and we accept the null hypothesis, and you could base your purchasing decision on factors such as cost. There are also statistical tests that would allow us to test our assumption that the two standard deviations are the same, but we will not go into them here.

 

Here is a problem where you can apply the various applications of the t test described. You and you laboratory partner have each synthesized what you think is the same compound. You each have performed five replicate experiments to determine the molecular weight of the compound. The data is shown here

run 1
run 2
run 3
run 4
run 5
you
154.2
148.0
153.5
152.9
154.5
your partner 
149.6
152.0
148.4
154.8
151.2

Start by considering these as two data sets. Calculate the mean, standard deviation and 95 and 99% confidence ranges for the mean of each set. Now formulate a null hypothesis about the means of the two samples and test this hypothesis at the 95 and 99% level. What do you conclude? Submit all of your results to your laboratory instructor.
 

The Rejection of Data

We are now entering murky waters. How do we decide if a particular data point which appears to be an "outlier" should be retained or rejected. This is a point of some controversy. If you have a large number of data purporting to measure the same thing then you can use the t test to determine whether or not the suspect data point lies outside of the range calculated at any confidence level. Also, with a large number of data, an outlier will have relatively negligible effect on the mean and standard deviation. The issue is more critical when you have, as you typically do in science, just a few measurements. There are several empirical schemes for testing outliers in small data sets, and we'll only look at one here, the Q test. Application of the Q test is relatively straightforward. You take the ratio of the difference between the suspect data point and its nearest neighbor to the total range of the data. This is your calculated Q. Now compare that to a tabulated value. Here are the 90% confidence level values for 3-10 measurements.

n
3
4
5
6
7
8
9
10
Q(90%) 
0.94
0.76
0.64
0.56
0.51
0.47
0.44
0.41

If your calculated Q is greater than the tabular value then, by this test, you can reject the suspect point. The Q test is relatively stringent. Note that for 5 or fewer values, the suspect point must lie quite far from the others. In fact, for 3 or 4 data points it is perhaps only valid to reject a value if you can establish that there was determinate error associated with that value which was not present for the others ("I spilled half of my sample on the lab bench."). For 3-5 data it is probably best, when there appears to be a suspect point, to repeat the experiment until you have a more statistically valid sample size.

Go back to the data on the molecular weight of the compound you and your lab partner synthesized. Look at your data. Does any value seem out of line? If so, apply the Q test and see if you are justified in rejecting that value. Now pool your data with that of your partner (why can you do this). Can you reject any data now (using the Q test at the 90% confidence level).

One of the reasons that rejection of data is so controversial for small data sets is that you are trying to balance between two types of error, what is known as error of the first kind - the rejection of valid data points - and error of the second kind - retention of invalid data.

 

 
return to the response form page

  • Flick Coleman wcoleman@wellesley.edu
  • Dept. of Chemistry
  • Date Created: Aug 12, 1997
  • Last Modified: Aug 2, 1998
  • Expires: Aug 1, 2000
  • copyright by W.F. Coleman - 1997