SPC With The Nile River Dataset

SPC
time series
Author

Ben Jepson

Published

July 19, 2024

A quick control chart

Some people are familiar with control charts for manufacturing, but they can be useful for tracking many things that occur over time. Something that occurs over time can be called a process.

With a control chart we use data we have observed to establish the location (average) of the process, and expected variation (spread). We use the chart to ask out data some questions (analysis!).

Based on what we know so far (observed data):

  • is the process stable, or changing (moving up or down)?

  • what range do we expect future values to be within?

  • if/when something changes, how might we know?

I’m using R with the ggQC package which makes control charts easy. For this project we’ll use an individual chart (XmR aka ImR).

We’ll load the built in dataset Nile, which is measurements of annual flow of the Nile river.

code
data(Nile)

nile_df <- data.frame(Year = time(Nile), Flow = as.vector(Nile))

Baseline chart and limits

The baseline chart is created using the data we have. The Nile dataset has 100 observations. But let’s try to simulate a real world SPC situation - we have some observations and we want to start tracking to see if something changes.

To set a baseline for our chart, we’ll use the first 20 observations of the dataset. We’ll calculate control limits, then monitor subsequent values to see if anything has changed.

code
library(tidyverse)
library(ggQC)
theme_set(theme_classic())
code
nile_df[1:20, ] %>% 
    ggplot(aes(x = Year, y = Flow))+
    geom_point()+
    geom_line()+
    stat_QC(method = "XmR", auto.label = TRUE, linewidth = 1)+
    labs(title = "Nile first 20 years baseline XmR control chart", 
         subtitle = "No signals detected in baseline period. We expect future values to be within the limits")

Control charts use rules (“run rules”) to help the user decide if something has happened. When something unexpected is detected, this is called a signal. Anytime a signal is detected, it should prompt investigation to see if something has really changed, and what can be done about it.

The most useful signal is any single point outside the control limits (the red lines in this control chart). Another useful rule is 8 or 9 sequential points on one side of the average (blue line). These rules are used to reduce over reactions - if a signal is detected, we can be pretty certain something unusual has happened or something has changed.

In this baseline period of the first 20 years, no signals are detected.

Ongoing monitoring

Now that we have a baseline average (centerline) and upper/lower control limits, we can continue to monitor future values.

In the real world, we would get one new data point at a time, and compare it to the average, control limits, and previous points. Someday I’ll put together a simulation where you can test this out and look point by point. For this quick article we’ll skip ahead several years and see if anything interesting has happened yet.

code
nile_df[1:40,] %>% 
    ggplot(aes(x = Year, y = Flow))+
    geom_point()+
    geom_line()+
    geom_hline(yintercept = c(1517.7, 1070.8, 624), color = c("red", "blue", "red"), linewidth = 1)+
    scale_x_continuous(n.breaks = 20)+
    labs(title = "1899-1906 has 8 points below average - signal of change!")

We see a sudden drop around 1899 - but is this evidence of a change in the process? By 1906 there are 8 points below the average, indicating that this probably is a real change. The points I chose to display here continue to be below the average - it looks like this shift lower is real.

At this point we could continue to plot more points, or since this is probably a real change we could recenter the chart and compute new limits to reflect this new reality.

As a side note - 1890-1896 has 7 points above the average. I can’t say if this is unusual or not relative to the baseline. There’s a good reason 8 or 9 points on one side are used to detect change on a control chart - this kind of thing happens a lot even with randomly generated numbers. A run of 7 numbers on one side can happen easily.

That’s not to say nothing happened in this period - we just don’t know! We would need more information to conclude there was a change in 1890-1896.

New Chart Limits

We’ll set up new limits based on 1899-1910. The new chart will be asking our questions relative to this time period.

code
nile_df[29:40, ] %>% 
    ggplot(aes(x = Year, y = Flow))+
    geom_point()+
    geom_line()+
    stat_QC(method = "XmR", auto.label = TRUE, linewidth = 1)+
    scale_x_continuous(n.breaks = 10)+
    labs(title = "Nile 1899 - 1910 new baseline XmR control chart", 
         subtitle = "No signals detected in baseline period. We expect future values to be within the limits")

This new chart has a lower average (858.6 vs 1070.8), and new control limits. Based on this we expect future values to be within about 461 to 1256. The range of the old control limits was 1517.7 - 624 = 893.7, the range of the new limits is 1255.9 - 461.3 = 794.6. Though the average has shifted lower, the spread or amount of space the process fills is about the same!

Ongoing Monitoring with New Limits

Let’s finish by displaying another 20 years:

code
nile_df[29:60,] %>% 
    ggplot(aes(x = Year, y = Flow))+
    geom_point()+
    geom_line()+
    geom_hline(yintercept = c(1255.9, 858.6, 461.3), color = c("red", "blue", "red"), linewidth = 1)+
    scale_x_continuous(n.breaks = 20)+
    labs(title = "1913 signal - a point outside the limits")

In 1913 we observe another signal, this point is outside the lower control limit. This indicates either an unusual year, or that the process average has shifted lower. It looks like there might be a period of decrease starting around 1910, but we’re not sure - these patterns really can show up frequently and we would need more information to know if that was a period of decrease. My hunch is that it probably was decreasing, the point outside the limits in 1913 I feel provides more evidence.

In any case, the process was back higher in 1914.

Conclusions for now

Control charts are useful for tracking things that happen over time. The control limits are calculated to reduce false positives and over reactions - if a signal is detected, even with all the variation observed, we can be pretty sure it is real. The charts help put data in context and can be used to help make decisions.

Have you thought of something you would like to track over time that this might be useful for? Weight loss/fitness tracking, sales per quarter?