code
data(Nile)
<- data.frame(Year = time(Nile), Flow = as.vector(Nile)) nile_df
Ben Jepson
July 19, 2024
Some people are familiar with control charts for manufacturing, but they can be useful for tracking many things that occur over time. Something that occurs over time can be called a process.
With a control chart we use data we have observed to establish the location (average) of the process, and expected variation (spread). We use the chart to ask out data some questions (analysis!).
Based on what we know so far (observed data):
is the process stable, or changing (moving up or down)?
what range do we expect future values to be within?
if/when something changes, how might we know?
I’m using R with the ggQC package which makes control charts easy. For this project we’ll use an individual chart (XmR aka ImR).
We’ll load the built in dataset Nile, which is measurements of annual flow of the Nile river.
The baseline chart is created using the data we have. The Nile dataset has 100 observations. But let’s try to simulate a real world SPC situation - we have some observations and we want to start tracking to see if something changes.
To set a baseline for our chart, we’ll use the first 20 observations of the dataset. We’ll calculate control limits, then monitor subsequent values to see if anything has changed.
nile_df[1:20, ] %>%
ggplot(aes(x = Year, y = Flow))+
geom_point()+
geom_line()+
stat_QC(method = "XmR", auto.label = TRUE, linewidth = 1)+
labs(title = "Nile first 20 years baseline XmR control chart",
subtitle = "No signals detected in baseline period. We expect future values to be within the limits")
Control charts use rules (“run rules”) to help the user decide if something has happened. When something unexpected is detected, this is called a signal. Anytime a signal is detected, it should prompt investigation to see if something has really changed, and what can be done about it.
The most useful signal is any single point outside the control limits (the red lines in this control chart). Another useful rule is 8 or 9 sequential points on one side of the average (blue line). These rules are used to reduce over reactions - if a signal is detected, we can be pretty certain something unusual has happened or something has changed.
In this baseline period of the first 20 years, no signals are detected.
Now that we have a baseline average (centerline) and upper/lower control limits, we can continue to monitor future values.
In the real world, we would get one new data point at a time, and compare it to the average, control limits, and previous points. Someday I’ll put together a simulation where you can test this out and look point by point. For this quick article we’ll skip ahead several years and see if anything interesting has happened yet.
We see a sudden drop around 1899 - but is this evidence of a change in the process? By 1906 there are 8 points below the average, indicating that this probably is a real change. The points I chose to display here continue to be below the average - it looks like this shift lower is real.
At this point we could continue to plot more points, or since this is probably a real change we could recenter the chart and compute new limits to reflect this new reality.
As a side note - 1890-1896 has 7 points above the average. I can’t say if this is unusual or not relative to the baseline. There’s a good reason 8 or 9 points on one side are used to detect change on a control chart - this kind of thing happens a lot even with randomly generated numbers. A run of 7 numbers on one side can happen easily.
That’s not to say nothing happened in this period - we just don’t know! We would need more information to conclude there was a change in 1890-1896.
We’ll set up new limits based on 1899-1910. The new chart will be asking our questions relative to this time period.
nile_df[29:40, ] %>%
ggplot(aes(x = Year, y = Flow))+
geom_point()+
geom_line()+
stat_QC(method = "XmR", auto.label = TRUE, linewidth = 1)+
scale_x_continuous(n.breaks = 10)+
labs(title = "Nile 1899 - 1910 new baseline XmR control chart",
subtitle = "No signals detected in baseline period. We expect future values to be within the limits")
This new chart has a lower average (858.6 vs 1070.8), and new control limits. Based on this we expect future values to be within about 461 to 1256. The range of the old control limits was 1517.7 - 624 = 893.7, the range of the new limits is 1255.9 - 461.3 = 794.6. Though the average has shifted lower, the spread or amount of space the process fills is about the same!
Let’s finish by displaying another 20 years:
In 1913 we observe another signal, this point is outside the lower control limit. This indicates either an unusual year, or that the process average has shifted lower. It looks like there might be a period of decrease starting around 1910, but we’re not sure - these patterns really can show up frequently and we would need more information to know if that was a period of decrease. My hunch is that it probably was decreasing, the point outside the limits in 1913 I feel provides more evidence.
In any case, the process was back higher in 1914.
Control charts are useful for tracking things that happen over time. The control limits are calculated to reduce false positives and over reactions - if a signal is detected, even with all the variation observed, we can be pretty sure it is real. The charts help put data in context and can be used to help make decisions.
Have you thought of something you would like to track over time that this might be useful for? Weight loss/fitness tracking, sales per quarter?
---
title: "SPC With The Nile River Dataset"
author: "Ben Jepson"
date: "2024-07-19"
categories: [SPC, time series]
execute:
warning: false
format:
html:
code-fold: true
code-summary: "code"
code-tools: true
fig-width: 10
fig-height: 6
---
## A quick control chart
Some people are familiar with control charts for manufacturing, but they can be useful for tracking many things that occur over time. Something that occurs over time can be called a *process*.
With a control chart we use data we have observed to establish the location (average) of the process, and expected variation (spread). We use the chart to ask out data some questions (analysis!).
*Based on what we know so far* (observed data):
- is the process stable, or changing (moving up or down)?
- what range do we expect future values to be within?
- if/when something changes, how might we know?
I'm using R with the [ggQC](http://rcontrolcharts.com/) package which makes control charts easy. For this project we'll use an individual chart (XmR aka ImR).
We'll load the built in dataset *Nile*, which is measurements of annual flow of the Nile river.
```{r}
data(Nile)
nile_df <- data.frame(Year = time(Nile), Flow = as.vector(Nile))
```
### Baseline chart and limits
The baseline chart is created using the data we have. The Nile dataset has 100 observations. But let's try to simulate a real world SPC situation - we have some observations and we want to start tracking to see if something changes.
To set a baseline for our chart, we'll use the first 20 observations of the dataset. We'll calculate control limits, then monitor subsequent values to see if anything has changed.
```{r}
library(tidyverse)
library(ggQC)
theme_set(theme_classic())
```
```{r}
nile_df[1:20, ] %>%
ggplot(aes(x = Year, y = Flow))+
geom_point()+
geom_line()+
stat_QC(method = "XmR", auto.label = TRUE, linewidth = 1)+
labs(title = "Nile first 20 years baseline XmR control chart",
subtitle = "No signals detected in baseline period. We expect future values to be within the limits")
```
Control charts use rules ("run rules") to help the user decide if something has happened. When something unexpected is detected, this is called a *signal*. Anytime a signal is detected, it should prompt investigation to see if something has really changed, and what can be done about it.
The most useful signal is *any single point outside the control limits* (the red lines in this control chart). Another useful rule is *8 or 9 sequential points on one side of the average* (blue line). These rules are used to reduce over reactions - if a signal is detected, we can be pretty certain something unusual has happened or something has changed.
In this baseline period of the first 20 years, no signals are detected.
### Ongoing monitoring
Now that we have a baseline average (centerline) and upper/lower control limits, we can continue to monitor future values.
In the real world, we would get one new data point at a time, and compare it to the average, control limits, and previous points. Someday I'll put together a simulation where you can test this out and look point by point. For this quick article we'll skip ahead several years and see if anything interesting has happened yet.
```{r}
nile_df[1:40,] %>%
ggplot(aes(x = Year, y = Flow))+
geom_point()+
geom_line()+
geom_hline(yintercept = c(1517.7, 1070.8, 624), color = c("red", "blue", "red"), linewidth = 1)+
scale_x_continuous(n.breaks = 20)+
labs(title = "1899-1906 has 8 points below average - signal of change!")
```
We see a sudden drop around 1899 - but is this evidence of a change in the process? By 1906 there are 8 points below the average, indicating that this probably is a real change. The points I chose to display here continue to be below the average - it looks like this shift lower is real.
At this point we could continue to plot more points, or since this is probably a real change we could recenter the chart and compute new limits to reflect this new reality.
As a side note - 1890-1896 has 7 points *above* the average. I can't say if this is unusual or not relative to the baseline. There's a good reason 8 or 9 points on one side are used to detect change on a control chart - this kind of thing happens a lot even with randomly generated numbers. A run of 7 numbers on one side can happen easily.
That's not to say nothing happened in this period - we just don't know! We would need more information to conclude there was a change in 1890-1896.
### New Chart Limits
We'll set up new limits based on 1899-1910. The new chart will be asking our questions relative to this time period.
```{r}
nile_df[29:40, ] %>%
ggplot(aes(x = Year, y = Flow))+
geom_point()+
geom_line()+
stat_QC(method = "XmR", auto.label = TRUE, linewidth = 1)+
scale_x_continuous(n.breaks = 10)+
labs(title = "Nile 1899 - 1910 new baseline XmR control chart",
subtitle = "No signals detected in baseline period. We expect future values to be within the limits")
```
This new chart has a lower average (858.6 vs 1070.8), and new control limits. Based on this we expect future values to be within about 461 to 1256. The range of the old control limits was 1517.7 - 624 = *893.7*, the range of the new limits is 1255.9 - 461.3 = *794.6*. Though the average has shifted lower, the spread or amount of space the process fills is about the same!
### Ongoing Monitoring with New Limits
Let's finish by displaying another 20 years:
```{r}
nile_df[29:60,] %>%
ggplot(aes(x = Year, y = Flow))+
geom_point()+
geom_line()+
geom_hline(yintercept = c(1255.9, 858.6, 461.3), color = c("red", "blue", "red"), linewidth = 1)+
scale_x_continuous(n.breaks = 20)+
labs(title = "1913 signal - a point outside the limits")
```
In 1913 we observe another signal, this point is outside the lower control limit. This indicates either an unusual year, or that the process average has shifted lower. It looks like there might be a period of decrease starting around 1910, but we're not sure - these patterns really can show up frequently and we would need more information to know if that was a period of decrease. My hunch is that it probably was decreasing, the point outside the limits in 1913 I feel provides more evidence.
In any case, the process was back higher in 1914.
### Conclusions for now
Control charts are useful for tracking things that happen over time. The control limits are calculated to reduce false positives and over reactions - if a signal is detected, even with all the variation observed, we can be pretty sure it is real. The charts help put data in context and can be used to help make decisions.
Have you thought of something you would like to track over time that this might be useful for? Weight loss/fitness tracking, sales per quarter?