FEATURED ARTICLES
# Information Overload

Analysis is about converting data into information to help make decisions. But how much information do we need? This article concerns the problem of information overload.

If we pour water into a bucket there comes a time where the water overflows the bucket. If the bucket of water is used to put out a fire, then the excess water serves no purpose for that task. It goes to waste.

Information overload is when excessive and useless information has been provided for the objecive. This is best demonstrated with an example.

Consider a process performance analysis, commonly used in Quality Assurance, but applicable in a wide variety of other areas, such as blood pressure monitoring. The following output has been observed on several process performance reports.

**Table I**

Summary | Central Tendency | Dispersion | Conformance | Indexes | Significance | Distribution |
---|---|---|---|---|---|---|

Total Count | Mean | Standard Deviation | Expected total | Cp | Anderson Darling | Skewness |

Subgroups | Mode | Variance | Expected less than | Cpk | KS | Kurtosis |

Subgroup Size | Median | Coefficient of variation | Expected Greater than | Cpm | Chi Squared Test | +-2 sd |

Maximum | Harmonic Mean | MAD | Actual Total | Pp | R squared | +- 3sd |

Minimum | Geometric mean | Range | Actual Less Than | PPK | Confidence Intervals On Mean | 1 |

Within subgroup Sd | Actual Greater than | PPM | Confidence interval on sd | |||

Between subgroup Sd | Z-US | Z-Bench | Confidence interval on cp, cpk,cm,cr | |||

Lower Quartile | Z-LS | Cr | Confidence intervals on pp,ppk, ppr, ppm | |||

Upper Quartile | Z-Target | Pr | P values | |||

Quartile Range | CPL | 1 | 1 | |||

PPL | ||||||

Sub-Group Range | PPU | |||||

CPU |

Some analysis output reports even more statistics. The information may be attractive to an element in an organization, who likes as much information as possible but true specialists would question the need for all this information. The risk is that everyday users get confused and overwhelmed and miss the important output of the performance analysis.

Information overload can be defined as providing unnecessary information that does not perform the task it is meant to perform. The above reported statistics may be impressive in quantity, but are they necessary? What purpose do they fulfill? Is this purpose consistent with the objective? What s the minimum information that should be reported. What is the maximum. Unfortunately, there are no formulae to determine the optimum balance. Each situation needs to be evaluated on its own merit, taking a vast number of factors into consideration. The process of deciding what information is useful and what is not, will be discussed now with the above example.

In statistical studies there is a term called ‘sufficiency’. A statistic is sufficient if no other statistic that can be calculated from the sample provides additional information to the value of the statistic. For example, the sample average is a sufficient statistic. No other statistic from the same data provides additional information to the average.

This principle can partially be applied to determine if the information is sufficient for the objective. Any additional information is information overload. In this case we are not talking about a statistic, but information to meet an objective, this can include a chart. (What the objective is does add an element of complication, because there can be many objectives as will become apparent.)

At first glance the objective of a process performance study is to see how well the current process performs regarding producing non-conformances. What is the minimum information required to meet this objective?

A histogram, such as shown below will provide information on performance. At a glance it is possible to determine how well the process is performing. Is it sufficient or insufficient?

Figure 1: Histogram example

If the only objective is to see how well the process is performing relative to specifications and if the number of samples is large, then the histogram with specification limits drawn is sufficient. No other statistic al information is required. Any additional information will result in information overload.

But if the number of samples is small then the information is insufficient. The reason is sampling error. Sampling error can inflate or deflate the estimate of the true performance in terms of non-conformance. We hence need confidence intervals on the proportion of defectives to place perspective on the outcome in terms of reliability.

If sample size is small, then we do not even need a histogram. The confidence intervals are sufficient to meet the objective of knowing how the process has performed in relation to non-conformance. No other information or statistic will answer the question better.

If the objective is not only to know how the process performed in terms of non-conformance, but also to obtain an insight into why then a histogram is also required. The histogram will show if the nonconformance is due to bad centering, or bad process capability. It will also show if zero nonconformance maybe due to operators flinching. Flinching is when operators falsify out of specification results to avoid the need for adjusting the process.

So, a process histogram and confidence intervals on non-conformances is sufficient for meeting the objective of knowing how the process performed in terms of non-conformance, and to obtain an insight into why. No additional information is required to provide this information. Additional information would be information overload.

If the objective is to additionally obtain estimates of non-conformance, for example to determine expected warranty costs then confidence intervals, though important, are inappropriate. Single point sample estimates will be required for accounting reasons. But, the estimate based on observed nonconformances is insufficient. If the estimate is zero and sample size is low, then this result MAY be misleading. Additional information is required, and this can be obtained from the theoretical distribution by comparing both the expected and theoretical distribution an average can be obtained.

The theoretical estimate is also not sufficient. The result depends on the fitted distribution which is affected by sample size. Both observed and theoretical estimates are required to obtain a better overall estimate.

Thus, if the objective is to determine the performance with estimates of non-conformance, if the sample size is small then the histogram, confidence intervals on observed non-conformance and estimates on both theoretical and observed non-conformance is required. What is not required is the Zu and ZL values, which are intermediate output used when calculating expected non-conformance from the assumed theoretical distribution. Similarly, PPL and PPU is only intermediate information used to obtain the PPK value. Intermediate output serves no purpose, only adding to information overload.

Process performance is rarely a snapshot analysis. In practice process performance is performed frequently to detect changes in process performance. The Histogram, powerful for snapshot analysis is inappropriate for comparisons over time or with other products. For this we need to quantify performance. % Non-conformance can be monitored over time and compared but provides no information on how tight the process is, how centered the process is and targeted it is. The pp, pr and ppk process performance indexes are more appropriate, but not sufficient. Confidence intervals are required when computable. In the case of non-normal distributions this is not always possible.

Additionally, the output of a distribution analysis is required for sufficiency because these indexes are reliant on the distribution fitted. To validate the performance indexes there must be strong evidence that the correct curve has been fitted. The histogram with a fitted curve can appear to provide sufficient information for this, but in-fact is not sufficient. Consider the curve in the image below!

Figure 2: A fitted Weibull curve to strongly skewed data

The fit looks good visually, but is it? Compare the fit in the histogram below!

Figure 3: A fitted Normal curve to a symmetrical distribution

Does it fit better? Other information is required to answer the question.

A probability plot is one such piece of information. Compare the two probability plots below. The first corresponds to Figure 2 and the second to Figure 3.

Figure 4a: Comparison of two probability plots

Figure 4b: Comparison of two probability plots

According to the two probability curves the second one is a much better fit. This is confirmed with the p value from the Anderson Darling statistic. The poor fit at the extremes concludes that theoretical estimates of proportion defectives at the extremes must be treated with some caution.

We can conclude that to perform a multi-objective process performance analysis the minimum sufficient information that is required is:

- A Histogram
- A probability plot
- A significance test for the fitted curve
- Expected non-conformance and theoretical non-conformance
- The pp (and related pr), ppk indexes
- Confidence intervals on these

This meets the following objectives

- To obtain a snap shot of current process performance
- To obtain quantifiable information and ‘accuracy’ on non-conformance, centering and degree of spread for comparisons, monitoring and reporting
- Confidence in the analysis

Deviation from target can be added to the above when it is important to also target the process. Ppm is another statistic that measures deviation from target but is more difficult to understand and can only be used for normally distributed data. It is not robust information. BISNET Analyst uses deviation from target for this reason and percent deviation from target relative to the specification range.

Most output provides additional summary information including the average, standard deviation, maximum, minimum, range and count. Although some of this information may be used in the calculations, this information is not required to be reported to meet the objectives.

The remaining information in the above table i.e.

Mode, median, geometric mean, median, lower quartile, upper quartile range, mean absolute deviation, sub-group size, sub-group number, within sub-group sd, between sub-group sd, Zl, Zu, CpL, CpU, cp, cpk, cpm, additional Statistical tests such as the Chis Squared and KS test, Skewness, Kurtosis, plus or minus 2 sigma and plus or minus 3 sigma are information not required to meet the objective and hence contribute to information overload confusing and distracting from the most important pieces of information. They have their uses for other objectives

BISNET Analyst does not subscribe to the school of the thought that believes the more information the better. BISNET Analyst will provide the least amount of sufficient information to meet the most important objectives.

Drive quality improvement through actionable insights using analytics you can trust! Use up to 200 analytics tools downloadable through a suite of Apps!

- Augmented with machine-powered smarts
- Always updated with the latest tools and features
- No licencing or fixed subscriptions - Pay ONLY for the analysis you run from 20 USD cents per analysis, billed monthly! Set a budget so you don't exceed!