Recently, I work a lot with scientists and notice that they are often using a simple method to visualize the data and make basic analysis.

This method is called “regressogram” in statistics. Simply put, Regressogram = Regression + Histogram. Here’s an example for regressogram:

This is an analysis for astronomy data. On the X-axis is the galaxy distance to some cosmological structure and on the Y-axis is the correlation for some features of this galaxy. We binned the data according to galaxy distance and take the mean within each bin as a landmark (or summary) and show how this landmark changes along galaxy distance. This visualizes the trend of the data in an obvious way.

In fact, the original data is very ugly! Here’s the scatter plot for the original data:

Note that now the range of Y is (0,1) while in the regressogram, the range is (0.7, 0.8). If you want to visualize the data, I don’t think this scatter plot is very helpful. The regressogram, however, is a simple approach to visualize hidden structure within this complicated data.

Here’s the steps for constructing regressogram. First we bin the data according to the X-axis (shown by red lines):

Then we compute the mean within each bin (shown by the blue points):

We can show only the blue points (and blue curves, which just connects each points) so that the result looks much more concise:

However, since the range for Y-axis is too large, this does not show the trend. So we zoom-in and compute the error for estimating the mean within each bin. This gives the first plot we have seen:

The advantage for regressogram is its simplicity. Since we’re summarizing the whole data by points representing the mean within each bin, the interpretation is very straight-forward. One can easily understand regressogram without any deep knowledge of statistics. Also, it shows the trend (and error bars) for the data so that we have rough idea what’s going on. Moreover, no matter how complicated the original plot is, the regressogram uses only a few of statistics (the mean within each bin) to summarize the whole data. Notice that we do not make any assumption on distribution (like normally distributed) of the data; thus, regressogram is a non-parametric method.

However, in statistics, regressogram is barely mentioned and very few people actually use it in research. The main reason is that the regressogram is a method for non-parametric regression and is not optimal. The predicted value of Y given X using regressogram is to find the bin where X lays in and use the sample mean within that bin as a predictor for Y. This prediction is suboptimal and there’re many alternative method such as local regression and kernel regression that has much better prediction accuracy. (The main problem for regressogram is the huge bias. We predict the same value within the same bin which makes it inflexible in prediction.)

Despite not being an optimal method for prediction, regressogram is still very attractive due to its simplicity. Especially for the case when we just want to grasp a rough trend of the data, some loss of accuracy to trade simplicity is always preferred. Hence, regressogram is still very popular in science.

I end this article with an interesting observation. Although regressogram is widely used in scientific research, very few scientists know the name of this method. Maybe regressogram is too intuitive for most scientists so that they do not care about its name.