Data Visualization and Statistics

Data visualization (and information visualization) is now a hot and important topic in both academics and industry.

Data visualization is concerned with how to visualize complicated data. One of the major goal for data visualization is to display important features in an intuitive and clear ways so that people without professional knowledge are able to understand. Without conducting a sophisticated analysis, some clear patterns can be directly observed after visualization. This is particularly useful for scientists to promote their work to general audience and potential collaborators.

Moreover, data visualization serves as a tool for exploratory data analysis. That is, we can visualize the data first and according to the structure we observe from the visualization, we choose how to analyze the data. When the dimension of the data (number of variables) is greater than 3, this technique is very useful.

Statistics and data visualization can have more interplay. With proper cooperation, statistics and data visualization can help solving problems from each other.

In data visualization, a problem is that we discard part of the information when we visualize the data. If the information we throw away is critical to our research, we will get into trouble. Thus, there is a need to study the information that each visualization approach discards and statisticians are perfect to do this job. Many visualization tools use some summary statistics and keep track of these features when visualizing; statistical analysis for these summaries allows us to understand what kind of information the summary provide and what type of information is ignored.

For statistics, a common problem for statistical analysis is that we cannot see the result we have analyzed. For instance, when estimating a multivariate function or a “set” in high dimensions, like a region of interest, we cannot see the result. A more concrete example is “clustering” at dimension greater than 3; it is hard to really see clusters in high dimensions. This problem is especially severe in nonparametric statistics; the “parameter of interest” is often infinite dimensional and it’s hard for statistician to “see” the estimator. However, tools from data visualization may provide helps for this problem. We can use the approaches from data visualization to display our result. Despite the fact that we may loss some information, we can have a rough idea how does our estimator look like and we can fine-tune our analysis accordingly.

The following two papers are examples for combining data visualization and statistics:

1. Gerber, Samuel, and Kristin Potter. “Data analysis with the morse-smale complex: The msr package for r.” Journal of Statistical Software (2011). URL: http://www.jstatsoft.org/v50/i02/paper

2. Chen, Yen-Chi, Christopher R. Genovese, and Larry Wasserman. “Enhanced mode clustering.” arXiv preprint arXiv:1406.1780 (2014). URL: http://arxiv.org/abs/1406.1780

Here are some useful links about data visualization (thanks to Yen-Chia Hsu@CMU – Robotic Institute):

http://senseable.mit.edu/
http://datavisualization.ch/
http://www.nanocubes.net/
http://www.informationisbeautifulawards.com/
http://labs.juiceanalytics.com/vizwelike/index.html
http://vis.cs.ucdavis.edu/Publications/
http://idl.cs.washington.edu/papers
http://vis.berkeley.edu/papers/
http://www.cs.ubc.ca/group/infovis/publications.shtml
http://vis.berkeley.edu/