Data Visualization and Statistics

Data visualization (and information visualization) is now a hot and important topic in both academics and industry.

Data visualization is concerned with how to visualize complicated data. One of the major goal for data visualization is to display important features in an intuitive and clear ways so that people without professional knowledge are able to understand. Without conducting a sophisticated analysis, some clear patterns can be directly observed after visualization. This is particularly useful for scientists to promote their work to general audience and potential collaborators.

Moreover, data visualization serves as a tool for exploratory data analysis. That is, we can visualize the data first and according to the structure we observe from the visualization, we choose how to analyze the data. When the dimension of the data (number of variables) is greater than 3, this technique is very useful.

Statistics and data visualization can have more interplay. With proper cooperation, statistics and data visualization can help solving problems from each other.

In data visualization, a problem is that we discard part of the information when we visualize the data. If the information we throw away is critical to our research, we will get into trouble. Thus, there is a need to study the information that each visualization approach discards and statisticians are perfect to do this job. Many visualization tools use some summary statistics and keep track of these features when visualizing; statistical analysis for these summaries allows us to understand what kind of information the summary provide and what type of information is ignored.

For statistics, a common problem for statistical analysis is that we cannot see the result we have analyzed. For instance, when estimating a multivariate function or a “set” in high dimensions, like a region of interest, we cannot see the result. A more concrete example is “clustering” at dimension greater than 3; it is hard to really see clusters in high dimensions. This problem is especially severe in nonparametric statistics; the “parameter of interest” is often infinite dimensional and it’s hard for statistician to “see” the estimator. However, tools from data visualization may provide helps for this problem. We can use the approaches from data visualization to display our result. Despite the fact that we may loss some information, we can have a rough idea how does our estimator look like and we can fine-tune our analysis accordingly.

The following two papers are examples for combining data visualization and statistics:

1. Gerber, Samuel, and Kristin Potter. “Data analysis with the morse-smale complex: The msr package for r.” Journal of Statistical Software (2011). URL: http://www.jstatsoft.org/v50/i02/paper

2. Chen, Yen-Chi, Christopher R. Genovese, and Larry Wasserman. “Enhanced mode clustering.” arXiv preprint arXiv:1406.1780 (2014). URL: http://arxiv.org/abs/1406.1780

Here are some useful links about data visualization (thanks to Yen-Chia Hsu@CMU – Robotic Institute):

http://senseable.mit.edu/
http://datavisualization.ch/
http://www.nanocubes.net/
http://www.informationisbeautifulawards.com/
http://labs.juiceanalytics.com/vizwelike/index.html
http://vis.cs.ucdavis.edu/Publications/
http://idl.cs.washington.edu/papers
http://vis.berkeley.edu/papers/
http://www.cs.ubc.ca/group/infovis/publications.shtml
http://vis.berkeley.edu/

Advertisements

Random thoughts: Statistics vs Statistical engineering

Recent days I attended many talks given by people from statistics and statistical engineering (machine learning, data mining…etc).

I notice that people doing theories in statistical engineering is quiet similar to people in statistics. We do lots of statistical analysis on the method/algorithm and build some useful bounds for the convergence rate.

However, I just found that there’s a feature for people in statistics that people in theoretical engineering usually do not have: seeking for the asymptotic distributions. It is true that many people in statistical engineering try to find the bounds on convergence rate. The bounds are like their destination; they usually not go further for the distribution. In contrast, people in statistics will not stop at the rate; statisticians are targeting at the asymptotic distributions.

The reason why statisticians care about asymptotic distribution may be related to the statistical inference. The statistical inference such as confidence intervals, hypothesis tests, requires knowledge about the distribution of a certain statistics. Knowing the bounds is not sufficient for carrying out the inference. Both confidence intervals (or more general, confidence sets) and hypothesis test require the distributions.

This might also be the reason why courses in ML emphasizes more on the Hoeffding’s inequality, Bernstein’s inequality while in statistics, the courses focus more on the central limit theory and chi-square approximation.

Usually, finding the bounds on convergence rate is much easier than finding the true distribution. This might be a reason why many popular methods in statistical engineering are not so welcomed in statistics. The lack of asymptotic distribution reduces popularity from statisticians. However, many methods though have no asymptotic distribution, are still very useful in prediction, especially those with guarantees from probability bounds. Maybe we statisticians should not limit ourselves to those methods that are capable of statistical inference.

Anyway, I just discovered the feature for statisticians on deriving the asymptotic distributions. Maybe this is just my bias sample or maybe it is the truth. I’ll keep using this feature as a predictor to the future talks.