Data Visualization and Statistics

Data visualization (and information visualization) is now a hot and important topic in both academics and industry.

Data visualization is concerned with how to visualize complicated data. One of the major goal for data visualization is to display important features in an intuitive and clear ways so that people without professional knowledge are able to understand. Without conducting a sophisticated analysis, some clear patterns can be directly observed after visualization. This is particularly useful for scientists to promote their work to general audience and potential collaborators.

Moreover, data visualization serves as a tool for exploratory data analysis. That is, we can visualize the data first and according to the structure we observe from the visualization, we choose how to analyze the data. When the dimension of the data (number of variables) is greater than 3, this technique is very useful.

Statistics and data visualization can have more interplay. With proper cooperation, statistics and data visualization can help solving problems from each other.

In data visualization, a problem is that we discard part of the information when we visualize the data. If the information we throw away is critical to our research, we will get into trouble. Thus, there is a need to study the information that each visualization approach discards and statisticians are perfect to do this job. Many visualization tools use some summary statistics and keep track of these features when visualizing; statistical analysis for these summaries allows us to understand what kind of information the summary provide and what type of information is ignored.

For statistics, a common problem for statistical analysis is that we cannot see the result we have analyzed. For instance, when estimating a multivariate function or a “set” in high dimensions, like a region of interest, we cannot see the result. A more concrete example is “clustering” at dimension greater than 3; it is hard to really see clusters in high dimensions. This problem is especially severe in nonparametric statistics; the “parameter of interest” is often infinite dimensional and it’s hard for statistician to “see” the estimator. However, tools from data visualization may provide helps for this problem. We can use the approaches from data visualization to display our result. Despite the fact that we may loss some information, we can have a rough idea how does our estimator look like and we can fine-tune our analysis accordingly.

The following two papers are examples for combining data visualization and statistics:

1. Gerber, Samuel, and Kristin Potter. “Data analysis with the morse-smale complex: The msr package for r.” Journal of Statistical Software (2011). URL: http://www.jstatsoft.org/v50/i02/paper

2. Chen, Yen-Chi, Christopher R. Genovese, and Larry Wasserman. “Enhanced mode clustering.” arXiv preprint arXiv:1406.1780 (2014). URL: http://arxiv.org/abs/1406.1780

Here are some useful links about data visualization (thanks to Yen-Chia Hsu@CMU – Robotic Institute):

http://senseable.mit.edu/
http://datavisualization.ch/
http://www.nanocubes.net/
http://www.informationisbeautifulawards.com/
http://labs.juiceanalytics.com/vizwelike/index.html
http://vis.cs.ucdavis.edu/Publications/
http://idl.cs.washington.edu/papers
http://vis.berkeley.edu/papers/
http://www.cs.ubc.ca/group/infovis/publications.shtml
http://vis.berkeley.edu/

Advertisements

Statistical Engineering

In my opinion, machine learning, data mining , pattern recognitions ..etc are branches of ‘statistical engineering’. I find the relationship between these disciplines and statistics is very similar to engineering versus science.

In engineering, people focus on the prediction, real performance and optimization for a process/procedure/algorithm. Theoretical analysis for engineers are not the as important as the empirical performance of a method. And how to use a method in solving practical problems is more important than to understand how it works. This is the case in machine learning, data mining and pattern recognition.

For instance, if a new method is proposed, it will be very popular in machine learning or data mining once the empirical performance is very good. How people classify a method as a good one is through the performance on a variety of data. In addition, those who are doing machine learning or data mining prefer to learn how to implement a method rather than to understand why this method works.

On the contrary, the scientific research emphasizes on constructing a general rule/model to explain the phenomena. Understanding a phenomenon is usually more important than knowing how to apply the outcome to real problem. For instance, astronomers develop lots of theories to explain the orbit, motion of a planet. However, astronomers do not care much about how this knowledge can be practically used in daily life.

In data analysis, the phenomena to be explained are the results from a statistical method such as the error of an estimation. For example, if a new method is proposed, it will arise statisticians’ attention once its theoretical performance is good. When there’s no theoretical guarantee for this method, statisticians will try to construct theories to explain how this method works. Besides, statisticians usually prefer understanding how a method works to learning how to implement it.

One can see that statistics versus machine learning/data mining/pattern recognition is nearly the same as science versus engineering. That’s why I use the term “statistical engineering” for these disciplines.