Monday, February 23, 2009

GGobi: Another Data Explorer

While exploring the links from the Feb. 17 lecture (Introduction to Data Mining), I came across another visualization tool for highly dimensional data that I like very much.

It's not useful for making movies or detailed simulations, but I think it's going to be useful for exploring data sets that are so large you don't even know what you have (or expect!). There are a bunch of extensions to it in R and the website also says that you can extend it in Python and a few other languages.

It's called GGobi and you can find out more about it here: http://www.ggobi.org/

It's open-source and, therefore, free.

Friday, February 20, 2009

Viz 3 Assignment Response - from Toni


















I found Correa et al. 2007 (http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=4376157&isnumber=4376125) interesting and directly applicable to what I currently do in the lab. The article explores methods of allowing a viewer to physically manipulate a data set as a means of data exploration. Such probing would be similar to touching and deforming an object with ones hands. The general scheme for deformation involves deforming data points through a displacement map or procedure then sampling these points according to a volume representation. It may be possible to carry out such an operation on data using software such as Volview. The figure to the left demonstrates such a deformation and how it can be useful in highlighting relevant features in a data set. I work with macromolecular structures that might benefit from similar manipulations. For example, atomic resolution structural data of very large protein complexes may be easier to navigate if one can easily deform surrounding atoms to view buried subunits, ligands, metals, etc.

Thursday, February 19, 2009

Random Useful References

Some references that might interest people taking this course.

- NetworkX.

A Python package for modelling graphs and networks:

http://networkx.lanl.gov/


- Graphviz.

A graph visualization tool (works with NetworkX):

http://www.graphviz.org/


- A line-by-line Python profiler

A Python profiler that gives you performance information about each line on your code, so you can attack the main bottlenecks first.

http://www.enthought.com/~rkern/cgi-bin/hgwebdir.cgi/line_profiler/summary


- "The visual display of quantitative data", a classic book by Edward Tufte.

http://www.edwardtufte.com/tufte/books_vdqi


- The NEOS server and AMPL

A server that accepts optimization jobs over the web. High-end hardware and high-quality solvers for free! Accepts jobs in the AMPL langugage.

http://www-neos.mcs.anl.gov/
http://www.ampl.com

- The Stony Brook Algorithm Repository

Search for algorithms by problem or by language. The website also provides a ranking of the algorithms (presumably which ones are "best"). For example, I needed an algorithm for matching in bipartite graphs. I just searched under "Graph problems - Polynomial type problems" -> "Matching" and found a list of them, with their ranking.

http://www.cs.sunysb.edu/~algorith/


- "Large Scale Data Analysis Challenges" (talk at Caltech next week)

Tuesday, February 24th
12:00 - 1:00pm
74 Jorgensen

*Lunch will be provided*

SPEAKER:
Dan Meiron
Fletcher Jones Professor of Applied & Computational Mathematics and
Computer Science

TITLE:
Large Scale Data Analysis Challenges

ABSTRACT:
JASON, a scientific advisory group, was asked by representatives of the
Department of Defense (DOD) and the Intelligence Community (IC) to
recommend ways in which the DOD/IC can handle present and future
sensor data in fundamentally different ways, taking into account both the
state-of-the-art, the potential for advances in areas such as data structures,
the shaping of sensor data for exploitation, as well as methodologies for
data discovery.

In this presentation we will examine the challenges associated with the
analysis of large data and in particular compare DOD/IC requirements to
those of several data intensive fields such as high energy physics and
astronomy. The conclusion is that while DOD/IC data requirements are
certainly significant, they are not unmanageable given the capabilities
of current and projected storage technology.

The key challenge will be to adequately empower DOD and IC analysts by
matching analysis needs to data delivery modalities. At a very cursory level,
we will examine some current approaches that could enable better information
fusion. We'll also propose various grand challenges that could be used to
assess and prioritize future research efforts in data assimilation and fusion.