Data Fish

I read with great delight Kevin Kelly’s recent post over at the Technium, "The Google Way of Science".

Mr. Kelly is a big fan of intellectual broadsides - technological Zen koans, and this one is a doozie.

The elevator pitch is this: As we accumulate more and more data, petabytes of the stuff, scientists can knock of the difficult and time consuming process of coming up with, then testing, peer reviewing, then repeatedly testing a hypothesis, and simply look with special tools (petascopes?) through these vast pools of data for previously undiscovered correlations, which with enough data points become as good as natural law.

Imagine you had a database with a vast number of entries detailing how long it took objects to fall to the ground from varying heights.

Using the field of Correlative Analytics, you could simple query the database asking how long it takes for an object to fall 25 meters. With enough data points, you could get answers as accurate as if you actually had a theory of gravitation, though the database and related search software have no implicit theory of gravity built in to them.

As it turns out, this is much the way Google language translation services work. They have no theory of French, English, or Chinese, the simply have very large amounts of bilingual translations they can use to look for correlations. It is science by Bayesian filter.

Except that it isn’t.

My issue with this is that without understanding, we may have some pragmatically useful results, but no underlying knowledge or wisdom about how we got them.

Now, I’m a big fan of pragmatism, and as it turns out, I have used exactly this method myself for the most popular piece of software on this site, my Protein Database Reader for Lightwave 3D (more on that in a minute).  But it stinks as science.

But that doesn’t matter, because it’s the idea, not the implementation that is powerful. I imagine that for the research scientist, the work flow would look something like:

  1. Have a hunch (like a theory, but much less developed)
  2. Look through the Godcloud for correlative data
  3. Using that information, refine hunch into theory
  4. Iterate steps 2 and three as necessary
  5. Proceed with boring parts of science (experimenting, peer review)…or not

 

My experience with Correlative Analytics

 

A long time ago, I built a very simple program for reading Protein Database (pdb) files, and plotting the atoms in 3D. I used this simple program to do the visual effects for the PBS series "The Nobel Legacy."

I set it aside for a couple of years. Coming back to it, I saw much room for improvements, both in user interface and in the options for visualization. One of the most requested features by far was the ability to render "Ball and Stick" models. In this mode, atoms are drawn as balls, and the bonds between them are drawn as sticks.

Which is cool, but at the end of the day, my AP Chemistry class in High-School, lo those many years ago, hadn’t given me enough of a theory of molecular bonding to give my program the smarts to draw those connections. What to do?

I didn’t have a theory, but I did have data. I started out with 64 (ok, not exactly a petabyte, but this was the early 90s) small PDB files that had the molecular bonds explicitly listed out.

Small Model with Explicit Bonds

They would do this for small models, benzene, caffeine, etc, but it was either too expensive or unwieldy to include them in the larger models of proteins, DNA and the like.

So I took my reader, and modified it to do two things:

  1. Read all files in a folder
  2. Write a CSV file for every explicitly connected pair of atoms

For each pair, the Atom pair was listed (HO, HC, OP, CaH) along with the distance between their points in space.

This gave me a database of about 1300 connections, from which I could derive the average distance between atoms of specific types, and the minimum and maximum distances between all pairs of that type. It also gave me another value, the maximum distance between any bonded pair.

With this information, I could march through the data in cubes the size of that max distance* and look at the atoms in that cube, see if any of the pair types were within bonding distance, and if they were, add a link.

And this is how, without so much as cracking a chemistry book, I made a fully functional molecular modeler.

It is still available in the downloads section, and as it is written in the C-like LScript scripting language, you can poke through the  five-year old source code if you like.

* If I recall, in order to avoid edge cases, I marched through in half-cube increments.

Large model with statisically likely bonds

So, yeah. I think Correlative Analytics is a cool, and very powerful tool. It did not, however, teach me much about chemistry.

If the world is a lake, and fish are useful scientific theorems, Correlative Analytics is going to be a great way to find new fishing holes. That’s my hunch…we’ll have to see if the data bear it out.

File this under: Correlation is not causation, but that’s how the smart money bets.