Wednesday, April 1, 2009

Building Query Tool/Statistical Analysis software

A lot of custom software involves reinventing the wheel. Writing things that have been written before and solving problems that have been solved before. I'm a fan of not reinventing the wheel but, like with so many things in life, even this heuristic needs to be adhered to in moderation. There are times when reinventing the wheel is the lesser of evils; in some cases it's even instructive.

While working on software to do statistical analysis on some data extracted under a certain set of circumstances, I've run across some micropatterns that I hope will save me some work some day.

#1 - Take the time to extract the data points into a class/standard form. If it's time series data then every data point should be associated with its underlying series. In the object model there should always be a path from the data point back to the underlying data source (e.g., the file it was taken from).

In the beginning this might seem like overkill. Why bother with the overhead of wrapping a double in a class that has a reference to the underlying source of data? Eventually you're going to want to know more about the data point than can be represented by its value. You'll definitely want to know where it came from. But you may also want its ordinal position in the set of data points. What about any special circumstances that were in effect at the time? If they vary over time then you might have a set of data all from the same source but with different configurations in effect. When your users want to query based on those configurations it's a lot easier if this information is readily accessible from any given data point.

#2 - Take the time to abstract the source of the data points into a standard form. This is usually a class that stores, in the very least, the name of the file that contained the data.

This class might start out storing very little; perhaps just the name of the file and the date it was acquired. Believe me, it'll grow. Over time more data sources will come online; new hardware, new algorithms, new "meta" information about the data. If this class already exists it'll be the natural repository for this information.

#3 - Encapsulate common statistical methods into a class that references a collection of the underlying data points. By common I mean things like mean, standard deviation, variance, etc... Make sure this encapsulating class has a reference to its underlying data sources.

This really comes in handy as new sources of data become available. The code to visually represent aggregate statistics will already exist; new data sources can just be fed into existing encapsulating classes or, in rare cases, derivative classes.

#4 - Do as little calculation at the User Interface level as possible. Ideally the UI should be handed nothing but data containing classes that get rendered. I'm not a total purist though, I'll occasionally do unit conversion at the UI (though with C# properties this is less justifiable).

Initially it's often easier to do some final calculations at the User Interface than it is to do them earlier but not only does this complicate the UI it also hardwires the UI to a specific calculation from specific data points. When the users decide they want to visualize the data in a different way it'll be a lot harder to do if a lot of calculation was being done at the UI level.

No comments :

Post a Comment