Tuesday, April 07, 2009

Cables and Data analysis projects

When I was a kid, I realized I really liked cables. My first real
memory was when my Dad was hooking up the TV, VCR, and our TI-99-4a
computer, and the areal TV antenna -- I realized that the cables were
different, and had to be hooked up in a specific way. Some of them
carried signal, and others carried power. And, in general, they
carried their good stuff in one direction only. I guess I was 5 or 6
years old.

I still like cables. They just make sense. When I teach classes to IP
network technicians, I often recommend that each cable should get its
own identity -- a name like "C1", "C52", etc. Cables are people too,
after all.

But second to connecting cables, I've found another type of task I
really like: data correlation and retrieval. I'm working on a project
right now where I'm collecting data from a BroadWorks TimesTen
(Oracle) database, some log files with SIP that I have to parse, and
the database in an Acme Packet Session Border Controller (SBC). If all
this data were in one database, you could run a big query and get it
all out. But it's not in one big database, and it might not be worth
putting into one database. So I build custom data-access tools to get
it out. Hi, I'm Mark, your report builder.

It's reasonable to ask whether it makes sense to load this data into
one big database to allow general-purpose searching with a better
query language. There are a few downsides: the need to setup a DBMS
(like MySQL), and the need to parse the data to get it into the
database in a clean way. I can get by with cheap-and-dirty parsing via
regex; to load the data into a database I'd have to slice apart the
data in the log files to put it in meaningful fields. For example, if
I don't parse the SIP message before I put it into the database, then
I'd just be doing full text searches on the SIP messages. I'm not
completely convinced either way here.

In one sense, it's just work -- dump data out, do searches, etc. But
in this kind of project, nobody cares if I use shell scripts, or Perl,
or a "real programming language" to handle it. My vocabulary is
normally awk, grep, sed, cut, tr, bash, perl, head, tail, gawk. Nobody
complains at me for it being inefficient. If I get the information
out, that's all they care about. And nobody else I know likes to do
this kind of project, so it's a sort of niche.

Cables have something in common with data analysis and correlation.
With cables, you have two endpoints connected to each other by some
common medium. With data analysis, you can glue data together using
common (correlating) keys. Yeah, the query language is complicated.

No comments: