Complex Systems

I hate Windows, it seems that all my problems at work come from having to deal with Windows.  And Mac OS X, I hate Mac OS X as well for the same reason.

Actually, I don’t really hate thoses operating systems, but it got your attention.  I actually think they are both perfectly fine operating systems.  But they do cause all my headaches at work.  I’m a Linux user by default and venturing into the realm of Windows and OS X always seems to give me headaches.

And it is not really the operating systems that cause me the headaches.  The real issue is the complexity of the systems that I have to work with.  As the main part of my job, I maintain and (try to) enhance and extend two fairly complex systems.  One is the public data server for data from the primary instrument on a NASA satellite mission, the other is the software build system for the primary instrument team for that mission.

Both of these systems suffer to some extent from the second system effect as described by Fred Brooks in the Mythical Man Month, as both are the follow-on systems to earlier systems that worked quite well. And both second systems were written by the author of the first system.

In the case of the data server, I only have myself to blame, since I am the original author.  I did all the trade studies, wrote the requirements and design documents, and implemented the system.  In fact, knowing about the second system effect, I tried really hard to avoid suffering from it.  And for the most part, I think I succeeded.  It’s a realtively small, focused system that does one thing really fast.

But it is still complex.  And it still gives me headaches when things go wrong.  And I wrote it.  I understand intuitively what it is supposed to be doing and how it works.  I can only imagine the headaches the guy who was maintianing it for the year I was off working on a different project had.

The other system, on the other hand, was not written by me, and I don’t have the intuitive grasp of the system like the original developer did.  Although I’m getting a better feel for it every day.  And in many ways, this system much more complex than the data server.  It’s an automated build system.  When a user checks in and tags new code, the build system launches a series of processes that checks out the code, builds it, runs all the associated tests, bundles up user, developer and source distributions and publishes all the results (including e-mailing developers about any of their packages that failed to compile or pass their tests).

It’s a fairly standard build system.  Except that it all has to run on seven different operating systems.  With six different compilers.  And it runs on a batch queuing system and talks to four different databases on two different MySQL servers.  Did I mention it was fairly complex?

Just to enumerate, the operating systems we currently support are 32 and 64 bit Redhat Enterprise Linux 4 & 5, Mac OS X 10.6 (Snow Leopard), Mac OS X 10.4 (Tiger, going away as soon as the Snow Leopard support is fully functional) and Windows XP (with Windows 7 support looming soon).  The compilers we currently support are four versions of gcc (3.4, 4.0, 4.1 and 4.2) and two versions of Visual Studio (2003 and 2008).  It’s not actually as bad as it sounds.  With the exception of two versions of VS running on Win XP, there is only one compiler supported per *nix style OS.  This variety is actually a good thing as it helps keep the codebase clean since it has to work everywhere.

The real trouble comes from the infrastructure supporting the system and the ways it interacts (or doesn’t) with these different operating systems.

The programs that run the build system were written in C++ using the Qt library.  Now I didn’t know anything about Qt when I acquired the responsibility for the project but after sifting through the code, I think I can understand why this was chosen.  One of the main reasons was the use of the timer and process control functionality, both to launch checks at specific intervals and to kill build or, more importantly, test processes that have hung and are taking to long.  Only that latter doesn’t seem to work on Snow Leopard, as we found out when one of our packages was seg faulting in the tests and instead of dying, it was going into an infinite loop.  And since the build system code didn’t properly kill it, the entire system hung up for that OS.  And right now I can’t tell if the problem is Qt, the underlying OS, how we’re applying it, or some combination of the three.  Complexity.

Page 1 of 2 | Next page