Tuesday, October 20, 2009

Software Complexity Crisis

[This rather large post is almost entirely a personal rant. I complain a lot about wxWidgets in here, because it's fresh in my mind and a very good example of a specific type of software engineering crisis. Note that I'm still using the WX libraries for the Alter Aeon client project; clearly it has value to me in spite of its faults, and I appreciate the effort the WX team has put it. That said, if I could easily move to any another library that met my constraints, I would do it in a heartbeat.]

Various recent attempts on my part to use large software libraries have made me re-examine the issue of the general software-engineering crisis. I've run up against typical software crisis problems many times in the past, so I tend to keep my eyes open when I see material related to the topic. This is one of the reasons that Vernor Vinge's book "A deepness in the sky" caught my attention.

One of the basic premises of this book is that complexity failure can be sufficient to bring down entire societies. This was put very succinctly in a blog posting by Jeremy Bowers on his iRi Blog, part of which I quote here:

"One of the less well known concepts which informs his sci-fi writings is one possible fate of societies that do not or can not end in a "singularity", which is the eventual unavoidable collapse of the society in a cascading failure state brought on by excessive, uncontrollable complexity in the ever-more-sophisticated systems that drive the society. In this case, take "system" in the broad sense, including not just software, but business practices, government, and societal mores. A failure occurs somewhere, which brings down something else, which brings down two other something elses, and perhaps quite literally in the blink of an eye, you are faced with a growing complex of problems beyond the ability of any one human to understand or contain."

This relates to software in that I'm beginning to see more and more examples of how this can occur. Two software platforms in particular come to mind - IBM's WebSphere, which I was peripherally involved with a decade ago, and wxWidgets, which I am involved with today.

Both of these platforms are very complex. Both form an abstraction layer which builds on top of other layers - in the case of wxWidgets, there is a huge amount of API reuse from the lower layers of Windows, or GTK, or X, depending on which configuration it's built for. Each of these layers is built upon other layers, and other layers, sometimes with very deep call trees.

The most egregrious example of this kind of layering that I can recall was in WebSphere. A friend had asked me to take a look at a stack trace from a WebSphere crash; somewhere deep in Java land, a 'null object exception' had been thrown. (Thank god it wasn't a NULL pointer, that would have been much worse!) The exception handler that caught it was basically the main loop, because apparently no other layer could be bothered to check for failure conditions along the way.

There were over 160 stack frames to walk through. Not one of them was due to a recursive algorithm or function. I don't know about you, but that level of stack depth is quite frankly beyond my ability to manage or debug. I don't care what it does.

WxWidgets is clearly beginning to show stress of its own, of a different character: it's becoming more and more impossible to guarantee consistency across platforms. The WX guys have made tremendous progress in this regard, so that most of the core features work right, but there are simply too many details to keep track of and too many paths that will never be tested.

Here's a couple of examples of this, one of which I'm STILL fighting:

-------------------------------------------

When I first started switching the Alter Aeon client project over to WX, I initially used the wxTextCtrl class, which is built out of Microsoft system libraries in the Win32 world, or built on the GTK libraries in my development environment. I had hoped to use the class for both the input window, and for the main display; with a small amount of effort, I got the client running and working, but there were minor, very persistent issues.

The first of these was the input window. Various events, such as backspacing when empty, cause a system beep/bell under windows. They don't cause a bell under Linux. And further, there's no way to disable this. I don't know about you, but I'll be damned if I'm going to ship a product that beeps every time someone hits backspace.

I managed to take care of some of this by trapping out various keystrokes in the CHAR handler. It seemed like a poor hack at the time, but at least it helped. However it didn't help enough; a number of keystrokes simply don't generate CHAR events, yet they still fucking beep. I finally ended up writing a raw keyboard event handler, which tracks nearly all of the keyboard state, to trap out events that would generate a beep when passed to the lower layer. In the time it took me to disable beeping, I could have written and debugged a keyboard handler from scratch, with exactly the desired behaviour.

While beeping has largely been taken care of, other issues with this so-called standard class have not. The biggest one is that the color of text displayed in the class returns to black occasionally, and depending on the versions of the system DLLs for the particular Windows installation. My first attempts at fixing this were effective on all my development environments, but failed on about half of the release environments - text typed in the window would occasionally simply vanish.

By adding forced color setting in various places where it shouldn't be needed, I eventually managed to fix this problem for about 90% of my users. Out of sheer disgust at this point, I did some extremely vicious forced color setting in various event handlers, and this appears to have fixed 'most' of the problem. I still am receiving sporadic reports of it happening on current client builds, but at least the problem goes away now and seems to be triggered at random.

When attempting to use this same class for the main window display, I ran into what seemed to be minor issues regarding the scroll bars and scrolling of text in the window. No matter what I tried, I never did find a way to get reliable scroll positioning for this class across all platforms.

After fighting this off and on for several months, I became desperate. I finally wrote my own text display class from the ground up, using nothing more than bitmaps and font drawing routines. The total from-scratch implementation time was less than the time I had previously wasted trying to get the scrollbars to work properly. It's also faster, especially for very large data sets.

-------------------------------------------

This, my friends, is the software engineering crisis in action. Each layer, while hiding some of the problems of the lower layers, introduces its own; the overall result is a system with fewer catastrophic issues, but exponentially more minor issues.

Those minor issues are surely tolerable, are they not? To an extent, yes - but at what point do you die the death of a thousand cuts?

Catastrophic issues might be catastrophic and obvious failures, but that's one of the best things about them: they're catastrophic and obvious. They HAVE to be fixed. They must be understood, they must be cleaned up and dealt with. The minor problems on the other hand, can just keep accumulating. They just keep getting worse, they just keep getting more obscure, more complicated, more difficult to find and rectify. And worse, they compound each other.

In a good scenario over the long term, they become so prevalent that the system no longer becomes usable. In a bad scenario, the system becomes critical and unmaintainable. It's a swiss cheese of buggy modules and misunderstood patches.

Is that really what you want to build critical infrastructure out of? Is this where we're headed? I certainly hope not.

1 comment:

Locane said...

You know, none of what your blog is a big surprise. You can apply this lack of quality and thoroughness concept to anything - not just society and software engineering.

I think the main fault here is that when a builder, worker, or engineer runs their internal "how much loss for how much gain" calculation, the overall total of minimal losses is never even considered, or in the rare case that it is, it's labeled as "not their problem" or "too much work to fix".

Another factor that comes to mind is that programming and computers in general are still viewed as largely hocus-pocus and magic to the general population. The type of people, and further the people with the expertise and capability to do the kinds of extremely complex software engineering jobs that are out there, are few and far between. It seems natural then, that if you have a team of 5 when you need a team of 55, corners are going to get cut. You're forced to choose between shipping something that is 90% functional but buggy, or not shipping at all.

I run into this in MY daily work, which is as a cable technician. I'm dealing with an intermittent internet problem because the last 4 techs who were out to "fix" it before jurry-rigged, spliced, and band-aided the situation until it turned into an unreliable mishmash of cable and splitters, scattered throughout different areas of the business or residence. People who by default do a clean and fundamentally sound job the first time are worth their weight in gold, even if it takes them longer to accomplish it.

I'm reminded of a saying I heard while working construction, where these issues were constant:

"There's nothing more permanent than temporary"

Which I've always liked. It seems like an oxymoron, but it's also largely correct.