Sunday, March 1, 2009

Chaos seems to reign

The field of software arguably contains some of the largest and most complex functional systems ever created by man. Even relatively simple systems can demonstrate incredibly complex behavior, and can hide potentially major flaws through indefinitely long periods of use, only to demonstrate them at inconvenient times. The Alter Aeon server codebase is no exception.

Pretty much the most complicated discrete object in the server is the socket stack. It handles a huge number of protocols and a bunch of different filtering layers, with hooks in various places. The last major redesign of the socket stack was around 1998; it is a testament to the design of that module that it has required no major modification in that entire time period, though many minor modifications have been made to it.

Unfortunately, the server is not made of such lovely discrete objects. Often, obscure bugs lurk in the sea of wild code that makes up the substructure of the system. This weekend I had the incredible good luck to catch and haul two of these bugs to the surface.

For years, there have been a handful of issues that occurred seemingly at random. In the early, old days, some of these would even cause a server crash and reboot the game; but after being unable to find and fix them, the crashes were gradually protected against and ways were found to ignore these spurious events.

Examples of the two most common of these are deceptively simple: sometimes, the get_room function would be asked to find a room but be given no instructions on how to do so, and other times a character (a monster or player, it was difficult to tell) would die and simply 'get stuck'.

In reality, these are monstrously difficult. The get_room function lacking instruction was mindboggling - it happened perhaps twice a year, the reports were never the same, and there were thousands of places that could be calling the function. Even as our debug facilities improved, no progress was made - there just seemed to be no rhyme or reason to it.

The dead character bug was just as bad. The entire destruct and event handling process was revisited and inspected several times, but no holes were ever found. Until thursday.

---------------------------------------------------

In the course of trying to add some new restrictions to the 'charm' spell, I noticed that the destruct sequences for the charm, possession, and entangling roots spells looked a little goofy. We didn't check or clear certain important things, but there were comments indicating that we didn't need to because something else over there would take care of it. Keep in mind that this was entirely my code; I had written this well over a decade ago.

In the course of looking at this, I suddenly had a realization: there were no 'holes' in the logic for charm or possession, but maybe, just maybe, there was a hole in entangling roots. Entangling roots did come later, and quite frankly the code for it was a complete hackjob.

Within ten minutes, I had my answer. There was indeed a conflict between entangling roots and the 'special' code that made possession and charm work. The problem then became, how exactly do you fix such a mess? After about two hours of thinking about it and five more of carefully backing things out and reorganizing, I got what appears to be a stable fix. There's still some debug logging in it, but this bug appears to be properly killed. The new code is simpler, the checks are stronger, and we don't rely on obscure handlers to clean up messes. I hope.

---------------------------------------------------

I thought that was the end of my troubles in the short term, but then Glorida shows up with some obscure problem of his own. He's been working on mob programs, and had built something rather complex that simply was not working. Not only wasn't it working, but it was doing something weird, and it was doing it reliably.

This is another one of those 'fairly complex' pieces of the system. It took me about two hours off and on to get it loaded into my brain so I could really think about it. It then took me probably another hour to really understand what was going on, and figure out what was happening. And then it occurred to me:

This explains a lot of those debug log reports over the years!

The symptoms he uncovered showed up as a very unusual sort of 'doing things before other things have completed' recursive issue. One example of it is that a monster would 'say' something to trigger another monster, and the second monster would perform its action before the first monster could fully complete its 'say' command. In obscure cases, this chaining could be several layers deep.

For simple things like monsters talking to each other, the worst that can happen is that some things get out of order, and you might not understand why it works sometimes but not others. But monsters do substantially more than just talk to each other.

And this is where the problem arose. In the course of doing more complicated actions, those actions could be interrupted mid-stream by other monsters trying to complete their triggers. One such set of actions would cause the get_room function to be passed trash. Another such set of actions would damage one of the monsters so that it could never properly die. Both of these sets of actions, and a number of others with similar strange effects, were possible and implemented in monsters on the game; and they were sufficiently rare to explain the infrequent bug reports.

This bug was easier to fix than the first, taking only about three hours to really think about and put together a proper solution. This fix also appears to work, though it does break a dozen or so special monsters that relied on the old behaviour to function.

---------------------------------------------------

There are several morals to this story, which all software engineers worth their salt will immediately recognize:

1) Never assume you'll remember anything about code you've written. When you start working on objects so complicated that it takes you an hour every day to load it into your head so you can go to work, what makes you think you'll remember every detail after a year?

2) Think about the design of any halfway complex system and stick with it. The charm/entangling roots bug was caused entirely by undesigned/spaghetti code for the character destruct process. It was never properly designed because I wasn't experienced enough at the time to know how. It's better now.

3) Never underestimate the power of race conditions and call trees. When even simple/obvious actions can invoke arbitrarily complicated effects capable of invoking other effects, you're walking on very, very dangerous ground.

Software continues to become more complex as time goes on. It pushes at the limits of our minds, bringing programmers to the limit of what they can understand and then begging them to add one more thing. It allows arbitrary expression, but with that comes the cost that our comprehension is limited even for structured objects, to say nothing of more arbitrary and abstract ones.

Where does the future of software lie? Undoubtedly toward increasing complexity. But we will need either better tools, or better brains, to be able to manage it. We are such a young species.

1 comment:

Shawn said...

Debugging software is inherently more difficult than writing software. So if you write your code as cleverly as possible, by definition, you are not smart enough to debug it.

One of my favorite programming quotes.