Wednesday, June 8, 2011

Everything Breaks, All the Time.


(This is the second of a three part series about the black art of doing tech support. The first part is here.)

Anyone who ever has to do tech support (or who is trying to get a broken program to function) must first internalize one key, vastly important fact:


You can take a flawlessly written program, install it on a new, factory-fresh, basically functional computer, run it, and find that it doesn't work.

When you understand why this is, tech support, giving and receiving, becomes ever so much easier.

Computers Are Mechanical Devices

Computers are so close to magic that it is easy to forget that they are machines. Incredibly, brain-breakingly complex machines, that record and recover millions of bits of information a second (in RAM or on your hard drive), etching down those details in the magnetic fields of microscopically small bits of matter. So much is done, so quickly, on such a small scale that quantum mechanics becomes relevant, that I'm amazed any computer ever manages to work at all, ever.

When data is recorded on the hard drive, errors can happen. There are guards in place (called checksums, for what it's worth) to help keep the errors under control, but there are still many, many ways that incomplete and incorrect chunks of data can be recorded. The longer you operate your computer, the more errors there will be.

Most of the time, when these errors occur, you never find out. They happen in bits of the operating system or in programs that you don't use or the error introduced is so minor you just ignore it. But sometimes the error happens in a graphics driver, or your saved game, or the bit of my RPG that determines whether your characters get experience or not, and suddenly there is a problem.

So What Does This Mean?

It means that even the best-written program will have a ton of problems out in the field that aren't the developer's fault. Problems that need to be fixed by rebooting the computer and relaunching the program (to fix any error in memory) or by reinstalling whatever part of the software (the game, the drivers, the operating system) that have become broken.

If the problem is in the game, your characters might stop doing damage, or you might lose the ability to enter new places, or the game just might start crashing like crazy. Corrupted file in the display drivers? The graphics might be drawn funny, or the screen might always be black, or the game just might start crashing like crazy. Corrupted file in the operating system? The game might stop being able to save, or the settings file (that contains the registration) might disappear, or the game just might start crashing like crazy.

I'm not just blowing smoke to distract from my own errors. These problems happen all the time.

Of course, when users report these problems, they will pretty much always assume that it is your fault and you are an idiot. I have gotten multitudes of bug reports along the lines of, "Whenever I try to start a new game, the program crashes. This is a terrible bug and you should fix it right away!" When I get these messages, what I want to respond (but don't) is, "If my game had a problem this serious, don't you think I would drop everything this instant to fix it? You think I want to sell games that are never usable by anyone? What turnip truck do you think I just rolled off of?"

That's what I don't say. What I do is send them my standard list of tech support steps, and, 99% of the time, problem solved.

My Rule For When I Start To Hunt For a Bug

It's a simple one.

I never even consider that a problem someone reports is a bug in my code until two people report the exact same problem.

Sometimes, if the report is vague enough, I wait for three people. It can be maddening to get reports of catastrophic problems and not act on them, but it's worse to waste your limited, precious time hunting for gremlins.

We Live In a World Of Frustrations

I know that, every time I release a new game, thousands of people will get the demo, run it, and it won't work because of the reasons outlined above. They delete the game, write me off as a bonehead, and never send me teh moneyz. This is hugely frustrating. Nobody wants to be thought an idiot, and everyone wants the aforementioned moneyz. It's sad, but it's part of the business of writing games for computers.

It's even worse when they then go online and write about what a bonehead you are. Recently, a site called Platform Nation reviewed our newest game, Avadon. The reviewer got stuck with a horrible glitch that teleported his character into nothingness. He proceeds to excoriate me for writing such a terribly buggy game. Please believe me when I say that nobody, and I mean nobody, besides the reviewer has ever reported this problem. Don't believe me? Our support and Avadon forums have never had a mention of it. But the reviewer still called the game "wrong or broken" and "unforgivable " and gave it 1/10.

(Interestingly, the review has disappeared from the main site, and the only remaining copy is on their forums. I can therefore neglect expressing any other opinions about the reviewer's level of professionalism.)

Of course, if this sort of horrible game-breaking behavior was a bug, I would do everything I could to fix it. But that's not how things work.

Game Development Isn't For Wimps

Many people will get a game that breaks, and most of them will simply disappear and never try your product again. But some of them, happily, will come to you for help. When they do, you should smile, take a deep breath, and do what you can to make them happy. When the problem is a weird one I've never heard of, I will first send them my magic troubleshooting checklist that solves all problems. I'll post that next week, and everything will be better for everyone forever and always.

32 comments:

  1. "I never even consider that a problem someone reports is a bug in my code until two people report the exact same problem."

    I presume that's only for post-beta code, right? :-) I've done some beta-testing for Spiderweb, and you've generally been really good about paying heed to my reports -- even when I'm fairly sure I was the only one who had that problem. (Then again, I only tend to report bugs after I've made every effort I can to consistently replicate them...)

    ReplyDelete
    Replies
    1. To create a such kind of article is really amazing,I daily read your blogs and give my announcement for that here this article is too great and so entertaining. Taruhan Bola

      Delete
  2. Beta too. I have had beta testers report weird glitches that only they ever got about 1000000 times. An odd, unexplainable problem only gets attention when the second testers finds it. And, if it's a real bug, a second tester pretty much always gets it eventually.

    - Jeff Vogel

    ReplyDelete
  3. To wit, when testing Avernum I had the game crash as soon as I opened a chest after defeating a horde of sliths in the temple where part of Demonslayer was stored. I distinctly remember our (Jeff and I's) exchange regarding the bug.

    I tried and tried to reproduce it; I could not. I always figured it was sun spots. Or possibly evil robots.

    ReplyDelete
  4. It's misleading to attibute bugs to "quantum magic and disk errors". That's like saying "beaches are deadly because tsunami".

    Yes, it is amazing that computers work at all, and deterministically for all practical purposes. Crud does accumulate from previous errors and may lurk silently for eons.

    However, the vast majority of bugs - even those seemingly "one-time"-glitches - are actual bugs, and your code is involved.

    I've hunted down hundreds of bugs with where the mechanism to make them occur is downright amazing, found sometimes only due to sheer luck, persistence and curiosity.

    And yes, I also have diagnosed faulty disks, unreliable USB cables, etc. Less than ten.

    There are many mechanisms why a bug rarely shows up. Just becuae you can't reproduce it doesn't mean it's not there.

    Don't blame the witch, don't teach people it's ok to blame the witch.

    ReplyDelete
  5. I've run into many tech support people who believe in random evil spirits and cargo cult cures. It's not that problems out there are mysterious, it's that some tech support people are ill-equipped by nature and training to understand them. All issues are in one or both of two categories: User error or bug.

    This is complete twaddle: "...even the best-written program will have a ton of problems out in the field that aren't the developer's fault. Problems that need to be fixed by rebooting the computer and relaunching the program (to fix any error in memory) or by reinstalling whatever part of the software (the game, the drivers, the operating system) that have become broken."

    Problems that need to be fixed by rebooting, relaunching or reinstalling are due to bugs!

    ReplyDelete
  6. "For home-based personal computers, the CPU has ranged from approximately 1 megahertz in the late 1970s (Atari, Commodore, Apple computers) to up to 6 GHz in the present (IBM POWER processors)."

    are you from the 70's? 1 Megahertz 1 million cycles per second. 1 instruction per cycle.

    ReplyDelete
  7. Your "key vastly important fact" contains a false premise, in that "flawlessly written programs" do not actually exist. What do exist are programs full of bugs that don't matter for most users most of the time some of which can be safely ignored.

    Yeah, it's theoretically possible that the guy who hit some weird bug had a corrupted hard drive, but it is much much more likely they had a particular way of using your program that took them down a particular code path which made bugs show up for them that most users don't see. This is why you have QA. Some people ("testers") have a knack for reproducibly finding those problem code paths, while other people ("developers") are good at instinctively avoiding them.

    ReplyDelete
  8. To those who insist that it's always code and that the random disk errors are so infrequent - that's just as general and insensitive as Jeff's original point, except his is based on far more experience. Sure, there are lots of delicate codes that can get messed up, but that doesn't mean the code is at fault. You can get a corrupted file and say "the code should have been made this way, not that way. that would have fixed it." But the truth is, computer glitches happen all the time. That is why rebooting solves the majority of "bugs."

    If rebooting works, the problem is specifically *not* with the code.

    ReplyDelete
  9. Over the course of my programming career, I've come to believe strongly that the best developers are the ones who *first* assume that any reported problem is their own fault. Then they go and spend some time trying to figure out what they did wrong, and only after producing genuine evidence and defensible hypotheses that it was in fact not their fault, do they approach others or openly suggest that someone or something else is to blame.

    I've met many engineers who rely on voodoo to short-circuit their own responsibility and laziness, and ultimately those people should be fired from an engineering team.

    I'm not saying your arguments don't have merit. In the wild west of real people's computers, there are many things that happen which are not your fault as a developer. And even if something was your fault, if only one person in a million is seeing it, it might not be worth your time to fix it (maybe.) But that is just a prioritization decision, not an assignment of responsibility.

    On the other hand, blog posts like this give ammunition and confidence to hoards of B-player, lazy developers who really don't deserve it. I would like to hear a little more to underscore the need for personal follow up and responsibility for programming mistakes (which are most often the culprit for most problem reports).

    ReplyDelete
  10. I expected responses like these. After all, I am challenging established dogma that developers are lazy and incompetent and we always ship shoddy products and all problems are our fault and we should all go around wearing hair shirts or something.

    But this does not line up with Reality As It Is Lived.

    Or, to put it another way. Frequently one of my users reports a nasty problem nobody else is reporting (usually a repeatable crash or some game system logic gone crazy). When this happen, the advice that fixes the problem most often BY FAR is, "Uninstall. Redownload. Reinstall."

    Now, if a program is repeatably failing, even after rebooting, you reinstall the SAME program, and the problem goes away, the only explanation I see here is file corruption.

    It happens. It's not gremlins, or magic, or "random evil spirits". It's a side effect of these how these delicate mechanical devices work, it happens ALL THE TIME, it is a True Fact, and I am entirely justified discussing it.

    - Jeff Vogel

    ReplyDelete
  11. Not to diss you, but based solely on your description so far, another option is that a config file got corrupted by a rarely-hit bug in your code, and reset during the reinstall. Or your save file ended up in a pickle, and your user started a new game after deleting everything.

    I'm not saying it's not externally-driven file corruption, but it's certainly not the only viable explanation going.

    ReplyDelete
  12. Can't someone easily diagnose file corruption with tools like md5sum?

    ReplyDelete
  13. @Chris: Reasonable suggestions, but probably not the case here.

    My programs and installers are very simple creatures. The installer dumps a folder in Programs, and that's it. Any config file that I edit is not touched by the demo installer (which is what I have them use).

    As for the saved game thing, starting a new game is something I suggest if the reinstall doesn't help. I'm sure people restarting and not telling me would explain some of the fixes, but I'm sure not all.

    Also, saved games get corrupted too. I know. I know. Every time a file gets truncated or incorrectly transcribed, it's my fault, always, even when it isn't. But still.

    - Jeff Vogel

    ReplyDelete
  14. If you're seeing "random disk errors" or "random memory" errors often, the most likely explanation is that your code is somehow *producing* these errors. (I base this conclusion on ~20 years of doing software QA, during which I tracked down a great many memory corruption errors in things like text editors)

    "the truth is, computer glitches happen all the time."

    If your code or the environment it runs in is buggy they do, yes. If your code is more robust they tend to happen much less often. For instance, if you were to *sanity-check* that saved-game file and not just crash if it "gets truncated or incorrectly transcribed", people might not have to uninstall/reinstall to recover from the problem.

    "If rebooting works, the problem is specifically *not* with the code."

    If you have a memory leak, your program can get slower and slower leading to timing-related issues...which rebooting fixes. If you run off the end of an array and corrupt memory, your program can crash or behave oddly either right then or at some later time completely unrelated to the action that caused the corruption...and rebooting fixes it. Rebooting is a temporary fix; actually finding the bug and getting rid of it is the permanent fix.

    Concrete example: when I was testing Newton 2.0 handwriting recognition I found recognition would reliably crash if I hand-wrote the poem "Jabberwocky", making corrections as I went along. It didn't crash for other texts and nobody else had noticed the problem before. Actual cause: there was a fixed length buffer for adding new words to the dictionary and an off-by-one error so that if you filled the buffer it caused a crash; the buffer was cleaned out by a background process *just often enough* that for most texts written by most people most of the time it wouldn't ever entirely fill. Too many unfamiliar words entered per unit time in a particular way/context caused the crash. That sort of bug looks like "gremlins" when encountered by a normal user - you could hit it a few times in a row and then by pure chance never see it again because you were doing things differently.

    ReplyDelete
  15. I administered thousands of servers at Google. Lots of random things would go wrong. Those machines didn't have parity checking memory, and we absolutely, provably, got lots of corrupted memory errors. Google builds its own hardware now, and it's a lot more reliable than those old machines, but still, when the numbers are great enough you will get random errors.

    If we pretended that every time anything broke it was a bug in our code, we never would have gotten anything done.

    That said, hardware errors aren't the most likely explanation for the kinds of glitches that Jeff is describing. It's more likely to have a version of the display device driver that is subtly incompatible with another driver on the system, or something like that. Every machine has a slightly different configuration, even if they are all supposed to be running exactly the same OS, and some of them will produce weird results only when certain code is executed. That doesn't mean the code is wrong and it doesn't mean the problem can ever be reproduced by someone who doesn't have access to that particular system. Jeff is entirely right about that.

    ReplyDelete
  16. You never EVER really did tech support, did you?
    'Cause if so you would know the 3 magic answers:

    1) Try again later.

    Huh, Still doesn't work.

    2) Don't worry, it's a known problem.

    Yeah! Sure but WTF do I do?

    3) It'll be fixed in next release.

    ...

    ReplyDelete
  17. I am challenging established dogma that developers are lazy and incompetent

    Er, you're responding to comments by a bunch of developers. You've never spent hours or days tracking down a weird, nigh-impossible-to-reproduce bug in your code? Really? Really?

    I've been a professional programmer for ten years now. My code runs constantly on thousands of machines, from Windows 2000 on up. It's generally quite solid. And there's still the occasional bug report that comes in related to old, highly-tested code. And it's almost always my fault. Even when there's a weird configuration issue on their end, it's still usually my fault, because I used an API slightly incorrectly in a way that would work 95% of the time, or didn't handle an error condition.

    If none of this sounds familiar, I don't really know what to say. Your code may run perfectly 99.999% of the time. It may pass vigorous QA testing. That doesn't mean it's free of bugs.

    ReplyDelete
  18. This comment has been removed by a blog administrator.

    ReplyDelete
  19. @JavaJack, good question.

    I would suggest also to use some kind of logging mechanism which - when application crashes or when user reports a bug - would gather all internal game informations, get functions results, error codes etc. etc. and export it to some ZIP file. Then, user could send this ZIP to Jeff for analysis.

    ReplyDelete
  20. @Naysayers: Ladies and Gentlemen, I present to you the age old programmer mistake of confusing the kind of issues you find with the customer's PC with the kind of issues you find IN YOUR TEST ENVIRONMENT.

    You always assume a bug that you find in your system is your fault because you can constrain the side-effects.
    You start with your own code till you reach the point where you've literally narrowed the bug down to the space between function calls. Because you can use a debugger, or a test app, or a thousand different types of Test System controls.

    Once it's the customer's PC?
    You assume that their PC is packed with insulating material and that they're using the CD drive as a cupholder. And that they're running on their own special Homebrew "BetterWindoez" OS.

    Remember, "Perfect is the enemy of Good"
    Your job is to fix the maximum number of bugs with the minimum of work, so you can continue developing your next project.

    ReplyDelete
  21. The original chi flat iron was released as a professional salon straightener. After gaining popularity on the market the cheap chi flat iron was later released for personal consumer use. If you are a professional stylist or someone who loves straightening their chi hair straightener before leaving for work the Chi original ceramic flat iron is one styling tool you cannot live without. Unlike other wholesale chi flat iron before its day the Chi model was developed with moist ceramic heat technology that does not burn or damage the wholesale chi hair straighteners.

    ReplyDelete
  22. very gracefull post too good.
    Anyone can buy aciphex from online drug store at lowest price.

    ReplyDelete
  23. I think this is one of the most important information for me. And i am glad reading your article. But want to remark on some general things, The site style is ideal, the articles is really great. Agen Bola Sbobet Bola Tangkas Prediksi Bola Tangkasnet Sbobet Casino Piala Eropa 2012 Score Bola

    ReplyDelete
  24. Great post! I?m just starting out in community management/marketing media and trying to learn how to do it well - resources like this article are incredibly helpful. As our company is based in the US, it?s all a bit new to us. The example above is something that I worry about as well, how to show your own genuine enthusiasm and share the fact that your product is useful in that case. Prediksi Bola Agent Sbobet Sbobet Ibcbet Sbobet Casino Judi Bola

    ReplyDelete
  25. Buy cheap adult animal onesies and character onesies from hotonesie.com at cheap prices. We offer a wide range of cheap adult onesies and kigurumis.
    dinosaur costumes
    animal onesies
    animal onesie sale

    ReplyDelete