Wednesday, June 8, 2011

Everything Breaks, All the Time.


(This is the second of a three part series about the black art of doing tech support. The first part is here.)

Anyone who ever has to do tech support (or who is trying to get a broken program to function) must first internalize one key, vastly important fact:


You can take a flawlessly written program, install it on a new, factory-fresh, basically functional computer, run it, and find that it doesn't work.

When you understand why this is, tech support, giving and receiving, becomes ever so much easier.

Computers Are Mechanical Devices

Computers are so close to magic that it is easy to forget that they are machines. Incredibly, brain-breakingly complex machines, that record and recover millions of bits of information a second (in RAM or on your hard drive), etching down those details in the magnetic fields of microscopically small bits of matter. So much is done, so quickly, on such a small scale that quantum mechanics becomes relevant, that I'm amazed any computer ever manages to work at all, ever.

When data is recorded on the hard drive, errors can happen. There are guards in place (called checksums, for what it's worth) to help keep the errors under control, but there are still many, many ways that incomplete and incorrect chunks of data can be recorded. The longer you operate your computer, the more errors there will be.

Most of the time, when these errors occur, you never find out. They happen in bits of the operating system or in programs that you don't use or the error introduced is so minor you just ignore it. But sometimes the error happens in a graphics driver, or your saved game, or the bit of my RPG that determines whether your characters get experience or not, and suddenly there is a problem.

So What Does This Mean?

It means that even the best-written program will have a ton of problems out in the field that aren't the developer's fault. Problems that need to be fixed by rebooting the computer and relaunching the program (to fix any error in memory) or by reinstalling whatever part of the software (the game, the drivers, the operating system) that have become broken.

If the problem is in the game, your characters might stop doing damage, or you might lose the ability to enter new places, or the game just might start crashing like crazy. Corrupted file in the display drivers? The graphics might be drawn funny, or the screen might always be black, or the game just might start crashing like crazy. Corrupted file in the operating system? The game might stop being able to save, or the settings file (that contains the registration) might disappear, or the game just might start crashing like crazy.

I'm not just blowing smoke to distract from my own errors. These problems happen all the time.

Of course, when users report these problems, they will pretty much always assume that it is your fault and you are an idiot. I have gotten multitudes of bug reports along the lines of, "Whenever I try to start a new game, the program crashes. This is a terrible bug and you should fix it right away!" When I get these messages, what I want to respond (but don't) is, "If my game had a problem this serious, don't you think I would drop everything this instant to fix it? You think I want to sell games that are never usable by anyone? What turnip truck do you think I just rolled off of?"

That's what I don't say. What I do is send them my standard list of tech support steps, and, 99% of the time, problem solved.

My Rule For When I Start To Hunt For a Bug

It's a simple one.

I never even consider that a problem someone reports is a bug in my code until two people report the exact same problem.

Sometimes, if the report is vague enough, I wait for three people. It can be maddening to get reports of catastrophic problems and not act on them, but it's worse to waste your limited, precious time hunting for gremlins.

We Live In a World Of Frustrations

I know that, every time I release a new game, thousands of people will get the demo, run it, and it won't work because of the reasons outlined above. They delete the game, write me off as a bonehead, and never send me teh moneyz. This is hugely frustrating. Nobody wants to be thought an idiot, and everyone wants the aforementioned moneyz. It's sad, but it's part of the business of writing games for computers.

It's even worse when they then go online and write about what a bonehead you are. Recently, a site called Platform Nation reviewed our newest game, Avadon. The reviewer got stuck with a horrible glitch that teleported his character into nothingness. He proceeds to excoriate me for writing such a terribly buggy game. Please believe me when I say that nobody, and I mean nobody, besides the reviewer has ever reported this problem. Don't believe me? Our support and Avadon forums have never had a mention of it. But the reviewer still called the game "wrong or broken" and "unforgivable " and gave it 1/10.

(Interestingly, the review has disappeared from the main site, and the only remaining copy is on their forums. I can therefore neglect expressing any other opinions about the reviewer's level of professionalism.)

Of course, if this sort of horrible game-breaking behavior was a bug, I would do everything I could to fix it. But that's not how things work.

Game Development Isn't For Wimps

Many people will get a game that breaks, and most of them will simply disappear and never try your product again. But some of them, happily, will come to you for help. When they do, you should smile, take a deep breath, and do what you can to make them happy. When the problem is a weird one I've never heard of, I will first send them my magic troubleshooting checklist that solves all problems. I'll post that next week, and everything will be better for everyone forever and always.

66 comments:

  1. "I never even consider that a problem someone reports is a bug in my code until two people report the exact same problem."

    I presume that's only for post-beta code, right? :-) I've done some beta-testing for Spiderweb, and you've generally been really good about paying heed to my reports -- even when I'm fairly sure I was the only one who had that problem. (Then again, I only tend to report bugs after I've made every effort I can to consistently replicate them...)

    ReplyDelete
  2. Beta too. I have had beta testers report weird glitches that only they ever got about 1000000 times. An odd, unexplainable problem only gets attention when the second testers finds it. And, if it's a real bug, a second tester pretty much always gets it eventually.

    - Jeff Vogel

    ReplyDelete
  3. To wit, when testing Avernum I had the game crash as soon as I opened a chest after defeating a horde of sliths in the temple where part of Demonslayer was stored. I distinctly remember our (Jeff and I's) exchange regarding the bug.

    I tried and tried to reproduce it; I could not. I always figured it was sun spots. Or possibly evil robots.

    ReplyDelete
  4. It's misleading to attibute bugs to "quantum magic and disk errors". That's like saying "beaches are deadly because tsunami".

    Yes, it is amazing that computers work at all, and deterministically for all practical purposes. Crud does accumulate from previous errors and may lurk silently for eons.

    However, the vast majority of bugs - even those seemingly "one-time"-glitches - are actual bugs, and your code is involved.

    I've hunted down hundreds of bugs with where the mechanism to make them occur is downright amazing, found sometimes only due to sheer luck, persistence and curiosity.

    And yes, I also have diagnosed faulty disks, unreliable USB cables, etc. Less than ten.

    There are many mechanisms why a bug rarely shows up. Just becuae you can't reproduce it doesn't mean it's not there.

    Don't blame the witch, don't teach people it's ok to blame the witch.

    ReplyDelete
  5. I've run into many tech support people who believe in random evil spirits and cargo cult cures. It's not that problems out there are mysterious, it's that some tech support people are ill-equipped by nature and training to understand them. All issues are in one or both of two categories: User error or bug.

    This is complete twaddle: "...even the best-written program will have a ton of problems out in the field that aren't the developer's fault. Problems that need to be fixed by rebooting the computer and relaunching the program (to fix any error in memory) or by reinstalling whatever part of the software (the game, the drivers, the operating system) that have become broken."

    Problems that need to be fixed by rebooting, relaunching or reinstalling are due to bugs!

    ReplyDelete
  6. "For home-based personal computers, the CPU has ranged from approximately 1 megahertz in the late 1970s (Atari, Commodore, Apple computers) to up to 6 GHz in the present (IBM POWER processors)."

    are you from the 70's? 1 Megahertz 1 million cycles per second. 1 instruction per cycle.

    ReplyDelete
  7. Your "key vastly important fact" contains a false premise, in that "flawlessly written programs" do not actually exist. What do exist are programs full of bugs that don't matter for most users most of the time some of which can be safely ignored.

    Yeah, it's theoretically possible that the guy who hit some weird bug had a corrupted hard drive, but it is much much more likely they had a particular way of using your program that took them down a particular code path which made bugs show up for them that most users don't see. This is why you have QA. Some people ("testers") have a knack for reproducibly finding those problem code paths, while other people ("developers") are good at instinctively avoiding them.

    ReplyDelete
  8. To those who insist that it's always code and that the random disk errors are so infrequent - that's just as general and insensitive as Jeff's original point, except his is based on far more experience. Sure, there are lots of delicate codes that can get messed up, but that doesn't mean the code is at fault. You can get a corrupted file and say "the code should have been made this way, not that way. that would have fixed it." But the truth is, computer glitches happen all the time. That is why rebooting solves the majority of "bugs."

    If rebooting works, the problem is specifically *not* with the code.

    ReplyDelete
  9. Over the course of my programming career, I've come to believe strongly that the best developers are the ones who *first* assume that any reported problem is their own fault. Then they go and spend some time trying to figure out what they did wrong, and only after producing genuine evidence and defensible hypotheses that it was in fact not their fault, do they approach others or openly suggest that someone or something else is to blame.

    I've met many engineers who rely on voodoo to short-circuit their own responsibility and laziness, and ultimately those people should be fired from an engineering team.

    I'm not saying your arguments don't have merit. In the wild west of real people's computers, there are many things that happen which are not your fault as a developer. And even if something was your fault, if only one person in a million is seeing it, it might not be worth your time to fix it (maybe.) But that is just a prioritization decision, not an assignment of responsibility.

    On the other hand, blog posts like this give ammunition and confidence to hoards of B-player, lazy developers who really don't deserve it. I would like to hear a little more to underscore the need for personal follow up and responsibility for programming mistakes (which are most often the culprit for most problem reports).

    ReplyDelete
  10. I expected responses like these. After all, I am challenging established dogma that developers are lazy and incompetent and we always ship shoddy products and all problems are our fault and we should all go around wearing hair shirts or something.

    But this does not line up with Reality As It Is Lived.

    Or, to put it another way. Frequently one of my users reports a nasty problem nobody else is reporting (usually a repeatable crash or some game system logic gone crazy). When this happen, the advice that fixes the problem most often BY FAR is, "Uninstall. Redownload. Reinstall."

    Now, if a program is repeatably failing, even after rebooting, you reinstall the SAME program, and the problem goes away, the only explanation I see here is file corruption.

    It happens. It's not gremlins, or magic, or "random evil spirits". It's a side effect of these how these delicate mechanical devices work, it happens ALL THE TIME, it is a True Fact, and I am entirely justified discussing it.

    - Jeff Vogel

    ReplyDelete
  11. Not to diss you, but based solely on your description so far, another option is that a config file got corrupted by a rarely-hit bug in your code, and reset during the reinstall. Or your save file ended up in a pickle, and your user started a new game after deleting everything.

    I'm not saying it's not externally-driven file corruption, but it's certainly not the only viable explanation going.

    ReplyDelete
  12. Can't someone easily diagnose file corruption with tools like md5sum?

    ReplyDelete
  13. @Chris: Reasonable suggestions, but probably not the case here.

    My programs and installers are very simple creatures. The installer dumps a folder in Programs, and that's it. Any config file that I edit is not touched by the demo installer (which is what I have them use).

    As for the saved game thing, starting a new game is something I suggest if the reinstall doesn't help. I'm sure people restarting and not telling me would explain some of the fixes, but I'm sure not all.

    Also, saved games get corrupted too. I know. I know. Every time a file gets truncated or incorrectly transcribed, it's my fault, always, even when it isn't. But still.

    - Jeff Vogel

    ReplyDelete
  14. If you're seeing "random disk errors" or "random memory" errors often, the most likely explanation is that your code is somehow *producing* these errors. (I base this conclusion on ~20 years of doing software QA, during which I tracked down a great many memory corruption errors in things like text editors)

    "the truth is, computer glitches happen all the time."

    If your code or the environment it runs in is buggy they do, yes. If your code is more robust they tend to happen much less often. For instance, if you were to *sanity-check* that saved-game file and not just crash if it "gets truncated or incorrectly transcribed", people might not have to uninstall/reinstall to recover from the problem.

    "If rebooting works, the problem is specifically *not* with the code."

    If you have a memory leak, your program can get slower and slower leading to timing-related issues...which rebooting fixes. If you run off the end of an array and corrupt memory, your program can crash or behave oddly either right then or at some later time completely unrelated to the action that caused the corruption...and rebooting fixes it. Rebooting is a temporary fix; actually finding the bug and getting rid of it is the permanent fix.

    Concrete example: when I was testing Newton 2.0 handwriting recognition I found recognition would reliably crash if I hand-wrote the poem "Jabberwocky", making corrections as I went along. It didn't crash for other texts and nobody else had noticed the problem before. Actual cause: there was a fixed length buffer for adding new words to the dictionary and an off-by-one error so that if you filled the buffer it caused a crash; the buffer was cleaned out by a background process *just often enough* that for most texts written by most people most of the time it wouldn't ever entirely fill. Too many unfamiliar words entered per unit time in a particular way/context caused the crash. That sort of bug looks like "gremlins" when encountered by a normal user - you could hit it a few times in a row and then by pure chance never see it again because you were doing things differently.

    ReplyDelete
  15. I administered thousands of servers at Google. Lots of random things would go wrong. Those machines didn't have parity checking memory, and we absolutely, provably, got lots of corrupted memory errors. Google builds its own hardware now, and it's a lot more reliable than those old machines, but still, when the numbers are great enough you will get random errors.

    If we pretended that every time anything broke it was a bug in our code, we never would have gotten anything done.

    That said, hardware errors aren't the most likely explanation for the kinds of glitches that Jeff is describing. It's more likely to have a version of the display device driver that is subtly incompatible with another driver on the system, or something like that. Every machine has a slightly different configuration, even if they are all supposed to be running exactly the same OS, and some of them will produce weird results only when certain code is executed. That doesn't mean the code is wrong and it doesn't mean the problem can ever be reproduced by someone who doesn't have access to that particular system. Jeff is entirely right about that.

    ReplyDelete
  16. You never EVER really did tech support, did you?
    'Cause if so you would know the 3 magic answers:

    1) Try again later.

    Huh, Still doesn't work.

    2) Don't worry, it's a known problem.

    Yeah! Sure but WTF do I do?

    3) It'll be fixed in next release.

    ...

    ReplyDelete
  17. I am challenging established dogma that developers are lazy and incompetent

    Er, you're responding to comments by a bunch of developers. You've never spent hours or days tracking down a weird, nigh-impossible-to-reproduce bug in your code? Really? Really?

    I've been a professional programmer for ten years now. My code runs constantly on thousands of machines, from Windows 2000 on up. It's generally quite solid. And there's still the occasional bug report that comes in related to old, highly-tested code. And it's almost always my fault. Even when there's a weird configuration issue on their end, it's still usually my fault, because I used an API slightly incorrectly in a way that would work 95% of the time, or didn't handle an error condition.

    If none of this sounds familiar, I don't really know what to say. Your code may run perfectly 99.999% of the time. It may pass vigorous QA testing. That doesn't mean it's free of bugs.

    ReplyDelete
  18. This comment has been removed by a blog administrator.

    ReplyDelete
  19. @JavaJack, good question.

    I would suggest also to use some kind of logging mechanism which - when application crashes or when user reports a bug - would gather all internal game informations, get functions results, error codes etc. etc. and export it to some ZIP file. Then, user could send this ZIP to Jeff for analysis.

    ReplyDelete
  20. @Naysayers: Ladies and Gentlemen, I present to you the age old programmer mistake of confusing the kind of issues you find with the customer's PC with the kind of issues you find IN YOUR TEST ENVIRONMENT.

    You always assume a bug that you find in your system is your fault because you can constrain the side-effects.
    You start with your own code till you reach the point where you've literally narrowed the bug down to the space between function calls. Because you can use a debugger, or a test app, or a thousand different types of Test System controls.

    Once it's the customer's PC?
    You assume that their PC is packed with insulating material and that they're using the CD drive as a cupholder. And that they're running on their own special Homebrew "BetterWindoez" OS.

    Remember, "Perfect is the enemy of Good"
    Your job is to fix the maximum number of bugs with the minimum of work, so you can continue developing your next project.

    ReplyDelete
  21. very gracefull post too good.
    Anyone can buy aciphex from online drug store at lowest price.

    ReplyDelete
  22. I think this is one of the most important information for me. And i am glad reading your article. But want to remark on some general things, The site style is ideal, the articles is really great. Agen Bola Sbobet Bola Tangkas Prediksi Bola Tangkasnet Sbobet Casino Piala Eropa 2012 Score Bola

    ReplyDelete
  23. Great post! I?m just starting out in community management/marketing media and trying to learn how to do it well - resources like this article are incredibly helpful. As our company is based in the US, it?s all a bit new to us. The example above is something that I worry about as well, how to show your own genuine enthusiasm and share the fact that your product is useful in that case. Prediksi Bola Agent Sbobet Sbobet Ibcbet Sbobet Casino Judi Bola

    ReplyDelete
  24. I am very happy to read this. This is the kind of manual that needs to be given and not the random misinformation that's at the other blogs. Appreciate your sharing this best posting. Pbsbo.com

    ReplyDelete
  25. If you are seeking best and simple way to market your business then this solution is flawlessly ideal with your requirement. buy targeted usa facebook likes

    ReplyDelete
  26. I found your blog while searching for the updates, I am happy to be here. Very useful content and also easily understandable providing. Believe me I did wrote an post about tutorials for beginners with reference of your blog. Thanks for sharing with us....

    Brand Development Company | Travel Technology Software | Software Development Solutions | Website Design Company in India | Mobile App Development Solution

    ReplyDelete
  27. There is a way to taste college call girls here you can enjoy a lot of pleasure with young escorts Chandigarh in the perfect correct completion way.

    Escorts in Chandigarh
    Chandigarh Call girls
    Escorts in Dehradun
    Escorts in Rishikesh

    ReplyDelete
  28. If you are looking for escorts in Dehradun, then look no further. Give us a call and let our girl meet you wherever you want them too.

    Chandigarh Escorts Service
    Mohali Escorts Service
    Zirakpur Escorts Service
    Panchkula Escorts Service
    Call Girls Service in Dehardun

    ReplyDelete
  29. Pada permainan slot game di xn--b1amgyd9f terdapat banyak sekali jenis dan variasi game yaitu ada permainan slot game jenis Pragmatic Play.

    ReplyDelete
  30. I really liked your article. I have learned a lot from this article on this site. I will be very grateful for this information. I have come to know a lot more from your article about this site. I want you to write more articles on the same topic.

    Thank you! I look forward to reading more news from you.

    Maldives Tour Packages | Dubai Tour Packages | Singapore Packages | Bangkok Packages | Bali Packages | Indonesia Tour Package

    ReplyDelete
  31. About.me
    Information
    Click Here

    An interesting discussion is worth comment. I think that you should write more on this topic, it might not be a taboo subject but generally people are not enough to speak on such topics. To the next. Cheers

    ReplyDelete
  32. Your articles are to an incredible degree confusing and I got a tremendous measure of data and heading understanding them Awesome Blog! you'd an extraordinary activity in your article.
    Surbhi Rana Escort
    Escort Service Chandigarh

    ReplyDelete
  33. Enjoy most reliable Chandigarh Escorts Services. Female Chandigarh Escorts, get hot and sexy call girls from our escort agencies and feel exciting moments in a pleasant ambiance
    Chandigarh Escorts
    Chandigarh Call Girls
    Chandigarh Independent Escort

    ReplyDelete

  34. Hi I’m for the first time here. I found this blog and I find It truly helpful & it helped me out much.
    I am hoping to give something back and aid others such as you helped me.
    Dreamy Places To Propose | Domestic & International Holiday Packages | Trending Tours Packages | GET ADDICTED TO TRAVEL | Travel Gateway | TRAVEL SAFARI |

    ReplyDelete
  35. Are you looking for shot Blasting machine for your industry,Shots blasting machine
    We are the leading manfacturer and provider of Shot Blasting Machine , Sand Blasting Machine, Abrasive materials equipments related to Airo shot blasting industry like SPray gun , Thermal SPray gun etc

    Visit our website for more information.

    Thanks

    ReplyDelete
  36. I am sure that everyone is working as good as i am Well I am into business of manufacture and Export of Sand Blasting Machine in UAE , Sand blasting Machine in Saudi Arab
    https://shotsblastingmachine.com/

    These are machine that treats the surface of metal , woods etc and is been used to either remove the dust rust etc form the Machine of its Parts , Or introduce a new protective coating on the Subject
    shot peening machine

    ReplyDelete
  37. The Sand Blasting machine uses compressed air as the power to form a high-speed jet beam to spray the blasting material (shot peening glass beads, steel shot, steel sand, quartz sand, emery, iron sand, sea sand) on the surface of the workpiece to be processed at high speed, so that The mechanical properties of the outer surface of the workpiece surface have changed. Due to the impact and cutting action of the abrasive on the surface of the workpiece, the surface of the workpiece obtains a certain degree of cleanliness and different roughness, so that the mechanical properties of the workpiece surface are improved, thereby increasing The fatigue resistance of the workpiece increases the adhesion between it and the coating, prolongs the durability of the coating film, and is also conducive to the leveling and decoration of the coating.

    ReplyDelete

  38. The technology employed to energize or propel the blast media is an important aspect characterizing Sand Blasting machine types. Blast machines use either pneumatic or air pressure or wheel to project abrasives or media.

    ReplyDelete
  39. This post is very beneficial for me and provide a new knowledge to me. Such a very useful blog.
    สล็อตออนไลน์

    ReplyDelete
  40. In sand blasting machine Majority of the critical industries, need their metals painted. Before painting, the outside scales, rust, paint marking, oil, grease, etc foreign materials are removed by blasting with air and an abrasive(sand, grit, shot). The surfaces are required clean and anchor profile for the painting and coating. The surface is near white metal.

    ReplyDelete
  41. I’m glad I found your lovely blog so much wonderful insights here

    Travel Technology Company in India |

    ReplyDelete
  42. thanks for posting this informative blog
    Logo Design

    ReplyDelete
  43. Nice Article Please Check out my website Advertising Agency in Hyderabad

    ReplyDelete