(you would know this if you actually read http://tns.u13.net/?p=13 before posting)
You can't run game servers in a cluster without software for the game server that is intended to fail-over.
Yeah, I was impatient and just posted without reading. Sorry bout that.
I am sure there is a way to cluster this app. I have not attempted it yet but I may just have too in my lab and see how well it goes. I have been away from fun like that for some time and it will give me something to do.
even then, you were way off in saying that using ECC RAM would protect against failing modules. ECC is there to protect against the kinds of errors that would come from cosmic rays and other stray things, causing its to flip once in a very long while (probably once every trillion memory address reads, if not more). ECC won't do **** about a RAM module completely flaking out, or even just starting to
ECC RAM can detect errors of 1 or 2 bits, and in the case of a 1 bit error, can correct it. That's 1 bit per 32 bits, so in the best case where memory is suffering a failure at a rate of exactly 1 bit for every 32 bits(non-averaged), it can be corrected. If in a single 32 bit sequence more than 1 bit is off, it cannot be corrected. If running a 64 bit OS, it's worse because only 1 bit can be corrected as well.
Of course, the RAM is usually built within tight specs if you are buying ECC ram, since it's designed for reliability, so the chance that an error will happen in the first place is very low; and the chance that it will be an error of more than 1 bit is exponentially greater.
As in the case with the server, there is no way that ECC could have helped here. Essentially the entire module was erroring out. Like Flies said, ECC isn't some kind of saving grace that can save you from all sorts of errors; it's designed for mission critical systems where reliability depends on the integrity of data, and the very small chance to correct an error, if one ever occurs in the first place, outweighs the cost.