Author Topic: Catastrophic Hardware Failure - Going to Chicago  (Read 29712 times)

0 Members and 1 Guest are viewing this topic.

Offline chrisgbk

  • Inactive Staff
  • Veteran
  • *****
  • Posts: 1739
Re: Catastrophic Hardware Failure - Going to Chicago
« Reply #60 on: May 26, 2007, 02:00:17 am »
(you would know this if you actually read http://tns.u13.net/?p=13 before posting)

You can't run game servers in a cluster without software for the game server that is intended to fail-over.
Yeah, I was impatient and just posted without reading. Sorry bout that.

I am sure there is a way to cluster this app. I have not attempted it yet but I may just have too in my lab and see how well it goes. I have been away from fun like that for some time and it will give me something to do.
even then, you were way off in saying that using ECC RAM would protect against failing modules.  ECC is there to protect against the kinds of errors that would come from cosmic rays and other stray things, causing its to flip once in a very long while (probably once every trillion memory address reads, if not more).  ECC won't do **** about a RAM module completely flaking out, or even just starting to

ECC RAM can detect errors of 1 or 2 bits, and in the case of a 1 bit error, can correct it. That's 1 bit per 32 bits, so in the best case where memory is suffering a failure at a rate of exactly 1 bit for every 32 bits(non-averaged), it can be corrected. If in a single 32 bit sequence more than 1 bit is off, it cannot be corrected. If running a 64 bit OS, it's worse because only 1 bit can be corrected as well.

Of course, the RAM is usually built within tight specs if you are buying ECC ram, since it's designed for reliability, so the chance that an error will happen in the first place is very low; and the chance that it will be an error of more than 1 bit is exponentially greater.

As in the case with the server, there is no way that ECC could have helped here. Essentially the entire module was erroring out. Like Flies said, ECC isn't some kind of saving grace that can save you from all sorts of errors; it's designed for mission critical systems where reliability depends on the integrity of data, and the very small chance to correct an error, if one ever occurs in the first place, outweighs the cost.
« Last Edit: May 26, 2007, 02:36:46 am by chrisgbk »

Offline FliesLikeABrick

  • Administrator
  • Flamebow Warrior
  • *****
  • Posts: 6144
    • Ultimate 13 Soldat
Re: Catastrophic Hardware Failure - Going to Chicago
« Reply #61 on: May 26, 2007, 12:52:35 pm »
'zactly chris <3

Offline chrisgbk

  • Inactive Staff
  • Veteran
  • *****
  • Posts: 1739
Re: Catastrophic Hardware Failure - Going to Chicago
« Reply #62 on: May 28, 2007, 08:32:13 am »
Then will you be getting a different type of RAM?

He is going to be going with a different brand; as Corsair still haven't been able to deliver the proper ram to him, he ordered some Kensington ram, and is getting a refund for the Corsair.

Offline PureGrain

  • Major(1)
  • Posts: 10
Re: Catastrophic Hardware Failure - Going to Chicago
« Reply #63 on: August 19, 2007, 08:49:16 pm »
You can't run game servers in a cluster without software for the game server that is intended to fail-over.
I have managed to get soldat to failover in a 2node cluster. It fails over without a hitch and fails back without a hitch, however, it manually has to be failed back right now. :o)

Offline FliesLikeABrick

  • Administrator
  • Flamebow Warrior
  • *****
  • Posts: 6144
    • Ultimate 13 Soldat
Re: Catastrophic Hardware Failure - Going to Chicago
« Reply #64 on: August 20, 2007, 02:51:49 am »
"wow look at the size of my e-wang as I go back and bump a thread that is more than 2 months old just to try and prove a point" ?


Locked.