Category: Work

When Good Servers Go Bad

Let me tell you how the sequence of events went down. (“Down” being the operative, or inoperative, word.)
1) Something Happens at about 5:30 this morning. (All times Pacific, thank you.) Perhaps one Audicy session too many was opened, perhaps it was just that the server had had enough. Either way, the Audicy server and (probably) the two workstations began flooding the network with traffic. Also, the Audicy server’s root partition is chock-full, which is likely the cause of the commotion (as the clients madly try to read and write data on the server).
2) A bit after 6:00, my phone beeps to let me know that we can no longer communicate with Entercom Corporate over the WAN. This happens fairly often, and usually clears itself up, so I fail to panic and instead go back to bed.
3) A bit after 7:00, my phone rings. It’s John Graefe, my boss at Corporate, asking why our WAN router is incommunicado. Whoopsie. I tell him I’ll jet right down to the office and Deal With It.
4) By 8:00, I’ve burped the WAN router to no avail.
5) At about 8:30, I notice that the main office network switch is saturated. Instead of happily blinking busy little activity lights, the whole display is lit solid. This worries me. I ponder burping the switch, but that would kick everybody in the building off the network, causing all kinds of chaos. Instead I start pulling plugs, briefly, one by one until the lights go dark(er).
6) I discover that unplugging the cable that connects to the Audicy server makes Everything All Better. Upon investigation of the server, I see that the root partition is full. Unloading the Netware emulation code frees up space. I decide to reboot the machine. This is great, but Linux insists on checking all the partitions since the machine hasn’t been rebooted in ages.
7) While the 60-plus gigabyte partition is being slowly fsck’d (and that’s a technical term, thank you), I look in horror at the central switch, which has again lit up solid.
8) After unplugging a series of cables around the server room, kicking half the building off the network in the process, I discover that a machine in Production Room 3 is the culprit. Hmm, what’s in there? Oh yes. An Audicy workstation. The workstation in question is powered down, which makes Everything All Better Yet Again.
9) I check the Audicy server, all seems well. For grins and chuckles (since both workstations are turned off) I upgrade the kernel from 2.4.0-test10 to 2.4.20, mostly in the vain hope that the memory/space leak will go away.
10) The Audicy server is rebooted on the new kernel, and the Production Room 1 workstation is brought back online and tested for network connectivity. All is well. The Production Room 3 workstation, already due to be removed since there’s now a ProTools workstation in place, is completely disconnected pending physical removal from the room.
There’s my morning in a nutshell… an appropriate receptacle. Now, if you’ll excuse me, I’m going to attempt to rehabilitate the general-purpose computer in Production Room 5, which seems to have suffered a catastrophic failure for unknown reasons…

27 March, 2003
I just wanted to fix the printer, honest!

Dr. Doug and Skippy emailed me this afternoon asking me to look into the ongoing printer problem in the Rosey studio. “Sure thing,” I said, “I’ll just pay Sheryl a visit.”
I went in, looked at the printer, power-cycled it and was about to run a test print when Sheryl Stewart said, “Hey, I need you to read something.”
“Uh, I don’t read very well.”
“That’s okay,” she replied, “It’s supposed to sound bad.”
“Okay, then.”
And so I took part in an on-air contesting bit. The evidence is right here. (Nine hundred fifteen kilobytes worth of evidence, that is.)

21 March, 2003
Snap! goes the Snap! server

I have two unpleasant things to write about this morning, and I decided to pick the one that’s foremost in my thoughts right this moment. Don’t worry, the other one is much more depressing.
Remember the Quantum Snap! Server? I’ve ranted about it before. I’ll probably rant about it again before it’s gone, too.
It crashed. 10:00am, and it crashed. The same error messages, same symptoms. Three months out of the past four this thing has crashed once per month. I’m tired of it, and I hope I can convince Corporate of 2 things:
1) Karel was wrong about buying the Snap! server, and what we should really do is upgrade the capacity of our existing Netware server, the one that performed so well for us over the years but just ran out of necessary space.
2) Karel, while wrong, isn’t such a freakin’ moron that we should find someone else to do his job. He just made an honest mistake, and that shouldn’t be held against him.
Wish me luck.

14 March, 2003
Not only do I not suck, I also don’t blow.

Without going into too much detail, my first-ever (yes, Entercom’s a bit behind the times in some ways) official Employee Review went fairly well. That’s one less thing to worry about in my permanent record, I suppose. It has been noted that I need to work on my organizational and communications skills and sometimes my attitude leaves a bit to be desired. All in all, though, it was a positive review. Yay!
The really big news of the day is that our problems with The Beast are all over. You see, it finally dawned on me that I should contact 3Ware about their RAID controller. After a couple of email exchanges, this is what I was told:

“If you have a 5xxx series controller, pressing F9 instead of ENTER will accept dissimilar drives into an array.”

Aaaaaargh! It was that easy all along? How about a hint on the screen that ENTER isn’t the only keypress available? How about something in the printed docs? Huh?
I don’t care too much about documentational and user-interface stupidity right now, though, because The Beast finally has a working 500 gigabyte array onto which Enco files are being copied as I type this. Hot diggity damn. That’s one major problem off my back. Two, if you count the Employee Review…

20 February, 2003
Stupid 3Ware. Stupid stupid stupid.

You may recall the problem we had this past weekend with our Enco servers. While we did get the main server back online with a surprising minimum of fuss, our standby server is still awaiting the correct replacement drive. The main cause of our problem is 3Ware’s insistence on having only 100% identical hard drives in any given array, a requirement quite out of step with most sensible RAID controller manufacturers.
Fine then. Since we don’t want to completely rebuild The Beast as a SCSI-based machine (bloody expensive proposition, that) we figured we’d hunt down another drive of the exact same model number as the others already in the machine. This way we can get a bit more mileage out of our lamebrained investment in 3Ware’s IDE-RAID technology. Great idea, right?
Wrong!

Same model number, different month, slightly different capacity.
It was no surprise that when I tested the new drive anyway (Gary would have insisted on it) I see this:

Delightful. Just delightful. It looks like we’re going to have to spend just shy of $1000 on hard drives to replace the entire bloody damned set. And that means at least two more days before our standby server is once again on standby.
I hate working without a ‘net…

19 February, 2003
When good drive arrays go bad.

There I am, settling in for a quiet Saturday of housekeeping, websurfing and bookreading. And the phone rings.
The main Enco server is down. This is bad news ordinarily, but it becomes exceptionally bad when you remember that the standby server went down a week ago and you haven’t yet received the replacement hard drive you need. Uh oh.
So I hop in the shower, hurry to the office, and discover that one of the drives in the external chassis has gone south. Oh no. I grab the spare (yes, we do keep a spare for the main server, just not the standby), put it into an enclosure and swap it into place, all the while expecting a long overnight as I babysit the restoration of files to the new Netware volume I’m doomed to have to create.
And the new drive exhibits exactly the same problem as the old. Aw, hell…
(changing verb tenses, just a moment please.)
It took Gary and I about three hours to get everything running again. How could we possibly have rebuilt a RAID 0 array and restored the data in such short time? Piece of cake. Turns out the drive itself didn’t die, just the receive bay in the hot-swap drive chassis.

And the boxed spare also turned out to be flakey. We tried every combination of enclosure, receive bay and LVD add-on board we had… except one. In a flash of desperate inspiration I decided to look up on one of the shelves in the engineering shop. Under a pair of old hard drives and other assorted detritus I found one more receive bay. We attached an LVD add-on board and set the SCSI drive ID to match the old bay so the RAID controller would hopefully recognize the original drive and spare us the need to create a new array. Lo and behold, it worked!
Yay, we got our array back. The main Enco server is once again alive and kicking. We made a list of spare parts we need to order, since it’s just a matter of time before that slot fails again. (Turns out that we’ve lost two receive bay units in the same chassis position since putting the Enco system into service. This does not instill us with confidence.) I then turned my attention to the standby server for which we’d received the replacement drive yesterday, naturally on the day I couldn’t make it to the office.
There’s a standard principle followed by almost every RAID-controller manufacturer in the business: All drives in an array will be treated as if they were the same size as the smallest drive in the array. It’s difficult to replace a single dead drive with an exact duplicate, especially two years down the road, so RAID controllers (usually) allow you to use a replacement drive slightly larger than the original. Yet, for some asinine reason, the folks at 3Ware decided that all drives on one of their IDE RAID controllers must always be exactly the same to be included in a single array.
Of course, the replacement drive we purchased, while the same manufacturer (IBM) and basic type (IDE, 7200 RPM), was just a wee bit larger than the others, and therefore different enough that the 3Ware controller refused to include it in the new array. And so, we cannot bring The Beast back online until we either find another DTLA-307075 or buy six or seven identical replacement drives for the new array.
I suppose you can’t win ’em all.

15 February, 2003

S	M	T	W	T	F	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

greyduck.net

Category: Work

When Good Servers Go Bad

I just wanted to fix the printer, honest!

Snap! goes the Snap! server

Not only do I not suck, I also don’t blow.

Stupid 3Ware. Stupid stupid stupid.

When good drive arrays go bad.

Calendar

My Links