Looking For Quacks In The Pavement

Enco Wars

First thing Tuesday morning, the main Enco digital audio storage server died. After hours of troubleshooting and tinkering, we brought it back online… minus all of its audio data that was lost when one of the drives in its array died. (We’re running RAID 0 because it’s the only way we can get enough capacity out of the array.)

We switched everybody over to the standby server, and told that server to restore its audio data to the main. No problem, right?

Wrong.

Shortly before midnight last night, the standby server decided to hang. We lost ten hours out of our restoration process because the server didn’t crash enough to come to the attention of my server monitor. Argh. (It was reporting X amount of available drive space… on a drive that was no longer responding. *grumble*)

The plot thickens: An attempt to restart the “rsync” file transfer process revealed another problem. Rsync was starting over from the very first file… even though no differences between the source and destination files could be determined! I’ve come to believe that using rsync across operating systems is a Very Bad Idea. (I’m having a similar problem backing up our main office server. Le sigh.)

So now I get to manually copy batches of audio data from the standby to the main server. This should only take, oh, another dozen hours or so. Nevermind that the copy command does a slower job than rsync, and nevermind the headache of updating to catch deleted files. I’m probably going to have to perform evil involving something like DirComp from a Windows workstation mapping both servers.

All of this adds up to no vacation day tomorrow, and I’ll probably be here Saturday and Sunday as well. I’m going to be gone Monday, no matter what: No way am I missing out on Two Towers!

UPDATE: It’s just about 9pm, and I’m just about to head back to the office. Why? Because the file transfer has stopped again. This graph shows the traffic on the network port through which the file transfer is being done. See that sudden stop right around 5:30? Right about the time I was leaving the building? Yeah, that’s just peachy.

Near as I can tell (from shelling in to the standby server from home), this time the stoppage isn’t because the RAID controller in the standby server is a piece of crap. (That was the cause of last night’s crash.) I don’t know why it’s stopped, really. I only know that I have to go back down there and start it up again. Le sigh.

At this rate I really will still be working on this “project” come Saturday. Argh. I love my job, really I do… but sometimes there are parts of my job I could do without.

1 Comment

  1. Gustav

    GreyDuck… forgive my coy smile at your Enco woes. For what seems to be the first time I can recall, you detail what goes wrong with the Digital Empire within the building and I am not there to share your pain! Argh.

    I always liked to look at it this way: Enco is really the WOPR from “War Games.” Just bring in Professor Falken, have the machine play itself in a game of Tic Tac Toe, and the free world will be saved…

© 2023 greyduck.net

Theme by Anders NorenUp ↑