• Visitors can check out the Forum FAQ by clicking this link. You have to register before you can post: click the REGISTER link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below. View our Forum Privacy Policy.
  • Want to receive the latest contracting news and advice straight to your inbox? Sign up to the ContractorUK newsletter here. Every sign up will also be entered into a draw to WIN £100 Amazon vouchers!

Reply to: Clusterf***

Collapse

You are not logged in or you do not have permission to access this page. This could be due to one of several reasons:

  • You are not logged in. If you are already registered, fill in the form below to log in, or follow the "Sign Up" link to register a new account.
  • You may not have sufficient privileges to access this page. Are you trying to edit someone else's post, access administrative features or some other privileged system?
  • If you are trying to post, the administrator may have disabled your account, or it may be awaiting activation.

Previously on "Clusterf***"

Collapse

  • suityou01
    replied
    Oi, where's my tags gone?

    Leave a comment:


  • NotAllThere
    replied
    No, it was bob who couldn't work it out. Not Uhuru.

    Leave a comment:


  • MarillionFan
    replied
    Originally posted by NotAllThere View Post
    I recall when our DB server stopped responding. But the failover wouldn't trigger, because the controller was trying to do a clean shutdown, and couldn't because a process was hanging. So we told our good friend Bob to force a shutdown of the DB server, which Bob was reluctant to do, as it wasn't in the SOP. Eventually, the op manager persuaded him - "Pull the bloody plug out if you have to - JFDI".

    Only Bob, didn't. He shut down ALL the servers.

    Oh, and naturally, when we tried to bring it all back, the failover db server wouldn't come up. Uhuru just couldn't work it out. Fortunately, Scotty worked out that one of the network cards had failed, knew how to get in via one of the others, and got that back online somewhat faster than the four hours the datacentre were quoting.

    13000 users, around the world, unable to log on for an hour. How we laughed.
    WTF. Are you Captain Kirk?

    Leave a comment:


  • NotAllThere
    replied
    I recall when our DB server stopped responding. But the failover wouldn't trigger, because the controller was trying to do a clean shutdown, and couldn't because a process was hanging. So we told our good friend Bob to force a shutdown of the DB server, which Bob was reluctant to do, as it wasn't in the SOP. Eventually, the op manager persuaded him - "Pull the bloody plug out if you have to - JFDI".

    Only Bob, didn't. He shut down ALL the servers.

    Oh, and naturally, when we tried to bring it all back, the failover db server wouldn't come up. Bob just couldn't work it out. Fortunately, Scotty worked out that one of the network cards had failed, knew how to get in via one of the others, and got that back online somewhat faster than the four hours the datacentre were quoting.

    13000 users, around the world, unable to log on for an hour. How we laughed.

    Leave a comment:


  • MarillionFan
    replied
    Originally posted by suityou01 View Post


    You patronising b******

    HTH BIDI
    No worries.

    I'm giving a lesson on how to suck eggs later next week. If you have a granny you'd like to enroll?

    Leave a comment:


  • suityou01
    replied
    Originally posted by MarillionFan View Post
    With databases SY it is normally necessary to update the Live server.

    But if you do, the process is to backup right before, notify and disconnect all users, take down application services, update, then bring it all back up after you've run through the process on test.

    Anyway that's what I made one client attempt to do this week. They're test box is still down because they can't work out how to restart the service. I'm bloody glad i didn't go gung ho and try my changes on live first.


    You patronising b******

    HTH BIDI
    Last edited by suityou01; 13 November 2010, 12:42.

    Leave a comment:


  • MarillionFan
    replied
    With databases SY it is normally necessary to update the Live server.

    But if you do, the process is to backup right before, notify and disconnect all users, take down application services, update, then bring it all back up after you've run through the process on test.

    Anyway that's what I made one client attempt to do this week. They're test box is still down because they can't work out how to restart the service. I'm bloody glad i didn't go gung ho and try my changes on live first.

    Leave a comment:


  • MarillionFan
    replied
    Originally posted by NotAllThere View Post
    Oo do tell. You can change the names to protect the guilty.
    Oh, your terrible.

    Leave a comment:


  • suityou01
    replied
    Originally posted by NotAllThere View Post
    Oo do tell. You can change the names to protect the guilty.
    DBAs story

    1) I took a live backup which failed.
    2) I noticed a lock on a table which I cleared.
    3) I took the live backup which worked.

    Transaction log says

    1) I dropped the whole database and tried recreating it, badly. (While the system was live)
    2) I failed to repopulate the table in question. (While the system was live)
    3) The system limped along like this for 25 hours, and this caused further problems.
    4) I then tried to rebuild the table in question again and it worked this time. (While the system was live)

    So when we started investigating the data all looked hunky dory, well in the most part anyway.

    Leave a comment:


  • NotAllThere
    replied
    Oo do tell. You can change the names to protect the guilty.

    Leave a comment:


  • suityou01
    replied
    Originally posted by suityou01 View Post
    Good work and thanks for your swift reply. If the MySQL binary transaction log (exhibit a) records session information then he won't be able to wriggle out of anything. I have not taken "sabbotage" - deliberate or mistaken - out of the equation. It is on my list of possibilies. Transaction log analysis is high up my list of next tasks.
    And how true that turned out to be. The transaction log made very interesting reading

    Taxi for the dba

    Leave a comment:


  • NotAllThere
    replied
    Originally posted by suityou01 View Post
    If the MySQL binary transaction log (exhibit a) records session information then he won't be able to wriggle out of anything. I have not taken "sabbotage" - deliberate or mistaken - out of the equation. It is on my list of possibilies. Transaction log analysis is high up my list of next tasks.
    Can't help you there. I know people and business, not databases. I'm just an 'umble programmer.

    Leave a comment:


  • suityou01
    replied
    Good feedback.

    To clarify a couple of points :

    The area of the system in question has been running for 10 months without problems.
    The area of the system in question has had no changes in 10 months.
    The guy with amnesia is not a developer. He is support. They (developers) are not allowed near the live system.
    This is not a witch hunt. Admittedly the guy with amnesia has pissed me off, but I have brushed that to one side as the only witch I am hunting is the technical root cause.
    The system will now be replaced I suspect as the politics dictate but this is not my concern. My concern is to troubleshoot, fix, resurrect and give guarantees. Then they can replace at leisure.

    I think given that the people I interviewed have changed their story the only choice I have is to coordinate load tests on the UAT environment and meanwhile do a forensic level check on the transaction logs.

    The truth will out.

    Stay tuned folks.

    Leave a comment:


  • norrahe
    replied
    Firstly: check what was delivered and what is at fault
    Secondly : check against the business spec
    Thirdly: check if system testing and UAT was actually carried out correctly
    Fourthly: check if said developer who is "fixing on the spot" delivered to said spec

    You will often find that a developer have decided that what was asked and what he delivered were two different things. you may also find that whomever sorted out the system and UAT were a tad lapsed in their testing

    If all else fails.

    Duck and cover or sit back and laugh.

    Leave a comment:


  • Spacecadet
    replied
    Originally posted by xoggoth View Post
    In the good old days we just blamed the hardware guys.
    "Network faults" has served me well for the past 6 or more years

    Leave a comment:

Working...
X