Deleted all shared libs on a production Solaris 10 system.

Was able to rebuild the config using the equivalent of Linux "ldconfig" and rebooted the machine before anyone realized. Solaris ran fine after that.

I also did an "rm -rfv .", not realizing I had changed to ~ instead of a subdir. Lost quite a lot of important files that way. Yes, no backups.

@thegibson We spent probably 20 minutes writing a query to carefully select a subset of production servers upon which to execute a handful of *very* resource intensive commands, and then to disable a bunch of critical services.

We stored that data in a variable.

We referenced a different variable when we ran that command in our production datacenter.

It ran for about 10 minutes before I said "Gee, this is taking a really long time" and then my co-worker went in to a panic and reverted our changes.

Went from multiple million active users to less than 500k in about 5 minutes, back up to multiple millions in less than 3.

Happened so fast that it didn't even trigger an alert.

@TheGibson Was managing a terrible HR application and got overconfident while applying patches, went for the next one without a snapshot and it failed miserably (it didn't in our test env).

The HR department (hundreds of employees) was left unable to do any work for half a day. We didn't know at the time if it was going to be an easy fix or something that would let us struggle with our backups for *days*.

Been treating prod environments like unexploded bombs since then, which is what they are.

rm -rf / * root my first real job NASA

@thegibson wired about 10 lightbulb sockets backwards because at the time I thought 110/120v AC was done the same way that 220/240v AC was done, so wouldn't matter which direction I hooked them up, so connected them randomly.

@TheGibson I sure did DOS the provider's API three times before noticing that there was no exit from the retry loop in the happy path

@thegibson I crippled my org's entire infrastructure by writing a script to do user cleanup on every server by checking a centralized source and upon a mismatch, killing that user's processes then removing the user.

My central source did not include root.

I learned about the protections a modern linux box has against some major foot bullets that day, but we still had to reboot every box in our infrastructure because I chopped half the load bearing services off at the knees.

That was my first major mistake in a decision-making IT role and I'm incredibly fortunate my boss didn't make it my last.

It's uncanny because my major task right now at a different org is centralized user management. I think about this mistake often.
@thegibson @djsundog Backin up was easy lawd, when he sang the blues. Backin' up was good enough for me.

I ruined a MB once by just dropping a screw on it, causing a short. Happened so fast!


@thegibson completely destroyed production Kafka cluster, more senior Ops engineers managed to fix it. But the very next day I did it again

The_Gibson boosted backups are for people who are more worried about integrity than confidentiality

@TheGibson "real men" don't backup.

because "real men" aren't afraid of crying.

Backups are for people who have something to lose.

