@thegibson@hackers.town have you not SEEN the crap i post??

i post it because it's a learning experience -- i just hope someone else is able to learn along with me

@thegibson Failed so far at being a well-adjusted to society "normal" adult with "normal" goals and expectarions in life.

Not sure that's such a bad thing though.

@TheGibson I knock out our SharePoint farm on a fairly regular basis. 90% of the time it’s because I forgot to rate limit a script and just told it to move/delete 10k files in the space of a few seconds. It will happily accept the requests and start processing in the background until it can’t any more. Then it goes splat…

@thegibson Small business customer in a remote site, 15 years ago. Had a linux server driving their network for email and filestore. RAID card failed, and no replacement cards were available, so we loaned them a spare box & copied from the one working drive. All good so far ...
Then a few weeks later we got the replacement card, so the old server was reinstalled & I took it to them (a 4 hour drive, the company's owner took me there).
In a small cramped space, set up the new server, and ran an rsync from the old one to copy the last few weeks data across.
It ran really fast.
Really really fast.
And didn't list any files ...
Because there weren't any.
I'd sync'd the new blank machine onto the old "full of customer data" one.
There were no backups of course, because the customer didn't have a backup facility, they had RAID cards ...
The owner drove me back home that evening, another 4 hours. It felt much much longer ...
@lightweight might know if I was ever forgiven ... I was working for him at the time, it was his customer ...

@yojimbo @TheGibson hah! It's been a weird day of unexpected memories. This one was pretty memorable, for sure. I felt very bad for you, as I think you did take a lot of precautions.. and I felt very bad for the customer. We continued working with them until I sold the business quite a few years later. I don't think it was discussed much subsequently, but I don't think it affected them too badly. :). I hope it hasn't troubled you overmuch since.

@lightweight @thegibson Luckily it was off-season for them, so there were no customer enquiries to track; and the only thing they'd been using email for was organising new uniforms from a supplier; so the supplier would have had all the copies of correspondence. So they were as mellow as possible under the circumstances!

In a technical interview a few years later I was asked "what's the worst thing you've done to a customer?" so I told the story. "What would you do differently?" was the next question, so I said I'd change the prompt on the old and new server using something like the MAC address to check (as that's the only unique identifier). The unix beard nodded, and said "yes, that's what I did when I had a job like that; but I got the prompts the wrong way around ..."

So I got that job :-)

I keep that question for interviews with people who have had sysadmin responsibility, it's a good one.

@TheGibson one of my first projects at Amazon (in 2000) was to put a wishlist button everywhere there was an add to cart button. I reversed the “add to cart” v “preorder” logic on music search, probably cost ~$100k in sales over 24hr. More than twice my annual salary at the time.

@tithonium @thegibson I too have caused a NA Order Drop!

Me: *clicks button*
TOS: Order drop!
Me: …was that me?
Coworkers: Nope. We do this all the time. It’s good.
Me: Ok… *continues*
TOS: *continues debugging order drop*
Me, 15 minutes later: *finishes, clicks button again*
TOS: Order recovery!
Me: …
*30 minutes pass*
Me: I need to click the button again.
Coworkers: It’s cool, wasn’t you.
Me: *clicks button*
TOS: Order drop!
Me: *clicks button again, steps away*

@TheGibson Building a datacenter network on Juniper Virtual Chassis, after initial testing seemed promising.

It saved us a ton of work at a time where network automation was not yet a thing, and brought us into the 10G (EX) and 40G (QFX) era... But after half an hour of downtime during one of the "non-stop software upgrade" procedures (that was the final straw, there were problems every time before), no one dared touch the things for another upgrade for years until the platform was replaced.

@TheGibson I got a job offer after spending some (too much) time on writing a Perl script that did traffic accounting from TIS FWTK log files, 1995-ish. It worked quite well, until one day it didn't, and all the output was garbage.

I didn't know about integer overflows.

Luckily for me, Math::BigInt showed up somewhere around the time that happened.

@thegibson updated around 4 million records in production with bad data. We have robust rollback mechanisms, so it only took about a day to fix.

@TheGibson In the first few months at my first job (at a online payment provider) I accidentally sent around 110000€ to the wrong people

I think we got like 107000€ them back

@thegibson i have 2

once i broke diagnosis history tracking in a large medical record system and the bug made it to production ; they lost 3 months of diagnostic history summaries due to the bug (this is v. bad for those not in healthcare)

one day i got a panicked phone call from a customer because the 'fix' i had just deployed to production broke /all/ medication dispense data entry in the system ; the entire nursing staff had to cancel their appointments for the day

@kemonine This is why I refuse to work in healthcare or any other life-critical systems. I just can't cope with that kind of pressure. @thegibson

@nomad

to be fair to my employer (and others) i wasnt just handed that level of pressure or responsibility

i only got that role because i could handle it and manage the whole 'oh mah [diety of choice here]' moments with customers

i was offered opportunity to get more involved in that area of 'work stuff' and they did a solid job of mentoring and filtering ahead of me being left alone as 'trusted'

not everyone i work with made it through the process and thats ok

@thegibson

@thegibson
Froze backups across my entire section, because our workstations were connected as a Beowulf cluster and nobody told me that saving data continually would keep the disks active preventing backup. Only found out because I had no cubical so was in the main room when the very frazzled admin came in trying to solve it.

@thegibson
My biggest fail is that I have so few fails because despite the passage of time I have relatively little experience.

@thegibson
Outside of tech... The entire recent situation. All of it. Worst mistakes of my life contained therein.

@thegibson First day on a gig, I broke some libraries that a bit over a hundred people were depending on because I didn’t understand their dev process. Then went to orientation for the next two hours and was unreachable while they tried to figure out what just happened :). Unbroke it in five minutes when I found out at least

@TheGibson fucked up a raid and lost a weeks worth of accounting data for three companies.
I still have my job and have had the chance to redeem myself

@xorowl @thegibson the sign of a good company.

Good decisions come from experience. Experience comes from bad decisions.

You just gained a bit of experience. :-)

@smitty @TheGibson exactly! i've helped to cultivate exactly such a culture at my work.

@thegibson

source local.env
bash scripts/replace_db_with_dumpfile.sh

"Huh... it's taking longer than usual"
...
"Huh, the local env still has the old data... wtf?"

export
ENVIRONMENT=production

"Well, that's probably not great..."

@thegibson i directly edited database on production server and gave the client an unexpected april’s fool

@xarvos @thegibson it was in the first week of my internship. I got bash script to fix - I had no idea about bash.
Did not know that not existing variable is not an error but empty string.
Run this script, where one of the line was:
rm -rf {myVar}/
And I removed myVar.

Fortunately it was just a test env

@TheGibson Early in my writing career at a moderately popular Apple news site (TUAW) I wrote about what I thought was a silly little piece of shareware that opened the CD drive on Mac towers.

Turns out it also ran a script that moved ALL your calendar events back by two weeks. Readers were not happy.

The person who made it never intended for it to get picked up in the news. They gave me a script readers could use to fix it all.

@chartier @thegibson

Used to check TUAW daily when I had my first Mac. (Aluminum PowerBook G4) :blobthumbsup:

@thegibson One time, at my $dayJob, I completely wiped out website without backing it up first.

Luckily, the host does a backup every 24 hours so we were able to get it back up in just a couple hours.

Why didn't I back it up before I did what I did to wipe it? I don't know.

Overconfidence in my abilities I guess.

@thegibson Maybe too esoteric, but I once spent weeks (maybe a month) trying to track down a bug in my verilog code for an FPGA on a new board bringup. I was stressed, the teams was stressed. I was rewriting entire modules and theorizing strange metastability bugs. In the end, the problem was that I had neglected to check a particular box in the configuration for the project so that the generated bitstream would kick the FPGA out of its initialization sequence after it loaded the bitstream. :/

@thegibson I also once walked across a carpeted lab, reached out my hand to push a reset button on an expensive electronic board I was working on, and a nice static discharge arced from my finger before I could hit the button. ALL the smoke started to come out of that board. 😆

@thegibson not my worst one, but the funniest one:

Fucked up the computers of a thousand+ of our lanparty guests, since we set DHCP option 46 to only use WINS and not broadcast for NetBIOS name resolution (p-node).

Windows set that flag permanently and people were not able to find other computers back at home anymore.

Solutions was of cause editing something in the registry by hand to restore the original behavior. Took us a week to figure that out, people were pretty pissed 😅

@thegibson

Kind of tech related:

Coworker had spent days trying to make replacement radio transmitter work. The receivers were all acting like they were expecting PL (sub-audible tone), but he was insistent that the old transmitter wasn't configured for PL. Boss sends me down. I read all the docs. I try *everything.* Days pass. No progress made. Boss finally arrives, pulls out service monitor, in a matter of seconds "Look, there's PL on the carrier, right there"

Apparently co-worker was looking in the wrong spot in the programming of the old transmitter and I just assumed he knew WTF he was talking about. Lesson learned.

@thegibson

Not tech related:

Roommate/homeowner wants to tile his countertops. We measure everything carefully. Note down the sizes of each section in inches. Add them all up. OK, this tile is sold in square feet. No problem, just divide by 12...

We had a LOT of extra tile.

@SetecAstronomy @thegibson lol there is no SOP on housing projects. I have had 4 houses now and each one of them was a pain in the ass to replace components.

@TheGibson I ran svn up on our production server during working hours.

@thegibson IBM cash registers of old used a version of Token Ring called Store Loop. I was replacing an IBM cash register base on Black Friday and instead of unplugging it from the IBM data connector on the wall (which would self-short to prevent opening the loop) I unplugged it from the back of the register.

Every cash register in the store immediately stopped. Hard.

All the cashiers went "Hey, did your register stop?" I said "Yes, not sure what happened." as I hastily plugged it back in.

@ColinTheMathmo @thegibson Huh, interesting stuff!

I wonder if a backpressure mechanism would work too. Fibre Channel had a backpressure mechanism that would say "stop, I'm full" when the buffers hit max, because FC was a lossless protocol.

@gedvondur The problem in this case was that the comms system couldn't handle the traffic being put in the front end. So somehow either the processing capacity needed to be increased -- which wasn't possible -- or the input needed to be reduced. Slowing it gently across the entire store worked well and was implemented elegantly.

@TheGibson

@gedvondur The underlying user-facing problem was that flow of data through the system has to be continuous and couldn't be seen to stutter at any point. So making the tills tun a bit slower seems the only thing to do.

@TheGibson

Sign in to participate in the conversation
hackers.town

A bunch of technomancers in the fediverse. This arcology is for all who wash up upon it's digital shore.