The Worst On-Call

Most people working in IT are familiar with working an on-call rotation, where after-hours requests or urgent trouble tickets might require their attention, and that for the duration of their “shift” they can’t be more than a few minutes away from a place they can work. Often this means no going out for meals or drinks, no visiting the theaters, and no leaving the laptop at home if one does decide to go out. Most employers include being a part of the on-call rotation as an integral duty of the job and do not provide any additional compensation for time spent waiting for alerts, but there are the rare companies who do give the employee a small bonus or extra PTO or the like in exchange for their time.

Today I bring you all the way back to 2009, to a story about the worst on-call experience I’ve ever had as a computer engineer. I share this story both as a warning to front-line employees and as a cautionary tale for employers on how to treat their staff. Surprisingly this story isn’t about the six months I was the only on-call person for a regional internet service provider, and how I was functionally chained to a desk during all non-work hours, but that was a close second.

This is as much a story about being on-call as it is about gratitude, or the lack thereof.

At the time I was working for an online media distribution firm; we acted as a middleman between small, independent recording labels and the major online platforms like Apple, Spotify, and so forth. We took the labels’ music, transcoded it into a variety of file and quality formats, and then acted on our clients’ behalf to get their artists listed on the internet services. They individually may not have been big enough to work out good deals with the major players, but under our umbrella we could help them get the best pricing and revenue. It was a very niche service, and one that isn’t as relevant or necessary today, but it was a fascinating time both in technology and in my career, and it was an opportunity I was very grateful to have had.

My role at the company was two-fold. On one hand I was a junior system administrator, helping with the monitoring, management, and maintenance of our vast array of servers (virtualization wasn’t nearly as prolific as it is today, and neither were reliable cloud compute services), while on the other I was leading our internal technical support team, which provided break-fix service to our employees located across the globe. In one shift I could start one day talking to our London office, work on servers and local support needs through the midday, and finish up by walking through agents in Sydney on how to get their clients connected to the system.

We were a small IT team, and as such each of us had long stretches holding the on-call pager—it was an actual pager—to alert us of infrastructure issues that popped up overnight, or internal tickets that were marked as “Critical” priority, usually meaning that an entire office was down or a major VIP needed immediate assistance.

The office in San Francisco was roughly two hours away from my home, each way, so my workweek was exhaustively long, but when the pager went off at 9pm on a Tuesday, it was my duty to respond. One of our distribution servers—critical to our services—had gone offline, and while it was a smaller server, its core functionality was to aid the staff in Europe. 9pm on the West Coast meant our agents in London, Bristol, and Paris would be waking up soon, so a fix had to be found, and found fast.

After confirming the server was indeed down, and no amount of remote power-cycling would bring it back up, I looped in my boss, who quickly brought in the head of content delivery. We all agreed that, for service to resume, the box would have to be rebuilt from scratch. Wanting to push and prove myself, I told them I was on it. We found a spare server that was suitable—again, this is in the era before virtualization, so this was an actual, physical server sitting in our datacenter for just such an emergency—and I went to work.

The documentation left behind by the engineer who previously built Europe’s content delivery server was … lacking. I would describe it more as a rough skeleton of an idea of the steps one would have to take to set up the machine. What’s worse, it was written for a version of Linux we no longer used, included references to packages that were no longer being maintained, and was years out of date across the board. With enthusiasm overcoming my fear of failure, I still dove into the task, keeping my boss appraised through email as I worked, relying on my experience with Linux, WAN connectivity, and service delivery protocols to piece together how such a server could operate and integrate into our overall system.

By 6am—seven full hours after I started working on the blank server—I had sample content being successfully served to Europe. I called my boss who verified that my methodology was “ugly, but working”, and we rang the head of content delivery. He verified that he could see data coming across the wire, and so I flipped the switch to start serving live content. Records started flooding in, and by all appearances the new server was up and running. Slower than the old one, but it was functional—if nothing else it would allow us to keep the doors open while we developed a longer-lasting fix.

Having worked a full shift the day before, plus more than four hours commuting, and nine hours overnight since the page first came in, I told my boss I would see him on Thursday. As a salaried employee I wasn’t due any extra compensation for my efforts, but he agreed that at the very least I should take the day off. Pleased with myself and how I had saved the day, I went to sleep.

That sleep did not last long.

At 10am I got a phone call from the VP of the San Francisco office, sardonically wondering if I had forgotten about the concept of weekdays. Not understanding, and more than a little sleep-deprived, I mumbled my confusion. “It’s Wednesday and you’re not at your desk,” he said, now irritated, “I need you to look at my computer.”

When I explained that I had the day off after working on a critical server outage all night, he snidely replied “that won’t fix my desktop. Get down here,” and hung up the phone.

And, true to form, his issue was that he had changed his network password that morning but was no longer receiving email on his phone. He hadn’t thought to enter his new password. This was the issue he had me drive back down to San Francisco to resolve, because he didn’t want to bother with anyone else on the helpdesk team—in his mind, he was important enough that he could only be helped by the helpdesk manager, me, in person.

My boss didn’t have the political capital to tell the VP he needed to follow standard helpdesk procedures, or that he couldn’t have me come in on my rest day, but ultimately I wasn’t upset at my boss not having my back. Sometimes you can only control the things you can control, and wisdom is knowing what is out of your hands. To his credit he did give me Friday off, so at least I got a three-day weekend from spending 25 hours working and 8 hours driving over two days.

After that event, and with the founding of my own IT service company some months later, I vowed that I wouldn’t let my time—or the time of the employees I managed—be abused like that ever again. While of course as a small business owner I put in a great deal of long hours trying to bring everything together, one aspect of my managerial style that has remained consistent is that I will always reinforce the expectation to my team that work comes second to their home life, and if ever things start to feel out of balance, I am open and willing to talk to them about how we can work together in making improvements.

The conversation around what it means to have an on-call rotation, what the compensation for such a rotation should be, and what the general expectations are—both inside and outside the company—is one that every IT service firm or department has to grapple with, occasionally multiple times. Will additional compensation be provided, or is it explicitly listed as an expectation on the offer letter and job description? Is there a clear process for deciding who is on call, and what happens if that person isn’t available? Is the manager of that department expected to pick up the slack, and if so, what considerations does that entail? What do the people calling the on-call number think it is for; emergencies only or any time they need help with anything? There are many facets of an on-call rotation to consider, that too often people who haven’t had to hold the (now virtual) pager don’t think to consider.

For those working in IT support, I sincerely hope your managers understand the value you bring to the organization and the toll that being on-call can bring, even if the amount of calls you get are low or infrequent. For those managing teams or are otherwise removed from the day-to-day (night-to-night?) operations of the emergency pager, please know that your front-line staff have a great deal of opportunity cost when it comes to being on-call; those are whole evenings or weekends they can’t go out with friends, spend time with family, or even go for a walk to clear their head. Being on-call, let alone answering it, can be stressful and limiting in a number of ways, and I very enthusiastically encourage you to consider what your employees are giving up in order to support your larger team.

I haven’t thought about this moment for a while, but a recent discussion about whether it should be “expected” that IT people be part of an on-call rotation brought to mind this anecdote from my own past, from which I hope others can learn and gain the benefit of my experience.

May your ticket queues be quiet and the pager silent.