You Can’t Automate Expectations

You Can’t Automate Expectations

2023年 10月 3日 | Code Comments Team

Automation and management

Partners

Code Comments • • You Can’t Automate Expectations | Code Comments

You Can’t Automate Expectations | Code Comments

About the episode

Establishing consistent automation habits helps keep those skills sharp and gets the systems set up promptly. But getting to that point takes time. And even when automating processes becomes second nature, you can still overlook potential pitfalls.

Joshua Bradley of Cox Edge describes what it’s like managing the expectations teams and stakeholders may have about automating infrastructure. The systems may be more complex. Timelines may be longer. And even when you leave detailed instructions, users may still make mistakes. It just means you need to keep adjusting until you get it right.

Code Comments Team Red Hat original show

Subscribe here:

Transcript

Automation can make things as easy as clicking a button, but behind that simplicity, there's usually a mountain of work, and if you don't get it right, that mountain could be a volcano waiting to erupt. You want to get it right, you're not looking to make life harder down the line. So you want to be picky about the tooling and take the time to learn to do things the right way, but then you actually have to do the thing on a regular basis. This episode, we hear the story of an engineer at Cox Edge and the work he and his team have done to make automation almost second nature. It requires taking a hard look at what their actual processes are, blocking out the steps, scripting them, accounting for unusual cases, and most of all, accounting for human error because nothing is ever truly completely automatic.

Automation is wonderful because it will update thousands of things at a time. Automation is also very scary because it will update thousands of things at a time.

Josh Bradley is the Director of Technology and Security Operations for Cox Edge. He's been at Cox for 25 years, and in that time has helped get their automation program up and running. People know what's possible with automation, and so when it's absent from processes, the ensuing delays can seem interminable.

The developers were always thinking we were taking a long time because it would take 2 or 3 weeks before we could even start our work. So what we were really trying to do is, since we didn't have control over that realm, is to take control over our realm and really automate and streamline anything that we could for our side.

Josh was setting up the infrastructure for developers to be able to do their own work. It was complicated and depended on the cooperation of other teams, which slowed things down. Over time, automation helped with that, but it takes a lot of effort to get to that consistency. To meet compliance requirements and to improve security, it requires being thorough, which takes time. Unfortunately, not everyone knows how complicated setting up an automation framework can be.

Generally, someone's introduction to automation is, "Okay. You're going to automate this thing? Well, why has it taken several weeks longer than when you didn't automate it? I thought automation was faster." So it's really just making sure that your customer understands that end goal of yes, this will take longer upfront, but I promise you down the line you'll save hours and days on things. Yes, the first time is going to be a little longer and painful, but as you get that automation practice down and you get the flow of knowing what needs to be automated and in what order, then all that stuff gets streamlined.

As with many skills, there's a learning curve to adopting automation. It can take a while to get the hang of it. Hopefully you spend less and less time learning the tool on each subsequent project, but at first you won't save any time when automating processes. That time saved comes later, and in the meantime, there's a heap of preparation to do.

Yeah. Automation takes longer the first time because you're not only doing the manual config, you're then having to turn around and automate that. So there was some initial hesitancy around, "Wait a second, you said you were going to save us time, but now you're adding time to be able to do that." But what really came into play is when we were able to do the additional environments where they said, "Okay. Now I need a production environment." Well, 2 hours later, they had that environment.

Processes go from weeks to set up an environment to taking a couple of hours. It's difficult to argue against that kind of time-saving work. It's just that the road to get there may have been a little longer and a little bumpier than expected.

That's really when the value came into play and really where it became a requirement because people no longer wanted to wait for those additional environments. They wanted that push-button experience. So we got ourselves into that situation by providing that service that people wanted to use. But again, automation is one of those things that most everybody gets value out of. So after you get over that initial standup hurdle, it's pretty easy to sell.

Getting over that initial standup hurdle is the trick, as is getting everyone to understand that the standup is complex, unforgiving, and potentially treacherous. Okay—yes that was a little dramatic, but a little caution will help in the long run when automating updates for thousands of components at once. There's a lot to account for, even for temporary requests—especially for temporary requests.

What's really valuable is we can turn those into additional environments. So for example, if a QA [quality assurance] team comes and says, "Hey, I need a low-test environment. I only need it for two weeks." Two hours later, we've got them an environment stood up. They can use it for as long as they need it, and then when they're done with it, we just reclaim the resources and add them back to the pool.

Standing down environments and reclaiming those resources can be easily overlooked. How many times had the operations team had to investigate which of the cost-draining environments are still in use and which are sitting idle and running up unnecessary bills. Making it easy to turn those off is an easy win, it's one of many. After years of experience, Josh's team has a simple rule to rack up those easy wins.

Our rule is, if we have to do it more than 3 times, it gets automated. Once or twice—not worth the lift really, but 3 times, that's when we get the value out of it. We do the manual once approach just to help with complexity. Automation adds time and complexity on things. So we do the manual approach first. Get it all working, prove it out. We don't really want to waste time automating something that won't be part of the final product.

Talk about having automation become part of your work culture. They not only identify when a process is going to be repeated, but also quantify whether that repeated process is worth the effort of automation. It takes experience to make that determination. Josh shared what a typical process right for automation looks like.

Days to a week, just every little component, and you get that and make sure that gets working, and then you move on to the next component and make sure that gets working and then move on.

And the difficulty of the task depends on the makeup of the system.

The more components and the more wiring of those components, the more complex it is just because, again, you have to get all those components working first before you can start the automation.

That may seem like a straightforward idea, but the relationship between complexity and components is not necessarily linear, and getting all of those components working with each other can be a challenge on its own. That's especially true of distributed systems. These days cloud-native environments often run on Kubernetes. Josh and his team set up an automation process they could reuse for a multitude of cloud-native projects. With the tech world moving so much to the cloud, that reusability is a big time saver. And if you're not familiar with containerization, container orchestration, and microservices in general, here's a quick peek at the components involved.

All the complexity of standing up in Kubernetes, we didn't really have to tackle that as part of that automation. We did do that as part of a previous automation. So we worked really hard. As you know, Kubernetes is a very complex environment and really when people say Kubernetes and there's like 12 different technologies underneath that to make up the Dockers and Grafana and all those bits that are needed to power a Kubernetes environment.

Knowing all the different components needed is no small feat, let alone which options to choose, how they interact with each other, what configurations are needed for them to play nice, making sure the security policies are set up right, and on and on. You can see how automation would come in handy, but also what could go wrong.

So those were taking us days and days and days to do. So knowing that—hey, we want a repeatable process, we want our dev environment to look just like our Q environment, to look just like our production environment. We automated Kubernetes to the extreme, and we really had it down to 2 hours, and that was literally from the time that we would get the VM to production level monitoring. Like integrated with the NOC, full monitoring, we could see exactly what that cluster was doing, but we had that down to 2 hours, and most of that really was the patching of the operating system.

Cloud-native environments are particularly tricky to set up, especially when specialized hardware is needed for its particular set of skills.

A good example is our Cox Media arm had to do some video transcoding and they had a very specialized processor to meet these exact streaming demands. So those boxes were like $30,000 a piece, so pretty expensive to have a large SDLC environment. So what we did is we worked with them and my team set up some automation, and again, it was Kubernetes focused. So we turned all the expensive nodes into worker nodes, and then when we needed to do some development testing, we actually had automation that did a role swap.

Changing the role of that hardware was costly and time consuming, which meant development for that part of the business was limited too. With Kubernetes and an automation framework, Josh and his team were able to make those transitions faster and more reliable.

So literally the customer could go in and click a button and put in a number. They had like 30-something boxes so they could say, "I want 6 of these to be in my QA environment and these 6 to be in my development environment, the rest of them to be in prod." And they would enter those numbers and click "go," and literally the pipeline would go execute. It would take the worker nodes, put them in the right environment, make sure they were in the right network, had the right user access, and then really complete that deployment, integrated with monitoring and everything.

With a couple of clicks, that web of complex Kubernetes components is rearranged to meet the demands of the development team without interrupting uptime. And those threats to that uptime could be efficiently dealt with too.

And then when the user needed to—all of a sudden they had a big production demand where they needed to get all those workloads back. They didn't need to talk to us at all. They literally just went into the pipeline, put in the number, hit "go," and then all of a sudden they had the production environment size to what they needed.

From production to development and back again, a successful end result to automation, which took time and deliberation to get right. But no automation system is completely infallible. Even when set up correctly, there's always the chance for user error. When we come back—examples of when things can go wrong.

You can spend hours writing the docs, you can spend more setting up the scripts, but you can't always predict how people will use them, if at all. Josh learned that lesson the hard way.

So we had one service and it was multiple different technologies. It was very complex. So the doc was admittedly very long just to due the complexity of the service. So I put it together, very detailed, and I handed over to the operations team.

People complain about the lack of documentation or about its quality or lack of detail. Josh put in the work to give the people what they want and need.

Well, a week or so goes by, and so one of the Ops teams stops by my desk and says, "Hey, listen, can you spare a day or so to come help us do this install?" So I was like, "Well, I'm happy to help and I can share the install doc that should walk you through everything. If there's any issues, let me know." And they said, "No, no, no. I've got the install doc, but that thing's way too long. I'm not going to sit down there and read that doc and follow that." So it just put me in a spot of, we do all this work to get this in a spot to where it's great for them, but then all of a sudden we don't want to turn it into a lot more work for them to do.

Apparently it was too much to do. They didn't have the time to work through the dock and set the system up themselves. So instead of saving time, Josh ended up having to help them through it anyway, and that led to a pivotal decision.

That led us into the first automation. So we had to run with that service as it is, but learned a lesson there. So it moved into the next service deployment. So at that point, I knew I was going to automate that thing to make it easier for me and the operations team in the log run. So I wrote everything in VI at the time, we didn't have the nice Ansible tools like we have now. So I wrote it all, had scripts out there, and I had an initial configuration doc that they needed to follow to set up the initial environments. Put in the config files. Production has different IP addresses than non-prod, so there's some things that resource would have to update.

That was the beginning of his automation journey, but he wasn't done learning about the pitfalls of user error. Some time later, the operations team was trying to run through the process Josh had set up for them. He thought he had left them everything they needed to succeed.

Of course, I was out on PTO [paid time off] and the production team got the VMs to their install, and I got a frantic call, "Help, help. We got a deployment, your automation's not working. It's failing all around." So I always bring my work computer with me of course.

Of course.

I jumped online to help them. So we went through and I opened up the config files and they were all blank. So immediately I was like, "Well, I think I figured out what's going on." I was like, "Well, you missed doing the config files." And they were adamant, "No, followed that doc exactly. We updated every config file. I put it in there. There's something wrong with the automation."

Hear what he said? They followed the doc exactly. Let's find out where the stumbling block was.

So I was like, "Well, let's go through the doc and fill out one of these config files." So the resource went through, they added the IPs, added everything exactly as was supposed to. Said, "All right. Looks good, let's move on to the next file." So they closed the first file and the window pops up and it says, "Would you like to save what you're doing?" They clicked no and shut the file down. I was like, "Well, wait a second. Hold on. You didn't save that file?" And they go, "Well, your doc didn't say to save the file."

I couldn't believe it when I heard it and I'm sure Josh couldn't either. Maybe you can if you've been in this industry for long enough. Josh realized he had a little more work to do because you cannot account for everything.

Automation is great, but you've really got to be careful with the instructions that you provide to both the executor and the automation itself, and to be very particular.

Amen to that. There are so many things we take for granted that we overlook in the instructions. It might not be as obvious as hitting "save" on a file, but a lot of things that are obvious to you may not be for the end user of the process you put together. So when we say the process needs to be deliberate and detailed, this is a big part of that.

It made me really take a look at the automation that we were doing, and yes, we thought this was very user-friendly, but it wasn't as user friendly as I thought it was. That's what we really took it to—it was bare minimum user input for anything, it was only the exact required bits, and then the scripts took care of everything else, including saving the files.

The more sophisticated our automation tools become and the scripts we write with them, the less we need to account for user error. But this is a reminder that there is always something that the user can miss, misunderstand, or straight up get wrong. That's when one process in particular can come in handy—and hopefully you've got that process automated as well.

That's where we really tout rollbacks. Working on something at 3:00 AM is very different than working on something at 3:00 PM. So we really try and get the production teams, "Hey, if you do run into an issue during this deployment, roll it back and we'll all get together the next day. Get multiple people in the room, not just whoever happened to be on call and be able to troubleshoot that." Of course, if it's a production service that has to absolutely be deployed, then of course we're happy to get on and troubleshoot, but we really still need somebody that knows the automation side to be able to help troubleshoot those. But again, once you've really QA-ed your automation, it's generally some environment difference from a manual update or the code issues that sometimes happen.

Automation is great. Automation is complex. And automation can absolutely go wrong even when it's gone right many times before. Some change somewhere can have an unintended effect. Have that rollback option set up to be rock solid just in case. Josh and his team are by now experts in automation and they're able to share that expertise more widely.

We collaborate a bunch, but we're not like evangelists that go out. But my previous team, we were Kubernetes focused, and so we were very, very automation-heavy on that. So we did set a lot of the standards for some other teams, and as part of that, they were seeing some of the cool things that we were doing. So we got some adoptions through that way, but it's really just collaboration across the company that really got us to where we are.

We just spent an episode talking about the importance of getting automation processes set up correctly to minimize the chances of human error. Josh and his team refine their automation processes, but they're one small team in a large organization. They can't take on the whole company's automation projects.

Next time on Code Comments, we hear from a small team of internal consultants at Ulta Beauty and how they help the rest of their company help themselves. You can read more at redhat.com/codecommentspodcast or visit redhat.com to find out more about our automation solutions. Many thanks to Josh Bradley for being our guest, and thank you for joining us.

This episode was produced by Johan Philippine, Kim Huang, Caroline Creaghead, and Brent Simoneaux. Our audio engineer is Elisabeth Hart. The audio team includes Leigh Day, Stephanie Wonderlick, Mike Esser, Nick Burns, Aaron Williamson, Karen King, Jared Oates, Rachel Ertel, Carrie da Silva, Mira Cyril, Ocean Matthews, Paige Stroud, Alex Traboulsi, and Victoria Lawton. I'm Jamie Parker, and this has been Code Comments, an original podcast from Red Hat.

About the show

Code Comments

On Code Comments, we speak with experienced professionals on the challenges along the way from whiteboard to deployment.

You Can’t Automate Expectations

You Can’t Automate Expectations | Code Comments

About the episode

Subscribe

Transcript

More about automation and management

About the show

Code Comments

Products & portfolios

Tools

Try, buy, & sell

Communicate

About Red Hat

Select a language

Red Hat legal and privacy links

Red Hat legal and privacy links

You Can’t Automate Expectations

You Can’t Automate Expectations | Code Comments

About the episode

Subscribe

Transcript

More about automation and management

深入探討政策即代碼(Policy as Code)：為何是現在以及如何提供幫助？

Bringing collaboration and scale to automation: The latest in Red Hat Ansible Automation Platform

About the show

Code Comments

Products & portfolios

Tools

Try, buy, & sell

Communicate

About Red Hat

Select a language

Red Hat legal and privacy links

Red Hat legal and privacy links