The Machine Config Operator (MCO) in Red Hat OpenShift has been able to perform disruptionless updates on select changes since version 4.7. These select changes were hardcoded in the MCO. To make this process more user-friendly and customizable, the MCO team is introducing node disruption policies.
This blog post will offer context behind node disruption policies, how MCO uses node disruption policies during a MachineConfig Operator update, and important points to be aware of while using them.
Why hand over node disruption control to administrators?
Disruptions can be very expensive for customers, especially for those running resource-constrained environments. We’ve had several requests from customers wanting to reduce disruption when deploying MachineConfig updates and for good reason: only they can determine the tradeoffs between node disruptions and the cost to their business. While we could keep hardcoding these into the back end of the MCO, this is not a practical solution. Without transparency, a customer may have different expectations on what should cause a disruption (e.g., drains, reboots etc.) or not. This created a clear need for users to easily query, define and customize the deployment of these changes.
How MachineConfig updates used to work
Before we dive into how the policies work, here is a simplified primer on how MC updates are deployed:
- You update or create a new MachineConfig object to customize a target MachineConfigPool
- The MCO’s machine-config-controller, which is watching all MachineConfig objects, merges and generates a new rendered MachineConfig for your targeted pool
- The machine-config-controller then starts deploying these rendered MachineConfigs to nodes of that pool, respecting that pool’s maxUnavailable
Then the machine-config-daemon on the node comes into play:
- The machine-config-daemon realizes that the current and desired MachineConfig of its node no longer match
- It then calculates the difference between them and determines if this update requires a reboot or one of those special hardcoded actions mentioned earlier
- The machine-config-daemon then begins to execute the actions on the node.
Let’s explore how node disruption policies fit into this workflow.
The MCO now has a new home for its knobs and dials
In Red Hat OpenShift 4.17, the Machine Config Operator began using a new API type called MachineConfiguration as a central control point for MCO-specific features. Only 1 object (named “cluster”) of this type is allowed in an OpenShift installation. (Note: The MachineConfiguration type should not be confused with the MachineConfig type, which is the type we typically use for node customization.)
Ok, so how do I use this knob?
As mentioned earlier, only 1 object of the MachineConfiguration type is allowed—the MCO will automatically create an object if 1 is not provided during installation or if the existing object is deleted. This object has 2 fields to pay attention to:
- .spec.nodeDisruptionPolicy: This is where any user-defined node disruption policies go. Although I won’t get into the specifics about how to define policies in this blog post, you can read our Red Hat documentation for more information
- .status.nodeDisruptionPolicyStatus: This is your cluster’s “effective” node disruption policies. It combines the user-defined policies and the cluster’s default policies. When the spec.nodeDisruptionPolicy is updated, the MCO will calculate the new “effective” policies and update this field
The new flow
Most of the original flow at the cluster level remains the same. The actual changes appear at the node level:
- The machine-config-daemon becomes aware that the current and desired MachineConfig of its node no longer match
- It then calculates the difference between 2
- It checks that the MachineConfiguration status is updated, ensuring any user changes to the .spec.nodeDisruptionPolicy field have been accounted for in .status.nodeDisruptionPolicyStatus
- The machine-config-daemon then compares the difference against .status.nodeDisruptionPolicyStatus (instead of the legacy hardcoded list) and generates a queue of actions
- The machine-config-daemon then begins to execute the queue of actions on the node
Some things to consider:
When spec.nodeDisruptionPolicy is empty, status.nodeDisruptionPolicyStatus will reflect the default cluster policies. These default policies are based on the legacy hardcoded actions and to be considered as the MCO’s best recommendation for a certain MC change. As a cluster administrator, you’re free to override them as you see fit. But I do want to emphasize that:
- The MCO does not check your policy actions for correctness. Other than some confidence checks, there isn’t a way for the MCO to ensure that the actions you are defining will leave your node in a healthy state post update. If in doubt, it is highly recommended to ensure your policy has the desired effect prior to deploying it across your clusters
- The policies apply to any sort of files/units difference between the currentConfig and desiredConfig. This includes additions, updates and removals of the files and units in question
- The default action for an unspecified change is reboot. If any of the changes require a reboot, all other actions will be skipped to avoid redundancy
Summary
The addition of node disruption policies provides a powerful mechanism to control disruptions to your cluster. This change empowers cluster administrators to balance stability and customization during updates, especially in resource-constrained environments. We’re excited to see how users will take advantage of this new feature, and while we consider it complete for the moment, we’re open to suggestions to make it even better. Please contact your account team or the Red Hat Support team with your feedback.
product trial
Red Hat Ansible Automation Platform | Product Trial
About the author
David joined Red Hat in 2022. He enjoys working with Go and tackling longstanding challenges in OpenShift. Outside of work, he loves diving into hard sci-fi—both books and movies—and has a passion for exploring new countries.
Browse by channel
Automation
The latest on IT automation for tech, teams, and environments
Artificial intelligence
Updates on the platforms that free customers to run AI workloads anywhere
Open hybrid cloud
Explore how we build a more flexible future with hybrid cloud
Security
The latest on how we reduce risks across environments and technologies
Edge computing
Updates on the platforms that simplify operations at the edge
Infrastructure
The latest on the world’s leading enterprise Linux platform
Applications
Inside our solutions to the toughest application challenges
Virtualization
The future of enterprise virtualization for your workloads on-premise or across clouds