Getting started with node disruption policies

2025年 7月 9日David Joshy3 分钟阅读

The Machine Config Operator (MCO) in Red Hat OpenShift has been able to perform disruptionless updates on select changes since version 4.7. These select changes were hardcoded in the MCO. To make this process more user-friendly and customizable, the MCO team is introducing node disruption policies.

This blog post will offer context behind node disruption policies, how MCO uses node disruption policies during a MachineConfig Operator update, and important points to be aware of while using them.

Why hand over node disruption control to administrators?

Disruptions can be very expensive for customers, especially for those running resource-constrained environments. We’ve had several requests from customers wanting to reduce disruption when deploying MachineConfig updates and for good reason: only they can determine the tradeoffs between node disruptions and the cost to their business. While we could keep hardcoding these into the back end of the MCO, this is not a practical solution. Without transparency, a customer may have different expectations on what should cause a disruption (e.g., drains, reboots etc.) or not. This created a clear need for users to easily query, define and customize the deployment of these changes.

How MachineConfig updates used to work

Before we dive into how the policies work, here is a simplified primer on how MC updates are deployed:

You update or create a new MachineConfig object to customize a target MachineConfigPool
The MCO’s machine-config-controller, which is watching all MachineConfig objects, merges and generates a new rendered MachineConfig for your targeted pool
The machine-config-controller then starts deploying these rendered MachineConfigs to nodes of that pool, respecting that pool’s maxUnavailable

Then the machine-config-daemon on the node comes into play:

The machine-config-daemon realizes that the current and desired MachineConfig of its node no longer match
It then calculates the difference between them and determines if this update requires a reboot or one of those special hardcoded actions mentioned earlier
The machine-config-daemon then begins to execute the actions on the node.

Let’s explore how node disruption policies fit into this workflow.

The MCO now has a new home for its knobs and dials

In Red Hat OpenShift 4.17, the Machine Config Operator began using a new API type called MachineConfiguration as a central control point for MCO-specific features. Only 1 object (named “cluster”) of this type is allowed in an OpenShift installation. (Note: The MachineConfiguration type should not be confused with the MachineConfig type, which is the type we typically use for node customization.)

Ok, so how do I use this knob?

As mentioned earlier, only 1 object of the MachineConfiguration type is allowed—the MCO will automatically create an object if 1 is not provided during installation or if the existing object is deleted. This object has 2 fields to pay attention to:

.spec.nodeDisruptionPolicy: This is where any user-defined node disruption policies go. Although I won’t get into the specifics about how to define policies in this blog post, you can read our Red Hat documentation for more information
.status.nodeDisruptionPolicyStatus: This is your cluster’s “effective” node disruption policies. It combines the user-defined policies and the cluster’s default policies. When the spec.nodeDisruptionPolicy is updated, the MCO will calculate the new “effective” policies and update this field

The new flow

Most of the original flow at the cluster level remains the same. The actual changes appear at the node level:

The machine-config-daemon becomes aware that the current and desired MachineConfig of its node no longer match
It then calculates the difference between 2
It checks that the MachineConfiguration status is updated, ensuring any user changes to the .spec.nodeDisruptionPolicy field have been accounted for in .status.nodeDisruptionPolicyStatus
The machine-config-daemon then compares the difference against .status.nodeDisruptionPolicyStatus (instead of the legacy hardcoded list) and generates a queue of actions
The machine-config-daemon then begins to execute the queue of actions on the node

Some things to consider:

When spec.nodeDisruptionPolicy is empty, status.nodeDisruptionPolicyStatus will reflect the default cluster policies. These default policies are based on the legacy hardcoded actions and to be considered as the MCO’s best recommendation for a certain MC change. As a cluster administrator, you’re free to override them as you see fit. But I do want to emphasize that:

The MCO does not check your policy actions for correctness. Other than some confidence checks, there isn’t a way for the MCO to ensure that the actions you are defining will leave your node in a healthy state post update. If in doubt, it is highly recommended to ensure your policy has the desired effect prior to deploying it across your clusters
The policies apply to any sort of files/units difference between the currentConfig and desiredConfig. This includes additions, updates and removals of the files and units in question
The default action for an unspecified change is reboot. If any of the changes require a reboot, all other actions will be skipped to avoid redundancy

Summary

The addition of node disruption policies provides a powerful mechanism to control disruptions to your cluster. This change empowers cluster administrators to balance stability and customization during updates, especially in resource-constrained environments. We’re excited to see how users will take advantage of this new feature, and while we consider it complete for the moment, we’re open to suggestions to make it even better. Please contact your account team or the Red Hat Support team with your feedback.

关于作者

David Joshy

Software Engineer

David joined Red Hat in 2022. He enjoys working with Go and tackling longstanding challenges in OpenShift. Outside of work, he loves diving into hard sci-fi—both books and movies—and has a passion for exploring new countries.

Read full bio

了解更多

按频道浏览

探索所有频道

Getting started with node disruption policies

Why hand over node disruption control to administrators?

How MachineConfig updates used to work

The MCO now has a new home for its knobs and dials

Ok, so how do I use this knob?

The new flow

Some things to consider:

Summary

红帽 Ansible 自动化平台 | 产品试用

关于作者

David Joshy

更多此类内容

了解更多

按频道浏览

产品和服务

工具

试用购买与出售

联系我们

关于红帽

选择语言

Red Hat legal and privacy links

Red Hat legal and privacy links