订阅内容

The Machine Config Operator (MCO) in Red Hat OpenShift has been able to perform disruptionless updates on select changes since version 4.7. These select changes were hardcoded in the MCO. To make this process more user-friendly and customizable, the MCO team is introducing node disruption policies.

This blog post will offer context behind node disruption policies, how MCO uses node disruption policies during a MachineConfig Operator update, and important points to be aware of while using them. 

Why hand over node disruption control to administrators?

Disruptions can be very expensive for customers, especially for those running resource-constrained environments. We’ve had several requests from customers wanting to reduce disruption when deploying MachineConfig updates and for good reason: only they can determine the tradeoffs between node disruptions and the cost to their business. While we could keep hardcoding these into the back end of the MCO, this is not a practical solution. Without transparency, a customer may have different expectations on what should cause a disruption (e.g., drains, reboots etc.) or not. This created a clear need for users to easily query, define and customize the deployment of these changes. 

How MachineConfig updates used to work

Before we dive into how the policies work, here is a simplified primer on how MC updates are deployed:

  1. You update or create a new MachineConfig object to customize a target MachineConfigPool
  2. The MCO’s machine-config-controller, which is watching all MachineConfig objects, merges and generates a new rendered MachineConfig for your targeted pool
  3. The machine-config-controller then starts deploying these rendered MachineConfigs to nodes of that pool, respecting that pool’s maxUnavailable

Then the machine-config-daemon on the node comes into play:

  1. The machine-config-daemon realizes that the current and desired MachineConfig of its node no longer match
  2. It then calculates the difference between them and determines if this update requires a reboot or one of those special hardcoded actions mentioned earlier
  3. The machine-config-daemon then begins to execute the actions on the node.

Let’s explore how node disruption policies fit into this workflow.

The MCO now has a new home for its knobs and dials

In Red Hat OpenShift 4.17, the Machine Config Operator began using a new API type called MachineConfiguration as a central control point for MCO-specific features. Only 1 object (named “cluster”) of this type is allowed in an OpenShift installation. (Note: The MachineConfiguration type should not be confused with the MachineConfig type, which is the type we typically use for node customization.)

Ok, so how do I use this knob?

As mentioned earlier, only 1 object of the MachineConfiguration type is allowed—the MCO will automatically create an object if 1 is not provided during installation or if the existing object is deleted. This object has 2 fields to pay attention to:

  • .spec.nodeDisruptionPolicy: This is where any user-defined node disruption policies go. Although I won’t get into the specifics about how to define policies in this blog post, you can read our Red Hat documentation for more information
  • .status.nodeDisruptionPolicyStatus: This is your cluster’s “effective” node disruption policies. It combines the user-defined policies and the cluster’s default policies. When the spec.nodeDisruptionPolicy is updated, the MCO will calculate the new “effective” policies and update this field

The new flow

Most of the original flow at the cluster level remains the same. The actual changes appear at the node level: 

  1. The machine-config-daemon becomes aware that the current and desired MachineConfig of its node no longer match
  2. It then calculates the difference between 2
  3. It checks that the MachineConfiguration status is updated, ensuring any user changes to the .spec.nodeDisruptionPolicy field have been accounted for in .status.nodeDisruptionPolicyStatus
  4. The machine-config-daemon then compares the difference against .status.nodeDisruptionPolicyStatus (instead of the legacy hardcoded list) and generates a queue of actions
  5. The machine-config-daemon then begins to execute the queue of actions on the node

Some things to consider:

When spec.nodeDisruptionPolicy is empty, status.nodeDisruptionPolicyStatus will reflect the default cluster policies. These default policies are based on the legacy hardcoded actions and to be considered as the MCO’s best recommendation for a certain MC change. As a cluster administrator, you’re free to override them as you see fit. But I do want to emphasize that:

  • The MCO does not check your policy actions for correctness. Other than some confidence checks, there isn’t a way for the MCO to ensure that the actions you are defining will leave your node in a healthy state post update. If in doubt, it is highly recommended to ensure your policy has the desired effect prior to deploying it across your clusters
  • The policies apply to any sort of files/units difference between the currentConfig and desiredConfig. This includes additions, updates and removals of the files and units in question
  • The default action for an unspecified change is reboot. If any of the changes require a reboot, all other actions will be skipped to avoid redundancy

Summary

The addition of node disruption policies provides a powerful mechanism to control disruptions to your cluster. This change empowers cluster administrators to balance stability and customization during updates, especially in resource-constrained environments. We’re excited to see how users will take advantage of this new feature, and while we consider it complete for the moment, we’re open to suggestions to make it even better. Please contact your account team or the Red Hat Support team with your feedback.

product trial

红帽 Ansible 自动化平台 | 产品试用

下载红帽 Ansible 自动化平台,获享 60 天的免费试用体验,其中包含红帽系统管理和预测性分析软件的使用权限。

关于作者

David joined Red Hat in 2022. He enjoys working with Go and tackling longstanding challenges in OpenShift. Outside of work, he loves diving into hard sci-fi—both books and movies—and has a passion for exploring new countries.

Read full bio
UI_Icon-Red_Hat-Close-A-Black-RGB

按频道浏览

automation icon

自动化

有关技术、团队和环境 IT 自动化的最新信息

AI icon

人工智能

平台更新使客户可以在任何地方运行人工智能工作负载

open hybrid cloud icon

开放混合云

了解我们如何利用混合云构建更灵活的未来

security icon

安全防护

有关我们如何跨环境和技术减少风险的最新信息

edge icon

边缘计算

简化边缘运维的平台更新

Infrastructure icon

基础架构

全球领先企业 Linux 平台的最新动态

application development icon

应用领域

我们针对最严峻的应用挑战的解决方案

Virtualization icon

虚拟化

适用于您的本地或跨云工作负载的企业虚拟化的未来