Resilience and availability for telcos moving to the cloud

3. März 2025Amy Fredj3 Minuten (Lesedauer)

In my previous article, I discussed the typical IT journey of a telco company, and the architectural requirements of the telco industry. In this article, I explore key concerns for the telco cloud and availability and resilience.

While some enterprises can tolerate some loss of service, the telco world has zero tolerance (which they call "Five 9's availability"). To address the requirement for high availability, a telco maintains software backups with the ability to recover to points in time, spare hardware capacity, geographical redundancy, and support for quick recovery from disasters.

Resiliency and the telco cloud

End-to-end resiliency is a combination of the capabilities of a cloud platform, the applications that run on it, and the underlying hardware. Prerequisites of the cloud platform are the inherited requirements for other components within the system. Applications must be designed to be resilient and have application-specific capabilities. Redundancy must be built into the application software architecture, and geo-redundancy capabilities must also be supported.

Outside of software, the underlying hardware must also be designed to be resilient and highly available. Hardware redundancy is one way to achieve this.

High availability and the telco cloud

All components of the cloud architecture must be designed with a policy of high availability.

High availability is essential for customers, especially for some rainy-day cases like abrupt node reboot or power outages. In a highly available network, the cluster can be recovered, and it's robust enough to work as normal in such circumstances. High availability in a Kubernetes system must be offered at the following levels:

Kubernetes cluster high availability
etcd cluster high availability
keepalived floating IP, to avoid a single point of failure
Local registry redundancy
Chart repository redundancy
Persistent storage redundancy
Service discovery high availability

In a cloud based on Kubernetes, redundancy of the controller managers and the Kubernetes nodes is built into the system. Each of the three controller nodes includes the controller manager, schedule, and proxy. Redundancy is achieved among the three servers.

The nodes on which workloads run are also designed with redundancy. In addition, in a robust and secure architecture, there are often worker nodes designated as interfaces to the network so that only they can interface with the outside world. These nodes, often referred to as edge nodes, are redundant with each other and provide a highly available interface to the outside world and the internal network.

Self-healing environments

When an issue occurs in a specific area of a system, it can often be mitigated by a self-healing environment. This can include:

Process-level healing: Process supervision triggers correction in the event of a failure
Container-level healing: A restart policy is defined, and liveness probes detect failures. When needed, the container is restarted
Pod-level healing: If a host becomes unavailable, affected pods are scheduled to restart in another host

Disasters can happen

Business continuity plans must account for potential disasters, ranging from loss of data to a natural disaster that wipes out an entire local system. A good telco operator has a business continuity plan and practices periodically. The business continuity plan addresses all aspects of recovery, from the moment when a disaster is declared through bringing up a geo-redundant site until the original site can be restored. The recovery plan spans from the bring-up of hardware through the cloud, until all applications hosted by the cloud are brought up, and continuity of operations is confirmed from an alternate site.

The strength of a recovery plan is based on proper data retention, or periodic backups to points in time. Additionally, its strength is a function of the time it takes the system to recover.

As part of a Disaster Recovery program, a platform must also be able to recover from disaster based on the backups that have been performed. The customer can define a cadence of backup snapshots and, in the event of a disaster, rollback to the relevant point in time.

A backup plan includes successful snapshots of all releases of all applications in the cloud, and a backup of underlying infrastructure and configurations. Restoring the platform from data corruption or other logical errors may not require re-installation and may not result in downtime for unaffected elements, depending on the recovery strategy of the applications and the hardware infrastructure.

Aiming for zero downtime

The goal of near-zero downtime is achievable with well designed and highly available architecture. In the event of failure in one component, there's a smooth transition to a redundant component, allowing the customer to restore the failed component.

Availability of a geo-redundant site contributes to high availability because, in the event of a disaster, customers can fail over to a geo-redundant site. A well defined recovery plan means the original site can be reinstated after hardware has been restored.

For more information about Red Hat's telco services, visit our Telco industry page.

Über den Autor

Amy Fredj

Product Manager

With over two decades of experience in the telco world, spanning positions ranging from software engineer, system engineer, marketing and product management, Amy has a broad perspective of where the wind blows in the telco world. She has grown with the industry from legacy systems, through virtualization and to the cloud. In the past few years, Amy has developed a keen interest in security in the real world. She has lectured in different venues and across diverse fields. A curious person, she is always open to meeting new people and hearing new ideas.

Read full bio