In my previous article, I discussed the typical IT journey of a telco company, and the architectural requirements of the telco industry. In this article, I explore key concerns for the telco cloud and availability and resilience.
While some enterprises can tolerate some loss of service, the telco world has zero tolerance (which they call "Five 9's availability"). To address the requirement for high availability, a telco maintains software backups with the ability to recover to points in time, spare hardware capacity, geographical redundancy, and support for quick recovery from disasters.
Resiliency and the telco cloud
End-to-end resiliency is a combination of the capabilities of a cloud platform, the applications that run on it, and the underlying hardware. Prerequisites of the cloud platform are the inherited requirements for other components within the system. Applications must be designed to be resilient and have application-specific capabilities. Redundancy must be built into the application software architecture, and geo-redundancy capabilities must also be supported.
Outside of software, the underlying hardware must also be designed to be resilient and highly available. Hardware redundancy is one way to achieve this.
High availability and the telco cloud
All components of the cloud architecture must be designed with a policy of high availability.
High availability is essential for customers, especially for some rainy-day cases like abrupt node reboot or power outages. In a highly available network, the cluster can be recovered, and it's robust enough to work as normal in such circumstances. High availability in a Kubernetes system must be offered at the following levels:
- Kubernetes cluster high availability
- etcd cluster high availability
- keepalived floating IP, to avoid a single point of failure
- Local registry redundancy
- Chart repository redundancy
- Persistent storage redundancy
- Service discovery high availability
In a cloud based on Kubernetes, redundancy of the controller managers and the Kubernetes nodes is built into the system. Each of the three controller nodes includes the controller manager, schedule, and proxy. Redundancy is achieved among the three servers.
The nodes on which workloads run are also designed with redundancy. In addition, in a robust and secure architecture, there are often worker nodes designated as interfaces to the network so that only they can interface with the outside world. These nodes, often referred to as edge nodes, are redundant with each other and provide a highly available interface to the outside world and the internal network.
Self-healing environments
When an issue occurs in a specific area of a system, it can often be mitigated by a self-healing environment. This can include:
- Process-level healing: Process supervision triggers correction in the event of a failure
- Container-level healing: A restart policy is defined, and liveness probes detect failures. When needed, the container is restarted
- Pod-level healing: If a host becomes unavailable, affected pods are scheduled to restart in another host
Disasters can happen
Business continuity plans must account for potential disasters, ranging from loss of data to a natural disaster that wipes out an entire local system. A good telco operator has a business continuity plan and practices periodically. The business continuity plan addresses all aspects of recovery, from the moment when a disaster is declared through bringing up a geo-redundant site until the original site can be restored. The recovery plan spans from the bring-up of hardware through the cloud, until all applications hosted by the cloud are brought up, and continuity of operations is confirmed from an alternate site.
The strength of a recovery plan is based on proper data retention, or periodic backups to points in time. Additionally, its strength is a function of the time it takes the system to recover.
As part of a Disaster Recovery program, a platform must also be able to recover from disaster based on the backups that have been performed. The customer can define a cadence of backup snapshots and, in the event of a disaster, rollback to the relevant point in time.
A backup plan includes successful snapshots of all releases of all applications in the cloud, and a backup of underlying infrastructure and configurations. Restoring the platform from data corruption or other logical errors may not require re-installation and may not result in downtime for unaffected elements, depending on the recovery strategy of the applications and the hardware infrastructure.
Aiming for zero downtime
The goal of near-zero downtime is achievable with well designed and highly available architecture. In the event of failure in one component, there's a smooth transition to a redundant component, allowing the customer to restore the failed component.
Availability of a geo-redundant site contributes to high availability because, in the event of a disaster, customers can fail over to a geo-redundant site. A well defined recovery plan means the original site can be reinstated after hardware has been restored.
For more information about Red Hat's telco services, visit our Telco industry page.
product trial
Red Hat Advanced Cluster Security Cloud Service | Testversion
Über den Autor
With over two decades of experience in the telco world, spanning positions ranging from software engineer, system engineer, marketing and product management, Amy has a broad perspective of where the wind blows in the telco world. She has grown with the industry from legacy systems, through virtualization and to the cloud. In the past few years, Amy has developed a keen interest in security in the real world. She has lectured in different venues and across diverse fields. A curious person, she is always open to meeting new people and hearing new ideas.
Mehr davon
Nach Thema durchsuchen
Automatisierung
Das Neueste zum Thema IT-Automatisierung für Technologien, Teams und Umgebungen
Künstliche Intelligenz
Erfahren Sie das Neueste von den Plattformen, die es Kunden ermöglichen, KI-Workloads beliebig auszuführen
Open Hybrid Cloud
Erfahren Sie, wie wir eine flexiblere Zukunft mit Hybrid Clouds schaffen.
Sicherheit
Erfahren Sie, wie wir Risiken in verschiedenen Umgebungen und Technologien reduzieren
Edge Computing
Erfahren Sie das Neueste von den Plattformen, die die Operations am Edge vereinfachen
Infrastruktur
Erfahren Sie das Neueste von der weltweit führenden Linux-Plattform für Unternehmen
Anwendungen
Entdecken Sie unsere Lösungen für komplexe Herausforderungen bei Anwendungen
Virtualisierung
Erfahren Sie das Neueste über die Virtualisierung von Workloads in Cloud- oder On-Premise-Umgebungen