AI-Driven Self-Healing Cloud Systems: Enhancing Reliability and Reducing Downtime through Event-Driven Automation
Authors: Rajeev Kumar Arora, Anoop Kumar, Arpita Soni and Aniruddh Tiwari
Publishing Date: 02-01-2025
ISBN: 978-81-955020-9-7
Abstract
The goal of this study is to create and carry out a self-healing cloud system by combining an event-driven automation framework depending on the if-this-then-that principle for managing incidents and recovery. A recovery engine with Artificial Intelligence (AI)-based decision-making approaches is presented—which chooses the best remedial actions from a pre-established catalogue in order to maximise system reliability and minimise downtime. The system is tested on an OpenStack-based video on demand service—where multiple issues are replicated in order to assess the efficaciousness of various recovery actions and workflows. The decision-making module of the recovery engine examines data from these experiments to determine the most effective remedial actions, taking into account their impact on the quality of service and other factors. The recovery engine is only meant to need human input when it comes to parameterizing and optimising decision models at particular points in time. In order to show how these AI-driven decision-making techniques can enhance mean time to repair and overall service quality in cloud environments—the study presents and assesses their results. This novel strategy represents a change towards cloud systems that are more sturdy autonomous, and able to effectively manage anomalies and recover from failure.
Keywords
Artificial Intelligence, Cloud System, Engine, OpenStack.
Cite as
Rajeev Kumar Arora, Anoop Kumar, Arpita Soni and Aniruddh Tiwari, "AI-Driven Self-Healing Cloud Systems: Enhancing Reliability and Reducing Downtime through Event-Driven Automation", In: Mukesh Saraswat and Rajani Kumari (eds), Applied Intelligence and Computing, SCRS, India, 2025, pp. 293-301. https://doi.org/10.56155/978-81-955020-9-7-28