Toward ML-Centric Cloud Platforms

Lego construction - Credit: Marcel Clemens

Communications of the ACM,
Contributed Articles: “Toward ML-Centric Cloud Platforms
By Ricardo Bianchini, Marcus Fontoura, et al.

“Cloud platforms are extremely expensive to build and operate, so providers have a strong incentive to optimize their use.”

 

 

Cloud platforms, such as Microsoft Azure, Amazon Web Services (AWS), and Google Cloud Platform, are tremendously complex. For example, the Azure Compute fabric governs all the physical and virtualized resources running in Microsoft’s datacenters. Its main resource management systems include virtual machine (VM) and container (hereafter we refer to VMs and containers simply as “containers”) scheduling, server and container health monitoring and repairs, power and energy management, and other management functions.

 

Cloud platforms, such as Microsoft Azure, Amazon Web Services (AWS), and Google Cloud Platform, are tremendously complex. For example, the Azure Compute fabric governs all the physical and virtualized resources running in Microsoft’s datacenters. Its main resource management systems include virtual machine (VM) and container (hereafter we refer to VMs and containers simply as “containers”) scheduling, server and container health monitoring and repairs, power and energy management, and other management functions.

 

Cloud platforms are also extremely expensive to build and operate, so providers have a strong incentive to optimize their use. A nascent approach is to leverage machine learning (ML) in the platforms’ resource management using supervised learning techniques, such as gradient-boosted trees and neural networks, or reinforcement learning. We also discuss why ML is often preferable to traditional non-ML techniques.

 

Public cloud providers are starting to explore ML-based resource management in production. For example, Google uses neural networks to optimize fan speeds and other energy knobs. In academia, researchers have proposed using collaborative filtering—a common technique in recommender systems—in scheduling containers for reduced with in-server performance interference. Others proposed using reinforcement learning to adjust the resources allocated to co-located VMs. Later, we discuss other opportunities for ML-based management.

 

Despite these prior efforts and opportunities, it is currently unclear how best to integrate ML into cloud resource management. In fact, prior approaches differ in multiple dimensions. For example, in some cases, the ML technique produces insights/predictions about the workload or infrastructure; in others, it produces actual resource management actions. In some cases, the ML is deeply integrated with the resource manager; in others, it is completely separate. In all cases, the ML addresses a single management problem; a different problem requires another approach. We discuss these dimensions, the possible integration designs, and their architectural, functional, and API implications.

 

As one point in this multi-dimensional space, we built Resource Central (RC)9—a general ML and prediction-serving system for providing workload and infrastructure insights to resource managers in the Azure Compute fabric. RC collects telemetry from containers and servers, learns from their prior behaviors and, when requested, produces predictions of their future behaviors. We are currently using RC to accurately predict many characteristics of the Azure Compute workload. We present an overview RC, its initial uses and results, and describe the lessons from building it.

 

Though RC has been successful so far, it has limitations. For example, it does not implement certain forms of interaction with resource managers. More broadly, the integration of ML into real cloud platforms in a general, maintainable, and at-scale manner is still in its infancy. We close the article with some open questions and challenges.

Read the Full Article »

About the Authors:

Ricardo Bianchini is a Distinguished Engineer at Microsoft Research, Redmond, WA, USA.

Marcus Fontoura is a Technical Fellow at Microsoft Research, Redmond, WA, USA.

Eli Cortez is a Principal Engineer at Microsoft Research, Redmond, WA, USA.

Anand Bonde is a senior engineer at Microsoft Research, Redmond, WA, USA.

Alexandre Muzio is a software engineer Microsoft Azure, Redmond, WA, USA.

Ana-Maria Constantin is a software engineer Microsoft Azure, Redmond, WA, USA.

Thomas Moscibroda is a partner research scientist at Microsoft Azure, Redmond, WA, USA.

Gabriel Magalhaes is a Ph.D. student at the University of Washington, and was an intern at Microsoft Azure during this work.

Girish Bablani is corporate vice president of Microsoft Azure, Redmond, WA, USA.

Mark Russinovich is a Technical Fellow and CTP at Microsoft Azure, Redmond, WA, USA.