Why Teams Adopting Kubernetes Fight over Capacity Management?

6 min readMay 16, 2019

Kubernetes capacity management is a core competency of teams shipping cloud-native applications. Proper capacity management enables excellent customer experience, organizations to innovate faster and maximizes the ROI in your cloud infrastructure. Capacity Management, however, is a challenge for many organizations for three main reasons:

Capacity Management is impacted by many moving parts, such as users workloads, application architecture, and underlying cloud infrastructure,
You need the engagement of multiple team members to balance performance, with resources, and the cost running cloud-native applications,
It is hard to have a common picture of effective Kubernetes and application capacity management.

Conflicting Requirements

Developers, DevOps, and engineering managers are the three leading roles directly impacting the effectiveness of capacity management. Having them to agree on effective capacity management is a challenge. Each role has its motivations to get their job done. Team members may conflict in their requirements. For example, developers are motivated to ship features quickly. They have no time to analyze and study needed resources or to improve the efficiency of their code. Let’s dig deeper into the motivations of each role.

Developers

Developers ship features and fixes bugs. Throughout my experience seeing others adopting Kubernetes, we’ve seen developers motivated by these factors:

I want our containers’ CI/CD pipeline fast and reliable, For example:
In a few minutes, my code is deployed in our Kubernetes cluster.
If my deployment fails, the system is still functional on the previous version.
I can get a detailed and meaningful report about why my deployment failed.
Our Kubernetes cluster is resilient enough to recover from any transient failures or resources issues. For example
I don’t have to tweak resources requests and limits too frequently.
I don’t need to go through CPU and memory resources budgeting exercises.
VMs failure is expected. I want our cluster to recover quickly before alarms go off.
Our observability pipeline (Prometheus + Grafana) provides me all metrics to diagnose issues. For example:
I can relate user experience with the performance of my containers or microservices.
I can see what’s taking place at the infrastructure level.
I’m always improving shipping cloud-native applications or services. For example
I can see from one release to another the improvements in performance and resources utilization.

DevOps

DevOps or infrastructure engineers are at the core of making sure that products are delivering their SLA. They are in the middle of an on-going storm of evolving infrastructure, application architecture, and business requirements. These requirements usually motivate DevOps:

I want my Kubernetes cluster to be stable and secure. Moving too fast may break our infrastructure or open security gaps. Having control of these is critical to the stability of our infrastructure.
Some of the worker nodes or even master nodes will fail at a certain point. I should have all the redundancy I need to avoid significant interruptions.
I want developers to deploy their pods and containers independently.
I want to make sure that our infrastructure is adequately utilized without jeopardizing the users’ experience.
With predictable performance, I get the best out of it, and I know that it will keep up with changes in users workloads.

Engineering Manager

Engineering managers enable teams to innovate fast to meet business goals as efficiently as possible. These requirements usually motivate them:

I want the adoption of Kubernetes to enable my team to be nimble and agile. Shipping features and handling any issues is critical to our products or services.
We can measure and improve the team’s effectiveness and Highest ROI out of our applications and infrastructure.
I want my team to spend most of its time innovating in core business requirements/areas.
I want to eliminate any friction between infrastructure maintenance/growth and application development.

Other Stakeholders?

Business Owners and Product Managers, in many cases, also impact capacity management and planning in case of significant business events. A marketing campaign, for example, may drive unusual traffic. The corresponding business owner should warn developers and DevOps of the abnormal traffic. The challenge here is when there is a rough estimate of the number of users or traffic. It becomes hard to map this to specific system requirements. Many teams end up over provisioning to be at the safe side.

When Does Contention Build Up?

You are getting into the vicious cycle of poor capacity Management. Teams get quickly into the vicious cycle of poor capacity management when they become reactive most of the time. Reacting to bad performance, Live Site Incidents (LSIs), or the monthly cloud bill puts your team consistently in fire fighting mode. You have to cut this cycle at a certain point. Make sure your team has the right KPIs and priorities to tackle each dimension of capacity management proactively.

Lack of a common view about capacity management. Each member look at their point of view of the world. For example, developers look at microservices and ignore or don’t understand well the limits of their infrastructure. Also, focusing on one set of metrics without considering the impact on the rest of the system is a dangerous practice. We have seen DevOps taking applications down when they want to improve CPU utilization. They usually overlook the impact of these changes on the application’s performance and usage patterns.

Triggers and Indicators of Poor Capacity Management

So, how do you know if you have room to improve how your team manages the capacity of your Kubernetes clusters? I broke it down in below table to the three areas that any organization should keep an eye on. To accurately assess your team’s effectiveness, answer these questions:

How frequently does your team get these triggers?
How much of your team’s time spent reacting to these triggers?
Do you have a few team members always act to these triggers, or is it distributed across the whole team?

Can It Be More Collaborative?

We learned that capacity management inside Kubernetes is a collaborative effort. Kubernetes provides a useful abstraction of the infrastructure. Your team, however, still have a lot of interaction points. The team still needs to collaborate on capacity allocation, application performance tuning, and of course, saving on the cost of cloud infrastructure. You can more read about this topic here. In a nutshell, you need to focus more

If you are a Developer, you need to:

Declare the ownership of the proper points inside the Kubernetes cluster
Own the observability of your application and microservices
Be sure about the resource you need for your application or microservices

If you are a DevOps engineer, you need to:

Establish a clear interaction workflow between you and the rest of the team.
Own the observability of the infrastructure and identify how pods are utilizing available capacity.
Understand different billing options in your public cloud provider to reduce the cost of infrastructure as much as possible.

If you are an engineering manager, you need to:

Make sure that your software engineers and DevOps engineers performed the steps mentioned above :)
Keep an eye on the contention that may build up when you feel your capacity management is going sideways.
Build clear KPIs to track the health of your capacity. You will find some tips in this article.

At Magalix, we can help you with your kubernetes adoption journey. You can see in one dashboard the performance of your containers, kubernetes cluster utilization, and detailed cost analysis. Connect your Kubernetes cluster for free today and get an in-depth analysis of your kubernetes cluster. You can also run your cluster on Autopilot mode to keep adjusting to your capacity proactively based on anticipated workloads.