Which One Should You Prioritize? Kubernetes Performance, Cluster Utilization, or Cost Optimization?

8 min readMay 8, 2019

Whether you are just getting started with Kubernetes or it is fully adopted in your organization, getting the most out of it and providing the best user experience will always be a challenge. Targeting the best performance or highest efficiency of your kubernetes cluster too early will slow you down significantly, and you may end up with lots of security holes in your cluster or just your team giving up on the technology. Deferring these optimizations will make you run in reactive mode, which creates unnecessary stress for you and your team. It is crucial to have a proper framework to prioritize factors impacting your infrastructure and your users’ experience at the right time.

Teams with successful Kubernetes adoption stories were conscious about the right priorities at each stage of their journey. Whether you are a developer, a DevOps or an engineering manager you should plan each step and decide what’s in scope and what’s out of scope for your team till you are production ready.

In a nutshell, teams that are jumping on the Kubernetes train go through the three main stages outlined below. I’ll talk more about them in detail in a later article, stay tuned!

Day-0: Getting started with kubernetes. Your team provision kubernetes for the first time and decide to use a provider-managed Kubernetes vs. a self-managed, get the CI/CD pipeline ready, some initial monitoring, etc.
Day-1: Fully committed but not in production yet. The team is still figuring out some important aspects of infrastructure, such as network fabric, securing your infrastructure, scaling your application and capacity, etc.
Day-2: Production grade Kubernetes. Your business or product’s availability and core business operations depend on your Kubernetes cluster(s). The focus at this stage is more on security, monitoring applications, workloads switch over, etc.

You cannot prioritize performance, utilization, and cost efficiency all the once. You will need to stage these throughout your journey. Your goal is to have the proper balance of these three factors before reaching day-2 operations — read this capacity management article to understand how each team member can contribute. But let’s focus here on how you can manage these critical factors during the Kubernetes adoption journey.

Day-0 — Understand Performance Implications

Your goals in day-0 are:

Characterize your microservices and containers. Are they CPU, memory, disk, or network hungry?
Understand the Kubernetes resources model and how to allocate compute resources to containers. Assigning requests and limits to your pods has implications on the performance of your cluster and microservices.
Understand the tricks and limits of the cloud provider’s infrastructure.

Which Tools Should You Use in Day-0?

Prometheus is your friend at this stage :) You can install Prometheus operator to get the necessary metrics out of your Kubernetes cluster. However, I highly recommend using Kube-Prometheus, which installs for you:

The Prometheus Operator, which creates/configures/manages Prometheus clusters atop Kubernetes
Highly available Prometheus, which is a Cloud Native Computing Foundation project, and a systems and service monitoring system.
Highly available Alertmanager, which handles alerts sent by client applications such as the Prometheus server.
Prometheus node-exporter, which collects hardware and OS metrics exposed by NIX kernels.
kube-state-metrics, which listens to the Kubernetes API server and generates metrics about the state of the objects
Grafana, which is Grafana is an open source, feature rich metrics dashboard and graph editor.

You should consider at this stage organizing your dashboards by downloading and customizing pre-canned Grafana dashboards. They will give you a great starting point to monitor different aspects of your Kubernetes cluster. I also highly recommend installing various Prometheus exporters to expose custom metrics of pre-built containers, such as Redis, Postgres DB, MongoDB, etc.

Which Metrics Should You Use in Day-0?

Monitor Your cluster’s overall resources allocation:

Visualize how much CPU/Memory capacity you have versus how much you allocated, and finally how much you are using. Check out this nice Grafana Kubernetes capacity dashboard. It summarizes how much resources you allocated vs. your containers are utilizing.
Visualize how containers are distributed over your VMs. I couldn’t find a Grafana dashboard that shows how each namespace and container are contributing to the overall resources utilization. You can build it yourself. Or you can try Magalix cluster dashboard that shows the breakdown of CPU and memory usage of each namespace in a single chart.

Monitor your microservices KPIs and their impact on resources usage.

Identify a KPI for each pod or container; this KPI should reflect the core operation that this pod/container is performing,
Track CPU usage, request, limits, and CPU throttling to understand if the requested resources are reasonable for the workloads you generate and also to see if your microservice or pod is having any performance hits at a certain point. This simple per pod CPU tracking Grafana dashboard gives some excellent CPU usage visualization. Also, Magalix dashboards track the overall CPU performance and throttling for your cluster, and you can drill down to a container level metrics.
Track network activity, since Kubernetes does not yet provide any network resources management, you need to monitor the bandwidth that each pod is consuming carefully and if this is an internal bandwidth or external. Use this node level dashboard to know what’s happening at the node level.

Day-1 — Map Performance to Resources Needs

Your goals in day-1 should focus on making sure that your application and cluster is ready to scale:

Relate your application’s KPIs to resources and how they should go together. For example, in case of a web application that exposes HTTP endpoints to your users, you should use the API call rate and latency to identify the amount of CPU and memory needed to scale with your operations.
Connect the scalability of your application and microservices to the scalability of your infrastructure. Understand how pod autoscaling (HPA and VPA) works with the Cluster Autoscaler (CA) to have pretty much a dynamic infrastructure. Please read how Autoscaling works inside Kubernetes and prepare a scalability plan accordingly.
Your containers and pods are probably not of the same nature. Some would need more CPU optimized instances, and others may run more effectively on I/O optimized instances. Identify the types of instances required in your cluster and create different scalability groups for the CA to scale them accordingly. The process has a lot of observation and adjustment at this point, but this will help you a lot in your day-2 cost and performance optimizations.

Which Tools Should You Use in Day-1?

You need to form a new friendship with manage the scalability of Kubernetes Pods and cluster nodes. Below are a couple of options

Kubernetes Autoscalers will help you a lot scaling your pods (vertically or horizontally) and your cluster nodes whenever additional capacity is needed. Please read this Kubernetes Autoscaling 101 article about HPA, VPA, and CA to understand how these open-source components work together. My main advice is to take it easy and gradually incorporate them in your cluster. They have many moving parts to align together. Also, make sure you review and update your scalability rules whenever relevant.
Magalix agent is an open source agent that depends on Magalix backend to analyze your pods and cluster metrics to give you scalability recommendations. It offers the autopilot feature to scale your pods and containers based on anticipated workloads proactively. This documentation page provides a high-level overview to decide if it is a good fit.

Which Metrics Should You Use in Day-1?

Application Specific KPIs, which can be any of these:

If you are building the typical web application with HTTP endpoints, you should track APIs call rate, average load size, and average latency. You should start with customer-facing APIs and gradually add your internal APIs. It is highly recommended to have in the same dashboard how much compute resources each of these pods are consuming to correlate usage with resources.
Break your KPIs into container specific metrics. For example, if you have influxDB, you want to track the DB engine metrics, such as average query time, number of queries per second, etc. This will help you to characterize container consumption of resources and correlate it to that container specific KPIs, and eventually understand how it impacts your application or user-level KPIs. This sample Grafana dashboard for influxDB specific metrics will give you an idea of what I mean by container specific KPIs.

Day-2 — Scale Responsibly and Efficiently

Your goals in day-2 are to scale in a way to maximize the value of your infrastructure without frustrating your users and other stakeholders in the organization:

Build more resiliency to temporary failures in your application and microservices. If you have some of your microservices running in stateless containers with enough redundancy, this will help you a lot to save up to 90% of your compute cost. Stateless containers with enough replicas may allow you to use spot (AWS) or preemptive (GCP) instance instances. Take a look at the Processes section of the 12-factor microservices methodology for the formal definition and some tips achieving that.
Set your capacity KPIs and mobilize the rest of the organization to optimize for those KPIs. I have written here some guidelines to plan your capacity and effectively manage it
Understand different billing models and options you have to save some money. Yes, the cloud is built for on-demand scalability. But the cloud providers have now complex billing models that you need to play with nicely to run lean and efficient. Even if your

Which Tools Should You Use in Day-2?

It is hard to find a single tool that can satisfy day-2 goals. But at this stage, it becomes more of a game of high-level monitoring and decision making. You can import some billing metrics in Prometheus and chart your cost over time. But no open-source tool out there can analyze your billing options and possible optimizations. You can depend on some commercial tools such as Magalix Node Advisor, or CloudHealth reports to give you some insights about your billing optimizations.

Which Metrics Should You Use in Day-2?

You still need to keep an eye on day-0 and day-1 metrics. But now you want to build higher level KPIs that track your application’s performance (or user experience), compute resources utilization, and the cost per relevant business transaction. Below are some examples:

Performance KPIs: APIs latency (90, 95 and 100 percentiles of users),
Resources KPIs: cost per CPU, effective CPU cost (utilization included), cost per memory GB.
Cost KPIs: cost per user or cost per operation (direct and indirect), or cost per microservice,

At Magalix we can help you in your kubernetes adoption journey. You can see in one dashboard the performance of your containers, kubernetes cluster utilization, and detailed cost analysis. Connect your Kubernetes cluster for free today and get an in-depth analysis of your cluster’s capacity and reliability. You can also run your cluster on Autopilot to keep adjusting to your capacity proactively based on anticipated workloads.