How Kubernetes Community Manage Kubernetes Scale? — Survey Results
This article wouldn’t be possible without the generous contributions of many members of the Kubernetes community. I’d like to personally thank all who contributed to the survey and even shared more details via email for the sake of sharing knowledge back with the community.
I have a personal interest in performance engineering of distributed systems. I got fascinated by how Kubernetes and Containers made apps and infrastructure more fluid and faster to scale. But it is not becoming easier, unfortunately. It is now part of my mission to make it easier and smarter at Magalix. We are excited to share our findings with the community.
We did this survey to share the results back with you. I hope it would provide insights into how different community members scale their applications and clusters. The survey results are just an anecdote that would hopefully encourage a more qualitative research.
Update: We open sourced the survey. You can find survey’s raw data here
About the Survey
- The survey consisted of only 7 questions. These questions are addressing the basics of autoscaling inside Kubernetes. They are not meant to dig into the specifics of Kubernetes autoscaling for two reasons: (1) I wanted to be diligent of respondents time, and (2) I didn’t want any respondent to get into details that he/she won’t feel comfortable sharing.
- Total responses are 520. Valid responses considered in below charts are 509 responses.
- Respondents came from United States, China, India, Europe, and Australia — see marked world map below.
- The survey conducted from July 9th to the 18th.
- Some questions allowed multiple answers since DevOps use more than one way to scale their clusters.
In many cases, we had a long tail of custom answers. I’ve aggregated them in some responses under “others.” However, I’m listing a sample of them in some questions.
Each section below is titled after the questions asked of the participants, and includes the key findings and visuals.
Where do you run your kubernetes clusters?
This question allowed multiple answers with the ability to mention unlisted options.
Some interesting points:
- Even though it is commonly known that Kubernetes is best suited for public cloud, it is worth mentioning that 36% of the respondents have at least one Kubernetes cluster of production grade on-premise..
- ~62% of the respondents run kubernetes on more than one public cloud provider.
- ~40% of respondents who run Kubernetes on-premise have at least one cluster running on public cloud.
- Here are some of the cloud providers used by the survey participants to host Kubernetes clusters: Digital Ocean, IBM Kubernetes Services, Qing Cloud, Alibaba Cloud, and Hitzner Cloud, Mail.ru Cloud Services.
How many nodes on average do you run per cluster?
This is a single selection question.
Few take away points:
- Majority of participants run relatively small clusters, 68% run clusters of 10 nodes or less.
- 37% of the on-premise clusters are 10 nodes or less in size. Rest of the on-premise clusters are between 11 and 50 nodes.
- 65% of the clusters running in the public cloud are 10 nodes or less and ~15% of public cloud clusters are between 11 and 50 nodes.
How many containers do you have in a Kubernetes cluster?
This was a single selection question as well. Below is the visualization of different answers.
Some insights:
- The density of containers per node does not change. It is roughly 10 containers per node regardless of cluster size or the number of containers.
- ~3.5% of the surveyed sample was able to run roughly 20 containers per instance. Kudos to them :)
How do you currently manage the scale of your pods and cluster?
The goal of this question is to measure how popular different scalability components within Kubernetes cluster and to understand how frequently do engineers use external scalability tools. The question allowed more than one selection to understand how frequently does the community use more than one component at the same time. Below visualization includes stats about tools that are frequently used together.
Few key points:
- It was surprising that around 29.5% of respondents are not using any autoscaling tools. This could be either due to running applications with static resources or they have provisioned enough resources to keep up with changes in workloads.
- Around 30% of those who don’t use auto-scaling is running on-premise clusters. The remaining 70% are running Kubernetes on one or more public cloud infrastructure.
- When it came to which autoscaling component to use, Cluster Autoscaler (CA) is the most popular followed by the Horizontal PodAutoscaler (HPA), and Vertical Pod Autoscaler (VPA).
- CA and HPA are mostly used together, and respondents using this combination run their clusters on a public cloud.
- The 3.5% who use all the three auto-scalability components are running their clusters on public cloud infrastructure.
- HPA is the most popular auto-scaling component used inside Kubernetes on-premise clusters. It is used almost 30% of the time.
Which of these metrics do you use to scale your pods and nodes?
This is the most interesting question in my opinion. It tells a lot about how different engineers see the right triggers to scale their pods, containers, and eventually their nodes. Multiple selections allowed to understand which metrics are used together.
Some Key points:
- As expected, CPU & memory are the most common metrics to scale pods and clusters, around 33%.
- Also, CPU was the most commonly used trigger when engineers are using a single metric to scale their infrastructure, around 15% of the time.
- Custom metrics were used 30% of the time along with other metrics. Around 40% of participants who use custom metrics use them for on-premise Kubernetes.
Which of these challenges do face managing the scale of your cluster?
One of the key challenges is budgeting resources inside Kubernetes clusters. It becomes trickier when it is a multi-tenant Kubernetes cluster. This question is to gauge how frequently do teams face this issue. The next question is about how coordination is done.
Few key points:
- 43% of engineers running Kubernetes clusters on-premise believe that they don’t have challenges managing the scale of their clusters. Around 34% believe they over-provision, and the 23% believe that they get performance issues due to under-provisioning their clusters.
- 34.5% of engineers running Kubernetes on top of public cloud infrastructure believe they don’t have challenges managing the scale of their clusters. 47% believe that they over-provision, and around 18.5% believe that they under-provision their clusters.
- Around 2% of the respondents didn’t have clarity about where they fall since other teams manage their clusters or they are in early stages trying out Kubernetes.
How do you coordinate the allocation of resources across services managed by different teams?
This question aims to understand if there are any offline dynamics to manage resources allocation between different teams and services.
Some key points:
- In both on-premise and public cloud clusters it is almost split equally between close coordination between teams and teams running their clusters.
- Some of the shared insights: We use namespace isolation and setting limits, we give each team their isolated cluster, it is part of our change control processes
TL:DR
- Kubernetes community needs more insights into best autoscaling practices inside Kubernetes. Understanding when teams need autoscaling and how they utilize different components is a reasonable approach moving forward.
- There are interesting differences between how Kubernetes autoscaling is managed on-premise vs public cloud setup, such as the use of HPA and CA, the allocation of resources and the size of clusters.
Please let me know your thoughts and questions. I’d like to do this survey and share the results regularly. What questions do you have in mind? How frequently should we do this survey? Any other thoughts?