What we Learned from running fully containerized services on Kubernetes? — Part I

High-Level Goals and Architecture

Our service provides resources management and recommendations using a sophisticated AI. Our AI pipeline consists of time series predictions, scalability decision analysis, optimization, and a feedback loop to learn from these decisions. These are mostly offline systems, but we also have real-time systems that interact with our customers' clusters via installed pods. Our systems must have high availability with very low latency for a quick response to scalability needs. We also provide one-stop service where users can manage many distributed clusters at different geographies and cloud providers.

  • Magalix agent is resilient to network failures. Magalix Agent is pod that listens to important events and metrics. Magalix’s Kubernetes agent must always be connected and resilient to network failures.
  • Our global entry points should be geo-replicated across different providers to guarantee the highest possible availability to our users and internal dependent systems
  • Super efficient AI pipeline. Our AI services have different capacity and availability requirements that require bursts of compute needs. We must be super efficient to make AI-based decision making affordable to our customers.
  • No cloud vendor specific service dependencies. We should be able to extend our infra easily to any other cloud provider with no major architectural or dependencies changes. That’s one of the advantages that Kubernetes provides anyways :)
high-level backend Kubernetes clusters architecture

Part 1 — Resources Management

We wanted to be super efficient from the get-go. So, we had nodes allocation done through a budgeting process to avoid having idle or underutilized instances. That was a big shock for most team members when we decided to take this approach. After applying restricted resources allocation, we started to have containers OOM terminated, inaccessible (now smaller) nodes due to containers highjacking their CPU, unexpected eviction of pods, failed deployments, and of course frustrated customers, and team members.

A sample of how we were budgeting our memory and cpu inside our Kubernetes clusters
A sample of how we're budgeting and scaling services between different Kubernetes environments
CPU and memory predictions. The blue line is the actual consumption. The Orange line is the AI predictions with 4 hours predictions.
A sample of generated recommendations/decisions to scale CPU and memory
Sample decision analysis to scale down one of our containers. Notice auto-pilot is off; the decision will not execute till the auto-pilot is enabled.
  • Kubernetes is all about resources scheduling and management. However, teams still need to closely coordinate the budgeting of their cluster resources.
  • Have a concrete plan to budget your resources. Start by understanding how much CPU & memory are requested, used, and the limits set. This will tell if your team need to improve how they manage capacity and resources.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Mohamed Ahmed

Mohamed Ahmed

Magalix Co-Founder, dad, and learner @MohamedFAhmed