Kubernetes: bare metal etcd quorum

By geraint

At Imagicloud we currently manage 7 kubernetes clusters, two run on Amazon Web Services EC2, one uses Amazon’s Kubernetes Service, two are bare metal and two use Google’s Kubernetes Engine. We haven’t dipped our toes into Azure quite yet, but we will certainly be doing at some point in the near future.

For the bare metal clusters we manage, one of which is our own cluster, running three masters seems a little overkill and so we’re left with a decision around what we’re going to do about etcd quorum when we only want to run two¬†masters… we could deploy a third master, or we could find a different solution, such as a raspberry pi, or another device with low power consumption.

The primary motivation for writing this article is that I’ve just googled and found there isn’t much information on how to do this available, and so I thought I’d just start writing the article that I was looking for and hope that I can figure out how to do it whilst writing… let’s see how this goes…

Give our Facebook Page or LinkedIn a follow if you’d like to see more articles from us, and if you’d like a hand with your own kubernetes requirements feel free to give us a call, we’re here to help.

The Garden

We’ll be working on a cluster we call the garden, the garden is a bare metal cluster which we were gifted by one of our clients who’s workloads had now moved to the cloud after a little help from us to bring the platform up to well-architected standards, we’ll have an article on this in future – it certainly wasn’t your average cloud deployment!

The garden has two masters, the potato and beetroot, and five nodes – onion, cabbage, carrot, cucumber and sprout. There’s also Bill and Ben, the Intel NUC’s, a couple of pfSense network appliance servers and an unnamed HP server which we’ve never powered up.

We’re running kubernetes version 1.23 with containerd for our CRI, metallb for layer2 networking services, nginx for ingress and pfSense appliance servers to provide firewalling, apiserver failover and so forth.

There are five nodes available for the cluster, however only two are powered up under normal circumstances, the other nodes can be turned on remotely should additional capacity be required and we also use these spare nodes to conduct application performance testing in a predictable and consistent environments.

The Problem

Currently, if either the potato or beetroot goes down (our masters), the kube-apiserver becomes unreachable and the cluster becomes a little (a lot) useless. This is because the etcd service cannot establish quorum with only two nodes, it requires an odd number, with a minimum of 3, and when this is not possible – the apiserver cannot function.

To resolve this, we need to add a third etcd process to the cluster. Naturally it doesn’t make sense to run this additional process on one of the existing masters as it would not help our current situation, aside from when the master with only one process goes down, the remaining master could continue. We need a third member of the etcd cluster, either on site, or remotely.

Our considerations are primarily related to cost. It doesn’t really matter if this box dies, our monitoring will tell us if this has happened and we can simply replace it, should one of the masters also fail during this period, we have automation in place which will move our entire workloads to AWS within 3-4 minutes.

Choosing a device

We have a few options for what we do next, we could opt for a physical device in our rack, or we could use a virtual machine or container running in the cloud.

For us, given we already have a mountain of spare equipment available to us we’re going to opt for a physical device. The cost of running the workload in the cloud will never be less than on bare metal in this circumstance, it doesn’t need to scale, and it is always on, additionally, our cluster runs workloads which continue even when there is no Internet available and so a remote process wouldn’t be ideal.

That said, here’s some of our options:

  • Spare HP Server
  • Old PC’s
  • Raspberry Pi’s
  • Intel NUC’s

The spare server seems a little overkill and a waste of electricity, so we’ll rule that one out immediately.

Next up we have an Old PC, this will take quite a bit of space in our rack so realistically we’re best off going for the raspberry pi or NUC.

Whilst a raspberry pi would do the job just fine, we’re going to use an Intel NUC so that we have a little more power available to us and perhaps we can also use it for something else in future, too.

Deployment Method

Now then, how we deploy this is up for grabs – the end result is that we need three etcd processes to be running on independent devices, but there are many ways in which we could achieve this.

Our options include making this device into a kubernetes node running etcd and other essential components only, or perhaps we could consider deploying the container manually onto this device and configuring it, or perhaps we could use ansible or similar to install directly onto the base operating system.

Many options, and at the time of writing I have absolutely no clue what the best thing to do is but it feels like making it a stripped back kubernetes node would be the most sensible as we can benefit from the various DaemonSets we have running for monitoring including the node exporter and loki scraper thus reducing our efforts in other places.

Deploying a Stripped back Master

After very little consideration, I’ve decided I’m going to deploy a stripped back master node, only running whatever is essential plus etcd and monitoring components… I must admit I don’t actually know what the essential components are, but I do know I’ll be needing containerd and kublet…

For this, I’m going to be using Rocky Linux 8, there are plenty of guides out there for installing the essentials so I won’t cover this here – but we might in future so that we have complete end-end guides you can follow.

I’m going to be installing:

  • containerd
  • kubelet-1.23

Once the essentials are installed and the necessary sysctl and firewalling options are set, we’re going to need to join this node to the cluster.

Join the new node

First up we’ll be needing our join command for the cluster, which we can get from kubeadm on one of our existing masters.

				
					kubeadm token create --print-join-command
				
			

Now we’ll execute the output of the above command on our newly prepared Rocky Linux 8 box which is to be our third etcd node.

				
					kubeadm join 10.0.0.1:6443 --token xxx --discovery-token-ca-cert-hash sha256:yyy 
				
			

Let’s see what kubectl shows us once the node has joined.

				
					kubectl get nodes
				
			

The node is showing as ready, and will be running any daemonsets and essential kubernetes services, let’s have a look at what these are.

				
					kubectl get nodes
				
			

As we can see, there are some components running that we probably don’t need.

We’re also going to need to taint the node and then add some tolerations for the services we do want to run on this node.

Tainting the node

				
					kubectl get nodes
				
			

Great, now our node is tainted we can be sure no workloads are going to start running here.

Allow etcd to run on the new node

Next we’ll need to tell etcd that we want it to launch on this node and become part of the etcd cluster for quorum only, we don’t need data to be available, we simply need it to help with establishing quorum.

				
					kubectl get nodes
				
			

Let’s check and see if etcd has come up…

				
					kubectl get nodes
				
			

Testing

Looking good… we don’t have any critical workloads running at the moment and losing the apiserver for a few minutes while we test is no issue, but hopefully that won’t happen now we have three members of our etcd cluster!

Let’s shut down the beetroot, one of our master nodes which runs a copy of the apiserver… what we’re hoping to see is that the potato continues to service apiserver requests (which are routed from pfsense via haproxy with health checks) and the cluster can continue to operate normally despite the absence of one of the master nodes.

				
					kubectl get nodes
				
			

Also looking good, the apiserver is still servicing requests even though there is only one master node online.

If at this point we lose either the remaining master or our new quorum node, the apiserver will fail.

Next, for good measure let’s bring the beetroot back in and remove the potato just to double check the beetroot can successfully continue to service apiserver requests and keep the cluster running in the absence of the potato.

				
					kubectl get nodes
				
			

This is all going far too smoothly.

I suppose we’d best check that when the etcd node is absent but both masters are present, the cluster continues to function as normal and either the potato or beetroot can continue to service apiservice requests.

I won’t paste the output, but it worked. The last check I’d like to do is to ensure the cluster can come back up if all three members go offline, which would be the case when we lose electricity and our workloads automatically move to AWS. For this check I’m going to be fairly brutal and issue a shutdown -h now to the beetroot, potato and ben, our new etcd node.

Once all shut down I checked to ensure they are indeed off and the apiserver is not responding. Next, turn them all back on and wait and see what happens… hopefully within a few minutes our apiserver should return and the cluster returns to a healthy state without any unexpected problems.

It’s always important to check even if you’re absolutely sure you know what the outcome is going to be, I was very confident that everything would work as we’ve been running a 2 node etcd cluster for a few months now and we have suffered power failures during this time with everything returning to normal upon rebooting… but, that said, I’ve also been supporting infrastructures for a long time, often finding myself being woken up in the middle of the night to help solve a problem that’s been ongoing for hours and these simple checks being missed during initially installation are often to blame and become infinitely more difficult to resolve.

That's it!

Our problem is now solved, we can now be sure our cluster will continue to operate normally even when we’re working on our master nodes. This work has taken us from a basic level or redundancy to a good level of redundancy.

If you’d like to see more articles from us, give our LinkedIn or Facebook Page a follow and if you need any consultancy or hands on work, get in touch – we’re here to help. For more information on how we work with our clients, check out this article.