k8s with HA kube-apiserver and external etcd cluster

#kubernetes #etcd #hlv3

Table of Content

k8s with HA kube-apiserver and external etcd cluster¶

This page is on building kubernetes cluster with loadbalancer for kube-apiserver and external etcd cluster.

references to official documents¶

The HA (highly available kube-apiserver using loadbalancer) is described in the official kubernetes documentation here.

The external etcd topology is explained in the official kubernetes document here.

ToC¶

preparing machines
setting up kube-ready nodes
building etcd cluster
preparing loadbalancer for kube-apiserver on control plane nodes
initializing kubernetes cluster
demo

Preparing Machines¶

I prepared physical and virtual machines all on AMD64 with mixed OS just to practice composing ansible playbook capable of handling different kinds of machines.

Baremetal Servers¶

As for baremetals, it is just a matter of getting the minimal OS installer image on USB, stick it to the machines, and follow the installation wizard to install OS with minimal system utilities and ssh server. I won't go into details on this.

Virtual Servers¶

As for virtual servers, I decided to build main virtual machines on Proxmox, and also did some testing on Hyper-V using my gaming desktop machine running Windows 11 Pro.

Building a virtual machine on Proxmox web GUI is as straightforward as installing an OS on baremetal machines. Here I leave notes on building virtual machines using templated cloud-init images on Proxmox.

Rocky 9 on Proxmox Using Cloud-init Image¶

Here is what's done below:

download cloud-init image and verify it
prepare ssh public key to import on the new machine at ~/.ssh/authorized_keys
create a virtual machine using the image and set basic parameters such as CPU, memory, architecture, and NIC
convert the machine into a template
clone the template to build the actual machine to run
set cloud-init parameters such as IP address, username, and ssh public key to allow access
start the image

# on proxmox hypervisor

# prepare the directory to place downloaded cloud-init image files
mkdir -p /opt/pve/dnld

# download and verify the file
cd /opt/pve/dnld
wget https://dl.rockylinux.org/pub/rocky/9/images/x86_64/Rocky-9-GenericCloud-LVM-9.5-20241118.0.x86_64.qcow2
wget https://dl.rockylinux.org/pub/rocky/9/images/x86_64/Rocky-9-GenericCloud-LVM-9.5-20241118.0.x86_64.qcow2.CHECKSUM
sha256sum -c Rocky-9-GenericCloud-LVM-9.5-20241118.0.x86_64.qcow2.CHECKSUM
qemu-img info Rocky-9-GenericCloud-LVM-9.5-20241118.0.x86_64.qcow2

# prepare ansible user ssh public key
mkdir /opt/pve/ssh_pub
cd /opt/pve/ssh_pub
# get the ssh public key from wherever and place it in the created directory
# or generate one

# environment variables
path_ciimg=/opt/pve/dnld/Rocky-9-GenericCloud-LVM-9.5-20241118.0.x86_64.qcow2
path_ssh_public_key=/opt/pve/ssh_pub/id_ed25519.pub
ciusername=username_here

# create a VM using the downloaded cloud-init image
qm create 9000 --memory 2048 --net0 virtio,bridge=vmbr0 --scsihw virtio-scsi-single --cpu cputype=x86-64-v3 --ostype l26
qm set 9000 --scsi0 local-zfs:0,import-from=$path_ciimg
qm set 9000 --ide2 local-zfs:cloudinit
qm set 9000 --boot order=scsi0

# template it
qm template 9000

# check what's created
qm config 9000
qm showcmd 9000

# create a new VM using the template
qm clone 9000 1308 --name name_of_this_rocky9_vm
qm set 1308 --sshkey $path_ssh_public_key
qm set 1308 --ipconfig0 ip=192.168.111.30/24,gw=192.168.111.1
qm set 1308 --ciuser $ciusername

While many of the steps are identical to what's described in the official document, here are some additional notes on minor deviations:

storage is local-zfs instead of local just because I prepared a different storage named local-zfs
cputype is x86-64-v3, and I referred to these links to choose one for my machine running Proxmox
- https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_qemu_cpu_types
- https://www.yinfor.com/2023/06/how-i-choose-vm-cpu-type-in-proxmox-ve.html
resizing disk
- resized disk size of the downloaded qcow2 image itself for most of the OS's I used, and the final cloned virtual machine had all the disk space automatically recognized and mounted
  - # qemu-img resize debian-12-generic-amd64.qcow2 64G
- for some I did disk resize using qm command like following, and in which case I had to partitioning, pvcreate, vgextend, lvextend, xfs_growfs, and the likes
  - qm disk resize 1312 scsi0 128G
  - obviously changing the disk size on the original cloud-init image is easier

Setting Up Kube-Ready Nodes Using Ansible¶

Here is the list of machines I prepared.

One of the goal for rebuilding my homelab is to be able to compose whatever ansible automation I want, and here I am with numbers of machines with mixed OS. I ran ansible-playbook to make whatever changes required to make these nodes kubernetes-ready along with other machines not listed here with roles to run services using docker or to serve as jump host.

hostname	role	os	cpu	memory	disk (+additional disk)	machine
cp1	k8s control plane	debian	4	8GB	128GB ssd	baremetal
cp2	k8s control plane	rocky	4	6GB	128GB	vm on proxmox
worker1	k8s worker node	debian	4	16GB	128GB ssd +6TB	baremetal
worker2	k8s worker node	rhel	4	6GB	128GB +200GB	vm on proxmox
worker3	k8s worker node	debian	2	8GB	128GB ssd +500GB	baremetal
worker4	k8s worker node	debian	16	16GB	512GB ssd	baremetal
worker5	k8s worker node	ubuntu	4	6GB	128GB +200GB	vm on proxmox
etcd1	etcd node	debian	4	8GB	64GB ssd	baremetal
etcd2	etcd node	debian	2	4GB	64GB	vm on proxmox
etcd3	etcd node	oracle	2	4GB	64GB	vm on proxmox

And here is the list of tasks executed using Ansible on all of the nodes listed above. #TODO Link to the repository will be for each tasks when I made them available on a public git repository.

bootstrap
- to create user account for ansible master and enable password-less sudo
- skipped for machines created using cloud-init images as the ssh key and username I set during the build were for ansible master
posture check
- sshd hardening
- ordinary package upgrades
- install certain packages such as nfs which I wanted in common among different machine roles on my homelab
configuration changes and packages installation
- disable swap
- set ipv4 forward
- install and setup containerd, runc, and cni
- disable selinux
- disable firewalld
- install kubernetes packages such as kubeadm, kubelet, and kubectl

Building Etcd Cluster¶

I have followed what's described in the official document on setting up the etcd cluster.

Here is what's done in brief:

configure kubelet through systemd to have kubelet look for static pod manifest files and run them
prepare kubeadm config file with etcd nodes details for each etcd node that will form the etcd cluster
- in my case, three different kubeadm config file for etcd1, etcd2, and etcd3
generate CA certificate and key for etcd using kubeadm command
generate certificates and keys for different communications among etcd and kubernetes components on each etcd node
generate static pod manifest that runs etcd on each etcd node
kubelet now picks up the generated manifest and spins up etcd service on each node
install etcdctl command
run etcdctl command to verify that etcd is working fine on each etcd node

TODO: Here is the link (to be added) to the repo containing ansible tasks

One difference from the steps described in the official document (link added in the beginning of the section) and the ansible tasks is that in the former steps a single etcd node prepares all the certificates, keys, and configuration files, and then distributes the products to other etcd nodes, and the latter the generation of files is done on each etcd node that actually uses them. The exception is of course that the etcd CA key is generated in the beginning on one etcd node and then shared to other etcd nodes so that all etcd nodes work under the common CA.

Verification Output of Etcd¶

PLACEHOLDER

TODO place example output of `ss -tlpn`, `crictl ps`, and `etcdctl endpoint health`¶

Preparing Loadbalancer for Kube-apiserver on Control Plane Nodes¶

I have two control plane nodes prepared on my homelab environment, and this section covers the preparation steps to setup high-available control plane (kube-apiserver) using haproxy and keepalived running as static pods on the two control plane nodes.

The Kubernetes official documentation on setting up kubernetes cluster using external etcd cluster is available here. The documentation on high-available control plane is available here, and the actual details on how to setup the loadbalancer is available here.

Here is the brief steps:

copy etcd CA certificate to the control plane nodes
copy client certificate and key for kube-apiserver to communicate with etcd cluster to the control plane nodes
prepare kubeadm config file on one control plane node
- configured to use the external etcd cluster using the certificates and key files prepared in the previous steps
- the control plane endpoint configured to point to the VIP and port to be served by the loadbalancer prepared in the following steps
prepare a keepalived configuration file, check script which keepalived uses to monitor health of the kube-apiserver (control plane service), and haproxy configuration file
prepare static pod manifest files for keepalived and haproxy

initializing Kubernetes Cluster¶

Run kubeadm init --config {kubeadm config file} on a control plane node to setup kubernetes cluster.

It will generate certificate and key files used by control plane node. These files are copied to the second control plane node as part of the ansible playbook tasks, and then kubeadm join will be executed on the second control plane node to join the created kubernetes cluster.

The similar kubeadm join command will be executed on the rest of the worker nodes to join the cluster.

Now I have the new kubernetes cluster composed of two control plane nodes with loadbalancer running for kube-apiserver access (for example, whatever you run with kubectl gets there), worker nodes, and the external etcd cluster of three etcd nodes running independently from the kubernetes cluster members.

demo¶

I'll continue on a little bit more on what I'd do after initializing the kubernetes cluster.

Managing Cluster from Other Machines¶

I could logon to the control plane nodes to manage the cluster but I rather do so without logging onto the control plane.

I copy the /etc/kubernetes/admin.conf file to ~/.kube/config on the same machines I run ansible playbook. This way I can execute bulk, scripted ansible changes and also manage the kubernetes cluster from the same machine.

By the way, if you have multiple clusters with respective /etc/kubernetes/admin.conf files to use, I would prepare aliases like below in ~/.bashrc, ~/.bash_aliases, or any file that works to switch the cluster to work on.

# kubectl unset (default, ~/.kube/config)
alias kcu="export KUBECONFIG="
# set to manage different k8s clusters
alias kchv="export KUBECONFIG=~/.kube/hv-config"
alias kchlv3="export KUBECONFIG=~/.kube/hlv3-config"
alias kclab="export KUBECONFIG=~/.kube/lab-config"

Installing Network Addon¶

The first component to install is the network addon. There are numbers of options as listed here.

This time I am trying out Cilium, and I'm installing it using helm.

add helm repo
run helm search and identify the version
download the values file of the chart locally
analyze and edit the values file as necessary
install the helm chart
- specify the modified values file to use if any changes were made with --values {modified_values_file.yaml}

# helm installation) https://helm.sh/docs/intro/install/

# add cilium repository
helm repo add cilium https://helm.cilium.io/

# confirm the latest version
helm search repo cilium
# helm search repo cilium -l  # to see the list of older versions available

# download the values file
helm show values cilium/cilium --version 1.16.6 > cilium-values.yaml

# edit the values file as necessary

# install
helm install cilium cilium/cilium --version 1.16.6 --values cilium-values.yaml

# monitor cilium pods and wait until all are up
kubectl get pods -n kube-system

# confirmation using helm commands
helm status cilium -n kube-system
helm get values cilium -n kube-system

# upgrade the existing release
# with revised values file when there's any change
helm upgrade cilium cilium/cilium -f cilium-values.yaml -n kube-system

# rollout restart for cilium deamonsets and deployments
# when necessary after helm release upgrade
kubectl -n kube-system rollout restart deployment/cilium-operator
kubectl -n kube-system rollout restart ds/cilium

# uninstall the helm release "cilium"
helm uninstall cilium -n kube-system

Cilium Values File¶

Here is the list of changes made on cilium chart 1.16.6:

k8sServiceHost: {kube_endpoint_ip_or_fqdn}
k8sServicePort: "8443"
kubeProxyReplacement: "true"
kubeProxyReplacementHealthzBindAddr: "0.0.0.0:10256"
l2announcements.enabled: true
externalIPs.enabled: true
hubble.relay.enabled: true
hubble.ui.enabled: true
gatewayAPI.enabled: true

Update CoreDNS Configuration¶

As the next step, I'd like to make changes on CoreDNS configuration. Two changes I want to make are:

to point root forwarder to the two DNS servers running on my homelab outside the kubernetes cluster
to add the cluster DNS domain name to the search suffix list

The changes can be made by preparing modified configmap manifest for the coredns and applying it.

# retrieve the configmap
kubectl get configmap coredns -n kube-system -o yaml > configmap-coredns-original.yaml
cp configmap-coredns-original.yaml configmap-coredns-custom-domain.yaml
# edit the configmap and then apply it
kubectl replace -f configmap-coredns-custom-domain.yaml
# restart coredns deployment to recreate coredns pods with the updated configmap
kubectl -n kube-system rollout restart deployment coredns

Here is the jinja2 template of the coredns configmap file with variables to add domain suffix with kube_cluster_domain variable, and the two forwarders with nameservers[0] and nameservers[1] variables. I actually ran ansible playbook to make above changes.

configmap-coredns-custom-domain.yaml

apiVersion: v1
kind: ConfigMap
metadata:
  name: coredns
  namespace: kube-system
data:
  Corefile: |
    .:53 {
        log
        errors
        health {
           lameduck 5s
        }
        ready
        kubernetes {{ kube_cluster_domain }} cluster.local in-addr.arpa ip6.arpa {
           pods insecure
           fallthrough in-addr.arpa ip6.arpa
           ttl 30
        }
        prometheus :9153
        forward . {{ nameservers[0] }} {{ nameservers[1] }} {
           health_check 5s
           max_concurrent 1000
        }
        cache 30
        loop
        reload
        loadbalance
    }

Pod Security Administration¶

Applying pod security standards to the cluster is one of the things listed in the best practice section of the document after kubeadm section.

https://kubernetes.io/docs/setup/best-practices/enforcing-pod-security-standards/

Let me just apply non-enforcing, baseline policy as a starter.

kubectl label --overwrite ns --all \
  pod-security.kubernetes.io/audit=baseline \
  pod-security.kubernetes.io/warn=baseline

Test Namespace and Pods¶

The kubernetes cluster becomes functional after installing network addon, and I also have made necessary changes on CoreDNS in the cluster and am ready to test running something in the cluster.

create test namespace
create test deployment
get inside the pods spun up and see how they look

# create test namespace
kubectl create namespace test

# add dnsutils pod in the test namespace
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: dnsutils
  namespace: test
spec:
  containers:
    - name: dnsutils
      image: registry.k8s.io/e2e-test-images/agnhost:2.39
      command:
        - sleep
        - "infinity"
      imagePullPolicy: IfNotPresent
  restartPolicy: Always
EOF

# view /etc/resolv.conf file
kubectl exec -t pod/dnsutils -n test -- cat /etc/resolv.conf
# internal DNS records using suffix list
kubectl exec -t pod/dnsutils -n test -- dig +search cp1
kubectl exec -t pod/dnsutils -n test -- dig +search worker1
# external records
kubectl exec -t pod/dnsutils -n test -- dig google.com.
kubectl exec -t pod/dnsutils -n test -- dig cloudflare.com.

# clean up
kubectl delete namespace test

more test using cilium, https://sreake.com/blog/learn-about-cilium-l2-announcement/
- [x] done in cilium_demo, 2025-01-24
- [x] update qps and burst value to 8 and 16 respectively and upgrade cilium hr
  - https://docs.cilium.io/en/stable/network/l2-announcements/#sizing-client-rate-limit
- [x] deploy web server
- [x] create svc with loadbalancer
- [x] add ipam
- [x] add l2advertisement
gateway examples, https://docs.cilium.io/en/stable/network/servicemesh/gateway-api/http/
- [x] done in cilium_demo, 2025-01-24
TODO: cilium gateway and cert-manager, https://docs.cilium.io/en/stable/network/servicemesh/gateway-api/https/