Mastering Automated Node Restarts in AKS: Harnessing the Power of Kured Daemonset
How to Automate AKS Node Restarts When Necessary?
Especially following the automatic installation of OS security patches or kernel updates by Azure. As a reminder, the nodes check daily for security patches via unattended-upgrade.
Thankfully, the daemonset KURED exists and assists us in these operations.
Website: https://kured.dev/
GitHub Project: https://github.com/kubereboot/kured
CNCF sandbox: https://www.cncf.io/projects/kured/
What does Kured do? Quote from https://kured.dev/docs/
Kured (KUbernetes REboot Daemon) is a Kubernetes daemonset that performs safe automatic node reboots when the need to do so is indicated by the package management system of the underlying OS.
Watches for the presence of a reboot sentinel file e.g.
/var/run/reboot-required
or the successful run of a sentinel command.Utilises a lock in the API server to ensure only one node reboots at a time
Optionally defers reboots in the presence of active Prometheus alerts or selected pods
Cordons & drains worker nodes before reboot, uncordoning them after
How does it work?
Kured does not rely on a configmap or secret for configuration, scheduling parameters can only be passed to the daemonset during deployment via Helm.
Also, it's not possible not to specify parameters, if nothing is specified, the default "any" config applies --> "Reboot schedule: SunMonTueWedThuFriSat between 00:00 and 23:59 UTC"
Note that Kured does not allow for advanced scheduling like cron, for instance, it's not possible to specify a maintenance window such as the xth Monday or Tuesday of months x-y-z.
Below is an example of deployment with a configuration for a maintenance window between 14:00 and 17:00 every day of the week.
helm repo add kubereboot https://kubereboot.github.io/charts/
helm repo update
kubectl create namespace kured
helm install kured kubereboot/kured --namespace kured --version 5.3.2 \
--set extraArgs.start-time=2pm \
--set extraArgs.end-time=5pm \
--set extraArgs.time-zone=Europe/Zurich \
--set extraArgs.reboot-days="mon\,tue\,wed\,thu\,fri" \
--set configuration.lockTtl=30m \
--set configuration.period=1m
You can easily log kured containers with:
kubectl -n kured logs -f -l app.kubernetes.io/name=kured
For one of our use cases, this scheduling is not specific nor controlled enough as we cannot afford to have nodes potentially restarting any day of the week.
To overcome this issue and have full control over maintenance windows and restarts, we have set up automation such as:
AWX executes a Template for deploying the Kured daemonset on the day of maintenance
AWX executes a Template for removing the Kured daemonset at the end of the maintenance window
By using AWX, we can finely schedule the automatic execution of these templates according to the annual planning of maintenance days.
Here is an ansible example playbook for the deployment of the Kured daemonset. In our use case, we create a new type of AWX credential that allows passing the kubeconfig in base64 format to the playbook, to create the kubeconfig on the awx-ee execution containers.
The different helm values are also configured at the AWX Template level.
---
- name: Deploy KURED on the cluster and follow the pod logs"
hosts: all
gather_facts: yes
tasks:
- name: Create Kubeconfig on awx-ee
delegate_to: localhost
shell: |
mkdir ~/.kube/
echo "{{ kubeconfig_file_base64 }}" | base64 --decode > ~/.kube/config
- name: Get nodes list
delegate_to: localhost
shell: |
kubectl get nodes -o wide
register: nodes_list
- name: Add kured Helm chart repo on awx-ee
delegate_to: localhost
kubernetes.core.helm_repository:
validate_certs: false
name: kubereboot
repo_url: https://kubereboot.github.io/charts/
- name: Deploy KURED to the cluster
delegate_to: localhost
ignore_errors: true
kubernetes.core.helm:
validate_certs: false
name: kured
chart_ref: kubereboot/kured
release_namespace: kured
chart_version: 5.3.2
create_namespace: true
set_values:
- value: "extraArgs.start-time={{ start_time }}"
value_type: string
- value: "extraArgs.end-time={{ end_time }}"
value_type: string
- value: "extraArgs.time-zone={{ time_zone }}"
value_type: string
- value: "extraArgs.reboot-days={{ reboot_days }}"
value_type: string
- value: "configuration.lockTtl={{ lockttl }}"
value_type: string
- value: "configuration.period={{ period }}"
value_type: string
- value: "notify-url={{ notify_url }}"
value_type: string
- name: Wait for 60 seconds
delegate_to: localhost
ansible.builtin.wait_for:
timeout: 60
- name: Log KURED pods
delegate_to: localhost
ignore_errors: true
kubernetes.core.k8s_log:
validate_certs: false
namespace: kured
label_selectors:
- app.kubernetes.io/name=kured
register: kured_log
And below is an example playbook for uninstalling the Kured daemonset. It is always recommended to use existing Ansible modules, such as kubernetes.core, but in our specific case, we used the "shell" module to execute the operations.
---
- name: Remove KURED from the cluster"
hosts: all
gather_facts: yes
tasks:
- name: Create Kubeconfig on awx-ee
delegate_to: localhost
shell: |
mkdir ~/.kube/
echo "{{ kubeconfig_file_base64 }}" | base64 --decode > ~/.kube/config
- name: Get nodes list
delegate_to: localhost
shell: |
kubectl --insecure-skip-tls-verify get nodes -o wide
register: nodes_list
- name: Uninstall KURED from the cluster
delegate_to: localhost
ignore_errors: true
shell: |
helm --kube-insecure-skip-tls-verify=true -n kured uninstall kured
- name
: Remove the KURED namespace
delegate_to: localhost
ignore_errors: true
shell: |
kubectl --insecure-skip-tls-verify delete ns kured
To be able to execute these playbooks, the AWX execution environment used must contain the different modules and binaries required by the codes.
For the creation of a custom awx-ee, you can refer to awx-19-create-a-custom-awx-ee-docker-image.
And there you have it! With this setup, we ensure that the latest security patches are applied, with node reboots if necessary, and above all, we avoid any untimely restarts of nodes and workloads!
Note : AI generated cover :*The cover image for this article was generated by AI*