Mastering Automated Node Restarts in AKS: Harnessing the Power of Kured Daemonset

How to Automate AKS Node Restarts When Necessary?

Especially following the automatic installation of OS security patches or kernel updates by Azure. As a reminder, the nodes check daily for security patches via unattended-upgrade.

Thankfully, the daemonset KURED exists and assists us in these operations.


Kured (KUbernetes REboot Daemon) is a Kubernetes daemonset that performs safe automatic node reboots when the need to do so is indicated by the package management system of the underlying OS.

  • Watches for the presence of a reboot sentinel file e.g. /var/run/reboot-required or the successful run of a sentinel command.

  • Utilises a lock in the API server to ensure only one node reboots at a time

  • Optionally defers reboots in the presence of active Prometheus alerts or selected pods

  • Cordons & drains worker nodes before reboot, uncordoning them after

How does it work?

Kured does not rely on a configmap or secret for configuration, scheduling parameters can only be passed to the daemonset during deployment via Helm.

Also, it's not possible not to specify parameters, if nothing is specified, the default "any" config applies --> "Reboot schedule: SunMonTueWedThuFriSat between 00:00 and 23:59 UTC"

Note that Kured does not allow for advanced scheduling like cron, for instance, it's not possible to specify a maintenance window such as the xth Monday or Tuesday of months x-y-z.

Below is an example of deployment with a configuration for a maintenance window between 14:00 and 17:00 every day of the week.

helm repo add kubereboot
helm repo update

kubectl create namespace kured

helm install kured kubereboot/kured --namespace kured  --version 5.3.2 \
 --set extraArgs.start-time=2pm \
 --set extraArgs.end-time=5pm \
 --set extraArgs.time-zone=Europe/Zurich \
 --set extraArgs.reboot-days="mon\,tue\,wed\,thu\,fri" \
 --set configuration.lockTtl=30m \
 --set configuration.period=1m

You can easily log kured containers with:

kubectl -n kured logs -f -l

For one of our use cases, this scheduling is not specific nor controlled enough as we cannot afford to have nodes potentially restarting any day of the week.

To overcome this issue and have full control over maintenance windows and restarts, we have set up automation such as:

  • AWX executes a Template for deploying the Kured daemonset on the day of maintenance

  • AWX executes a Template for removing the Kured daemonset at the end of the maintenance window

By using AWX, we can finely schedule the automatic execution of these templates according to the annual planning of maintenance days.

Here is an ansible example playbook for the deployment of the Kured daemonset. In our use case, we create a new type of AWX credential that allows passing the kubeconfig in base64 format to the playbook, to create the kubeconfig on the awx-ee execution containers.

The different helm values are also configured at the AWX Template level.

- name: Deploy KURED on the cluster and follow the pod logs"
  hosts: all
  gather_facts: yes

    - name: Create Kubeconfig on awx-ee
      delegate_to: localhost
      shell: |
        mkdir ~/.kube/
        echo "{{ kubeconfig_file_base64 }}" | base64 --decode > ~/.kube/config

    - name: Get nodes list
      delegate_to: localhost
      shell: |
        kubectl get nodes -o wide
      register: nodes_list

    - name: Add kured Helm chart repo on awx-ee
      delegate_to: localhost
        validate_certs: false
        name: kubereboot

    - name: Deploy KURED to the cluster
      delegate_to: localhost
      ignore_errors: true
        validate_certs: false
        name: kured
        chart_ref: kubereboot/kured
        release_namespace: kured
        chart_version: 5.3.2
        create_namespace: true
          - value: "extraArgs.start-time={{ start_time }}"
            value_type: string
          - value: "extraArgs.end-time={{ end_time }}"
            value_type: string
          - value: "extraArgs.time-zone={{ time_zone }}"
            value_type: string
          - value: "extraArgs.reboot-days={{ reboot_days }}"
            value_type: string
          - value: "configuration.lockTtl={{ lockttl }}"
            value_type: string
          - value: "configuration.period={{ period }}"
            value_type: string
          - value: "notify-url={{ notify_url }}"
            value_type: string

    - name: Wait for 60 seconds
      delegate_to: localhost
        timeout: 60

    - name: Log KURED pods
      delegate_to: localhost
      ignore_errors: true
        validate_certs: false
        namespace: kured
      register: kured_log

And below is an example playbook for uninstalling the Kured daemonset. It is always recommended to use existing Ansible modules, such as kubernetes.core, but in our specific case, we used the "shell" module to execute the operations.

- name: Remove KURED from the cluster"
  hosts: all
  gather_facts: yes

    - name: Create Kubeconfig on awx-ee
      delegate_to: localhost
      shell: |
        mkdir ~/.kube/
        echo "{{ kubeconfig_file_base64 }}" | base64 --decode > ~/.kube/config

    - name: Get nodes list
      delegate_to: localhost
      shell: |
        kubectl --insecure-skip-tls-verify get nodes -o wide
      register: nodes_list

    - name: Uninstall KURED from the cluster
      delegate_to: localhost
      ignore_errors: true
      shell: |
        helm --kube-insecure-skip-tls-verify=true -n kured uninstall kured

    - name

: Remove the KURED namespace
      delegate_to: localhost
      ignore_errors: true
      shell: |
        kubectl --insecure-skip-tls-verify delete ns kured

To be able to execute these playbooks, the AWX execution environment used must contain the different modules and binaries required by the codes.

For the creation of a custom awx-ee, you can refer to awx-19-create-a-custom-awx-ee-docker-image.

And there you have it! With this setup, we ensure that the latest security patches are applied, with node reboots if necessary, and above all, we avoid any untimely restarts of nodes and workloads!

