Enable Auto-Repair for GKE Cluster Nodes

Trend Micro Cloud One™ – Conformity is a continuous assurance tool that provides peace of mind for your cloud infrastructure, delivering over 750 automated best practice checks.

Risk level: Medium (should be achieved)

Ensure that Auto-Repair feature is enabled for all your Google Kubernetes Engine (GKE) cluster nodes in order to help you keep the cluster nodes healthy. Google Kubernetes Engine uses the node's health status to determine if a cluster node needs to be repaired. GKE triggers a repair action if a node reports consecutive unhealthy status reports for a given time threshold. The unhealthy status is reported when:

A cluster node broadcast a "NotReady" status on consecutive checks over the given time threshold.

A cluster node does not broadcast any status at all over the given time threshold.

A cluster node's boot disk is out of disk space for an extended period of time.

Reliability

GKE Auto-Repair helps you keep the nodes in your cluster in a healthy, running state. When the feature is enabled, GKE makes periodic checks on the health state of each node in your cluster. If a node fails consecutive health checks over a given time threshold, GKE service initiates a repair process for that cluster node.


Audit

To determine if your Google Kubernetes Engine (GKE) clusters are using auto-repairing nodes, perform the following actions:

Using GCP Console

01 Sign in to Google Cloud Management Console.

02 Select the Google Cloud Platform (GCP) project that you want to access from the console top navigation bar.

03 Navigate to Google Kubernetes Engine (GKE) console at https://console.cloud.google.com/kubernetes.

04 In the navigation panel, select Clusters to access the list of the GKE clusters deployed within the selected project.

05 Click on the name of the GKE cluster that you want to examine and select the Details tab to access the cluster configuration information.

06 Under Node pools, click on the name of the cluster node pool that you want to examine.

07 In the Management section, check the Auto-repair configuration attribute status. If the attribute status is set to Disabled, the Auto-Repair feature is not enabled for the nodes running within the selected Google Kubernetes Engine (GKE) cluster node pool.

08 Repeat step no. 6 and 7 for each node pool provisioned for the selected GKE cluster.

09 Repeat step no. 5 – 8 for each GKE cluster created for the selected GCP project.

10 Repeat steps no. 2 – 9 for each project deployed within your Google Cloud account.

Using GCP CLI

01 Run projects list command (Windows/macOS/Linux) using custom query filters to list the IDs of all the Google Cloud Platform (GCP) projects available in your cloud account:

gcloud projects list
    --format="table(projectId)"

02 The command output should return the requested GCP project identifiers:

PROJECT_ID
cc-bigdata-project-123123
cc-analytics-project-112233

03 Run container clusters list command (Windows/macOS/Linux) using custom query filters to describe the name and the zone of each GKE cluster provisioned for the selected Google Cloud project:

gcloud container clusters list
    --project cc-bigdata-project-123123
    --format="(NAME,LOCATION)"

04 The command output should return the requested GKE cluster names and their zones:

NAME                     LOCATION
cc-gke-frontend-cluster  us-central1-c
cc-gke-backend-cluster   us-central1-c

05 Run container node-pools list command (Windows/macOS/Linux) using the name of the Google Cloud GKE cluster that you want to examine as identifier parameter and custom query filters to describe the name of each node pool provisioned for the selected cluster:

gcloud container node-pools list
    --cluster=cc-gke-frontend-cluster
    --zone=us-central1-c
    --format="(NAME)"

06 The command output should return the requested cluster node pool name(s):

NAME
cc-gke-frontend-pool-001
cc-gke-frontend-pool-002
cc-gke-frontend-pool-003

07 Run container node-pools describe command (Windows/macOS/Linux) using the name of the cluster node pool that you want to examine as identifier parameter and custom output filtering to describe the Auto-Repair feature configuration status:

gcloud container node-pools describe cc-gke-frontend-pool-001
    --cluster=cc-gke-frontend-cluster
    --zone=us-central1-c
    --format="yaml(management.autoRepair)"

08 The command output should return the requested feature configuration status:

management: {}

If the container node-pools describe command output returns null, or an empty object for the management configuration attribute, the Auto-Repair feature is not enabled for the nodes running within the selected Google Kubernetes Engine (GKE) cluster node pool.

09 Repeat step no. 7 and 8 for each node pool provisioned for the selected GKE cluster.

10 Repeat steps no. 5 – 9 for each GKE cluster created for the selected GCP project.

11 Repeat steps no. 3 – 10 for each GCP project deployed in your Google Cloud account.

Remediation / Resolution

To enable the Auto-Repair feature for all the Google Kubernetes Engine (GKE) cluster nodes, perform the following actions:

Note: GKE cluster node auto-repair can be enabled on a per-node pool basis only.

Using GCP Console

01 Sign in to Google Cloud Management Console.

02 Select the GCP project that you want to access from the console top navigation bar.

03 Navigate to Google Kubernetes Engine (GKE) console at https://console.cloud.google.com/kubernetes.

04 In the navigation panel, select Clusters to access the list of the GKE clusters available within the selected project.

05 Click on the name of the GKE cluster that you want to access (see Audit section part I to identify the right resource), and select the Details tab to access the cluster configuration information.

06 Under Node pools, click on the name of the node pool that you want to reconfigure and choose the EDIT button from the console top menu to access the resource editing page.

07 In the Management section, select the Enable auto-repair checkbox to enable the Auto-Repair feature for the selected Google Kubernetes Engine (GKE) cluster node pool.

08 Click SAVE to apply the configuration changes.

09 Repeat steps no. 6 – 8 to enable auto-repair for other node pools provisioned for the selected GKE cluster.

10 Repeat steps no. 5 – 9 to reconfigure other GKE clusters created for the selected GCP project.

11 Repeat steps no. 2 – 10 for each GCP project available in your Google Cloud account.

Using GCP CLI

01 Run container node-pools update command (Windows/macOS/Linux) using the name of the GKE cluster node pool that you want to reconfigure as identifier parameter to enable the Auto-Repair feature for the selected node pool:

gcloud container node-pools update cc-gke-frontend-pool-001
    --cluster=cc-gke-frontend-cluster
    --zone=us-central1-c
    --enable-autorepair

02 The command output should return the URL of the reconfigured GKE cluster node pool:

Updating node pool cc-gke-frontend-pool-001...done.
Updated [https://container.googleapis.com/v1/projects/cc-bigdata-project-123123/zones/us-central1-c/clusters/cc-gke-frontend-cluster/nodePools/cc-gke-frontend-pool-001].

03 Repeat step no. 1 and 2 to enable auto-repair for other node pools created for the selected GKE cluster.

04 Repeat steps no. 1 – 3 to reconfigure other GKE clusters created for the selected GCP project.

05 Repeat steps no. 1 – 4 for each GCP project deployed in your Google Cloud account.

References

Publication date May 10, 2021

Unlock the Remediation Steps


Gain free unlimited access
to our full Knowledge Base


Over 750 rules & best practices
for AWS and Azure

You are auditing:

Enable Auto-Repair for GKE Cluster Nodes

Risk level: Medium