The cluster auditing and event storage features of TKE are configured with rich visual charts to display audit logs and cluster events in multiple dimensions. Their operations are simple, and most common cluster Ops use cases are covered, making it easy for you to find and locate problems, improve the Ops efficiency, and maximize the value of audit and event data.
This document describes how to use audit and event dashboards to quickly locate cluster problems for several use cases.
You have logged in to the TKE console and enabled cluster audit and event storage.
10001****7138
account deleted the nginx
application at 2020-11-30T03:37:13
. For more information on the account, select CAM > User List. 10001****7138
cordoned the node 172.16.18.13
at 2020-11-30T06:22:18
. tke-kube-state-metrics
user has much more access requests than others. The operation type distribution trend shows that most of the operations are LIST operations, and the status code distribution trend shows that most of the status codes are 403. The business logs show that the tke-kube-state-metrics
add-on kept requesting API server retries due to the RBAC authentication issue, resulting in a sharp increase in API server access requests. Below is a sample log:E1130 06:19:37.368981 1 reflector.go:156] pkg/mod/k8s.io/client-go@v0.0.0-20191109102209-3c0d1af94be5/tools/cache/reflector.go:108: Failed to list *v1.VolumeAttachment: volumeattachments.storage.k8s.io is forbidden: User "system:serviceaccount:kube-system:tke-kube-state-metrics" cannot list resource "volumeattachments" in API group "storage.k8s.io" at the cluster scope
172.16.18.13
was exceptional due to insufficient disk space. Then kubelet began to drain pods on the node to reclaim the node's disk space. The cluster auto-scaler (CA) add-on automatically increases or decreases the number of nodes in the cluster according to the load condition when node pool elastic scaling is enabled. If a node in the cluster is automatically scaled, you can backtrack the whole scaling process through event search.
Log in to the TKE console.
Select Log Management > Event Logs in the left sidebar to go to the Event search page.
Select the Global Search tab and enter the following search command in the search box:
event.source.component : "cluster-autoscaler"
Select event.reason
, event.message
, and event.involvedObject.name
from the Hidden Fields on the left for display. Click Search and Analysis and view the results.
Sort the search results by Log Time in reverse order as shown below:
According to the event flow in the above figure, you can find that the node scaling occurred around 2020-11-25 20:35:45
and was triggered by three Nginx pods (nginx-5dbf784b68-tq8rd, nginx-5dbf784b68-fpvbx, and nginx-5dbf784b68-v9jv5). After three nodes were scaled out, the subsequent scaling was not triggered because the number of nodes in the node pool reached the upper limit.
Was this page helpful?