Node Problem Detector is a daemon for monitoring and reporting about a node's health. You can run Node Problem Detector as a DaemonSet
or as a standalone daemon. Node Problem Detector collects information about node problems from various daemons and reports these conditions to the API server as Node Conditions or as Events.
To learn how to install and use Node Problem Detector, see Node Problem Detector project documentation.
You need to have a Kubernetes cluster, and the kubectl command-line tool must be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a cluster, you can create one by using minikube or you can use one of these Kubernetes playgrounds:
Some cloud providers enable Node Problem Detector as an Addon. You can also enable Node Problem Detector with kubectl
or by creating an Addon DaemonSet.
kubectl
provides the most flexible management of Node Problem Detector. You can overwrite the default configuration to fit it into your environment or to detect customized node problems. For example:
Create a Node Problem Detector configuration similar to node-problem-detector.yaml
:
apiVersion:apps/v1kind:DaemonSetmetadata:name:node-problem-detector-v0.1namespace:kube-systemlabels:k8s-app:node-problem-detectorversion:v0.1kubernetes.io/cluster-service:"true"spec:selector:matchLabels:k8s-app:node-problem-detector version:v0.1kubernetes.io/cluster-service:"true"template:metadata:labels:k8s-app:node-problem-detectorversion:v0.1kubernetes.io/cluster-service:"true"spec:hostNetwork:truecontainers:- name:node-problem-detectorimage:registry.k8s.io/node-problem-detector:v0.1securityContext:privileged:trueresources:limits:cpu:"200m"memory:"100Mi"requests:cpu:"20m"memory:"20Mi"volumeMounts:- name:logmountPath:/logreadOnly:truevolumes:- name:loghostPath:path:/var/log/
Start node problem detector with kubectl
:
kubectl apply -f https://k8s.io/examples/debug/node-problem-detector.yaml
If you are using a custom cluster bootstrap solution and don't need to overwrite the default configuration, you can leverage the Addon pod to further automate the deployment.
Create node-problem-detector.yaml
, and save the configuration in the Addon pod's directory /etc/kubernetes/addons/node-problem-detector
on a control plane node.
The default configuration is embedded when building the Docker image of Node Problem Detector.
However, you can use a ConfigMap
to overwrite the configuration:
Change the configuration files in config/
Create the ConfigMap
node-problem-detector-config
:
kubectl create configmap node-problem-detector-config --from-file=config/
Change the node-problem-detector.yaml
to use the ConfigMap
:
apiVersion:apps/v1kind:DaemonSetmetadata:name:node-problem-detector-v0.1namespace:kube-systemlabels:k8s-app:node-problem-detectorversion:v0.1kubernetes.io/cluster-service:"true"spec:selector:matchLabels:k8s-app:node-problem-detector version:v0.1kubernetes.io/cluster-service:"true"template:metadata:labels:k8s-app:node-problem-detectorversion:v0.1kubernetes.io/cluster-service:"true"spec:hostNetwork:truecontainers:- name:node-problem-detectorimage:registry.k8s.io/node-problem-detector:v0.1securityContext:privileged:trueresources:limits:cpu:"200m"memory:"100Mi"requests:cpu:"20m"memory:"20Mi"volumeMounts:- name:logmountPath:/logreadOnly:true- name:config# Overwrite the config/ directory with ConfigMap volumemountPath:/configreadOnly:truevolumes:- name:loghostPath:path:/var/log/- name:config# Define ConfigMap volumeconfigMap:name:node-problem-detector-config
Recreate the Node Problem Detector with the new configuration file:
# If you have a node-problem-detector running, delete before recreatingkubectl delete -f https://k8s.io/examples/debug/node-problem-detector.yaml kubectl apply -f https://k8s.io/examples/debug/node-problem-detector-configmap.yaml
kubectl
.Overwriting a configuration is not supported if a Node Problem Detector runs as a cluster Addon. The Addon manager does not support ConfigMap
.
A problem daemon is a sub-daemon of the Node Problem Detector. It monitors specific kinds of node problems and reports them to the Node Problem Detector. There are several types of supported problem daemons.
A SystemLogMonitor
type of daemon monitors the system logs and reports problems and metrics according to predefined rules. You can customize the configurations for different log sources such as filelog, kmsg, kernel, abrt, and systemd.
A SystemStatsMonitor
type of daemon collects various health-related system stats as metrics. You can customize its behavior by updating its configuration file.
A CustomPluginMonitor
type of daemon invokes and checks various node problems by running user-defined scripts. You can use different custom plugin monitors to monitor different problems and customize the daemon behavior by updating the configuration file.
A HealthChecker
type of daemon checks the health of the kubelet and container runtime on a node.
The system log monitor currently supports file-based logs, journald, and kmsg. Additional sources can be added by implementing a new log watcher.
You can extend the Node Problem Detector to execute any monitor scripts written in any language by developing a custom plugin. The monitor scripts must conform to the plugin protocol in exit code and standard output. For more information, please refer to the plugin interface proposal.
An exporter reports the node problems and/or metrics to certain backends. The following exporters are supported:
Kubernetes exporter: this exporter reports node problems to the Kubernetes API server. Temporary problems are reported as Events and permanent problems are reported as Node Conditions.
Prometheus exporter: this exporter reports node problems and metrics locally as Prometheus (or OpenMetrics) metrics. You can specify the IP address and port for the exporter using command line arguments.
Stackdriver exporter: this exporter reports node problems and metrics to the Stackdriver Monitoring API. The exporting behavior can be customized using a configuration file.
It is recommended to run the Node Problem Detector in your cluster to monitor node health. When running the Node Problem Detector, you can expect extra resource overhead on each node. Usually this is fine, because: