Use case: The kubernetes.io/container/restart_count system SLI metric provides the number of times a container has restarted. This chart may be useful to identify if a container is crashing/restarting frequently. The specific service container can be filtered out by metrics labels for a specific service's container monitoring.
The following shows using the kubernetes.io/container/restart_count metric for the Cassandra container. You can use this metric for any of the containers in the table above.
Resource types
k8s_container
Metric
kubernetes.io/container/restart_count
Filter By
namespace_name = apigee and container_name =~ .*cassandra.*
Group By
cluster_name, namespace_name, pod_name, container_name, and all k8s_container resource type labels
Aggregator
sum
Alert consideration
If a container is restarting frequently, further investigation is needed for the root cause. There are multiple reasons a container can restart, such as OOMKilled, data disk full, and configuration issues, to name a few.
Alert threshold
Depends on the SLO for the installation. For example: For production, trigger an event notification, If a container restarts more often than 5 times within 30 minutes.
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Hard to understand","hardToUnderstand","thumb-down"],["Incorrect information or sample code","incorrectInformationOrSampleCode","thumb-down"],["Missing the information/samples I need","missingTheInformationSamplesINeed","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2025-04-24 UTC."],[[["This guide is for Apigee Hybrid cluster administrators and Org admins, providing instructions on monitoring Apigee Hybrid deployments using Service Level Indicator (SLI) metrics."],["Apigee Hybrid monitoring categorizes metrics into traffic, database, Apigee control plane, and infrastructure groups, and uses three common Resource Types: `k8s_container`, `Proxy`, and `Target`, to identify the source of each SLI metric."],["Alert thresholds for monitoring Apigee Hybrid should be customized based on specific traffic patterns and Service Level Agreements (SLAs), and are subject to ongoing optimization due to potential changes in service and infrastructure usage."],["Monitoring traffic involves analyzing request/response counts, latencies, and error rates for both API Proxies and Targets, which can reveal abnormal spikes, drops in traffic, or high error rates that may indicate issues like security concerns or connectivity problems."],["Monitoring Cassandra database performance involves tracking read/write request rates and latencies using `k8s_container` resource type metrics, and any consistent upward trend in read/write latencies should trigger alerts, as well as tracking container restarts which may indicate underlying issues."]]],[]]