Use case: The kubernetes.io/container/restart_count system SLI metric provides the number of times a container has restarted. This chart may be useful to identify if a container is crashing/restarting frequently. The specific service container can be filtered out by metrics labels for a specific service's container monitoring.
The following shows using the kubernetes.io/container/restart_count metric for the Cassandra container. You can use this metric for any of the containers in the table above.
Resource types
k8s_container
Metric
kubernetes.io/container/restart_count
Filter By
namespace_name = apigee and container_name =~ .*cassandra.*
Group By
cluster_name, namespace_name, pod_name, container_name, and all k8s_container resource type labels
Aggregator
sum
Alert consideration
If a container is restarting frequently, further investigation is needed for the root cause. There are multiple reasons a container can restart, such as OOMKilled, data disk full, and configuration issues, to name a few.
Alert threshold
Depends on the SLO for the installation. For example: For production, trigger an event notification, If a container restarts more often than 5 times within 30 minutes.
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Hard to understand","hardToUnderstand","thumb-down"],["Incorrect information or sample code","incorrectInformationOrSampleCode","thumb-down"],["Missing the information/samples I need","missingTheInformationSamplesINeed","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2025-04-24 UTC."],[[["This guide is for Apigee Hybrid cluster administrators and Org admins, providing instructions on monitoring Apigee Hybrid deployments."],["Apigee Hybrid provides Service Level Indicator (SLI) metrics, such as those with Resource Types `k8s_container`, `Proxy`, and `Target`, to assess application and system performance, which can be found in the available metrics."],["Monitoring Apigee Hybrid clusters involves four main areas: Traffic, Database, Apigee control plane, and infrastructure, each with specific metrics and resource types to consider."],["Alert thresholds for Apigee Hybrid deployments are not fixed, and should be determined through ongoing optimization based on traffic patterns, SLO/SLA agreements, and the specific needs of the service and infrastructure."],["Several SLI metrics are available to monitor different areas, such as `proxy/request_count` and `target/request_count` to monitor traffic request rates, or `cassandra/clientrequest_rate` to monitor database performance."]]],[]]