Use case: The kubernetes.io/container/restart_count system SLI metric provides the number of times a container has restarted. This chart may be useful to identify if a container is crashing/restarting frequently. The specific service container can be filtered out by metrics labels for a specific service's container monitoring.
The following shows using the kubernetes.io/container/restart_count metric for the Cassandra container. You can use this metric for any of the containers in the table above.
Resource types
k8s_container
Metric
kubernetes.io/container/restart_count
Filter By
namespace_name = apigee and container_name =~ .*cassandra.*
Group By
cluster_name, namespace_name, pod_name, container_name, and all k8s_container resource type labels
Aggregator
sum
Alert consideration
If a container is restarting frequently, further investigation is needed for the root cause. There are multiple reasons a container can restart, such as OOMKilled, data disk full, and configuration issues, to name a few.
Alert threshold
Depends on the SLO for the installation. For example: For production, trigger an event notification, If a container restarts more often than 5 times within 30 minutes.
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Hard to understand","hardToUnderstand","thumb-down"],["Incorrect information or sample code","incorrectInformationOrSampleCode","thumb-down"],["Missing the information/samples I need","missingTheInformationSamplesINeed","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2025-04-24 UTC."],[[["This document outlines how to monitor an Apigee Hybrid deployment, covering key areas like traffic, database, Apigee control plane, and infrastructure, and it is intended for hybrid cluster and Org administrators."],["Apigee Hybrid clusters use Service Level Indicator (SLI) metrics across three Resource Types (`k8s_container`, `ProxyV2`, and `TargetV2`) to track application and system service performance, with labels for cluster and performance data."],["Alert thresholds for monitoring are not fixed, but rather must be determined based on traffic patterns and Service Level Agreements (SLAs), often requiring ongoing optimization to identify \"normal\" operation."],["Monitoring traffic involves tracking proxy and target request counts and error rates using specific metrics like `proxyv2/request_count`, `targetv2/request_count`, `proxyv2/response_count`, and `targetv2/response_count` with their respective filters and aggregators."],["Monitoring the Cassandra database requires watching the read and write request rates and latencies using the `cassandra/clientrequest_rate` and `cassandra/clientrequest_latency` metrics, while monitoring the Apigee Control plane relies on the metrics `upstream/request_count` and `upstream/response_count` to identify connectivity issues."]]],[]]