Guru: How to get your metrics into Prometheus
Don’t use metricsScraping labels
# TYPE http_requests_in_flight gauge
http_requests_in_flight 13
# TYPE http_request_seconds summary
http_request_seconds_sum{method="GET"} 9036.32
http_request_seconds_count{method="GET"} 807283.0
http_request_seconds_created{method="GET"} 1605281325.0
http_request_seconds_sum{method="POST"} 479.3
http_request_seconds_count{method="POST"} 34.0
http_request_seconds_created{method="POST"} 1605281325.0
# TYPE process_cpu_seconds counter
# UNIT process_cpu_seconds seconds
process_cpu_seconds_total 4.20072246e+06
# TYPE http_requests_in_flight gauge
http_requests_in_flight 13
# TYPE http_request_seconds summary
http_request_seconds_sum{method="GET"} 9036.32
http_request_seconds_count{method="GET"} 807283.0
http_request_seconds_created{method="GET"} 1605281325.0
http_request_seconds_sum{method="POST"} 479.3
http_request_seconds_count{method="POST"} 34.0
http_request_seconds_created{method="POST"} 1605281325.0
# TYPE process_cpu_seconds counter
# UNIT process_cpu_seconds seconds
process_cpu_seconds_total 4.20072246e+06
9 hour interval :: 0 -> ~55000
500 * 12 * 9 = 54000
$name_{sum,bucket,count}
_sum
(sum of all the values) and _count
a superfluous the count way to reference the less than infinity bucket
WTF questions
→ prometheus dev → kube_
beats: kubectl get X -oyaml
# TYPE alert_routing_outcome_total counter
alert_routing_outcome_total{kind="",outcome="failure"} 2
alert_routing_outcome_total{kind="app",outcome="success"} 35
alert_routing_outcome_total{kind="infra",outcome="success"} 1
alert_routing_outcome_total{kind="owned",outcome="success"} 9
alert_routing_outcome_total{kind="pod",outcome="success"} 32
sum(increase(alert_routing_outcome_total{}[5m]))
by (outcome)
sum(rate(alert_routing_outcome_total{outcome="success"}[1h]))
/
sum(rate(alert_routing_outcome_total[1h]))
<vector expr> <bin-op> ignoring(<label list>) <vector expr>
<vector expr> <bin-op> on(<label list>) <vector expr>
http_request_sum{path="/api/v1",method="GET"} 9036.32
http_request_sum{path="/api/v2",method="GET"} 3036.1
http_request_sum{path="/api/v2",method="DELETE"} 1.3
http_request_sum{path="/api/v1",method="POST"} 4479.3
http_request_sum{path="/api/v2",method="POST"} 479.3
cardinality(http_request_sum)
= len({"GET", "POST", "DELETE"}) *
len({"/api/v1", "/api/v2"})
= 2*3 = 6
count(http_request_sum) = 5
instance
/pod
labels enrichedinstance
, pod
or both in our scrapershttp_request_duration_seconds_bucket
resolution rule of thumb:
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: prometheus-stack-kube-prom-k8s.rules
spec:
groups:
- name: k8s.rules
rules:
- expr: |-
sum by (cluster, namespace, pod, container) (
irate(container_cpu_usage_seconds_total{
job="kubelet",
metrics_path="/metrics/cadvisor",
image!=""}[5m]
)
) * on (cluster, namespace, pod)
group_left(node) topk by (cluster, namespace, pod) (
1, max by(cluster, namespace, pod, node)
(kube_pod_info{node!=""})
)
record: node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate
see prometheus/rules
you do not need to:
Things you can and should use or configure:
or steal