在上一篇中说到了日志和链路的结合使用,基本实现了opentelemetry的主要功能,就剩下监控相关了,那么本篇就继续针对opentelemetry的监控来完善我们的云原生可观测平台。
首先,提到监控,可能会想到多个维度,例如主机监控、中间件监控、容器监控、应用监控、黑盒监控、流量监控等。那么在opentelemetry中,主要处理的是我们接入的应用监控,就以我落地的这一套来说吧。
我们公司后端都是java,我们通过opentelemetry-javaagent注入进业务服务,获取了Trace、Log,那如何获取Metrics呢,在上一篇中,我们在配置javaagent时,给了如下配置:
1
2
3
4
5
6
7
8
|
- name: Options
value: -javaagent:/tmp/opentelemetry-javaagent.jar -Dotel.resource.attributes=service.name=APPNAME
- name: OTEL_EXPORTER_OTLP_COMPRESSION
value: gzip
- name: OTEL_EXPORTER_OTLP_ENDPOINT
value: http://otel-collector-headless.opentelemetry:4317
- name: OTEL_LOGS_EXPORTER
value: otlp
|
我们只需在配置中添加一条
1
2
|
- name: OTEL_METRICS_EXPORTER
value: prometheus
|
即可为我们的业务应用添加Metrics接口,默认端口9464,获取到的Metrics示例如下:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
|
# TYPE otel_scope_info info
# HELP otel_scope_info Scope metadata
otel_scope_info{otel_scope_name="io.opentelemetry.apache-httpclient-4.0",otel_scope_version="1.23.0-alpha"} 1
# TYPE otel_scope_info info
# HELP otel_scope_info Scope metadata
otel_scope_info{otel_scope_name="io.opentelemetry.hikaricp-3.0",otel_scope_version="1.23.0-alpha"} 1
# TYPE otel_scope_info info
# HELP otel_scope_info Scope metadata
otel_scope_info{otel_scope_name="io.opentelemetry.tomcat-7.0",otel_scope_version="1.23.0-alpha"} 1
# TYPE otel_scope_info info
# HELP otel_scope_info Scope metadata
otel_scope_info{otel_scope_name="io.opentelemetry.exporters.otlp-grpc-okhttp"} 1
# TYPE otel_scope_info info
# HELP otel_scope_info Scope metadata
otel_scope_info{otel_scope_name="io.opentelemetry.sdk.logs"} 1
# TYPE otel_scope_info info
# HELP otel_scope_info Scope metadata
otel_scope_info{otel_scope_name="io.opentelemetry.runtime-metrics",otel_scope_version="1.23.0-alpha"} 1
# TYPE otel_scope_info info
# HELP otel_scope_info Scope metadata
otel_scope_info{otel_scope_name="io.opentelemetry.http-url-connection",otel_scope_version="1.23.0-alpha"} 1
# TYPE otel_scope_info info
# HELP otel_scope_info Scope metadata
otel_scope_info{otel_scope_name="io.opentelemetry.sdk.trace"} 1
|
Metrics获取到之后,需要存储到时序数据库,因为我们只有一套集群,所以没有经过opentelemetry来转发,而是直接通过之前部署的Prometheus的ServiceMonitor直接抓取,配置起来才知道,还是通过opentelemetry去发比较方便,因为需要自己去做自动发现,实现起来不麻烦,我的实现方式如下:
首先需要给业务服务添加Service的端口暴露,例如我们有一个服务的Service如下:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
|
apiVersion: v1
kind: Service
metadata:
name: APPNAME
spec:
internalTrafficPolicy: Cluster
ports:
- name: http
port: 80
protocol: TCP
targetPort: http
selector:
app: healthcheck-mix-biz
type: ClusterIP
|
修改后如下:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
|
apiVersion: v1
kind: Service
metadata:
labels:
opentelemetry: "on"
name: APPNAME
spec:
internalTrafficPolicy: Cluster
ports:
- name: http
port: 80
protocol: TCP
targetPort: http
- name: prometheus
port: 9464
protocol: TCP
targetPort: 9464
selector:
app: healthcheck-mix-biz
type: ClusterIP
|
opentelemetry: "on"用来自动发现,而下面的Prometheus是暴露的Metrics接口,可以通过ServiceMonitor来接入Prometheus,下面是ServiceMonitor描述文件:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
|
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: jvm
labels:
app: jvm
spec:
selector:
matchLabels:
opentelemetry: "on"
endpoints:
- port: prometheus
path: /metrics
interval: 30s
|
完成之后可以在Prometheus的webUI查看target
获取了Metrics之后,就需要大盘来展示了,我问了很多大佬、社区,很遗憾目前还没有这方面的轮子,我就自己做了一个简单的大盘,主要针对JVM内存的各个区进行监控,还有GC情况。后续再慢慢研究这个Metrics,完善我的大盘
坑
阿里云OSS不支持Tempo的数据存储,Minio可以
报错:error copying block from local to remote backend: error writing object to s3 backend, object single-tenant/42b480df-2b1b-4a5e-a876-e2b531a26f24/bloom-0: Aws MultiChunkedEncoding is not supported.
不过我没用minio,直接走本地存储+StorageClass 存在云盘了,比OSS稍贵,不过又不是用我的钱。嘻嘻~~
tips:在爬坑的整个过程中,ChatGPT给予了非常多的帮助,节省了很多时间。