Troubleshoot an out-of-memory EDOT Collector
Serverless Observability Stack EDOT Collector
If your EDOT Collector pods terminate with an OOMKilled status, this usually indicates sustained memory pressure or potentially a memory leak due to an introduced regression or a bug. You can use the Performance Profiler (pprof) extension to collect and analyze memory profiles, helping you identify the root cause of the issue.
These symptoms typically indicate that the EDOT Collector is experiencing a memory-related failure:
- EDOT Collector pod restarts with an
OOMKilledstatus in Kubernetes. - Memory usage steadily increases before the crash.
- The Collector's logs don't show clear errors before termination.
Turn on runtime profiling using the pprof extension and then gather memory heap profiles from the affected pod:
-
Enable
pprofin the CollectorEdit the EDOT Collector Daemonset configuration and include the
pprofextension:exporters: ... processors: ... receivers: ... extensions: pprof: service: extensions: - pprof - ... pipelines: metrics: receivers: [ ... ] processors: [ ... ] exporters: [ ... ]Restart the Collector after applying these changes. When the Daemonset is deployed again, spot the pod that is getting restarted.
-
Access the affected pod and collect a heap dump
When a pod starts exhibiting high memory usage or restarts due to OOM, run the following to enter a debug shell:
kubectl debug -it <collector-pod-name> --image=ubuntu:latestIn the debug container:
apt update apt install -y curl curl http://localhost:1777/debug/pprof/heap > heap.out -
Copy the heap file from the pod
From your local machine, copy the heap file using:
kubectl cp <collector-pod-name>:heap.out ./heap.out -c <debug-container-name>NoteReplace
<debug-container-name>with the name assigned to the debug container. Without the-cflag, Kubernetes will show the list of available containers. -
Convert the heap profile for analysis
You can now generate a visual representation, for example PNG:
go tool pprof -png heap.out > heap.png
To improve the effectiveness of memory diagnostics and reduce investigation time, consider the following:
Collect multiple heap profiles over time (for example, every few minutes) to observe memory trends before the crash.
Automate heap profile collection at intervals to observe trends over time.