Troubleshoot an out-of-memory EDOT Collector
Elastic Stack Serverless Observability
If your EDOT Collector pods terminate with an OOMKilled
status, this usually indicates sustained memory pressure or potentially a memory leak due to an introduced regression or a bug. You can use the Performance Profiler (pprof
) extension to collect and analyze memory profiles, helping you identify the root cause of the issue.
These symptoms typically indicate that the EDOT Collector is experiencing a memory-related failure:
- EDOT Collector pod restarts with an
OOMKilled
status in Kubernetes. - Memory usage steadily increases before the crash.
- The Collector's logs don't show clear errors before termination.
Turn on runtime profiling using the pprof
extension and then gather memory heap profiles from the affected pod:
-
Enable `pprof` in the Collector
Edit the EDOT Collector Daemonset configuration and include the
pprof
extension:exporters: ... processors: ... receivers: ... extensions: pprof: service: extensions: - pprof - ... pipelines: metrics: receivers: [ ... ] processors: [ ... ] exporters: [ ... ]
Restart the Collector after applying these changes. When the Daemonset is deployed again, spot the pod that is getting restarted.
-
Access the affected pod and collect a heap dump
When a pod starts exhibiting high memory usage or restarts due to OOM, run the following to enter a debug shell:
kubectl debug -it <collector-pod-name> --image=ubuntu:latest
In the debug container:
apt update
apt install -y curl curl http://localhost:1777/debug/pprof/heap > heap.out
-
Copy the heap file from the pod
From your local machine, copy the heap file using:
kubectl cp <collector-pod-name>:heap.out ./heap.out -c <debug-container-name>
NoteReplace
<debug-container-name>
with the name assigned to the debug container. Without the-c
flag, Kubernetes will show the list of available containers. -
Convert the heap profile for analysis
You can now generate a visual representation, for example PNG:
go tool pprof -png heap.out > heap.png
To improve the effectiveness of memory diagnostics and reduce investigation time, consider the following:
Collect multiple heap profiles over time (for example, every few minutes) to observe memory trends before the crash.
Automate heap profile collection at intervals to observe trends over time.