Troubleshoot an out-of-memory EDOT Collector

Elastic Stack Serverless Observability

If your EDOT Collector pods terminate with an OOMKilled status, this usually indicates sustained memory pressure or potentially a memory leak due to an introduced regression or a bug. You can use the Performance Profiler (pprof) extension to collect and analyze memory profiles, helping you identify the root cause of the issue.

Symptoms

These symptoms typically indicate that the EDOT Collector is experiencing a memory-related failure:

EDOT Collector pod restarts with an OOMKilled status in Kubernetes.
Memory usage steadily increases before the crash.
The Collector's logs don't show clear errors before termination.

Resolution

Turn on runtime profiling using the pprof extension and then gather memory heap profiles from the affected pod:

Enable `pprof` in the Collector

Edit the EDOT Collector Daemonset configuration and include the pprof extension:
```
exporters:
  ...
processors:
  ...
receivers:
  ...
extensions:
  pprof:

service:
  extensions:
   - pprof
   - ...
  pipelines:
    metrics:
      receivers: [ ... ]
      processors: [ ... ]
      exporters: [ ... ]
```
Restart the Collector after applying these changes. When the Daemonset is deployed again, spot the pod that is getting restarted.
Access the affected pod and collect a heap dump

When a pod starts exhibiting high memory usage or restarts due to OOM, run the following to enter a debug shell:
```
 kubectl debug -it <collector-pod-name> --image=ubuntu:latest 
```
In the debug container:
```
 apt update apt install -y curl
curl http://localhost:1777/debug/pprof/heap > heap.out
```
Copy the heap file from the pod

From your local machine, copy the heap file using:
```
kubectl cp <collector-pod-name>:heap.out ./heap.out -c <debug-container-name>
```
Note

Replace <debug-container-name> with the name assigned to the debug container. Without the -c flag, Kubernetes will show the list of available containers.
Convert the heap profile for analysis

You can now generate a visual representation, for example PNG:
```
go tool pprof -png heap.out > heap.png
```

Best practices

To improve the effectiveness of memory diagnostics and reduce investigation time, consider the following:

Collect multiple heap profiles over time (for example, every few minutes) to observe memory trends before the crash.
Automate heap profile collection at intervals to observe trends over time.