Rebuilding a broken Zookeeper quorum

ECE

Warning

This article covers an advanced recovery method involving directly modifying Zookeeper. This process can potentially corrupt your data. Elastic strongly recommends only following this outline after receiving confirmation by Elastic Support.

This article describes how to recover a broken Zookeeper leader or follower within Elastic Cloud Enterprise.

When to recover

When an ECE director host’s Zookeeper status cannot be determined healthy using the Verify Zookeeper sync status command or from Elastic Cloud Enterprise > Platform > Settings, then you might need to recover Zookeeper.

This situation might surface when recovering the Elastic Cloud Enterprise director host from a full disk issue.

A healthy Zookeeper quorum returns a sync status similar to the following. Any other responses require further investigation.

		$ # Zookeeper leader with id:10
$ echo mntr | nc 127.0.0.1 2191
zk_server_state  leader
# ...
zk_followers         2
zk_synced_followers  2

$ # Zookeeper follower with id:11
$ echo mntr | nc 127.0.0.1 2192
zk_server_state follower

$ # Zookeeper follower with id:12
$ echo mntr | nc 127.0.0.1 2193
zk_server_state follower
		
	

Back up data directories

Before recovering the Zookeeper leader or follower, back up all Elastic Cloud Enterprise hosts' Zookeeper data directories. Normally this is only applicable to director hosts, but may apply to other hosts during migrations.

Perform the following steps on each host to back up the Zookeeper data directory:

Extract the Zookeeper /data directory path:

		docker inspect --format '{{ range .Mounts }}{{ .Source }} {{ end }}' frc-zookeeper-servers-zookeeper | grep --color=auto \"zookeeper/data\"
		
	

Make a copy or backup of the emitted directory. For example, if data directory is /mnt/data/elastic/172.16.0.30/services/zookeeper/data, then run the following command:
```
cp -R /mnt/data/elastic/172.16.0.30/services/zookeeper/data /mnt/data/elastic/ZK_data_backup
		
```

Determine the Zookeeper leader

If a Zookeeper quorum is broken, you need to identify the best Zookeeper leader candidate to use for recovery before you start the recovery process.

Collect the following information from all ECE director hosts that have ZK containers running, including any recently created or decommissioned hosts. After you have gathered the information, reach out to Elastic Support to identify the best ZK leader candidate.

Output of file list and sizes of Zookeeper directories
ECE diagnostics

Collect the output of file list and sizes of Zookeeper directories

		# collect disk usage
find /mnt/data/elastic/*/services/zookeeper/data/ -print -exec du -hs {} \;
# collect file status
find /mnt/data/elastic/*/services/zookeeper/data/ -print -exec stat {} \;
		
	

Collect ECE diagnostics

Follow Run ECE diagnostics tool to collect ECE diagnostics.

Make sure to run the tool with the --disableApiCalls flag. Without this flag, ECE diagnostics might fail to run.

Command

./ece-diagnostics run --disableApiCalls

Sample response

		elastic@my-ece-director-host1:~$ ./ece-diagnostics run --disableApiCalls
- Configuring ECE home folder
    ✓ found /mnt/data/elastic for runner 172.16.15.204
- Log file: /tmp/ecediag-172.16.15.204-20250404-080202.log
++ Created tar output: /tmp/ecediag-172.16.15.204-20250404-080202.tar.gz
⚠ skipping collection of ECE metricbeat data (took: 0s)
⚠ skipping collection of API information for ECE and Elasticsearch (took: 0s)
✓ collected information on certificates (took: 221ms)
✓ collected information on client-forwarder connectivity (took: 368ms)
✓ collected ZooKeeper stats (took: 8.391s)
✓ collected system information (took: 14.263s)
✓ collected Docker info and logs (took: 18.976s)
		
	

Recover Zookeeper nodes

In the following recovery steps, the steps for the determined leader are marked with [leader], and the steps for all other Zookeepers are marked with [followers]. The [leader] should be recovered as needed before its [followers]. Steps marked [followers] should be performed on each follower director host, and steps marked [director] should be performed only on problematic director hosts.

Recover the Zookeeper Leader

Restart the Zookeeper container

To recover the Zookeeper leader, you should first try to restart the Docker Zookeeper container. Restarting the container is often enough to trigger the leader to resync its connection to its followers.

Within a SSH session of Zookeeper hosts, run the following command:

docker restart frc-zookeeper-servers-zookeeper

Wait a few minutes for state to attempt to sync across leader and followers, then verify the Zookeeper sync status to see if the quorum has recovered.

If the Zookeeper leader is still not recovered, proceed to the next section.

Manually set the Zookeeper leader

If restarting the Zookeeper container does not recover the leader, you can manually set the leader and rebuild the quorum.

[followers] Shut down the Docker Runner and Zookeeper containers:

		docker stop frc-runners-runner
docker stop frc-zookeeper-servers-zookeeper

[leader] Stop the Zookeeper service within the Docker container. Note this is stopping the service within the Docker container and not stopping the Zookeeper Docker container itself:
```
docker exec -it frc-zookeeper-servers-zookeeper sv stop zookeeper
		
```

[leader] Enter the Docker Zookeeper container and determine its Zookeeper ID:

		$ docker exec -it frc-zookeeper-servers-zookeeper bash
root@XXXXX:/# cat /app/data/myid
10
		
	

[leader] In the directory /app/managed/, modify the Zookeeper file replicated.cfg.dynamic:
- Remove the lines referencing other Zookeeper hosts.
- If multiple lines reference localhost, then remove all but the one containing the Zookeeper ID from the previous step.

[leader] Restart the Docker Zookeeper and Director containers:

		docker restart frc-zookeeper-servers-zookeeper
docker restart frc-directors-director

[leader] Check the Zookeeper sync status. The response should now show this director host as the Zookeeper leader.
Confirm that Elastic Cloud Enterprise is now also able to check the Zookeeper status and make changes.

[followers] Restart the Docker Zookeeper, Director, and Runner containers:

		docker restart frc-zookeeper-servers-zookeeper
docker restart frc-directors-director
docker restart frc-runners-runner
		
	

Verify that the Zookeeper sync status reports an odd number for zk_quorum_size and that no Zookeeper hosts are marked as lost.

Recover the Zookeeper follower

Zookeeper followers can sometimes refuse a [leader] election or become state corrupted. The following steps can be used to recover a broken or corrupted Zookeeper [follower]. These steps should only be considered after confirming a Zookeeper leader, as the [follower] will be reset to copy the state from [leader].

On the [follower], do the following:

Get the director host’s Zookeeper /data directory path:

		docker inspect --format '{{ range .Mounts }}{{ .Source }} {{ end }}' frc-zookeeper-servers-zookeeper | grep --color=auto \"zookeeper/data\"
		
	

Stop the Docker Runner and Zookeeper containers:

		docker stop frc-runners-runner
docker stop frc-zookeeper-servers-zookeeper

Under the determined /data directory, remove the sub-directory data/version-NUMBER, replacing the NUMBER placeholder.
```
/mnt/data/elastic/MY_IP/services/zookeeper/data$ rm -R ./version-NUMBER/
		
```
Make sure that myid file exists and is retained.
Start the Runner container, which will auto-start the Docker Zookeeper container.
```
docker start frc-runners-runner
		
```
Wait a few minutes for Zookeeper states to sync. Then check the Zookeeper sync status to confirm the following:
- zk_server_state follower
- zk_outstanding_requests 0
Confirm that the [leader] recognizes the added [follower] by checking the Zookeeper sync status for an incremented zk_synced_followers count.