Bug #22472

Updated by Joao Eduardo Luis over 3 years ago

After configuring the restful module (enabling the module, creating self-signed certificate), I tried to see how the MGR instances behave when being marked as failed and what happens with the restful module, which is supposed to move to another node. Marking the MGR instance as failed works fine. The move of the restful module also worked fine for the first time. But after a few tries (three tries for three MGR instances), it stopped working and the restful API became unresponsive. Restarting the MGR instance on the node where the MGR is active and the API didn't respond resolved the issue for that node and this time. I further investigated that problem and found out that the port 8003 for the restful API stays open on the node where the MGR instance has been marked as failed. When the node becomes active again, the restfuls' API will become unresponsive. If I restart the MGR on a node which was active, but is currently not active, it'll stop listening on port 8003 and, the next time the node becomes active, the restful API will respond properly.

*Steps to reproduce*

# Enable the restful module to work properly with *two* MGR instances. It'd also work with more nodes, but it needs more attempts to reproduce (one extra attempt for every MGR instance).
# Mark the active (first) instance as failed.
<pre>ceph mgr fail <who></pre>
# Check if the other MGR instance has become active.
<pre>ceph -s</pre>
# Test if the moved MGR is listening on the port 8003 (second instance).
<pre>netstat -tlpn | grep 8003</pre>
# Test if the restful API responds (should work).
<pre>curl -k https://<name>:8003</pre>
# Mark the second MGR node as failed.
<pre>ceph mgr fail <who></pre>
# Check if the API responds. This is the place where it stops working.
<pre>curl -k https://<name>:8003</pre>
user@home ~ » time curl -k https://pn-ceph-1:8003 130 ↵
curl: (28) Operation timed out after 0 milliseconds with 0 out of 0 bytes received
curl -k https://pn-ceph-1:8003 0,12s user 0,02s system 0% cpu 5:00,53 total
user@home ~ »


Restarting the active MGR instance results in a working restful API. Works before or after the MGR switch to the node where the MGR is restarted.

<pre>systemctl restart ceph-mgr@<name>.service</pre>

*Software and versions*

(edited by Joao) downstream SUSE Linux Enterprise Server 12 SP3
ceph version 12.2.2-356-gb87ca3c12e (b87ca3c12e27c5950edb7970044c318365dd91a8) luminous (stable) - this should roughly be 12.2.2 plus a few patches (I will attempt reproducing this on an upstream version to pinpoint the failure)