Project

General

Profile

Bug #22472

restful module: Unresponsive restful API after marking MGR as failed

Added by Patrick Seidensal almost 4 years ago. Updated almost 4 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
ceph-mgr
Target version:
% Done:

0%

Source:
Community (dev)
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

After configuring the restful module (enabling the module, creating self-signed certificate), I tried to see how the MGR instances behave when being marked as failed and what happens with the restful module, which is supposed to move to another node. Marking the MGR instance as failed works fine. The move of the restful module also worked fine for the first time. But after a few tries (three tries for three MGR instances), it stopped working and the restful API became unresponsive. Restarting the MGR instance on the node where the MGR is active and the API didn't respond resolved the issue for that node and this time. I further investigated that problem and found out that the port 8003 for the restful API stays open on the node where the MGR instance has been marked as failed. When the node becomes active again, the restfuls' API will become unresponsive. If I restart the MGR on a node which was active, but is currently not active, it'll stop listening on port 8003 and, the next time the node becomes active, the restful API will respond properly.

Steps to reproduce

  1. Enable the restful module to work properly with two MGR instances. It'd also work with more nodes, but it needs more attempts to reproduce (one extra attempt for every MGR instance).
  2. Mark the active (first) instance as failed.
    ceph mgr fail <who>
  3. Check if the other MGR instance has become active.
    ceph -s
  4. Test if the moved MGR is listening on the port 8003 (second instance).
    netstat -tlpn | grep 8003
  5. Test if the restful API responds (should work).
    curl -k https://<name>:8003
  6. Mark the second MGR node as failed.
    ceph mgr fail <who>
  7. Check if the API responds. This is the place where it stops working.
    curl -k https://<name>:8003

    user@home ~ » time curl -k https://pn-ceph-1:8003                         130 ↵
    curl: (28) Operation timed out after 0 milliseconds with 0 out of 0 bytes received
    curl -k https://pn-ceph-1:8003  0,12s user 0,02s system 0% cpu 5:00,53 total
    user@home ~ » 
    

Workaround

Restarting the active MGR instance results in a working restful API. Works before or after the MGR switch to the node where the MGR is restarted.

systemctl restart ceph-mgr@<name>.service

Software and versions

(edited by Joao) downstream ceph version 12.2.2-356-gb87ca3c12e (b87ca3c12e27c5950edb7970044c318365dd91a8) luminous (stable) - this should roughly be 12.2.2 plus a few patches (I will attempt reproducing this on an upstream version to pinpoint the failure)

History

#1 Updated by Joao Eduardo Luis almost 4 years ago

  • Description updated (diff)

#2 Updated by John Spray almost 4 years ago

When the system is in the bugged state, are there then two ceph-mgr processes running on the problematic node? It sounds like perhaps the original one (that was `fail`ed) didn't get torn down?

#3 Updated by Patrick Seidensal almost 4 years ago

John Spray wrote:

When the system is in the bugged state, are there then two ceph-mgr processes running on the problematic node? It sounds like perhaps the original one (that was `fail`ed) didn't get torn down?

No, it's just a single process. With `ps aux | grep mgr` I can clearly see that there's only one process open. But this process might listen on multiple ports (as `netstat -tlpn | grep mgr` reveals), nonetheless it's the same PID listening on all ports.

But the port 8003 stays open, although the MGR has been marked as failed. Using the listening port as indicator, I've been able to foresee on which node the API will fail to respond. If the MGR instance is restarted, it'll stop listening on port 8003 won't fail if it'll be activated. The port 8003 is never opened on a MGR instance which hasn't been active before and won't fail if that node becomes active (for the first time).

Also available in: Atom PDF