Bug #22472: restful module: Unresponsive restful API after marking MGR as failed - mgr - Ceph

Actions

Copy link

Bug #22472

open

restful module: Unresponsive restful API after marking MGR as failed

Added by Patrick Seidensal over 6 years ago. Updated over 6 years ago.

Status:

New

Priority:

Normal

Assignee:

Category:

ceph-mgr

Target version:

Ceph - v12.2.2

% Done:

Source:

Community (dev)

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

Ceph - v12.2.2

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

After configuring the restful module (enabling the module, creating self-signed certificate), I tried to see how the MGR instances behave when being marked as failed and what happens with the restful module, which is supposed to move to another node. Marking the MGR instance as failed works fine. The move of the restful module also worked fine for the first time. But after a few tries (three tries for three MGR instances), it stopped working and the restful API became unresponsive. Restarting the MGR instance on the node where the MGR is active and the API didn't respond resolved the issue for that node and this time. I further investigated that problem and found out that the port 8003 for the restful API stays open on the node where the MGR instance has been marked as failed. When the node becomes active again, the restfuls' API will become unresponsive. If I restart the MGR on a node which was active, but is currently not active, it'll stop listening on port 8003 and, the next time the node becomes active, the restful API will respond properly.

Steps to reproduce

Enable the restful module to work properly with two MGR instances. It'd also work with more nodes, but it needs more attempts to reproduce (one extra attempt for every MGR instance).
Mark the active (first) instance as failed.
```
ceph mgr fail <who>
```
Check if the other MGR instance has become active.
```
ceph -s
```
Test if the moved MGR is listening on the port 8003 (second instance).
```
netstat -tlpn | grep 8003
```
Test if the restful API responds (should work).
```
curl -k https://<name>:8003
```
Mark the second MGR node as failed.
```
ceph mgr fail <who>
```

Check if the API responds. This is the place where it stops working.

curl -k https://<name>:8003

user@home ~ » time curl -k https://pn-ceph-1:8003                         130 ↵
curl: (28) Operation timed out after 0 milliseconds with 0 out of 0 bytes received
curl -k https://pn-ceph-1:8003  0,12s user 0,02s system 0% cpu 5:00,53 total
user@home ~ »

Workaround

Restarting the active MGR instance results in a working restful API. Works before or after the MGR switch to the node where the MGR is restarted.

systemctl restart ceph-mgr@<name>.service

Software and versions

(edited by Joao) downstream ceph version 12.2.2-356-gb87ca3c12e (b87ca3c12e27c5950edb7970044c318365dd91a8) luminous (stable) - this should roughly be 12.2.2 plus a few patches (I will attempt reproducing this on an upstream version to pinpoint the failure)

Actions

Copy link

Updated by Joao Eduardo Luis over 6 years ago

Description updated (diff)

Actions

Copy link

Updated by John Spray over 6 years ago

When the system is in the bugged state, are there then two ceph-mgr processes running on the problematic node? It sounds like perhaps the original one (that was `fail`ed) didn't get torn down?

Actions

Copy link

Updated by Patrick Seidensal over 6 years ago

John Spray wrote:

When the system is in the bugged state, are there then two ceph-mgr processes running on the problematic node? It sounds like perhaps the original one (that was `fail`ed) didn't get torn down?

No, it's just a single process. With `ps aux | grep mgr` I can clearly see that there's only one process open. But this process might listen on multiple ports (as `netstat -tlpn | grep mgr` reveals), nonetheless it's the same PID listening on all ports.

But the port 8003 stays open, although the MGR has been marked as failed. Using the listening port as indicator, I've been able to foresee on which node the API will fail to respond. If the MGR instance is restarted, it'll stop listening on port 8003 won't fail if it'll be activated. The port 8003 is never opened on a MGR instance which hasn't been active before and won't fail if that node becomes active (for the first time).

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » mgr

Custom queries

Bug #22472

restful module: Unresponsive restful API after marking MGR as failed

Updated by Joao Eduardo Luis over 6 years ago

Updated by John Spray over 6 years ago

Updated by Patrick Seidensal over 6 years ago