Project

General

Profile

Bug #51027

monmap drops rebooted mon if deployed via label

Added by Harry Coin almost 3 years ago. Updated almost 2 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
pacific
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

In Ceph Pacific 16.2.4, I assigned 5 mons via 'ceph orch apply mon label:mon' normally using docker images on the hosts with the 'mon' tag.

When I reboot any host, the monmap drops that mon deployed on that host, though it remains in systemd and the dashboard lists it as 'running'. Further 'ceph orch apply..' commands do not re-launch the mon on the host. The 'mon' tag is still listed on that host.

If I remove the tag, wait for the dashboard to remove the mon (listed as 'running', though not in the monmap), the re-add the tag to that host, the mon redeploys and operates normally and is listed in the monmap.

I've repeated this now about 5 times, it happens without regard to which host is rebooted.

sphil_before - prior to reboot, mon has label, and in monmap (70.5 KB) Harry Coin, 06/08/2021 03:49 PM

sphil_after - after noc4 reboot, mon has label but not in monmap (69.9 KB) Harry Coin, 06/08/2021 03:49 PM


Related issues

Related to Orchestrator - Bug #50272: cephadm: after downsizing mon service from 5 to 3 daemons, cephadm reports "stray" daemons New
Related to Orchestrator - Bug #53033: cephadm removes MONs during upgrade 15.2.14 > 16.2.6 which leads to failed quorum and broken cluster New

History

#1 Updated by Neha Ojha almost 3 years ago

  • Project changed from Ceph to Orchestrator

#2 Updated by Sebastian Wagner almost 3 years ago

  • Status changed from New to Need More Info

can you run https://gist.github.com/sebastian-philipp/8e18f4815e90dc0f51fe3fbff8c8aae5 and attach the result? Also having the monmap before and after would be helpful.

#3 Updated by Harry Coin almost 3 years ago

Yes, and the results are attached. This is a little sandbox system in a workshop, 4 of 5 hosts running osds, 5 of 5 hosts running mons.

This is 100% repeatable and very easy to reproduce on your own: just assign the mon hosts the tag 'mon', then do ceph orch apply mon label:mon, wait for it all to sync up, reboot one of them, notice the monmap has dropped the rebooted system and notice on the rebooted system the dashboard lists the mon has having 'stopped'.

To recover, delete the mon tag from the host, notice mon listed as 'stopped' is then removed from the host (the reduced mon map hasn't changed), add the 'mon' tag back to the host and notice the restoration of operations (both monmap and running container) as they were prior to the mon host's reboot.

In the case I ran here for you just now: The relevant syslog entries after rebooting a host (it does not matter which host running a mon gets rebooted) are:
...
Jun 8 10:28:55 noc4 bash10918: Stalls(count): 0 level0_slowdown, 0 level0_slowdown_with_compaction, 0 level0_numfiles, 0 level0_numfiles_with_compaction, 0 stop for pending_compaction_bytes, 0 slowdown for pending_compaction_bytes, 0 memtable_compaction, 0 memtable_slowdown, interval 0 total count
Jun 8 10:28:55 noc4 bash10918: * File Read Latency Histogram By Level [default] *
Jun 8 10:28:55 noc4 bash10918: debug 2021-06-08T15:28:55.048+0000 7f7f9ff4e700 0 mon.noc4 does not exist in monmap, will attempt to join an existing cluster
Jun 8 10:28:55 noc4 bash10918: debug 2021-06-08T15:28:55.048+0000 7f7f9ff4e700 0 using public_addr v2:[fc00:1002:c7::44]:0/0 -> [v2:[fc00:1002:c7::44]:3300/0,v1:[fc00:1002:c7::44]:6789/0]
Jun 8 10:28:55 noc4 bash10918: debug 2021-06-08T15:28:55.048+0000 7f7f9ff4e700 0 starting mon.noc4 rank -1 at public addrs [v2:[fc00:1002:c7::44]:3300/0,v1:[fc00:1002:c7::44]:6789/0] at bind addrs [v2:[fc00:1002:c7::44]:3300/0,v1:[fc00:1002:c7::44]:6789/0] mon_data /var/lib/ceph/mon/ceph-noc4 fsid 4067126d-01cb-40af-824a-881c130140f8
Jun 8 10:28:55 noc4 bash10918: debug 2021-06-08T15:28:55.052+0000 7f7f9ff4e700 1 mon.noc4@-1(?) e64 preinit fsid 4067126d-01cb-40af-824a-881c130140f8
Jun 8 10:28:55 noc4 bash10918: debug 2021-06-08T15:28:55.052+0000 7f7f9ff4e700 -1 mon.noc4@-1(
?) e64 not in monmap and have been in a quorum before; must have been removed
Jun 8 10:28:55 noc4 bash10918: debug 2021-06-08T15:28:55.052+0000 7f7f9ff4e700 -1 mon.noc4@-1(???) e64 commit suicide!
Jun 8 10:28:55 noc4 bash10918: debug 2021-06-08T15:28:55.052+0000 7f7f9ff4e700 -1 failed to initialize
Jun 8 10:28:55 noc4 dockerd1457: time="2021-06-08T10:28:55.127175846-05:00" level=info msg="ignoring event" container=b1b05c4f42153526d5a924e6870cc8a0a79c1bbfc3eb2d220395de2f38f6ba45 module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
...

But of course, the mon label is still there, it was never removed.

Adding {
your script

2>&1 | tee sphil_$1

To fix the misleading fix suggested in the log entry complaining of missing cephadm access to root on the hosts, I did:
chown cephadm /etc/ceph/ceph.client.admin.keyring
The proper keys were in the authorized_keys files for cephadm in /root/.ssh/authorized_keys all along.

Uploaded before reboot & after rebooting host noc4, which had a running mon docker daemon prior, but not after reboot.

#4 Updated by Harry Coin almost 3 years ago

P.S. It might be a good idea to think of a better debug log message phrase than 'commit suicide'.

#5 Updated by Loïc Dachary over 2 years ago

  • Target version deleted (v16.2.5)

#6 Updated by Harry Coin over 2 years ago

I think it's a mistake to put this in the 'orchestrator' problem list, because I think the logic that decides whether a mon should commit suicide lives in the mon -- and it doesn't consider whether the mon exists in the monmap because of a label. So it removes itself improperly when it finds it's not in 'the monmap' -- except it ought to be in the monmap.

The failure description:
Jul 15 08:40:25 noc1.1.quietfountain.com bash193661: debug 2021-07-15T13:40:25.066+0000 7f516385d700 0 mon.noc1 does not exist in monmap, will attempt to join an existing cluster
Jul 15 08:40:25 noc1.1.quietfountain.com bash193661: debug 2021-07-15T13:40:25.066+0000 7f516385d700 0 using public_addr v2:[fc00:1002:c7::41]:0/0 -> [v2:[fc00:1002:c7::41]:3300/0,v1:[fc00:1002:c7::41]:6789/0]
Jul 15 08:40:25 noc1.1.quietfountain.com bash193661: debug 2021-07-15T13:40:25.070+0000 7f516385d700 0 starting mon.noc1 rank -1 at public addrs [v2:[fc00:1002:c7::41]:3300/0,v1:[fc00:1002:c7::41]:6789/0] at bind addr>
Jul 15 08:40:25 noc1.1.quietfountain.com bash193661: debug 2021-07-15T13:40:25.070+0000 7f516385d700 1 mon.noc1@-1(?) e88 preinit fsid 4067126d-01cb-40af-824a-881c130140f8
Jul 15 08:40:25 noc1.1.quietfountain.com bash193661: debug 2021-07-15T13:40:25.074+0000 7f516385d700 -1 mon.noc1@-1(
?) e88 not in monmap and have been in a quorum before; must have been removed
Jul 15 08:40:25 noc1.1.quietfountain.com bash193661: debug 2021-07-15T13:40:25.074+0000 7f516385d700 -1 mon.noc1@-1(???) e88 commit suicide!
Jul 15 08:40:25 noc1.1.quietfountain.com bash193661: debug 2021-07-15T13:40:25.074+0000 7f516385d700 -1 failed to initialize

#7 Updated by Harry Coin over 2 years ago

Still a problem in Pacific 16.2.5. Pretty much makes the 'assignment of mons by label' useless since the mon is lost upon host reboot.

#8 Updated by Sebastian Wagner over 2 years ago

  • Related to Bug #50272: cephadm: after downsizing mon service from 5 to 3 daemons, cephadm reports "stray" daemons added

#9 Updated by Sebastian Wagner over 2 years ago

  • Assignee set to Adam King
  • Priority changed from Normal to High

#10 Updated by Adam King over 2 years ago

  • Status changed from Need More Info to In Progress
  • Pull request ID set to 42690

#11 Updated by David Orman over 2 years ago

We can confirm this impacts 16.2.5 clusters. On host failures/reboots, we have to undeploy/redeploy monitors, which is quite dangerous when considering some of the potential failure scenarios.

#12 Updated by Stefan Fleischmann over 2 years ago

Is there any workaround for this other than redeploying? As David said this is dangerous. We had quite some trouble to recover after a hardware failure and some unexpected reboots.

#13 Updated by Harry Coin over 2 years ago

If you want to use the label deployment feature: Not that I was able to find. It's a real problem. And it's been allowed to sit out there a long time. This is one of the reasons the folks avoid the whole 'container and orchestrator drama'. How was it even possible the testing routines didn't notice 'hey, you lose a monitor on reboot' before release?

#14 Updated by Cory Snyder over 2 years ago

  • Backport set to pacific

#15 Updated by Cory Snyder over 2 years ago

  • Status changed from In Progress to Pending Backport

#16 Updated by Sebastian Wagner over 2 years ago

  • Status changed from Pending Backport to Resolved

#17 Updated by Sebastian Wagner over 2 years ago

  • Related to Bug #53033: cephadm removes MONs during upgrade 15.2.14 > 16.2.6 which leads to failed quorum and broken cluster added

#18 Updated by Thomas Roth almost 2 years ago

Interesting that this was changed to Resolved 8 months ago - we have a test cluster installed with 16.2.7 from the start, and this behaviour is still there!

lxmon1:~# cephadm bootstrap --mon-ip 10.20.2.161

Use user cephadm and distribute key:
lxmon1:~# cp /etc/ceph/ceph.pub /var/lib/cephadm/.ssh/authorized_keys
lxmon1:~# scp /etc/ceph/ceph.pub lxmon2:/var/lib/cephadm/.ssh/authorized_keys
lxmon1:~# ceph cephadm set-user cephadm

Add mon

lxmon1:~# ceph orch host add lxmon2 10.20.2.162

Add OSDs, pools, radosgw, cephfs...

Reboot lxmon1 - all hosts Offline!

lxmon1:~# ceph orch host ls
HOST      ADDR         LABELS  STATUS   
lxmon1  10.20.2.161  _admin  Offline  
lxmon2  10.20.2.162  _admin  Offline  
...

This is ridiculous.

Also available in: Atom PDF