Bug #57462
closedcephadm removes config & keyring files in mid flight
0%
Description
During testing, cephadm spontaneously removes the ceph.conf and keyring files from the admin/mon node and creates newer ones later on. :-O
There's no better way to describe that. In mid test we see cephadm removing the files and replacing them at a later time. The timestamps on the files prove that when compared to the other nodes. So the stat poller we have running suddenly can't execute ceph cmds ...
2022/09/03-23:32:47: Completed warp hybrid phase 4 ... 2022/09/03-23:32:47: Stopping warp clients f28-h29-000-r630 | CHANGED | rc=0 >> f28-h26-000-r630 | CHANGED | rc=0 >> f28-h27-000-r630 | CHANGED | rc=0 >> f28-h25-000-r630 | CHANGED | rc=0 >> f28-h28-000-r630 | CHANGED | rc=0 >> f28-h30-000-r630 | CHANGED | rc=0 >> did not load config file, using default settings. 2022-09-03T23:32:59.040+0000 1509640dd500 -1 Errors while parsing config file! 2022-09-03T23:32:59.040+0000 1509640dd500 -1 can't open ceph.conf: (2) No such file or directory unable to get monitor info from DNS SRV with service name: ceph-mon 2022-09-03T23:32:59.050+0000 1509640dd500 -1 failed for service _ceph-mon._tcp 2022-09-03T23:32:59.050+0000 1509640dd500 -1 monclient: get_monmap_and_config cannot identify monitors to contact failed to fetch mon config (--no-mon-config to skip) did not load config file, using default settings. 2022-09-03T23:32:59.076+0000 154db2bd3500 -1 Errors while parsing config file! 2022-09-03T23:32:59.076+0000 154db2bd3500 -1 can't open ceph.conf: (2) No such file or directory unable to get monitor info from DNS SRV with service name: ceph-mon 2022-09-03T23:32:59.077+0000 154db2bd3500 -1 failed for service _ceph-mon._tcp 2022-09-03T23:32:59.077+0000 154db2bd3500 -1 monclient: get_monmap_and_config cannot identify monitors to contact failed to fetch mon config (--no-mon-config to skip) 2022/09/03-23:32:59: RESHARD Queue Length 0 2022/09/03-23:32:59: RESHARD List ./runtest.sh: line 175: 830707 Terminated ./Utils/warp.sh hybrid 4 $LOGFILE 2022/09/03-23:33:05: FINISH: warp phase 4 (w33) Error initializing cluster client: ObjectNotFound('RADOS object not found (error calling conf_read_file)',) Error initializing cluster client: ObjectNotFound('RADOS object not found (error calling conf_read_file)',) 2022/09/03-23:33:05: 2022/09/03-23:33:05: Stopped POLL bkgrd process ./Utils/functions.shinc: line 3: 578241 Terminated Utils/poll.sh 300 ${LOGFILE} 2022/09/03-23:33:05: ** Waiting for cleanPGs == totalPGs did not load config file, using default settings. 2022-09-03T23:33:05.431+0000 150e1726c500 -1 Errors while parsing config file! 2022-09-03T23:33:05.431+0000 150e1726c500 -1 can't open ceph.conf: (2) No such file or directory unable to get monitor info from DNS SRV with service name: ceph-mon 2022-09-03T23:33:05.432+0000 150e1726c500 -1 failed for service _ceph-mon._tcp 2022-09-03T23:33:05.432+0000 150e1726c500 -1 monclient: get_monmap_and_config cannot identify monitors to contact failed to fetch mon config (--no-mon-config to skip) did not load config file, using default settings. 2022-09-03T23:33:05.495+0000 14e9551cb500 -1 Errors while parsing config file! 2022-09-03T23:33:05.495+0000 14e9551cb500 -1 can't open ceph.conf: (2) No such file or directory unable to get monitor info from DNS SRV with service name: ceph-mon 2022-09-03T23:33:05.496+0000 14e9551cb500 -1 failed for service _ceph-mon._tcp 2022-09-03T23:33:05.496+0000 14e9551cb500 -1 monclient: get_monmap_and_config cannot identify monitors to contact failed to fetch mon config (--no-mon-config to skip) var.shinc: No access and/or secret key found for user johndoe Error initializing cluster client: ObjectNotFound('RADOS object not found (error calling conf_read_file)',) 2022/09/03:23:33:05: ** POLLCEPH started Error initializing cluster client: ObjectNotFound('RADOS object not found (error calling conf_read_file)',) Utils/pollceph.sh: line 40: [: : integer expression expected 2022/09/03:23:33:05: ** Recovery completed: POLLCEPH ending Error initializing cluster client: ObjectNotFound('RADOS object not found (error calling conf_read_file)',) Error initializing cluster client: ObjectNotFound('RADOS object not found (error calling conf_read_file)',) Error initializing cluster client: ObjectNotFound('RADOS object not found (error calling conf_read_file)',) Error initializing cluster client: ObjectNotFound('RADOS object not found (error calling conf_read_file)',) Error initializing cluster client: ObjectNotFound('RADOS object not found (error calling conf_read_file)',) 2022/09/03-23:33:06: 2022/09/03-23:33:06: ** Cleanup END: Recovery complete LOGFINAL: ./RESULTS/w33_runtest.sh_20220903-004937.log 22/09/03 23:33 stop vmstat.sh ... Error initializing cluster client: ObjectNotFound('RADOS object not found (error calling conf_read_file)',) cp: cannot stat '/var/log/ceph//ceph.log*': No such file or directory 22/09/03 23:33
But within time the files return and the cluster status is OK.
From ceph.log ...
2022-09-03T23:31:56.668089+0000 mgr.f28-h30-000-r630.upsejc (mgr.14282) 42313 : cluster [DBG] pgmap v42368: 4769 pgs: 4769 active+clean; 8.6 TiB data, 21 TiB used, 334 TiB / 355 TiB avail 2022-09-03T23:31:58.673808+0000 mgr.f28-h30-000-r630.upsejc (mgr.14282) 42314 : cluster [DBG] pgmap v42369: 4769 pgs: 4769 active+clean; 8.6 TiB data, 21 TiB used, 334 TiB / 355 TiB avail 2022-09-03T23:26:33.437471+0000 mon.f28-h28-000-r630.rdu2.scalelab.redhat.com (mon.0) 65480 : cluster [INF] osd.0 marked down after no beacon for 40141.200950 seconds 2022-09-03T23:26:33.437557+0000 mon.f28-h28-000-r630.rdu2.scalelab.redhat.com (mon.0) 65481 : cluster [INF] osd.1 marked down after no beacon for 40282.610400 seconds 2022-09-03T23:26:33.437601+0000 mon.f28-h28-000-r630.rdu2.scalelab.redhat.com (mon.0) 65482 : cluster [INF] osd.2 marked down after no beacon for 40291.889797 seconds 2022-09-03T23:26:33.437624+0000 mon.f28-h28-000-r630.rdu2.scalelab.redhat.com (mon.0) 65483 : cluster [INF] osd.3 marked down after no beacon for 40201.475581 seconds 2022-09-03T23:26:33.437647+0000 mon.f28-h28-000-r630.rdu2.scalelab.redhat.com (mon.0) 65484 : cluster [INF] osd.4 marked down after no beacon for 40272.289077 seconds 2022-09-03T23:26:33.437687+0000 mon.f28-h28-000-r630.rdu2.scalelab.redhat.com (mon.0) 65485 : cluster [INF] osd.5 marked down after no beacon for 40251.696393 seconds 2022-09-03T23:26:33.437712+0000 mon.f28-h28-000-r630.rdu2.scalelab.redhat.com (mon.0) 65486 : cluster [INF] osd.6 marked down after no beacon for 40245.079861 seconds 2022-09-03T23:26:33.437736+0000 mon.f28-h28-000-r630.rdu2.scalelab.redhat.com (mon.0) 65487 : cluster [INF] osd.7 marked down after no beacon for 40227.439177 seconds 2022-09-03T23:26:33.437760+0000 mon.f28-h28-000-r630.rdu2.scalelab.redhat.com (mon.0) 65488 : cluster [INF] osd.8 marked down after no beacon for 40140.826010 seconds 2022-09-03T23:26:33.437780+0000 mon.f28-h28-000-r630.rdu2.scalelab.redhat.com (mon.0) 65489 : cluster [INF] osd.9 marked down after no beacon for 40276.810130 seconds 2022-09-03T23:26:33.437799+0000 mon.f28-h28-000-r630.rdu2.scalelab.redhat.com (mon.0) 65490 : cluster [INF] osd.10 marked down after no beacon for 40078.375459 seconds 2022-09-03T23:26:33.437819+0000 mon.f28-h28-000-r630.rdu2.scalelab.redhat.com (mon.0) 65491 : cluster [INF] osd.11 marked down after no beacon for 40099.579699 seconds 2022-09-03T23:26:33.518161+0000 mon.f28-h28-000-r630.rdu2.scalelab.redhat.com (mon.0) 65492 : cluster [INF] osd.12 marked down after no beacon for 40069.888989 seconds 2022-09-03T23:26:33.518191+0000 mon.f28-h28-000-r630.rdu2.scalelab.redhat.com (mon.0) 65493 : cluster [INF] osd.13 marked down after no beacon for 40049.501219 seconds ... 2022-09-03T23:26:35.794124+0000 mon.f28-h28-000-r630.rdu2.scalelab.redhat.com (mon.0) 65612 : cluster [INF] osd.132 marked down after no beacon for 40280.963709 seconds 2022-09-03T23:26:35.794140+0000 mon.f28-h28-000-r630.rdu2.scalelab.redhat.com (mon.0) 65613 : cluster [INF] osd.133 marked down after no beacon for 40226.728449 seconds 2022-09-03T23:26:35.794158+0000 mon.f28-h28-000-r630.rdu2.scalelab.redhat.com (mon.0) 65614 : cluster [INF] osd.134 marked down after no beacon for 40057.138960 seconds 2022-09-03T23:26:38.884332+0000 mon.f28-h28-000-r630.rdu2.scalelab.redhat.com (mon.0) 65615 : cluster [WRN] Health check failed: 135 osds down (OSD_DOWN) 2022-09-03T23:26:43.082515+0000 mon.f28-h28-000-r630.rdu2.scalelab.redhat.com (mon.0) 65616 : cluster [WRN] overall HEALTH_WARN failed to probe daemons or devices; noscrub,nodeep-scrub flag(s) set 2022-09-03T23:26:44.358075+0000 mon.f28-h28-000-r630.rdu2.scalelab.redhat.com (mon.0) 65617 : cluster [INF] mon.f28-h28-000-r630.rdu2.scalelab.redhat.com calling monitor election 2022-09-03T23:26:47.542828+0000 mon.f28-h28-000-r630.rdu2.scalelab.redhat.com (mon.0) 65618 : cluster [INF] mon.f28-h28-000-r630.rdu2.scalelab.redhat.com is new leader, mons f28-h28-000-r630.rdu2.scalelab.redhat.com,f28-h29-000-r630,f28-h30-000-r630 in quorum (ranks 0,1,2) 2022-09-03T23:26:50.652453+0000 mon.f28-h28-000-r630.rdu2.scalelab.redhat.com (mon.0) 65619 : cluster [WRN] Health detail: HEALTH_WARN failed to probe daemons or devices; noscrub,nodeep-scrub flag(s) set 2022-09-03T23:26:50.652576+0000 mon.f28-h28-000-r630.rdu2.scalelab.redhat.com (mon.0) 65620 : cluster [WRN] [WRN] CEPHADM_REFRESH_FAILED: failed to probe daemons or devices 2022-09-03T23:26:50.652584+0000 mon.f28-h28-000-r630.rdu2.scalelab.redhat.com (mon.0) 65621 : cluster [WRN] host f28-h28-000-r630.rdu2.scalelab.redhat.com `cephadm gather-facts` failed: Unable to reach remote host f28-h28-000-r630.rdu2.scalelab.redhat.com. 2022-09-03T23:26:50.652604+0000 mon.f28-h28-000-r630.rdu2.scalelab.redhat.com (mon.0) 65622 : cluster [WRN] [WRN] OSDMAP_FLAGS: noscrub,nodeep-scrub flag(s) set 2022-09-03T23:31:49.643433+0000 mon.f28-h28-000-r630.rdu2.scalelab.redhat.com (mon.0) 65623 : cluster [WRN] overall HEALTH_WARN failed to probe daemons or devices; noscrub,nodeep-scrub flag(s) set 2022-09-03T23:32:00.168119+0000 mon.f28-h30-000-r630 (mon.2) 7992 : cluster [INF] mon.f28-h30-000-r630 calling monitor election 2022-09-03T23:32:00.179658+0000 mon.f28-h29-000-r630 (mon.1) 14376 : cluster [INF] mon.f28-h29-000-r630 calling monitor election 2022-09-03T23:32:00.216508+0000 mon.f28-h28-000-r630.rdu2.scalelab.redhat.com (mon.0) 65624 : cluster [INF] mon.f28-h28-000-r630.rdu2.scalelab.redhat.com calling monitor election 2022-09-03T23:32:00.280824+0000 mon.f28-h30-000-r630 (mon.2) 7993 : cluster [WRN] message from mon.0 was stamped 0.050454s in the future, clocks not synchronized 2022-09-03T23:32:00.282018+0000 mon.f28-h28-000-r630.rdu2.scalelab.redhat.com (mon.0) 65625 : cluster [INF] mon.f28-h28-000-r630.rdu2.scalelab.redhat.com calling monitor election 2022-09-03T23:32:00.311858+0000 mon.f28-h28-000-r630.rdu2.scalelab.redhat.com (mon.0) 65626 : cluster [INF] mon.f28-h28-000-r630.rdu2.scalelab.redhat.com is new leader, mons f28-h28-000-r630.rdu2.scalelab.redhat.com,f28-h29-000-r630,f28-h30-000-r630 in quorum (ranks 0,1,2) 2022-09-03T23:32:00.333933+0000 mon.f28-h28-000-r630.rdu2.scalelab.redhat.com (mon.0) 65627 : cluster [DBG] monmap e3: 3 mons at {f28-h28-000-r630.rdu2.scalelab.redhat.com=[v2:172.16.45.51:3300/0,v1:172.16.45.51:6789/0],f28-h29-000-r630=[v2:172.16.44.230:3300/0,v1:172.16.44.230:6789/0],f28-h30-000-r630=[v2:172.16.45.102:3300/0,v1:172.16.45.102:6789/0]} 2022-09-03T23:32:00.334022+0000 mon.f28-h28-000-r630.rdu2.scalelab.redhat.com (mon.0) 65628 : cluster [DBG] fsmap 2022-09-03T23:32:00.334580+0000 mon.f28-h28-000-r630.rdu2.scalelab.redhat.com (mon.0) 65629 : cluster [DBG] osdmap e1996: 192 total, 192 up, 192 in 2022-09-03T23:32:00.365333+0000 mon.f28-h28-000-r630.rdu2.scalelab.redhat.com (mon.0) 65630 : cluster [DBG] mgrmap e39: f28-h30-000-r630.upsejc(active, since 22h), standbys: f28-h28-000-r630.rdu2.scalelab.redhat.com.hvzphf, f28-h29-000-r630.oojtqu 2022-09-03T23:32:00.374767+0000 mon.f28-h28-000-r630.rdu2.scalelab.redhat.com (mon.0) 65631 : cluster [INF] Health check cleared: MON_DOWN (was: 1/3 mons down, quorum f28-h29-000-r630,f28-h30-000-r630) 2022-09-03T23:32:00.427578+0000 mon.f28-h28-000-r630.rdu2.scalelab.redhat.com (mon.0) 65632 : cluster [WRN] Health detail: HEALTH_WARN failed to probe daemons or devices; noscrub,nodeep-scrub flag(s) set 2022-09-03T23:32:00.427613+0000 mon.f28-h28-000-r630.rdu2.scalelab.redhat.com (mon.0) 65633 : cluster [WRN] [WRN] CEPHADM_REFRESH_FAILED: failed to probe daemons or devices 2022-09-03T23:32:00.427633+0000 mon.f28-h28-000-r630.rdu2.scalelab.redhat.com (mon.0) 65634 : cluster [WRN] host f28-h28-000-r630.rdu2.scalelab.redhat.com `cephadm gather-facts` failed: Unable to reach remote host f28-h28-000-r630.rdu2.scalelab.redhat.com. 2022-09-03T23:32:00.427640+0000 mon.f28-h28-000-r630.rdu2.scalelab.redhat.com (mon.0) 65635 : cluster [WRN] [WRN] OSDMAP_FLAGS: noscrub,nodeep-scrub flag(s) set 2022-09-03T23:32:00.706332+0000 mgr.f28-h30-000-r630.upsejc (mgr.14282) 42315 : cluster [DBG] pgmap v42370: 4769 pgs: 4769 active+clean; 8.6 TiB data, 21 TiB used, 334 TiB / 355 TiB avail 2022-09-03T23:32:02.712912+0000 mgr.f28-h30-000-r630.upsejc (mgr.14282) 42316 : cluster [DBG] pgmap v42371: 4769 pgs: 4769 active+clean; 8.6 TiB data, 21 TiB used, 334 TiB / 355 TiB avail 2022-09-03T23:32:04.742078+0000 mgr.f28-h30-000-r630.upsejc (mgr.14282) 42317 : cluster [DBG] pgmap v42372: 4769 pgs: 4769 active+clean; 8.6 TiB data, 21 TiB used, 334 TiB / 355 TiB avail 2022-09-03T23:32:05.806855+0000 mon.f28-h30-000-r630 (mon.2) 7994 : cluster [WRN] message from mon.0 was stamped 0.050398s in the future, clocks not synchronized 2022-09-03T23:32:06.751954+0000 mgr.f28-h30-000-r630.upsejc (mgr.14282) 42318 : cluster [DBG] pgmap v42373: 4769 pgs: 4769 active+clean; 8.6 TiB data, 21 TiB used, 334 TiB / 355 TiB avail 2022-09-03T23:32:08.559491+0000 mgr.f28-h30-000-r630.upsejc (mgr.14282) 42324 : cluster [DBG] pgmap v42374: 4769 pgs: 4769 active+clean; 8.6 TiB data, 21 TiB used, 334 TiB / 355 TiB avail 2022-09-03T23:32:09.610791+0000 mon.f28-h28-000-r630.rdu2.scalelab.redhat.com (mon.0) 65641 : cluster [INF] Health check cleared: CEPHADM_REFRESH_FAILED (was: failed to probe daemons or devices) 2022-09-03T23:32:10.587091+0000 mgr.f28-h30-000-r630.upsejc (mgr.14282) 42325 : cluster [DBG] pgmap v42375: 4769 pgs: 4769 active+clean; 8.6 TiB data, 21 TiB used, 334 TiB / 355 TiB avail 2022-09-03T23:32:12.596987+0000 mgr.f28-h30-000-r630.upsejc (mgr.14282) 42326 : cluster [DBG] pgmap v42376: 4769 pgs: 4769 active+clean; 8.6 TiB data, 21 TiB used, ...
The cephadm.log showed exactly what we describe but I inadvertently deleted it. :-/
There was a previous bug bug where we couldn't use the hostname for the admin/MON1 node in the host_spec file. The workaround was to use the IP of the admin/mon instead in the addr: field. I had recently gone back to using the hostname after seeing the bug resolved but that became suspect after the above started occurring. After reverting back to using the bootstrap IP in the host spec file instead of the hostname, this situation is no longer observed.
Updated by Tim Wilkinson over 1 year ago
Bug 2024301 was the issue referenced above ... https://bugzilla.redhat.com/show_bug.cgi?id=2024301
Updated by Adam King over 1 year ago
- Related to Bug #56696: admin keyring disappears during qa run added
Updated by Adam King over 1 year ago
- Backport set to quincy, pacific
- Pull request ID set to 48074
Updated by Adam King over 1 year ago
- Status changed from New to Pending Backport
Updated by Backport Bot over 1 year ago
- Copied to Backport #58207: quincy: cephadm removes config & keyring files in mid flight added
Updated by Backport Bot over 1 year ago
- Copied to Backport #58208: pacific: cephadm removes config & keyring files in mid flight added
Updated by Adam King about 1 year ago
- Status changed from Pending Backport to Resolved
Updated by Voja Molani 9 months ago
It is definitely better in 17.2.6 I think, compared to 17.2.x releases earlier than 17.2.6. Before this PR I had at least 6+ occurrences of keyring (and sometimes ceph.conf
) going missing over 2 clusters.
But just now I had a 17.2.6 cluster MGR node lose it's keyring. Somewhere between ceph orch host maintenance enter
, reboot, and ceph orch host maintenance exit
.
As used to happen so often on 17.2.5, when trying to run ceph orch host maintenance exit
the command just takes forever and seems to hang; since it is taking so long the first thing I did was check /etc/ceph/
and sure enough the keyring file was missing. Eventually the ceph orch host maintenance exit
command errored due to missing keyring, put the keyring file back, re-run ceph orch host maintenance exit
and now all services on the node are back again.
But the linked PR 48074 definitely does not completely fix this issue - or there are some further issues present with similar symptoms.
Updated by Voja Molani 9 months ago
Second and third MGR node in the cluster being rebooted with the exact same process, after reboot ceph orch host maintenance exit
didn't work because of missing keyring. Exactly like the first MGR node being rebooted. Seems to be a very repeatable problem so far, 3 out of 3 attempts failed at keyring lost.