Bug #57462: cephadm removes config & keyring files in mid flight - Orchestrator - Ceph

Actions

Copy link

Bug #57462

closed

cephadm removes config & keyring files in mid flight

Added by Tim Wilkinson over 1 year ago. Updated 9 months ago.

Status:

Resolved

Priority:

Normal

Assignee:

Adam King

Category:

cephadm

Target version:

% Done:

Source:

Tags:

backport_processed

Backport:

quincy, pacific

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

Ceph - v17.2.1

ceph-qa-suite:

Pull request ID:

48074

Crash signature (v1):

Crash signature (v2):

Description

During testing, cephadm spontaneously removes the ceph.conf and keyring files from the admin/mon node and creates newer ones later on. :-O

There's no better way to describe that. In mid test we see cephadm removing the files and replacing them at a later time. The timestamps on the files prove that when compared to the other nodes. So the stat poller we have running suddenly can't execute ceph cmds ...

2022/09/03-23:32:47: Completed warp hybrid phase 4 ...
2022/09/03-23:32:47: Stopping warp clients
f28-h29-000-r630 | CHANGED | rc=0 >>

f28-h26-000-r630 | CHANGED | rc=0 >>

f28-h27-000-r630 | CHANGED | rc=0 >>

f28-h25-000-r630 | CHANGED | rc=0 >>

f28-h28-000-r630 | CHANGED | rc=0 >>

f28-h30-000-r630 | CHANGED | rc=0 >>

did not load config file, using default settings.
2022-09-03T23:32:59.040+0000 1509640dd500 -1 Errors while parsing config file!
2022-09-03T23:32:59.040+0000 1509640dd500 -1 can't open ceph.conf: (2) No such file or directory
unable to get monitor info from DNS SRV with service name: ceph-mon
2022-09-03T23:32:59.050+0000 1509640dd500 -1 failed for service _ceph-mon._tcp
2022-09-03T23:32:59.050+0000 1509640dd500 -1 monclient: get_monmap_and_config cannot identify monitors to contact
failed to fetch mon config (--no-mon-config to skip)
did not load config file, using default settings.
2022-09-03T23:32:59.076+0000 154db2bd3500 -1 Errors while parsing config file!
2022-09-03T23:32:59.076+0000 154db2bd3500 -1 can't open ceph.conf: (2) No such file or directory
unable to get monitor info from DNS SRV with service name: ceph-mon
2022-09-03T23:32:59.077+0000 154db2bd3500 -1 failed for service _ceph-mon._tcp
2022-09-03T23:32:59.077+0000 154db2bd3500 -1 monclient: get_monmap_and_config cannot identify monitors to contact
failed to fetch mon config (--no-mon-config to skip)
2022/09/03-23:32:59: RESHARD Queue Length 0
2022/09/03-23:32:59: RESHARD List 
./runtest.sh: line 175: 830707 Terminated              ./Utils/warp.sh hybrid 4 $LOGFILE
2022/09/03-23:33:05: FINISH: warp phase 4 (w33)
Error initializing cluster client: ObjectNotFound('RADOS object not found (error calling conf_read_file)',)
Error initializing cluster client: ObjectNotFound('RADOS object not found (error calling conf_read_file)',)
2022/09/03-23:33:05: 
2022/09/03-23:33:05: Stopped POLL bkgrd process
./Utils/functions.shinc: line 3: 578241 Terminated              Utils/poll.sh 300 ${LOGFILE}
2022/09/03-23:33:05: ** Waiting for cleanPGs == totalPGs
did not load config file, using default settings.
2022-09-03T23:33:05.431+0000 150e1726c500 -1 Errors while parsing config file!
2022-09-03T23:33:05.431+0000 150e1726c500 -1 can't open ceph.conf: (2) No such file or directory
unable to get monitor info from DNS SRV with service name: ceph-mon
2022-09-03T23:33:05.432+0000 150e1726c500 -1 failed for service _ceph-mon._tcp
2022-09-03T23:33:05.432+0000 150e1726c500 -1 monclient: get_monmap_and_config cannot identify monitors to contact
failed to fetch mon config (--no-mon-config to skip)
did not load config file, using default settings.
2022-09-03T23:33:05.495+0000 14e9551cb500 -1 Errors while parsing config file!
2022-09-03T23:33:05.495+0000 14e9551cb500 -1 can't open ceph.conf: (2) No such file or directory
unable to get monitor info from DNS SRV with service name: ceph-mon
2022-09-03T23:33:05.496+0000 14e9551cb500 -1 failed for service _ceph-mon._tcp
2022-09-03T23:33:05.496+0000 14e9551cb500 -1 monclient: get_monmap_and_config cannot identify monitors to contact
failed to fetch mon config (--no-mon-config to skip)
var.shinc: No access and/or secret key found for user johndoe
Error initializing cluster client: ObjectNotFound('RADOS object not found (error calling conf_read_file)',)
2022/09/03:23:33:05: ** POLLCEPH started
Error initializing cluster client: ObjectNotFound('RADOS object not found (error calling conf_read_file)',)
Utils/pollceph.sh: line 40: [: : integer expression expected
2022/09/03:23:33:05: ** Recovery completed: POLLCEPH ending
Error initializing cluster client: ObjectNotFound('RADOS object not found (error calling conf_read_file)',)
Error initializing cluster client: ObjectNotFound('RADOS object not found (error calling conf_read_file)',)
Error initializing cluster client: ObjectNotFound('RADOS object not found (error calling conf_read_file)',)
Error initializing cluster client: ObjectNotFound('RADOS object not found (error calling conf_read_file)',)
Error initializing cluster client: ObjectNotFound('RADOS object not found (error calling conf_read_file)',)
2022/09/03-23:33:06: 
2022/09/03-23:33:06: ** Cleanup END: Recovery complete
LOGFINAL: ./RESULTS/w33_runtest.sh_20220903-004937.log
22/09/03 23:33  stop vmstat.sh ...
Error initializing cluster client: ObjectNotFound('RADOS object not found (error calling conf_read_file)',)
cp: cannot stat '/var/log/ceph//ceph.log*': No such file or directory
22/09/03 23:33

But within time the files return and the cluster status is OK.

From ceph.log ...

2022-09-03T23:31:56.668089+0000 mgr.f28-h30-000-r630.upsejc (mgr.14282) 42313 : cluster [DBG] pgmap v42368: 4769 pgs: 4769 active+clean; 8.6 TiB data, 21 TiB used, 334 TiB / 355 TiB avail
2022-09-03T23:31:58.673808+0000 mgr.f28-h30-000-r630.upsejc (mgr.14282) 42314 : cluster [DBG] pgmap v42369: 4769 pgs: 4769 active+clean; 8.6 TiB data, 21 TiB used, 334 TiB / 355 TiB avail
2022-09-03T23:26:33.437471+0000 mon.f28-h28-000-r630.rdu2.scalelab.redhat.com (mon.0) 65480 : cluster [INF] osd.0 marked down after no beacon for 40141.200950 seconds
2022-09-03T23:26:33.437557+0000 mon.f28-h28-000-r630.rdu2.scalelab.redhat.com (mon.0) 65481 : cluster [INF] osd.1 marked down after no beacon for 40282.610400 seconds
2022-09-03T23:26:33.437601+0000 mon.f28-h28-000-r630.rdu2.scalelab.redhat.com (mon.0) 65482 : cluster [INF] osd.2 marked down after no beacon for 40291.889797 seconds
2022-09-03T23:26:33.437624+0000 mon.f28-h28-000-r630.rdu2.scalelab.redhat.com (mon.0) 65483 : cluster [INF] osd.3 marked down after no beacon for 40201.475581 seconds
2022-09-03T23:26:33.437647+0000 mon.f28-h28-000-r630.rdu2.scalelab.redhat.com (mon.0) 65484 : cluster [INF] osd.4 marked down after no beacon for 40272.289077 seconds
2022-09-03T23:26:33.437687+0000 mon.f28-h28-000-r630.rdu2.scalelab.redhat.com (mon.0) 65485 : cluster [INF] osd.5 marked down after no beacon for 40251.696393 seconds
2022-09-03T23:26:33.437712+0000 mon.f28-h28-000-r630.rdu2.scalelab.redhat.com (mon.0) 65486 : cluster [INF] osd.6 marked down after no beacon for 40245.079861 seconds
2022-09-03T23:26:33.437736+0000 mon.f28-h28-000-r630.rdu2.scalelab.redhat.com (mon.0) 65487 : cluster [INF] osd.7 marked down after no beacon for 40227.439177 seconds
2022-09-03T23:26:33.437760+0000 mon.f28-h28-000-r630.rdu2.scalelab.redhat.com (mon.0) 65488 : cluster [INF] osd.8 marked down after no beacon for 40140.826010 seconds
2022-09-03T23:26:33.437780+0000 mon.f28-h28-000-r630.rdu2.scalelab.redhat.com (mon.0) 65489 : cluster [INF] osd.9 marked down after no beacon for 40276.810130 seconds
2022-09-03T23:26:33.437799+0000 mon.f28-h28-000-r630.rdu2.scalelab.redhat.com (mon.0) 65490 : cluster [INF] osd.10 marked down after no beacon for 40078.375459 seconds
2022-09-03T23:26:33.437819+0000 mon.f28-h28-000-r630.rdu2.scalelab.redhat.com (mon.0) 65491 : cluster [INF] osd.11 marked down after no beacon for 40099.579699 seconds
2022-09-03T23:26:33.518161+0000 mon.f28-h28-000-r630.rdu2.scalelab.redhat.com (mon.0) 65492 : cluster [INF] osd.12 marked down after no beacon for 40069.888989 seconds
2022-09-03T23:26:33.518191+0000 mon.f28-h28-000-r630.rdu2.scalelab.redhat.com (mon.0) 65493 : cluster [INF] osd.13 marked down after no beacon for 40049.501219 seconds
 ...
2022-09-03T23:26:35.794124+0000 mon.f28-h28-000-r630.rdu2.scalelab.redhat.com (mon.0) 65612 : cluster [INF] osd.132 marked down after no beacon for 40280.963709 seconds
2022-09-03T23:26:35.794140+0000 mon.f28-h28-000-r630.rdu2.scalelab.redhat.com (mon.0) 65613 : cluster [INF] osd.133 marked down after no beacon for 40226.728449 seconds
2022-09-03T23:26:35.794158+0000 mon.f28-h28-000-r630.rdu2.scalelab.redhat.com (mon.0) 65614 : cluster [INF] osd.134 marked down after no beacon for 40057.138960 seconds
2022-09-03T23:26:38.884332+0000 mon.f28-h28-000-r630.rdu2.scalelab.redhat.com (mon.0) 65615 : cluster [WRN] Health check failed: 135 osds down (OSD_DOWN)
2022-09-03T23:26:43.082515+0000 mon.f28-h28-000-r630.rdu2.scalelab.redhat.com (mon.0) 65616 : cluster [WRN] overall HEALTH_WARN failed to probe daemons or devices; noscrub,nodeep-scrub flag(s) set
2022-09-03T23:26:44.358075+0000 mon.f28-h28-000-r630.rdu2.scalelab.redhat.com (mon.0) 65617 : cluster [INF] mon.f28-h28-000-r630.rdu2.scalelab.redhat.com calling monitor election
2022-09-03T23:26:47.542828+0000 mon.f28-h28-000-r630.rdu2.scalelab.redhat.com (mon.0) 65618 : cluster [INF] mon.f28-h28-000-r630.rdu2.scalelab.redhat.com is new leader, mons f28-h28-000-r630.rdu2.scalelab.redhat.com,f28-h29-000-r630,f28-h30-000-r630 in quorum (ranks 0,1,2)
2022-09-03T23:26:50.652453+0000 mon.f28-h28-000-r630.rdu2.scalelab.redhat.com (mon.0) 65619 : cluster [WRN] Health detail: HEALTH_WARN failed to probe daemons or devices; noscrub,nodeep-scrub flag(s) set
2022-09-03T23:26:50.652576+0000 mon.f28-h28-000-r630.rdu2.scalelab.redhat.com (mon.0) 65620 : cluster [WRN] [WRN] CEPHADM_REFRESH_FAILED: failed to probe daemons or devices
2022-09-03T23:26:50.652584+0000 mon.f28-h28-000-r630.rdu2.scalelab.redhat.com (mon.0) 65621 : cluster [WRN]     host f28-h28-000-r630.rdu2.scalelab.redhat.com `cephadm gather-facts` failed: Unable to reach remote host f28-h28-000-r630.rdu2.scalelab.redhat.com.
2022-09-03T23:26:50.652604+0000 mon.f28-h28-000-r630.rdu2.scalelab.redhat.com (mon.0) 65622 : cluster [WRN] [WRN] OSDMAP_FLAGS: noscrub,nodeep-scrub flag(s) set
2022-09-03T23:31:49.643433+0000 mon.f28-h28-000-r630.rdu2.scalelab.redhat.com (mon.0) 65623 : cluster [WRN] overall HEALTH_WARN failed to probe daemons or devices; noscrub,nodeep-scrub flag(s) set
2022-09-03T23:32:00.168119+0000 mon.f28-h30-000-r630 (mon.2) 7992 : cluster [INF] mon.f28-h30-000-r630 calling monitor election
2022-09-03T23:32:00.179658+0000 mon.f28-h29-000-r630 (mon.1) 14376 : cluster [INF] mon.f28-h29-000-r630 calling monitor election
2022-09-03T23:32:00.216508+0000 mon.f28-h28-000-r630.rdu2.scalelab.redhat.com (mon.0) 65624 : cluster [INF] mon.f28-h28-000-r630.rdu2.scalelab.redhat.com calling monitor election
2022-09-03T23:32:00.280824+0000 mon.f28-h30-000-r630 (mon.2) 7993 : cluster [WRN] message from mon.0 was stamped 0.050454s in the future, clocks not synchronized
2022-09-03T23:32:00.282018+0000 mon.f28-h28-000-r630.rdu2.scalelab.redhat.com (mon.0) 65625 : cluster [INF] mon.f28-h28-000-r630.rdu2.scalelab.redhat.com calling monitor election
2022-09-03T23:32:00.311858+0000 mon.f28-h28-000-r630.rdu2.scalelab.redhat.com (mon.0) 65626 : cluster [INF] mon.f28-h28-000-r630.rdu2.scalelab.redhat.com is new leader, mons f28-h28-000-r630.rdu2.scalelab.redhat.com,f28-h29-000-r630,f28-h30-000-r630 in quorum (ranks 0,1,2)
2022-09-03T23:32:00.333933+0000 mon.f28-h28-000-r630.rdu2.scalelab.redhat.com (mon.0) 65627 : cluster [DBG] monmap e3: 3 mons at {f28-h28-000-r630.rdu2.scalelab.redhat.com=[v2:172.16.45.51:3300/0,v1:172.16.45.51:6789/0],f28-h29-000-r630=[v2:172.16.44.230:3300/0,v1:172.16.44.230:6789/0],f28-h30-000-r630=[v2:172.16.45.102:3300/0,v1:172.16.45.102:6789/0]}
2022-09-03T23:32:00.334022+0000 mon.f28-h28-000-r630.rdu2.scalelab.redhat.com (mon.0) 65628 : cluster [DBG] fsmap
2022-09-03T23:32:00.334580+0000 mon.f28-h28-000-r630.rdu2.scalelab.redhat.com (mon.0) 65629 : cluster [DBG] osdmap e1996: 192 total, 192 up, 192 in
2022-09-03T23:32:00.365333+0000 mon.f28-h28-000-r630.rdu2.scalelab.redhat.com (mon.0) 65630 : cluster [DBG] mgrmap e39: f28-h30-000-r630.upsejc(active, since 22h), standbys: f28-h28-000-r630.rdu2.scalelab.redhat.com.hvzphf, f28-h29-000-r630.oojtqu
2022-09-03T23:32:00.374767+0000 mon.f28-h28-000-r630.rdu2.scalelab.redhat.com (mon.0) 65631 : cluster [INF] Health check cleared: MON_DOWN (was: 1/3 mons down, quorum f28-h29-000-r630,f28-h30-000-r630)
2022-09-03T23:32:00.427578+0000 mon.f28-h28-000-r630.rdu2.scalelab.redhat.com (mon.0) 65632 : cluster [WRN] Health detail: HEALTH_WARN failed to probe daemons or devices; noscrub,nodeep-scrub flag(s) set
2022-09-03T23:32:00.427613+0000 mon.f28-h28-000-r630.rdu2.scalelab.redhat.com (mon.0) 65633 : cluster [WRN] [WRN] CEPHADM_REFRESH_FAILED: failed to probe daemons or devices
2022-09-03T23:32:00.427633+0000 mon.f28-h28-000-r630.rdu2.scalelab.redhat.com (mon.0) 65634 : cluster [WRN]     host f28-h28-000-r630.rdu2.scalelab.redhat.com `cephadm gather-facts` failed: Unable to reach remote host f28-h28-000-r630.rdu2.scalelab.redhat.com.
2022-09-03T23:32:00.427640+0000 mon.f28-h28-000-r630.rdu2.scalelab.redhat.com (mon.0) 65635 : cluster [WRN] [WRN] OSDMAP_FLAGS: noscrub,nodeep-scrub flag(s) set
2022-09-03T23:32:00.706332+0000 mgr.f28-h30-000-r630.upsejc (mgr.14282) 42315 : cluster [DBG] pgmap v42370: 4769 pgs: 4769 active+clean; 8.6 TiB data, 21 TiB used, 334 TiB / 355 TiB avail
2022-09-03T23:32:02.712912+0000 mgr.f28-h30-000-r630.upsejc (mgr.14282) 42316 : cluster [DBG] pgmap v42371: 4769 pgs: 4769 active+clean; 8.6 TiB data, 21 TiB used, 334 TiB / 355 TiB avail
2022-09-03T23:32:04.742078+0000 mgr.f28-h30-000-r630.upsejc (mgr.14282) 42317 : cluster [DBG] pgmap v42372: 4769 pgs: 4769 active+clean; 8.6 TiB data, 21 TiB used, 334 TiB / 355 TiB avail
2022-09-03T23:32:05.806855+0000 mon.f28-h30-000-r630 (mon.2) 7994 : cluster [WRN] message from mon.0 was stamped 0.050398s in the future, clocks not synchronized
2022-09-03T23:32:06.751954+0000 mgr.f28-h30-000-r630.upsejc (mgr.14282) 42318 : cluster [DBG] pgmap v42373: 4769 pgs: 4769 active+clean; 8.6 TiB data, 21 TiB used, 334 TiB / 355 TiB avail
2022-09-03T23:32:08.559491+0000 mgr.f28-h30-000-r630.upsejc (mgr.14282) 42324 : cluster [DBG] pgmap v42374: 4769 pgs: 4769 active+clean; 8.6 TiB data, 21 TiB used, 334 TiB / 355 TiB avail
2022-09-03T23:32:09.610791+0000 mon.f28-h28-000-r630.rdu2.scalelab.redhat.com (mon.0) 65641 : cluster [INF] Health check cleared: CEPHADM_REFRESH_FAILED (was: failed to probe daemons or devices)
2022-09-03T23:32:10.587091+0000 mgr.f28-h30-000-r630.upsejc (mgr.14282) 42325 : cluster [DBG] pgmap v42375: 4769 pgs: 4769 active+clean; 8.6 TiB data, 21 TiB used, 334 TiB / 355 TiB avail
2022-09-03T23:32:12.596987+0000 mgr.f28-h30-000-r630.upsejc (mgr.14282) 42326 : cluster [DBG] pgmap v42376: 4769 pgs: 4769 active+clean; 8.6 TiB data, 21 TiB used,
  ...

The cephadm.log showed exactly what we describe but I inadvertently deleted it. :-/

There was a previous bug bug where we couldn't use the hostname for the admin/MON1 node in the host_spec file. The workaround was to use the IP of the admin/mon instead in the addr: field. I had recently gone back to using the hostname after seeing the bug resolved but that became suspect after the above started occurring. After reverting back to using the bootstrap IP in the host spec file instead of the hostname, this situation is no longer observed.

Related issues 3 (0 open — 3 closed)

Actions

Copy link

Updated by Tim Wilkinson over 1 year ago

Bug 2024301 was the issue referenced above ... https://bugzilla.redhat.com/show_bug.cgi?id=2024301

Actions

Copy link

Updated by Adam King over 1 year ago

Assignee set to Adam King

Actions

Copy link

Updated by Adam King over 1 year ago

Related to Bug #56696: admin keyring disappears during qa run added

Actions

Copy link

Updated by Adam King over 1 year ago

Backport set to quincy, pacific
Pull request ID set to 48074

Actions

Copy link

Updated by Adam King over 1 year ago

Status changed from New to Pending Backport

Actions

Copy link

Updated by Backport Bot over 1 year ago

Copied to Backport #58207: quincy: cephadm removes config & keyring files in mid flight added

Actions

Copy link

Updated by Backport Bot over 1 year ago

Copied to Backport #58208: pacific: cephadm removes config & keyring files in mid flight added

Actions

Copy link

Updated by Backport Bot over 1 year ago

Tags set to backport_processed

Actions

Copy link

Updated by Adam King about 1 year ago

Status changed from Pending Backport to Resolved

Actions

Copy link

#10

Updated by Voja Molani 9 months ago

It is definitely better in 17.2.6 I think, compared to 17.2.x releases earlier than 17.2.6. Before this PR I had at least 6+ occurrences of keyring (and sometimes ceph.conf) going missing over 2 clusters.

But just now I had a 17.2.6 cluster MGR node lose it's keyring. Somewhere between ceph orch host maintenance enter, reboot, and ceph orch host maintenance exit.

As used to happen so often on 17.2.5, when trying to run ceph orch host maintenance exit the command just takes forever and seems to hang; since it is taking so long the first thing I did was check /etc/ceph/ and sure enough the keyring file was missing. Eventually the ceph orch host maintenance exit command errored due to missing keyring, put the keyring file back, re-run ceph orch host maintenance exit and now all services on the node are back again.

But the linked PR 48074 definitely does not completely fix this issue - or there are some further issues present with similar symptoms.

Actions

Copy link

#11

Updated by Voja Molani 9 months ago

Second and third MGR node in the cluster being rebooted with the exact same process, after reboot ceph orch host maintenance exit didn't work because of missing keyring. Exactly like the first MGR node being rebooted. Seems to be a very repeatable problem so far, 3 out of 3 attempts failed at keyring lost.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » Orchestrator

Custom queries

Bug #57462

cephadm removes config & keyring files in mid flight

Updated by Tim Wilkinson over 1 year ago

Updated by Adam King over 1 year ago

Updated by Adam King over 1 year ago

Updated by Adam King over 1 year ago

Updated by Adam King over 1 year ago

Updated by Backport Bot over 1 year ago

Updated by Backport Bot over 1 year ago

Updated by Backport Bot over 1 year ago

Updated by Adam King about 1 year ago

Updated by Voja Molani 9 months ago

Updated by Voja Molani 9 months ago