Project

General

Profile

Bug #57462

cephadm removes config & keyring files in mid flight

Added by Tim Wilkinson 5 months ago. Updated about 2 months ago.

Status:
Pending Backport
Priority:
Normal
Assignee:
Category:
cephadm
Target version:
-
% Done:

0%

Source:
Tags:
backport_processed
Backport:
quincy, pacific
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

During testing, cephadm spontaneously removes the ceph.conf and keyring files from the admin/mon node and creates newer ones later on. :-O

There's no better way to describe that. In mid test we see cephadm removing the files and replacing them at a later time. The timestamps on the files prove that when compared to the other nodes. So the stat poller we have running suddenly can't execute ceph cmds ...

2022/09/03-23:32:47: Completed warp hybrid phase 4 ...
2022/09/03-23:32:47: Stopping warp clients
f28-h29-000-r630 | CHANGED | rc=0 >>

f28-h26-000-r630 | CHANGED | rc=0 >>

f28-h27-000-r630 | CHANGED | rc=0 >>

f28-h25-000-r630 | CHANGED | rc=0 >>

f28-h28-000-r630 | CHANGED | rc=0 >>

f28-h30-000-r630 | CHANGED | rc=0 >>

did not load config file, using default settings.
2022-09-03T23:32:59.040+0000 1509640dd500 -1 Errors while parsing config file!
2022-09-03T23:32:59.040+0000 1509640dd500 -1 can't open ceph.conf: (2) No such file or directory
unable to get monitor info from DNS SRV with service name: ceph-mon
2022-09-03T23:32:59.050+0000 1509640dd500 -1 failed for service _ceph-mon._tcp
2022-09-03T23:32:59.050+0000 1509640dd500 -1 monclient: get_monmap_and_config cannot identify monitors to contact
failed to fetch mon config (--no-mon-config to skip)
did not load config file, using default settings.
2022-09-03T23:32:59.076+0000 154db2bd3500 -1 Errors while parsing config file!
2022-09-03T23:32:59.076+0000 154db2bd3500 -1 can't open ceph.conf: (2) No such file or directory
unable to get monitor info from DNS SRV with service name: ceph-mon
2022-09-03T23:32:59.077+0000 154db2bd3500 -1 failed for service _ceph-mon._tcp
2022-09-03T23:32:59.077+0000 154db2bd3500 -1 monclient: get_monmap_and_config cannot identify monitors to contact
failed to fetch mon config (--no-mon-config to skip)
2022/09/03-23:32:59: RESHARD Queue Length 0
2022/09/03-23:32:59: RESHARD List 
./runtest.sh: line 175: 830707 Terminated              ./Utils/warp.sh hybrid 4 $LOGFILE
2022/09/03-23:33:05: FINISH: warp phase 4 (w33)
Error initializing cluster client: ObjectNotFound('RADOS object not found (error calling conf_read_file)',)
Error initializing cluster client: ObjectNotFound('RADOS object not found (error calling conf_read_file)',)
2022/09/03-23:33:05: 
2022/09/03-23:33:05: Stopped POLL bkgrd process
./Utils/functions.shinc: line 3: 578241 Terminated              Utils/poll.sh 300 ${LOGFILE}
2022/09/03-23:33:05: ** Waiting for cleanPGs == totalPGs
did not load config file, using default settings.
2022-09-03T23:33:05.431+0000 150e1726c500 -1 Errors while parsing config file!
2022-09-03T23:33:05.431+0000 150e1726c500 -1 can't open ceph.conf: (2) No such file or directory
unable to get monitor info from DNS SRV with service name: ceph-mon
2022-09-03T23:33:05.432+0000 150e1726c500 -1 failed for service _ceph-mon._tcp
2022-09-03T23:33:05.432+0000 150e1726c500 -1 monclient: get_monmap_and_config cannot identify monitors to contact
failed to fetch mon config (--no-mon-config to skip)
did not load config file, using default settings.
2022-09-03T23:33:05.495+0000 14e9551cb500 -1 Errors while parsing config file!
2022-09-03T23:33:05.495+0000 14e9551cb500 -1 can't open ceph.conf: (2) No such file or directory
unable to get monitor info from DNS SRV with service name: ceph-mon
2022-09-03T23:33:05.496+0000 14e9551cb500 -1 failed for service _ceph-mon._tcp
2022-09-03T23:33:05.496+0000 14e9551cb500 -1 monclient: get_monmap_and_config cannot identify monitors to contact
failed to fetch mon config (--no-mon-config to skip)
var.shinc: No access and/or secret key found for user johndoe
Error initializing cluster client: ObjectNotFound('RADOS object not found (error calling conf_read_file)',)
2022/09/03:23:33:05: ** POLLCEPH started
Error initializing cluster client: ObjectNotFound('RADOS object not found (error calling conf_read_file)',)
Utils/pollceph.sh: line 40: [: : integer expression expected
2022/09/03:23:33:05: ** Recovery completed: POLLCEPH ending
Error initializing cluster client: ObjectNotFound('RADOS object not found (error calling conf_read_file)',)
Error initializing cluster client: ObjectNotFound('RADOS object not found (error calling conf_read_file)',)
Error initializing cluster client: ObjectNotFound('RADOS object not found (error calling conf_read_file)',)
Error initializing cluster client: ObjectNotFound('RADOS object not found (error calling conf_read_file)',)
Error initializing cluster client: ObjectNotFound('RADOS object not found (error calling conf_read_file)',)
2022/09/03-23:33:06: 
2022/09/03-23:33:06: ** Cleanup END: Recovery complete
LOGFINAL: ./RESULTS/w33_runtest.sh_20220903-004937.log
22/09/03 23:33  stop vmstat.sh ...
Error initializing cluster client: ObjectNotFound('RADOS object not found (error calling conf_read_file)',)
cp: cannot stat '/var/log/ceph//ceph.log*': No such file or directory
22/09/03 23:33

But within time the files return and the cluster status is OK.

From ceph.log ...

2022-09-03T23:31:56.668089+0000 mgr.f28-h30-000-r630.upsejc (mgr.14282) 42313 : cluster [DBG] pgmap v42368: 4769 pgs: 4769 active+clean; 8.6 TiB data, 21 TiB used, 334 TiB / 355 TiB avail
2022-09-03T23:31:58.673808+0000 mgr.f28-h30-000-r630.upsejc (mgr.14282) 42314 : cluster [DBG] pgmap v42369: 4769 pgs: 4769 active+clean; 8.6 TiB data, 21 TiB used, 334 TiB / 355 TiB avail
2022-09-03T23:26:33.437471+0000 mon.f28-h28-000-r630.rdu2.scalelab.redhat.com (mon.0) 65480 : cluster [INF] osd.0 marked down after no beacon for 40141.200950 seconds
2022-09-03T23:26:33.437557+0000 mon.f28-h28-000-r630.rdu2.scalelab.redhat.com (mon.0) 65481 : cluster [INF] osd.1 marked down after no beacon for 40282.610400 seconds
2022-09-03T23:26:33.437601+0000 mon.f28-h28-000-r630.rdu2.scalelab.redhat.com (mon.0) 65482 : cluster [INF] osd.2 marked down after no beacon for 40291.889797 seconds
2022-09-03T23:26:33.437624+0000 mon.f28-h28-000-r630.rdu2.scalelab.redhat.com (mon.0) 65483 : cluster [INF] osd.3 marked down after no beacon for 40201.475581 seconds
2022-09-03T23:26:33.437647+0000 mon.f28-h28-000-r630.rdu2.scalelab.redhat.com (mon.0) 65484 : cluster [INF] osd.4 marked down after no beacon for 40272.289077 seconds
2022-09-03T23:26:33.437687+0000 mon.f28-h28-000-r630.rdu2.scalelab.redhat.com (mon.0) 65485 : cluster [INF] osd.5 marked down after no beacon for 40251.696393 seconds
2022-09-03T23:26:33.437712+0000 mon.f28-h28-000-r630.rdu2.scalelab.redhat.com (mon.0) 65486 : cluster [INF] osd.6 marked down after no beacon for 40245.079861 seconds
2022-09-03T23:26:33.437736+0000 mon.f28-h28-000-r630.rdu2.scalelab.redhat.com (mon.0) 65487 : cluster [INF] osd.7 marked down after no beacon for 40227.439177 seconds
2022-09-03T23:26:33.437760+0000 mon.f28-h28-000-r630.rdu2.scalelab.redhat.com (mon.0) 65488 : cluster [INF] osd.8 marked down after no beacon for 40140.826010 seconds
2022-09-03T23:26:33.437780+0000 mon.f28-h28-000-r630.rdu2.scalelab.redhat.com (mon.0) 65489 : cluster [INF] osd.9 marked down after no beacon for 40276.810130 seconds
2022-09-03T23:26:33.437799+0000 mon.f28-h28-000-r630.rdu2.scalelab.redhat.com (mon.0) 65490 : cluster [INF] osd.10 marked down after no beacon for 40078.375459 seconds
2022-09-03T23:26:33.437819+0000 mon.f28-h28-000-r630.rdu2.scalelab.redhat.com (mon.0) 65491 : cluster [INF] osd.11 marked down after no beacon for 40099.579699 seconds
2022-09-03T23:26:33.518161+0000 mon.f28-h28-000-r630.rdu2.scalelab.redhat.com (mon.0) 65492 : cluster [INF] osd.12 marked down after no beacon for 40069.888989 seconds
2022-09-03T23:26:33.518191+0000 mon.f28-h28-000-r630.rdu2.scalelab.redhat.com (mon.0) 65493 : cluster [INF] osd.13 marked down after no beacon for 40049.501219 seconds
 ...
2022-09-03T23:26:35.794124+0000 mon.f28-h28-000-r630.rdu2.scalelab.redhat.com (mon.0) 65612 : cluster [INF] osd.132 marked down after no beacon for 40280.963709 seconds
2022-09-03T23:26:35.794140+0000 mon.f28-h28-000-r630.rdu2.scalelab.redhat.com (mon.0) 65613 : cluster [INF] osd.133 marked down after no beacon for 40226.728449 seconds
2022-09-03T23:26:35.794158+0000 mon.f28-h28-000-r630.rdu2.scalelab.redhat.com (mon.0) 65614 : cluster [INF] osd.134 marked down after no beacon for 40057.138960 seconds
2022-09-03T23:26:38.884332+0000 mon.f28-h28-000-r630.rdu2.scalelab.redhat.com (mon.0) 65615 : cluster [WRN] Health check failed: 135 osds down (OSD_DOWN)
2022-09-03T23:26:43.082515+0000 mon.f28-h28-000-r630.rdu2.scalelab.redhat.com (mon.0) 65616 : cluster [WRN] overall HEALTH_WARN failed to probe daemons or devices; noscrub,nodeep-scrub flag(s) set
2022-09-03T23:26:44.358075+0000 mon.f28-h28-000-r630.rdu2.scalelab.redhat.com (mon.0) 65617 : cluster [INF] mon.f28-h28-000-r630.rdu2.scalelab.redhat.com calling monitor election
2022-09-03T23:26:47.542828+0000 mon.f28-h28-000-r630.rdu2.scalelab.redhat.com (mon.0) 65618 : cluster [INF] mon.f28-h28-000-r630.rdu2.scalelab.redhat.com is new leader, mons f28-h28-000-r630.rdu2.scalelab.redhat.com,f28-h29-000-r630,f28-h30-000-r630 in quorum (ranks 0,1,2)
2022-09-03T23:26:50.652453+0000 mon.f28-h28-000-r630.rdu2.scalelab.redhat.com (mon.0) 65619 : cluster [WRN] Health detail: HEALTH_WARN failed to probe daemons or devices; noscrub,nodeep-scrub flag(s) set
2022-09-03T23:26:50.652576+0000 mon.f28-h28-000-r630.rdu2.scalelab.redhat.com (mon.0) 65620 : cluster [WRN] [WRN] CEPHADM_REFRESH_FAILED: failed to probe daemons or devices
2022-09-03T23:26:50.652584+0000 mon.f28-h28-000-r630.rdu2.scalelab.redhat.com (mon.0) 65621 : cluster [WRN]     host f28-h28-000-r630.rdu2.scalelab.redhat.com `cephadm gather-facts` failed: Unable to reach remote host f28-h28-000-r630.rdu2.scalelab.redhat.com.
2022-09-03T23:26:50.652604+0000 mon.f28-h28-000-r630.rdu2.scalelab.redhat.com (mon.0) 65622 : cluster [WRN] [WRN] OSDMAP_FLAGS: noscrub,nodeep-scrub flag(s) set
2022-09-03T23:31:49.643433+0000 mon.f28-h28-000-r630.rdu2.scalelab.redhat.com (mon.0) 65623 : cluster [WRN] overall HEALTH_WARN failed to probe daemons or devices; noscrub,nodeep-scrub flag(s) set
2022-09-03T23:32:00.168119+0000 mon.f28-h30-000-r630 (mon.2) 7992 : cluster [INF] mon.f28-h30-000-r630 calling monitor election
2022-09-03T23:32:00.179658+0000 mon.f28-h29-000-r630 (mon.1) 14376 : cluster [INF] mon.f28-h29-000-r630 calling monitor election
2022-09-03T23:32:00.216508+0000 mon.f28-h28-000-r630.rdu2.scalelab.redhat.com (mon.0) 65624 : cluster [INF] mon.f28-h28-000-r630.rdu2.scalelab.redhat.com calling monitor election
2022-09-03T23:32:00.280824+0000 mon.f28-h30-000-r630 (mon.2) 7993 : cluster [WRN] message from mon.0 was stamped 0.050454s in the future, clocks not synchronized
2022-09-03T23:32:00.282018+0000 mon.f28-h28-000-r630.rdu2.scalelab.redhat.com (mon.0) 65625 : cluster [INF] mon.f28-h28-000-r630.rdu2.scalelab.redhat.com calling monitor election
2022-09-03T23:32:00.311858+0000 mon.f28-h28-000-r630.rdu2.scalelab.redhat.com (mon.0) 65626 : cluster [INF] mon.f28-h28-000-r630.rdu2.scalelab.redhat.com is new leader, mons f28-h28-000-r630.rdu2.scalelab.redhat.com,f28-h29-000-r630,f28-h30-000-r630 in quorum (ranks 0,1,2)
2022-09-03T23:32:00.333933+0000 mon.f28-h28-000-r630.rdu2.scalelab.redhat.com (mon.0) 65627 : cluster [DBG] monmap e3: 3 mons at {f28-h28-000-r630.rdu2.scalelab.redhat.com=[v2:172.16.45.51:3300/0,v1:172.16.45.51:6789/0],f28-h29-000-r630=[v2:172.16.44.230:3300/0,v1:172.16.44.230:6789/0],f28-h30-000-r630=[v2:172.16.45.102:3300/0,v1:172.16.45.102:6789/0]}
2022-09-03T23:32:00.334022+0000 mon.f28-h28-000-r630.rdu2.scalelab.redhat.com (mon.0) 65628 : cluster [DBG] fsmap
2022-09-03T23:32:00.334580+0000 mon.f28-h28-000-r630.rdu2.scalelab.redhat.com (mon.0) 65629 : cluster [DBG] osdmap e1996: 192 total, 192 up, 192 in
2022-09-03T23:32:00.365333+0000 mon.f28-h28-000-r630.rdu2.scalelab.redhat.com (mon.0) 65630 : cluster [DBG] mgrmap e39: f28-h30-000-r630.upsejc(active, since 22h), standbys: f28-h28-000-r630.rdu2.scalelab.redhat.com.hvzphf, f28-h29-000-r630.oojtqu
2022-09-03T23:32:00.374767+0000 mon.f28-h28-000-r630.rdu2.scalelab.redhat.com (mon.0) 65631 : cluster [INF] Health check cleared: MON_DOWN (was: 1/3 mons down, quorum f28-h29-000-r630,f28-h30-000-r630)
2022-09-03T23:32:00.427578+0000 mon.f28-h28-000-r630.rdu2.scalelab.redhat.com (mon.0) 65632 : cluster [WRN] Health detail: HEALTH_WARN failed to probe daemons or devices; noscrub,nodeep-scrub flag(s) set
2022-09-03T23:32:00.427613+0000 mon.f28-h28-000-r630.rdu2.scalelab.redhat.com (mon.0) 65633 : cluster [WRN] [WRN] CEPHADM_REFRESH_FAILED: failed to probe daemons or devices
2022-09-03T23:32:00.427633+0000 mon.f28-h28-000-r630.rdu2.scalelab.redhat.com (mon.0) 65634 : cluster [WRN]     host f28-h28-000-r630.rdu2.scalelab.redhat.com `cephadm gather-facts` failed: Unable to reach remote host f28-h28-000-r630.rdu2.scalelab.redhat.com.
2022-09-03T23:32:00.427640+0000 mon.f28-h28-000-r630.rdu2.scalelab.redhat.com (mon.0) 65635 : cluster [WRN] [WRN] OSDMAP_FLAGS: noscrub,nodeep-scrub flag(s) set
2022-09-03T23:32:00.706332+0000 mgr.f28-h30-000-r630.upsejc (mgr.14282) 42315 : cluster [DBG] pgmap v42370: 4769 pgs: 4769 active+clean; 8.6 TiB data, 21 TiB used, 334 TiB / 355 TiB avail
2022-09-03T23:32:02.712912+0000 mgr.f28-h30-000-r630.upsejc (mgr.14282) 42316 : cluster [DBG] pgmap v42371: 4769 pgs: 4769 active+clean; 8.6 TiB data, 21 TiB used, 334 TiB / 355 TiB avail
2022-09-03T23:32:04.742078+0000 mgr.f28-h30-000-r630.upsejc (mgr.14282) 42317 : cluster [DBG] pgmap v42372: 4769 pgs: 4769 active+clean; 8.6 TiB data, 21 TiB used, 334 TiB / 355 TiB avail
2022-09-03T23:32:05.806855+0000 mon.f28-h30-000-r630 (mon.2) 7994 : cluster [WRN] message from mon.0 was stamped 0.050398s in the future, clocks not synchronized
2022-09-03T23:32:06.751954+0000 mgr.f28-h30-000-r630.upsejc (mgr.14282) 42318 : cluster [DBG] pgmap v42373: 4769 pgs: 4769 active+clean; 8.6 TiB data, 21 TiB used, 334 TiB / 355 TiB avail
2022-09-03T23:32:08.559491+0000 mgr.f28-h30-000-r630.upsejc (mgr.14282) 42324 : cluster [DBG] pgmap v42374: 4769 pgs: 4769 active+clean; 8.6 TiB data, 21 TiB used, 334 TiB / 355 TiB avail
2022-09-03T23:32:09.610791+0000 mon.f28-h28-000-r630.rdu2.scalelab.redhat.com (mon.0) 65641 : cluster [INF] Health check cleared: CEPHADM_REFRESH_FAILED (was: failed to probe daemons or devices)
2022-09-03T23:32:10.587091+0000 mgr.f28-h30-000-r630.upsejc (mgr.14282) 42325 : cluster [DBG] pgmap v42375: 4769 pgs: 4769 active+clean; 8.6 TiB data, 21 TiB used, 334 TiB / 355 TiB avail
2022-09-03T23:32:12.596987+0000 mgr.f28-h30-000-r630.upsejc (mgr.14282) 42326 : cluster [DBG] pgmap v42376: 4769 pgs: 4769 active+clean; 8.6 TiB data, 21 TiB used,
  ...

The cephadm.log showed exactly what we describe but I inadvertently deleted it. :-/

There was a previous bug bug where we couldn't use the hostname for the admin/MON1 node in the host_spec file. The workaround was to use the IP of the admin/mon instead in the addr: field. I had recently gone back to using the hostname after seeing the bug resolved but that became suspect after the above started occurring. After reverting back to using the bootstrap IP in the host spec file instead of the hostname, this situation is no longer observed.


Related issues

Related to Orchestrator - Bug #56696: admin keyring disappears during qa run Resolved
Copied to Orchestrator - Backport #58207: quincy: cephadm removes config & keyring files in mid flight In Progress
Copied to Orchestrator - Backport #58208: pacific: cephadm removes config & keyring files in mid flight In Progress

History

#1 Updated by Tim Wilkinson 5 months ago

Bug 2024301 was the issue referenced above ... https://bugzilla.redhat.com/show_bug.cgi?id=2024301

#2 Updated by Adam King 5 months ago

  • Assignee set to Adam King

#3 Updated by Adam King 5 months ago

  • Related to Bug #56696: admin keyring disappears during qa run added

#4 Updated by Adam King about 2 months ago

  • Backport set to quincy, pacific
  • Pull request ID set to 48074

#5 Updated by Adam King about 2 months ago

  • Status changed from New to Pending Backport

#6 Updated by Backport Bot about 2 months ago

  • Copied to Backport #58207: quincy: cephadm removes config & keyring files in mid flight added

#7 Updated by Backport Bot about 2 months ago

  • Copied to Backport #58208: pacific: cephadm removes config & keyring files in mid flight added

#8 Updated by Backport Bot about 2 months ago

  • Tags set to backport_processed

Also available in: Atom PDF