Bug #46764
closedcephadm (ceph orch apply) sometimes gets "stuck" and cannot deploy any OSDs
0%
Description
This started happening about two weeks ago. Before that it was not happening. I've seen it only in single-node deployments.
It manifests like this:
1. bootstrap (succeeds)
2. populate service_spec.yaml for OSD deployment (placement -> hosts -> HOSTNAME_OF_THE_NODE)
3. OSD deployment gets stuck
This happens in automated testing of sesdev. Most of the time, the deployment succeeds. About 20% of the time (?), it fails due to this bug. When it happens, the deployment script produces the following output:
master: ++ rm -f /root/service_spec_core.yml master: ++ touch /root/service_spec_core.yml master: ++ cat master: ++ ceph orch device ls --refresh master: ++ cat /root/service_spec_core.yml master: --- master: service_type: osd master: service_id: sesdev_osd_deployment master: placement: master: hosts: master: - 'master' master: data_devices: master: all: true master: ++ ceph orch apply -i /root/service_spec_core.yml master: Scheduled osd.sesdev_osd_deployment update... master: ++ EXPECTED_NUMBER_OF_OSDS=4 master: ++ set +x master: OSDs in cluster (actual/expected): 0/4 (900 seconds to timeout) master: OSDs in cluster (actual/expected): 0/4 (890 seconds to timeout) master: OSDs in cluster (actual/expected): 0/4 (880 seconds to timeout) master: OSDs in cluster (actual/expected): 0/4 (870 seconds to timeout) master: OSDs in cluster (actual/expected): 0/4 (860 seconds to timeout) master: OSDs in cluster (actual/expected): 0/4 (850 seconds to timeout) master: OSDs in cluster (actual/expected): 0/4 (840 seconds to timeout) master: OSDs in cluster (actual/expected): 0/4 (830 seconds to timeout) master: OSDs in cluster (actual/expected): 0/4 (820 seconds to timeout) master: OSDs in cluster (actual/expected): 0/4 (810 seconds to timeout) master: OSDs in cluster (actual/expected): 0/4 (800 seconds to timeout) master: OSDs in cluster (actual/expected): 0/4 (790 seconds to timeout) master: OSDs in cluster (actual/expected): 0/4 (780 seconds to timeout) master: OSDs in cluster (actual/expected): 0/4 (770 seconds to timeout) master: OSDs in cluster (actual/expected): 0/4 (760 seconds to timeout) master: OSDs in cluster (actual/expected): 0/4 (750 seconds to timeout) master: OSDs in cluster (actual/expected): 0/4 (740 seconds to timeout) master: OSDs in cluster (actual/expected): 0/4 (730 seconds to timeout) master: OSDs in cluster (actual/expected): 0/4 (720 seconds to timeout) master: OSDs in cluster (actual/expected): 0/4 (710 seconds to timeout) master: OSDs in cluster (actual/expected): 0/4 (700 seconds to timeout) master: OSDs in cluster (actual/expected): 0/4 (690 seconds to timeout) master: OSDs in cluster (actual/expected): 0/4 (680 seconds to timeout) master: OSDs in cluster (actual/expected): 0/4 (670 seconds to timeout) master: OSDs in cluster (actual/expected): 0/4 (660 seconds to timeout) master: OSDs in cluster (actual/expected): 0/4 (650 seconds to timeout) master: OSDs in cluster (actual/expected): 0/4 (640 seconds to timeout) master: OSDs in cluster (actual/expected): 0/4 (630 seconds to timeout) master: OSDs in cluster (actual/expected): 0/4 (620 seconds to timeout) master: OSDs in cluster (actual/expected): 0/4 (610 seconds to timeout) master: OSDs in cluster (actual/expected): 0/4 (600 seconds to timeout) master: OSDs in cluster (actual/expected): 0/4 (590 seconds to timeout) master: OSDs in cluster (actual/expected): 0/4 (580 seconds to timeout) master: OSDs in cluster (actual/expected): 0/4 (570 seconds to timeout) master: OSDs in cluster (actual/expected): 0/4 (560 seconds to timeout) master: OSDs in cluster (actual/expected): 0/4 (550 seconds to timeout) master: OSDs in cluster (actual/expected): 0/4 (540 seconds to timeout) master: OSDs in cluster (actual/expected): 0/4 (530 seconds to timeout) master: OSDs in cluster (actual/expected): 0/4 (520 seconds to timeout) master: OSDs in cluster (actual/expected): 0/4 (510 seconds to timeout) master: OSDs in cluster (actual/expected): 0/4 (500 seconds to timeout) master: OSDs in cluster (actual/expected): 0/4 (490 seconds to timeout) master: OSDs in cluster (actual/expected): 0/4 (480 seconds to timeout) master: OSDs in cluster (actual/expected): 0/4 (470 seconds to timeout) master: OSDs in cluster (actual/expected): 0/4 (460 seconds to timeout) master: OSDs in cluster (actual/expected): 0/4 (450 seconds to timeout) master: OSDs in cluster (actual/expected): 0/4 (440 seconds to timeout) master: OSDs in cluster (actual/expected): 0/4 (430 seconds to timeout) master: OSDs in cluster (actual/expected): 0/4 (420 seconds to timeout) master: OSDs in cluster (actual/expected): 0/4 (410 seconds to timeout) master: OSDs in cluster (actual/expected): 0/4 (400 seconds to timeout) master: OSDs in cluster (actual/expected): 0/4 (390 seconds to timeout) master: OSDs in cluster (actual/expected): 0/4 (380 seconds to timeout) master: OSDs in cluster (actual/expected): 0/4 (370 seconds to timeout) master: OSDs in cluster (actual/expected): 0/4 (360 seconds to timeout) master: OSDs in cluster (actual/expected): 0/4 (350 seconds to timeout) master: OSDs in cluster (actual/expected): 0/4 (340 seconds to timeout) master: OSDs in cluster (actual/expected): 0/4 (330 seconds to timeout) master: OSDs in cluster (actual/expected): 0/4 (320 seconds to timeout) master: OSDs in cluster (actual/expected): 0/4 (310 seconds to timeout) master: OSDs in cluster (actual/expected): 0/4 (300 seconds to timeout) master: OSDs in cluster (actual/expected): 0/4 (290 seconds to timeout) master: OSDs in cluster (actual/expected): 0/4 (280 seconds to timeout) master: OSDs in cluster (actual/expected): 0/4 (270 seconds to timeout) master: OSDs in cluster (actual/expected): 0/4 (260 seconds to timeout) master: OSDs in cluster (actual/expected): 0/4 (250 seconds to timeout) master: OSDs in cluster (actual/expected): 0/4 (240 seconds to timeout) master: OSDs in cluster (actual/expected): 0/4 (230 seconds to timeout) master: OSDs in cluster (actual/expected): 0/4 (220 seconds to timeout) master: OSDs in cluster (actual/expected): 0/4 (210 seconds to timeout) master: OSDs in cluster (actual/expected): 0/4 (200 seconds to timeout) master: OSDs in cluster (actual/expected): 0/4 (190 seconds to timeout) master: OSDs in cluster (actual/expected): 0/4 (180 seconds to timeout) master: OSDs in cluster (actual/expected): 0/4 (170 seconds to timeout) master: OSDs in cluster (actual/expected): 0/4 (160 seconds to timeout) master: OSDs in cluster (actual/expected): 0/4 (150 seconds to timeout) master: OSDs in cluster (actual/expected): 0/4 (140 seconds to timeout) master: OSDs in cluster (actual/expected): 0/4 (130 seconds to timeout) master: OSDs in cluster (actual/expected): 0/4 (120 seconds to timeout) master: OSDs in cluster (actual/expected): 0/4 (110 seconds to timeout) master: OSDs in cluster (actual/expected): 0/4 (100 seconds to timeout) master: OSDs in cluster (actual/expected): 0/4 (90 seconds to timeout) master: OSDs in cluster (actual/expected): 0/4 (80 seconds to timeout) master: OSDs in cluster (actual/expected): 0/4 (70 seconds to timeout) master: OSDs in cluster (actual/expected): 0/4 (60 seconds to timeout) master: OSDs in cluster (actual/expected): 0/4 (50 seconds to timeout) master: OSDs in cluster (actual/expected): 0/4 (40 seconds to timeout) master: OSDs in cluster (actual/expected): 0/4 (30 seconds to timeout) master: OSDs in cluster (actual/expected): 0/4 (20 seconds to timeout) master: OSDs in cluster (actual/expected): 0/4 (10 seconds to timeout) master: ++ ceph status master: cluster: master: id: 8e162d92-d1b5-11ea-9ebc-525400a23639 master: health: HEALTH_WARN master: Reduced data availability: 1 pg inactive master: OSD count 0 < osd_pool_default_size 3 master: master: services: master: mon: 1 daemons, quorum master (age 17m) master: mgr: master.yexoxq(active, since 16m) master: osd: 0 osds: 0 up, 0 in master: master: data: master: pools: 1 pools, 1 pgs master: objects: 0 objects, 0 B master: usage: 0 B used, 0 B / 0 B avail master: pgs: 100.000% pgs unknown master: 1 unknown master:
Updated by Sebastian Wagner almost 4 years ago
- Related to Bug #46990: execnet: EOFError: couldnt load message header, expected 9 bytes, got 0 added
Updated by Sebastian Wagner almost 4 years ago
- Status changed from New to Need More Info
Thanks for the report. I'll need more info to resolve this. MGR log might be helpful
Updated by Nathan Cutler almost 4 years ago
- Status changed from Need More Info to Duplicate
Updated by Nathan Cutler almost 4 years ago
- Related to deleted (Bug #46990: execnet: EOFError: couldnt load message header, expected 9 bytes, got 0)
Updated by Nathan Cutler almost 4 years ago
- Is duplicate of Bug #46990: execnet: EOFError: couldnt load message header, expected 9 bytes, got 0 added
Updated by Nathan Cutler almost 4 years ago
Though I cannot access the MGR log, I think it's safe to assume it would look like the one in #46990
Updated by Joshua Schmid over 3 years ago
Could you include the mgr log the next time you see this issue?
Updated by Nathan Cutler over 3 years ago
- Status changed from Duplicate to New
This issue was "fixed" by inserting a one-minute grace period between
(a) the completion of "cephadm bootstrap"
and
(b) the issuance of the "ceph orch apply" command to deploy the OSDs.
I don't think this one-minute grace period should be necessary, but it is (or, at least, it was a couple months ago). I will try to reproduce the issue using the latest code.
Updated by Nathan Cutler over 3 years ago
- Status changed from New to Can't reproduce
no longer reproducible