Project

General

Profile

Bug #46764

cephadm (ceph orch apply) sometimes gets "stuck" and cannot deploy any OSDs

Added by Nathan Cutler 6 months ago. Updated 3 months ago.

Status:
Can't reproduce
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature:

Description

This started happening about two weeks ago. Before that it was not happening. I've seen it only in single-node deployments.

It manifests like this:

1. bootstrap (succeeds)
2. populate service_spec.yaml for OSD deployment (placement -> hosts -> HOSTNAME_OF_THE_NODE)
3. OSD deployment gets stuck

This happens in automated testing of sesdev. Most of the time, the deployment succeeds. About 20% of the time (?), it fails due to this bug. When it happens, the deployment script produces the following output:

    master: ++ rm -f /root/service_spec_core.yml
    master: ++ touch /root/service_spec_core.yml
    master: ++ cat
    master: ++ ceph orch device ls --refresh
    master: ++ cat /root/service_spec_core.yml
    master: ---
    master: service_type: osd
    master: service_id: sesdev_osd_deployment
    master: placement:
    master:     hosts:
    master:         - 'master'
    master: data_devices:
    master:     all: true
    master: ++ ceph orch apply -i /root/service_spec_core.yml
    master: Scheduled osd.sesdev_osd_deployment update...
    master: ++ EXPECTED_NUMBER_OF_OSDS=4
    master: ++ set +x
    master: OSDs in cluster (actual/expected): 0/4 (900 seconds to timeout)
    master: OSDs in cluster (actual/expected): 0/4 (890 seconds to timeout)
    master: OSDs in cluster (actual/expected): 0/4 (880 seconds to timeout)
    master: OSDs in cluster (actual/expected): 0/4 (870 seconds to timeout)
    master: OSDs in cluster (actual/expected): 0/4 (860 seconds to timeout)
    master: OSDs in cluster (actual/expected): 0/4 (850 seconds to timeout)
    master: OSDs in cluster (actual/expected): 0/4 (840 seconds to timeout)
    master: OSDs in cluster (actual/expected): 0/4 (830 seconds to timeout)
    master: OSDs in cluster (actual/expected): 0/4 (820 seconds to timeout)
    master: OSDs in cluster (actual/expected): 0/4 (810 seconds to timeout)
    master: OSDs in cluster (actual/expected): 0/4 (800 seconds to timeout)
    master: OSDs in cluster (actual/expected): 0/4 (790 seconds to timeout)
    master: OSDs in cluster (actual/expected): 0/4 (780 seconds to timeout)
    master: OSDs in cluster (actual/expected): 0/4 (770 seconds to timeout)
    master: OSDs in cluster (actual/expected): 0/4 (760 seconds to timeout)
    master: OSDs in cluster (actual/expected): 0/4 (750 seconds to timeout)
    master: OSDs in cluster (actual/expected): 0/4 (740 seconds to timeout)
    master: OSDs in cluster (actual/expected): 0/4 (730 seconds to timeout)
    master: OSDs in cluster (actual/expected): 0/4 (720 seconds to timeout)
    master: OSDs in cluster (actual/expected): 0/4 (710 seconds to timeout)
    master: OSDs in cluster (actual/expected): 0/4 (700 seconds to timeout)
    master: OSDs in cluster (actual/expected): 0/4 (690 seconds to timeout)
    master: OSDs in cluster (actual/expected): 0/4 (680 seconds to timeout)
    master: OSDs in cluster (actual/expected): 0/4 (670 seconds to timeout)
    master: OSDs in cluster (actual/expected): 0/4 (660 seconds to timeout)
    master: OSDs in cluster (actual/expected): 0/4 (650 seconds to timeout)
    master: OSDs in cluster (actual/expected): 0/4 (640 seconds to timeout)
    master: OSDs in cluster (actual/expected): 0/4 (630 seconds to timeout)
    master: OSDs in cluster (actual/expected): 0/4 (620 seconds to timeout)
    master: OSDs in cluster (actual/expected): 0/4 (610 seconds to timeout)
    master: OSDs in cluster (actual/expected): 0/4 (600 seconds to timeout)
    master: OSDs in cluster (actual/expected): 0/4 (590 seconds to timeout)
    master: OSDs in cluster (actual/expected): 0/4 (580 seconds to timeout)
    master: OSDs in cluster (actual/expected): 0/4 (570 seconds to timeout)
    master: OSDs in cluster (actual/expected): 0/4 (560 seconds to timeout)
    master: OSDs in cluster (actual/expected): 0/4 (550 seconds to timeout)
    master: OSDs in cluster (actual/expected): 0/4 (540 seconds to timeout)
    master: OSDs in cluster (actual/expected): 0/4 (530 seconds to timeout)
    master: OSDs in cluster (actual/expected): 0/4 (520 seconds to timeout)
    master: OSDs in cluster (actual/expected): 0/4 (510 seconds to timeout)
    master: OSDs in cluster (actual/expected): 0/4 (500 seconds to timeout)
    master: OSDs in cluster (actual/expected): 0/4 (490 seconds to timeout)
    master: OSDs in cluster (actual/expected): 0/4 (480 seconds to timeout)
    master: OSDs in cluster (actual/expected): 0/4 (470 seconds to timeout)
    master: OSDs in cluster (actual/expected): 0/4 (460 seconds to timeout)
    master: OSDs in cluster (actual/expected): 0/4 (450 seconds to timeout)
    master: OSDs in cluster (actual/expected): 0/4 (440 seconds to timeout)
    master: OSDs in cluster (actual/expected): 0/4 (430 seconds to timeout)
    master: OSDs in cluster (actual/expected): 0/4 (420 seconds to timeout)
    master: OSDs in cluster (actual/expected): 0/4 (410 seconds to timeout)
    master: OSDs in cluster (actual/expected): 0/4 (400 seconds to timeout)
    master: OSDs in cluster (actual/expected): 0/4 (390 seconds to timeout)
    master: OSDs in cluster (actual/expected): 0/4 (380 seconds to timeout)
    master: OSDs in cluster (actual/expected): 0/4 (370 seconds to timeout)
    master: OSDs in cluster (actual/expected): 0/4 (360 seconds to timeout)
    master: OSDs in cluster (actual/expected): 0/4 (350 seconds to timeout)
    master: OSDs in cluster (actual/expected): 0/4 (340 seconds to timeout)
    master: OSDs in cluster (actual/expected): 0/4 (330 seconds to timeout)
    master: OSDs in cluster (actual/expected): 0/4 (320 seconds to timeout)
    master: OSDs in cluster (actual/expected): 0/4 (310 seconds to timeout)
    master: OSDs in cluster (actual/expected): 0/4 (300 seconds to timeout)
    master: OSDs in cluster (actual/expected): 0/4 (290 seconds to timeout)
    master: OSDs in cluster (actual/expected): 0/4 (280 seconds to timeout)
    master: OSDs in cluster (actual/expected): 0/4 (270 seconds to timeout)
    master: OSDs in cluster (actual/expected): 0/4 (260 seconds to timeout)
    master: OSDs in cluster (actual/expected): 0/4 (250 seconds to timeout)
    master: OSDs in cluster (actual/expected): 0/4 (240 seconds to timeout)
    master: OSDs in cluster (actual/expected): 0/4 (230 seconds to timeout)
    master: OSDs in cluster (actual/expected): 0/4 (220 seconds to timeout)
    master: OSDs in cluster (actual/expected): 0/4 (210 seconds to timeout)
    master: OSDs in cluster (actual/expected): 0/4 (200 seconds to timeout)
    master: OSDs in cluster (actual/expected): 0/4 (190 seconds to timeout)
    master: OSDs in cluster (actual/expected): 0/4 (180 seconds to timeout)
    master: OSDs in cluster (actual/expected): 0/4 (170 seconds to timeout)
    master: OSDs in cluster (actual/expected): 0/4 (160 seconds to timeout)
    master: OSDs in cluster (actual/expected): 0/4 (150 seconds to timeout)
    master: OSDs in cluster (actual/expected): 0/4 (140 seconds to timeout)
    master: OSDs in cluster (actual/expected): 0/4 (130 seconds to timeout)
    master: OSDs in cluster (actual/expected): 0/4 (120 seconds to timeout)
    master: OSDs in cluster (actual/expected): 0/4 (110 seconds to timeout)
    master: OSDs in cluster (actual/expected): 0/4 (100 seconds to timeout)
    master: OSDs in cluster (actual/expected): 0/4 (90 seconds to timeout)
    master: OSDs in cluster (actual/expected): 0/4 (80 seconds to timeout)
    master: OSDs in cluster (actual/expected): 0/4 (70 seconds to timeout)
    master: OSDs in cluster (actual/expected): 0/4 (60 seconds to timeout)
    master: OSDs in cluster (actual/expected): 0/4 (50 seconds to timeout)
    master: OSDs in cluster (actual/expected): 0/4 (40 seconds to timeout)
    master: OSDs in cluster (actual/expected): 0/4 (30 seconds to timeout)
    master: OSDs in cluster (actual/expected): 0/4 (20 seconds to timeout)
    master: OSDs in cluster (actual/expected): 0/4 (10 seconds to timeout)
    master: ++ ceph status
    master:   cluster:
    master:     id:     8e162d92-d1b5-11ea-9ebc-525400a23639
    master:     health: HEALTH_WARN
    master:             Reduced data availability: 1 pg inactive
    master:             OSD count 0 < osd_pool_default_size 3
    master:  
    master:   services:
    master:     mon: 1 daemons, quorum master (age 17m)
    master:     mgr: master.yexoxq(active, since 16m)
    master:     osd: 0 osds: 0 up, 0 in
    master:  
    master:   data:
    master:     pools:   1 pools, 1 pgs
    master:     objects: 0 objects, 0 B
    master:     usage:   0 B used, 0 B / 0 B avail
    master:     pgs:     100.000% pgs unknown
    master:              1 unknown
    master:  

Related issues

Duplicates Orchestrator - Bug #46990: execnet: EOFError: couldnt load message header, expected 9 bytes, got 0 New

History

#1 Updated by Sebastian Wagner 5 months ago

  • Related to Bug #46990: execnet: EOFError: couldnt load message header, expected 9 bytes, got 0 added

#2 Updated by Sebastian Wagner 5 months ago

  • Status changed from New to Need More Info

Thanks for the report. I'll need more info to resolve this. MGR log might be helpful

#3 Updated by Nathan Cutler 5 months ago

  • Status changed from Need More Info to Duplicate

#4 Updated by Nathan Cutler 5 months ago

  • Related to deleted (Bug #46990: execnet: EOFError: couldnt load message header, expected 9 bytes, got 0)

#5 Updated by Nathan Cutler 5 months ago

  • Duplicates Bug #46990: execnet: EOFError: couldnt load message header, expected 9 bytes, got 0 added

#6 Updated by Nathan Cutler 5 months ago

Though I cannot access the MGR log, I think it's safe to assume it would look like the one in #46990

#7 Updated by Joshua Schmid 4 months ago

Could you include the mgr log the next time you see this issue?

#8 Updated by Nathan Cutler 4 months ago

  • Status changed from Duplicate to New

This issue was "fixed" by inserting a one-minute grace period between

(a) the completion of "cephadm bootstrap"

and

(b) the issuance of the "ceph orch apply" command to deploy the OSDs.

I don't think this one-minute grace period should be necessary, but it is (or, at least, it was a couple months ago). I will try to reproduce the issue using the latest code.

#9 Updated by Nathan Cutler 3 months ago

  • Status changed from New to Can't reproduce

no longer reproducible

Also available in: Atom PDF