Project

General

Profile

Bug #24423

failed to load OSD map for epoch X, got 0 bytes

Added by Sergey Malinin almost 6 years ago. Updated over 5 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
mimic
Regression:
No
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

After upgrading to Mimic I deleted a non-lvm OSD and recreated it with 'ceph-volume lvm prepare --bluestore --data /dev/sdX'. Now the OSD fails to start with the following error:

2018-06-05 13:27:49.189 7f79e4def240 0 osd.3 0 done with init, starting boot process
2018-06-05 13:27:49.189 7f79e4def240 1 osd.3 0 start_boot
2018-06-05 13:27:49.189 7f79e4def240 10 osd.3 0 start_boot - have maps 0..0
2018-06-05 13:27:49.189 7f79d5d4a700 10 osd.3 0 OSD::ms_get_authorizer type=mgr
2018-06-05 13:27:49.193 7f79c971e700 10 osd.3 0 ms_handle_connect con 0x55e0a0f62300
2018-06-05 13:27:49.193 7f79bfae4700 10 osd.3 0 _preboot _preboot mon has osdmaps 17056..17676
2018-06-05 13:27:49.193 7f79bfae4700 20 osd.3 0 update_osd_stat osd_stat(1.0 GiB used, 99 GiB avail, 100 GiB total, peers [] op hist [])
2018-06-05 13:27:49.193 7f79bfae4700 5 osd.3 0 heartbeat: osd_stat(1.0 GiB used, 99 GiB avail, 100 GiB total, peers [] op hist [])
2018-06-05 13:27:49.193 7f79bfae4700 1 osd.3 0 waiting for initial osdmap
2018-06-05 13:27:49.193 7f79c971e700 20 osd.3 0 OSD::ms_dispatch: osd_map(17056..17056 src has 17056..17676 +gap_removed_snaps) v4
2018-06-05 13:27:49.193 7f79c971e700 10 osd.3 0 do_waiters -
start
2018-06-05 13:27:49.193 7f79c971e700 10 osd.3 0 do_waiters -- finish
2018-06-05 13:27:49.193 7f79c971e700 20 osd.3 0 _dispatch 0x55e0a0b2fe40 osd_map(17056..17056 src has 17056..17676 +gap_removed_snaps) v4
2018-06-05 13:27:49.193 7f79c971e700 3 osd.3 0 handle_osd_map epochs [17056,17056], i have 0, src has [17056,17676]
2018-06-05 13:27:49.193 7f79c971e700 10 osd.3 0 handle_osd_map message skips epochs 1..17055
2018-06-05 13:27:49.193 7f79c971e700 10 osd.3 0 handle_osd_map got full map for epoch 17056
2018-06-05 13:27:49.193 7f79c971e700 20 osd.3 0 got_full_map 17056, nothing requested
2018-06-05 13:27:49.193 7f79c971e700 20 osd.3 0 get_map 17055 - loading and decoding 0x55e0a0fa4480
2018-06-05 13:27:49.193 7f79c971e700 -1 osd.3 0 failed to load OSD map for epoch 17055, got 0 bytes
2018-06-05 13:27:49.197 7f79c971e700 -1 /build/ceph-13.2.0/src/osd/OSD.h: In function 'OSDMapRef OSDService::get_map(epoch_t)' thread 7f79c971e700 time 2018-06-05 13:27:49.200273
/build/ceph-13.2.0/src/osd/OSD.h: 828: FAILED assert(ret)

ceph version 13.2.0 (79a10589f1f80dfe21e8f9794365ed98143071c4) mimic (stable)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) [0x7f79dc2175e2]
2: (()+0x26b7a7) [0x7f79dc2177a7]
3: (OSDService::get_map(unsigned int)+0x4a) [0x55e09e756e9a]
4: (OSD::handle_osd_map(MOSDMap*)+0xfb1) [0x55e09e6fddc1]
5: (OSD::_dispatch(Message*)+0xa1) [0x55e09e706a21]
6: (OSD::ms_dispatch(Message*)+0x56) [0x55e09e706d76]
7: (DispatchQueue::entry()+0xb92) [0x7f79dc290452]
8: (DispatchQueue::DispatchThread::entry()+0xd) [0x7f79dc32e6cd]
9: (()+0x76db) [0x7f79da9126db]
10: (clone()+0x3f) [0x7f79d98d688f]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Related issues

Duplicated by Ceph - Bug #24524: Newly added OSDs do not start in Mimic Duplicate 06/14/2018
Duplicated by RADOS - Bug #24450: OSD Caught signal (Aborted) Duplicate 06/07/2018
Copied to RADOS - Backport #24599: mimic: failed to load OSD map for epoch X, got 0 bytes Resolved

History

#1 Updated by Paul Emmerich almost 6 years ago

We are also seeing this when creating OSDs with IDs that existed previously.

I verified that the old osd was deleted completely: no crush map entry, no auth list entry, no osd map entry.

#2 Updated by Michael Sudnick almost 6 years ago

I'm trying to make new OSDs with ceph-volume osd create --dmcrypt --bluestore --data /dev/sdg and am getting the same error.

8> 2018-06-08 17:02:05.468 7f62b8166700  3 osd.47 0 handle_osd_map epochs [219408,219408], i have 0, src has [219408,223195]
-7> 2018-06-08 17:02:05.468 7f62b8166700 -1 osd.47 0 failed to load OSD map for epoch 219407, got 0 bytes
-6> 2018-06-08 17:02:05.468 7f62c3905700 1 -
10.0.150.0:6815/24239 >> 10.0.152.0:6804/294348 conn(0x55d38dc32400 :-1 s=STATE_OPEN pgs=270569 cs=1 l=1).read_bulk peer close file descriptor 36
5> 2018-06-08 17:02:05.468 7f62c3905700 1 - 10.0.150.0:6815/24239 >> 10.0.152.0:6804/294348 conn(0x55d38dc32400 :-1 s=STATE_OPEN pgs=270569 cs=1 l=1).read_until read failed
4> 2018-06-08 17:02:05.469 7f62c3905700 1 - 10.0.150.0:6815/24239 >> 10.0.152.0:6804/294348 conn(0x55d38dc32400 :-1 s=STATE_OPEN pgs=270569 cs=1 l=1).process read tag failed
3> 2018-06-08 17:02:05.469 7f62c3905700 1 - 10.0.150.0:6815/24239 >> 10.0.152.0:6804/294348 conn(0x55d38dc32400 :-1 s=STATE_OPEN pgs=270569 cs=1 l=1).fault on lossy channel, failing
2> 2018-06-08 17:02:05.469 7f62c3905700 2 - 10.0.150.0:6815/24239 >> 10.0.152.0:6804/294348 conn(0x55d38dc32400 :-1 s=STATE_OPEN pgs=270569 cs=1 l=1)._stop
1> 2018-06-08 17:02:05.470 7f62c4106700 5 - 10.0.150.0:6815/24239 >> 10.0.153.0:6789/0 conn(0x55d38dc30000 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=4242230 cs=1 l=1). rx mon.3 seq 13 0x55d38d181900 osd_map(219409..219448 src has 219408..223195) v4
0> 2018-06-08 17:02:05.470 7f62b8166700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/13.2.0/rpm/el7/BUILD/ceph-13.2.0/src/osd/OSD.h: In function 'OSDMapRef OSDService::get_map(epoch_t)' thread 7f62b8166700 time 2018-06-08 17:02:05.469737
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/13.2.0/rpm/el7/BUILD/ceph-13.2.0/src/osd/OSD.h: 828: FAILED assert(ret)
ceph version 13.2.0 (79a10589f1f80dfe21e8f9794365ed98143071c4) mimic (stable)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0xff) [0x7f62cbbfb53f]
2: (()+0x286727) [0x7f62cbbfb727]
3: (OSDService::get_map(unsigned int)+0x4a) [0x55d389ef7dda]
4: (OSD::handle_osd_map(MOSDMap*)+0x1020) [0x55d389ea51f0]
5: (OSD::_dispatch(Message*)+0xa1) [0x55d389ea7d21]
6: (OSD::ms_dispatch(Message*)+0x56) [0x55d389ea8066]
7: (DispatchQueue::entry()+0xb5a) [0x7f62cbc7174a]
8: (DispatchQueue::DispatchThread::entry()+0xd) [0x7f62cbd11f2d]
9: (()+0x7e25) [0x7f62c8753e25]
10: (clone()+0x6d) [0x7f62c7844bad]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

I've tried new OSD numbers and I still seem to get the error.

#3 Updated by Paul Emmerich almost 6 years ago

I can't reproduce this on any new Mimic cluster, it only happens on clusters upgraded from Luminous (which is why we haven't caught this earlier).

Also, it's not specific to reusing the ID as I've initially assumed. Deleting any OSD seems to trigger this for all new OSDs.

#4 Updated by Gavin Baker almost 6 years ago

I have also experienced this issue while continuing the Bluestore conversion of OSDs on my Ceph cluster, after carrying out an upgrade to mimic (from Luminous). The same "failed to load OSD map for epoch X, got 0 bytes" appears in the logs, with the service repeatedly failing to start.

#5 Updated by Vasu Kulkarni almost 6 years ago

  • Priority changed from Normal to High

I am going to add this test for upgrade as well, steps to recreate


On Fri, Jun 8, 2018 at 4:09 PM, Paul Emmerich <paul.emmerich@croit.io> wrote:
> Hi,
>
> we are also seeing this (I've also posted to the issue tracker). It only
> affects clusters upgraded from Luminous, not new ones.
> Also, it's not about re-using OSDs. Deleting any OSD seems to trigger this
> bug for all new OSDs on upgraded clusters.
>
> We are still using the pre-Luminous way to remove OSDs, i.e.:
>
> * ceph osd down/stop service
> * ceph osd crush remove
> * ceph osd auth del
> * ceph osd rm

#6 Updated by Grant Slater almost 6 years ago

I am getting the same issue.

I also upgraded to Luminous to Mimic.
I used: ceph osd purge

#7 Updated by Wido den Hollander almost 6 years ago

Seeing the same here with a new Mimic cluster.

I purged a few OSDs (deployment went wrong) and now they can't start anymore.

v13.2.0

#8 Updated by Michel Nicol almost 6 years ago

Not sure if this is related, but for a few days, I'm not able to modify crushmap (like adding or removing OSD) on a luminous 12.2.5 cluster, except by using ceph osd getcrushmap / ceph osd setcrushmap

When trying to add an OSD, each monitor will die one after another, unless I stop the creating process.

    -5> 2018-06-15 11:03:49.841658 7fb34c18e700 10 log_client  log_queue is 1 last_log 403 sent 402 num 1 unsent 1 sending 1
    -4> 2018-06-15 11:03:49.841662 7fb34c18e700 10 log_client  will send 2018-06-15 11:03:49.841654 mon.ceph01 mon.0 10.2.100.6:6789/0 403 : audit [INF] from='client.1431377037 -' entity='client.admin' cmd=[
    -3> 2018-06-15 11:03:49.841673 7fb34c18e700  1 -- 10.2.100.6:6789/0 --> 10.2.100.6:6789/0 -- log(1 entries from seq 403 at 2018-06-15 11:03:49.841654) v1 -- 0x55639ed74300 con 0
    -2> 2018-06-15 11:03:49.841691 7fb34c18e700  5 mon.ceph01@0(leader).paxos(paxos active c 156957330..156958064) is_readable = 1 - now=2018-06-15 11:03:49.841692 lease_expire=2018-06-15 11:03:54.570130 has
    -1> 2018-06-15 11:03:49.842117 7fb34c18e700  5 mon.ceph01@0(leader).osd e8861309 creating crush item id 45 name 'osd.45' device_class 'ssd'
     0> 2018-06-15 11:03:49.851126 7fb34c18e700 -1 *** Caught signal (Segmentation fault) **
 in thread 7fb34c18e700 thread_name:ms_dispatch

2018-06-15 11:07:13.663097 7f46d9729700  0 mon.ceph01@0(leader) e32 handle_command mon_command({"prefix": "osd crush set-device-class", "class": "hdd", "ids": ["45"]} v 0) v1
2018-06-15 11:07:13.663158 7f46d9729700  0 log_channel(audit) log [INF] : from='osd.45 -' entity='osd.45' cmd=[{"prefix": "osd crush set-device-class", "class": "hdd", "ids": ["45"]}]: dispatch
2018-06-15 11:07:13.665762 7f46d9729700  0 mon.ceph01@0(leader) e32 handle_command mon_command({"prefix": "osd crush create-or-move", "id": 45, "weight":0.2728, "args": ["host=ceph14", "root=default"]} v 0)
2018-06-15 11:07:13.665820 7f46d9729700  0 log_channel(audit) log [INF] : from='osd.45 -' entity='osd.45' cmd=[{"prefix": "osd crush create-or-move", "id": 45, "weight":0.2728, "args": ["host=ceph14", "root=
2018-06-15 11:07:13.665940 7f46d9729700  0 mon.ceph01@0(leader).osd e8861310 create-or-move crush item name 'osd.45' initial_weight 0.2728 at location {host=ceph14,root=default}
2018-06-15 11:07:13.675745 7f46d9729700 -1 *** Caught signal (Segmentation fault) **
 in thread 7f46d9729700 thread_name:ms_dispatch

2018-06-15 11:16:45.123595 7f21b5354700  0 mon.ceph01@0(leader).data_health(4034) update_stats avail 46% total 128 GB, used 64118 MB, avail 60962 MB
2018-06-15 11:16:45.202578 7f21b5354700  0 log_channel(cluster) log [WRN] : Health check update: 54840/9257042 objects misplaced (0.592%) (OBJECT_MISPLACED)
2018-06-15 11:16:46.147449 7f21b2b4f700  0 mon.ceph01@0(leader) e32 handle_command mon_command({"prefix": "osd dump"} v 0) v1
2018-06-15 11:16:46.147489 7f21b2b4f700  0 log_channel(audit) log [DBG] : from='client.? 10.2.100.6:0/3091481263' entity='client.admin' cmd=[{"prefix": "osd dump"}]: dispatch
2018-06-15 11:16:46.471926 7f21b2b4f700  0 mon.ceph01@0(leader) e32 handle_command mon_command({"prefix": "osd crush set-device-class", "class": "hdd", "ids": ["46"]} v 0) v1
2018-06-15 11:16:46.472010 7f21b2b4f700  0 log_channel(audit) log [INF] : from='osd.46 -' entity='osd.46' cmd=[{"prefix": "osd crush set-device-class", "class": "hdd", "ids": ["46"]}]: dispatch
2018-06-15 11:16:46.481162 7f21b2b4f700 -1 *** Caught signal (Segmentation fault) **
 in thread 7f21b2b4f700 thread_name:ms_dispatch


    -8> 2018-06-07 21:24:50.333431 7ffa35a82700  0 mon.ceph01@0(leader) e32 handle_command mon_command({"prefix": "osd crush remove", "name": "osd.30"} v 0) v1
    -7> 2018-06-07 21:24:50.333479 7ffa35a82700  0 log_channel(audit) log [INF] : from='client.1426519607 -' entity='client.admin' cmd=[{"prefix": "osd crush remove", "name": "osd.30"}]: dispatch
    -6> 2018-06-07 21:24:50.333484 7ffa35a82700 10 log_client _send_to_monlog to self
    -5> 2018-06-07 21:24:50.333485 7ffa35a82700 10 log_client  log_queue is 1 last_log 3369586 sent 3369585 num 1 unsent 1 sending 1
    -4> 2018-06-07 21:24:50.333489 7ffa35a82700 10 log_client  will send 2018-06-07 21:24:50.333482 mon.ceph01 mon.0 10.2.100.6:6789/0 3369586 : audit [INF] from='client.1426519607 -' entity='client.admin' c
    -3> 2018-06-07 21:24:50.333499 7ffa35a82700  1 -- 10.2.100.6:6789/0 --> 10.2.100.6:6789/0 -- log(1 entries from seq 3369586 at 2018-06-07 21:24:50.333482) v1 -- 0x55704ef55e00 con 0
    -2> 2018-06-07 21:24:50.333514 7ffa35a82700  5 mon.ceph01@0(leader).paxos(paxos active c 156003781..156004298) is_readable = 1 - now=2018-06-07 21:24:50.333515 lease_expire=2018-06-07 21:24:54.354429 has
    -1> 2018-06-07 21:24:50.341906 7ffa3227b700  1 -- 10.2.100.6:6789/0 >> - conn(0x557065b03000 :6789 s=STATE_ACCEPTING pgs=0 cs=0 l=0)._process_connection sd=50 -
     0> 2018-06-07 21:24:50.341978 7ffa35a82700 -1 *** Caught signal (Segmentation fault) **
 in thread 7ffa35a82700 thread_name:ms_dispatch

#10 Updated by Lazuardi Nasution almost 6 years ago

How to fix installed Mimic (upgraded from Luminous) with this fix? Is there any way to make startup OSD not requesting unavailable map on such epoch?

#11 Updated by Greg Farnum over 5 years ago

  • Duplicated by Bug #24524: Newly added OSDs do not start in Mimic added

#12 Updated by Sage Weil over 5 years ago

  • Project changed from Ceph to RADOS
  • Category deleted (OSD)
  • Status changed from New to 12
  • Priority changed from High to Urgent

#13 Updated by Igor Fedotov over 5 years ago

  • Duplicated by Bug #24450: OSD Caught signal (Aborted) added

#14 Updated by Sage Weil over 5 years ago

  • Status changed from 12 to Pending Backport
  • Backport set to mimic

#16 Updated by Nathan Cutler over 5 years ago

  • Copied to Backport #24599: mimic: failed to load OSD map for epoch X, got 0 bytes added

#17 Updated by Sergey Burdakov over 5 years ago

I also have this issue on new installed mimic cluster.
Don't know if this is important, the problem has appeared after I added some ssd disks to a degraded cluster (2/3 OSD hosts are down, replica count is 3). New OSDs won't start, the stack trace is the same.

#18 Updated by Sergey Ponomarev over 5 years ago

I finded workaround solution how add new osd to "mimic" cluster:

1. Purge osd from cluster which displayed as "down"
2. Purge mimic packages and data on node
3. Change repo to luminous
4. Install Luminous
5. Add all new osd on this node to mimic cluster, wait UP state
6. Change repo to mimic and upgrade packages
8. reboot
9. Waiting for rebuild

#19 Updated by Lazuardi Nasution over 5 years ago

Hi Sergey,

Have you try that after "ceph osd require-osd-release mimic"?

My workaround is below.

1. Build patched version on somewhere else
2. Shutdown all OSD on the same host
3. Safe original /usr/bin/ceph-osd and replace it with patched one, adjust some library path and/symlink if necessary
4. Add new OSD
5. Shutdown newly added OSD
6. Replace back original /usr/bin/ceph-osd
7. Start all OSD normally

You should only continue to next step after stabilizing Ceph status.

Best regards,

#20 Updated by Martin Overgaard Hansen over 5 years ago

Hi everyone,

What’s the workaround for this issue? Not being able to add new osds is getting more and more urgent.

Thanks,
Martin

#21 Updated by Lazuardi Nasution over 5 years ago

Hi Martin,

Have you try my workaround above?

Best regards,

#22 Updated by Nathan Cutler over 5 years ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF