Bug #13594
closedosd/PG.cc: 2856: FAILED assert(values.size() == 1)
0%
Description
After a host reboot one of our OSD doesn't restart, it fails on one ASSERT :
0> 2015-10-26 08:15:59.923059 7f67f0cb2900 -1 osd/PG.cc: In function 'static epoch_t PG::peek_map_epoch(ObjectStore*, spg_t, ceph::bufferlist*)' thread 7f67f0cb2900 time 2015-10-26 08:15:59.922041
osd/PG.cc: 2856: FAILED assert(values.size() == 1)
ceph version 0.94.2-1-ga11cca9 (a11cca90395f7da516668bbba20dff6cf1d8a538)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x76) [0xc19276]
2: (PG::peek_map_epoch(ObjectStore*, spg_t, ceph::buffer::list*)+0xa04) [0x7dfe34]
3: (OSD::load_pgs()+0xa0f) [0x6b2e9f]
4: (OSD::init()+0xc4c) [0x6b5d9c]
5: (main()+0x2821) [0x63dd11]
6: (__libc_start_main()+0xf5) [0x7f67edd7db45]
7: /usr/bin/ceph-osd() [0x6579d7]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Full log attached.
On surface similar to: http://tracker.ceph.com/issues/4855
Files
Updated by Laurent GUERBY over 8 years ago
A second OSD in our cluster now has the exact same error. This happens during a recovery, if we keep loosing OSD we'll loose data ...
Is it a journal error? if so is it safe to blank the journal ceph-osd --mkjournal -i OSDNUM ?
Updated by Yuri Weinstein over 8 years ago
- Release set to infernalis
- ceph-qa-suite rados added
Also in:
Run: http://pulpito.ceph.com/teuthology-2015-10-29_21:00:11-rados-infernalis-distro-basic-multi/
Job: ['1131488']
Assertion: osd/PG.cc: 2850: FAILED assert(values.size() == 1) ceph version 9.1.0-98-g3af21a1 (3af21a1432aabb86f82f9063aef7a6ab9865345e) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x7f9aa717f94b] 2: (PG::peek_map_epoch(ObjectStore*, spg_t, unsigned int*, ceph::buffer::list*)+0x8a0) [0x7f9aa6d14270] 3: (OSD::load_pgs()+0x580) [0x7f9aa6bf6040] 4: (OSD::init()+0xe74) [0x7f9aa6c061e4] 5: (main()+0x2954) [0x7f9aa6b8b0c4] 6: (__libc_start_main()+0xf5) [0x7f9aa3a0dec5] 7: (()+0x2f7f07) [0x7f9aa6bbaf07]
Updated by Sage Weil over 8 years ago
- Status changed from New to Resolved
Updated by Yuri Weinstein over 8 years ago
Still see in http://pulpito.ceph.com/teuthology-2015-10-31_21:00:10-rados-infernalis-distro-basic-multi/
Job - ['1134160']
Updated by Jens Harbott over 7 years ago
We are seeing a similar backtrace on a Hammer OSD, is this the same issue? Looks like the fix wasn't backported?
0> 2016-12-07 07:46:21.105387 7f97149158c0 -1 osd/PG.cc: In function 'static int PG::peek_map_epoch(ObjectStore*, spg_t, epoch_t*, ceph::bufferlist*)' thread 7f97149158c0 time 2016-12-07 07:46:21.104173 osd/PG.cc: 2889: FAILED assert(values.size() == 2) ceph version 0.94.7 (d56bdf93ced6b80b07397d57e3fa68fe68304432) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0xbb1fab] 2: (PG::peek_map_epoch(ObjectStore*, spg_t, unsigned int*, ceph::buffer::list*)+0x885) [0x7c19b5] 3: (OSD::load_pgs()+0x9b7) [0x6b9cd7] 4: (OSD::init()+0x17c7) [0x6bd777] 5: (main()+0x2a31) [0x6480e1] 6: (__libc_start_main()+0xf5) [0x7f9711c9cf45] 7: /usr/bin/ceph-osd() [0x661147] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Updated by Jens Harbott over 7 years ago
I tried an adapted version of https://github.com/ceph/ceph/pull/6444 which gets the OSD started, but it is getting a lot of other weird errors after that, so it would be great if someone could take a look at the hammer code and tell whether a different fix might be needed there.