Bug #13594
closed
osd/PG.cc: 2856: FAILED assert(values.size() == 1)
Added by Laurent GUERBY over 8 years ago.
Updated over 7 years ago.
Description
After a host reboot one of our OSD doesn't restart, it fails on one ASSERT :
0> 2015-10-26 08:15:59.923059 7f67f0cb2900 -1 osd/PG.cc: In function 'static epoch_t PG::peek_map_epoch(ObjectStore*, spg_t, ceph::bufferlist*)' thread 7f67f0cb2900 time 2015-10-26 08:15:59.922041
osd/PG.cc: 2856: FAILED assert(values.size() == 1)
ceph version 0.94.2-1-ga11cca9 (a11cca90395f7da516668bbba20dff6cf1d8a538)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x76) [0xc19276]
2: (PG::peek_map_epoch(ObjectStore*, spg_t, ceph::buffer::list*)+0xa04) [0x7dfe34]
3: (OSD::load_pgs()+0xa0f) [0x6b2e9f]
4: (OSD::init()+0xc4c) [0x6b5d9c]
5: (main()+0x2821) [0x63dd11]
6: (__libc_start_main()+0xf5) [0x7f67edd7db45]
7: /usr/bin/ceph-osd() [0x6579d7]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Full log attached.
On surface similar to: http://tracker.ceph.com/issues/4855
Files
A second OSD in our cluster now has the exact same error. This happens during a recovery, if we keep loosing OSD we'll loose data ...
Is it a journal error? if so is it safe to blank the journal ceph-osd --mkjournal -i OSDNUM ?
- Release set to infernalis
- ceph-qa-suite rados added
Also in:
Run: http://pulpito.ceph.com/teuthology-2015-10-29_21:00:11-rados-infernalis-distro-basic-multi/
Job: ['1131488']
Assertion: osd/PG.cc: 2850: FAILED assert(values.size() == 1)
ceph version 9.1.0-98-g3af21a1 (3af21a1432aabb86f82f9063aef7a6ab9865345e)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x7f9aa717f94b]
2: (PG::peek_map_epoch(ObjectStore*, spg_t, unsigned int*, ceph::buffer::list*)+0x8a0) [0x7f9aa6d14270]
3: (OSD::load_pgs()+0x580) [0x7f9aa6bf6040]
4: (OSD::init()+0xe74) [0x7f9aa6c061e4]
5: (main()+0x2954) [0x7f9aa6b8b0c4]
6: (__libc_start_main()+0xf5) [0x7f9aa3a0dec5]
7: (()+0x2f7f07) [0x7f9aa6bbaf07]
- Status changed from New to Resolved
We are seeing a similar backtrace on a Hammer OSD, is this the same issue? Looks like the fix wasn't backported?
0> 2016-12-07 07:46:21.105387 7f97149158c0 -1 osd/PG.cc: In function 'static int PG::peek_map_epoch(ObjectStore*, spg_t, epoch_t*, ceph::bufferlist*)' thread 7f97149158c0 time 2016-12-07 07:46:21.104173
osd/PG.cc: 2889: FAILED assert(values.size() == 2)
ceph version 0.94.7 (d56bdf93ced6b80b07397d57e3fa68fe68304432)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0xbb1fab]
2: (PG::peek_map_epoch(ObjectStore*, spg_t, unsigned int*, ceph::buffer::list*)+0x885) [0x7c19b5]
3: (OSD::load_pgs()+0x9b7) [0x6b9cd7]
4: (OSD::init()+0x17c7) [0x6bd777]
5: (main()+0x2a31) [0x6480e1]
6: (__libc_start_main()+0xf5) [0x7f9711c9cf45]
7: /usr/bin/ceph-osd() [0x661147]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
I tried an adapted version of https://github.com/ceph/ceph/pull/6444 which gets the OSD started, but it is getting a lot of other weird errors after that, so it would be great if someone could take a look at the hammer code and tell whether a different fix might be needed there.
Also available in: Atom
PDF