https://tracker.ceph.com/
https://tracker.ceph.com/favicon.ico
2017-09-16T11:00:14Z
Ceph
Ceph - Bug #21416: osd/PGLog.cc: 60: FAILED assert(s <= can_rollback_to) after upgrade to luminous
https://tracker.ceph.com/issues/21416?journal_id=99238
2017-09-16T11:00:14Z
Patrick Fruh
<ul></ul><p>Apparently it's only happening to my first 3 of 41 OSDs (0,1,2) in my "node1" since the luminous upgrade.</p>
<p>I checked smartctl and it's clean without any errors, so a hardware defect seems unlikely.</p>
Ceph - Bug #21416: osd/PGLog.cc: 60: FAILED assert(s <= can_rollback_to) after upgrade to luminous
https://tracker.ceph.com/issues/21416?journal_id=99240
2017-09-16T11:15:23Z
Patrick Fruh
<ul></ul><p>Those 3 OSDs share a journaling / DB SSD - so it might be that failing, currently running a smart self test on it</p>
Ceph - Bug #21416: osd/PGLog.cc: 60: FAILED assert(s <= can_rollback_to) after upgrade to luminous
https://tracker.ceph.com/issues/21416?journal_id=99242
2017-09-16T13:00:40Z
Patrick Fruh
<ul></ul><p>Self test of all 3 OSDs and journal / DB SSD came out ok.</p>
Ceph - Bug #21416: osd/PGLog.cc: 60: FAILED assert(s <= can_rollback_to) after upgrade to luminous
https://tracker.ceph.com/issues/21416?journal_id=99618
2017-09-21T20:27:47Z
Sage Weil
sage@newdream.net
<ul><li><strong>Subject</strong> changed from <i>OSD segfaults after luminous upgrade - bluestore and filestore</i> to <i>osd/PGLog.cc: 60: FAILED assert(s <= can_rollback_to) after upgrade to luminous</i></li><li><strong>Status</strong> changed from <i>New</i> to <i>12</i></li><li><strong>Priority</strong> changed from <i>Normal</i> to <i>Urgent</i></li></ul>
Ceph - Bug #21416: osd/PGLog.cc: 60: FAILED assert(s <= can_rollback_to) after upgrade to luminous
https://tracker.ceph.com/issues/21416?journal_id=99619
2017-09-21T20:30:39Z
Sage Weil
sage@newdream.net
<ul></ul><p>When you say "upgraded this osd from filestore to bluestore", do you mean that you wiped the device, recreated a bluestore osd in its place, let it recover, and then started seeing the new errors? (The error message seem totally unrelated... but the fact that a <strong>fresh</strong> replacement OSD would see new, different problems makes me suspicious of the hardware.)</p>
Ceph - Bug #21416: osd/PGLog.cc: 60: FAILED assert(s <= can_rollback_to) after upgrade to luminous
https://tracker.ceph.com/issues/21416?journal_id=99620
2017-09-21T20:34:08Z
Patrick Fruh
<ul></ul><p>Yes, the errors seemingly started to show up when the OSDs were under high load (recovering themselves or recovering other OSDs). However after wiping them again and changing them back to filestore, everything is now working fine again, even under high load.</p>
Ceph - Bug #21416: osd/PGLog.cc: 60: FAILED assert(s <= can_rollback_to) after upgrade to luminous
https://tracker.ceph.com/issues/21416?journal_id=99621
2017-09-21T20:43:08Z
Patrick Fruh
<ul></ul><p>Looking at the most recent logs, some OSDs still crashed yesterday early morning (even after going back to filestore and finishing recovery), I don't know if the error is related:</p>
<pre>
Sep 20 02:47:57 node1.ceph ceph-osd[7051]: 2017-09-20 02:47:57.075709 7f4591be9700 -1 osd.0 pg_epoch: 23702 pg[7.1f5( v 23702'6584626 (23697'6581556,23702'6584626] local-lis/les=23341/23342 n=3146 ec=104/104 lis/c 23341/23341 les/c/f 23342/23342/0 23312/23341/23341) [0,15,11] r=0 lpr=23341 crt=23702'6584626 lcod 23702'6584625 mlcod 23702'6584625 active+clean+scrubbing] _scan_snaps no head for 7:afaba73b:::rbd_data.68c683238e1f29.00000000000022a1:3d (have MIN)
Sep 20 02:52:10 node1.ceph ceph-osd[7051]: 2017-09-20 02:52:10.185743 7f45903e6700 -1 osd.0 pg_epoch: 23702 pg[7.360( v 23702'8555567 (23697'8552485,23702'8555567] local-lis/les=23431/23432 n=3094 ec=2543/2543 lis/c 23431/23431 les/c/f 23432/23432/0 23312/23431/23431) [0,31,8] r=0 lpr=23431 crt=23702'8555567 lcod 23702'8555566 mlcod 23702'8555566 active+clean+scrubbing+deep] _scan_snaps no head for 7:06d39870:::rbd_data.bd1c92ae8944a.0000000000005648:d2 (have MIN)
Sep 20 02:52:20 node1.ceph ceph-osd[7051]: 2017-09-20 02:52:20.808277 7f45903e6700 -1 osd.0 pg_epoch: 23702 pg[7.360( v 23702'8555581 (23697'8552485,23702'8555581] local-lis/les=23431/23432 n=3094 ec=2543/2543 lis/c 23431/23431 les/c/f 23432/23432/0 23312/23431/23431) [0,31,8] r=0 lpr=23431 crt=23702'8555581 lcod 23702'8555580 mlcod 23702'8555580 active+clean+scrubbing+deep] _scan_snaps no head for 7:06d5926c:::rbd_data.a74f60238e1f29.000000000000112a:cf (have MIN)
Sep 20 02:52:20 node1.ceph ceph-osd[7051]: 2017-09-20 02:52:20.808312 7f45903e6700 -1 osd.0 pg_epoch: 23702 pg[7.360( v 23702'8555581 (23697'8552485,23702'8555581] local-lis/les=23431/23432 n=3094 ec=2543/2543 lis/c 23431/23431 les/c/f 23432/23432/0 23312/23431/23431) [0,31,8] r=0 lpr=23431 crt=23702'8555581 lcod 23702'8555580 mlcod 23702'8555580 active+clean+scrubbing+deep] _scan_snaps no head for 7:06d5926c:::rbd_data.a74f60238e1f29.000000000000112a:9c (have MIN)
Sep 20 02:53:23 node1.ceph ceph-osd[7051]: 2017-09-20 02:53:23.572447 7f45903e6700 -1 osd.0 pg_epoch: 23702 pg[7.360( v 23702'8555677 (23697'8552585,23702'8555677] local-lis/les=23431/23432 n=3094 ec=2543/2543 lis/c 23431/23431 les/c/f 23432/23432/0 23312/23431/23431) [0,31,8] r=0 lpr=23431 crt=23702'8555677 lcod 23702'8555676 mlcod 23702'8555676 active+clean+scrubbing+deep] _scan_snaps no head for 7:06e4dd89:::rbd_data.a74f60238e1f29.0000000000000832:9c (have MIN)
Sep 20 02:53:29 node1.ceph ceph-osd[7051]: 2017-09-20 02:53:29.239197 7f45903e6700 -1 osd.0 pg_epoch: 23702 pg[7.360( v 23702'8555687 (23697'8552685,23702'8555687] local-lis/les=23431/23432 n=3094 ec=2543/2543 lis/c 23431/23431 les/c/f 23432/23432/0 23312/23431/23431) [0,31,8] r=0 lpr=23431 crt=23702'8555687 lcod 23702'8555686 mlcod 23702'8555686 active+clean+scrubbing+deep] _scan_snaps no head for 7:06e69ad6:::rbd_data.68c683238e1f29.0000000000001eb3:3d (have MIN)
Sep 20 02:56:14 node1.ceph ceph-osd[7051]: 2017-09-20 02:56:14.208765 7f4591be9700 -1 osd.0 pg_epoch: 23702 pg[7.3e9( v 23702'5074005 (23702'5070965,23702'5074005] local-lis/les=23393/23394 n=3090 ec=2543/2543 lis/c 23393/23393 les/c/f 23394/23394/0 23312/23393/23393) [0,15,8] r=0 lpr=23393 crt=23702'5074005 lcod 23702'5074004 mlcod 23702'5074004 active+clean+scrubbing] _scan_snaps no head for 7:97c5a729:::rbd_data.68c683238e1f29.0000000000000580:3d (have MIN)
Sep 20 02:56:31 node1.ceph ceph-osd[7051]: *** Caught signal (Segmentation fault) **
Sep 20 02:56:31 node1.ceph ceph-osd[7051]: in thread 7f4591be9700 thread_name:tp_osd_tp
Sep 20 02:56:31 node1.ceph ceph-osd[7051]: ceph version 12.2.0 (32ce2a3ae5239ee33d6150705cdb24d43bab910c) luminous (rc)
Sep 20 02:56:31 node1.ceph ceph-osd[7051]: 1: (()+0xa23b21) [0x55c161f97b21]
Sep 20 02:56:31 node1.ceph ceph-osd[7051]: 2: (()+0xf5e0) [0x7f45af75e5e0]
Sep 20 02:56:31 node1.ceph ceph-osd[7051]: 3: (()+0x1cdff) [0x7f45b1f2bdff]
Sep 20 02:56:31 node1.ceph ceph-osd[7051]: 4: (rocksdb::BlockBasedTable::PutDataBlockToCache(rocksdb::Slice const&, rocksdb::Slice const&, rocksdb::Cache*, rocksdb::Cache*, rocksdb::ReadOptions const&, rocksdb::ImmutableCFOptions const&, rocksdb::BlockBasedTable::CachableEntry<rocksdb::Block>*, rocksdb::Block*, unsigned int, rocksdb::Slice const&, unsigned long, bool, rocksdb::Cache::Priority)+0xd6) [0x55c1623065e6]
Sep 20 02:56:31 node1.ceph ceph-osd[7051]: 5: (rocksdb::BlockBasedTable::MaybeLoadDataBlockToCache(rocksdb::BlockBasedTable::Rep*, rocksdb::ReadOptions const&, rocksdb::BlockHandle const&, rocksdb::Slice, rocksdb::BlockBasedTable::CachableEntry<rocksdb::Block>*, bool)+0x3dc) [0x55c16230767c]
Sep 20 02:56:31 node1.ceph ceph-osd[7051]: 6: (rocksdb::BlockBasedTable::NewDataBlockIterator(rocksdb::BlockBasedTable::Rep*, rocksdb::ReadOptions const&, rocksdb::BlockHandle const&, rocksdb::BlockIter*, bool, rocksdb::Status)+0x127) [0x55c1623078d7]
Sep 20 02:56:31 node1.ceph ceph-osd[7051]: 7: (rocksdb::BlockBasedTable::BlockEntryIteratorState::NewSecondaryIterator(rocksdb::Slice const&)+0x89) [0x55c16230fef9]
Sep 20 02:56:31 node1.ceph ceph-osd[7051]: 8: (()+0xdbd596) [0x55c162331596]
Sep 20 02:56:31 node1.ceph ceph-osd[7051]: 9: (()+0xdbdb5d) [0x55c162331b5d]
Sep 20 02:56:31 node1.ceph ceph-osd[7051]: 10: (rocksdb::MergingIterator::Seek(rocksdb::Slice const&)+0xce) [0x55c162319c8e]
Sep 20 02:56:31 node1.ceph ceph-osd[7051]: 11: (rocksdb::DBIter::Seek(rocksdb::Slice const&)+0x174) [0x55c1623990e4]
Sep 20 02:56:31 node1.ceph ceph-osd[7051]: 12: (RocksDBStore::RocksDBWholeSpaceIteratorImpl::lower_bound(std::string const&, std::string const&)+0xa2) [0x55c161eee962]
Sep 20 02:56:31 node1.ceph ceph-osd[7051]: 13: (DBObjectMap::DBObjectMapIteratorImpl::lower_bound(std::string const&)+0x47) [0x55c161f11cf7]
Sep 20 02:56:31 node1.ceph ceph-osd[7051]: 14: (DBObjectMap::scan(std::shared_ptr<DBObjectMap::_Header>, std::set<std::string, std::less<std::string>, std::allocator<std::string> > const&, std::set<std::string, std::less<std::string>, std::allocator<std::string> >*, std::map<std::string, ceph::buffer::list, std::less<std::string>, std::allocator<std::pair<std::string const, ceph::buffer::list> > >*)+0x110) [0x55c161f16530]
Sep 20 02:56:31 node1.ceph ceph-osd[7051]: 15: (DBObjectMap::get_values(ghobject_t const&, std::set<std::string, std::less<std::string>, std::allocator<std::string> > const&, std::map<std::string, ceph::buffer::list, std::less<std::string>, std::allocator<std::pair<std::string const, ceph::buffer::list> > >*)+0xaf) [0x55c161f18f6f]
Sep 20 02:56:31 node1.ceph ceph-osd[7051]: 16: (FileStore::omap_get_values(coll_t const&, ghobject_t const&, std::set<std::string, std::less<std::string>, std::allocator<std::string> > const&, std::map<std::string, ceph::buffer::list, std::less<std::string>, std::allocator<std::pair<std::string const, ceph::buffer::list> > >*)+0x18d) [0x55c161d9355d]
Sep 20 02:56:31 node1.ceph ceph-osd[7051]: 17: (MapCacher::MapCacher<std::string, ceph::buffer::list>::get_keys(std::set<std::string, std::less<std::string>, std::allocator<std::string> > const&, std::map<std::string, ceph::buffer::list, std::less<std::string>, std::allocator<std::pair<std::string const, ceph::buffer::list> > >*)+0x268) [0x55c161c6f1b8]
Sep 20 02:56:31 node1.ceph ceph-osd[7051]: 18: (SnapMapper::get_snaps(hobject_t const&, SnapMapper::object_snaps*)+0x100) [0x55c161c6bd10]
Sep 20 02:56:31 node1.ceph ceph-osd[7051]: 19: (SnapMapper::get_snaps(hobject_t const&, std::set<snapid_t, std::less<snapid_t>, std::allocator<snapid_t> >*)+0xd8) [0x55c161c6c198]
Sep 20 02:56:31 node1.ceph ceph-osd[7051]: 20: (PG::_scan_snaps(ScrubMap&)+0x70e) [0x55c161b0441e]
Sep 20 02:56:31 node1.ceph ceph-osd[7051]: 21: (PG::build_scrub_map_chunk(ScrubMap&, hobject_t, hobject_t, bool, unsigned int, ThreadPool::TPHandle&)+0x22e) [0x55c161b0532e]
Sep 20 02:56:31 node1.ceph ceph-osd[7051]: 22: (PG::chunky_scrub(ThreadPool::TPHandle&)+0x3dd) [0x55c161b3157d]
Sep 20 02:56:31 node1.ceph ceph-osd[7051]: 23: (PG::scrub(unsigned int, ThreadPool::TPHandle&)+0x462) [0x55c161b32f72]
Sep 20 02:56:31 node1.ceph ceph-osd[7051]: 24: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xfec) [0x55c161a7ecac]
Sep 20 02:56:31 node1.ceph ceph-osd[7051]: 25: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x839) [0x55c161fdc029]
Sep 20 02:56:31 node1.ceph ceph-osd[7051]: 26: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55c161fddfc0]
Sep 20 02:56:31 node1.ceph ceph-osd[7051]: 27: (()+0x7e25) [0x7f45af756e25]
Sep 20 02:56:31 node1.ceph ceph-osd[7051]: 28: (clone()+0x6d) [0x7f45ae84a34d]
Sep 20 02:56:31 node1.ceph ceph-osd[7051]: 2017-09-20 02:56:31.662730 7f4591be9700 -1 *** Caught signal (Segmentation fault) **
Sep 20 02:56:31 node1.ceph ceph-osd[7051]: in thread 7f4591be9700 thread_name:tp_osd_tp
Sep 20 02:56:31 node1.ceph ceph-osd[7051]: ceph version 12.2.0 (32ce2a3ae5239ee33d6150705cdb24d43bab910c) luminous (rc)
Sep 20 02:56:31 node1.ceph ceph-osd[7051]: 1: (()+0xa23b21) [0x55c161f97b21]
Sep 20 02:56:31 node1.ceph ceph-osd[7051]: 2: (()+0xf5e0) [0x7f45af75e5e0]
Sep 20 02:56:31 node1.ceph ceph-osd[7051]: 3: (()+0x1cdff) [0x7f45b1f2bdff]
Sep 20 02:56:31 node1.ceph ceph-osd[7051]: 4: (rocksdb::BlockBasedTable::PutDataBlockToCache(rocksdb::Slice const&, rocksdb::Slice const&, rocksdb::Cache*, rocksdb::Cache*, rocksdb::ReadOptions const&, rocksdb::ImmutableCFOptions const&, rocksdb::BlockBasedTable::CachableEntry<rocksdb::Block>*, rocksdb::Block*, unsigned int, rocksdb::Slice const&, unsigned long, bool, rocksdb::Cache::Priority)+0xd6) [0x55c1623065e6]
Sep 20 02:56:31 node1.ceph ceph-osd[7051]: 5: (rocksdb::BlockBasedTable::MaybeLoadDataBlockToCache(rocksdb::BlockBasedTable::Rep*, rocksdb::ReadOptions const&, rocksdb::BlockHandle const&, rocksdb::Slice, rocksdb::BlockBasedTable::CachableEntry<rocksdb::Block>*, bool)+0x3dc) [0x55c16230767c]
Sep 20 02:56:31 node1.ceph ceph-osd[7051]: 6: (rocksdb::BlockBasedTable::NewDataBlockIterator(rocksdb::BlockBasedTable::Rep*, rocksdb::ReadOptions const&, rocksdb::BlockHandle const&, rocksdb::BlockIter*, bool, rocksdb::Status)+0x127) [0x55c1623078d7]
Sep 20 02:56:31 node1.ceph ceph-osd[7051]: 7: (rocksdb::BlockBasedTable::BlockEntryIteratorState::NewSecondaryIterator(rocksdb::Slice const&)+0x89) [0x55c16230fef9]
Sep 20 02:56:31 node1.ceph ceph-osd[7051]: 8: (()+0xdbd596) [0x55c162331596]
Sep 20 02:56:31 node1.ceph ceph-osd[7051]: 9: (()+0xdbdb5d) [0x55c162331b5d]
Sep 20 02:56:31 node1.ceph ceph-osd[7051]: 10: (rocksdb::MergingIterator::Seek(rocksdb::Slice const&)+0xce) [0x55c162319c8e]
Sep 20 02:56:31 node1.ceph ceph-osd[7051]: 11: (rocksdb::DBIter::Seek(rocksdb::Slice const&)+0x174) [0x55c1623990e4]
Sep 20 02:56:31 node1.ceph ceph-osd[7051]: 12: (RocksDBStore::RocksDBWholeSpaceIteratorImpl::lower_bound(std::string const&, std::string const&)+0xa2) [0x55c161eee962]
Sep 20 02:56:31 node1.ceph ceph-osd[7051]: 13: (DBObjectMap::DBObjectMapIteratorImpl::lower_bound(std::string const&)+0x47) [0x55c161f11cf7]
Sep 20 02:56:31 node1.ceph ceph-osd[7051]: 14: (DBObjectMap::scan(std::shared_ptr<DBObjectMap::_Header>, std::set<std::string, std::less<std::string>, std::allocator<std::string> > const&, std::set<std::string, std::less<std::string>, std::allocator<std::string> >*, std::map<std::string, ceph::buffer::list, std::less<std::string>, std::allocator<std::pair<std::string const, ceph::buffer::list> > >*)+0x110) [0x55c161f16530]
Sep 20 02:56:31 node1.ceph ceph-osd[7051]: 15: (DBObjectMap::get_values(ghobject_t const&, std::set<std::string, std::less<std::string>, std::allocator<std::string> > const&, std::map<std::string, ceph::buffer::list, std::less<std::string>, std::allocator<std::pair<std::string const, ceph::buffer::list> > >*)+0xaf) [0x55c161f18f6f]
Sep 20 02:56:31 node1.ceph ceph-osd[7051]: 16: (FileStore::omap_get_values(coll_t const&, ghobject_t const&, std::set<std::string, std::less<std::string>, std::allocator<std::string> > const&, std::map<std::string, ceph::buffer::list, std::less<std::string>, std::allocator<std::pair<std::string const, ceph::buffer::list> > >*)+0x18d) [0x55c161d9355d]
Sep 20 02:56:31 node1.ceph ceph-osd[7051]: 17: (MapCacher::MapCacher<std::string, ceph::buffer::list>::get_keys(std::set<std::string, std::less<std::string>, std::allocator<std::string> > const&, std::map<std::string, ceph::buffer::list, std::less<std::string>, std::allocator<std::pair<std::string const, ceph::buffer::list> > >*)+0x268) [0x55c161c6f1b8]
Sep 20 02:56:31 node1.ceph ceph-osd[7051]: 18: (SnapMapper::get_snaps(hobject_t const&, SnapMapper::object_snaps*)+0x100) [0x55c161c6bd10]
Sep 20 02:56:31 node1.ceph ceph-osd[7051]: 19: (SnapMapper::get_snaps(hobject_t const&, std::set<snapid_t, std::less<snapid_t>, std::allocator<snapid_t> >*)+0xd8) [0x55c161c6c198]
Sep 20 02:56:31 node1.ceph ceph-osd[7051]: 20: (PG::_scan_snaps(ScrubMap&)+0x70e) [0x55c161b0441e]
Sep 20 02:56:31 node1.ceph ceph-osd[7051]: 21: (PG::build_scrub_map_chunk(ScrubMap&, hobject_t, hobject_t, bool, unsigned int, ThreadPool::TPHandle&)+0x22e) [0x55c161b0532e]
Sep 20 02:56:31 node1.ceph ceph-osd[7051]: 22: (PG::chunky_scrub(ThreadPool::TPHandle&)+0x3dd) [0x55c161b3157d]
Sep 20 02:56:31 node1.ceph ceph-osd[7051]: 23: (PG::scrub(unsigned int, ThreadPool::TPHandle&)+0x462) [0x55c161b32f72]
Sep 20 02:56:31 node1.ceph ceph-osd[7051]: 24: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xfec) [0x55c161a7ecac]
Sep 20 02:56:31 node1.ceph ceph-osd[7051]: 25: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x839) [0x55c161fdc029]
Sep 20 02:56:31 node1.ceph ceph-osd[7051]: 26: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55c161fddfc0]
Sep 20 02:56:31 node1.ceph ceph-osd[7051]: 27: (()+0x7e25) [0x7f45af756e25]
Sep 20 02:56:31 node1.ceph ceph-osd[7051]: 28: (clone()+0x6d) [0x7f45ae84a34d]
Sep 20 02:56:31 node1.ceph ceph-osd[7051]: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Sep 20 02:56:31 node1.ceph ceph-osd[7051]: 0> 2017-09-20 02:56:31.662730 7f4591be9700 -1 *** Caught signal (Segmentation fault) **
Sep 20 02:56:31 node1.ceph ceph-osd[7051]: in thread 7f4591be9700 thread_name:tp_osd_tp
Sep 20 02:56:31 node1.ceph ceph-osd[7051]: ceph version 12.2.0 (32ce2a3ae5239ee33d6150705cdb24d43bab910c) luminous (rc)
Sep 20 02:56:31 node1.ceph ceph-osd[7051]: 1: (()+0xa23b21) [0x55c161f97b21]
Sep 20 02:56:31 node1.ceph ceph-osd[7051]: 2: (()+0xf5e0) [0x7f45af75e5e0]
Sep 20 02:56:31 node1.ceph ceph-osd[7051]: 3: (()+0x1cdff) [0x7f45b1f2bdff]
Sep 20 02:56:31 node1.ceph ceph-osd[7051]: 4: (rocksdb::BlockBasedTable::PutDataBlockToCache(rocksdb::Slice const&, rocksdb::Slice const&, rocksdb::Cache*, rocksdb::Cache*, rocksdb::ReadOptions const&, rocksdb::ImmutableCFOptions const&, rocksdb::BlockBasedTable::CachableEntry<rocksdb::Block>*, rocksdb::Block*, unsigned int, rocksdb::Slice const&, unsigned long, bool, rocksdb::Cache::Priority)+0xd6) [0x55c1623065e6]
Sep 20 02:56:31 node1.ceph ceph-osd[7051]: 5: (rocksdb::BlockBasedTable::MaybeLoadDataBlockToCache(rocksdb::BlockBasedTable::Rep*, rocksdb::ReadOptions const&, rocksdb::BlockHandle const&, rocksdb::Slice, rocksdb::BlockBasedTable::CachableEntry<rocksdb::Block>*, bool)+0x3dc) [0x55c16230767c]
Sep 20 02:56:31 node1.ceph ceph-osd[7051]: 6: (rocksdb::BlockBasedTable::NewDataBlockIterator(rocksdb::BlockBasedTable::Rep*, rocksdb::ReadOptions const&, rocksdb::BlockHandle const&, rocksdb::BlockIter*, bool, rocksdb::Status)+0x127) [0x55c1623078d7]
Sep 20 02:56:31 node1.ceph ceph-osd[7051]: 7: (rocksdb::BlockBasedTable::BlockEntryIteratorState::NewSecondaryIterator(rocksdb::Slice const&)+0x89) [0x55c16230fef9]
Sep 20 02:56:31 node1.ceph ceph-osd[7051]: 8: (()+0xdbd596) [0x55c162331596]
Sep 20 02:56:31 node1.ceph ceph-osd[7051]: 9: (()+0xdbdb5d) [0x55c162331b5d]
Sep 20 02:56:31 node1.ceph ceph-osd[7051]: 10: (rocksdb::MergingIterator::Seek(rocksdb::Slice const&)+0xce) [0x55c162319c8e]
Sep 20 02:56:31 node1.ceph ceph-osd[7051]: 11: (rocksdb::DBIter::Seek(rocksdb::Slice const&)+0x174) [0x55c1623990e4]
Sep 20 02:56:31 node1.ceph ceph-osd[7051]: 12: (RocksDBStore::RocksDBWholeSpaceIteratorImpl::lower_bound(std::string const&, std::string const&)+0xa2) [0x55c161eee962]
Sep 20 02:56:31 node1.ceph ceph-osd[7051]: 13: (DBObjectMap::DBObjectMapIteratorImpl::lower_bound(std::string const&)+0x47) [0x55c161f11cf7]
Sep 20 02:56:31 node1.ceph ceph-osd[7051]: 14: (DBObjectMap::scan(std::shared_ptr<DBObjectMap::_Header>, std::set<std::string, std::less<std::string>, std::allocator<std::string> > const&, std::set<std::string, std::less<std::string>, std::allocator<std::string> >*, std::map<std::string, ceph::buffer::list, std::less<std::string>, std::allocator<std::pair<std::string const, ceph::buffer::list> > >*)+0x110) [0x55c161f16530]
Sep 20 02:56:31 node1.ceph ceph-osd[7051]: 15: (DBObjectMap::get_values(ghobject_t const&, std::set<std::string, std::less<std::string>, std::allocator<std::string> > const&, std::map<std::string, ceph::buffer::list, std::less<std::string>, std::allocator<std::pair<std::string const, ceph::buffer::list> > >*)+0xaf) [0x55c161f18f6f]
Sep 20 02:56:31 node1.ceph ceph-osd[7051]: 16: (FileStore::omap_get_values(coll_t const&, ghobject_t const&, std::set<std::string, std::less<std::string>, std::allocator<std::string> > const&, std::map<std::string, ceph::buffer::list, std::less<std::string>, std::allocator<std::pair<std::string const, ceph::buffer::list> > >*)+0x18d) [0x55c161d9355d]
Sep 20 02:56:31 node1.ceph ceph-osd[7051]: 17: (MapCacher::MapCacher<std::string, ceph::buffer::list>::get_keys(std::set<std::string, std::less<std::string>, std::allocator<std::string> > const&, std::map<std::string, ceph::buffer::list, std::less<std::string>, std::allocator<std::pair<std::string const, ceph::buffer::list> > >*)+0x268) [0x55c161c6f1b8]
Sep 20 02:56:31 node1.ceph ceph-osd[7051]: 18: (SnapMapper::get_snaps(hobject_t const&, SnapMapper::object_snaps*)+0x100) [0x55c161c6bd10]
Sep 20 02:56:31 node1.ceph ceph-osd[7051]: 19: (SnapMapper::get_snaps(hobject_t const&, std::set<snapid_t, std::less<snapid_t>, std::allocator<snapid_t> >*)+0xd8) [0x55c161c6c198]
Sep 20 02:56:31 node1.ceph ceph-osd[7051]: 20: (PG::_scan_snaps(ScrubMap&)+0x70e) [0x55c161b0441e]
Sep 20 02:56:31 node1.ceph ceph-osd[7051]: 21: (PG::build_scrub_map_chunk(ScrubMap&, hobject_t, hobject_t, bool, unsigned int, ThreadPool::TPHandle&)+0x22e) [0x55c161b0532e]
Sep 20 02:56:31 node1.ceph ceph-osd[7051]: 22: (PG::chunky_scrub(ThreadPool::TPHandle&)+0x3dd) [0x55c161b3157d]
Sep 20 02:56:31 node1.ceph ceph-osd[7051]: 23: (PG::scrub(unsigned int, ThreadPool::TPHandle&)+0x462) [0x55c161b32f72]
Sep 20 02:56:31 node1.ceph ceph-osd[7051]: 24: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xfec) [0x55c161a7ecac]
Sep 20 02:56:31 node1.ceph ceph-osd[7051]: 25: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x839) [0x55c161fdc029]
Sep 20 02:56:31 node1.ceph ceph-osd[7051]: 26: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55c161fddfc0]
Sep 20 02:56:31 node1.ceph ceph-osd[7051]: 27: (()+0x7e25) [0x7f45af756e25]
Sep 20 02:56:31 node1.ceph ceph-osd[7051]: 28: (clone()+0x6d) [0x7f45ae84a34d]
Sep 20 02:56:31 node1.ceph ceph-osd[7051]: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Sep 20 02:56:31 node1.ceph systemd[1]: ceph-osd@0.service: main process exited, code=killed, status=11/SEGV
Sep 20 02:56:31 node1.ceph systemd[1]: Unit ceph-osd@0.service entered failed state.
Sep 20 02:56:31 node1.ceph systemd[1]: ceph-osd@0.service failed.
Sep 20 02:56:51 node1.ceph systemd[1]: ceph-osd@0.service holdoff time over, scheduling restart.
Sep 20 02:56:51 node1.ceph systemd[1]: Starting Ceph object storage daemon osd.0...
Sep 20 02:56:51 node1.ceph systemd[1]: Started Ceph object storage daemon osd.0.
Sep 20 02:56:52 node1.ceph ceph-osd[12827]: 2017-09-20 02:56:52.058834 7fa52d465d00 -1 Public network was set, but cluster network was not set
Sep 20 02:56:52 node1.ceph ceph-osd[12827]: 2017-09-20 02:56:52.058846 7fa52d465d00 -1 Using public network also for cluster network
Sep 20 02:56:52 node1.ceph ceph-osd[12827]: starting osd.0 at - osd_data /var/lib/ceph/osd/ceph-0 /var/lib/ceph/osd/ceph-0/journal
Sep 20 02:57:14 node1.ceph ceph-osd[12827]: 2017-09-20 02:57:14.864876 7fa52d465d00 -1 osd.0 23702 log_to_monitors {default=true}
Sep 20 03:11:13 node1.ceph ceph-osd[12827]: 2017-09-20 03:11:13.898451 7fa50bbe5700 -1 osd.0 pg_epoch: 23712 pg[8.c7( v 23712'5244794 (23697'5241702,23712'5244794] local-lis/les=23708/23712 n=3379 ec=107/107 lis/c 23708/23708 les/c/f 23712/23712/0 23708/23708/16900) [6,0,34] r=1 lpr=23708 luod=0'0 crt=23712'5244794 lcod 23712'5244793 active] _scan_snaps no head for 8:e306ea48:::rbd_data.9aca26238e1f29.0000000000003619:65 (have MIN)
Sep 20 03:12:00 node1.ceph ceph-osd[12827]: 2017-09-20 03:12:00.823561 7fa50cbe7700 -1 osd.0 pg_epoch: 23712 pg[8.3ea( v 23712'1976155 (23697'1973071,23712'1976155] local-lis/les=23708/23710 n=3413 ec=2548/2548 lis/c 23708/23708 les/c/f 23710/23712/0 23708/23708/16926) [17,32,0] r=2 lpr=23708 luod=0'0 crt=23712'1976155 lcod 23712'1976154 active] _scan_snaps no head for 8:57c8258f:::rbd_data.9aca26238e1f29.000000000000409d:65 (have MIN)
Sep 20 03:14:51 node1.ceph ceph-osd[12827]: 2017-09-20 03:14:51.229050 7fa50cbe7700 -1 osd.0 pg_epoch: 23712 pg[7.322( v 23712'8893405 (23697'8890317,23712'8893405] local-lis/les=23708/23710 n=3025 ec=2543/2543 lis/c 23708/23708 les/c/f 23710/23712/0 23708/23708/16920) [12,0,27] r=1 lpr=23708 luod=0'0 crt=23712'8893405 lcod 23712'8893404 active] _scan_snaps no head for 7:44c420f4:::rbd_data.a6351b238e1f29.000000000002e2c6:9e (have MIN)
Sep 20 03:14:52 node1.ceph ceph-osd[12827]: 2017-09-20 03:14:52.758835 7fa50cbe7700 -1 osd.0 pg_epoch: 23712 pg[7.322( v 23712'8893405 (23697'8890317,23712'8893405] local-lis/les=23708/23710 n=3025 ec=2543/2543 lis/c 23708/23708 les/c/f 23710/23712/0 23708/23708/16920) [12,0,27] r=1 lpr=23708 luod=0'0 crt=23712'8893405 lcod 23712'8893404 active] _scan_snaps no head for 7:44d30056:::rbd_data.bd1c92ae8944a.00000000000027e9:d4 (have MIN)
Sep 20 03:14:52 node1.ceph ceph-osd[12827]: 2017-09-20 03:14:52.758875 7fa50cbe7700 -1 osd.0 pg_epoch: 23712 pg[7.322( v 23712'8893405 (23697'8890317,23712'8893405] local-lis/les=23708/23710 n=3025 ec=2543/2543 lis/c 23708/23708 les/c/f 23710/23712/0 23708/23708/16920) [12,0,27] r=1 lpr=23708 luod=0'0 crt=23712'8893405 lcod 23712'8893404 active] _scan_snaps no head for 7:44d30056:::rbd_data.bd1c92ae8944a.00000000000027e9:d2 (have MIN)
Sep 20 03:14:53 node1.ceph ceph-osd[12827]: 2017-09-20 03:14:53.126656 7fa50cbe7700 -1 osd.0 pg_epoch: 23712 pg[7.322( v 23712'8893405 (23697'8890317,23712'8893405] local-lis/les=23708/23710 n=3025 ec=2543/2543 lis/c 23708/23708 les/c/f 23710/23712/0 23708/23708/16920) [12,0,27] r=1 lpr=23708 luod=0'0 crt=23712'8893405 lcod 23712'8893404 active] _scan_snaps no head for 7:44d64919:::rbd_data.a6351b238e1f29.000000000002ecf5:9e (have MIN)
Sep 20 03:14:53 node1.ceph ceph-osd[12827]: 2017-09-20 03:14:53.246801 7fa50cbe7700 -1 osd.0 pg_epoch: 23712 pg[7.322( v 23712'8893405 (23697'8890317,23712'8893405] local-lis/les=23708/23710 n=3025 ec=2543/2543 lis/c 23708/23708 les/c/f 23710/23712/0 23708/23708/16920) [12,0,27] r=1 lpr=23708 luod=0'0 crt=23712'8893405 lcod 23712'8893404 active] _scan_snaps no head for 7:44d6d538:::rbd_data.bd1c92ae8944a.0000000000000c66:d4 (have MIN)
Sep 20 03:14:53 node1.ceph ceph-osd[12827]: 2017-09-20 03:14:53.246836 7fa50cbe7700 -1 osd.0 pg_epoch: 23712 pg[7.322( v 23712'8893405 (23697'8890317,23712'8893405] local-lis/les=23708/23710 n=3025 ec=2543/2543 lis/c 23708/23708 les/c/f 23710/23712/0 23708/23708/16920) [12,0,27] r=1 lpr=23708 luod=0'0 crt=23712'8893405 lcod 23712'8893404 active] _scan_snaps no head for 7:44d6d538:::rbd_data.bd1c92ae8944a.0000000000000c66:d2 (have MIN)
Sep 20 03:14:54 node1.ceph ceph-osd[12827]: 2017-09-20 03:14:54.765036 7fa50cbe7700 -1 osd.0 pg_epoch: 23712 pg[7.322( v 23712'8893405 (23697'8890317,23712'8893405] local-lis/les=23708/23710 n=3025 ec=2543/2543 lis/c 23708/23708 les/c/f 23710/23712/0 23708/23708/16920) [12,0,27] r=1 lpr=23708 luod=0'0 crt=23712'8893405 lcod 23712'8893404 active] _scan_snaps no head for 7:44e5b763:::rbd_data.b0d052238e1f29.000000000000030a:b1 (have MIN)
</pre>
Ceph - Bug #21416: osd/PGLog.cc: 60: FAILED assert(s <= can_rollback_to) after upgrade to luminous
https://tracker.ceph.com/issues/21416?journal_id=99622
2017-09-21T20:46:42Z
Patrick Fruh
<ul></ul><p>Looking deeper into it, it's only my osd.0 that crashed and only some osds have this "pg_epoch" spam, the other logs are perfectly clean except for the startup log.</p>
Ceph - Bug #21416: osd/PGLog.cc: 60: FAILED assert(s <= can_rollback_to) after upgrade to luminous
https://tracker.ceph.com/issues/21416?journal_id=99623
2017-09-21T21:10:54Z
Patrick Fruh
<ul></ul><p>So, to sum everything up:</p>
<ol>
<li>I upgraded my 6 hosts with 41 OSDs total from Ceph 10.2.9 to 12.2.0 and from CentOS 7.3 to 7.4</li>
<li>I set the first 3 OSDs on my first host out, and let them empty (those 3 OSDs already started showing issues there, but I didn't notice at first, all others were fine <strong>first log I posted</strong>)</li>
<li>I wiped the 3 OSDs and their shared journaling SSD and set them up again with bluestore and the SSD as the DB device</li>
<li>As the recovery started and while other OSDs (apart from the first 3) were recovering, the 3 bluestore OSDs started acting crazy and kept segfaulting regularily (<strong>second log I posted</strong>, possibly due to the high load caused by the recoveries)</li>
<li>I wiped those 3 OSDs again and reverted them to filestore, the SSD again as their shared journaling device.</li>
<li>The OSDs are now working without any issues (unlike before, after the upgrade, so the "fresh" setup with filestore seems to have fixed part of it), except for osd.0, which crashed once after moving it back to filestore (<strong>third log I posted</strong>)</li>
</ol>
<p>(Sorry for all the posts, can't seem to edit old ones.)</p>
Ceph - Bug #21416: osd/PGLog.cc: 60: FAILED assert(s <= can_rollback_to) after upgrade to luminous
https://tracker.ceph.com/issues/21416?journal_id=117382
2018-07-20T22:07:50Z
Patrick Donnelly
pdonnell@redhat.com
<ul><li><strong>Affected Versions</strong> <i>v14.0.0</i> added</li><li><strong>ceph-qa-suite</strong> <i>fs</i> added</li></ul><p>Instance in fs testing: /ceph/teuthology-archive/pdonnell-2018-07-20_06:53:09-fs-wip-pdonnell-testing-20180720.044448-testing-basic-smithi/2800613/teuthology.log</p>
Ceph - Bug #21416: osd/PGLog.cc: 60: FAILED assert(s <= can_rollback_to) after upgrade to luminous
https://tracker.ceph.com/issues/21416?journal_id=118210
2018-08-01T02:12:14Z
Patrick Donnelly
pdonnell@redhat.com
<ul><li><strong>Target version</strong> set to <i>v14.0.0</i></li><li><strong>Backport</strong> set to <i>mimic,luminous</i></li></ul><p>Another: /ceph/teuthology-archive/pdonnell-2018-07-31_23:55:57-fs-wip-pdonnell-testing-20180731.215430-testing-basic-smithi/2845133/teuthology.log</p>
Ceph - Bug #21416: osd/PGLog.cc: 60: FAILED assert(s <= can_rollback_to) after upgrade to luminous
https://tracker.ceph.com/issues/21416?journal_id=118401
2018-08-02T19:59:50Z
Patrick Donnelly
pdonnell@redhat.com
<ul></ul><p>/ceph/teuthology-archive/pdonnell-2018-08-02_13:06:29-multimds-wip-pdonnell-testing-20180802.044402-testing-basic-smithi/2852575</p>
<p>with logs/cores</p>
Ceph - Bug #21416: osd/PGLog.cc: 60: FAILED assert(s <= can_rollback_to) after upgrade to luminous
https://tracker.ceph.com/issues/21416?journal_id=119086
2018-08-18T18:03:30Z
Sage Weil
sage@newdream.net
<ul></ul><p>the fix is clearly to not trim past can_rollback_to. the other question, though, is why can_rollback_to is so far behind!</p>
<p>for this run, osd.0 is the culprit:<br />/a/nojha-2018-08-08_23:48:33-multimds-wip-test-crt-distro-basic-smithi/2883705/remote</p>
<p>writes are coming in at 22'6698...<br /><pre>
2018-08-09 09:57:38.338 7f702a907700 1 -- 172.21.15.15:6802/32650 <== osd.4 172.21.15.140:6801/32831 18743 ==== MOSDECSubOpWrite(3.3s2 22/20 ECSubWrite(tid=12345, reqid=client.4408.0:48075, at_version=22'6698, trim_to=22'3267, roll_forward_to=22'3239)) v2 ==== 6456+0+0 (387819236 0 0) 0x55e812ba5000 con 0x55e7a69ad500
</pre><br />replies are going back at 22'3329<br /><pre>
2018-08-09 09:57:38.509 7f701a890700 1 -- 172.21.15.15:6802/32650 --> 172.21.15.140:6801/32831 -- MOSDECSubOpWriteReply(3.3s0 22/20 ECSubWriteReply(tid=5615, last_complete=22'3329, committed=1, applied=1)) v2 -- 0x55e7e3b1e6c0 con 0
</pre></p>
Ceph - Bug #21416: osd/PGLog.cc: 60: FAILED assert(s <= can_rollback_to) after upgrade to luminous
https://tracker.ceph.com/issues/21416?journal_id=119088
2018-08-18T19:05:41Z
Sage Weil
sage@newdream.net
<ul></ul><p>...and the reason is just that the osd.0 is way behind, and the primary is not throttling it's work accordingly. And apparently the client workload isn't throttling it's submitted work either. This is a long-standing problem we've never come up with a good solution for.</p>
<p>it is unsurprisingly the ffsb workload, which is really good at submitting lots of ios in parallel without any throttling.</p>
Ceph - Bug #21416: osd/PGLog.cc: 60: FAILED assert(s <= can_rollback_to) after upgrade to luminous
https://tracker.ceph.com/issues/21416?journal_id=119211
2018-08-21T18:03:46Z
Neha Ojha
nojha@redhat.com
<ul><li><strong>Status</strong> changed from <i>12</i> to <i>7</i></li><li><strong>Assignee</strong> set to <i>Neha Ojha</i></li></ul>
Ceph - Bug #21416: osd/PGLog.cc: 60: FAILED assert(s <= can_rollback_to) after upgrade to luminous
https://tracker.ceph.com/issues/21416?journal_id=119212
2018-08-21T18:04:26Z
Neha Ojha
nojha@redhat.com
<ul></ul><p><a class="external" href="https://github.com/ceph/ceph/pull/23633">https://github.com/ceph/ceph/pull/23633</a></p>
Ceph - Bug #21416: osd/PGLog.cc: 60: FAILED assert(s <= can_rollback_to) after upgrade to luminous
https://tracker.ceph.com/issues/21416?journal_id=119649
2018-08-30T13:53:15Z
Sage Weil
sage@newdream.net
<ul><li><strong>Status</strong> changed from <i>7</i> to <i>Pending Backport</i></li></ul>
Ceph - Bug #21416: osd/PGLog.cc: 60: FAILED assert(s <= can_rollback_to) after upgrade to luminous
https://tracker.ceph.com/issues/21416?journal_id=119704
2018-08-31T11:08:50Z
Nathan Cutler
ncutler@suse.cz
<ul><li><strong>Copied to</strong> <i><a class="issue tracker-9 status-3 priority-4 priority-default closed" href="/issues/35072">Backport #35072</a>: luminous: osd/PGLog.cc: 60: FAILED assert(s <= can_rollback_to) after upgrade to luminous</i> added</li></ul>
Ceph - Bug #21416: osd/PGLog.cc: 60: FAILED assert(s <= can_rollback_to) after upgrade to luminous
https://tracker.ceph.com/issues/21416?journal_id=119706
2018-08-31T11:08:52Z
Nathan Cutler
ncutler@suse.cz
<ul><li><strong>Copied to</strong> <i><a class="issue tracker-9 status-3 priority-4 priority-default closed" href="/issues/35073">Backport #35073</a>: mimic: osd/PGLog.cc: 60: FAILED assert(s <= can_rollback_to) after upgrade to luminous</i> added</li></ul>
Ceph - Bug #21416: osd/PGLog.cc: 60: FAILED assert(s <= can_rollback_to) after upgrade to luminous
https://tracker.ceph.com/issues/21416?journal_id=119794
2018-09-02T18:14:43Z
Nathan Cutler
ncutler@suse.cz
<ul></ul><p>Luminous backport is non-trivial; see <a class="external" href="http://tracker.ceph.com/issues/35072#note-2">http://tracker.ceph.com/issues/35072#note-2</a></p>
Ceph - Bug #21416: osd/PGLog.cc: 60: FAILED assert(s <= can_rollback_to) after upgrade to luminous
https://tracker.ceph.com/issues/21416?journal_id=120546
2018-09-12T16:12:48Z
Nathan Cutler
ncutler@suse.cz
<ul><li><strong>Status</strong> changed from <i>Pending Backport</i> to <i>Resolved</i></li></ul>