Project

General

Profile

Bug #21021

Failed assert starting OSD after mark_unfound_lost caused process crash

Added by Barry Stokes about 2 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
OSD
Target version:
Start date:
08/17/2017
Due date:
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:

Description

I had an issue with my Ceph cluster where I got it into a state where for two of my PGs I only had one copy and both had unfound objects in them (1 unfound on each PG).

I tried running "mark_unfound_lost revert" on both PGs and it told me it responded with "pg has 1 objects unfound and apparently lost marking".

This didn't seem to solve anything so I did a quick search and discovered this bug: http://tracker.ceph.com/issues/19377 which suggests that unless I have found objects it won't recover from here.

Given that I can't use the workaround suggested on that bug (as I don't have a second copy of the PG) I decided to tell it to delete the data, so issued "mark_unfound_lost delete" on the PGs.

The first one seemed to work, recovery kicked in and the PG copied over to other OSDs.

The second one resulted in the OSD process crashing (or it crashed due to some unrelated reason).

The issue I now face is that the OSD won't start up as it's hitting an assertion and crashing out.

This is a bit of a shame as that OSD has the only copy of my PG on it and now it's looking like I'm going to be missing the whole thing rather than just the two unfound objects.

Log when trying to start the OSD as follows:

2017-08-17 08:38:47.913101 7fa7f1f01a40 -1 WARNING: the following dangerous and experimental features are enabled: bluestore
2017-08-17 08:38:47.913223 7fa7f1f01a40 -1 WARNING: the following dangerous and experimental features are enabled: bluestore
starting osd.6 at - osd_data /var/lib/ceph/osd/ceph-6 /var/lib/ceph/osd/ceph-6/journal
2017-08-17 08:38:47.914890 7fa7f1f01a40 -1 WARNING: the following dangerous and experimental features are enabled: bluestore
2017-08-17 08:38:48.555102 7fa7f1f01a40 -1 journal FileJournal::_open: disabling aio for non-block journal.  Use journal_force_aio to force use of aio anyway
2017-08-17 08:38:48.605621 7fa7f1f01a40 -1 osd.6 8844 log_to_monitors {default=true}
/build/ceph-11.2.1/src/osd/PrimaryLogPG.cc: In function 'void PrimaryLogPG::populate_obc_watchers(ObjectContextRef)' thread 7fa7d7b44700 time 2017-08-17 08:38:52.408154
/build/ceph-11.2.1/src/osd/PrimaryLogPG.cc: 8960: FAILED assert((recovering.count(obc->obs.oi.soid) || !is_missing_object(obc->obs.oi.soid)) || (pg_log.get_log().objects.count(obc->obs.oi.soid) && pg_log.get_log().objects.find(obc->obs.oi.soid)->second->op == pg_log_entry_t::LOST_REVERT && pg_log.get_log().objects.find(obc->obs.oi.soid)->second->reverting_to == obc->obs.oi.version))
 ceph version 11.2.1 (e0354f9d3b1eea1d75a7dd487ba8098311be38a7)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x79) [0x55d750bc60f9]
 2: (PrimaryLogPG::populate_obc_watchers(std::shared_ptr<ObjectContext>)+0x811) [0x55d7506201c1]
 3: (PrimaryLogPG::get_object_context(hobject_t const&, bool, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, ceph::buffer::list, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::list> > >*)+0xaec) [0x55d75062131c]
 4: (PrimaryLogPG::recover_primary(unsigned long, ThreadPool::TPHandle&)+0x841) [0x55d750668ad1]
 5: (PrimaryLogPG::start_recovery_ops(unsigned long, ThreadPool::TPHandle&, unsigned long*)+0xbce) [0x55d75067255e]
 6: (OSD::do_recovery(PG*, unsigned int, unsigned long, ThreadPool::TPHandle&)+0x426) [0x55d7504dc8c6]
 7: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x69e) [0x55d7504dd57e]
 8: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x7f9) [0x55d750bcae99]
 9: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55d750bce250]
 10: (()+0x76da) [0x7fa7f0f996da]
 11: (clone()+0x5f) [0x7fa7ef662d7f]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
2017-08-17 08:38:52.415176 7fa7d7b44700 -1 /build/ceph-11.2.1/src/osd/PrimaryLogPG.cc: In function 'void PrimaryLogPG::populate_obc_watchers(ObjectContextRef)' thread 7fa7d7b44700 time 2017-08-17 08:38:52.408154
/build/ceph-11.2.1/src/osd/PrimaryLogPG.cc: 8960: FAILED assert((recovering.count(obc->obs.oi.soid) || !is_missing_object(obc->obs.oi.soid)) || (pg_log.get_log().objects.count(obc->obs.oi.soid) && pg_log.get_log().objects.find(obc->obs.oi.soid)->second->op == pg_log_entry_t::LOST_REVERT && pg_log.get_log().objects.find(obc->obs.oi.soid)->second->reverting_to == obc->obs.oi.version))

 ceph version 11.2.1 (e0354f9d3b1eea1d75a7dd487ba8098311be38a7)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x79) [0x55d750bc60f9]
 2: (PrimaryLogPG::populate_obc_watchers(std::shared_ptr<ObjectContext>)+0x811) [0x55d7506201c1]
 3: (PrimaryLogPG::get_object_context(hobject_t const&, bool, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, ceph::buffer::list, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::list> > >*)+0xaec) [0x55d75062131c]
 4: (PrimaryLogPG::recover_primary(unsigned long, ThreadPool::TPHandle&)+0x841) [0x55d750668ad1]
 5: (PrimaryLogPG::start_recovery_ops(unsigned long, ThreadPool::TPHandle&, unsigned long*)+0xbce) [0x55d75067255e]
 6: (OSD::do_recovery(PG*, unsigned int, unsigned long, ThreadPool::TPHandle&)+0x426) [0x55d7504dc8c6]
 7: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x69e) [0x55d7504dd57e]
 8: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x7f9) [0x55d750bcae99]
 9: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55d750bce250]
 10: (()+0x76da) [0x7fa7f0f996da]
 11: (clone()+0x5f) [0x7fa7ef662d7f]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

 -1976> 2017-08-17 08:38:47.913101 7fa7f1f01a40 -1 WARNING: the following dangerous and experimental features are enabled: bluestore
 -1975> 2017-08-17 08:38:47.913223 7fa7f1f01a40 -1 WARNING: the following dangerous and experimental features are enabled: bluestore
 -1937> 2017-08-17 08:38:47.914890 7fa7f1f01a40 -1 WARNING: the following dangerous and experimental features are enabled: bluestore
 -1897> 2017-08-17 08:38:48.555102 7fa7f1f01a40 -1 journal FileJournal::_open: disabling aio for non-block journal.  Use journal_force_aio to force use of aio anyway
 -1005> 2017-08-17 08:38:48.605621 7fa7f1f01a40 -1 osd.6 8844 log_to_monitors {default=true}
     0> 2017-08-17 08:38:52.415176 7fa7d7b44700 -1 /build/ceph-11.2.1/src/osd/PrimaryLogPG.cc: In function 'void PrimaryLogPG::populate_obc_watchers(ObjectContextRef)' thread 7fa7d7b44700 time 2017-08-17 08:38:52.408154
/build/ceph-11.2.1/src/osd/PrimaryLogPG.cc: 8960: FAILED assert((recovering.count(obc->obs.oi.soid) || !is_missing_object(obc->obs.oi.soid)) || (pg_log.get_log().objects.count(obc->obs.oi.soid) && pg_log.get_log().objects.find(obc->obs.oi.soid)->second->op == pg_log_entry_t::LOST_REVERT && pg_log.get_log().objects.find(obc->obs.oi.soid)->second->reverting_to == obc->obs.oi.version))

 ceph version 11.2.1 (e0354f9d3b1eea1d75a7dd487ba8098311be38a7)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x79) [0x55d750bc60f9]
 2: (PrimaryLogPG::populate_obc_watchers(std::shared_ptr<ObjectContext>)+0x811) [0x55d7506201c1]
 3: (PrimaryLogPG::get_object_context(hobject_t const&, bool, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, ceph::buffer::list, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::list> > >*)+0xaec) [0x55d75062131c]
 4: (PrimaryLogPG::recover_primary(unsigned long, ThreadPool::TPHandle&)+0x841) [0x55d750668ad1]
 5: (PrimaryLogPG::start_recovery_ops(unsigned long, ThreadPool::TPHandle&, unsigned long*)+0xbce) [0x55d75067255e]
 6: (OSD::do_recovery(PG*, unsigned int, unsigned long, ThreadPool::TPHandle&)+0x426) [0x55d7504dc8c6]
 7: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x69e) [0x55d7504dd57e]
 8: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x7f9) [0x55d750bcae99]
 9: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55d750bce250]
 10: (()+0x76da) [0x7fa7f0f996da]
 11: (clone()+0x5f) [0x7fa7ef662d7f]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

*** Caught signal (Aborted) **
 in thread 7fa7d7b44700 thread_name:tp_osd_tp
 ceph version 11.2.1 (e0354f9d3b1eea1d75a7dd487ba8098311be38a7)
 1: (()+0x955dee) [0x55d7509f4dee]
 2: (()+0x11670) [0x7fa7f0fa3670]
 3: (gsignal()+0x9f) [0x7fa7ef58f77f]
 4: (abort()+0x16a) [0x7fa7ef59137a]
 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x264) [0x55d750bc62e4]
 6: (PrimaryLogPG::populate_obc_watchers(std::shared_ptr<ObjectContext>)+0x811) [0x55d7506201c1]
 7: (PrimaryLogPG::get_object_context(hobject_t const&, bool, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, ceph::buffer::list, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::list> > >*)+0xaec) [0x55d75062131c]
 8: (PrimaryLogPG::recover_primary(unsigned long, ThreadPool::TPHandle&)+0x841) [0x55d750668ad1]
 9: (PrimaryLogPG::start_recovery_ops(unsigned long, ThreadPool::TPHandle&, unsigned long*)+0xbce) [0x55d75067255e]
 10: (OSD::do_recovery(PG*, unsigned int, unsigned long, ThreadPool::TPHandle&)+0x426) [0x55d7504dc8c6]
 11: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x69e) [0x55d7504dd57e]
 12: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x7f9) [0x55d750bcae99]
 13: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55d750bce250]
 14: (()+0x76da) [0x7fa7f0f996da]
 15: (clone()+0x5f) [0x7fa7ef662d7f]
2017-08-17 08:38:52.430465 7fa7d7b44700 -1 *** Caught signal (Aborted) **
 in thread 7fa7d7b44700 thread_name:tp_osd_tp

 ceph version 11.2.1 (e0354f9d3b1eea1d75a7dd487ba8098311be38a7)
 1: (()+0x955dee) [0x55d7509f4dee]
 2: (()+0x11670) [0x7fa7f0fa3670]
 3: (gsignal()+0x9f) [0x7fa7ef58f77f]
 4: (abort()+0x16a) [0x7fa7ef59137a]
 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x264) [0x55d750bc62e4]
 6: (PrimaryLogPG::populate_obc_watchers(std::shared_ptr<ObjectContext>)+0x811) [0x55d7506201c1]
 7: (PrimaryLogPG::get_object_context(hobject_t const&, bool, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, ceph::buffer::list, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::list> > >*)+0xaec) [0x55d75062131c]
 8: (PrimaryLogPG::recover_primary(unsigned long, ThreadPool::TPHandle&)+0x841) [0x55d750668ad1]
 9: (PrimaryLogPG::start_recovery_ops(unsigned long, ThreadPool::TPHandle&, unsigned long*)+0xbce) [0x55d75067255e]
 10: (OSD::do_recovery(PG*, unsigned int, unsigned long, ThreadPool::TPHandle&)+0x426) [0x55d7504dc8c6]
 11: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x69e) [0x55d7504dd57e]
 12: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x7f9) [0x55d750bcae99]
 13: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55d750bce250]
 14: (()+0x76da) [0x7fa7f0f996da]
 15: (clone()+0x5f) [0x7fa7ef662d7f]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

     0> 2017-08-17 08:38:52.430465 7fa7d7b44700 -1 *** Caught signal (Aborted) **
 in thread 7fa7d7b44700 thread_name:tp_osd_tp

 ceph version 11.2.1 (e0354f9d3b1eea1d75a7dd487ba8098311be38a7)
 1: (()+0x955dee) [0x55d7509f4dee]
 2: (()+0x11670) [0x7fa7f0fa3670]
 3: (gsignal()+0x9f) [0x7fa7ef58f77f]
 4: (abort()+0x16a) [0x7fa7ef59137a]
 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x264) [0x55d750bc62e4]
 6: (PrimaryLogPG::populate_obc_watchers(std::shared_ptr<ObjectContext>)+0x811) [0x55d7506201c1]
 7: (PrimaryLogPG::get_object_context(hobject_t const&, bool, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, ceph::buffer::list, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::list> > >*)+0xaec) [0x55d75062131c]
 8: (PrimaryLogPG::recover_primary(unsigned long, ThreadPool::TPHandle&)+0x841) [0x55d750668ad1]
 9: (PrimaryLogPG::start_recovery_ops(unsigned long, ThreadPool::TPHandle&, unsigned long*)+0xbce) [0x55d75067255e]
 10: (OSD::do_recovery(PG*, unsigned int, unsigned long, ThreadPool::TPHandle&)+0x426) [0x55d7504dc8c6]
 11: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x69e) [0x55d7504dd57e]
 12: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x7f9) [0x55d750bcae99]
 13: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55d750bce250]
 14: (()+0x76da) [0x7fa7f0f996da]
 15: (clone()+0x5f) [0x7fa7ef662d7f]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Also available in: Atom PDF