Actions
Bug #8008
closedosd/ReplicatedPG.cc: 258: FAILED assert(missing_loc.needs_recovery(hoid)) during pg repair
Status:
Resolved
Priority:
Urgent
Assignee:
-
Category:
OSD
Target version:
-
% Done:
0%
Source:
Community (user)
Tags:
Backport:
Regression:
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
Here is the log from crashed OSD:
-2> 2014-04-07 22:01:14.289703 7fc4a1488700 5 -- op tracker -- , seq: 80287, time: 2014-04-07 22:01:14.289703, event: waiting_for_osdmap, request: MOSDPGPush(2.1a 14605 [PushOp(82 21b31a/rb.0.6761c.238e1f29.00000000a05a/head//2, version: 13702'40924, data_included: [0~3825664], data_size: 3825664, omap_header_size: 0, omap_entries_size: 0, attrset_size: 2, recov ery_info: ObjectRecoveryInfo(8221b31a/rb.0.6761c.238e1f29.00000000a05a/head//2@13702'40924, copy_subset: [0~3825664], clone_subset: {}), after_progress: ObjectRecoveryProgress(!first, data_recovered_to:3825664, data_complete:true, omap_recovered_to:, omap_complete:true), before_progress: ObjectRecoveryProgress(first, data_recovered_to:0, data_complete:false, omap_re covered_to:, omap_complete:false))]) v2 -1> 2014-04-07 22:01:14.289754 7fc49b47c700 5 -- op tracker -- , seq: 80287, time: 2014-04-07 22:01:14.289753, event: reached_pg, request: MOSDPGPush(2.1a 14605 [PushOp(8221b31a/r b.0.6761c.238e1f29.00000000a05a/head//2, version: 13702'40924, data_included: [0~3825664], data_size: 3825664, omap_header_size: 0, omap_entries_size: 0, attrset_size: 2, recovery_info : ObjectRecoveryInfo(8221b31a/rb.0.6761c.238e1f29.00000000a05a/head//2@13702'40924, copy_subset: [0~3825664], clone_subset: {}), after_progress: ObjectRecoveryProgress(!first, data_recovered_to:3825664, data_complete:true, omap_recovered_to:, omap_complete:true), before_progress: ObjectRecoveryProgress(first, data_recovered_to:0, data_complete:false, omap_recovered_to:, omap_complete:false))]) v2 0> 2014-04-07 22:01:14.294263 7fc49b47c700 -1 osd/ReplicatedPG.cc: In function 'virtual void ReplicatedPG::on_local_recover(const hobject_t&, const object_stat_sum_t&, const ObjectRecoveryInfo&, ObjectContextRef, ObjectStore::Transaction*)' thread 7fc49b47c700 time 2014-04-07 22:01:14.289856 osd/ReplicatedPG.cc: 258: FAILED assert(missing_loc.needs_recovery(hoid)) ceph version 0.78 (f6c746c314d7b87b8419b6e584c94bfe4511dbd4) 1: (ReplicatedPG::on_local_recover(hobject_t const&, object_stat_sum_t const&, ObjectRecoveryInfo const&, std::tr1::shared_ptr<ObjectContext>, ObjectStore::Transaction*)+0xbd1) [0x7fc4b69b7f01] 2: (ReplicatedBackend::handle_pull_response(pg_shard_t, PushOp&, PullOp*, std::list<hobject_t, std::allocator<hobject_t> >*, ObjectStore::Transaction*)+0x579) [0x7fc4b69fc439] 3: (ReplicatedBackend::_do_pull_response(std::tr1::shared_ptr<OpRequest>)+0x2da) [0x7fc4b69fcbea] 4: (ReplicatedBackend::handle_message(std::tr1::shared_ptr<OpRequest>)+0x3be) [0x7fc4b6a90b8e] 5: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x2db) [0x7fc4b699fb3b] 6: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x371) [0x7fc4b675a141] 7: (OSD::OpWQ::_process(boost::intrusive_ptr<PG>, ThreadPool::TPHandle&)+0x1a5) [0x7fc4b6773fb5] 8: (ThreadPool::WorkQueueVal<std::pair<boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest> >, boost::intrusive_ptr<PG> >::_void_process(void*, ThreadPool::TPHandle&)+0x9c) [0x7fc4b67b69bc] 9: (ThreadPool::worker(ThreadPool::WorkThread*)+0xaf1) [0x7fc4b6b944d1] 10: (ThreadPool::WorkThread::entry()+0x10) [0x7fc4b6b953c0] 11: (()+0x8062) [0x7fc4b5a7b062] 12: (clone()+0x6d) [0x7fc4b41a5a3d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. --- logging levels --- 0/ 5 none 0/ 1 lockdep 0/ 1 context 1/ 1 crush 1/ 5 mds 1/ 5 mds_balancer 1/ 5 mds_locker 1/ 5 mds_log 1/ 5 mds_log_expire 1/ 5 mds_migrator 0/ 1 buffer 0/ 1 timer 0/ 1 filer 0/ 1 striper 0/ 1 objecter 0/ 5 rados 0/ 5 rbd 0/ 5 journaler 0/ 5 objectcacher 0/ 5 client 0/ 5 osd 0/ 5 optracker 0/ 5 objclass 1/ 3 filestore 1/ 3 keyvaluestore 1/ 3 journal 0/ 5 ms 1/ 5 mon 0/10 monc 1/ 5 paxos 0/ 5 tp 1/ 5 auth 1/ 5 crypto 1/ 1 finisher 1/ 5 heartbeatmap 1/ 5 perfcounter 1/ 5 rgw 1/ 5 javaclient 1/ 5 asok 1/ 1 throttle -2/-2 (syslog threshold) -1/-1 (stderr threshold) max_recent 10000 max_new 1000 log_file /var/log/ceph/ceph-osd.5.log --- end dump of recent events --- 2014-04-07 22:01:14.365731 7fc49b47c700 -1 *** Caught signal (Aborted) ** in thread 7fc49b47c700 ceph version 0.78 (f6c746c314d7b87b8419b6e584c94bfe4511dbd4) 1: (()+0x59ba2f) [0x7fc4b6ac5a2f] 2: (()+0xf880) [0x7fc4b5a82880] 3: (gsignal()+0x39) [0x7fc4b40f53a9] 4: (abort()+0x148) [0x7fc4b40f84c8] 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7fc4b49e789d] 6: (()+0x63996) [0x7fc4b49e5996] 7: (()+0x639c3) [0x7fc4b49e59c3] 8: (()+0x63bee) [0x7fc4b49e5bee] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1f2) [0x7fc4b6ba3692] 10: (ReplicatedPG::on_local_recover(hobject_t const&, object_stat_sum_t const&, ObjectRecoveryInfo const&, std::tr1::shared_ptr<ObjectContext>, ObjectStore::Transaction*)+0xbd1) [0x7fc4b69b7f01] 11: (ReplicatedBackend::handle_pull_response(pg_shard_t, PushOp&, PullOp*, std::list<hobject_t, std::allocator<hobject_t> >*, ObjectStore::Transaction*)+0x579) [0x7fc4b69fc439] 12: (ReplicatedBackend::_do_pull_response(std::tr1::shared_ptr<OpRequest>)+0x2da) [0x7fc4b69fcbea] 13: (ReplicatedBackend::handle_message(std::tr1::shared_ptr<OpRequest>)+0x3be) [0x7fc4b6a90b8e] 14: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x2db) [0x7fc4b699fb3b] 15: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x371) [0x7fc4b675a141] 16: (OSD::OpWQ::_process(boost::intrusive_ptr<PG>, ThreadPool::TPHandle&)+0x1a5) [0x7fc4b6773fb5] 17: (ThreadPool::WorkQueueVal<std::pair<boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest> >, boost::intrusive_ptr<PG> >::_void_process(void*, ThreadPool::TPHandle&)+0x9c) [0x7fc4b67b69bc] 18: (ThreadPool::worker(ThreadPool::WorkThread*)+0xaf1) [0x7fc4b6b944d1] 19: (ThreadPool::WorkThread::entry()+0x10) [0x7fc4b6b953c0] 20: (()+0x8062) [0x7fc4b5a7b062] 21: (clone()+0x6d) [0x7fc4b41a5a3d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. --- begin dump of recent events --- -4> 2014-04-07 22:01:14.341932 7fc4a248a700 1 -- 192.168.0.250:6805/14656 <== mon.4 192.168.0.250:6789/0 548 ==== log(last 52) v1 ==== 24+0+0 (2322174735 0 0) 0x7fc4c85e08c0 con 0x7fc4dff14160 -3> 2014-04-07 22:01:14.341955 7fc4a248a700 10 handle_log_ack log(last 52) v1 -2> 2014-04-07 22:01:14.341957 7fc4a248a700 10 logged 2014-04-07 22:01:14.119753 osd.5 192.168.0.250:6805/14656 51 : [ERR] 2.1a repair 0 missing, 1 inconsistent objects -1> 2014-04-07 22:01:14.341963 7fc4a248a700 10 logged 2014-04-07 22:01:14.119771 osd.5 192.168.0.250:6805/14656 52 : [ERR] 2.1a repair 3 errors, 3 fixed 0> 2014-04-07 22:01:14.365731 7fc49b47c700 -1 *** Caught signal (Aborted) ** in thread 7fc49b47c700 ceph version 0.78 (f6c746c314d7b87b8419b6e584c94bfe4511dbd4) 1: (()+0x59ba2f) [0x7fc4b6ac5a2f] 2: (()+0xf880) [0x7fc4b5a82880] 3: (gsignal()+0x39) [0x7fc4b40f53a9] 4: (abort()+0x148) [0x7fc4b40f84c8] 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7fc4b49e789d] 6: (()+0x63996) [0x7fc4b49e5996] 7: (()+0x639c3) [0x7fc4b49e59c3] 8: (()+0x63bee) [0x7fc4b49e5bee] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1f2) [0x7fc4b6ba3692] 10: (ReplicatedPG::on_local_recover(hobject_t const&, object_stat_sum_t const&, ObjectRecoveryInfo const&, std::tr1::shared_ptr<ObjectContext>, ObjectStore::Transaction*)+0xbd1) [0x7fc4b69b7f01] 11: (ReplicatedBackend::handle_pull_response(pg_shard_t, PushOp&, PullOp*, std::list<hobject_t, std::allocator<hobject_t> >*, ObjectStore::Transaction*)+0x579) [0x7fc4b69fc439] 12: (ReplicatedBackend::_do_pull_response(std::tr1::shared_ptr<OpRequest>)+0x2da) [0x7fc4b69fcbea] 13: (ReplicatedBackend::handle_message(std::tr1::shared_ptr<OpRequest>)+0x3be) [0x7fc4b6a90b8e] 14: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x2db) [0x7fc4b699fb3b] 15: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x371) [0x7fc4b675a141] 16: (OSD::OpWQ::_process(boost::intrusive_ptr<PG>, ThreadPool::TPHandle&)+0x1a5) [0x7fc4b6773fb5] 17: (ThreadPool::WorkQueueVal<std::pair<boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest> >, boost::intrusive_ptr<PG> >::_void_process(void*, ThreadPool::TPHandle&)+0x9c) [0x7fc4b67b69bc] 18: (ThreadPool::worker(ThreadPool::WorkThread*)+0xaf1) [0x7fc4b6b944d1] 19: (ThreadPool::WorkThread::entry()+0x10) [0x7fc4b6b953c0] 20: (()+0x8062) [0x7fc4b5a7b062] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. --- logging levels --- 0/ 5 none 0/ 1 lockdep 0/ 1 context 1/ 1 crush 1/ 5 mds 1/ 5 mds_balancer 1/ 5 mds_locker 1/ 5 mds_log 1/ 5 mds_log_expire 1/ 5 mds_migrator 0/ 1 buffer 0/ 1 timer 0/ 1 filer 0/ 1 striper 0/ 1 objecter 0/ 5 rados 0/ 5 rbd 0/ 5 journaler 0/ 5 objectcacher 0/ 5 client 0/ 5 osd 0/ 5 optracker 0/ 5 objclass 1/ 3 filestore 1/ 3 keyvaluestore 1/ 3 journal 0/ 5 ms 1/ 5 mon 0/10 monc 1/ 5 paxos 0/ 5 tp 1/ 5 auth 1/ 5 crypto 1/ 1 finisher 1/ 5 heartbeatmap 1/ 5 perfcounter 1/ 5 rgw 1/ 5 javaclient 1/ 5 asok 1/ 1 throttle -2/-2 (syslog threshold) -1/-1 (stderr threshold) max_recent 10000 max_new 1000 log_file /var/log/ceph/ceph-osd.5.log --- end dump of recent events ---
Crash appears to be related to repair of inconsistent PG.
I gave commands 'ceph pg repair 2.1c' and 'ceph osd repair 5' shortly before crash of osd.5.
Actions