Bug #8008
closedosd/ReplicatedPG.cc: 258: FAILED assert(missing_loc.needs_recovery(hoid)) during pg repair
0%
Description
Here is the log from crashed OSD:
-2> 2014-04-07 22:01:14.289703 7fc4a1488700 5 -- op tracker -- , seq: 80287, time: 2014-04-07 22:01:14.289703, event: waiting_for_osdmap, request: MOSDPGPush(2.1a 14605 [PushOp(82 21b31a/rb.0.6761c.238e1f29.00000000a05a/head//2, version: 13702'40924, data_included: [0~3825664], data_size: 3825664, omap_header_size: 0, omap_entries_size: 0, attrset_size: 2, recov ery_info: ObjectRecoveryInfo(8221b31a/rb.0.6761c.238e1f29.00000000a05a/head//2@13702'40924, copy_subset: [0~3825664], clone_subset: {}), after_progress: ObjectRecoveryProgress(!first, data_recovered_to:3825664, data_complete:true, omap_recovered_to:, omap_complete:true), before_progress: ObjectRecoveryProgress(first, data_recovered_to:0, data_complete:false, omap_re covered_to:, omap_complete:false))]) v2 -1> 2014-04-07 22:01:14.289754 7fc49b47c700 5 -- op tracker -- , seq: 80287, time: 2014-04-07 22:01:14.289753, event: reached_pg, request: MOSDPGPush(2.1a 14605 [PushOp(8221b31a/r b.0.6761c.238e1f29.00000000a05a/head//2, version: 13702'40924, data_included: [0~3825664], data_size: 3825664, omap_header_size: 0, omap_entries_size: 0, attrset_size: 2, recovery_info : ObjectRecoveryInfo(8221b31a/rb.0.6761c.238e1f29.00000000a05a/head//2@13702'40924, copy_subset: [0~3825664], clone_subset: {}), after_progress: ObjectRecoveryProgress(!first, data_recovered_to:3825664, data_complete:true, omap_recovered_to:, omap_complete:true), before_progress: ObjectRecoveryProgress(first, data_recovered_to:0, data_complete:false, omap_recovered_to:, omap_complete:false))]) v2 0> 2014-04-07 22:01:14.294263 7fc49b47c700 -1 osd/ReplicatedPG.cc: In function 'virtual void ReplicatedPG::on_local_recover(const hobject_t&, const object_stat_sum_t&, const ObjectRecoveryInfo&, ObjectContextRef, ObjectStore::Transaction*)' thread 7fc49b47c700 time 2014-04-07 22:01:14.289856 osd/ReplicatedPG.cc: 258: FAILED assert(missing_loc.needs_recovery(hoid)) ceph version 0.78 (f6c746c314d7b87b8419b6e584c94bfe4511dbd4) 1: (ReplicatedPG::on_local_recover(hobject_t const&, object_stat_sum_t const&, ObjectRecoveryInfo const&, std::tr1::shared_ptr<ObjectContext>, ObjectStore::Transaction*)+0xbd1) [0x7fc4b69b7f01] 2: (ReplicatedBackend::handle_pull_response(pg_shard_t, PushOp&, PullOp*, std::list<hobject_t, std::allocator<hobject_t> >*, ObjectStore::Transaction*)+0x579) [0x7fc4b69fc439] 3: (ReplicatedBackend::_do_pull_response(std::tr1::shared_ptr<OpRequest>)+0x2da) [0x7fc4b69fcbea] 4: (ReplicatedBackend::handle_message(std::tr1::shared_ptr<OpRequest>)+0x3be) [0x7fc4b6a90b8e] 5: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x2db) [0x7fc4b699fb3b] 6: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x371) [0x7fc4b675a141] 7: (OSD::OpWQ::_process(boost::intrusive_ptr<PG>, ThreadPool::TPHandle&)+0x1a5) [0x7fc4b6773fb5] 8: (ThreadPool::WorkQueueVal<std::pair<boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest> >, boost::intrusive_ptr<PG> >::_void_process(void*, ThreadPool::TPHandle&)+0x9c) [0x7fc4b67b69bc] 9: (ThreadPool::worker(ThreadPool::WorkThread*)+0xaf1) [0x7fc4b6b944d1] 10: (ThreadPool::WorkThread::entry()+0x10) [0x7fc4b6b953c0] 11: (()+0x8062) [0x7fc4b5a7b062] 12: (clone()+0x6d) [0x7fc4b41a5a3d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. --- logging levels --- 0/ 5 none 0/ 1 lockdep 0/ 1 context 1/ 1 crush 1/ 5 mds 1/ 5 mds_balancer 1/ 5 mds_locker 1/ 5 mds_log 1/ 5 mds_log_expire 1/ 5 mds_migrator 0/ 1 buffer 0/ 1 timer 0/ 1 filer 0/ 1 striper 0/ 1 objecter 0/ 5 rados 0/ 5 rbd 0/ 5 journaler 0/ 5 objectcacher 0/ 5 client 0/ 5 osd 0/ 5 optracker 0/ 5 objclass 1/ 3 filestore 1/ 3 keyvaluestore 1/ 3 journal 0/ 5 ms 1/ 5 mon 0/10 monc 1/ 5 paxos 0/ 5 tp 1/ 5 auth 1/ 5 crypto 1/ 1 finisher 1/ 5 heartbeatmap 1/ 5 perfcounter 1/ 5 rgw 1/ 5 javaclient 1/ 5 asok 1/ 1 throttle -2/-2 (syslog threshold) -1/-1 (stderr threshold) max_recent 10000 max_new 1000 log_file /var/log/ceph/ceph-osd.5.log --- end dump of recent events --- 2014-04-07 22:01:14.365731 7fc49b47c700 -1 *** Caught signal (Aborted) ** in thread 7fc49b47c700 ceph version 0.78 (f6c746c314d7b87b8419b6e584c94bfe4511dbd4) 1: (()+0x59ba2f) [0x7fc4b6ac5a2f] 2: (()+0xf880) [0x7fc4b5a82880] 3: (gsignal()+0x39) [0x7fc4b40f53a9] 4: (abort()+0x148) [0x7fc4b40f84c8] 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7fc4b49e789d] 6: (()+0x63996) [0x7fc4b49e5996] 7: (()+0x639c3) [0x7fc4b49e59c3] 8: (()+0x63bee) [0x7fc4b49e5bee] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1f2) [0x7fc4b6ba3692] 10: (ReplicatedPG::on_local_recover(hobject_t const&, object_stat_sum_t const&, ObjectRecoveryInfo const&, std::tr1::shared_ptr<ObjectContext>, ObjectStore::Transaction*)+0xbd1) [0x7fc4b69b7f01] 11: (ReplicatedBackend::handle_pull_response(pg_shard_t, PushOp&, PullOp*, std::list<hobject_t, std::allocator<hobject_t> >*, ObjectStore::Transaction*)+0x579) [0x7fc4b69fc439] 12: (ReplicatedBackend::_do_pull_response(std::tr1::shared_ptr<OpRequest>)+0x2da) [0x7fc4b69fcbea] 13: (ReplicatedBackend::handle_message(std::tr1::shared_ptr<OpRequest>)+0x3be) [0x7fc4b6a90b8e] 14: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x2db) [0x7fc4b699fb3b] 15: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x371) [0x7fc4b675a141] 16: (OSD::OpWQ::_process(boost::intrusive_ptr<PG>, ThreadPool::TPHandle&)+0x1a5) [0x7fc4b6773fb5] 17: (ThreadPool::WorkQueueVal<std::pair<boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest> >, boost::intrusive_ptr<PG> >::_void_process(void*, ThreadPool::TPHandle&)+0x9c) [0x7fc4b67b69bc] 18: (ThreadPool::worker(ThreadPool::WorkThread*)+0xaf1) [0x7fc4b6b944d1] 19: (ThreadPool::WorkThread::entry()+0x10) [0x7fc4b6b953c0] 20: (()+0x8062) [0x7fc4b5a7b062] 21: (clone()+0x6d) [0x7fc4b41a5a3d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. --- begin dump of recent events --- -4> 2014-04-07 22:01:14.341932 7fc4a248a700 1 -- 192.168.0.250:6805/14656 <== mon.4 192.168.0.250:6789/0 548 ==== log(last 52) v1 ==== 24+0+0 (2322174735 0 0) 0x7fc4c85e08c0 con 0x7fc4dff14160 -3> 2014-04-07 22:01:14.341955 7fc4a248a700 10 handle_log_ack log(last 52) v1 -2> 2014-04-07 22:01:14.341957 7fc4a248a700 10 logged 2014-04-07 22:01:14.119753 osd.5 192.168.0.250:6805/14656 51 : [ERR] 2.1a repair 0 missing, 1 inconsistent objects -1> 2014-04-07 22:01:14.341963 7fc4a248a700 10 logged 2014-04-07 22:01:14.119771 osd.5 192.168.0.250:6805/14656 52 : [ERR] 2.1a repair 3 errors, 3 fixed 0> 2014-04-07 22:01:14.365731 7fc49b47c700 -1 *** Caught signal (Aborted) ** in thread 7fc49b47c700 ceph version 0.78 (f6c746c314d7b87b8419b6e584c94bfe4511dbd4) 1: (()+0x59ba2f) [0x7fc4b6ac5a2f] 2: (()+0xf880) [0x7fc4b5a82880] 3: (gsignal()+0x39) [0x7fc4b40f53a9] 4: (abort()+0x148) [0x7fc4b40f84c8] 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7fc4b49e789d] 6: (()+0x63996) [0x7fc4b49e5996] 7: (()+0x639c3) [0x7fc4b49e59c3] 8: (()+0x63bee) [0x7fc4b49e5bee] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1f2) [0x7fc4b6ba3692] 10: (ReplicatedPG::on_local_recover(hobject_t const&, object_stat_sum_t const&, ObjectRecoveryInfo const&, std::tr1::shared_ptr<ObjectContext>, ObjectStore::Transaction*)+0xbd1) [0x7fc4b69b7f01] 11: (ReplicatedBackend::handle_pull_response(pg_shard_t, PushOp&, PullOp*, std::list<hobject_t, std::allocator<hobject_t> >*, ObjectStore::Transaction*)+0x579) [0x7fc4b69fc439] 12: (ReplicatedBackend::_do_pull_response(std::tr1::shared_ptr<OpRequest>)+0x2da) [0x7fc4b69fcbea] 13: (ReplicatedBackend::handle_message(std::tr1::shared_ptr<OpRequest>)+0x3be) [0x7fc4b6a90b8e] 14: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x2db) [0x7fc4b699fb3b] 15: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x371) [0x7fc4b675a141] 16: (OSD::OpWQ::_process(boost::intrusive_ptr<PG>, ThreadPool::TPHandle&)+0x1a5) [0x7fc4b6773fb5] 17: (ThreadPool::WorkQueueVal<std::pair<boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest> >, boost::intrusive_ptr<PG> >::_void_process(void*, ThreadPool::TPHandle&)+0x9c) [0x7fc4b67b69bc] 18: (ThreadPool::worker(ThreadPool::WorkThread*)+0xaf1) [0x7fc4b6b944d1] 19: (ThreadPool::WorkThread::entry()+0x10) [0x7fc4b6b953c0] 20: (()+0x8062) [0x7fc4b5a7b062] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. --- logging levels --- 0/ 5 none 0/ 1 lockdep 0/ 1 context 1/ 1 crush 1/ 5 mds 1/ 5 mds_balancer 1/ 5 mds_locker 1/ 5 mds_log 1/ 5 mds_log_expire 1/ 5 mds_migrator 0/ 1 buffer 0/ 1 timer 0/ 1 filer 0/ 1 striper 0/ 1 objecter 0/ 5 rados 0/ 5 rbd 0/ 5 journaler 0/ 5 objectcacher 0/ 5 client 0/ 5 osd 0/ 5 optracker 0/ 5 objclass 1/ 3 filestore 1/ 3 keyvaluestore 1/ 3 journal 0/ 5 ms 1/ 5 mon 0/10 monc 1/ 5 paxos 0/ 5 tp 1/ 5 auth 1/ 5 crypto 1/ 1 finisher 1/ 5 heartbeatmap 1/ 5 perfcounter 1/ 5 rgw 1/ 5 javaclient 1/ 5 asok 1/ 1 throttle -2/-2 (syslog threshold) -1/-1 (stderr threshold) max_recent 10000 max_new 1000 log_file /var/log/ceph/ceph-osd.5.log --- end dump of recent events ---
Crash appears to be related to repair of inconsistent PG.
I gave commands 'ceph pg repair 2.1c' and 'ceph osd repair 5' shortly before crash of osd.5.
Updated by Samuel Just about 10 years ago
- Priority changed from Normal to Urgent
Updated by Sage Weil about 10 years ago
- Subject changed from 0.78: OSD crash to osd/ReplicatedPG.cc: 258: FAILED assert(missing_loc.needs_recovery(hoid)) during pg repair
- Source changed from other to Community (user)
Updated by Sage Weil about 10 years ago
- Status changed from New to In Progress
- Assignee set to Sage Weil
Updated by Sage Weil about 10 years ago
- Status changed from In Progress to Fix Under Review
Updated by Samuel Just about 10 years ago
- Status changed from Fix Under Review to Resolved
Updated by Dmitry Smirnov about 10 years ago
Unfortunately revision 6ff645f5 applied on top of 0.79 did not fixed the issue.
I can't repair inconsistent PG.
# ceph pg repair 2.1f
2014-04-20 06:19:18.605576 osd.5 [ERR] 2.1f shard (2,255): soid 5508c91f/rb.0.2ba94.2ae8944a.000000006000/head//2 digest 1852679265 != known digest 2841785730, size 4194304 != known size 757760 2014-04-20 06:19:18.605582 osd.5 [ERR] 2.1f shard (5,255): soid 5508c91f/rb.0.2ba94.2ae8944a.000000006000/head//2 digest 1852679265 != known digest 2841785730, size 4194304 != known size 757760 2014-04-20 06:23:24.952515 osd.3 [ERR] 2.1f push 5508c91f/rb.0.2ba94.2ae8944a.000000006000/head//2 v 17671'231000 failed because local copy is 17671'231034 2014-04-20 06:23:24.928470 osd.5 [ERR] 2.1f repair 0 missing, 1 inconsistent objects 2014-04-20 06:23:24.928505 osd.5 [ERR] 2.1f repair 2 errors, 2 fixed
At this point OSD.5 stuck without response for over 1000 sec.; 1 "unfound" object appeared which I had to fix by
ceph pg 2.1f mark_unfound_lost revert
because cluster stopped responding as soon as unfound object appeared. After that OSD.5 crashed:
0> 2014-04-20 06:39:21.465666 7fc92585b700 -1 osd/ReplicatedPG.cc: In function 'void ReplicatedBackend::sub_op_modify(OpRequestRef)' thread 7fc92585b700 time 2014-04-20 06:39:21.452230 osd/ReplicatedPG.cc: 7175: FAILED assert(!parent->get_log().get_missing().is_missing(soid)) ceph version 0.79 (4c2d73a5095f527c3a2168deb5fa54b3c8991a6e) 1: (ReplicatedBackend::sub_op_modify(std::tr1::shared_ptr<OpRequest>)+0x170e) [0x7fc940a4b25e] 2: (ReplicatedBackend::handle_message(std::tr1::shared_ptr<OpRequest>)+0x4a6) [0x7fc940b3aba6] 3: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x2db) [0x7fc9409e53fb] 4: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x371) [0x7fc940845151] 5: (OSD::OpWQ::_process(boost::intrusive_ptr<PG>, ThreadPool::TPHandle&)+0x1f4) [0x7fc94085ef74] 6: (ThreadPool::WorkQueueVal<std::pair<boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest> >, boost::intrusive_ptr<PG> >::_void_process(void*, ThreadPool::TPHandle&)+0x9c) [0x7fc9408a1c3c] 7: (ThreadPool::worker(ThreadPool::WorkThread*)+0xaf1) [0x7fc940c82b61] 8: (ThreadPool::WorkThread::entry()+0x10) [0x7fc940c83a50] 9: (()+0x8062) [0x7fc93fd69062] 10: (clone()+0x6d) [0x7fc93e28fa3d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. --- logging levels --- --- logging levels --- 0/ 5 none 0/ 1 lockdep 0/ 1 context 1/ 1 crush 1/ 5 mds 1/ 5 mds_balancer 1/ 5 mds_locker 1/ 5 mds_log 1/ 5 mds_log_expire 1/ 5 mds_migrator 0/ 1 buffer 0/ 1 timer 0/ 1 filer 0/ 1 striper 0/ 1 objecter 0/ 5 rados 0/ 5 rbd 0/ 5 journaler 0/ 5 objectcacher 0/ 5 client 0/ 5 osd 0/ 5 optracker 0/ 5 objclass 1/ 3 filestore 1/ 3 keyvaluestore 1/ 3 journal 0/ 5 ms 1/ 5 mon 0/10 monc 1/ 5 paxos 0/ 5 tp 1/ 5 auth 1/ 5 crypto 1/ 1 finisher 1/ 5 heartbeatmap 1/ 5 perfcounter 1/ 5 rgw 1/ 5 javaclient 1/ 5 asok 1/ 1 throttle -2/-2 (syslog threshold) -1/-1 (stderr threshold) max_recent 10000 max_new 1000 log_file /var/log/ceph/ceph-osd.5.log --- end dump of recent events --- 2014-04-20 06:39:21.538171 7fc92585b700 -1 *** Caught signal (Aborted) ** in thread 7fc92585b700 ceph version 0.79 (4c2d73a5095f527c3a2168deb5fa54b3c8991a6e) 1: (()+0x59d23f) [0x7fc940bb123f] 2: (()+0xf880) [0x7fc93fd70880] 3: (gsignal()+0x39) [0x7fc93e1df3a9] 4: (abort()+0x148) [0x7fc93e1e24c8] 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7fc93ead189d] 6: (()+0x63996) [0x7fc93eacf996] 7: (()+0x639c3) [0x7fc93eacf9c3] 8: (()+0x63bee) [0x7fc93eacfbee] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1f2) [0x7fc940c91ca2] 10: (ReplicatedBackend::sub_op_modify(std::tr1::shared_ptr<OpRequest>)+0x170e) [0x7fc940a4b25e] 11: (ReplicatedBackend::handle_message(std::tr1::shared_ptr<OpRequest>)+0x4a6) [0x7fc940b3aba6] 12: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x2db) [0x7fc9409e53fb] 13: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x371) [0x7fc940845151] 14: (OSD::OpWQ::_process(boost::intrusive_ptr<PG>, ThreadPool::TPHandle&)+0x1f4) [0x7fc94085ef74] 15: (ThreadPool::WorkQueueVal<std::pair<boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest> >, boost::intrusive_ptr<PG> >::_void_process(void*, ThreadPool::TPHandle&)+0x9c) [0x7fc9408a1c3c] 16: (ThreadPool::worker(ThreadPool::WorkThread*)+0xaf1) [0x7fc940c82b61] 17: (ThreadPool::WorkThread::entry()+0x10) [0x7fc940c83a50] 18: (()+0x8062) [0x7fc93fd69062] 19: (clone()+0x6d) [0x7fc93e28fa3d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. --- begin dump of recent events --- -40> 2014-04-20 06:39:21.516088 7fc92b066700 1 -- 192.168.0.250:6815/14731 <== osd.10 192.168.0.2:6802/16401 26 ==== pg_notify(2.66(121),1.67(121) epoch 17675) v5 ==== 15514+0+0 (2774128543 0 0) 0x7fc95dd82700 con 0x7fc95bb75a20 -39> 2014-04-20 06:39:21.516116 7fc92b066700 5 -- op tracker -- , seq: 723174, time: 2014-04-20 06:39:21.515420, event: header_read, request: pg_notify(2.66(121),1.67(121) epoch 17675) v5 -38> 2014-04-20 06:39:21.516123 7fc92b066700 5 -- op tracker -- , seq: 723174, time: 2014-04-20 06:39:21.515423, event: throttled, request: pg_notify(2.66(121),1.67(121) epoch 17675) v5 -37> 2014-04-20 06:39:21.516128 7fc92b066700 5 -- op tracker -- , seq: 723174, time: 2014-04-20 06:39:21.516066, event: all_read, request: pg_notify(2.66(121),1.67(121) epoch 17675) v5 -36> 2014-04-20 06:39:21.516132 7fc92b066700 5 -- op tracker -- , seq: 723174, time: 2014-04-20 06:39:21.516112, event: dispatched, request: pg_notify(2.66(121),1.67(121) epoch 17675) v5 -35> 2014-04-20 06:39:21.516138 7fc92b066700 5 -- op tracker -- , seq: 723174, time: 2014-04-20 06:39:21.516138, event: waiting_for_osdmap, request: pg_notify(2.66(121),1.67(121) epoch 17675) v5 -34> 2014-04-20 06:39:21.516146 7fc92b066700 5 -- op tracker -- , seq: 723174, time: 2014-04-20 06:39:21.516145, event: started, request: pg_notify(2.66(121),1.67(121) epoch 17675) v5 -33> 2014-04-20 06:39:21.516179 7fc92b066700 5 -- op tracker -- , seq: 723174, time: 2014-04-20 06:39:21.516179, event: done, request: pg_notify(2.66(121),1.67(121) epoch 17675) v5 -32> 2014-04-20 06:39:21.516513 7fc92b066700 1 -- 192.168.0.250:6815/14731 <== osd.10 192.168.0.2:6802/16401 27 ==== pg_info(1 pgs e17675:0.42) v4 ==== 704+0+0 (382868193 0 0) 0x7fc95e8a4c40 con 0x7fc95bb75a20 -31> 2014-04-20 06:39:21.516528 7fc92b066700 5 -- op tracker -- , seq: 723175, time: 2014-04-20 06:39:21.516177, event: header_read, request: pg_info(1 pgs e17675:0.42) v4 -30> 2014-04-20 06:39:21.516533 7fc92b066700 5 -- op tracker -- , seq: 723175, time: 2014-04-20 06:39:21.516179, event: throttled, request: pg_info(1 pgs e17675:0.42) v4 -29> 2014-04-20 06:39:21.516538 7fc92b066700 5 -- op tracker -- , seq: 723175, time: 2014-04-20 06:39:21.516230, event: all_read, request: pg_info(1 pgs e17675:0.42) v4 -28> 2014-04-20 06:39:21.516542 7fc92b066700 5 -- op tracker -- , seq: 723175, time: 2014-04-20 06:39:21.516526, event: dispatched, request: pg_info(1 pgs e17675:0.42) v4 -27> 2014-04-20 06:39:21.516547 7fc92b066700 5 -- op tracker -- , seq: 723175, time: 2014-04-20 06:39:21.516547, event: waiting_for_osdmap, request: pg_info(1 pgs e17675:0.42) v4 -26> 2014-04-20 06:39:21.516554 7fc92b066700 5 -- op tracker -- , seq: 723175, time: 2014-04-20 06:39:21.516553, event: started, request: pg_info(1 pgs e17675:0.42) v4 -25> 2014-04-20 06:39:21.516574 7fc92b066700 5 -- op tracker -- , seq: 723175, time: 2014-04-20 06:39:21.516574, event: done, request: pg_info(1 pgs e17675:0.42) v4 -24> 2014-04-20 06:39:21.519532 7fc92b066700 1 -- 192.168.0.250:6815/14731 <== osd.10 192.168.0.2:6802/16401 28 ==== pg_notify(0.68(121),0.75(268),1.74(253) epoch 17675) v5 ==== 39 298+0+0 (1151709690 0 0) 0x7fc95dbe0c40 con 0x7fc95bb75a20 -23> 2014-04-20 06:39:21.519551 7fc92b066700 5 -- op tracker -- , seq: 723176, time: 2014-04-20 06:39:21.518480, event: header_read, request: pg_notify(0.68(121),0.75(268),1.74(253) epoch 17675) v5 -22> 2014-04-20 06:39:21.519559 7fc92b066700 5 -- op tracker -- , seq: 723176, time: 2014-04-20 06:39:21.518482, event: throttled, request: pg_notify(0.68(121),0.75(268),1.74(253) epoch 17675) v5 -21> 2014-04-20 06:39:21.519565 7fc92b066700 5 -- op tracker -- , seq: 723176, time: 2014-04-20 06:39:21.519516, event: all_read, request: pg_notify(0.68(121),0.75(268),1.74(253) epoch 17675) v5 -20> 2014-04-20 06:39:21.519570 7fc92b066700 5 -- op tracker -- , seq: 723176, time: 2014-04-20 06:39:21.519549, event: dispatched, request: pg_notify(0.68(121),0.75(268),1.74(253) epoch 17675) v5 -19> 2014-04-20 06:39:21.519576 7fc92b066700 5 -- op tracker -- , seq: 723176, time: 2014-04-20 06:39:21.519576, event: waiting_for_osdmap, request: pg_notify(0.68(121),0.75(268),1.74(253) epoch 17675) v5 -18> 2014-04-20 06:39:21.519584 7fc92b066700 5 -- op tracker -- , seq: 723176, time: 2014-04-20 06:39:21.519584, event: started, request: pg_notify(0.68(121),0.75(268),1.74(253) epoch 17675) v5 -17> 2014-04-20 06:39:21.519629 7fc92b066700 5 -- op tracker -- , seq: 723176, time: 2014-04-20 06:39:21.519628, event: done, request: pg_notify(0.68(121),0.75(268),1.74(253) epoch 17675) v5 -16> 2014-04-20 06:39:21.519698 7fc92b066700 1 -- 192.168.0.250:6815/14731 <== osd.10 192.168.0.2:6802/16401 29 ==== MRecoveryReserve REQUEST pgid: 0.42, query_epoch: 17675 v2 ==== 26+0+0 (288802618 0 0) 0x7fc96486a380 con 0x7fc95bb75a20 -15> 2014-04-20 06:39:21.519714 7fc92b066700 5 -- op tracker -- , seq: 723177, time: 2014-04-20 06:39:21.519531, event: header_read, request: MRecoveryReserve REQUEST pgid: 0.42, query_epoch: 17675 v2 -14> 2014-04-20 06:39:21.519720 7fc92b066700 5 -- op tracker -- , seq: 723177, time: 2014-04-20 06:39:21.519532, event: throttled, request: MRecoveryReserve REQUEST pgid: 0.42, query_epoch: 17675 v2 -13> 2014-04-20 06:39:21.519724 7fc92b066700 5 -- op tracker -- , seq: 723177, time: 2014-04-20 06:39:21.519563, event: all_read, request: MRecoveryReserve REQUEST pgid: 0.42, query_epoch: 17675 v2 -12> 2014-04-20 06:39:21.519728 7fc92b066700 5 -- op tracker -- , seq: 723177, time: 2014-04-20 06:39:21.519712, event: dispatched, request: MRecoveryReserve REQUEST pgid: 0.42, query_epoch: 17675 v2 -11> 2014-04-20 06:39:21.519733 7fc92b066700 5 -- op tracker -- , seq: 723177, time: 2014-04-20 06:39:21.519733, event: waiting_for_osdmap, request: MRecoveryReserve REQUEST pgid: 0.42, query_epoch: 17675 v2 -10> 2014-04-20 06:39:21.519749 7fc92b066700 5 -- op tracker -- , seq: 723177, time: 2014-04-20 06:39:21.519749, event: done, request: MRecoveryReserve REQUEST pgid: 0.42, query_epoch: 17675 v2 -9> 2014-04-20 06:39:21.519758 7fc92505a700 5 osd.5 pg_epoch: 17675 pg[0.42( v 17665'277358 (17279'274357,17665'277358] local-les=17675 n=11044 ec=1 les/c 17675/17639 17673/17674/17673) [10,4,2,5]/[10,2,5,0] r=2 lpr=17674 pi=7870-17673/559 luod=0'0 crt=16509'270369 lcod 17646'277357 active+remapped] exit Started/ReplicaActive/RepNotRecovering 0.283411 3 0.000282 -8> 2014-04-20 06:39:21.519778 7fc92505a700 5 osd.5 pg_epoch: 17675 pg[0.42( v 17665'277358 (17279'274357,17665'277358] local-les=17675 n=11044 ec=1 les/c 17675/17639 17673/17674/17673) [10,4,2,5]/[10,2,5,0] r=2 lpr=17674 pi=7870-17673/559 luod=0'0 crt=16509'270369 lcod 17646'277357 active+remapped] enter Started/ReplicaActive/RepWaitRecoveryReserved -7> 2014-04-20 06:39:21.519821 7fc92505a700 1 -- 192.168.0.250:6815/14731 --> osd.10 192.168.0.2:6802/16401 -- MRecoveryReserve GRANT pgid: 0.42, query_epoch: 17675 v2 -- ?+0 0x7fc968b3aa80 -6> 2014-04-20 06:39:21.519834 7fc92505a700 5 osd.5 pg_epoch: 17675 pg[0.42( v 17665'277358 (17279'274357,17665'277358] local-les=17675 n=11044 ec=1 les/c 17675/17639 17673/17674/17673) [10,4,2,5]/[10,2,5,0] r=2 lpr=17674 pi=7870-17673/559 luod=0'0 crt=16509'270369 lcod 17646'277357 active+remapped] exit Started/ReplicaActive/RepWaitRecoveryReserved 0.000056 1 0.000035 -5> 2014-04-20 06:39:21.519848 7fc92505a700 5 osd.5 pg_epoch: 17675 pg[0.42( v 17665'277358 (17279'274357,17665'277358] local-les=17675 n=11044 ec=1 les/c 17675/17639 17673/17674/17673) [10,4,2,5]/[10,2,5,0] r=2 lpr=17674 pi=7870-17673/559 luod=0'0 crt=16509'270369 lcod 17646'277357 active+remapped] enter Started/ReplicaActive/RepRecovering -4> 2014-04-20 06:39:21.537485 7fc928861700 1 -- 192.168.0.250:6824/14731 <== osd.1 192.168.0.250:0/14378 15801 ==== osd_ping(ping e17675 stamp 2014-04-20 06:39:21.537249) v2 ==== 47+0+0 (3692397561 0 0) 0x7fc95d961dc0 con 0x7fc95ddd2420 -3> 2014-04-20 06:39:21.537495 7fc929863700 1 -- 192.168.0.250:6825/14731 <== osd.1 192.168.0.250:0/14378 15801 ==== osd_ping(ping e17675 stamp 2014-04-20 06:39:21.537249) v2 ==== 47+0+0 (3692397561 0 0) 0x7fc95e318a80 con 0x7fc95ddd3080 -2> 2014-04-20 06:39:21.537514 7fc928861700 1 -- 192.168.0.250:6824/14731 --> 192.168.0.250:0/14378 -- osd_ping(ping_reply e17675 stamp 2014-04-20 06:39:21.537249) v2 -- ?+0 0x7fc95d1fb6c0 con 0x7fc95ddd2420 -3> 2014-04-20 06:39:21.537495 7fc929863700 1 -- 192.168.0.250:6825/14731 <== osd.1 192.168.0.250:0/14378 15801 ==== osd_ping(ping e17675 stamp 2014-04-20 06:39:21.537249) v2 ==== 47+0+0 (3692397561 0 0) 0x7fc95e318a80 con 0x7fc95ddd3080 -2> 2014-04-20 06:39:21.537514 7fc928861700 1 -- 192.168.0.250:6824/14731 --> 192.168.0.250:0/14378 -- osd_ping(ping_reply e17675 stamp 2014-04-20 06:39:21.537249) v2 -- ?+0 0x7fc95d1fb6c0 con 0x7fc95ddd2420 -1> 2014-04-20 06:39:21.537549 7fc929863700 1 -- 192.168.0.250:6825/14731 --> 192.168.0.250:0/14378 -- osd_ping(ping_reply e17675 stamp 2014-04-20 06:39:21.537249) v2 -- ?+0 0x7fc95e202700 con 0x7fc95ddd3080 0> 2014-04-20 06:39:21.538171 7fc92585b700 -1 *** Caught signal (Aborted) ** in thread 7fc92585b700 ceph version 0.79 (4c2d73a5095f527c3a2168deb5fa54b3c8991a6e) 1: (()+0x59d23f) [0x7fc940bb123f] 2: (()+0xf880) [0x7fc93fd70880] 3: (gsignal()+0x39) [0x7fc93e1df3a9] 4: (abort()+0x148) [0x7fc93e1e24c8] 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7fc93ead189d] 6: (()+0x63996) [0x7fc93eacf996] 7: (()+0x639c3) [0x7fc93eacf9c3] 8: (()+0x63bee) [0x7fc93eacfbee] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1f2) [0x7fc940c91ca2] 10: (ReplicatedBackend::sub_op_modify(std::tr1::shared_ptr<OpRequest>)+0x170e) [0x7fc940a4b25e] 11: (ReplicatedBackend::handle_message(std::tr1::shared_ptr<OpRequest>)+0x4a6) [0x7fc940b3aba6] 12: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x2db) [0x7fc9409e53fb] 13: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x371) [0x7fc940845151] 14: (OSD::OpWQ::_process(boost::intrusive_ptr<PG>, ThreadPool::TPHandle&)+0x1f4) [0x7fc94085ef74] 15: (ThreadPool::WorkQueueVal<std::pair<boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest> >, boost::intrusive_ptr<PG> >::_void_process(void*, ThreadPool::TPHandle&)+0x9c) [0x7fc9408a1c3c] 16: (ThreadPool::worker(ThreadPool::WorkThread*)+0xaf1) [0x7fc940c82b61] 17: (ThreadPool::WorkThread::entry()+0x10) [0x7fc940c83a50] 18: (()+0x8062) [0x7fc93fd69062] 19: (clone()+0x6d) [0x7fc93e28fa3d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. --- logging levels --- 0/ 5 none 0/ 1 lockdep 0/ 1 context 1/ 1 crush 1/ 5 mds 1/ 5 mds_balancer 1/ 5 mds_locker 1/ 5 mds_log 1/ 5 mds_log_expire 1/ 5 mds_migrator 0/ 1 buffer 0/ 1 timer 0/ 1 filer 0/ 1 striper 0/ 1 objecter 0/ 5 rados 0/ 5 rbd 0/ 5 journaler 0/ 5 objectcacher 0/ 5 client 0/ 5 osd 0/ 5 optracker 0/ 5 objclass 1/ 3 filestore 1/ 3 keyvaluestore 1/ 3 journal 0/ 5 ms 1/ 5 mon 0/10 monc 1/ 5 paxos 0/ 5 tp 1/ 5 auth 1/ 5 crypto 1/ 1 finisher 1/ 5 heartbeatmap 1/ 5 perfcounter 1/ 5 rgw 1/ 5 javaclient 1/ 5 asok 1/ 1 throttle -2/-2 (syslog threshold) -1/-1 (stderr threshold) max_recent 10000 max_new 1000 log_file /var/log/ceph/ceph-osd.5.log --- end dump of recent events ---
Updated by Dmitry Smirnov about 10 years ago
This may be similar to #8048 so I applied corresponding fix 3d0e80ac as well as number of other post-0.79 PG-related fixes on top of 0.79.
However another attempt to repair inconsistent PG ended up with yet another OSD crash:
2014-04-20 11:13:32.323721 osd.10 [ERR] 2.1f shard (10,255): soid 5508c91f/rb.0.2ba94.2ae8944a.000000006000/head//2 size 4194304 != known size 757760 2014-04-20 11:14:01.787642 osd.10 [ERR] 2.1f scrub 0 missing, 1 inconsistent objects 2014-04-20 11:14:01.787648 osd.10 [ERR] 2.1f scrub 1 errors 2014-04-20 11:18:55.299895 osd.10 [ERR] 2.1f shard (10,255): soid 5508c91f/rb.0.2ba94.2ae8944a.000000006000/head//2 digest 2833429890 != known digest 3985900942, size 4194304 != known size 757760 2014-04-20 11:24:44.234940 osd.10 [ERR] 2.1f repair 0 missing, 1 inconsistent objects 2014-04-20 11:24:44.234957 osd.10 [ERR] 2.1f repair 1 errors, 1 fixed 2014-04-20 11:24:44.243471 osd.5 [ERR] 2.1f push 5508c91f/rb.0.2ba94.2ae8944a.000000006000/head//2 v 17772'235539 failed because local copy is 17775'235656
From OSD.10 log:
0> 2014-04-20 11:28:31.613466 7f88428f4700 -1 osd/ReplicatedPG.cc: In function 'eversion_t ReplicatedPG::pick_newest_available(const hobject_t&)' thread 7f88428f4700 time 2014-04-20 11:28:31.566273 osd/ReplicatedPG.cc: 8669: FAILED assert(is_backfill_targets(peer)) ceph version 0.79 (4c2d73a5095f527c3a2168deb5fa54b3c8991a6e) 1: (ReplicatedPG::pick_newest_available(hobject_t const&)+0x4bf) [0x7f885d629c1f] 2: (ReplicatedPG::mark_all_unfound_lost(int)+0xac3) [0x7f885d650ff3] 3: (ReplicatedPG::do_command(std::map<std::string, boost::variant<std::string, bool, long, double, std::vector<std::string, std::allocator<std::string> >, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_>, std::less<std::string>, std::allocator<std::pair<std::string const, boost::variant<std::string, bool, long, double, std::vector<std::string, std::allocator<std::string> >, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost:: detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_> > > >, std::ostream&, ceph::buffer::list&, ceph::buffer::list&)+0xb5c) [0x7f885d63c12c] 4: (OSD::do_command(Connection*, unsigned long, std::vector<std::string, std::allocator<std::string> >&, ceph::buffer::list&)+0x25a4) [0x7f885d4a6b94] 5: (OSD::CommandWQ::_process(OSD::Command*)+0x47) [0x7f885d4ff777] 6: (ThreadPool::worker(ThreadPool::WorkThread*)+0xaf1) [0x7f885d8b1da1] 7: (ThreadPool::WorkThread::entry()+0x10) [0x7f885d8b2c90] 8: (()+0x8062) [0x7f885c997062] 9: (clone()+0x6d) [0x7f885aebea3d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. --- logging levels --- 0/ 5 none 0/ 1 lockdep 0/ 1 context 1/ 1 crush 1/ 5 mds 1/ 5 mds_balancer 1/ 5 mds_locker 1/ 5 mds_log 1/ 5 mds_log_expire 1/ 5 mds_migrator 0/ 1 buffer 0/ 1 timer 0/ 1 filer 0/ 1 striper 0/ 1 objecter 0/ 5 rados 0/ 5 rbd 0/ 5 journaler 0/ 5 objectcacher 0/ 5 client 0/ 5 osd 0/ 5 optracker 0/ 5 objclass 1/ 3 filestore 1/ 3 keyvaluestore 1/ 3 journal 0/ 5 ms 1/ 5 mon 0/10 monc 1/ 5 paxos 0/ 5 tp 1/ 5 auth 1/ 5 crypto 1/ 1 finisher 1/ 5 heartbeatmap 1/ 5 perfcounter 1/ 5 rgw 1/ 5 javaclient 1/ 5 asok 1/ 1 throttle -2/-2 (syslog threshold) -1/-1 (stderr threshold) max_recent 10000 max_new 1000 log_file /var/log/ceph/ceph-osd.10.log --- end dump of recent events --- 2014-04-20 11:28:32.060148 7f88428f4700 -1 *** Caught signal (Aborted) ** in thread 7f88428f4700 ceph version 0.79 (4c2d73a5095f527c3a2168deb5fa54b3c8991a6e) 1: (()+0x59e47f) [0x7f885d7e047f] 2: (()+0xf880) [0x7f885c99e880] 3: (gsignal()+0x39) [0x7f885ae0e3a9] 4: (abort()+0x148) [0x7f885ae114c8] 5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7f885b6fb5e5] 6: (()+0x5e746) [0x7f885b6f9746] 7: (()+0x5e773) [0x7f885b6f9773] 8: (()+0x5e9b2) [0x7f885b6f99b2] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1f2) [0x7f885d8c0ee2] 10: (ReplicatedPG::pick_newest_available(hobject_t const&)+0x4bf) [0x7f885d629c1f] 11: (ReplicatedPG::mark_all_unfound_lost(int)+0xac3) [0x7f885d650ff3] 12: (ReplicatedPG::do_command(std::map<std::string, boost::variant<std::string, bool, long, double, std::vector<std::string, std::allocator<std::string> >, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_>, std::less<std::string>, std::allocator<std::pair<std::string const, boost::variant<std::string, bool, long, double, std::vector<std::string, std::allocator<std::string> >, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost:: detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_> > > >, std::ostream&, ceph::buffer::list&, ceph::buffer::list&)+0xb5c) [0x7f885d63c12c] 13: (OSD::do_command(Connection*, unsigned long, std::vector<std::string, std::allocator<std::string> >&, ceph::buffer::list&)+0x25a4) [0x7f885d4a6b94] 14: (OSD::CommandWQ::_process(OSD::Command*)+0x47) [0x7f885d4ff777] 15: (ThreadPool::worker(ThreadPool::WorkThread*)+0xaf1) [0x7f885d8b1da1] 16: (ThreadPool::WorkThread::entry()+0x10) [0x7f885d8b2c90] 17: (()+0x8062) [0x7f885c997062] 18: (clone()+0x6d) [0x7f885aebea3d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. --- begin dump of recent events --- -41> 2014-04-20 11:28:31.714621 7f88478fe700 1 -- 192.168.0.2:6822/29335 <== osd.3 192.168.0.2:0/29928 644 ==== osd_ping(ping e17775 stamp 2014-04-20 11:28:31.714032) v2 ==== 47+0+0 (2170213909 0 0) 0x7f8864a2f6c0 con 0x7f88638b4840 -40> 2014-04-20 11:28:31.714648 7f88478fe700 1 -- 192.168.0.2:6822/29335 --> 192.168.0.2:0/29928 -- osd_ping(ping_reply e17775 stamp 2014-04-20 11:28:31.714032) v2 -- ?+0 0x7f88645c0fc0 con 0x7f88638b4840 -39> 2014-04-20 11:28:31.714677 7f8848900700 1 -- 192.168.0.2:6823/29335 <== osd.3 192.168.0.2:0/29928 644 ==== osd_ping(ping e17775 stamp 2014-04-20 11:28:31.714032) v2 ==== 47+0+0 (2170213909 0 0) 0x7f8863e73180 con 0x7f88638b4dc0 -38> 2014-04-20 11:28:31.714699 7f8848900700 1 -- 192.168.0.2:6823/29335 --> 192.168.0.2:0/29928 -- osd_ping(ping_reply e17775 stamp 2014-04-20 11:28:31.714032) v2 -- ?+0 0x7f8863f11500 con 0x7f88638b4dc0 -37> 2014-04-20 11:28:31.827090 7f88420f3700 5 osd.10 17775 heartbeat: osd_stat(66425 MB used, 166 GB avail, 232 GB total, peers [0,1,2,3,4,5,6,7,8]/[] op hist [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,5]) -36> 2014-04-20 11:28:31.827114 7f88420f3700 1 -- 192.168.0.2:0/29335 --> 192.168.0.250:6823/25087 -- osd_ping(ping e17775 stamp 2014-04-20 11:28:31.827111) v2 -- ?+0 0x7f88646f2700 con 0x7f8862d17600 -35> 2014-04-20 11:28:31.827136 7f88420f3700 1 -- 192.168.0.2:0/29335 --> 192.168.0.250:6824/25087 -- osd_ping(ping e17775 stamp 2014-04-20 11:28:31.827111) v2 -- ?+0 0x7f8864ac0fc0 con 0x7f8864a6fe40 -34> 2014-04-20 11:28:31.827149 7f88420f3700 1 -- 192.168.0.2:0/29335 --> 192.168.0.250:6813/26721 -- osd_ping(ping e17775 stamp 2014-04-20 11:28:31.827111) v2 -- ?+0 0x7f8864de8fc0 con 0x7f8863bc2b00 -33> 2014-04-20 11:28:31.827161 7f88420f3700 1 -- 192.168.0.2:0/29335 --> 192.168.0.250:6816/26721 -- osd_ping(ping e17775 stamp 2014-04-20 11:28:31.827111) v2 -- ?+0 0x7f88651d1180 con 0x7f88641da2c0 -32> 2014-04-20 11:28:31.827171 7f88420f3700 1 -- 192.168.0.2:0/29335 --> 192.168.0.204:6804/13883 -- osd_ping(ping e17775 stamp 2014-04-20 11:28:31.827111) v2 -- ?+0 0x7f8864c89500 con 0x7f88636a4f20 -31> 2014-04-20 11:28:31.827182 7f88420f3700 1 -- 192.168.0.2:0/29335 --> 192.168.0.204:6806/13883 -- osd_ping(ping e17775 stamp 2014-04-20 11:28:31.827111) v2 -- ?+0 0x7f8864753500 con 0x7f88636a51e0 -30> 2014-04-20 11:28:31.827194 7f88420f3700 1 -- 192.168.0.2:0/29335 --> 192.168.0.2:6825/29928 -- osd_ping(ping e17775 stamp 2014-04-20 11:28:31.827111) v2 -- ?+0 0x7f886421e000 con 0x7f88638ae160 -29> 2014-04-20 11:28:31.827206 7f88420f3700 1 -- 192.168.0.2:0/29335 --> 192.168.0.2:6826/29928 -- osd_ping(ping e17775 stamp 2014-04-20 11:28:31.827111) v2 -- ?+0 0x7f8864847500 con 0x7f8863afb4a0 -28> 2014-04-20 11:28:31.827219 7f88420f3700 1 -- 192.168.0.2:0/29335 --> 192.168.0.250:6804/25696 -- osd_ping(ping e17775 stamp 2014-04-20 11:28:31.827111) v2 -- ?+0 0x7f88647d2a80 con 0x7f8864a6fce0 -27> 2014-04-20 11:28:31.827232 7f88420f3700 1 -- 192.168.0.2:0/29335 --> 192.168.0.250:6805/25696 -- osd_ping(ping e17775 stamp 2014-04-20 11:28:31.827111) v2 -- ?+0 0x7f8863e10700 con 0x7f8864a6fb80 -26> 2014-04-20 11:28:31.827245 7f88420f3700 1 -- 192.168.0.2:0/29335 --> 192.168.0.250:6808/28015 -- osd_ping(ping e17775 stamp 2014-04-20 11:28:31.827111) v2 -- ?+0 0x7f88647d5880 con 0x7f886385db80 -25> 2014-04-20 11:28:31.827257 7f88420f3700 1 -- 192.168.0.2:0/29335 --> 192.168.0.250:6810/28015 -- osd_ping(ping e17775 stamp 2014-04-20 11:28:31.827111) v2 -- ?+0 0x7f8863e0afc0 con 0x7f8862687ce0 -24> 2014-04-20 11:28:31.827302 7f88420f3700 1 -- 192.168.0.2:0/29335 --> 192.168.0.204:6813/14254 -- osd_ping(ping e17775 stamp 2014-04-20 11:28:31.827111) v2 -- ?+0 0x7f8863aa0c40 con 0x7f886370e840 -23> 2014-04-20 11:28:31.827314 7f88420f3700 1 -- 192.168.0.2:0/29335 --> 192.168.0.204:6816/14254 -- osd_ping(ping e17775 stamp 2014-04-20 11:28:31.827111) v2 -- ?+0 0x7f88646f3c00 con 0x7f886370e6e0 -22> 2014-04-20 11:28:31.827325 7f88420f3700 1 -- 192.168.0.2:0/29335 --> 192.168.0.2:6814/30829 -- osd_ping(ping e17775 stamp 2014-04-20 11:28:31.827111) v2 -- ?+0 0x7f8864178000 con 0x7f8862ff5ce0 -21> 2014-04-20 11:28:31.827337 7f88420f3700 1 -- 192.168.0.2:0/29335 --> 192.168.0.2:6815/30829 -- osd_ping(ping e17775 stamp 2014-04-20 11:28:31.827111) v2 -- ?+0 0x7f8863e94000 con 0x7f88637b4000 -20> 2014-04-20 11:28:31.827348 7f88420f3700 1 -- 192.168.0.2:0/29335 --> 192.168.0.2:6819/31811 -- osd_ping(ping e17775 stamp 2014-04-20 11:28:31.827111) v2 -- ?+0 0x7f8864512540 con 0x7f8863803b80 -19> 2014-04-20 11:28:31.827359 7f88420f3700 1 -- 192.168.0.2:0/29335 --> 192.168.0.2:6820/31811 -- osd_ping(ping e17775 stamp 2014-04-20 11:28:31.827111) v2 -- ?+0 0x7f8865126a80 con 0x7f88641ae2c0 -18> 2014-04-20 11:28:31.827392 7f8849101700 1 -- 192.168.0.2:0/29335 <== osd.3 192.168.0.2:6825/29928 632 ==== osd_ping(ping_reply e17775 stamp 2014-04-20 11:28:31.827111) v2 ==== 47+0+0 (1299242366 0 0) 0x7f8864ce9dc0 con 0x7f88638ae160 -17> 2014-04-20 11:28:31.827433 7f8849101700 1 -- 192.168.0.2:0/29335 <== osd.3 192.168.0.2:6826/29928 632 ==== osd_ping(ping_reply e17775 stamp 2014-04-20 11:28:31.827111) v2 ==== 47+0+0 (1299242366 0 0) 0x7f8864512540 con 0x7f8863afb4a0 -16> 2014-04-20 11:28:31.827517 7f8849101700 1 -- 192.168.0.2:0/29335 <== osd.1 192.168.0.250:6813/26721 399 ==== osd_ping(ping_reply e17775 stamp 2014-04-20 11:28:31.827111) v2 ==== 47+0+0 (1299242366 0 0) 0x7f88645c1180 con 0x7f8863bc2b00 -15> 2014-04-20 11:28:31.827540 7f8849101700 1 -- 192.168.0.2:0/29335 <== osd.8 192.168.0.2:6819/31811 656 ==== osd_ping(ping_reply e17775 stamp 2014-04-20 11:28:31.827111) v2 ==== 47+0+0 (1299242366 0 0) 0x7f8864be56c0 con 0x7f8863803b80 -14> 2014-04-20 11:28:31.827564 7f8849101700 1 -- 192.168.0.2:0/29335 <== osd.8 192.168.0.2:6820/31811 656 ==== osd_ping(ping_reply e17775 stamp 2014-04-20 11:28:31.827111) v2 ==== 47+0+0 (1299242366 0 0) 0x7f8864a5c8c0 con 0x7f88641ae2c0 -13> 2014-04-20 11:28:31.827593 7f8849101700 1 -- 192.168.0.2:0/29335 <== osd.0 192.168.0.250:6824/25087 556 ==== osd_ping(ping_reply e17775 stamp 2014-04-20 11:28:31.827111) v2 ==== 47+0+0 (1299242366 0 0) 0x7f8863e0b6c0 con 0x7f8864a6fe40 -12> 2014-04-20 11:28:31.827614 7f8849101700 1 -- 192.168.0.2:0/29335 <== osd.0 192.168.0.250:6823/25087 556 ==== osd_ping(ping_reply e17775 stamp 2014-04-20 11:28:31.827111) v2 ==== 47+0+0 (1299242366 0 0) 0x7f8864728e00 con 0x7f8862d17600 -11> 2014-04-20 11:28:31.827730 7f8849101700 1 -- 192.168.0.2:0/29335 <== osd.1 192.168.0.250:6816/26721 399 ==== osd_ping(ping_reply e17775 stamp 2014-04-20 11:28:31.827111) v2 ==== 47+0+0 (1299242366 0 0) 0x7f8864787180 con 0x7f88641da2c0 -10> 2014-04-20 11:28:31.827750 7f8849101700 1 -- 192.168.0.2:0/29335 <== osd.7 192.168.0.2:6814/30829 656 ==== osd_ping(ping_reply e17775 stamp 2014-04-20 11:28:31.827111) v2 ==== 47+0+0 (1299242366 0 0) 0x7f8864f9f500 con 0x7f8862ff5ce0 -9> 2014-04-20 11:28:31.827773 7f8849101700 1 -- 192.168.0.2:0/29335 <== osd.7 192.168.0.2:6815/30829 656 ==== osd_ping(ping_reply e17775 stamp 2014-04-20 11:28:31.827111) v2 ==== 47+0+0 (1299242366 0 0) 0x7f8863e0e1c0 con 0x7f88637b4000 -8> 2014-04-20 11:28:31.827861 7f8849101700 1 -- 192.168.0.2:0/29335 <== osd.5 192.168.0.250:6808/28015 389 ==== osd_ping(ping_reply e17775 stamp 2014-04-20 11:28:31.827111) v2 ==== 47+0+0 (1299242366 0 0) 0x7f8864ac0540 con 0x7f886385db80 -7> 2014-04-20 11:28:31.827885 7f8849101700 1 -- 192.168.0.2:0/29335 <== osd.5 192.168.0.250:6810/28015 389 ==== osd_ping(ping_reply e17775 stamp 2014-04-20 11:28:31.827111) v2 ==== 47+0+0 (1299242366 0 0) 0x7f8863aa0000 con 0x7f8862687ce0 -6> 2014-04-20 11:28:31.827900 7f8849101700 1 -- 192.168.0.2:0/29335 <== osd.6 192.168.0.204:6813/14254 700 ==== osd_ping(ping_reply e17775 stamp 2014-04-20 11:28:31.827111) v2 ==== 47+0+0 (1299242366 0 0) 0x7f8864598e00 con 0x7f886370e840 -5> 2014-04-20 11:28:31.827914 7f8849101700 1 -- 192.168.0.2:0/29335 <== osd.4 192.168.0.250:6804/25696 440 ==== osd_ping(ping_reply e17775 stamp 2014-04-20 11:28:31.827111) v2 ==== 47+0+0 (1299242366 0 0) 0x7f886380ddc0 con 0x7f8864a6fce0 -4> 2014-04-20 11:28:31.827929 7f8849101700 1 -- 192.168.0.2:0/29335 <== osd.4 192.168.0.250:6805/25696 440 ==== osd_ping(ping_reply e17775 stamp 2014-04-20 11:28:31.827111) v2 ==== 47+0+0 (1299242366 0 0) 0x7f8864728540 con 0x7f8864a6fb80 -3> 2014-04-20 11:28:31.827952 7f8849101700 1 -- 192.168.0.2:0/29335 <== osd.6 192.168.0.204:6816/14254 700 ==== osd_ping(ping_reply e17775 stamp 2014-04-20 11:28:31.827111) v2 ==== 47+0+0 (1299242366 0 0) 0x7f88645ecfc0 con 0x7f886370e6e0 -2> 2014-04-20 11:28:31.828104 7f8849101700 1 -- 192.168.0.2:0/29335 <== osd.2 192.168.0.204:6806/13883 700 ==== osd_ping(ping_reply e17775 stamp 2014-04-20 11:28:31.827111) v2 ==== 47+0+0 (1299242366 0 0) 0x7f88646f2540 con 0x7f88636a51e0 -1> 2014-04-20 11:28:31.828128 7f8849101700 1 -- 192.168.0.2:0/29335 <== osd.2 192.168.0.204:6804/13883 700 ==== osd_ping(ping_reply e17775 stamp 2014-04-20 11:28:31.827111) v2 ==== 47+0+0 (1299242366 0 0) 0x7f88651d0000 con 0x7f88636a4f20 0> 2014-04-20 11:28:32.060148 7f88428f4700 -1 *** Caught signal (Aborted) ** in thread 7f88428f4700 ceph version 0.79 (4c2d73a5095f527c3a2168deb5fa54b3c8991a6e) 1: (()+0x59e47f) [0x7f885d7e047f] 2: (()+0xf880) [0x7f885c99e880] 3: (gsignal()+0x39) [0x7f885ae0e3a9] 4: (abort()+0x148) [0x7f885ae114c8] 5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7f885b6fb5e5] 6: (()+0x5e746) [0x7f885b6f9746] 7: (()+0x5e773) [0x7f885b6f9773] 8: (()+0x5e9b2) [0x7f885b6f99b2] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1f2) [0x7f885d8c0ee2] 10: (ReplicatedPG::pick_newest_available(hobject_t const&)+0x4bf) [0x7f885d629c1f] 11: (ReplicatedPG::mark_all_unfound_lost(int)+0xac3) [0x7f885d650ff3] 12: (ReplicatedPG::do_command(std::map<std::string, boost::variant<std::string, bool, long, double, std::vector<std::string, std::allocator<std::string> >, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_>, std::less<std::string>, std::allocator<std::pair<std::string const, boost::variant<std::string, bool, long, double, std::vector<std::string, std::allocator<std::string> >, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost:: detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_> > > >, std::ostream&, ceph::buffer::list&, ceph::buffer::list&)+0xb5c) [0x7f885d63c12c] 13: (OSD::do_command(Connection*, unsigned long, std::vector<std::string, std::allocator<std::string> >&, ceph::buffer::list&)+0x25a4) [0x7f885d4a6b94] 14: (OSD::CommandWQ::_process(OSD::Command*)+0x47) [0x7f885d4ff777] 15: (ThreadPool::worker(ThreadPool::WorkThread*)+0xaf1) [0x7f885d8b1da1] 16: (ThreadPool::WorkThread::entry()+0x10) [0x7f885d8b2c90] 17: (()+0x8062) [0x7f885c997062] 18: (clone()+0x6d) [0x7f885aebea3d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. --- logging levels --- 0/ 5 none 0/ 1 lockdep 0/ 1 context 1/ 1 crush 1/ 5 mds 1/ 5 mds_balancer 1/ 5 mds_locker 1/ 5 mds_log 1/ 5 mds_log_expire 1/ 5 mds_migrator 0/ 1 buffer 0/ 1 timer 0/ 1 filer 0/ 1 striper 0/ 1 objecter 0/ 5 rados 0/ 5 rbd 0/ 5 journaler 0/ 5 objectcacher 0/ 5 client 0/ 5 osd 0/ 5 optracker 0/ 5 objclass 1/ 3 filestore 1/ 3 keyvaluestore 1/ 3 journal 0/ 5 ms 1/ 5 mon 0/10 monc 1/ 5 paxos 0/ 5 tp 1/ 5 auth 1/ 5 crypto 1/ 1 finisher 1/ 5 heartbeatmap 1/ 5 perfcounter 1/ 5 rgw 1/ 5 javaclient 1/ 5 asok 1/ 1 throttle -2/-2 (syslog threshold) -1/-1 (stderr threshold) max_recent 10000 max_new 1000 log_file /var/log/ceph/ceph-osd.10.log --- end dump of recent events ---
At this point there were 1 unfound object. I issued
ceph pg 2.1f mark_unfound_lost revert
and apparently this fixed inconsistent PG (as I verified with "ceph pg scrub 2.1f" and "ceph pg deep-scrub 2.1f" which returned "2.1f scrub ok" and "2.1f deep-scrub ok" accordingly).
So there is some improvement but perhaps one crash is still due to be fixed...