Project

General

Profile

Actions

Bug #8008

closed

osd/ReplicatedPG.cc: 258: FAILED assert(missing_loc.needs_recovery(hoid)) during pg repair

Added by Dmitry Smirnov about 10 years ago. Updated almost 10 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
-
Category:
OSD
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Here is the log from crashed OSD:

    -2> 2014-04-07 22:01:14.289703 7fc4a1488700  5 -- op tracker -- , seq: 80287, time: 2014-04-07 22:01:14.289703, event: waiting_for_osdmap, request: MOSDPGPush(2.1a 14605 [PushOp(82
21b31a/rb.0.6761c.238e1f29.00000000a05a/head//2, version: 13702'40924, data_included: [0~3825664], data_size: 3825664, omap_header_size: 0, omap_entries_size: 0, attrset_size: 2, recov
ery_info: ObjectRecoveryInfo(8221b31a/rb.0.6761c.238e1f29.00000000a05a/head//2@13702'40924, copy_subset: [0~3825664], clone_subset: {}), after_progress: ObjectRecoveryProgress(!first, 
data_recovered_to:3825664, data_complete:true, omap_recovered_to:, omap_complete:true), before_progress: ObjectRecoveryProgress(first, data_recovered_to:0, data_complete:false, omap_re
covered_to:, omap_complete:false))]) v2
    -1> 2014-04-07 22:01:14.289754 7fc49b47c700  5 -- op tracker -- , seq: 80287, time: 2014-04-07 22:01:14.289753, event: reached_pg, request: MOSDPGPush(2.1a 14605 [PushOp(8221b31a/r
b.0.6761c.238e1f29.00000000a05a/head//2, version: 13702'40924, data_included: [0~3825664], data_size: 3825664, omap_header_size: 0, omap_entries_size: 0, attrset_size: 2, recovery_info
: ObjectRecoveryInfo(8221b31a/rb.0.6761c.238e1f29.00000000a05a/head//2@13702'40924, copy_subset: [0~3825664], clone_subset: {}), after_progress: ObjectRecoveryProgress(!first, data_recovered_to:3825664, data_complete:true, omap_recovered_to:, omap_complete:true), before_progress: ObjectRecoveryProgress(first, data_recovered_to:0, data_complete:false, omap_recovered_to:, omap_complete:false))]) v2
     0> 2014-04-07 22:01:14.294263 7fc49b47c700 -1 osd/ReplicatedPG.cc: In function 'virtual void ReplicatedPG::on_local_recover(const hobject_t&, const object_stat_sum_t&, const ObjectRecoveryInfo&, ObjectContextRef, ObjectStore::Transaction*)' thread 7fc49b47c700 time 2014-04-07 22:01:14.289856
osd/ReplicatedPG.cc: 258: FAILED assert(missing_loc.needs_recovery(hoid))

 ceph version 0.78 (f6c746c314d7b87b8419b6e584c94bfe4511dbd4)
 1: (ReplicatedPG::on_local_recover(hobject_t const&, object_stat_sum_t const&, ObjectRecoveryInfo const&, std::tr1::shared_ptr<ObjectContext>, ObjectStore::Transaction*)+0xbd1) [0x7fc4b69b7f01]
 2: (ReplicatedBackend::handle_pull_response(pg_shard_t, PushOp&, PullOp*, std::list<hobject_t, std::allocator<hobject_t> >*, ObjectStore::Transaction*)+0x579) [0x7fc4b69fc439]
 3: (ReplicatedBackend::_do_pull_response(std::tr1::shared_ptr<OpRequest>)+0x2da) [0x7fc4b69fcbea]
 4: (ReplicatedBackend::handle_message(std::tr1::shared_ptr<OpRequest>)+0x3be) [0x7fc4b6a90b8e]
 5: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x2db) [0x7fc4b699fb3b]
 6: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x371) [0x7fc4b675a141]
 7: (OSD::OpWQ::_process(boost::intrusive_ptr<PG>, ThreadPool::TPHandle&)+0x1a5) [0x7fc4b6773fb5]
 8: (ThreadPool::WorkQueueVal<std::pair<boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest> >, boost::intrusive_ptr<PG> >::_void_process(void*, ThreadPool::TPHandle&)+0x9c) [0x7fc4b67b69bc]
 9: (ThreadPool::worker(ThreadPool::WorkThread*)+0xaf1) [0x7fc4b6b944d1]
 10: (ThreadPool::WorkThread::entry()+0x10) [0x7fc4b6b953c0]
 11: (()+0x8062) [0x7fc4b5a7b062]
 12: (clone()+0x6d) [0x7fc4b41a5a3d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 client
   0/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 keyvaluestore
   1/ 3 journal
   0/ 5 ms
   1/ 5 mon
   0/10 monc
   1/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent     10000
  max_new         1000
  log_file /var/log/ceph/ceph-osd.5.log
--- end dump of recent events ---
2014-04-07 22:01:14.365731 7fc49b47c700 -1 *** Caught signal (Aborted) **
 in thread 7fc49b47c700

 ceph version 0.78 (f6c746c314d7b87b8419b6e584c94bfe4511dbd4)
 1: (()+0x59ba2f) [0x7fc4b6ac5a2f]
 2: (()+0xf880) [0x7fc4b5a82880]
 3: (gsignal()+0x39) [0x7fc4b40f53a9]
 4: (abort()+0x148) [0x7fc4b40f84c8]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7fc4b49e789d]
 6: (()+0x63996) [0x7fc4b49e5996]
 7: (()+0x639c3) [0x7fc4b49e59c3]
 8: (()+0x63bee) [0x7fc4b49e5bee]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1f2) [0x7fc4b6ba3692]
 10: (ReplicatedPG::on_local_recover(hobject_t const&, object_stat_sum_t const&, ObjectRecoveryInfo const&, std::tr1::shared_ptr<ObjectContext>, ObjectStore::Transaction*)+0xbd1) [0x7fc4b69b7f01]
 11: (ReplicatedBackend::handle_pull_response(pg_shard_t, PushOp&, PullOp*, std::list<hobject_t, std::allocator<hobject_t> >*, ObjectStore::Transaction*)+0x579) [0x7fc4b69fc439]
 12: (ReplicatedBackend::_do_pull_response(std::tr1::shared_ptr<OpRequest>)+0x2da) [0x7fc4b69fcbea]
 13: (ReplicatedBackend::handle_message(std::tr1::shared_ptr<OpRequest>)+0x3be) [0x7fc4b6a90b8e]
 14: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x2db) [0x7fc4b699fb3b]
 15: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x371) [0x7fc4b675a141]
 16: (OSD::OpWQ::_process(boost::intrusive_ptr<PG>, ThreadPool::TPHandle&)+0x1a5) [0x7fc4b6773fb5]
 17: (ThreadPool::WorkQueueVal<std::pair<boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest> >, boost::intrusive_ptr<PG> >::_void_process(void*, ThreadPool::TPHandle&)+0x9c) [0x7fc4b67b69bc]
 18: (ThreadPool::worker(ThreadPool::WorkThread*)+0xaf1) [0x7fc4b6b944d1]
 19: (ThreadPool::WorkThread::entry()+0x10) [0x7fc4b6b953c0]
 20: (()+0x8062) [0x7fc4b5a7b062]
 21: (clone()+0x6d) [0x7fc4b41a5a3d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- begin dump of recent events ---
    -4> 2014-04-07 22:01:14.341932 7fc4a248a700  1 -- 192.168.0.250:6805/14656 <== mon.4 192.168.0.250:6789/0 548 ==== log(last 52) v1 ==== 24+0+0 (2322174735 0 0) 0x7fc4c85e08c0 con 0x7fc4dff14160
    -3> 2014-04-07 22:01:14.341955 7fc4a248a700 10 handle_log_ack log(last 52) v1
    -2> 2014-04-07 22:01:14.341957 7fc4a248a700 10  logged 2014-04-07 22:01:14.119753 osd.5 192.168.0.250:6805/14656 51 : [ERR] 2.1a repair 0 missing, 1 inconsistent objects
    -1> 2014-04-07 22:01:14.341963 7fc4a248a700 10  logged 2014-04-07 22:01:14.119771 osd.5 192.168.0.250:6805/14656 52 : [ERR] 2.1a repair 3 errors, 3 fixed
     0> 2014-04-07 22:01:14.365731 7fc49b47c700 -1 *** Caught signal (Aborted) **
 in thread 7fc49b47c700

 ceph version 0.78 (f6c746c314d7b87b8419b6e584c94bfe4511dbd4)
 1: (()+0x59ba2f) [0x7fc4b6ac5a2f]
 2: (()+0xf880) [0x7fc4b5a82880]
 3: (gsignal()+0x39) [0x7fc4b40f53a9]
 4: (abort()+0x148) [0x7fc4b40f84c8]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7fc4b49e789d]
 6: (()+0x63996) [0x7fc4b49e5996]
 7: (()+0x639c3) [0x7fc4b49e59c3]
 8: (()+0x63bee) [0x7fc4b49e5bee]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1f2) [0x7fc4b6ba3692]
 10: (ReplicatedPG::on_local_recover(hobject_t const&, object_stat_sum_t const&, ObjectRecoveryInfo const&, std::tr1::shared_ptr<ObjectContext>, ObjectStore::Transaction*)+0xbd1) [0x7fc4b69b7f01]
 11: (ReplicatedBackend::handle_pull_response(pg_shard_t, PushOp&, PullOp*, std::list<hobject_t, std::allocator<hobject_t> >*, ObjectStore::Transaction*)+0x579) [0x7fc4b69fc439]
 12: (ReplicatedBackend::_do_pull_response(std::tr1::shared_ptr<OpRequest>)+0x2da) [0x7fc4b69fcbea]
 13: (ReplicatedBackend::handle_message(std::tr1::shared_ptr<OpRequest>)+0x3be) [0x7fc4b6a90b8e]
 14: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x2db) [0x7fc4b699fb3b]
 15: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x371) [0x7fc4b675a141]
 16: (OSD::OpWQ::_process(boost::intrusive_ptr<PG>, ThreadPool::TPHandle&)+0x1a5) [0x7fc4b6773fb5]
 17: (ThreadPool::WorkQueueVal<std::pair<boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest> >, boost::intrusive_ptr<PG> >::_void_process(void*, ThreadPool::TPHandle&)+0x9c) [0x7fc4b67b69bc]
 18: (ThreadPool::worker(ThreadPool::WorkThread*)+0xaf1) [0x7fc4b6b944d1]
 19: (ThreadPool::WorkThread::entry()+0x10) [0x7fc4b6b953c0]
 20: (()+0x8062) [0x7fc4b5a7b062]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 client
   0/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 keyvaluestore
   1/ 3 journal
   0/ 5 ms
   1/ 5 mon
   0/10 monc
   1/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent     10000
  max_new         1000
  log_file /var/log/ceph/ceph-osd.5.log
--- end dump of recent events ---

Crash appears to be related to repair of inconsistent PG.
I gave commands 'ceph pg repair 2.1c' and 'ceph osd repair 5' shortly before crash of osd.5.

Actions #1

Updated by Samuel Just about 10 years ago

  • Priority changed from Normal to Urgent
Actions #2

Updated by Sage Weil about 10 years ago

  • Subject changed from 0.78: OSD crash to osd/ReplicatedPG.cc: 258: FAILED assert(missing_loc.needs_recovery(hoid)) during pg repair
  • Source changed from other to Community (user)
Actions #3

Updated by Sage Weil about 10 years ago

  • Status changed from New to In Progress
  • Assignee set to Sage Weil
Actions #4

Updated by Sage Weil about 10 years ago

  • Status changed from In Progress to Fix Under Review
Actions #5

Updated by Sage Weil about 10 years ago

  • Assignee deleted (Sage Weil)
Actions #6

Updated by Samuel Just about 10 years ago

  • Status changed from Fix Under Review to Resolved
Actions #7

Updated by Dmitry Smirnov almost 10 years ago

Unfortunately revision 6ff645f5 applied on top of 0.79 did not fixed the issue.
I can't repair inconsistent PG.

# ceph pg repair 2.1f
2014-04-20 06:19:18.605576 osd.5 [ERR] 2.1f shard (2,255): soid 5508c91f/rb.0.2ba94.2ae8944a.000000006000/head//2 digest 1852679265 != known digest 2841785730, size 4194304 != known size 757760                                                                                
2014-04-20 06:19:18.605582 osd.5 [ERR] 2.1f shard (5,255): soid 5508c91f/rb.0.2ba94.2ae8944a.000000006000/head//2 digest 1852679265 != known digest 2841785730, size 4194304 != known size 757760                                                                                
2014-04-20 06:23:24.952515 osd.3 [ERR] 2.1f push 5508c91f/rb.0.2ba94.2ae8944a.000000006000/head//2 v 17671'231000 failed because local copy is 17671'231034
2014-04-20 06:23:24.928470 osd.5 [ERR] 2.1f repair 0 missing, 1 inconsistent objects
2014-04-20 06:23:24.928505 osd.5 [ERR] 2.1f repair 2 errors, 2 fixed

At this point OSD.5 stuck without response for over 1000 sec.; 1 "unfound" object appeared which I had to fix by

ceph pg 2.1f mark_unfound_lost revert

because cluster stopped responding as soon as unfound object appeared. After that OSD.5 crashed:

     0> 2014-04-20 06:39:21.465666 7fc92585b700 -1 osd/ReplicatedPG.cc: In function 'void ReplicatedBackend::sub_op_modify(OpRequestRef)' thread 7fc92585b700 time 2014-04-20 06:39:21.452230
osd/ReplicatedPG.cc: 7175: FAILED assert(!parent->get_log().get_missing().is_missing(soid))

 ceph version 0.79 (4c2d73a5095f527c3a2168deb5fa54b3c8991a6e)
 1: (ReplicatedBackend::sub_op_modify(std::tr1::shared_ptr<OpRequest>)+0x170e) [0x7fc940a4b25e]
 2: (ReplicatedBackend::handle_message(std::tr1::shared_ptr<OpRequest>)+0x4a6) [0x7fc940b3aba6]
 3: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x2db) [0x7fc9409e53fb]
 4: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x371) [0x7fc940845151]
 5: (OSD::OpWQ::_process(boost::intrusive_ptr<PG>, ThreadPool::TPHandle&)+0x1f4) [0x7fc94085ef74]
 6: (ThreadPool::WorkQueueVal<std::pair<boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest> >, boost::intrusive_ptr<PG> >::_void_process(void*, ThreadPool::TPHandle&)+0x9c) [0x7fc9408a1c3c]
 7: (ThreadPool::worker(ThreadPool::WorkThread*)+0xaf1) [0x7fc940c82b61]
 8: (ThreadPool::WorkThread::entry()+0x10) [0x7fc940c83a50]
 9: (()+0x8062) [0x7fc93fd69062]
 10: (clone()+0x6d) [0x7fc93e28fa3d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 client
   0/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 keyvaluestore
   1/ 3 journal
   0/ 5 ms
   1/ 5 mon
   0/10 monc
   1/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent     10000
  max_new         1000
  log_file /var/log/ceph/ceph-osd.5.log
--- end dump of recent events ---
2014-04-20 06:39:21.538171 7fc92585b700 -1 *** Caught signal (Aborted) **
 in thread 7fc92585b700

 ceph version 0.79 (4c2d73a5095f527c3a2168deb5fa54b3c8991a6e)
 1: (()+0x59d23f) [0x7fc940bb123f]
 2: (()+0xf880) [0x7fc93fd70880]
 3: (gsignal()+0x39) [0x7fc93e1df3a9]
 4: (abort()+0x148) [0x7fc93e1e24c8]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7fc93ead189d]
 6: (()+0x63996) [0x7fc93eacf996]
 7: (()+0x639c3) [0x7fc93eacf9c3]
 8: (()+0x63bee) [0x7fc93eacfbee]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1f2) [0x7fc940c91ca2]
 10: (ReplicatedBackend::sub_op_modify(std::tr1::shared_ptr<OpRequest>)+0x170e) [0x7fc940a4b25e]
 11: (ReplicatedBackend::handle_message(std::tr1::shared_ptr<OpRequest>)+0x4a6) [0x7fc940b3aba6]
 12: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x2db) [0x7fc9409e53fb]
 13: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x371) [0x7fc940845151]
 14: (OSD::OpWQ::_process(boost::intrusive_ptr<PG>, ThreadPool::TPHandle&)+0x1f4) [0x7fc94085ef74]
 15: (ThreadPool::WorkQueueVal<std::pair<boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest> >, boost::intrusive_ptr<PG> >::_void_process(void*, ThreadPool::TPHandle&)+0x9c) [0x7fc9408a1c3c]
 16: (ThreadPool::worker(ThreadPool::WorkThread*)+0xaf1) [0x7fc940c82b61]
 17: (ThreadPool::WorkThread::entry()+0x10) [0x7fc940c83a50]
 18: (()+0x8062) [0x7fc93fd69062]
 19: (clone()+0x6d) [0x7fc93e28fa3d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- begin dump of recent events ---
   -40> 2014-04-20 06:39:21.516088 7fc92b066700  1 -- 192.168.0.250:6815/14731 <== osd.10 192.168.0.2:6802/16401 26 ==== pg_notify(2.66(121),1.67(121) epoch 17675) v5 ==== 15514+0+0 (2774128543 0 0) 0x7fc95dd82700 con 0x7fc95bb75a20
   -39> 2014-04-20 06:39:21.516116 7fc92b066700  5 -- op tracker -- , seq: 723174, time: 2014-04-20 06:39:21.515420, event: header_read, request: pg_notify(2.66(121),1.67(121) epoch 17675) v5
   -38> 2014-04-20 06:39:21.516123 7fc92b066700  5 -- op tracker -- , seq: 723174, time: 2014-04-20 06:39:21.515423, event: throttled, request: pg_notify(2.66(121),1.67(121) epoch 17675) v5
   -37> 2014-04-20 06:39:21.516128 7fc92b066700  5 -- op tracker -- , seq: 723174, time: 2014-04-20 06:39:21.516066, event: all_read, request: pg_notify(2.66(121),1.67(121) epoch 17675) v5
   -36> 2014-04-20 06:39:21.516132 7fc92b066700  5 -- op tracker -- , seq: 723174, time: 2014-04-20 06:39:21.516112, event: dispatched, request: pg_notify(2.66(121),1.67(121) epoch 17675) v5
   -35> 2014-04-20 06:39:21.516138 7fc92b066700  5 -- op tracker -- , seq: 723174, time: 2014-04-20 06:39:21.516138, event: waiting_for_osdmap, request: pg_notify(2.66(121),1.67(121) epoch 17675) v5
   -34> 2014-04-20 06:39:21.516146 7fc92b066700  5 -- op tracker -- , seq: 723174, time: 2014-04-20 06:39:21.516145, event: started, request: pg_notify(2.66(121),1.67(121) epoch 17675) v5
   -33> 2014-04-20 06:39:21.516179 7fc92b066700  5 -- op tracker -- , seq: 723174, time: 2014-04-20 06:39:21.516179, event: done, request: pg_notify(2.66(121),1.67(121) epoch 17675) v5
   -32> 2014-04-20 06:39:21.516513 7fc92b066700  1 -- 192.168.0.250:6815/14731 <== osd.10 192.168.0.2:6802/16401 27 ==== pg_info(1 pgs e17675:0.42) v4 ==== 704+0+0 (382868193 0 0) 0x7fc95e8a4c40 con 0x7fc95bb75a20
   -31> 2014-04-20 06:39:21.516528 7fc92b066700  5 -- op tracker -- , seq: 723175, time: 2014-04-20 06:39:21.516177, event: header_read, request: pg_info(1 pgs e17675:0.42) v4
   -30> 2014-04-20 06:39:21.516533 7fc92b066700  5 -- op tracker -- , seq: 723175, time: 2014-04-20 06:39:21.516179, event: throttled, request: pg_info(1 pgs e17675:0.42) v4
   -29> 2014-04-20 06:39:21.516538 7fc92b066700  5 -- op tracker -- , seq: 723175, time: 2014-04-20 06:39:21.516230, event: all_read, request: pg_info(1 pgs e17675:0.42) v4
   -28> 2014-04-20 06:39:21.516542 7fc92b066700  5 -- op tracker -- , seq: 723175, time: 2014-04-20 06:39:21.516526, event: dispatched, request: pg_info(1 pgs e17675:0.42) v4
   -27> 2014-04-20 06:39:21.516547 7fc92b066700  5 -- op tracker -- , seq: 723175, time: 2014-04-20 06:39:21.516547, event: waiting_for_osdmap, request: pg_info(1 pgs e17675:0.42) v4
   -26> 2014-04-20 06:39:21.516554 7fc92b066700  5 -- op tracker -- , seq: 723175, time: 2014-04-20 06:39:21.516553, event: started, request: pg_info(1 pgs e17675:0.42) v4
   -25> 2014-04-20 06:39:21.516574 7fc92b066700  5 -- op tracker -- , seq: 723175, time: 2014-04-20 06:39:21.516574, event: done, request: pg_info(1 pgs e17675:0.42) v4
   -24> 2014-04-20 06:39:21.519532 7fc92b066700  1 -- 192.168.0.250:6815/14731 <== osd.10 192.168.0.2:6802/16401 28 ==== pg_notify(0.68(121),0.75(268),1.74(253) epoch 17675) v5 ==== 39
298+0+0 (1151709690 0 0) 0x7fc95dbe0c40 con 0x7fc95bb75a20
   -23> 2014-04-20 06:39:21.519551 7fc92b066700  5 -- op tracker -- , seq: 723176, time: 2014-04-20 06:39:21.518480, event: header_read, request: pg_notify(0.68(121),0.75(268),1.74(253) epoch 17675) v5
   -22> 2014-04-20 06:39:21.519559 7fc92b066700  5 -- op tracker -- , seq: 723176, time: 2014-04-20 06:39:21.518482, event: throttled, request: pg_notify(0.68(121),0.75(268),1.74(253) epoch 17675) v5
   -21> 2014-04-20 06:39:21.519565 7fc92b066700  5 -- op tracker -- , seq: 723176, time: 2014-04-20 06:39:21.519516, event: all_read, request: pg_notify(0.68(121),0.75(268),1.74(253) epoch 17675) v5
   -20> 2014-04-20 06:39:21.519570 7fc92b066700  5 -- op tracker -- , seq: 723176, time: 2014-04-20 06:39:21.519549, event: dispatched, request: pg_notify(0.68(121),0.75(268),1.74(253) epoch 17675) v5
   -19> 2014-04-20 06:39:21.519576 7fc92b066700  5 -- op tracker -- , seq: 723176, time: 2014-04-20 06:39:21.519576, event: waiting_for_osdmap, request: pg_notify(0.68(121),0.75(268),1.74(253) epoch 17675) v5
   -18> 2014-04-20 06:39:21.519584 7fc92b066700  5 -- op tracker -- , seq: 723176, time: 2014-04-20 06:39:21.519584, event: started, request: pg_notify(0.68(121),0.75(268),1.74(253) epoch 17675) v5
   -17> 2014-04-20 06:39:21.519629 7fc92b066700  5 -- op tracker -- , seq: 723176, time: 2014-04-20 06:39:21.519628, event: done, request: pg_notify(0.68(121),0.75(268),1.74(253) epoch 17675) v5
   -16> 2014-04-20 06:39:21.519698 7fc92b066700  1 -- 192.168.0.250:6815/14731 <== osd.10 192.168.0.2:6802/16401 29 ==== MRecoveryReserve REQUEST  pgid: 0.42, query_epoch: 17675 v2 ==== 26+0+0 (288802618 0 0) 0x7fc96486a380 con 0x7fc95bb75a20
   -15> 2014-04-20 06:39:21.519714 7fc92b066700  5 -- op tracker -- , seq: 723177, time: 2014-04-20 06:39:21.519531, event: header_read, request: MRecoveryReserve REQUEST  pgid: 0.42, query_epoch: 17675 v2
   -14> 2014-04-20 06:39:21.519720 7fc92b066700  5 -- op tracker -- , seq: 723177, time: 2014-04-20 06:39:21.519532, event: throttled, request: MRecoveryReserve REQUEST  pgid: 0.42, query_epoch: 17675 v2
   -13> 2014-04-20 06:39:21.519724 7fc92b066700  5 -- op tracker -- , seq: 723177, time: 2014-04-20 06:39:21.519563, event: all_read, request: MRecoveryReserve REQUEST  pgid: 0.42, query_epoch: 17675 v2
   -12> 2014-04-20 06:39:21.519728 7fc92b066700  5 -- op tracker -- , seq: 723177, time: 2014-04-20 06:39:21.519712, event: dispatched, request: MRecoveryReserve REQUEST  pgid: 0.42, query_epoch: 17675 v2
   -11> 2014-04-20 06:39:21.519733 7fc92b066700  5 -- op tracker -- , seq: 723177, time: 2014-04-20 06:39:21.519733, event: waiting_for_osdmap, request: MRecoveryReserve REQUEST  pgid: 0.42, query_epoch: 17675 v2
   -10> 2014-04-20 06:39:21.519749 7fc92b066700  5 -- op tracker -- , seq: 723177, time: 2014-04-20 06:39:21.519749, event: done, request: MRecoveryReserve REQUEST  pgid: 0.42, query_epoch: 17675 v2
    -9> 2014-04-20 06:39:21.519758 7fc92505a700  5 osd.5 pg_epoch: 17675 pg[0.42( v 17665'277358 (17279'274357,17665'277358] local-les=17675 n=11044 ec=1 les/c 17675/17639 17673/17674/17673) [10,4,2,5]/[10,2,5,0] r=2 lpr=17674 pi=7870-17673/559 luod=0'0 crt=16509'270369 lcod 17646'277357 active+remapped] exit Started/ReplicaActive/RepNotRecovering 0.283411 3 0.000282
    -8> 2014-04-20 06:39:21.519778 7fc92505a700  5 osd.5 pg_epoch: 17675 pg[0.42( v 17665'277358 (17279'274357,17665'277358] local-les=17675 n=11044 ec=1 les/c 17675/17639 17673/17674/17673) [10,4,2,5]/[10,2,5,0] r=2 lpr=17674 pi=7870-17673/559 luod=0'0 crt=16509'270369 lcod 17646'277357 active+remapped] enter Started/ReplicaActive/RepWaitRecoveryReserved
    -7> 2014-04-20 06:39:21.519821 7fc92505a700  1 -- 192.168.0.250:6815/14731 --> osd.10 192.168.0.2:6802/16401 -- MRecoveryReserve GRANT  pgid: 0.42, query_epoch: 17675 v2 -- ?+0 0x7fc968b3aa80
    -6> 2014-04-20 06:39:21.519834 7fc92505a700  5 osd.5 pg_epoch: 17675 pg[0.42( v 17665'277358 (17279'274357,17665'277358] local-les=17675 n=11044 ec=1 les/c 17675/17639 17673/17674/17673) [10,4,2,5]/[10,2,5,0] r=2 lpr=17674 pi=7870-17673/559 luod=0'0 crt=16509'270369 lcod 17646'277357 active+remapped] exit Started/ReplicaActive/RepWaitRecoveryReserved 0.000056 1 0.000035
    -5> 2014-04-20 06:39:21.519848 7fc92505a700  5 osd.5 pg_epoch: 17675 pg[0.42( v 17665'277358 (17279'274357,17665'277358] local-les=17675 n=11044 ec=1 les/c 17675/17639 17673/17674/17673) [10,4,2,5]/[10,2,5,0] r=2 lpr=17674 pi=7870-17673/559 luod=0'0 crt=16509'270369 lcod 17646'277357 active+remapped] enter Started/ReplicaActive/RepRecovering
    -4> 2014-04-20 06:39:21.537485 7fc928861700  1 -- 192.168.0.250:6824/14731 <== osd.1 192.168.0.250:0/14378 15801 ==== osd_ping(ping e17675 stamp 2014-04-20 06:39:21.537249) v2 ==== 47+0+0 (3692397561 0 0) 0x7fc95d961dc0 con 0x7fc95ddd2420
    -3> 2014-04-20 06:39:21.537495 7fc929863700  1 -- 192.168.0.250:6825/14731 <== osd.1 192.168.0.250:0/14378 15801 ==== osd_ping(ping e17675 stamp 2014-04-20 06:39:21.537249) v2 ==== 47+0+0 (3692397561 0 0) 0x7fc95e318a80 con 0x7fc95ddd3080
    -2> 2014-04-20 06:39:21.537514 7fc928861700  1 -- 192.168.0.250:6824/14731 --> 192.168.0.250:0/14378 -- osd_ping(ping_reply e17675 stamp 2014-04-20 06:39:21.537249) v2 -- ?+0 0x7fc95d1fb6c0 con 0x7fc95ddd2420
    -3> 2014-04-20 06:39:21.537495 7fc929863700  1 -- 192.168.0.250:6825/14731 <== osd.1 192.168.0.250:0/14378 15801 ==== osd_ping(ping e17675 stamp 2014-04-20 06:39:21.537249) v2 ==== 47+0+0 (3692397561 0 0) 0x7fc95e318a80 con 0x7fc95ddd3080
    -2> 2014-04-20 06:39:21.537514 7fc928861700  1 -- 192.168.0.250:6824/14731 --> 192.168.0.250:0/14378 -- osd_ping(ping_reply e17675 stamp 2014-04-20 06:39:21.537249) v2 -- ?+0 0x7fc95d1fb6c0 con 0x7fc95ddd2420
    -1> 2014-04-20 06:39:21.537549 7fc929863700  1 -- 192.168.0.250:6825/14731 --> 192.168.0.250:0/14378 -- osd_ping(ping_reply e17675 stamp 2014-04-20 06:39:21.537249) v2 -- ?+0 0x7fc95e202700 con 0x7fc95ddd3080
     0> 2014-04-20 06:39:21.538171 7fc92585b700 -1 *** Caught signal (Aborted) **
 in thread 7fc92585b700

 ceph version 0.79 (4c2d73a5095f527c3a2168deb5fa54b3c8991a6e)
 1: (()+0x59d23f) [0x7fc940bb123f]
 2: (()+0xf880) [0x7fc93fd70880]
 3: (gsignal()+0x39) [0x7fc93e1df3a9]
 4: (abort()+0x148) [0x7fc93e1e24c8]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7fc93ead189d]
 6: (()+0x63996) [0x7fc93eacf996]
 7: (()+0x639c3) [0x7fc93eacf9c3]
 8: (()+0x63bee) [0x7fc93eacfbee]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1f2) [0x7fc940c91ca2]
 10: (ReplicatedBackend::sub_op_modify(std::tr1::shared_ptr<OpRequest>)+0x170e) [0x7fc940a4b25e]
 11: (ReplicatedBackend::handle_message(std::tr1::shared_ptr<OpRequest>)+0x4a6) [0x7fc940b3aba6]
 12: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x2db) [0x7fc9409e53fb]
 13: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x371) [0x7fc940845151]
 14: (OSD::OpWQ::_process(boost::intrusive_ptr<PG>, ThreadPool::TPHandle&)+0x1f4) [0x7fc94085ef74]
 15: (ThreadPool::WorkQueueVal<std::pair<boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest> >, boost::intrusive_ptr<PG> >::_void_process(void*, ThreadPool::TPHandle&)+0x9c) [0x7fc9408a1c3c]
 16: (ThreadPool::worker(ThreadPool::WorkThread*)+0xaf1) [0x7fc940c82b61]
 17: (ThreadPool::WorkThread::entry()+0x10) [0x7fc940c83a50]
 18: (()+0x8062) [0x7fc93fd69062]
 19: (clone()+0x6d) [0x7fc93e28fa3d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 client
   0/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 keyvaluestore
   1/ 3 journal
   0/ 5 ms
   1/ 5 mon
   0/10 monc
   1/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent     10000
  max_new         1000
  log_file /var/log/ceph/ceph-osd.5.log
--- end dump of recent events ---

Actions #8

Updated by Dmitry Smirnov almost 10 years ago

This may be similar to #8048 so I applied corresponding fix 3d0e80ac as well as number of other post-0.79 PG-related fixes on top of 0.79.
However another attempt to repair inconsistent PG ended up with yet another OSD crash:

2014-04-20 11:13:32.323721 osd.10 [ERR] 2.1f shard (10,255): soid 5508c91f/rb.0.2ba94.2ae8944a.000000006000/head//2 size 4194304 != known size 757760
2014-04-20 11:14:01.787642 osd.10 [ERR] 2.1f scrub 0 missing, 1 inconsistent objects
2014-04-20 11:14:01.787648 osd.10 [ERR] 2.1f scrub 1 errors
2014-04-20 11:18:55.299895 osd.10 [ERR] 2.1f shard (10,255): soid 5508c91f/rb.0.2ba94.2ae8944a.000000006000/head//2 digest 2833429890 != known digest 3985900942, size 4194304 != known size 757760
2014-04-20 11:24:44.234940 osd.10 [ERR] 2.1f repair 0 missing, 1 inconsistent objects
2014-04-20 11:24:44.234957 osd.10 [ERR] 2.1f repair 1 errors, 1 fixed
2014-04-20 11:24:44.243471 osd.5 [ERR] 2.1f push 5508c91f/rb.0.2ba94.2ae8944a.000000006000/head//2 v 17772'235539 failed because local copy is 17775'235656

From OSD.10 log:

0> 2014-04-20 11:28:31.613466 7f88428f4700 -1 osd/ReplicatedPG.cc: In function 'eversion_t ReplicatedPG::pick_newest_available(const hobject_t&)' thread 7f88428f4700 time 2014-04-20 11:28:31.566273
osd/ReplicatedPG.cc: 8669: FAILED assert(is_backfill_targets(peer))

 ceph version 0.79 (4c2d73a5095f527c3a2168deb5fa54b3c8991a6e)
 1: (ReplicatedPG::pick_newest_available(hobject_t const&)+0x4bf) [0x7f885d629c1f]
 2: (ReplicatedPG::mark_all_unfound_lost(int)+0xac3) [0x7f885d650ff3]
 3: (ReplicatedPG::do_command(std::map<std::string, boost::variant<std::string, bool, long, double, std::vector<std::string, std::allocator<std::string> >, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_>, std::less<std::string>, std::allocator<std::pair<std::string const, boost::variant<std::string, bool, long, double, std::vector<std::string, std::allocator<std::string> >, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::
detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_> > > >, std::ostream&, ceph::buffer::list&, ceph::buffer::list&)+0xb5c) [0x7f885d63c12c]
 4: (OSD::do_command(Connection*, unsigned long, std::vector<std::string, std::allocator<std::string> >&, ceph::buffer::list&)+0x25a4) [0x7f885d4a6b94]
 5: (OSD::CommandWQ::_process(OSD::Command*)+0x47) [0x7f885d4ff777]
 6: (ThreadPool::worker(ThreadPool::WorkThread*)+0xaf1) [0x7f885d8b1da1]
 7: (ThreadPool::WorkThread::entry()+0x10) [0x7f885d8b2c90]
 8: (()+0x8062) [0x7f885c997062]
 9: (clone()+0x6d) [0x7f885aebea3d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 client
   0/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 keyvaluestore
   1/ 3 journal
   0/ 5 ms
   1/ 5 mon
   0/10 monc
   1/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent     10000
  max_new         1000
  log_file /var/log/ceph/ceph-osd.10.log
--- end dump of recent events ---
2014-04-20 11:28:32.060148 7f88428f4700 -1 *** Caught signal (Aborted) **
 in thread 7f88428f4700

 ceph version 0.79 (4c2d73a5095f527c3a2168deb5fa54b3c8991a6e)
 1: (()+0x59e47f) [0x7f885d7e047f]
 2: (()+0xf880) [0x7f885c99e880]
 3: (gsignal()+0x39) [0x7f885ae0e3a9]
 4: (abort()+0x148) [0x7f885ae114c8]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7f885b6fb5e5]
 6: (()+0x5e746) [0x7f885b6f9746]
 7: (()+0x5e773) [0x7f885b6f9773]
 8: (()+0x5e9b2) [0x7f885b6f99b2]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1f2) [0x7f885d8c0ee2]
 10: (ReplicatedPG::pick_newest_available(hobject_t const&)+0x4bf) [0x7f885d629c1f]
 11: (ReplicatedPG::mark_all_unfound_lost(int)+0xac3) [0x7f885d650ff3]
 12: (ReplicatedPG::do_command(std::map<std::string, boost::variant<std::string, bool, long, double, std::vector<std::string, std::allocator<std::string> >, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_>, std::less<std::string>, std::allocator<std::pair<std::string const, boost::variant<std::string, bool, long, double, std::vector<std::string, std::allocator<std::string> >, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::
detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_> > > >, std::ostream&, ceph::buffer::list&, ceph::buffer::list&)+0xb5c) [0x7f885d63c12c]
 13: (OSD::do_command(Connection*, unsigned long, std::vector<std::string, std::allocator<std::string> >&, ceph::buffer::list&)+0x25a4) [0x7f885d4a6b94]
 14: (OSD::CommandWQ::_process(OSD::Command*)+0x47) [0x7f885d4ff777]
 15: (ThreadPool::worker(ThreadPool::WorkThread*)+0xaf1) [0x7f885d8b1da1]
 16: (ThreadPool::WorkThread::entry()+0x10) [0x7f885d8b2c90]
 17: (()+0x8062) [0x7f885c997062]
 18: (clone()+0x6d) [0x7f885aebea3d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- begin dump of recent events ---
   -41> 2014-04-20 11:28:31.714621 7f88478fe700  1 -- 192.168.0.2:6822/29335 <== osd.3 192.168.0.2:0/29928 644 ==== osd_ping(ping e17775 stamp 2014-04-20 11:28:31.714032) v2 ==== 47+0+0 (2170213909 0 0) 0x7f8864a2f6c0 con 0x7f88638b4840
   -40> 2014-04-20 11:28:31.714648 7f88478fe700  1 -- 192.168.0.2:6822/29335 --> 192.168.0.2:0/29928 -- osd_ping(ping_reply e17775 stamp 2014-04-20 11:28:31.714032) v2 -- ?+0 0x7f88645c0fc0 con 0x7f88638b4840
   -39> 2014-04-20 11:28:31.714677 7f8848900700  1 -- 192.168.0.2:6823/29335 <== osd.3 192.168.0.2:0/29928 644 ==== osd_ping(ping e17775 stamp 2014-04-20 11:28:31.714032) v2 ==== 47+0+0 (2170213909 0 0) 0x7f8863e73180 con 0x7f88638b4dc0
   -38> 2014-04-20 11:28:31.714699 7f8848900700  1 -- 192.168.0.2:6823/29335 --> 192.168.0.2:0/29928 -- osd_ping(ping_reply e17775 stamp 2014-04-20 11:28:31.714032) v2 -- ?+0 0x7f8863f11500 con 0x7f88638b4dc0
   -37> 2014-04-20 11:28:31.827090 7f88420f3700  5 osd.10 17775 heartbeat: osd_stat(66425 MB used, 166 GB avail, 232 GB total, peers [0,1,2,3,4,5,6,7,8]/[] op hist [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,5])
   -36> 2014-04-20 11:28:31.827114 7f88420f3700  1 -- 192.168.0.2:0/29335 --> 192.168.0.250:6823/25087 -- osd_ping(ping e17775 stamp 2014-04-20 11:28:31.827111) v2 -- ?+0 0x7f88646f2700 con 0x7f8862d17600
   -35> 2014-04-20 11:28:31.827136 7f88420f3700  1 -- 192.168.0.2:0/29335 --> 192.168.0.250:6824/25087 -- osd_ping(ping e17775 stamp 2014-04-20 11:28:31.827111) v2 -- ?+0 0x7f8864ac0fc0 con 0x7f8864a6fe40
   -34> 2014-04-20 11:28:31.827149 7f88420f3700  1 -- 192.168.0.2:0/29335 --> 192.168.0.250:6813/26721 -- osd_ping(ping e17775 stamp 2014-04-20 11:28:31.827111) v2 -- ?+0 0x7f8864de8fc0 con 0x7f8863bc2b00
   -33> 2014-04-20 11:28:31.827161 7f88420f3700  1 -- 192.168.0.2:0/29335 --> 192.168.0.250:6816/26721 -- osd_ping(ping e17775 stamp 2014-04-20 11:28:31.827111) v2 -- ?+0 0x7f88651d1180 con 0x7f88641da2c0
   -32> 2014-04-20 11:28:31.827171 7f88420f3700  1 -- 192.168.0.2:0/29335 --> 192.168.0.204:6804/13883 -- osd_ping(ping e17775 stamp 2014-04-20 11:28:31.827111) v2 -- ?+0 0x7f8864c89500 con 0x7f88636a4f20
   -31> 2014-04-20 11:28:31.827182 7f88420f3700  1 -- 192.168.0.2:0/29335 --> 192.168.0.204:6806/13883 -- osd_ping(ping e17775 stamp 2014-04-20 11:28:31.827111) v2 -- ?+0 0x7f8864753500 con 0x7f88636a51e0
   -30> 2014-04-20 11:28:31.827194 7f88420f3700  1 -- 192.168.0.2:0/29335 --> 192.168.0.2:6825/29928 -- osd_ping(ping e17775 stamp 2014-04-20 11:28:31.827111) v2 -- ?+0 0x7f886421e000 con 0x7f88638ae160
   -29> 2014-04-20 11:28:31.827206 7f88420f3700  1 -- 192.168.0.2:0/29335 --> 192.168.0.2:6826/29928 -- osd_ping(ping e17775 stamp 2014-04-20 11:28:31.827111) v2 -- ?+0 0x7f8864847500 con 0x7f8863afb4a0
   -28> 2014-04-20 11:28:31.827219 7f88420f3700  1 -- 192.168.0.2:0/29335 --> 192.168.0.250:6804/25696 -- osd_ping(ping e17775 stamp 2014-04-20 11:28:31.827111) v2 -- ?+0 0x7f88647d2a80 con 0x7f8864a6fce0
   -27> 2014-04-20 11:28:31.827232 7f88420f3700  1 -- 192.168.0.2:0/29335 --> 192.168.0.250:6805/25696 -- osd_ping(ping e17775 stamp 2014-04-20 11:28:31.827111) v2 -- ?+0 0x7f8863e10700 con 0x7f8864a6fb80
   -26> 2014-04-20 11:28:31.827245 7f88420f3700  1 -- 192.168.0.2:0/29335 --> 192.168.0.250:6808/28015 -- osd_ping(ping e17775 stamp 2014-04-20 11:28:31.827111) v2 -- ?+0 0x7f88647d5880 con 0x7f886385db80
   -25> 2014-04-20 11:28:31.827257 7f88420f3700  1 -- 192.168.0.2:0/29335 --> 192.168.0.250:6810/28015 -- osd_ping(ping e17775 stamp 2014-04-20 11:28:31.827111) v2 -- ?+0 0x7f8863e0afc0 con 0x7f8862687ce0
   -24> 2014-04-20 11:28:31.827302 7f88420f3700  1 -- 192.168.0.2:0/29335 --> 192.168.0.204:6813/14254 -- osd_ping(ping e17775 stamp 2014-04-20 11:28:31.827111) v2 -- ?+0 0x7f8863aa0c40 con 0x7f886370e840
   -23> 2014-04-20 11:28:31.827314 7f88420f3700  1 -- 192.168.0.2:0/29335 --> 192.168.0.204:6816/14254 -- osd_ping(ping e17775 stamp 2014-04-20 11:28:31.827111) v2 -- ?+0 0x7f88646f3c00 con 0x7f886370e6e0
   -22> 2014-04-20 11:28:31.827325 7f88420f3700  1 -- 192.168.0.2:0/29335 --> 192.168.0.2:6814/30829 -- osd_ping(ping e17775 stamp 2014-04-20 11:28:31.827111) v2 -- ?+0 0x7f8864178000 con 0x7f8862ff5ce0
   -21> 2014-04-20 11:28:31.827337 7f88420f3700  1 -- 192.168.0.2:0/29335 --> 192.168.0.2:6815/30829 -- osd_ping(ping e17775 stamp 2014-04-20 11:28:31.827111) v2 -- ?+0 0x7f8863e94000 con 0x7f88637b4000
   -20> 2014-04-20 11:28:31.827348 7f88420f3700  1 -- 192.168.0.2:0/29335 --> 192.168.0.2:6819/31811 -- osd_ping(ping e17775 stamp 2014-04-20 11:28:31.827111) v2 -- ?+0 0x7f8864512540 con 0x7f8863803b80
   -19> 2014-04-20 11:28:31.827359 7f88420f3700  1 -- 192.168.0.2:0/29335 --> 192.168.0.2:6820/31811 -- osd_ping(ping e17775 stamp 2014-04-20 11:28:31.827111) v2 -- ?+0 0x7f8865126a80 con 0x7f88641ae2c0
   -18> 2014-04-20 11:28:31.827392 7f8849101700  1 -- 192.168.0.2:0/29335 <== osd.3 192.168.0.2:6825/29928 632 ==== osd_ping(ping_reply e17775 stamp 2014-04-20 11:28:31.827111) v2 ==== 47+0+0 (1299242366 0 0) 0x7f8864ce9dc0 con 0x7f88638ae160
   -17> 2014-04-20 11:28:31.827433 7f8849101700  1 -- 192.168.0.2:0/29335 <== osd.3 192.168.0.2:6826/29928 632 ==== osd_ping(ping_reply e17775 stamp 2014-04-20 11:28:31.827111) v2 ==== 47+0+0 (1299242366 0 0) 0x7f8864512540 con 0x7f8863afb4a0
   -16> 2014-04-20 11:28:31.827517 7f8849101700  1 -- 192.168.0.2:0/29335 <== osd.1 192.168.0.250:6813/26721 399 ==== osd_ping(ping_reply e17775 stamp 2014-04-20 11:28:31.827111) v2 ==== 47+0+0 (1299242366 0 0) 0x7f88645c1180 con 0x7f8863bc2b00
   -15> 2014-04-20 11:28:31.827540 7f8849101700  1 -- 192.168.0.2:0/29335 <== osd.8 192.168.0.2:6819/31811 656 ==== osd_ping(ping_reply e17775 stamp 2014-04-20 11:28:31.827111) v2 ==== 47+0+0 (1299242366 0 0) 0x7f8864be56c0 con 0x7f8863803b80
   -14> 2014-04-20 11:28:31.827564 7f8849101700  1 -- 192.168.0.2:0/29335 <== osd.8 192.168.0.2:6820/31811 656 ==== osd_ping(ping_reply e17775 stamp 2014-04-20 11:28:31.827111) v2 ==== 47+0+0 (1299242366 0 0) 0x7f8864a5c8c0 con 0x7f88641ae2c0
   -13> 2014-04-20 11:28:31.827593 7f8849101700  1 -- 192.168.0.2:0/29335 <== osd.0 192.168.0.250:6824/25087 556 ==== osd_ping(ping_reply e17775 stamp 2014-04-20 11:28:31.827111) v2 ==== 47+0+0 (1299242366 0 0) 0x7f8863e0b6c0 con 0x7f8864a6fe40
   -12> 2014-04-20 11:28:31.827614 7f8849101700  1 -- 192.168.0.2:0/29335 <== osd.0 192.168.0.250:6823/25087 556 ==== osd_ping(ping_reply e17775 stamp 2014-04-20 11:28:31.827111) v2 ==== 47+0+0 (1299242366 0 0) 0x7f8864728e00 con 0x7f8862d17600
   -11> 2014-04-20 11:28:31.827730 7f8849101700  1 -- 192.168.0.2:0/29335 <== osd.1 192.168.0.250:6816/26721 399 ==== osd_ping(ping_reply e17775 stamp 2014-04-20 11:28:31.827111) v2 ==== 47+0+0 (1299242366 0 0) 0x7f8864787180 con 0x7f88641da2c0
   -10> 2014-04-20 11:28:31.827750 7f8849101700  1 -- 192.168.0.2:0/29335 <== osd.7 192.168.0.2:6814/30829 656 ==== osd_ping(ping_reply e17775 stamp 2014-04-20 11:28:31.827111) v2 ==== 47+0+0 (1299242366 0 0) 0x7f8864f9f500 con 0x7f8862ff5ce0
    -9> 2014-04-20 11:28:31.827773 7f8849101700  1 -- 192.168.0.2:0/29335 <== osd.7 192.168.0.2:6815/30829 656 ==== osd_ping(ping_reply e17775 stamp 2014-04-20 11:28:31.827111) v2 ==== 47+0+0 (1299242366 0 0) 0x7f8863e0e1c0 con 0x7f88637b4000
    -8> 2014-04-20 11:28:31.827861 7f8849101700  1 -- 192.168.0.2:0/29335 <== osd.5 192.168.0.250:6808/28015 389 ==== osd_ping(ping_reply e17775 stamp 2014-04-20 11:28:31.827111) v2 ==== 47+0+0 (1299242366 0 0) 0x7f8864ac0540 con 0x7f886385db80
    -7> 2014-04-20 11:28:31.827885 7f8849101700  1 -- 192.168.0.2:0/29335 <== osd.5 192.168.0.250:6810/28015 389 ==== osd_ping(ping_reply e17775 stamp 2014-04-20 11:28:31.827111) v2 ==== 47+0+0 (1299242366 0 0) 0x7f8863aa0000 con 0x7f8862687ce0
    -6> 2014-04-20 11:28:31.827900 7f8849101700  1 -- 192.168.0.2:0/29335 <== osd.6 192.168.0.204:6813/14254 700 ==== osd_ping(ping_reply e17775 stamp 2014-04-20 11:28:31.827111) v2 ==== 47+0+0 (1299242366 0 0) 0x7f8864598e00 con 0x7f886370e840
    -5> 2014-04-20 11:28:31.827914 7f8849101700  1 -- 192.168.0.2:0/29335 <== osd.4 192.168.0.250:6804/25696 440 ==== osd_ping(ping_reply e17775 stamp 2014-04-20 11:28:31.827111) v2 ==== 47+0+0 (1299242366 0 0) 0x7f886380ddc0 con 0x7f8864a6fce0
    -4> 2014-04-20 11:28:31.827929 7f8849101700  1 -- 192.168.0.2:0/29335 <== osd.4 192.168.0.250:6805/25696 440 ==== osd_ping(ping_reply e17775 stamp 2014-04-20 11:28:31.827111) v2 ==== 47+0+0 (1299242366 0 0) 0x7f8864728540 con 0x7f8864a6fb80
    -3> 2014-04-20 11:28:31.827952 7f8849101700  1 -- 192.168.0.2:0/29335 <== osd.6 192.168.0.204:6816/14254 700 ==== osd_ping(ping_reply e17775 stamp 2014-04-20 11:28:31.827111) v2 ==== 47+0+0 (1299242366 0 0) 0x7f88645ecfc0 con 0x7f886370e6e0
    -2> 2014-04-20 11:28:31.828104 7f8849101700  1 -- 192.168.0.2:0/29335 <== osd.2 192.168.0.204:6806/13883 700 ==== osd_ping(ping_reply e17775 stamp 2014-04-20 11:28:31.827111) v2 ==== 47+0+0 (1299242366 0 0) 0x7f88646f2540 con 0x7f88636a51e0
    -1> 2014-04-20 11:28:31.828128 7f8849101700  1 -- 192.168.0.2:0/29335 <== osd.2 192.168.0.204:6804/13883 700 ==== osd_ping(ping_reply e17775 stamp 2014-04-20 11:28:31.827111) v2 ==== 47+0+0 (1299242366 0 0) 0x7f88651d0000 con 0x7f88636a4f20
     0> 2014-04-20 11:28:32.060148 7f88428f4700 -1 *** Caught signal (Aborted) **
 in thread 7f88428f4700

 ceph version 0.79 (4c2d73a5095f527c3a2168deb5fa54b3c8991a6e)
 1: (()+0x59e47f) [0x7f885d7e047f]
 2: (()+0xf880) [0x7f885c99e880]
 3: (gsignal()+0x39) [0x7f885ae0e3a9]
 4: (abort()+0x148) [0x7f885ae114c8]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7f885b6fb5e5]
 6: (()+0x5e746) [0x7f885b6f9746]
 7: (()+0x5e773) [0x7f885b6f9773]
 8: (()+0x5e9b2) [0x7f885b6f99b2]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1f2) [0x7f885d8c0ee2]
 10: (ReplicatedPG::pick_newest_available(hobject_t const&)+0x4bf) [0x7f885d629c1f]
 11: (ReplicatedPG::mark_all_unfound_lost(int)+0xac3) [0x7f885d650ff3]
 12: (ReplicatedPG::do_command(std::map<std::string, boost::variant<std::string, bool, long, double, std::vector<std::string, std::allocator<std::string> >, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_>, std::less<std::string>, std::allocator<std::pair<std::string const, boost::variant<std::string, bool, long, double, std::vector<std::string, std::allocator<std::string> >, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::
detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_> > > >, std::ostream&, ceph::buffer::list&, ceph::buffer::list&)+0xb5c) [0x7f885d63c12c]
 13: (OSD::do_command(Connection*, unsigned long, std::vector<std::string, std::allocator<std::string> >&, ceph::buffer::list&)+0x25a4) [0x7f885d4a6b94]
 14: (OSD::CommandWQ::_process(OSD::Command*)+0x47) [0x7f885d4ff777]
 15: (ThreadPool::worker(ThreadPool::WorkThread*)+0xaf1) [0x7f885d8b1da1]
 16: (ThreadPool::WorkThread::entry()+0x10) [0x7f885d8b2c90]
 17: (()+0x8062) [0x7f885c997062]
 18: (clone()+0x6d) [0x7f885aebea3d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 client
   0/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 keyvaluestore
   1/ 3 journal
   0/ 5 ms
   1/ 5 mon
   0/10 monc
   1/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent     10000
  max_new         1000
  log_file /var/log/ceph/ceph-osd.10.log
--- end dump of recent events ---

At this point there were 1 unfound object. I issued

ceph pg 2.1f mark_unfound_lost revert

and apparently this fixed inconsistent PG (as I verified with "ceph pg scrub 2.1f" and "ceph pg deep-scrub 2.1f" which returned "2.1f scrub ok" and "2.1f deep-scrub ok" accordingly).
So there is some improvement but perhaps one crash is still due to be fixed...

Actions

Also available in: Atom PDF