Project

General

Profile

Actions

Bug #8008

closed

osd/ReplicatedPG.cc: 258: FAILED assert(missing_loc.needs_recovery(hoid)) during pg repair

Added by Dmitry Smirnov about 10 years ago. Updated about 10 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
-
Category:
OSD
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Here is the log from crashed OSD:

    -2> 2014-04-07 22:01:14.289703 7fc4a1488700  5 -- op tracker -- , seq: 80287, time: 2014-04-07 22:01:14.289703, event: waiting_for_osdmap, request: MOSDPGPush(2.1a 14605 [PushOp(82
21b31a/rb.0.6761c.238e1f29.00000000a05a/head//2, version: 13702'40924, data_included: [0~3825664], data_size: 3825664, omap_header_size: 0, omap_entries_size: 0, attrset_size: 2, recov
ery_info: ObjectRecoveryInfo(8221b31a/rb.0.6761c.238e1f29.00000000a05a/head//2@13702'40924, copy_subset: [0~3825664], clone_subset: {}), after_progress: ObjectRecoveryProgress(!first, 
data_recovered_to:3825664, data_complete:true, omap_recovered_to:, omap_complete:true), before_progress: ObjectRecoveryProgress(first, data_recovered_to:0, data_complete:false, omap_re
covered_to:, omap_complete:false))]) v2
    -1> 2014-04-07 22:01:14.289754 7fc49b47c700  5 -- op tracker -- , seq: 80287, time: 2014-04-07 22:01:14.289753, event: reached_pg, request: MOSDPGPush(2.1a 14605 [PushOp(8221b31a/r
b.0.6761c.238e1f29.00000000a05a/head//2, version: 13702'40924, data_included: [0~3825664], data_size: 3825664, omap_header_size: 0, omap_entries_size: 0, attrset_size: 2, recovery_info
: ObjectRecoveryInfo(8221b31a/rb.0.6761c.238e1f29.00000000a05a/head//2@13702'40924, copy_subset: [0~3825664], clone_subset: {}), after_progress: ObjectRecoveryProgress(!first, data_recovered_to:3825664, data_complete:true, omap_recovered_to:, omap_complete:true), before_progress: ObjectRecoveryProgress(first, data_recovered_to:0, data_complete:false, omap_recovered_to:, omap_complete:false))]) v2
     0> 2014-04-07 22:01:14.294263 7fc49b47c700 -1 osd/ReplicatedPG.cc: In function 'virtual void ReplicatedPG::on_local_recover(const hobject_t&, const object_stat_sum_t&, const ObjectRecoveryInfo&, ObjectContextRef, ObjectStore::Transaction*)' thread 7fc49b47c700 time 2014-04-07 22:01:14.289856
osd/ReplicatedPG.cc: 258: FAILED assert(missing_loc.needs_recovery(hoid))

 ceph version 0.78 (f6c746c314d7b87b8419b6e584c94bfe4511dbd4)
 1: (ReplicatedPG::on_local_recover(hobject_t const&, object_stat_sum_t const&, ObjectRecoveryInfo const&, std::tr1::shared_ptr<ObjectContext>, ObjectStore::Transaction*)+0xbd1) [0x7fc4b69b7f01]
 2: (ReplicatedBackend::handle_pull_response(pg_shard_t, PushOp&, PullOp*, std::list<hobject_t, std::allocator<hobject_t> >*, ObjectStore::Transaction*)+0x579) [0x7fc4b69fc439]
 3: (ReplicatedBackend::_do_pull_response(std::tr1::shared_ptr<OpRequest>)+0x2da) [0x7fc4b69fcbea]
 4: (ReplicatedBackend::handle_message(std::tr1::shared_ptr<OpRequest>)+0x3be) [0x7fc4b6a90b8e]
 5: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x2db) [0x7fc4b699fb3b]
 6: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x371) [0x7fc4b675a141]
 7: (OSD::OpWQ::_process(boost::intrusive_ptr<PG>, ThreadPool::TPHandle&)+0x1a5) [0x7fc4b6773fb5]
 8: (ThreadPool::WorkQueueVal<std::pair<boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest> >, boost::intrusive_ptr<PG> >::_void_process(void*, ThreadPool::TPHandle&)+0x9c) [0x7fc4b67b69bc]
 9: (ThreadPool::worker(ThreadPool::WorkThread*)+0xaf1) [0x7fc4b6b944d1]
 10: (ThreadPool::WorkThread::entry()+0x10) [0x7fc4b6b953c0]
 11: (()+0x8062) [0x7fc4b5a7b062]
 12: (clone()+0x6d) [0x7fc4b41a5a3d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 client
   0/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 keyvaluestore
   1/ 3 journal
   0/ 5 ms
   1/ 5 mon
   0/10 monc
   1/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent     10000
  max_new         1000
  log_file /var/log/ceph/ceph-osd.5.log
--- end dump of recent events ---
2014-04-07 22:01:14.365731 7fc49b47c700 -1 *** Caught signal (Aborted) **
 in thread 7fc49b47c700

 ceph version 0.78 (f6c746c314d7b87b8419b6e584c94bfe4511dbd4)
 1: (()+0x59ba2f) [0x7fc4b6ac5a2f]
 2: (()+0xf880) [0x7fc4b5a82880]
 3: (gsignal()+0x39) [0x7fc4b40f53a9]
 4: (abort()+0x148) [0x7fc4b40f84c8]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7fc4b49e789d]
 6: (()+0x63996) [0x7fc4b49e5996]
 7: (()+0x639c3) [0x7fc4b49e59c3]
 8: (()+0x63bee) [0x7fc4b49e5bee]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1f2) [0x7fc4b6ba3692]
 10: (ReplicatedPG::on_local_recover(hobject_t const&, object_stat_sum_t const&, ObjectRecoveryInfo const&, std::tr1::shared_ptr<ObjectContext>, ObjectStore::Transaction*)+0xbd1) [0x7fc4b69b7f01]
 11: (ReplicatedBackend::handle_pull_response(pg_shard_t, PushOp&, PullOp*, std::list<hobject_t, std::allocator<hobject_t> >*, ObjectStore::Transaction*)+0x579) [0x7fc4b69fc439]
 12: (ReplicatedBackend::_do_pull_response(std::tr1::shared_ptr<OpRequest>)+0x2da) [0x7fc4b69fcbea]
 13: (ReplicatedBackend::handle_message(std::tr1::shared_ptr<OpRequest>)+0x3be) [0x7fc4b6a90b8e]
 14: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x2db) [0x7fc4b699fb3b]
 15: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x371) [0x7fc4b675a141]
 16: (OSD::OpWQ::_process(boost::intrusive_ptr<PG>, ThreadPool::TPHandle&)+0x1a5) [0x7fc4b6773fb5]
 17: (ThreadPool::WorkQueueVal<std::pair<boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest> >, boost::intrusive_ptr<PG> >::_void_process(void*, ThreadPool::TPHandle&)+0x9c) [0x7fc4b67b69bc]
 18: (ThreadPool::worker(ThreadPool::WorkThread*)+0xaf1) [0x7fc4b6b944d1]
 19: (ThreadPool::WorkThread::entry()+0x10) [0x7fc4b6b953c0]
 20: (()+0x8062) [0x7fc4b5a7b062]
 21: (clone()+0x6d) [0x7fc4b41a5a3d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- begin dump of recent events ---
    -4> 2014-04-07 22:01:14.341932 7fc4a248a700  1 -- 192.168.0.250:6805/14656 <== mon.4 192.168.0.250:6789/0 548 ==== log(last 52) v1 ==== 24+0+0 (2322174735 0 0) 0x7fc4c85e08c0 con 0x7fc4dff14160
    -3> 2014-04-07 22:01:14.341955 7fc4a248a700 10 handle_log_ack log(last 52) v1
    -2> 2014-04-07 22:01:14.341957 7fc4a248a700 10  logged 2014-04-07 22:01:14.119753 osd.5 192.168.0.250:6805/14656 51 : [ERR] 2.1a repair 0 missing, 1 inconsistent objects
    -1> 2014-04-07 22:01:14.341963 7fc4a248a700 10  logged 2014-04-07 22:01:14.119771 osd.5 192.168.0.250:6805/14656 52 : [ERR] 2.1a repair 3 errors, 3 fixed
     0> 2014-04-07 22:01:14.365731 7fc49b47c700 -1 *** Caught signal (Aborted) **
 in thread 7fc49b47c700

 ceph version 0.78 (f6c746c314d7b87b8419b6e584c94bfe4511dbd4)
 1: (()+0x59ba2f) [0x7fc4b6ac5a2f]
 2: (()+0xf880) [0x7fc4b5a82880]
 3: (gsignal()+0x39) [0x7fc4b40f53a9]
 4: (abort()+0x148) [0x7fc4b40f84c8]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7fc4b49e789d]
 6: (()+0x63996) [0x7fc4b49e5996]
 7: (()+0x639c3) [0x7fc4b49e59c3]
 8: (()+0x63bee) [0x7fc4b49e5bee]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1f2) [0x7fc4b6ba3692]
 10: (ReplicatedPG::on_local_recover(hobject_t const&, object_stat_sum_t const&, ObjectRecoveryInfo const&, std::tr1::shared_ptr<ObjectContext>, ObjectStore::Transaction*)+0xbd1) [0x7fc4b69b7f01]
 11: (ReplicatedBackend::handle_pull_response(pg_shard_t, PushOp&, PullOp*, std::list<hobject_t, std::allocator<hobject_t> >*, ObjectStore::Transaction*)+0x579) [0x7fc4b69fc439]
 12: (ReplicatedBackend::_do_pull_response(std::tr1::shared_ptr<OpRequest>)+0x2da) [0x7fc4b69fcbea]
 13: (ReplicatedBackend::handle_message(std::tr1::shared_ptr<OpRequest>)+0x3be) [0x7fc4b6a90b8e]
 14: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x2db) [0x7fc4b699fb3b]
 15: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x371) [0x7fc4b675a141]
 16: (OSD::OpWQ::_process(boost::intrusive_ptr<PG>, ThreadPool::TPHandle&)+0x1a5) [0x7fc4b6773fb5]
 17: (ThreadPool::WorkQueueVal<std::pair<boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest> >, boost::intrusive_ptr<PG> >::_void_process(void*, ThreadPool::TPHandle&)+0x9c) [0x7fc4b67b69bc]
 18: (ThreadPool::worker(ThreadPool::WorkThread*)+0xaf1) [0x7fc4b6b944d1]
 19: (ThreadPool::WorkThread::entry()+0x10) [0x7fc4b6b953c0]
 20: (()+0x8062) [0x7fc4b5a7b062]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 client
   0/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 keyvaluestore
   1/ 3 journal
   0/ 5 ms
   1/ 5 mon
   0/10 monc
   1/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent     10000
  max_new         1000
  log_file /var/log/ceph/ceph-osd.5.log
--- end dump of recent events ---

Crash appears to be related to repair of inconsistent PG.
I gave commands 'ceph pg repair 2.1c' and 'ceph osd repair 5' shortly before crash of osd.5.

Actions

Also available in: Atom PDF