Project

General

Profile

Bug #13111

replicatedPG:the assert occurs in the fuction ReplicatedPG::on_local_recover.

Added by lu shi about 5 years ago. Updated 12 months ago.

Status:
Fix Under Review
Priority:
Low
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature:

Description

 -7> 2015-08-28 17:56:08.179961 7fa382c26700  5 -- op tracker -- seq: 387, time: 2015-08-28 17:56:08.179961, event: started, op: pg_notify(7.10d(5) epoch 3032)
    -6> 2015-08-28 17:56:08.194586 7fa36f6ed700  1 -- 192.168.17.92:6800/33615 <== osd.8 192.168.17.91:6801/27312 33 ==== MOSDPGPush(7.1f0 3032 [PushOp(eced69f0/rbd_data.85642ae8944a.000000000000085c/head//7, version: 1769'22828, data_included: [0~4194304], data_size: 4194304, omap_header_size: 0, omap_entries_size: 0, attrset_size: 2, recovery_info: ObjectRecoveryInfo(eced69f0/rbd_data.85642ae8944a.000000000000085c/head//7@1769'22828, copy_subset: [0~4194304], clone_subset: {}), after_progress: ObjectRecoveryProgress(!first, data_recovered_to:4194304, data_complete:true, omap_recovered_to:, omap_complete:true), before_progress: ObjectRecoveryProgress(first, data_recovered_to:0, data_complete:false, omap_recovered_to:, omap_complete:false)),PushOp(c84ea5f0/rbd_data.85642ae8944a.0000000000001600/head//7, version: 2725'22829, data_included: [0~4194304], data_size: 4194304, omap_header_size: 0, omap_entries_size: 0, attrset_size: 2, recovery_info: ObjectRecoveryInfo(c84ea5f0/rbd_data.85642ae8944a.0000000000001600/head//7@2725'22829, copy_subset: [0~4194304], clone_subset: {}), after_progress: ObjectRecoveryProgress(!first, data_recovered_to:4194304, data_complete:true, omap_recovered_to:, omap_complete:true), before_progress: ObjectRecoveryProgress(first, data_recovered_to:0, data_complete:false, omap_recovered_to:, omap_complete:false))]) v2 ==== 8390517+0+0 (2584546233 0 0) 0x1147b800 con 0x4615e40
    -5> 2015-08-28 17:56:08.194648 7fa36f6ed700  5 -- op tracker -- seq: 388, time: 2015-08-28 17:56:08.166005, event: header_read, op: MOSDPGPush(7.1f0 3032 [PushOp(eced69f0/rbd_data.85642ae8944a.000000000000085c/head//7, version: 1769'22828, data_included: [0~4194304], data_size: 4194304, omap_header_size: 0, omap_entries_size: 0, attrset_size: 2, recovery_info: ObjectRecoveryInfo(eced69f0/rbd_data.85642ae8944a.000000000000085c/head//7@1769'22828, copy_subset: [0~4194304], clone_subset: {}), after_progress: ObjectRecoveryProgress(!first, data_recovered_to:4194304, data_complete:true, omap_recovered_to:, omap_complete:true), before_progress: ObjectRecoveryProgress(first, data_recovered_to:0, data_complete:false, omap_recovered_to:, omap_complete:false)),PushOp(c84ea5f0/rbd_data.85642ae8944a.0000000000001600/head//7, version: 2725'22829, data_included: [0~4194304], data_size: 4194304, omap_header_size: 0, omap_entries_size: 0, attrset_size: 2, recovery_info: ObjectRecoveryInfo(c84ea5f0/rbd_data.85642ae8944a.0000000000001600/head//7@2725'22829, copy_subset: [0~4194304], clone_subset: {}), after_progress: ObjectRecoveryProgress(!first, data_recovered_to:4194304, data_complete:true, omap_recovered_to:, omap_complete:true), before_progress: ObjectRecoveryProgress(first, data_recovered_to:0, data_complete:false, omap_recovered_to:, omap_complete:false))])
    -4> 2015-08-28 17:56:08.194670 7fa36f6ed700  5 -- op tracker -- seq: 388, time: 2015-08-28 17:56:08.166008, event: throttled, op: MOSDPGPush(7.1f0 3032 [PushOp(eced69f0/rbd_data.85642ae8944a.000000000000085c/head//7, version: 1769'22828, data_included: [0~4194304], data_size: 4194304, omap_header_size: 0, omap_entries_size: 0, attrset_size: 2, recovery_info: ObjectRecoveryInfo(eced69f0/rbd_data.85642ae8944a.000000000000085c/head//7@1769'22828, copy_subset: [0~4194304], clone_subset: {}), after_progress: ObjectRecoveryProgress(!first, data_recovered_to:4194304, data_complete:true, omap_recovered_to:, omap_complete:true), before_progress: ObjectRecoveryProgress(first, data_recovered_to:0, data_complete:false, omap_recovered_to:, omap_complete:false)),PushOp(c84ea5f0/rbd_data.85642ae8944a.0000000000001600/head//7, version: 2725'22829, data_included: [0~4194304], data_size: 4194304, omap_header_size: 0, omap_entries_size: 0, attrset_size: 2, recovery_info: ObjectRecoveryInfo(c84ea5f0/rbd_data.85642ae8944a.0000000000001600/head//7@2725'22829, copy_subset: [0~4194304], clone_subset: {}), after_progress: ObjectRecoveryProgress(!first, data_recovered_to:4194304, data_complete:true, omap_recovered_to:, omap_complete:true), before_progress: ObjectRecoveryProgress(first, data_recovered_to:0, data_complete:false, omap_recovered_to:, omap_complete:false))])
    -3> 2015-08-28 17:56:08.194688 7fa36f6ed700  5 -- op tracker -- seq: 388, time: 2015-08-28 17:56:08.194575, event: all_read, op: MOSDPGPush(7.1f0 3032 [PushOp(eced69f0/rbd_data.85642ae8944a.000000000000085c/head//7, version: 1769'22828, data_included: [0~4194304], data_size: 4194304, omap_header_size: 0, omap_entries_size: 0, attrset_size: 2, recovery_info: ObjectRecoveryInfo(eced69f0/rbd_data.85642ae8944a.000000000000085c/head//7@1769'22828, copy_subset: [0~4194304], clone_subset: {}), after_progress: ObjectRecoveryProgress(!first, data_recovered_to:4194304, data_complete:true, omap_recovered_to:, omap_complete:true), before_progress: ObjectRecoveryProgress(first, data_recovered_to:0, data_complete:false, omap_recovered_to:, omap_complete:false)),PushOp(c84ea5f0/rbd_data.85642ae8944a.0000000000001600/head//7, version: 2725'22829, data_included: [0~4194304], data_size: 4194304, omap_header_size: 0, omap_entries_size: 0, attrset_size: 2, recovery_info: ObjectRecoveryInfo(c84ea5f0/rbd_data.85642ae8944a.0000000000001600/head//7@2725'22829, copy_subset: [0~4194304], clone_subset: {}), after_progress: ObjectRecoveryProgress(!first, data_recovered_to:4194304, data_complete:true, omap_recovered_to:, omap_complete:true), before_progress: ObjectRecoveryProgress(first, data_recovered_to:0, data_complete:false, omap_recovered_to:, omap_complete:false))])
    -2> 2015-08-28 17:56:08.194705 7fa36f6ed700  5 -- op tracker -- seq: 388, time: 0.000000, event: dispatched, op: MOSDPGPush(7.1f0 3032 [PushOp(eced69f0/rbd_data.85642ae8944a.000000000000085c/head//7, version: 1769'22828, data_included: [0~4194304], data_size: 4194304, omap_header_size: 0, omap_entries_size: 0, attrset_size: 2, recovery_info: ObjectRecoveryInfo(eced69f0/rbd_data.85642ae8944a.000000000000085c/head//7@1769'22828, copy_subset: [0~4194304], clone_subset: {}), after_progress: ObjectRecoveryProgress(!first, data_recovered_to:4194304, data_complete:true, omap_recovered_to:, omap_complete:true), before_progress: ObjectRecoveryProgress(first, data_recovered_to:0, data_complete:false, omap_recovered_to:, omap_complete:false)),PushOp(c84ea5f0/rbd_data.85642ae8944a.0000000000001600/head//7, version: 2725'22829, data_included: [0~4194304], data_size: 4194304, omap_header_size: 0, omap_entries_size: 0, attrset_size: 2, recovery_info: ObjectRecoveryInfo(c84ea5f0/rbd_data.85642ae8944a.0000000000001600/head//7@2725'22829, copy_subset: [0~4194304], clone_subset: {}), after_progress: ObjectRecoveryProgress(!first, data_recovered_to:4194304, data_complete:true, omap_recovered_to:, omap_complete:true), before_progress: ObjectRecoveryProgress(first, data_recovered_to:0, data_complete:false, omap_recovered_to:, omap_complete:false))])
    -1> 2015-08-28 17:56:08.194765 7fa379c14700  5 -- op tracker -- seq: 388, time: 2015-08-28 17:56:08.194765, event: reached_pg, op: MOSDPGPush(7.1f0 3032 [PushOp(eced69f0/rbd_data.85642ae8944a.000000000000085c/head//7, version: 1769'22828, data_included: [0~4194304], data_size: 4194304, omap_header_size: 0, omap_entries_size: 0, attrset_size: 2, recovery_info: ObjectRecoveryInfo(eced69f0/rbd_data.85642ae8944a.000000000000085c/head//7@1769'22828, copy_subset: [0~4194304], clone_subset: {}), after_progress: ObjectRecoveryProgress(!first, data_recovered_to:4194304, data_complete:true, omap_recovered_to:, omap_complete:true), before_progress: ObjectRecoveryProgress(first, data_recovered_to:0, data_complete:false, omap_recovered_to:, omap_complete:false)),PushOp(c84ea5f0/rbd_data.85642ae8944a.0000000000001600/head//7, version: 2725'22829, data_included: [0~4194304], data_size: 4194304, omap_header_size: 0, omap_entries_size: 0, attrset_size: 2, recovery_info: ObjectRecoveryInfo(c84ea5f0/rbd_data.85642ae8944a.0000000000001600/head//7@2725'22829, copy_subset: [0~4194304], clone_subset: {}), after_progress: ObjectRecoveryProgress(!first, data_recovered_to:4194304, data_complete:true, omap_recovered_to:, omap_complete:true), before_progress: ObjectRecoveryProgress(first, data_recovered_to:0, data_complete:false, omap_recovered_to:, omap_complete:false))])
     0> 2015-08-28 17:56:08.200992 7fa379c14700 -1 osd/ReplicatedPG.cc: In function 'virtual void ReplicatedPG::on_local_recover(const hobject_t&, const object_stat_sum_t&, const ObjectRecoveryInfo&, ObjectContextRef, ObjectStore::Transaction*)' thread 7fa379c14700 time 2015-08-28 17:56:08.194843
osd/ReplicatedPG.cc: 211: FAILED assert(is_primary())

ceph version 0.94.2 (5fb85614ca8f354284c713a2f9c610860720bbf3)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0xbc2b8b]
2: (ReplicatedPG::on_local_recover(hobject_t const&, object_stat_sum_t const&, ObjectRecoveryInfo const&, std::tr1::shared_ptr<ObjectContext>, ObjectStore::Transaction*)+0x4b8) [0x854ba8]
3: (ReplicatedBackend::handle_push(pg_shard_t, PushOp&, PushReplyOp*, ObjectStore::Transaction*)+0x370) [0xa0c470]
4: (ReplicatedBackend::_do_push(std::tr1::shared_ptr<OpRequest>)+0x117) [0xa0c7f7]
5: (ReplicatedBackend::handle_message(std::tr1::shared_ptr<OpRequest>)+0x40f) [0xa1767f]
6: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x167) [0x840f57]
7: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x3d5) [0x6a0f85]
8: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x331) [0x6a14d1]
9: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x875) [0xbb2c85]
10: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xbb4da0]
11: (()+0x8182) [0x7fa39654e182]
12: (clone()+0x6d) [0x7fa394ab900d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

History

#1 Updated by Samuel Just about 5 years ago

  • Priority changed from Normal to Urgent

#2 Updated by Sage Weil about 5 years ago

  • Description updated (diff)
  • Status changed from New to Need More Info

Is this the only OSD that crashed? Have you seen it more than once?

#3 Updated by lu shi about 5 years ago

yes,only one osd crashed.i see it just once.But there is someone who joined ceph-devel encountered this problem

#4 Updated by Jianhui Yuan about 5 years ago

In my env, Many OSD crash because of it.
And I use aync msg. It seem like wrong send push_op to replicated.

-1842> 2015-09-25 12:06:20.467166 7f1463a03700  1 -- 172.16.50.7:6829/30641 >> 172.16.50.10:6821/32703 conn(0x1eb93000 sd=263 :-1 s=STATE_OPEN pgs=455 cs=1 l=0). == tx == 0x1efbda00 MOSDPGPushReply(5.55d 94505 [PushReplyOp(85ea9d5d/rbd_data.8f19f2ae8944a.000000000002e950/head//5)]) v2

 -1290> 2015-09-25 12:06:20.474333 7f146ae48700  1 -- 172.16.50.7:6829/30641 >> 172.16.50.10:6821/32703 conn(0x1eb93000 sd=263 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=455 cs=1 l=0). == rx == 0x1f665800 MOSDPGPush(5.55d 94505 [PushOp(85ea9d5d/rbd_data.8f19f2ae8944a.000000000002e950/head//5, version: 90491'484667, data_included: [8388608~3014656], data_size: 3014656, omap_header_size: 0, omap_entries_size: 0, attrset_size: 0, recovery_info: ObjectRecoveryInfo(85ea9d5d/rbd_data.8f19f2ae8944a.000000000002e950/head//5@90491'484667, copy_subset: [0~11403264], clone_subset: {}), after_progress: ObjectRecoveryProgress(!first, data_recovered_to:11403264, data_complete:true, omap_recovered_to:, omap_complete:true), before_progress: ObjectRecoveryProgress(!first, data_recovered_to:8388608, data_complete:false, omap_recovered_to:, omap_complete:true))]) v2

  -1204> 2015-09-25 12:06:20.474705 7f1459db6700 10 osd.10 pg_epoch: 94505 pg[5.55d( v 94488'484671 lc 94488'484670 (75232'481654,94488'484671] local-les=94505 n=130 ec=288 les/c 94505/94464 94504/94504/93603) [38,10] r=1 lpr=94504 pi=93491-94503/25 luod=0'0 crt=94488'484671 lcod 0'0 active m=1] on_local_recover: 85ea9d5d/rbd_data.8f19f2ae8944a.000000000002e950/head//5

#5 Updated by Sage Weil about 5 years ago

  • Priority changed from Urgent to High

this sounds like a side-effect of a async msgr bug, but without a full log it is hard to say.

#6 Updated by lu shi about 5 years ago

https://github.com/s09816/kk/blob/master/ceph-osd.17.log.5
I put the full log info to this address.

#7 Updated by Sage Weil about 5 years ago

lu shi wrote:

https://github.com/s09816/kk/blob/master/ceph-osd.17.log.5
I put the full log info to this address.

need to reproduce with debug osd = 20, debug ms = 1

#8 Updated by shawn chen about 5 years ago

actually, it's not too hard to reproduce this problem ,the reason is object info save on the file system is inconsistent with pglog, so the primary may send low version object to replicas.

see https://github.com/ceph/ceph/pull/6345

#9 Updated by Loic Dachary about 5 years ago

  • Status changed from Need More Info to In Progress
  • Assignee set to Zhiqiang Wang

#10 Updated by Kefu Chai about 5 years ago

  • Status changed from In Progress to Fix Under Review
  • Assignee deleted (Zhiqiang Wang)

#11 Updated by Sage Weil about 5 years ago

  • Priority changed from High to Urgent

#12 Updated by Sage Weil almost 5 years ago

  • Status changed from Fix Under Review to 17

#13 Updated by Samuel Just about 4 years ago

  • Priority changed from Urgent to Low

#14 Updated by Sage Weil over 3 years ago

Is this still a problem?

#15 Updated by Greg Farnum over 3 years ago

  • Project changed from Ceph to RADOS

I don't really get how the AsyncMessenger could have caused this issue...?

#16 Updated by Patrick Donnelly 12 months ago

  • Status changed from 17 to Fix Under Review

Also available in: Atom PDF