Project

General

Profile

Actions

Bug #13111

open

replicatedPG:the assert occurs in the fuction ReplicatedPG::on_local_recover.

Added by lu shi over 8 years ago. Updated over 4 years ago.

Status:
Fix Under Review
Priority:
Low
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

 -7> 2015-08-28 17:56:08.179961 7fa382c26700  5 -- op tracker -- seq: 387, time: 2015-08-28 17:56:08.179961, event: started, op: pg_notify(7.10d(5) epoch 3032)
    -6> 2015-08-28 17:56:08.194586 7fa36f6ed700  1 -- 192.168.17.92:6800/33615 <== osd.8 192.168.17.91:6801/27312 33 ==== MOSDPGPush(7.1f0 3032 [PushOp(eced69f0/rbd_data.85642ae8944a.000000000000085c/head//7, version: 1769'22828, data_included: [0~4194304], data_size: 4194304, omap_header_size: 0, omap_entries_size: 0, attrset_size: 2, recovery_info: ObjectRecoveryInfo(eced69f0/rbd_data.85642ae8944a.000000000000085c/head//7@1769'22828, copy_subset: [0~4194304], clone_subset: {}), after_progress: ObjectRecoveryProgress(!first, data_recovered_to:4194304, data_complete:true, omap_recovered_to:, omap_complete:true), before_progress: ObjectRecoveryProgress(first, data_recovered_to:0, data_complete:false, omap_recovered_to:, omap_complete:false)),PushOp(c84ea5f0/rbd_data.85642ae8944a.0000000000001600/head//7, version: 2725'22829, data_included: [0~4194304], data_size: 4194304, omap_header_size: 0, omap_entries_size: 0, attrset_size: 2, recovery_info: ObjectRecoveryInfo(c84ea5f0/rbd_data.85642ae8944a.0000000000001600/head//7@2725'22829, copy_subset: [0~4194304], clone_subset: {}), after_progress: ObjectRecoveryProgress(!first, data_recovered_to:4194304, data_complete:true, omap_recovered_to:, omap_complete:true), before_progress: ObjectRecoveryProgress(first, data_recovered_to:0, data_complete:false, omap_recovered_to:, omap_complete:false))]) v2 ==== 8390517+0+0 (2584546233 0 0) 0x1147b800 con 0x4615e40
    -5> 2015-08-28 17:56:08.194648 7fa36f6ed700  5 -- op tracker -- seq: 388, time: 2015-08-28 17:56:08.166005, event: header_read, op: MOSDPGPush(7.1f0 3032 [PushOp(eced69f0/rbd_data.85642ae8944a.000000000000085c/head//7, version: 1769'22828, data_included: [0~4194304], data_size: 4194304, omap_header_size: 0, omap_entries_size: 0, attrset_size: 2, recovery_info: ObjectRecoveryInfo(eced69f0/rbd_data.85642ae8944a.000000000000085c/head//7@1769'22828, copy_subset: [0~4194304], clone_subset: {}), after_progress: ObjectRecoveryProgress(!first, data_recovered_to:4194304, data_complete:true, omap_recovered_to:, omap_complete:true), before_progress: ObjectRecoveryProgress(first, data_recovered_to:0, data_complete:false, omap_recovered_to:, omap_complete:false)),PushOp(c84ea5f0/rbd_data.85642ae8944a.0000000000001600/head//7, version: 2725'22829, data_included: [0~4194304], data_size: 4194304, omap_header_size: 0, omap_entries_size: 0, attrset_size: 2, recovery_info: ObjectRecoveryInfo(c84ea5f0/rbd_data.85642ae8944a.0000000000001600/head//7@2725'22829, copy_subset: [0~4194304], clone_subset: {}), after_progress: ObjectRecoveryProgress(!first, data_recovered_to:4194304, data_complete:true, omap_recovered_to:, omap_complete:true), before_progress: ObjectRecoveryProgress(first, data_recovered_to:0, data_complete:false, omap_recovered_to:, omap_complete:false))])
    -4> 2015-08-28 17:56:08.194670 7fa36f6ed700  5 -- op tracker -- seq: 388, time: 2015-08-28 17:56:08.166008, event: throttled, op: MOSDPGPush(7.1f0 3032 [PushOp(eced69f0/rbd_data.85642ae8944a.000000000000085c/head//7, version: 1769'22828, data_included: [0~4194304], data_size: 4194304, omap_header_size: 0, omap_entries_size: 0, attrset_size: 2, recovery_info: ObjectRecoveryInfo(eced69f0/rbd_data.85642ae8944a.000000000000085c/head//7@1769'22828, copy_subset: [0~4194304], clone_subset: {}), after_progress: ObjectRecoveryProgress(!first, data_recovered_to:4194304, data_complete:true, omap_recovered_to:, omap_complete:true), before_progress: ObjectRecoveryProgress(first, data_recovered_to:0, data_complete:false, omap_recovered_to:, omap_complete:false)),PushOp(c84ea5f0/rbd_data.85642ae8944a.0000000000001600/head//7, version: 2725'22829, data_included: [0~4194304], data_size: 4194304, omap_header_size: 0, omap_entries_size: 0, attrset_size: 2, recovery_info: ObjectRecoveryInfo(c84ea5f0/rbd_data.85642ae8944a.0000000000001600/head//7@2725'22829, copy_subset: [0~4194304], clone_subset: {}), after_progress: ObjectRecoveryProgress(!first, data_recovered_to:4194304, data_complete:true, omap_recovered_to:, omap_complete:true), before_progress: ObjectRecoveryProgress(first, data_recovered_to:0, data_complete:false, omap_recovered_to:, omap_complete:false))])
    -3> 2015-08-28 17:56:08.194688 7fa36f6ed700  5 -- op tracker -- seq: 388, time: 2015-08-28 17:56:08.194575, event: all_read, op: MOSDPGPush(7.1f0 3032 [PushOp(eced69f0/rbd_data.85642ae8944a.000000000000085c/head//7, version: 1769'22828, data_included: [0~4194304], data_size: 4194304, omap_header_size: 0, omap_entries_size: 0, attrset_size: 2, recovery_info: ObjectRecoveryInfo(eced69f0/rbd_data.85642ae8944a.000000000000085c/head//7@1769'22828, copy_subset: [0~4194304], clone_subset: {}), after_progress: ObjectRecoveryProgress(!first, data_recovered_to:4194304, data_complete:true, omap_recovered_to:, omap_complete:true), before_progress: ObjectRecoveryProgress(first, data_recovered_to:0, data_complete:false, omap_recovered_to:, omap_complete:false)),PushOp(c84ea5f0/rbd_data.85642ae8944a.0000000000001600/head//7, version: 2725'22829, data_included: [0~4194304], data_size: 4194304, omap_header_size: 0, omap_entries_size: 0, attrset_size: 2, recovery_info: ObjectRecoveryInfo(c84ea5f0/rbd_data.85642ae8944a.0000000000001600/head//7@2725'22829, copy_subset: [0~4194304], clone_subset: {}), after_progress: ObjectRecoveryProgress(!first, data_recovered_to:4194304, data_complete:true, omap_recovered_to:, omap_complete:true), before_progress: ObjectRecoveryProgress(first, data_recovered_to:0, data_complete:false, omap_recovered_to:, omap_complete:false))])
    -2> 2015-08-28 17:56:08.194705 7fa36f6ed700  5 -- op tracker -- seq: 388, time: 0.000000, event: dispatched, op: MOSDPGPush(7.1f0 3032 [PushOp(eced69f0/rbd_data.85642ae8944a.000000000000085c/head//7, version: 1769'22828, data_included: [0~4194304], data_size: 4194304, omap_header_size: 0, omap_entries_size: 0, attrset_size: 2, recovery_info: ObjectRecoveryInfo(eced69f0/rbd_data.85642ae8944a.000000000000085c/head//7@1769'22828, copy_subset: [0~4194304], clone_subset: {}), after_progress: ObjectRecoveryProgress(!first, data_recovered_to:4194304, data_complete:true, omap_recovered_to:, omap_complete:true), before_progress: ObjectRecoveryProgress(first, data_recovered_to:0, data_complete:false, omap_recovered_to:, omap_complete:false)),PushOp(c84ea5f0/rbd_data.85642ae8944a.0000000000001600/head//7, version: 2725'22829, data_included: [0~4194304], data_size: 4194304, omap_header_size: 0, omap_entries_size: 0, attrset_size: 2, recovery_info: ObjectRecoveryInfo(c84ea5f0/rbd_data.85642ae8944a.0000000000001600/head//7@2725'22829, copy_subset: [0~4194304], clone_subset: {}), after_progress: ObjectRecoveryProgress(!first, data_recovered_to:4194304, data_complete:true, omap_recovered_to:, omap_complete:true), before_progress: ObjectRecoveryProgress(first, data_recovered_to:0, data_complete:false, omap_recovered_to:, omap_complete:false))])
    -1> 2015-08-28 17:56:08.194765 7fa379c14700  5 -- op tracker -- seq: 388, time: 2015-08-28 17:56:08.194765, event: reached_pg, op: MOSDPGPush(7.1f0 3032 [PushOp(eced69f0/rbd_data.85642ae8944a.000000000000085c/head//7, version: 1769'22828, data_included: [0~4194304], data_size: 4194304, omap_header_size: 0, omap_entries_size: 0, attrset_size: 2, recovery_info: ObjectRecoveryInfo(eced69f0/rbd_data.85642ae8944a.000000000000085c/head//7@1769'22828, copy_subset: [0~4194304], clone_subset: {}), after_progress: ObjectRecoveryProgress(!first, data_recovered_to:4194304, data_complete:true, omap_recovered_to:, omap_complete:true), before_progress: ObjectRecoveryProgress(first, data_recovered_to:0, data_complete:false, omap_recovered_to:, omap_complete:false)),PushOp(c84ea5f0/rbd_data.85642ae8944a.0000000000001600/head//7, version: 2725'22829, data_included: [0~4194304], data_size: 4194304, omap_header_size: 0, omap_entries_size: 0, attrset_size: 2, recovery_info: ObjectRecoveryInfo(c84ea5f0/rbd_data.85642ae8944a.0000000000001600/head//7@2725'22829, copy_subset: [0~4194304], clone_subset: {}), after_progress: ObjectRecoveryProgress(!first, data_recovered_to:4194304, data_complete:true, omap_recovered_to:, omap_complete:true), before_progress: ObjectRecoveryProgress(first, data_recovered_to:0, data_complete:false, omap_recovered_to:, omap_complete:false))])
     0> 2015-08-28 17:56:08.200992 7fa379c14700 -1 osd/ReplicatedPG.cc: In function 'virtual void ReplicatedPG::on_local_recover(const hobject_t&, const object_stat_sum_t&, const ObjectRecoveryInfo&, ObjectContextRef, ObjectStore::Transaction*)' thread 7fa379c14700 time 2015-08-28 17:56:08.194843
osd/ReplicatedPG.cc: 211: FAILED assert(is_primary())

ceph version 0.94.2 (5fb85614ca8f354284c713a2f9c610860720bbf3)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0xbc2b8b]
2: (ReplicatedPG::on_local_recover(hobject_t const&, object_stat_sum_t const&, ObjectRecoveryInfo const&, std::tr1::shared_ptr<ObjectContext>, ObjectStore::Transaction*)+0x4b8) [0x854ba8]
3: (ReplicatedBackend::handle_push(pg_shard_t, PushOp&, PushReplyOp*, ObjectStore::Transaction*)+0x370) [0xa0c470]
4: (ReplicatedBackend::_do_push(std::tr1::shared_ptr<OpRequest>)+0x117) [0xa0c7f7]
5: (ReplicatedBackend::handle_message(std::tr1::shared_ptr<OpRequest>)+0x40f) [0xa1767f]
6: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x167) [0x840f57]
7: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x3d5) [0x6a0f85]
8: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x331) [0x6a14d1]
9: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x875) [0xbb2c85]
10: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xbb4da0]
11: (()+0x8182) [0x7fa39654e182]
12: (clone()+0x6d) [0x7fa394ab900d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Actions #1

Updated by Samuel Just over 8 years ago

  • Priority changed from Normal to Urgent
Actions #2

Updated by Sage Weil over 8 years ago

  • Description updated (diff)
  • Status changed from New to Need More Info

Is this the only OSD that crashed? Have you seen it more than once?

Actions #3

Updated by lu shi over 8 years ago

yes,only one osd crashed.i see it just once.But there is someone who joined ceph-devel encountered this problem

Actions #4

Updated by Jianhui Yuan over 8 years ago

In my env, Many OSD crash because of it.
And I use aync msg. It seem like wrong send push_op to replicated.

-1842> 2015-09-25 12:06:20.467166 7f1463a03700  1 -- 172.16.50.7:6829/30641 >> 172.16.50.10:6821/32703 conn(0x1eb93000 sd=263 :-1 s=STATE_OPEN pgs=455 cs=1 l=0). == tx == 0x1efbda00 MOSDPGPushReply(5.55d 94505 [PushReplyOp(85ea9d5d/rbd_data.8f19f2ae8944a.000000000002e950/head//5)]) v2

 -1290> 2015-09-25 12:06:20.474333 7f146ae48700  1 -- 172.16.50.7:6829/30641 >> 172.16.50.10:6821/32703 conn(0x1eb93000 sd=263 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=455 cs=1 l=0). == rx == 0x1f665800 MOSDPGPush(5.55d 94505 [PushOp(85ea9d5d/rbd_data.8f19f2ae8944a.000000000002e950/head//5, version: 90491'484667, data_included: [8388608~3014656], data_size: 3014656, omap_header_size: 0, omap_entries_size: 0, attrset_size: 0, recovery_info: ObjectRecoveryInfo(85ea9d5d/rbd_data.8f19f2ae8944a.000000000002e950/head//5@90491'484667, copy_subset: [0~11403264], clone_subset: {}), after_progress: ObjectRecoveryProgress(!first, data_recovered_to:11403264, data_complete:true, omap_recovered_to:, omap_complete:true), before_progress: ObjectRecoveryProgress(!first, data_recovered_to:8388608, data_complete:false, omap_recovered_to:, omap_complete:true))]) v2

  -1204> 2015-09-25 12:06:20.474705 7f1459db6700 10 osd.10 pg_epoch: 94505 pg[5.55d( v 94488'484671 lc 94488'484670 (75232'481654,94488'484671] local-les=94505 n=130 ec=288 les/c 94505/94464 94504/94504/93603) [38,10] r=1 lpr=94504 pi=93491-94503/25 luod=0'0 crt=94488'484671 lcod 0'0 active m=1] on_local_recover: 85ea9d5d/rbd_data.8f19f2ae8944a.000000000002e950/head//5
Actions #5

Updated by Sage Weil over 8 years ago

  • Priority changed from Urgent to High

this sounds like a side-effect of a async msgr bug, but without a full log it is hard to say.

Actions #6

Updated by lu shi over 8 years ago

https://github.com/s09816/kk/blob/master/ceph-osd.17.log.5
I put the full log info to this address.

Actions #7

Updated by Sage Weil over 8 years ago

lu shi wrote:

https://github.com/s09816/kk/blob/master/ceph-osd.17.log.5
I put the full log info to this address.

need to reproduce with debug osd = 20, debug ms = 1

Actions #8

Updated by shawn chen over 8 years ago

actually, it's not too hard to reproduce this problem ,the reason is object info save on the file system is inconsistent with pglog, so the primary may send low version object to replicas.

see https://github.com/ceph/ceph/pull/6345

Actions #9

Updated by Loïc Dachary over 8 years ago

  • Status changed from Need More Info to In Progress
  • Assignee set to Zhiqiang Wang
Actions #10

Updated by Kefu Chai over 8 years ago

  • Status changed from In Progress to Fix Under Review
  • Assignee deleted (Zhiqiang Wang)
Actions #11

Updated by Sage Weil over 8 years ago

  • Priority changed from High to Urgent
Actions #12

Updated by Sage Weil over 8 years ago

  • Status changed from Fix Under Review to 17
Actions #13

Updated by Samuel Just over 7 years ago

  • Priority changed from Urgent to Low
Actions #14

Updated by Sage Weil about 7 years ago

Is this still a problem?

Actions #15

Updated by Greg Farnum almost 7 years ago

  • Project changed from Ceph to RADOS

I don't really get how the AsyncMessenger could have caused this issue...?

Actions #16

Updated by Patrick Donnelly over 4 years ago

  • Status changed from 17 to Fix Under Review
Actions

Also available in: Atom PDF