Project

General

Profile

Actions

Bug #38483

closed

FAILED ceph_assert(p != pg_slots.end()) in OSDShard::register_and_wake_split_child(PG*)

Added by Sage Weil about 5 years ago. Updated over 4 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
nautilus
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

2019-02-26 03:04:15.748 7fefccfc0700 10 osd.4:7._attach_pg 1.7 0x55aec8689000
2019-02-26 03:04:15.748 7fefccfc0700 20 osd.4:7._wake_pg_slot _wake_pg_slot 1.7 to_process <> waiting <> waiting_peering {}
...
2019-02-26 03:04:15.749 7fefbf7a5700 20 osd.4 572 advance_pg 1.7 is merge target, sources are 1.f
2019-02-26 03:04:15.749 7fefbf7a5700  1 osd.4 572 advance_pg 1.f is merge source, target is 1.7
2019-02-26 03:04:15.750 7fefbf7a5700 10 osd.4 572 add_merge_waiter added merge_waiter 1.f for 1.7, have 1/1
...
2019-02-26 03:04:15.750 7fefbf7a5700 10 osd.4 pg_epoch: 547 pg[1.7( DNE empty local-lis/les=0/0 n=0 ec=0/0 lis/c 0/0 les/c/f 0/0/0 0/0/0) [6,3,7] r=-1 lpr=547 crt=0'0 unknown mbc={}] merge_from from {1.f=0x55aec9916000} split_bits 3
2019-02-26 03:04:15.750 7fefbf7a5700 10 osd.4 pg_epoch: 547 pg[1.7( DNE empty local-lis/les=0/0 n=0 ec=0/0 lis/c 0/0 les/c/f 0/0/0 0/0/0) [6,3,7] r=-1 lpr=547 crt=0'0 unknown mbc={}] merge_from target incomplete
2019-02-26 03:04:15.750 7fefbf7a5700 10 osd.4 pg_epoch: 547 pg[1.7( DNE empty local-lis/les=0/0 n=0 ec=0/0 lis/c 0/0 les/c/f 0/0/0 0/0/0) [6,3,7] r=-1 lpr=547 crt=0'0 unknown mbc={}] merge_from taking source's past_intervals
...
2019-02-26 03:04:15.751 7fefbf7a5700 10 osd.4 572 split_pgs splitting pg[1.7( empty lb MIN (NIBBLEWISE) local-lis/les=0/547 n=0 ec=33/14 lis/c 374/374 les/c/f 547/547/0 536/560/374) [6,3,7] r=-1 lpr=560 pi=[374,560)/2 crt=0'0 unknown NOTIFY mbc={}] into 1.f
2019-02-26 03:04:15.751 7fefbf7a5700 10 osd.4 pg_epoch: 560 pg[1.7( empty lb MIN (bitwise) local-lis/les=0/547 n=0 ec=33/14 lis/c 374/374 les/c/f 547/547/0 536/560/374) [6,3,7] r=-1 lpr=560 pi=[374,560)/2 crt=0'0 unknown NOTIFY mbc={}] release_backoffs [MIN,MAX)
2019-02-26 03:04:15.751 7fefbf7a5700 10 osd.4 572 split_pgs splitting pg[1.7( empty lb MIN (bitwise) local-lis/les=0/547 n=0 ec=33/14 lis/c 374/374 les/c/f 547/547/0 536/560/374) [6,3,7] r=-1 lpr=560 pi=[374,560)/2 crt=0'0 unknown NOTIFY mbc={}] into 1.17
...
2019-02-26 03:04:15.752 7fefbf7a5700 10 osd.4 572 _finish_splits pg[1.17( empty lb MIN (bitwise) local-lis/les=0/547 n=0 ec=560/14 lis/c 374/374 les/c/f 547/547/0 536/560/374) [6,3,7] r=-1 lpr=0 pi=[374,560)/2 crt=0'0 unknown NOTIFY mbc={}]
2019-02-26 03:04:15.752 7fefbf7a5700 10 osd.4 pg_epoch: 560 pg[1.17( empty lb MIN (bitwise) local-lis/les=0/547 n=0 ec=560/14 lis/c 374/374 les/c/f 547/547/0 536/560/374) [6,3,7] r=-1 lpr=0 pi=[374,560)/2 crt=0'0 unknown NOTIFY mbc={}] handle_initialize
2019-02-26 03:04:15.752 7fefbf7a5700  5 osd.4 pg_epoch: 560 pg[1.17( empty lb MIN (bitwise) local-lis/les=0/547 n=0 ec=560/14 lis/c 374/374 les/c/f 547/547/0 536/560/374) [6,3,7] r=-1 lpr=0 pi=[374,560)/2 crt=0'0 unknown NOTIFY mbc={}] exit Initial 0.001060 0 0.000000
2019-02-26 03:04:15.752 7fefbf7a5700  5 osd.4 pg_epoch: 560 pg[1.17( empty lb MIN (bitwise) local-lis/les=0/547 n=0 ec=560/14 lis/c 374/374 les/c/f 547/547/0 536/560/374) [6,3,7] r=-1 lpr=0 pi=[374,560)/2 crt=0'0 unknown NOTIFY mbc={}] enter Reset
2019-02-26 03:04:15.752 7fefbf7a5700 20 osd.4 pg_epoch: 560 pg[1.17( empty lb MIN (bitwise) local-lis/les=0/547 n=0 ec=560/14 lis/c 374/374 les/c/f 547/547/0 536/560/374) [6,3,7] r=-1 lpr=0 pi=[374,560)/2 crt=0'0 unknown NOTIFY mbc={}] set_last_peering_reset 560
2019-02-26 03:04:15.752 7fefbf7a5700 10 osd.4 pg_epoch: 560 pg[1.17( empty lb MIN (bitwise) local-lis/les=0/547 n=0 ec=560/14 lis/c 374/374 les/c/f 547/547/0 536/560/374) [6,3,7] r=-1 lpr=560 pi=[374,560)/2 crt=0'0 unknown NOTIFY mbc={}] Clearing blocked outgoing recovery messages
2019-02-26 03:04:15.752 7fefbf7a5700 10 osd.4 pg_epoch: 560 pg[1.17( empty lb MIN (bitwise) local-lis/les=0/547 n=0 ec=560/14 lis/c 374/374 les/c/f 547/547/0 536/560/374) [6,3,7] r=-1 lpr=560 pi=[374,560)/2 crt=0'0 unknown NOTIFY mbc={}] Not blocking outgoing recovery messages
2019-02-26 03:04:15.752 7fefbf7a5700 10 osd.4 pg_epoch: 560 pg[1.17( empty lb MIN (bitwise) local-lis/les=0/547 n=0 ec=560/14 lis/c 374/374 les/c/f 547/547/0 536/560/374) [6,3,7] r=-1 lpr=560 pi=[374,560)/2 crt=0'0 unknown NOTIFY mbc={}] null
2019-02-26 03:04:15.752 7fefbf7a5700 15 osd.4 572 enqueue_peering_evt 1.17 epoch_sent: 560 epoch_requested: 560 NullEvt
2019-02-26 03:04:15.752 7fefbf7a5700 20 osd.4 op_wq(7) _enqueue OpQueueItem(1.17 PGPeeringEvent(epoch_sent: 560 epoch_requested: 560 NullEvt) prio 255 cost 10 e560)
2019-02-26 03:04:15.752 7fefbf7a5700 10 osd.4:7.register_and_wake_split_child 1.17 0x55aec8234000
     0> 2019-02-26 03:04:15.756 7fefbf7a5700 -1 *** Caught signal (Aborted) **
 in thread 7fefbf7a5700 thread_name:tp_osd_tp

 ceph version 14.1.0-125-g8b98d22 (8b98d22533def4c768359c2efe9496780b036d22) nautilus (dev)
 1: (()+0xf5d0) [0x7fefe558e5d0]
 2: (gsignal()+0x37) [0x7fefe4385207]
 3: (abort()+0x148) [0x7fefe43868f8]
 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x199) [0x55aebaf52a1b]
 5: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char const*, ...)+0) [0x55aebaf52b9a]
 6: (OSDShard::register_and_wake_split_child(PG*)+0x7e3) [0x55aebb0bf503]
 7: (OSD::_finish_splits(std::set<boost::intrusive_ptr<PG>, std::less<boost::intrusive_ptr<PG> >, std::allocator<boost::intrusive_ptr<PG> > >&)+0x121) [0x55aebb0bf671]
 8: (Context::complete(int)+0x9) [0x55aebb0c6349]
 9: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x67c) [0x55aebb0abd7c]
 10: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x433) [0x55aebb69f003]
 11: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55aebb6a20a0]

/a/sage-2019-02-26_00:43:29-rados-wip-sage-testing-2019-02-25-1642-distro-basic-smithi/3638207

looks like the merge -> split sequence doesn't prime the merge target


Related issues 2 (1 open1 closed)

Is duplicate of RADOS - Bug #36304: FAILED ceph_assert(p != pg_slots.end()) in OSDShard::register_and_wake_split_child(PG*)Need More Info

Actions
Copied to RADOS - Backport #41712: nautilus: FAILED ceph_assert(p != pg_slots.end()) in OSDShard::register_and_wake_split_child(PG*)ResolvedPrashant DActions
Actions #1

Updated by Sage Weil about 5 years ago

  • Status changed from 12 to Fix Under Review
Actions #2

Updated by Sage Weil about 5 years ago

  • Status changed from Fix Under Review to In Progress
  • Assignee set to Sage Weil
Actions #3

Updated by Neha Ojha about 5 years ago

/a/nojha-2019-04-25_05:43:35-rados-wip-39441-distro-basic-smithi/3892156/

Actions #4

Updated by Greg Farnum over 4 years ago

  • Is duplicate of Bug #36304: FAILED ceph_assert(p != pg_slots.end()) in OSDShard::register_and_wake_split_child(PG*) added
Actions #5

Updated by Josh Durgin over 4 years ago

  • Priority changed from Urgent to Normal

Sage says the PR is buggy, and this case is very hard to hit, so moving to normal priority.

Actions #6

Updated by xie xingguo over 4 years ago

  • Status changed from In Progress to Pending Backport
  • Backport set to nautilus
Actions #7

Updated by xie xingguo over 4 years ago

  • Pull request ID set to 30018
Actions #8

Updated by Nathan Cutler over 4 years ago

  • Copied to Backport #41712: nautilus: FAILED ceph_assert(p != pg_slots.end()) in OSDShard::register_and_wake_split_child(PG*) added
Actions #9

Updated by Nathan Cutler over 4 years ago

  • Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

Actions

Also available in: Atom PDF