Project

General

Profile

Bug #36304

FAILED ceph_assert(p != pg_slots.end()) in OSDShard::register_and_wake_split_child(PG*)

Added by Neha Ojha over 2 years ago. Updated about 2 months ago.

Status:
New
Priority:
High
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
pacific, octopus, nautilus
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

2018-10-03T10:16:57.845 INFO:tasks.ceph.osd.7.smithi149.stderr:/build/ceph-14.0.0-3811-gb36adc9/src/osd/OSD.cc: In function 'void OSDShard::register_and_wake_split_child(PG*)' thread 7f9e1e06e700 time 2018-10-03 10:16:57.843713
2018-10-03T10:16:57.845 INFO:tasks.ceph.osd.7.smithi149.stderr:/build/ceph-14.0.0-3811-gb36adc9/src/osd/OSD.cc: 10109: FAILED ceph_assert(p != pg_slots.end())
2018-10-03T10:16:57.903 INFO:teuthology.orchestra.run.smithi149:Running: 'sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 30 ceph --cluster ceph --admin-daemon /var/run/ceph/ceph-osd.6.asok dump_blocked_ops'
2018-10-03T10:16:57.936 INFO:tasks.ceph.osd.3.smithi060.stderr:2018-10-03 10:16:57.931 7f4d82e81700 -1 received  signal: Hangup from /usr/bin/python /usr/bin/daemon-helper kill ceph-osd -f --cluster ceph -i 3  (PID: 11231) UID: 0
2018-10-03T10:16:58.036 INFO:tasks.ceph.osd.6.smithi149.stderr:2018-10-03 10:16:58.032 7f186a51a700 -1 received  signal: Hangup from /usr/bin/python /usr/bin/daemon-helper kill ceph-osd -f --cluster ceph -i 6  (PID: 11089) UID: 0
2018-10-03T10:16:58.044 INFO:tasks.ceph.osd.7.smithi149.stderr: ceph version 14.0.0-3811-gb36adc9 (b36adc93ab29e9108ef784e767962f948a1e1b9d) nautilus (dev)
2018-10-03T10:16:58.044 INFO:tasks.ceph.osd.7.smithi149.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x158) [0x561c0314b875]
2018-10-03T10:16:58.045 INFO:tasks.ceph.osd.7.smithi149.stderr: 2: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char const*, ...)+0) [0x561c0314ba52]
2018-10-03T10:16:58.045 INFO:tasks.ceph.osd.7.smithi149.stderr: 3: (OSDShard::register_and_wake_split_child(PG*)+0x805) [0x561c03297605]
2018-10-03T10:16:58.045 INFO:tasks.ceph.osd.7.smithi149.stderr: 4: (OSD::_finish_splits(std::set<boost::intrusive_ptr<PG>, std::less<boost::intrusive_ptr<PG> >, std::allocator<boost::intrusive_ptr<PG> > >&)+0x13a) [0x561c0329779a]
2018-10-03T10:16:58.045 INFO:tasks.ceph.osd.7.smithi149.stderr: 5: (Context::complete(int)+0x9) [0x561c0329fcb9]
2018-10-03T10:16:58.045 INFO:tasks.ceph.osd.7.smithi149.stderr: 6: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x76c) [0x561c03283a1c]
2018-10-03T10:16:58.045 INFO:tasks.ceph.osd.7.smithi149.stderr: 7: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x496) [0x561c038c6f66]
2018-10-03T10:16:58.045 INFO:tasks.ceph.osd.7.smithi149.stderr: 8: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x561c038ce720]
2018-10-03T10:16:58.045 INFO:tasks.ceph.osd.7.smithi149.stderr: 9: (()+0x76db) [0x7f9e461b96db]
2018-10-03T10:16:58.045 INFO:tasks.ceph.osd.7.smithi149.stderr: 10: (clone()+0x3f) [0x7f9e44f5488f]

/a/nojha-2018-10-02_20:12:26-rados-master-distro-basic-smithi/3094858/


Related issues

Duplicated by RADOS - Bug #38483: FAILED ceph_assert(p != pg_slots.end()) in OSDShard::register_and_wake_split_child(PG*) Resolved 02/26/2019

History

#1 Updated by Sage Weil over 2 years ago

  • Status changed from New to Can't reproduce

I'm guessing this was fixed by 450f337d6fd048c8c95a0ec0dec0d97f5474922e

#2 Updated by Sage Weil about 2 years ago

  • Status changed from Can't reproduce to 12
  • Priority changed from Normal to Urgent

reproduced, but without logs..

2019-02-16T22:22:14.214 INFO:tasks.ceph.osd.2.smithi107.stderr:/build/ceph-14.0.1-3819-g2bd9523/src/osd/OSD.cc: In function 'void OSDShard::register_and_wake_split_child(PG*)' thread 7f7e8557f700 time 2019-02-16 22:22:14.218197
2019-02-16T22:22:14.214 INFO:tasks.ceph.osd.2.smithi107.stderr:/build/ceph-14.0.1-3819-g2bd9523/src/osd/OSD.cc: 10561: FAILED ceph_assert(p != pg_slots.end())
2019-02-16T22:22:14.217 INFO:tasks.ceph.osd.2.smithi107.stderr: ceph version 14.0.1-3819-g2bd9523 (2bd9523756002a98bc3cea0c687fc99c4b8b988a) nautilus (dev)
2019-02-16T22:22:14.217 INFO:tasks.ceph.osd.2.smithi107.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x152) [0x84d12c]
2019-02-16T22:22:14.217 INFO:tasks.ceph.osd.2.smithi107.stderr: 2: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char const*, ...)+0) [0x84d307]
2019-02-16T22:22:14.218 INFO:tasks.ceph.osd.2.smithi107.stderr: 3: (OSDShard::register_and_wake_split_child(PG*)+0x7f3) [0x9af683]
2019-02-16T22:22:14.218 INFO:tasks.ceph.osd.2.smithi107.stderr: 4: (OSD::_finish_splits(std::set<boost::intrusive_ptr<PG>, std::less<boost::intrusive_ptr<PG> >, std::allocator<boost::intrusive_ptr<PG> > >&)+0x124) [0x9af7f4]
2019-02-16T22:22:14.218 INFO:tasks.ceph.osd.2.smithi107.stderr: 5: (Context::complete(int)+0x9) [0x9b6889]
2019-02-16T22:22:14.218 INFO:tasks.ceph.osd.2.smithi107.stderr: 6: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x6a4) [0x99c064]
2019-02-16T22:22:14.218 INFO:tasks.ceph.osd.2.smithi107.stderr: 7: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x4ac) [0xfb775c]
2019-02-16T22:22:14.218 INFO:tasks.ceph.osd.2.smithi107.stderr: 8: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xfba910]
2019-02-16T22:22:14.219 INFO:tasks.ceph.osd.2.smithi107.stderr: 9: (()+0x76ba) [0x7f7ea4c426ba]
2019-02-16T22:22:14.219 INFO:tasks.ceph.osd.2.smithi107.stderr: 10: (clone()+0x6d) [0x7f7ea424941d]

/a/sage-2019-02-16_18:46:49-rados-wip-sage-testing-2019-02-16-0946-distro-basic-smithi/3601996

#3 Updated by Neha Ojha about 2 years ago

We have logs here: /a/nojha-2019-02-11_18:58:45-rados:thrash-erasure-code-wip-test-revert-distro-basic-smithi/3575122

#4 Updated by Sage Weil about 2 years ago

aha:
during startup, we load pg 2.fs1, but fail to prime it from init():

2019-02-11 19:51:05.982 7f4735a22c00  0 osd.1 230 load_pgs opened 53 pgs
...
2019-02-11 19:51:05.983 7f4735a22c00 20 osd.1 230 identify_splits_and_merges 1.0 e199 to e230 pg_nums {14=8,28=18,33=28,58=38,160=48,190=58,198=68,207=78}
...
2019-02-11 19:51:05.983 7f4735a22c00 20 osd.1 230 identify_splits_and_merges 3.1fs1 e229 to e230 pg_nums {151=16,167=26,216=36,223=46,230=56}

2.fs1 isn't included there. Only 51 pgs are mentioned, but we loaded 53 of them. We did prime splits for 2 different PGs, though, which modified pg_slots, invalidating our iterator.

see /a/nojha-2019-02-11_18:58:45-rados:thrash-erasure-code-wip-test-revert-distro-basic-smithi/3575122 osd.1

#5 Updated by Sage Weil about 2 years ago

  • Status changed from 12 to Fix Under Review

#6 Updated by Sage Weil about 2 years ago

  • Status changed from Fix Under Review to Resolved

#7 Updated by Kefu Chai almost 2 years ago

  • Status changed from Resolved to New
/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/15.0.0-3600-ge2c05a7/rpm/el
7/BUILD/ceph-15.0.0-3600-ge2c05a7/src/osd/OSD.cc: 10451: FAILED ceph_assert(p != pg_slots.end())

 ceph version 15.0.0-3600-ge2c05a7 (e2c05a7110b8a24787480b43d6293549ed2a42f1) octopus (dev)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14a) [0x560085e6afb1]
 2: (()+0x4ee179) [0x560085e6b179]
 3: (OSDShard::register_and_wake_split_child(PG*)+0x7ad) [0x560085f8190d]
 4: (OSD::_finish_splits(std::set<boost::intrusive_ptr<PG>, std::less<boost::intrusive_ptr<PG> >, std::allocator<boost::intrusive_ptr<PG> > >&)+0x1f3) [0x560085f81b63]
 5: (Context::complete(int)+0x9) [0x560085f8c549]
 6: (void finish_contexts<std::list<Context*, std::allocator<Context*> > >(CephContext*, std::list<Context*, std::allocator<Context*> >&, int)+0x7d) [0x5600863032bd]
 7: (C_ContextsBase<Context, Context, std::list<Context*, std::allocator<Context*> > >::complete(int)+0x29) [0x560086303529]
 8: (Finisher::finisher_thread_entry()+0x19d) [0x5600864ed20d]
 9: (()+0x7dd5) [0x7fbe4154edd5]
 10: (clone()+0x6d) [0x7fbe4041502d]

/a/kchai-2019-08-08_08:04:33-rados-wip-29537-kefu-distro-basic-smithi/4198166

#8 Updated by Greg Farnum over 1 year ago

  • Priority changed from Urgent to Normal

We can bump this priority up if it reappears again.

#9 Updated by Greg Farnum over 1 year ago

  • Duplicated by Bug #38483: FAILED ceph_assert(p != pg_slots.end()) in OSDShard::register_and_wake_split_child(PG*) added

#10 Updated by Brad Hubbard about 1 year ago

  • Priority changed from Normal to Urgent

/a/teuthology-2020-04-26_07:01:02-rados-master-distro-basic-smithi/4986119

#11 Updated by Neha Ojha 8 months ago

  • Priority changed from Urgent to Normal

Haven't seen this in a while.

#12 Updated by Neha Ojha 6 months ago

/a/ksirivad-2020-11-16_07:16:50-rados-wip-mgr-progress-turn-off-option-distro-basic-smithi/5630402 - no logs

#13 Updated by Neha Ojha 4 months ago

/a/teuthology-2021-01-23_07:01:02-rados-master-distro-basic-gibba/5819503

#14 Updated by Neha Ojha 2 months ago

  • Priority changed from Normal to High
  • Backport set to pacific, octopus, nautilus

/a/yuriw-2021-03-08_21:03:18-rados-wip-yuri5-testing-2021-03-08-1049-pacific-distro-basic-smithi/5947439

#15 Updated by Neha Ojha about 2 months ago

/a/yuriw-2021-03-19_00:00:55-rados-wip-yuri8-testing-2021-03-18-1502-pacific-distro-basic-smithi/5978982

#16 Updated by Neha Ojha about 2 months ago

relevant osd.3 logs from yuriw-2021-03-19_00:00:55-rados-wip-yuri8-testing-2021-03-18-1502-pacific-distro-basic-smithi/5978982

2021-03-19T16:15:42.602+0000 7f1e7d4ea700 10 osd.3 452 split_pgs splitting pg[1.18( empty local-lis/les=379/380 n=0 ec=134/14 lis/c=379/379 les/c/f=380/380/0 sis=452) [4] r=-1 lpr=452 crt=0'0 mlcod 0'0 unknown NOTIFY mbc={}] into 1.58
2021-03-19T16:15:42.602+0000 7f1e7d4ea700 10 osd.3 452 _make_pg 1.58
2021-03-19T16:15:42.602+0000 7f1e7d4ea700  5 osd.3 pg_epoch: 452 pg[1.58(unlocked)] enter Initial
...
2021-03-19T16:15:42.609+0000 7f1e7d4ea700 10 osd.3 453 _finish_splits pg[1.58( empty local-lis/les=379/380 n=0 ec=452/14 lis/c=379/379 les/c/f=380/380/0 sis=452) [6,5] r=-1 lpr=0 crt=0'0 mlcod 0'0 unknown NOTIFY mbc={}]
2021-03-19T16:15:42.609+0000 7f1e7d4ea700 10 osd.3 pg_epoch: 452 pg[1.58( empty local-lis/les=379/380 n=0 ec=452/14 lis/c=379/379 les/c/f=380/380/0 sis=452) [6,5] r=-1 lpr=0 crt=0'0 mlcod 0'0 unknown NOTIFY mbc={}] handle_initialize
2021-03-19T16:15:42.609+0000 7f1e7d4ea700  5 osd.3 pg_epoch: 452 pg[1.58( empty local-lis/les=379/380 n=0 ec=452/14 lis/c=379/379 les/c/f=380/380/0 sis=452) [6,5] r=-1 lpr=0 crt=0'0 mlcod 0'0 unknown NOTIFY mbc={}] exit Initial 0.006560 0 0.000000
2021-03-19T16:15:42.609+0000 7f1e7d4ea700  5 osd.3 pg_epoch: 452 pg[1.58( empty local-lis/les=379/380 n=0 ec=452/14 lis/c=379/379 les/c/f=380/380/0 sis=452) [6,5] r=-1 lpr=0 crt=0'0 mlcod 0'0 unknown NOTIFY mbc={}] enter Reset
2021-03-19T16:15:42.609+0000 7f1e7d4ea700 20 osd.3 pg_epoch: 452 pg[1.58( empty local-lis/les=379/380 n=0 ec=452/14 lis/c=379/379 les/c/f=380/380/0 sis=452) [6,5] r=-1 lpr=0 crt=0'0 mlcod 0'0 unknown NOTIFY mbc={}] set_last_peering_reset 452
2021-03-19T16:15:42.609+0000 7f1e7d4ea700 10 osd.3 pg_epoch: 452 pg[1.58( empty local-lis/les=379/380 n=0 ec=452/14 lis/c=379/379 les/c/f=380/380/0 sis=452) [6,5] r=-1 lpr=452 crt=0'0 mlcod 0'0 unknown NOTIFY mbc={}] Clearing blocked outgoing recovery messages
2021-03-19T16:15:42.609+0000 7f1e7d4ea700 10 osd.3 pg_epoch: 452 pg[1.58( empty local-lis/les=379/380 n=0 ec=452/14 lis/c=379/379 les/c/f=380/380/0 sis=452) [6,5] r=-1 lpr=452 crt=0'0 mlcod 0'0 unknown NOTIFY mbc={}] Not blocking outgoing recovery messages
2021-03-19T16:15:42.609+0000 7f1e7d4ea700 10 osd.3 pg_epoch: 452 pg[1.58( empty local-lis/les=379/380 n=0 ec=452/14 lis/c=379/379 les/c/f=380/380/0 sis=452) [6,5] r=-1 lpr=452 crt=0'0 mlcod 0'0 unknown NOTIFY mbc={}] null
2021-03-19T16:15:42.609+0000 7f1e7d4ea700 15 osd.3 453 enqueue_peering_evt 1.58 epoch_sent: 452 epoch_requested: 452 NullEvt
2021-03-19T16:15:42.609+0000 7f1e7d4ea700 20 osd.3 op_wq(0) _enqueue OpSchedulerItem(1.58 PGPeeringEvent(epoch_sent: 452 epoch_requested: 452 NullEvt) prio 255 cost 10 e452)
2021-03-19T16:15:42.609+0000 7f1e7d4ea700 10 osd.3:0.register_and_wake_split_child 1.58 0x561e34601000
...
2021-03-19T16:15:42.611+0000 7f1e7d4ea700 -1 /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.1.0-945-g14d80dbf/rpm/el8/BUILD/ceph-16.1.0-945-g14d80dbf/src/osd/OSD.cc: In function 'void OSDShard::register_and_wake_split_child(PG*)' thread 7f1e7d4ea700 time 2021-03-19T16:15:42.610408+0000
/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.1.0-945-g14d80dbf/rpm/el8/BUILD/ceph-16.1.0-945-g14d80dbf/src/osd/OSD.cc: 10551: FAILED ceph_assert(p != pg_slots.end())

 ceph version 16.1.0-945-g14d80dbf (14d80dbf937b38fba622cec4b41998d0bd128816) pacific (rc)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x158) [0x561e1b13475a]
 2: ceph-osd(+0x568974) [0x561e1b134974]
 3: (OSDShard::register_and_wake_split_child(PG*)+0x810) [0x561e1b26f670]
 4: (OSD::_finish_splits(std::set<boost::intrusive_ptr<PG>, std::less<boost::intrusive_ptr<PG> >, std::allocator<boost::intrusive_ptr<PG> > >&)+0x2ad) [0x561e1b26f97d]
 5: (Context::complete(int)+0xd) [0x561e1b273e0d]
 6: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xadc) [0x561e1b258d7c]
 7: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4) [0x561e1b8bf764]
 8: (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x561e1b8c2404]
 9: /lib64/libpthread.so.0(+0x82de) [0x7f1ea1b2c2de]
 10: clone()

#17 Updated by Neha Ojha about 2 months ago

/a/yuriw-2021-03-25_20:03:40-rados-wip-yuri8-testing-2021-03-25-1042-pacific-distro-basic-smithi/5999016

Also available in: Atom PDF