Project

General

Profile

Actions

Bug #36304

open

FAILED ceph_assert(p != pg_slots.end()) in OSDShard::register_and_wake_split_child(PG*)

Added by Neha Ojha over 5 years ago. Updated about 1 year ago.

Status:
Need More Info
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
pacific,quincy
Regression:
No
Severity:
3 - minor
Reviewed:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):

05e5f599a62a5dc72d411b075e9b702c2e45e990278d2386470dd7ef6f79f1c5
125b31f57af72bea60fa2db382925da1730a8d7118e9d827778f9a49d61ca878
233900602a61533bdb486cd3f9eae546d0db4fb5715a82a0d15c0244f27dcec2
27c46f755bca8a1d873d66a30f93bf7041f94edab812f839eb1c667eba2cd751
31308372950603c851d9add002704ded02409b08d017515602f351996217339a
32c03188e76c2a719a5b26baf49b8b560a9b078c24accea9347c2a9f89737d84
340daa8e343eeb8355f547a77a7d822edf68fea54af5d0d11f2e793e5028c8be
3c4ff391ab4df33f8c1cfc09464c69c22b40b8e713e914712508b8baaf001aa3
46c568174021ce61296fa153a30e09ae3c09518f07dbd0145a72e74e853edcb0
4f18594f29259021881f2dd23a80a7ca25b7c1e311e2ab13be4ab53c1fb00041
86176dad44ae51d3e7de7eac892f695acedb065563b1a18490481db3635c017a
8a5bb745fdc27734479cb08cf5ae61cd543d5c876a3ad18c959d648deb34cfa9
b34e9d76dd14e64b9a2fe50f844fd40ec4279a5d92b937e331cd3cfa8a4f7d41
df9d085480e577805f39e644c28785152d63e0a6fd2d436b54f23afa573ea163
e00dda9216c3ff419287a5314dbc59ab4a17d21b016edd94cd573e92d0fb325e
e6055c44dd5f22a47efe21bbe1dbf9c958be2277b3bd42d1f0a5e2627d647147
0356017d322bf641144359877b9d26eb4c641406196cebc62d4d34ab17b57243
2e183f042cfb37d392227418fd02958a2703ad0ab709c55af411f42c3f324c2b
455fd66f34bd00635469032ca7c9bea2599a0a465dd41e926f90d2497f51f56d
68e3577850d4112598abe487ace1ce015c4a2b76597ad13bcce22d2cd883ab84
80e38d027e52f6cc81fee7724ab2a547e009cfe89bbffdcfa0b4ccac341e1405
af1769893ea37eae8427bf518fc3997f29d47c4c12d1875ed1aaa03c82babdbd
b937dbef3913a3de2ee021c170f61f8d79796472fae56d7a0934416939c6d4aa
c2078b2c605ed12c74ef0bffcdde97b77347aff06fd259faa0e13cf528ad5cc7
cff6797480af5200e7be067db8371b3f68050b012212023f3f4b75cd5fd316a2
d47488913c46e26f5a3289d18389dc86838508e38cd14290f26ec1544f7321eb
e5c6cec6e9786bb04799923ddc0a88ffc7ded022391860d70c300a883c201dae
f05fba6c03f9aed26b526ce54bc19512507b59ef4d077c30395c21831e7ead37
16f463b2dfc02cc963f4499ff13aafe17d7e4947af19f23fbca0fcad6520c14d
37d3fdf78283d105d6a11a520d61e7f5a5f0883503d4087d370ae5b301ec84ef
67520958119a147d51de928d55e9a0ced531b51efe8671da77e7728dda2e9c6d
7f023b3071908bbbbd9e6c0aa2c8f75154b633cc1776819e00bb509a757f8a42
85d45ca5ca28524177a5811ba3d182ce8b9f566eeded8b71f2a5f42e930fd3f3
9ec8956104635f128dd55d1dd763e9a2c129fbe35361e6488edf9520b574ae76
a660aa9f5e07846bfac45137b2efebbdff14e5724d9a34f930c6171bb878602a
a8a32006bb8f0bc7e0e407081fed56e5609ff47656e319a82c81cc2201059f6a
c59a3023d8b06439767f45fd49e498c03a241e645df477d3852b9c9011d30263


Description

2018-10-03T10:16:57.845 INFO:tasks.ceph.osd.7.smithi149.stderr:/build/ceph-14.0.0-3811-gb36adc9/src/osd/OSD.cc: In function 'void OSDShard::register_and_wake_split_child(PG*)' thread 7f9e1e06e700 time 2018-10-03 10:16:57.843713
2018-10-03T10:16:57.845 INFO:tasks.ceph.osd.7.smithi149.stderr:/build/ceph-14.0.0-3811-gb36adc9/src/osd/OSD.cc: 10109: FAILED ceph_assert(p != pg_slots.end())
2018-10-03T10:16:57.903 INFO:teuthology.orchestra.run.smithi149:Running: 'sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 30 ceph --cluster ceph --admin-daemon /var/run/ceph/ceph-osd.6.asok dump_blocked_ops'
2018-10-03T10:16:57.936 INFO:tasks.ceph.osd.3.smithi060.stderr:2018-10-03 10:16:57.931 7f4d82e81700 -1 received  signal: Hangup from /usr/bin/python /usr/bin/daemon-helper kill ceph-osd -f --cluster ceph -i 3  (PID: 11231) UID: 0
2018-10-03T10:16:58.036 INFO:tasks.ceph.osd.6.smithi149.stderr:2018-10-03 10:16:58.032 7f186a51a700 -1 received  signal: Hangup from /usr/bin/python /usr/bin/daemon-helper kill ceph-osd -f --cluster ceph -i 6  (PID: 11089) UID: 0
2018-10-03T10:16:58.044 INFO:tasks.ceph.osd.7.smithi149.stderr: ceph version 14.0.0-3811-gb36adc9 (b36adc93ab29e9108ef784e767962f948a1e1b9d) nautilus (dev)
2018-10-03T10:16:58.044 INFO:tasks.ceph.osd.7.smithi149.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x158) [0x561c0314b875]
2018-10-03T10:16:58.045 INFO:tasks.ceph.osd.7.smithi149.stderr: 2: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char const*, ...)+0) [0x561c0314ba52]
2018-10-03T10:16:58.045 INFO:tasks.ceph.osd.7.smithi149.stderr: 3: (OSDShard::register_and_wake_split_child(PG*)+0x805) [0x561c03297605]
2018-10-03T10:16:58.045 INFO:tasks.ceph.osd.7.smithi149.stderr: 4: (OSD::_finish_splits(std::set<boost::intrusive_ptr<PG>, std::less<boost::intrusive_ptr<PG> >, std::allocator<boost::intrusive_ptr<PG> > >&)+0x13a) [0x561c0329779a]
2018-10-03T10:16:58.045 INFO:tasks.ceph.osd.7.smithi149.stderr: 5: (Context::complete(int)+0x9) [0x561c0329fcb9]
2018-10-03T10:16:58.045 INFO:tasks.ceph.osd.7.smithi149.stderr: 6: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x76c) [0x561c03283a1c]
2018-10-03T10:16:58.045 INFO:tasks.ceph.osd.7.smithi149.stderr: 7: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x496) [0x561c038c6f66]
2018-10-03T10:16:58.045 INFO:tasks.ceph.osd.7.smithi149.stderr: 8: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x561c038ce720]
2018-10-03T10:16:58.045 INFO:tasks.ceph.osd.7.smithi149.stderr: 9: (()+0x76db) [0x7f9e461b96db]
2018-10-03T10:16:58.045 INFO:tasks.ceph.osd.7.smithi149.stderr: 10: (clone()+0x3f) [0x7f9e44f5488f]

/a/nojha-2018-10-02_20:12:26-rados-master-distro-basic-smithi/3094858/


Related issues 2 (0 open2 closed)

Has duplicate RADOS - Bug #38483: FAILED ceph_assert(p != pg_slots.end()) in OSDShard::register_and_wake_split_child(PG*)ResolvedSage Weil02/26/2019

Actions
Has duplicate RADOS - Bug #52149: crash: void OSDShard::register_and_wake_split_child(PG*): assert(p != pg_slots.end())Duplicate

Actions
Actions #1

Updated by Sage Weil over 5 years ago

  • Status changed from New to Can't reproduce

I'm guessing this was fixed by 450f337d6fd048c8c95a0ec0dec0d97f5474922e

Actions #2

Updated by Sage Weil about 5 years ago

  • Status changed from Can't reproduce to 12
  • Priority changed from Normal to Urgent

reproduced, but without logs..

2019-02-16T22:22:14.214 INFO:tasks.ceph.osd.2.smithi107.stderr:/build/ceph-14.0.1-3819-g2bd9523/src/osd/OSD.cc: In function 'void OSDShard::register_and_wake_split_child(PG*)' thread 7f7e8557f700 time 2019-02-16 22:22:14.218197
2019-02-16T22:22:14.214 INFO:tasks.ceph.osd.2.smithi107.stderr:/build/ceph-14.0.1-3819-g2bd9523/src/osd/OSD.cc: 10561: FAILED ceph_assert(p != pg_slots.end())
2019-02-16T22:22:14.217 INFO:tasks.ceph.osd.2.smithi107.stderr: ceph version 14.0.1-3819-g2bd9523 (2bd9523756002a98bc3cea0c687fc99c4b8b988a) nautilus (dev)
2019-02-16T22:22:14.217 INFO:tasks.ceph.osd.2.smithi107.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x152) [0x84d12c]
2019-02-16T22:22:14.217 INFO:tasks.ceph.osd.2.smithi107.stderr: 2: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char const*, ...)+0) [0x84d307]
2019-02-16T22:22:14.218 INFO:tasks.ceph.osd.2.smithi107.stderr: 3: (OSDShard::register_and_wake_split_child(PG*)+0x7f3) [0x9af683]
2019-02-16T22:22:14.218 INFO:tasks.ceph.osd.2.smithi107.stderr: 4: (OSD::_finish_splits(std::set<boost::intrusive_ptr<PG>, std::less<boost::intrusive_ptr<PG> >, std::allocator<boost::intrusive_ptr<PG> > >&)+0x124) [0x9af7f4]
2019-02-16T22:22:14.218 INFO:tasks.ceph.osd.2.smithi107.stderr: 5: (Context::complete(int)+0x9) [0x9b6889]
2019-02-16T22:22:14.218 INFO:tasks.ceph.osd.2.smithi107.stderr: 6: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x6a4) [0x99c064]
2019-02-16T22:22:14.218 INFO:tasks.ceph.osd.2.smithi107.stderr: 7: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x4ac) [0xfb775c]
2019-02-16T22:22:14.218 INFO:tasks.ceph.osd.2.smithi107.stderr: 8: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xfba910]
2019-02-16T22:22:14.219 INFO:tasks.ceph.osd.2.smithi107.stderr: 9: (()+0x76ba) [0x7f7ea4c426ba]
2019-02-16T22:22:14.219 INFO:tasks.ceph.osd.2.smithi107.stderr: 10: (clone()+0x6d) [0x7f7ea424941d]

/a/sage-2019-02-16_18:46:49-rados-wip-sage-testing-2019-02-16-0946-distro-basic-smithi/3601996
Actions #3

Updated by Neha Ojha about 5 years ago

We have logs here: /a/nojha-2019-02-11_18:58:45-rados:thrash-erasure-code-wip-test-revert-distro-basic-smithi/3575122

Actions #4

Updated by Sage Weil about 5 years ago

aha:
during startup, we load pg 2.fs1, but fail to prime it from init():

2019-02-11 19:51:05.982 7f4735a22c00  0 osd.1 230 load_pgs opened 53 pgs
...
2019-02-11 19:51:05.983 7f4735a22c00 20 osd.1 230 identify_splits_and_merges 1.0 e199 to e230 pg_nums {14=8,28=18,33=28,58=38,160=48,190=58,198=68,207=78}
...
2019-02-11 19:51:05.983 7f4735a22c00 20 osd.1 230 identify_splits_and_merges 3.1fs1 e229 to e230 pg_nums {151=16,167=26,216=36,223=46,230=56}

2.fs1 isn't included there. Only 51 pgs are mentioned, but we loaded 53 of them. We did prime splits for 2 different PGs, though, which modified pg_slots, invalidating our iterator.

see /a/nojha-2019-02-11_18:58:45-rados:thrash-erasure-code-wip-test-revert-distro-basic-smithi/3575122 osd.1

Actions #5

Updated by Sage Weil about 5 years ago

  • Status changed from 12 to Fix Under Review
Actions #6

Updated by Sage Weil about 5 years ago

  • Status changed from Fix Under Review to Resolved
Actions #7

Updated by Kefu Chai over 4 years ago

  • Status changed from Resolved to New
/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/15.0.0-3600-ge2c05a7/rpm/el
7/BUILD/ceph-15.0.0-3600-ge2c05a7/src/osd/OSD.cc: 10451: FAILED ceph_assert(p != pg_slots.end())

 ceph version 15.0.0-3600-ge2c05a7 (e2c05a7110b8a24787480b43d6293549ed2a42f1) octopus (dev)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14a) [0x560085e6afb1]
 2: (()+0x4ee179) [0x560085e6b179]
 3: (OSDShard::register_and_wake_split_child(PG*)+0x7ad) [0x560085f8190d]
 4: (OSD::_finish_splits(std::set<boost::intrusive_ptr<PG>, std::less<boost::intrusive_ptr<PG> >, std::allocator<boost::intrusive_ptr<PG> > >&)+0x1f3) [0x560085f81b63]
 5: (Context::complete(int)+0x9) [0x560085f8c549]
 6: (void finish_contexts<std::list<Context*, std::allocator<Context*> > >(CephContext*, std::list<Context*, std::allocator<Context*> >&, int)+0x7d) [0x5600863032bd]
 7: (C_ContextsBase<Context, Context, std::list<Context*, std::allocator<Context*> > >::complete(int)+0x29) [0x560086303529]
 8: (Finisher::finisher_thread_entry()+0x19d) [0x5600864ed20d]
 9: (()+0x7dd5) [0x7fbe4154edd5]
 10: (clone()+0x6d) [0x7fbe4041502d]

/a/kchai-2019-08-08_08:04:33-rados-wip-29537-kefu-distro-basic-smithi/4198166

Actions #8

Updated by Greg Farnum over 4 years ago

  • Priority changed from Urgent to Normal

We can bump this priority up if it reappears again.

Actions #9

Updated by Greg Farnum over 4 years ago

  • Has duplicate Bug #38483: FAILED ceph_assert(p != pg_slots.end()) in OSDShard::register_and_wake_split_child(PG*) added
Actions #10

Updated by Brad Hubbard almost 4 years ago

  • Priority changed from Normal to Urgent

/a/teuthology-2020-04-26_07:01:02-rados-master-distro-basic-smithi/4986119

Actions #11

Updated by Neha Ojha over 3 years ago

  • Priority changed from Urgent to Normal

Haven't seen this in a while.

Actions #12

Updated by Neha Ojha over 3 years ago

/a/ksirivad-2020-11-16_07:16:50-rados-wip-mgr-progress-turn-off-option-distro-basic-smithi/5630402 - no logs

Actions #13

Updated by Neha Ojha about 3 years ago

/a/teuthology-2021-01-23_07:01:02-rados-master-distro-basic-gibba/5819503

Actions #14

Updated by Neha Ojha about 3 years ago

  • Priority changed from Normal to High
  • Backport set to pacific, octopus, nautilus

/a/yuriw-2021-03-08_21:03:18-rados-wip-yuri5-testing-2021-03-08-1049-pacific-distro-basic-smithi/5947439

Actions #15

Updated by Neha Ojha about 3 years ago

/a/yuriw-2021-03-19_00:00:55-rados-wip-yuri8-testing-2021-03-18-1502-pacific-distro-basic-smithi/5978982

Actions #16

Updated by Neha Ojha about 3 years ago

relevant osd.3 logs from yuriw-2021-03-19_00:00:55-rados-wip-yuri8-testing-2021-03-18-1502-pacific-distro-basic-smithi/5978982

2021-03-19T16:15:42.602+0000 7f1e7d4ea700 10 osd.3 452 split_pgs splitting pg[1.18( empty local-lis/les=379/380 n=0 ec=134/14 lis/c=379/379 les/c/f=380/380/0 sis=452) [4] r=-1 lpr=452 crt=0'0 mlcod 0'0 unknown NOTIFY mbc={}] into 1.58
2021-03-19T16:15:42.602+0000 7f1e7d4ea700 10 osd.3 452 _make_pg 1.58
2021-03-19T16:15:42.602+0000 7f1e7d4ea700  5 osd.3 pg_epoch: 452 pg[1.58(unlocked)] enter Initial
...
2021-03-19T16:15:42.609+0000 7f1e7d4ea700 10 osd.3 453 _finish_splits pg[1.58( empty local-lis/les=379/380 n=0 ec=452/14 lis/c=379/379 les/c/f=380/380/0 sis=452) [6,5] r=-1 lpr=0 crt=0'0 mlcod 0'0 unknown NOTIFY mbc={}]
2021-03-19T16:15:42.609+0000 7f1e7d4ea700 10 osd.3 pg_epoch: 452 pg[1.58( empty local-lis/les=379/380 n=0 ec=452/14 lis/c=379/379 les/c/f=380/380/0 sis=452) [6,5] r=-1 lpr=0 crt=0'0 mlcod 0'0 unknown NOTIFY mbc={}] handle_initialize
2021-03-19T16:15:42.609+0000 7f1e7d4ea700  5 osd.3 pg_epoch: 452 pg[1.58( empty local-lis/les=379/380 n=0 ec=452/14 lis/c=379/379 les/c/f=380/380/0 sis=452) [6,5] r=-1 lpr=0 crt=0'0 mlcod 0'0 unknown NOTIFY mbc={}] exit Initial 0.006560 0 0.000000
2021-03-19T16:15:42.609+0000 7f1e7d4ea700  5 osd.3 pg_epoch: 452 pg[1.58( empty local-lis/les=379/380 n=0 ec=452/14 lis/c=379/379 les/c/f=380/380/0 sis=452) [6,5] r=-1 lpr=0 crt=0'0 mlcod 0'0 unknown NOTIFY mbc={}] enter Reset
2021-03-19T16:15:42.609+0000 7f1e7d4ea700 20 osd.3 pg_epoch: 452 pg[1.58( empty local-lis/les=379/380 n=0 ec=452/14 lis/c=379/379 les/c/f=380/380/0 sis=452) [6,5] r=-1 lpr=0 crt=0'0 mlcod 0'0 unknown NOTIFY mbc={}] set_last_peering_reset 452
2021-03-19T16:15:42.609+0000 7f1e7d4ea700 10 osd.3 pg_epoch: 452 pg[1.58( empty local-lis/les=379/380 n=0 ec=452/14 lis/c=379/379 les/c/f=380/380/0 sis=452) [6,5] r=-1 lpr=452 crt=0'0 mlcod 0'0 unknown NOTIFY mbc={}] Clearing blocked outgoing recovery messages
2021-03-19T16:15:42.609+0000 7f1e7d4ea700 10 osd.3 pg_epoch: 452 pg[1.58( empty local-lis/les=379/380 n=0 ec=452/14 lis/c=379/379 les/c/f=380/380/0 sis=452) [6,5] r=-1 lpr=452 crt=0'0 mlcod 0'0 unknown NOTIFY mbc={}] Not blocking outgoing recovery messages
2021-03-19T16:15:42.609+0000 7f1e7d4ea700 10 osd.3 pg_epoch: 452 pg[1.58( empty local-lis/les=379/380 n=0 ec=452/14 lis/c=379/379 les/c/f=380/380/0 sis=452) [6,5] r=-1 lpr=452 crt=0'0 mlcod 0'0 unknown NOTIFY mbc={}] null
2021-03-19T16:15:42.609+0000 7f1e7d4ea700 15 osd.3 453 enqueue_peering_evt 1.58 epoch_sent: 452 epoch_requested: 452 NullEvt
2021-03-19T16:15:42.609+0000 7f1e7d4ea700 20 osd.3 op_wq(0) _enqueue OpSchedulerItem(1.58 PGPeeringEvent(epoch_sent: 452 epoch_requested: 452 NullEvt) prio 255 cost 10 e452)
2021-03-19T16:15:42.609+0000 7f1e7d4ea700 10 osd.3:0.register_and_wake_split_child 1.58 0x561e34601000
...
2021-03-19T16:15:42.611+0000 7f1e7d4ea700 -1 /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.1.0-945-g14d80dbf/rpm/el8/BUILD/ceph-16.1.0-945-g14d80dbf/src/osd/OSD.cc: In function 'void OSDShard::register_and_wake_split_child(PG*)' thread 7f1e7d4ea700 time 2021-03-19T16:15:42.610408+0000
/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.1.0-945-g14d80dbf/rpm/el8/BUILD/ceph-16.1.0-945-g14d80dbf/src/osd/OSD.cc: 10551: FAILED ceph_assert(p != pg_slots.end())

 ceph version 16.1.0-945-g14d80dbf (14d80dbf937b38fba622cec4b41998d0bd128816) pacific (rc)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x158) [0x561e1b13475a]
 2: ceph-osd(+0x568974) [0x561e1b134974]
 3: (OSDShard::register_and_wake_split_child(PG*)+0x810) [0x561e1b26f670]
 4: (OSD::_finish_splits(std::set<boost::intrusive_ptr<PG>, std::less<boost::intrusive_ptr<PG> >, std::allocator<boost::intrusive_ptr<PG> > >&)+0x2ad) [0x561e1b26f97d]
 5: (Context::complete(int)+0xd) [0x561e1b273e0d]
 6: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xadc) [0x561e1b258d7c]
 7: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4) [0x561e1b8bf764]
 8: (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x561e1b8c2404]
 9: /lib64/libpthread.so.0(+0x82de) [0x7f1ea1b2c2de]
 10: clone()
Actions #17

Updated by Neha Ojha about 3 years ago

/a/yuriw-2021-03-25_20:03:40-rados-wip-yuri8-testing-2021-03-25-1042-pacific-distro-basic-smithi/5999016

Actions #19

Updated by Kefu Chai over 2 years ago

2021-07-28T09:04:05.163+0000 7fa0aef14700 20 bluestore(/var/lib/ceph/osd/ceph-1).collection(meta 0x5595b8d4f680)  r -2 v.len 0
2021-07-28T09:04:05.163+0000 7fa0aef14700 10 bluestore(/var/lib/ceph/osd/ceph-1) read meta #-1:afd56935:::osdmap.368:0# 0x0~0 = -2
2021-07-28T09:04:05.163+0000 7fa0aef14700 -1 osd.1 403 failed to load OSD map for epoch 368, got 0 bytes
2021-07-28T09:04:05.163+0000 7fa0aef14700 20 osd.1 403 advance_pg missing map 368
2021-07-28T09:04:05.163+0000 7fa0aef14700 20 osd.1 403 get_map 369 - loading and decoding 0x5595b9761700
2021-07-28T09:04:05.163+0000 7fa0ae713700 -1 /build/ceph-17.0.0-6461-g3cfb73ec/src/osd/OSD.cc: In function 'void OSDShard::register_and_wake_split_child(PG*)' thread 7fa0ae713700 time 2021-07-28T09:04:05.
157796+0000
/build/ceph-17.0.0-6461-g3cfb73ec/src/osd/OSD.cc: 10748: FAILED ceph_assert(p != pg_slots.end())

 ceph version 17.0.0-6461-g3cfb73ec (3cfb73ec2f3bd8e1d6621a767f6ab72f21d849ae) quincy (dev)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14f) [0x5595b37992c0]
 2: ceph-osd(+0xc184d2) [0x5595b37994d2]
 3: (OSDShard::register_and_wake_split_child(PG*)+0x7d0) [0x5595b388c110]
 4: (OSD::_finish_splits(std::set<boost::intrusive_ptr<PG>, std::less<boost::intrusive_ptr<PG> >, std::allocator<boost::intrusive_ptr<PG> > >&)+0x3af) [0x5595b38ad36f]
 5: (Context::complete(int)+0xd) [0x5595b38b20ed]
 6: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x953) [0x5595b3891323]
 7: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x403) [0x5595b3fe0d73]
 8: (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x5595b3fe3b94]
 9: (Thread::_entry_func(void*)+0xd) [0x5595b3fd351d]
 10: /lib/x86_64-linux-gnu/libpthread.so.0(+0x9609) [0x7fa0cb93b609]
 11: clone()

/a/kchai-2021-07-28_08:37:00-rados-wip-kefu-testing-2021-07-28-1257-distro-basic-smithi/6297932/

Actions #20

Updated by Neha Ojha over 2 years ago

/a/yuriw-2021-08-06_16:31:19-rados-wip-yuri-master-8.6.21-distro-basic-smithi/6324576

Actions #21

Updated by Ronen Friedman over 2 years ago

Some possibly helpful hints:
1. In "my" specific instance, the pg address handed over to register_and_wake_split_child() was null.
2. The locking in the calling function (_finish_splits()) took a long time

Actions #22

Updated by Neha Ojha over 2 years ago

more useful debug logging being added in https://github.com/ceph/ceph/pull/42965

Actions #23

Updated by Neha Ojha over 2 years ago

Ronen Friedman wrote:

Some possibly helpful hints:
1. In "my" specific instance, the pg address handed over to register_and_wake_split_child() was null.
2. The locking in the calling function (_finish_splits()) took a long time

Can you provide a link to the failed run?

Actions #24

Updated by Neha Ojha over 2 years ago

  • Has duplicate Bug #52149: crash: void OSDShard::register_and_wake_split_child(PG*): assert(p != pg_slots.end()) added
Actions #25

Updated by Ronen Friedman over 2 years ago

Neha Ojha wrote:

Can you provide a link to the failed run?

Trying to reproduce.

Actions #26

Updated by Neha Ojha about 2 years ago

  • Priority changed from High to Normal
Actions #27

Updated by Telemetry Bot about 2 years ago

  • Crash signature (v1) updated (diff)
  • Crash signature (v2) updated (diff)
  • Affected Versions v14.2.11, v14.2.7, v15.2.1, v15.2.11, v15.2.13, v15.2.5, v15.2.6, v15.2.8, v15.2.9 added

http://telemetry.front.sepia.ceph.com:4000/d/jByk5HaMz/crash-spec-x-ray?orgId=1&var-sig_v2=86176dad44ae51d3e7de7eac892f695acedb065563b1a18490481db3635c017a

Assert condition: p != pg_slots.end()
Assert function: void OSDShard::register_and_wake_split_child(PG*)

Sanitized backtrace:

    OSDShard::register_and_wake_split_child(PG*)
    OSD::_finish_splits(std::set<boost::intrusive_ptr<PG>, std::less<boost::intrusive_ptr<PG> >, std::allocator<boost::intrusive_ptr<PG> > >&)
    Context::complete(int)
    OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)
    ShardedThreadPool::shardedthreadpool_worker(unsigned int)
    ShardedThreadPool::WorkThreadSharded::entry()
    clone()

Crash dump sample:
{
    "archived": "2021-07-17 11:49:45.087615",
    "assert_condition": "p != pg_slots.end()",
    "assert_file": "osd/OSD.cc",
    "assert_func": "void OSDShard::register_and_wake_split_child(PG*)",
    "assert_line": 10321,
    "assert_msg": "osd/OSD.cc: In function 'void OSDShard::register_and_wake_split_child(PG*)' thread 7ff20c836700 time 2021-07-17T04:26:15.494347-0700\nosd/OSD.cc: 10321: FAILED ceph_assert(p != pg_slots.end())",
    "assert_thread_name": "tp_osd_tp",
    "backtrace": [
        "(()+0xf630) [0x7ff22dabd630]",
        "(gsignal()+0x37) [0x7ff22c8ab387]",
        "(abort()+0x148) [0x7ff22c8aca78]",
        "(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x19b) [0x55f9736589f2]",
        "(()+0x4deb6b) [0x55f973658b6b]",
        "(OSDShard::register_and_wake_split_child(PG*)+0x781) [0x55f97376c7b1]",
        "(OSD::_finish_splits(std::set<boost::intrusive_ptr<PG>, std::less<boost::intrusive_ptr<PG> >, std::allocator<boost::intrusive_ptr<PG> > >&)+0x296) [0x55f97376caa6]",
        "(Context::complete(int)+0x9) [0x55f973771c29]",
        "(OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x14ec) [0x55f97375864c]",
        "(ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5b6) [0x55f973d44c46]",
        "(ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55f973d47790]",
        "(()+0x7ea5) [0x7ff22dab5ea5]",
        "(clone()+0x6d) [0x7ff22c9739fd]" 
    ],
    "ceph_version": "15.2.13",
    "crash_id": "2021-07-17T11:26:15.505225Z_3117824e-c38e-48a5-b593-adaebe8106a8",
    "entity_name": "osd.79ecd90a2b778d5c739dd4b7af798dda206d7c70",
    "os_id": "centos",
    "os_name": "CentOS Linux",
    "os_version": "7 (Core)",
    "os_version_id": "7",
    "process_name": "ceph-osd",
    "stack_sig": "b34e9d76dd14e64b9a2fe50f844fd40ec4279a5d92b937e331cd3cfa8a4f7d41",
    "timestamp": "2021-07-17T11:26:15.505225Z",
    "utsname_machine": "x86_64",
    "utsname_release": "3.10.0-1160.31.1.el7.x86_64",
    "utsname_sysname": "Linux",
    "utsname_version": "#1 SMP Thu Jun 10 13:32:12 UTC 2021" 
}

Actions #28

Updated by Telemetry Bot about 2 years ago

  • Crash signature (v1) updated (diff)
  • Crash signature (v2) updated (diff)
Actions #29

Updated by Telemetry Bot about 2 years ago

  • Crash signature (v1) updated (diff)
  • Affected Versions v14.2.15, v15.2.12, v16.2.1, v16.2.4, v16.2.5, v16.2.7 added
Actions #30

Updated by Radoslaw Zarzynski about 2 years ago

  • Status changed from New to Need More Info
  • Crash signature (v1) updated (diff)

Waiting for reproducing the issue. See comment #25.

Actions #31

Updated by Telemetry Bot over 1 year ago

  • Crash signature (v1) updated (diff)
  • Affected Versions v15.2.15, v15.2.16, v15.2.7, v17.2.0 added
Actions #32

Updated by Radoslaw Zarzynski over 1 year ago

  • Backport changed from pacific, octopus, nautilus to pacific,quincy
  • Crash signature (v1) updated (diff)
Actions #33

Updated by Radoslaw Zarzynski over 1 year ago

Although there was a report from Telemetry, we still need more logs (read: a reoccurance at Sepia) which, hopefully, throw some extra light as Ronen introduced more logs into the register_and_wake_split_child().

Actions #34

Updated by Neha Ojha over 1 year ago

/a/yuriw-2022-09-15_17:53:16-rados-quincy-release-distro-default-smithi/7034166

Actions #35

Updated by Nathan Gardiner about 1 year ago

I'm hitting this assertion on 16.2.11 on one of my OSDs, here's my logs:

2023-03-28T20:42:28.660+1100 7f5c81aa4700 1 ./src/osd/OSD.cc: In function 'void OSDShard::register_and_wake_split_child(PG*)' thread 7f5c81aa4700 time 2023-03-28T20:42:28.653968+1100
./src/osd/OSD.cc: 10862: FAILED ceph_assert(slot
>waiting_for_split.count(epoch))

ceph version 16.2.11 (578f8e68e41b0a98523d0045ef6db90ce6f2e5ab) pacific (stable)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x124) [0x558c594172ea]
2: /usr/bin/ceph-osd(+0xac3475) [0x558c59417475]
3: (OSDShard::register_and_wake_split_child(PG*)+0x354) [0x558c594de204]
4: (OSD::_finish_splits(std::set&lt;boost::intrusive_ptr&lt;PG&gt;, std::less&lt;boost::intrusive_ptr&lt;PG&gt; >, std::allocator&lt;boost::intrusive_ptr&lt;PG&gt; > >&)+0xe6) [0x558c594e1ac6]
5: (Context::complete(int)+0x9) [0x558c5951bf19]
6: (OSD::ShardedOpWQ::handle_oncommits(std::__cxx11::list&lt;Context*, std::allocator&lt;Context*&gt; >&)+0x24) [0x558c5952d4f4]
7: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x256b) [0x558c59501ceb]
8: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x41a) [0x558c59bafaba]
9: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x558c59bb2090]
10: /lib/x86_64-linux-gnu/libpthread.so.0(+0x7ea7) [0x7f5c9e430ea7]
11: clone()

2023-03-28T20:42:28.668+1100 7f5c81aa4700 -1 ** Caught signal (Aborted) *
in thread 7f5c81aa4700 thread_name:tp_osd_tp

ceph version 16.2.11 (578f8e68e41b0a98523d0045ef6db90ce6f2e5ab) pacific (stable)
1: /lib/x86_64-linux-gnu/libpthread.so.0(+0x13140) [0x7f5c9e43c140]
2: gsignal()
3: abort()
4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x16e) [0x558c59417334]
5: /usr/bin/ceph-osd(+0xac3475) [0x558c59417475]
6: (OSDShard::register_and_wake_split_child(PG*)+0x354) [0x558c594de204]
7: (OSD::_finish_splits(std::set&lt;boost::intrusive_ptr&lt;PG&gt;, std::less&lt;boost::intrusive_ptr&lt;PG&gt; >, std::allocator&lt;boost::intrusive_ptr&lt;PG&gt; > >&)+0xe6) [0x558c594e1ac6]
8: (Context::complete(int)+0x9) [0x558c5951bf19]
9: (OSD::ShardedOpWQ::handle_oncommits(std::__cxx11::list&lt;Context*, std::allocator&lt;Context*&gt; >&)+0x24) [0x558c5952d4f4]
10: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x256b) [0x558c59501ceb]
11: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x41a) [0x558c59bafaba]
12: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x558c59bb2090]
13: /lib/x86_64-linux-gnu/libpthread.so.0(+0x7ea7) [0x7f5c9e430ea7]
14: clone()
NOTE: a copy of the executable, or `objdump -rdS &lt;executable&gt;` is needed to interpret this.

--- begin dump of recent events ---

Actions #36

Updated by Radoslaw Zarzynski about 1 year ago

Hello Nathan!

Do you have a log or a coredump by any chance?

Actions

Also available in: Atom PDF