Project

General

Profile

Actions

Bug #15426

closed

is_split crash in handle_pg_create

Added by Samuel Just about 8 years ago. Updated over 7 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
hammer, infernalis
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
rados
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

2016-04-06T20:53:28.907 INFO:tasks.ceph.osd.5.smithi044.stderr:osd/osd_types.cc: 476: FAILED assert(m_seed < old_pg_num)
2016-04-06T20:53:28.907 INFO:tasks.ceph.osd.5.smithi044.stderr:
2016-04-06T20:53:28.908 INFO:tasks.ceph.osd.5.smithi044.stderr: ceph version 10.1.0-496-g4c2c6fd (4c2c6fd6563be8c0a4e08a162e9ccb33e48fa597)
2016-04-06T20:53:28.908 INFO:tasks.ceph.osd.5.smithi044.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x7f3048f0f96b]
2016-04-06T20:53:28.908 INFO:tasks.ceph.osd.5.smithi044.stderr: 2: (()+0x60c181) [0x7f3048b3d181]
2016-04-06T20:53:28.908 INFO:tasks.ceph.osd.5.smithi044.stderr: 3: (spg_t::is_split(unsigned int, unsigned int, std::set<spg_t, std::less<spg_t>, std::allocator<spg_t> >) const+0x5b) [0x7f3048904c9b]
2016-04-06T20:53:28.908 INFO:tasks.ceph.osd.5.smithi044.stderr: 4: (OSD::project_pg_history(spg_t, pg_history_t&, unsigned int, std::vector<int, std::allocator<int> > const&, int, std::vector<int, std::allocator<int> > const&, int)+0x379)
[0x7f30488c3729]
2016-04-06T20:53:28.908 INFO:tasks.ceph.osd.5.smithi044.stderr: 5: (OSD::handle_pg_create(std::shared_ptr<OpRequest>)+0xac5) [0x7f30488daa15]
2016-04-06T20:53:28.908 INFO:tasks.ceph.osd.5.smithi044.stderr: 6: (OSD::dispatch_op(std::shared_ptr<OpRequest>)+0x1f8) [0x7f30488dc6f8]
2016-04-06T20:53:28.909 INFO:tasks.ceph.osd.5.smithi044.stderr: 7: (OSD::_dispatch(Message
)+0x21d) [0x7f30488dd13d]
2016-04-06T20:53:28.909 INFO:tasks.ceph.osd.5.smithi044.stderr: 8: (OSD::ms_dispatch(Message*)+0x20f) [0x7f30488dd7bf]
2016-04-06T20:53:28.909 INFO:tasks.ceph.osd.5.smithi044.stderr: 9: (Messenger::ms_deliver_dispatch(Message*)+0x77) [0x7f304902c827]
2016-04-06T20:53:28.909 INFO:tasks.ceph.osd.5.smithi044.stderr: 10: (C_handle_dispatch::do_request(int)+0x11) [0x7f304902cf81]
2016-04-06T20:53:28.909 INFO:tasks.ceph.osd.5.smithi044.stderr: 11: (EventCenter::process_events(int)+0xa2d) [0x7f3048fce81d]
2016-04-06T20:53:28.909 INFO:tasks.ceph.osd.5.smithi044.stderr: 12: (Worker::entry()+0x1e0) [0x7f3048faf460]
2016-04-06T20:53:28.909 INFO:tasks.ceph.osd.5.smithi044.stderr: 13: (()+0x8182) [0x7f30471d2182]
2016-04-06T20:53:28.910 INFO:tasks.ceph.osd.5.smithi044.stderr: 14: (clone()+0x6d) [0x7f304530047d]
2016-04-06T20:53:28.910 INFO:tasks.ceph.osd.5.smithi044.stderr: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.


Related issues 2 (0 open2 closed)

Copied to Ceph - Backport #15481: infernalis: is_split crash in handle_pg_createRejectedActions
Copied to Ceph - Backport #15482: hammer: is_split crash in handle_pg_createResolvedNathan CutlerActions
Actions #1

Updated by Samuel Just about 8 years ago

sjust@teuthology:/a/samuelj-2016-04-06_16:46:34-rados-wip-sam-testing-distro-basic-smithi/112630$

wip-sam-testing run, but should not matter.

Actions #2

Updated by Samuel Just about 8 years ago

wip-sam-testing-07-04-2016

Actions #3

Updated by Samuel Just about 8 years ago

sjust@teuthology:/a/samuelj-2016-04-06_16:46:34-rados-wip-sam-testing-distro-basic-smithi/112457$

Actions #4

Updated by Kefu Chai about 8 years ago

the cluster was thrashing by updating the pgnum and pgpnum. so the monitor sends a pg-create message to the primary. as the logs of monitor and osd are missing, can hardly tell why the osd find that the pg.seed greater than old_pg_num.

Actions #5

Updated by Kefu Chai about 8 years ago

  • Status changed from New to Need More Info
Actions #6

Updated by Yuri Weinstein about 8 years ago

  • Source changed from other to Q/A
  • Release set to jewel
  • ceph-qa-suite rados added

Run: http://pulpito.ceph.com/teuthology-2016-04-10_22:00:01-rados-jewel-distro-basic-smithi/
Job: ['120522']
Logs: http://qa-proxy.ceph.com/teuthology/teuthology-2016-04-10_22:00:01-rados-jewel-distro-basic-smithi/120522/teuthology.log

2016-04-11T08:18:08.486 INFO:tasks.ceph.osd.3.smithi009.stderr:osd/osd_types.cc: In function 'bool pg_t::is_split(unsigned int, unsigned int, std::set<pg_t>*) const' thread 7f80766c3700 time 2016-04-11 15:18:08.479045
2016-04-11T08:18:08.486 INFO:tasks.ceph.osd.3.smithi009.stderr:osd/osd_types.cc: 476: FAILED assert(m_seed < old_pg_num)
2016-04-11T08:18:08.486 INFO:tasks.ceph.osd.3.smithi009.stderr: ceph version 10.1.1 (ce50389b773fe7f72fca40a3dd69cfe6613eaeb1)
2016-04-11T08:18:08.486 INFO:tasks.ceph.osd.3.smithi009.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0x7f808ab13885]
2016-04-11T08:18:08.487 INFO:tasks.ceph.osd.3.smithi009.stderr: 2: (()+0x62023c) [0x7f808a72423c]
2016-04-11T08:18:08.487 INFO:tasks.ceph.osd.3.smithi009.stderr: 3: (spg_t::is_split(unsigned int, unsigned int, std::set<spg_t, std::less<spg_t>, std::allocator<spg_t> >*) const+0x6e) [0x7f808a4db24e]
2016-04-11T08:18:08.487 INFO:tasks.ceph.osd.3.smithi009.stderr: 4: (OSD::project_pg_history(spg_t, pg_history_t&, unsigned int, std::vector<int, std::allocator<int> > const&, int, std::vector<int, std::allocator<int> > const&, int)+0x389) [0x7f808a497dd9]
2016-04-11T08:18:08.487 INFO:tasks.ceph.osd.3.smithi009.stderr: 5: (OSD::handle_pg_create(std::shared_ptr<OpRequest>)+0xae5) [0x7f808a4af615]
2016-04-11T08:18:08.487 INFO:tasks.ceph.osd.3.smithi009.stderr: 6: (OSD::dispatch_op(std::shared_ptr<OpRequest>)+0x220) [0x7f808a4b1400]
2016-04-11T08:18:08.487 INFO:tasks.ceph.osd.3.smithi009.stderr: 7: (OSD::_dispatch(Message*)+0x22e) [0x7f808a4b1ede]
2016-04-11T08:18:08.487 INFO:tasks.ceph.osd.3.smithi009.stderr: 8: (OSD::ms_dispatch(Message*)+0x21f) [0x7f808a4b259f]
2016-04-11T08:18:08.488 INFO:tasks.ceph.osd.3.smithi009.stderr: 9: (DispatchQueue::entry()+0x749) [0x7f808abd3ac9]
2016-04-11T08:18:08.488 INFO:tasks.ceph.osd.3.smithi009.stderr: 10: (DispatchQueue::DispatchThread::entry()+0xd) [0x7f808aaef0bd]
2016-04-11T08:18:08.488 INFO:tasks.ceph.osd.3.smithi009.stderr: 11: (()+0x7dc5) [0x7f8088837dc5]
2016-04-11T08:18:08.488 INFO:tasks.ceph.osd.3.smithi009.stderr: 12: (clone()+0x6d) [0x7f8086ec321d]
2016-04-11T08:18:08.488 INFO:tasks.ceph.osd.3.smithi009.stderr: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Actions #7

Updated by Kefu Chai about 8 years ago

  • Status changed from Need More Info to 12
  • Assignee set to Kefu Chai
Actions #8

Updated by Kefu Chai about 8 years ago

the first osd_pg_create msg for pg2.17 received by osd.3 was

2016-04-11 15:18:08.478720 7f80766c3700 1 -- 172.21.15.9:6800/10707 <== mon.2 172.21.15.67:6790/0 9 ==== osd_pg_create(e87 2.7:61 2.17:61) v3 ==== 162+0+0 (1934195199 0 0) 0x7f8096f2eb40 con 0x7f80965f8400

in smithi067/log/ceph-mon.c.log.gz

2016-04-11 15:18:08.544168 7fae35323700 20 mon.c@2(peon).pg v176 refreshing pg 2.17 86:14 creating+peering
...
2016-04-11 15:18:08.544276 7fae35323700 20 mon.c@2(peon).pg v177 check_subs .. osd.3 172.21.15.9:6800/10707
2016-04-11 15:18:08.544278 7fae35323700 20 mon.c@2(peon).pg v177 send_pg_creates osd.3 from 0 : epoch 71 2 pgs
2016-04-11 15:18:08.544281 7fae35323700 1 -- 172.21.15.67:6790/0 >> 172.21.15.9:6800/10707 conn(0x7fae4959b000 sd=32 :6790 s=STATE_OPEN pgs=9 cs=1 l=1). tx 0x7fae49b4c900 osd_pg_create(e87 2.7:61 2.17:61) v3

osd_pg_create(e87 2.7:61 2.17:61) was sent to osd.3. but the "created" epoch of pg2.17 should have been e71. see

2016-04-11 15:17:25.807456 7fb1d2133700 10 mon.b@0(leader).pg v147 register_new_pgs checking pg pools for osdmap epoch 71, last_pg_scan 70
...
2016-04-11 15:17:25.807590 7fb1d2133700 10 mon.b@0(leader).pg v147 parent is 2.7
2016-04-11 15:17:25.807596 7fb1d2133700 10 mon.b@0(leader).pg v147 register_pg will create 2.17 primary 3 acting [3] parent 2.7 by 1 bits

so, the pg2.17 should not be processed at all:

  1. msg.mkpg[pg2.17].split_bits should be 1 but it was 0.
  2. msg.mkpg[pg2.17].created should be 71 but it was 61.
Actions #9

Updated by Kefu Chai about 8 years ago

  • Status changed from 12 to Fix Under Review
Actions #10

Updated by Kefu Chai about 8 years ago

  • Status changed from Fix Under Review to Pending Backport
  • Backport set to hammer, infernalis
Actions #11

Updated by Kefu Chai about 8 years ago

  • Assignee deleted (Kefu Chai)
Actions #12

Updated by Nathan Cutler about 8 years ago

  • Copied to Backport #15481: infernalis: is_split crash in handle_pg_create added
Actions #13

Updated by Nathan Cutler about 8 years ago

  • Copied to Backport #15482: hammer: is_split crash in handle_pg_create added
Actions #14

Updated by Sage Weil about 8 years ago

  • Priority changed from Immediate to Urgent
Actions #15

Updated by Loïc Dachary over 7 years ago

  • Status changed from Pending Backport to Resolved
Actions

Also available in: Atom PDF