Project

General

Profile

Bug #18929

"osd/PG.cc: 6896: FAILED assert(pg->is_acting(osd_with_shard) || pg->is_up(osd_with_shard))" in rados/upgrade

Added by Yuri Weinstein about 7 years ago. Updated over 6 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
kraken,jewel
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
rados
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

(part of PRs testing batch https://trello.com/c/wqtuiCLb)

Run: http://pulpito.ceph.com/yuriw-2017-02-14_00:18:31-rados-wip-yuri-testing2_2017_2_14---basic-smithi/
Job: http://pulpito.ceph.com/yuriw-2017-02-14_00:18:31-rados-wip-yuri-testing2_2017_2_14---basic-smithi/812148
Logs: /a/yuriw-2017-02-14_00:18:31-rados-wip-yuri-testing2_2017_2_14---basic-smithi/812148/teuthology.log

 yuriw@teuthology ~ [16:38:27]> grep -i caught /a/yuriw-2017-02-14_00:18:31-rados-wip-yuri-testing2_2017_2_14---basic-smithi/812148/teuthology.log -b10 -a20
444787790-2017-02-14T01:31:35.396 INFO:tasks.ceph.osd.1.smithi171.stderr: -3056> 2017-02-14 01:31:35.353391 7f4ec3904700 -1 /build/ceph-12.0.0-279-gd8a7d0d/src/osd/PG.cc: In function 'boost::statechart::result PG::RecoveryState::Active::react(const PG::AdvMap&)' thread 7f4ec3904700 time 2017-02-14 01:31:35.350665
444788095-2017-02-14T01:31:35.396 INFO:tasks.ceph.osd.1.smithi171.stderr:/build/ceph-12.0.0-279-gd8a7d0d/src/osd/PG.cc: 6896: FAILED assert(pg->is_acting(osd_with_shard) || pg->is_up(osd_with_shard))
444788285-2017-02-14T01:31:35.397 INFO:tasks.ceph.osd.1.smithi171.stderr:
444788349-2017-02-14T01:31:35.397 INFO:tasks.ceph.osd.1.smithi171.stderr: ceph version 12.0.0-279-gd8a7d0d (d8a7d0d1a7113686b4b1f2554406d36ef584c290)
444788489-2017-02-14T01:31:35.398 INFO:tasks.ceph.osd.1.smithi171.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x10e) [0x7f4edec8833e]
444788650-2017-02-14T01:31:35.398 INFO:tasks.ceph.osd.1.smithi171.stderr: 2: (PG::RecoveryState::Active::react(PG::AdvMap const&)+0x167) [0x7f4ede82b877]
444788794-2017-02-14T01:31:35.398 INFO:tasks.ceph.osd.1.smithi171.stderr: 3: (boost::statechart::simple_state<PG::RecoveryState::Active, PG::RecoveryState::Primary, PG::RecoveryState::Activating, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x266) [0x7f4ede881e86]
444789103-2017-02-14T01:31:35.398 INFO:tasks.ceph.osd.1.smithi171.stderr: 4: (boost::statechart::simple_state<PG::RecoveryState::Clean, PG::RecoveryState::Active, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x5a) [0x7f4ede87ec4a]
444789596-2017-02-14T01:31:35.398 INFO:tasks.ceph.osd.1.smithi171.stderr: 5: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::send_event(boost::statechart::event_base const&)+0x5b) [0x7f4ede8632bb]
444789902-2017-02-14T01:31:35.399 INFO:tasks.ceph.osd.1.smithi171.stderr: 6: (PG::handle_advance_map(std::shared_ptr<OSDMap const>, std::shared_ptr<OSDMap const>, std::vector<int, std::allocator<int> >&, int, std::vector<int, std::allocator<int> >&, int, PG::RecoveryCtx*)+0x48f) [0x7f4ede83303f]
444790189-2017-02-14T01:31:35.399 INFO:tasks.ceph.osd.1.smithi171.stderr: 7: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle&, PG::RecoveryCtx*, std::set<boost::intrusive_ptr<PG>, std::less<boost::intrusive_ptr<PG> >, std::allocator<boost::intrusive_ptr<PG> > >*)+0x299) [0x7f4ede787a79]
444790476-2017-02-14T01:31:35.400 INFO:tasks.ceph.osd.1.smithi171.stderr: 8: (OSD::process_peering_events(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x1a5) [0x7f4ede798665]
444790664-2017-02-14T01:31:35.400 INFO:tasks.ceph.osd.1.smithi171.stderr: 9: (OSD::PeeringWQ::_process(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x17) [0x7f4ede7e67c7]
444790848-2017-02-14T01:31:35.400 INFO:tasks.ceph.osd.1.smithi171.stderr: 10: (ThreadPool::worker(ThreadPool::WorkThread*)+0xb65) [0x7f4edec8ee15]
444790985-2017-02-14T01:31:35.400 INFO:tasks.ceph.osd.1.smithi171.stderr: 11: (ThreadPool::WorkThread::entry()+0x10) [0x7f4edec8fde0]
444791109-2017-02-14T01:31:35.401 INFO:tasks.ceph.osd.1.smithi171.stderr: 12: (()+0x8184) [0x7f4edca25184]
444791206-2017-02-14T01:31:35.402 INFO:tasks.ceph.osd.1.smithi171.stderr: 13: (clone()+0x6d) [0x7f4edbb1537d]
444791306-2017-02-14T01:31:35.402 INFO:tasks.ceph.osd.1.smithi171.stderr: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.


Related issues

Copied to Ceph - Backport #18999: kraken: "osd/PG.cc: 6896: FAILED assert(pg->is_acting(osd_with_shard) || pg->is_up(osd_with_shard))" in rados/upgrade Resolved
Copied to Ceph - Backport #19000: jewel: "osd/PG.cc: 6896: FAILED assert(pg->is_acting(osd_with_shard) || pg->is_up(osd_with_shard))" in rados/upgrade Resolved

History

#1 Updated by Sage Weil about 7 years ago

2017-02-14 01:31:34.313088 7f4ec3904700 10 osd.1 pg_epoch: 2605 pg[2.5d( empty local-les=2605 n=0 ec=1620 les/c/f 2605/2600/0 2604/2604/2604) []/[1] r=0 lpr=2604 pi=2596-2603/3 crt=0'0 mlcod 0'0 active+undersized+degraded+remapped] choose_acting want [1,5] != acting [1], requesting pg_temp change
...
2017-02-14 01:31:35.350650 7f4ec3904700 10 osd.1 pg_epoch: 2605 pg[2.5d( empty local-les=2605 n=0 ec=1620 les/c/f 2605/2605/0 2604/2604/2604) []/[1] r=0 lpr=2604 crt=0'0 mlcod 0'0 active+undersized+degraded+remapped] handle_advance_map [3,4]/[1] -- 3/1
2017-02-14 01:31:35.350661 7f4ec3904700 10 osd.1 pg_epoch: 2606 pg[2.5d( empty local-les=2605 n=0 ec=1620 les/c/f 2605/2605/0 2604/2604/2604) []/[1] r=0 lpr=2604 crt=0'0 mlcod 0'0 active+undersized+degraded+remapped] state<Started/Primary/Active>: Active advmap
2017-02-14 01:31:35.353391 7f4ec3904700 -1 /build/ceph-12.0.0-279-gd8a7d0d/src/osd/PG.cc: In function 'boost::statechart::result PG::RecoveryState::Active::react(const PG::AdvMap&)' thread 7f4ec3904700 time 2017-02-14 01:31:35.350665
/build/ceph-12.0.0-279-gd8a7d0d/src/osd/PG.cc: 6896: FAILED assert(pg->is_acting(osd_with_shard) || pg->is_up(osd_with_shard))

The code is
  for (size_t i = 0; i < pg->want_acting.size(); i++) {
    int osd = pg->want_acting[i];
    if (!advmap.osdmap->is_up(osd)) {
      pg_shard_t osd_with_shard(osd, shard_id_t(i));
      assert(pg->is_acting(osd_with_shard) || pg->is_up(osd_with_shard));
    }
  }

I can't tell what that loop and assert is intended to check. The acting set can change arbitrarily, with no regard to what want_acting in the prior interval was!

#2 Updated by Sage Weil about 7 years ago

  • Status changed from New to 7

#3 Updated by Sage Weil about 7 years ago

  • Project changed from ceph-qa-suite to Ceph

#4 Updated by Samuel Just about 7 years ago

samuelj@teuthology:/a/samuelj-2017-02-15_01:03:44-rados-wip-sam-testing---basic-smithi/816292 also

#5 Updated by Samuel Just about 7 years ago

I don't understand why this is not popping up. Sage's patch is correct, but something else is going on. Why is the up set empty here?

#6 Updated by Sage Weil about 7 years ago

  • Priority changed from Normal to Immediate

#7 Updated by Samuel Just about 7 years ago

  • Assignee set to Greg Farnum

#8 Updated by Sage Weil about 7 years ago

  • Status changed from 7 to Pending Backport
  • Backport set to kraken,jewel

#9 Updated by Loïc Dachary about 7 years ago

  • Copied to Backport #18999: kraken: "osd/PG.cc: 6896: FAILED assert(pg->is_acting(osd_with_shard) || pg->is_up(osd_with_shard))" in rados/upgrade added

#10 Updated by Loïc Dachary about 7 years ago

  • Copied to Backport #19000: jewel: "osd/PG.cc: 6896: FAILED assert(pg->is_acting(osd_with_shard) || pg->is_up(osd_with_shard))" in rados/upgrade added

#11 Updated by Sage Weil about 7 years ago

  • Priority changed from Immediate to Urgent

#12 Updated by Nathan Cutler over 6 years ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF