Project

General

Profile

Bug #20785

osd/osd_types.cc: 3574: FAILED assert(lastmap->get_pools().count(pgid.pool()))

Added by Sage Weil 3 months ago. Updated about 1 month ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Correctness/Safety
Target version:
-
Start date:
07/26/2017
Due date:
% Done:

0%

Source:
Tags:
Backport:
luminous
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Release:
Needs Doc:
No
Component(RADOS):
Monitor

Description

2017-07-26 15:53:25.247581 7fcbabaa1700  1 -- 172.21.15.184:6808/156553 <== mon.0 172.21.15.148:6789/0 5 ==== osd_pg_create(e363 9.6:0 9.7:0 9.c:0 9.e:0 9.f:0) v3 ==== 381+0+0 (2842883642 0 0) 0x5568d4476800 con 0x5568d938d000
2017-07-26 15:53:25.247589 7fcbabaa1700 20 osd.2 364 OSD::ms_dispatch: osd_pg_create(e363 9.6:0 9.7:0 9.c:0 9.e:0 9.f:0) v3
2017-07-26 15:53:25.247585 7fcb9b402700 20 osd.2 pg_epoch: 364 pg[9.7( v 364'14 (0'0,364'14] local-lis/les=362/364 n=14 ec=362/362 lis/c 362/362 les/c/f 364/364/0 362/362/362) [2,1] r=0 lpr=362 luod=364'13 lua=364'13 crt=364'14 lcod 364'12 mlcod 364'12 active+clean] PrimaryLogPG::check_blacklisted_obc_watchers for obc 9:e5b2df3f:::benchmark_data_smithi184_210998_object209:head
2017-07-26 15:53:25.247592 7fcbabaa1700 10 osd.2 364 do_waiters -- start
2017-07-26 15:53:25.247593 7fcbabaa1700 10 osd.2 364 do_waiters -- finish
2017-07-26 15:53:25.247595 7fcbabaa1700 20 osd.2 364 _dispatch 0x5568d4476800 osd_pg_create(e363 9.6:0 9.7:0 9.c:0 9.e:0 9.f:0) v3
2017-07-26 15:53:25.247600 7fcbabaa1700 10 osd.2 364 handle_pg_create osd_pg_create(e363 9.6:0 9.7:0 9.c:0 9.e:0 9.f:0) v3
2017-07-26 15:53:25.247603 7fcbabaa1700 15 osd.2 364 require_same_or_newer_map 363 (i am 364) 0x5568d4476800
2017-07-26 15:53:25.247593 7fcb9b402700 10 osd.2 pg_epoch: 364 pg[9.7( v 364'14 (0'0,364'14] local-lis/les=362/364 n=14 ec=362/362 lis/c 362/362 les/c/f 364/364/0 362/362/362) [2,1] r=0 lpr=362 luod=364'13 lua=364'13 crt=364'14 lcod 364'12 mlcod 364'12 active+clean] get_object_context: 0x5568d6b8ea80 9:e5b2df3f:::benchmark_data_smithi184_210998_object209:head rwstate(none n=0 w=0) oi: 9:e5b2df3f:::benchmark_data_smithi184_210998_object209:head(0'0 unknown.0.0:0 s 0 uv 0 alloc_hint [0 0 0]) ssc: 0x5568ba51f760 snapset: 0=[]:[]
2017-07-26 15:53:25.247606 7fcbabaa1700 20 osd.2 364 mkpg 9.6 e0@0.000000
...
2017-07-26 15:53:25.565112 7fcbabaa1700 -1 /build/ceph-12.1.1-609-g392d888/src/osd/osd_types.cc: In function 'static bool PastIntervals::check_new_interval(int, int, const std::vector<int>&, const std::vector<int>&, int, int, const std::vector<int>&, const std::vector<int>&, epoch_t, epoch_t, OSDMapRef, OSDMapRef, pg_t, IsPGRecoverablePredicate*, PastIntervals*, std::ostream*)' thread 7fcbabaa1700 time 2017-07-26 15:53:25.265711
/build/ceph-12.1.1-609-g392d888/src/osd/osd_types.cc: 3574: FAILED assert(lastmap->get_pools().count(pgid.pool()))

 ceph version 12.1.1-609-g392d888 (392d888dffbe31593a9d9f3d4dbf4f83284fdc58) luminous (rc)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) [0x5568aaf58e22]
 2: (PastIntervals::check_new_interval(int, int, std::vector<int, std::allocator<int> > const&, std::vector<int, std::allocator<int> > const&, int, int, std::vector<int, std::allocator<int> > const&, std::vector<int, std::allocator<int> > const&, unsigned int, unsigned int, std::shared_ptr<OSDMap const>, std::shared_ptr<OSDMap const>, pg_t, IsPGRecoverablePredicate*, PastIntervals*, std::ostream*)+0x293) [0x5568aabc4493]
 3: (OSD::build_initial_pg_history(spg_t, unsigned int, utime_t, pg_history_t*, PastIntervals*)+0x4c2) [0x5568aa980932]
 4: (OSD::handle_pg_create(boost::intrusive_ptr<OpRequest>)+0xa8e) [0x5568aa9a143e]
 5: (OSD::dispatch_op(boost::intrusive_ptr<OpRequest>)+0x1b1) [0x5568aa9a34a1]
 6: (OSD::_dispatch(Message*)+0x389) [0x5568aa9a3f19]
 7: (OSD::ms_dispatch(Message*)+0x87) [0x5568aa9a4267]
 8: (DispatchQueue::entry()+0xf4a) [0x5568ab1c1dea]
 9: (DispatchQueue::DispatchThread::entry()+0xd) [0x5568aafeae8d]
 10: (()+0x76ba) [0x7fcbbe7546ba]
 11: (clone()+0x6d) [0x7fcbbd7cb82d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

/a/sage-2017-07-26_14:40:34-rados-wip-sage-testing-distro-basic-smithi/1447282

Related issues

Duplicated by Ceph - Bug #20474: osd/osd_types.cc: 3534: FAILED assert(lastmap->get_pools().count(pgid.pool())) Duplicate 06/30/2017
Duplicated by Ceph - Bug #20987: mon/OSDMonitor.cc: 3284: FAILED assert(create != creating_pgs.pgs.end()) Duplicate 08/11/2017
Copied to RADOS - Backport #21076: luminous: osd/osd_types.cc: 3574: FAILED assert(lastmap->get_pools().count(pgid.pool())) Resolved
Copied to RADOS - Backport #21090: osd/osd_types.cc: 3574: FAILED assert(lastmap->get_pools().count(pgid.pool())) Resolved

History

#1 Updated by Sage Weil 3 months ago

problem appears to be the message the mon sent,

osd_pg_create(e363 9.6:0 9.7:0 9.c:0 9.e:0 9.f:0) 

with epoch 0 for the create:
2017-07-26 15:53:25.247620 7fcbabaa1700 10 osd.2 364 build_initial_pg_history 9.6 created 0

indeed,
2017-07-26 15:53:25.241490 7efec178f700 20 mon.b@0(leader).osd e364 send_pg_creates osd.2 from 0 : epoch 363 5 pgs
2017-07-26 15:53:25.241501 7efec178f700 20 mon.b@0(leader).osd e364 send_pg_creates will create 9.6 at 0
2017-07-26 15:53:25.241506 7efec178f700 20 mon.b@0(leader).osd e364 send_pg_creates will create 9.7 at 0
2017-07-26 15:53:25.241509 7efec178f700 20 mon.b@0(leader).osd e364 send_pg_creates will create 9.c at 0
2017-07-26 15:53:25.241510 7efec178f700 20 mon.b@0(leader).osd e364 send_pg_creates will create 9.e at 0
2017-07-26 15:53:25.241512 7efec178f700 20 mon.b@0(leader).osd e364 send_pg_creates will create 9.f at 0
2017-07-26 15:53:25.241513 7efec178f700  1 -- 172.21.15.148:6789/0 --> 172.21.15.184:6808/156553 -- osd_pg_create(e363 9.6:0 9.7:0 9.c:0 9.e:0 9.f:0) v3 -- ?+0 0x7efed7cc9680 con 0x7efed868c300

#2 Updated by Sage Weil 3 months ago

  • Assignee deleted (Sage Weil)

#3 Updated by Kefu Chai 3 months ago

  • Assignee set to Kefu Chai

#4 Updated by Kefu Chai 3 months ago

https://github.com/ceph/ceph/pull/16677 is posted to help debug this issue.

#5 Updated by Sage Weil 3 months ago

  • Status changed from Verified to Need More Info

#6 Updated by Greg Farnum 2 months ago

  • Priority changed from Urgent to High

#7 Updated by Sage Weil 2 months ago

  • Duplicated by Bug #20474: osd/osd_types.cc: 3534: FAILED assert(lastmap->get_pools().count(pgid.pool())) added

#8 Updated by Kefu Chai 2 months ago

  • Duplicated by Bug #20987: mon/OSDMonitor.cc: 3284: FAILED assert(create != creating_pgs.pgs.end()) added

#9 Updated by Kefu Chai 2 months ago

  • Category set to Correctness/Safety
  • Status changed from Need More Info to Need Review
  • Backport set to luminous
  • Component(RADOS) Monitor added

#10 Updated by Kefu Chai about 2 months ago

  • Status changed from Need Review to Pending Backport

#11 Updated by Kefu Chai about 2 months ago

  • Copied to Backport #21076: luminous: osd/osd_types.cc: 3574: FAILED assert(lastmap->get_pools().count(pgid.pool())) added

#12 Updated by Kefu Chai about 2 months ago

  • Copied to Backport #21090: osd/osd_types.cc: 3574: FAILED assert(lastmap->get_pools().count(pgid.pool())) added

#13 Updated by Kefu Chai about 2 months ago

  • Status changed from Pending Backport to Resolved

#14 Updated by Kefu Chai about 2 months ago

  • Status changed from Resolved to Need Review

/a//joshd-2017-08-25_00:03:46-rados-wip-dup-perf-distro-basic-smithi/1560728/ mon.c

#15 Updated by Sage Weil about 2 months ago

  • Status changed from Need Review to Resolved

#16 Updated by Joao Luis about 1 month ago

I may be wrong, but it looks like the commit fixing this is only present in current master. I was under the impression this affected luminous; does it not? If so, we need to backport it.

#17 Updated by Nathan Cutler about 1 month ago

  • Status changed from Resolved to Pending Backport

#18 Updated by Nathan Cutler about 1 month ago

Joao, I changed status to "Pending Backport" but the PR is also has the "needs-backport" label, which is perhaps enough to ensure that the cherry-pick gets done in time for v12.2.1.

#19 Updated by Joao Luis about 1 month ago

doh. I missed the needs-backport tag on the pr :(

#20 Updated by Kefu Chai about 1 month ago

#21 Updated by Nathan Cutler about 1 month ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF