Project

General

Profile

Bug #21833

Multiple asserts caused by DNE pgs left behind after lots of OSD restarts

Added by Greg Farnum about 1 year ago. Updated 7 months ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
PG Split
Target version:
-
Start date:
10/18/2017
Due date:
% Done:

0%

Source:
Tags:
Backport:
luminous
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
OSD

Description

   -17> 2017-10-17 00:33:47.342829 7fde3f784e80 10 read_log_and_missing done
   -16> 2017-10-17 00:33:47.343839 7fde3f784e80 10 osd.137 pg_epoch: 440601 pg[3.1ca7( v 440170'847999 lc 440078'847998 (438151'846486,440170'847999] local-lis/les=440557/440558 n=1653 ec=97501/97501 lis/c 440557/438436 les/c/f 440558/438437/0 440557/440557/431789) [216,137] r=1 lpr=0 pi=[438436,440557)/1 crt=440170'
847999 lcod 0'0 unknown m=1] handle_loaded
   -15> 2017-10-17 00:33:47.343892 7fde3f784e80  5 osd.137 pg_epoch: 440601 pg[3.1ca7( v 440170'847999 lc 440078'847998 (438151'846486,440170'847999] local-lis/les=440557/440558 n=1653 ec=97501/97501 lis/c 440557/438436 les/c/f 440558/438437/0 440557/440557/431789) [216,137] r=1 lpr=0 pi=[438436,440557)/1 crt=440170'
847999 lcod 0'0 unknown NOTIFY m=1] exit Initial 0.079676 0 0.000000
   -14> 2017-10-17 00:33:47.343911 7fde3f784e80  5 osd.137 pg_epoch: 440601 pg[3.1ca7( v 440170'847999 lc 440078'847998 (438151'846486,440170'847999] local-lis/les=440557/440558 n=1653 ec=97501/97501 lis/c 440557/438436 les/c/f 440558/438437/0 440557/440557/431789) [216,137] r=1 lpr=0 pi=[438436,440557)/1 crt=440170'
847999 lcod 0'0 unknown NOTIFY m=1] enter Reset
   -13> 2017-10-17 00:33:47.343924 7fde3f784e80 20 osd.137 pg_epoch: 440601 pg[3.1ca7( v 440170'847999 lc 440078'847998 (438151'846486,440170'847999] local-lis/les=440557/440558 n=1653 ec=97501/97501 lis/c 440557/438436 les/c/f 440558/438437/0 440557/440557/431789) [216,137] r=1 lpr=0 pi=[438436,440557)/1 crt=440170'
847999 lcod 0'0 unknown NOTIFY m=1] set_last_peering_reset 440601
   -12> 2017-10-17 00:33:47.343934 7fde3f784e80 10 osd.137 pg_epoch: 440601 pg[3.1ca7( v 440170'847999 lc 440078'847998 (438151'846486,440170'847999] local-lis/les=440557/440558 n=1653 ec=97501/97501 lis/c 440557/438436 les/c/f 440558/438437/0 440557/440557/431789) [216,137] r=1 lpr=440601 pi=[438436,440557)/1 crt=44
0170'847999 lcod 0'0 unknown NOTIFY m=1] Clearing blocked outgoing recovery messages
   -11> 2017-10-17 00:33:47.343944 7fde3f784e80 10 osd.137 pg_epoch: 440601 pg[3.1ca7( v 440170'847999 lc 440078'847998 (438151'846486,440170'847999] local-lis/les=440557/440558 n=1653 ec=97501/97501 lis/c 440557/438436 les/c/f 440558/438437/0 440557/440557/431789) [216,137] r=1 lpr=440601 pi=[438436,440557)/1 crt=44
0170'847999 lcod 0'0 unknown NOTIFY m=1] Not blocking outgoing recovery messages
   -10> 2017-10-17 00:33:47.343969 7fde3f784e80 10 osd.137 440635 load_pgs loaded pg[3.1ca7( v 440170'847999 lc 440078'847998 (438151'846486,440170'847999] local-lis/les=440557/440558 n=1653 ec=97501/97501 lis/c 440557/438436 les/c/f 440558/438437/0 440557/440557/431789) [216,137] r=1 lpr=440601 pi=[438436,440557)/1 
crt=440170'847999 lcod 0'0 unknown NOTIFY m=1] log((438151'846486,440170'847999], crt=440170'847999)
    -9> 2017-10-17 00:33:47.343983 7fde3f784e80  5 write_log_and_missing with: dirty_to: 0'0, dirty_from: 4294967295'18446744073709551615, writeout_from: 4294967295'18446744073709551615, trimmed: , trimmed_dups: , clear_divergent_priors: 0
    -8> 2017-10-17 00:33:47.358300 7fde3f784e80  0 osd.137 440635 load_pgs opened 220 pgs
    -7> 2017-10-17 00:33:47.358730 7fde3f784e80 10 osd.137 440635 10.522s2 needs 439159-0
    -6> 2017-10-17 00:33:47.358912 7fde3f784e80 10 osd.137 440635 10.5d3s1 needs 439159-0
    -5> 2017-10-17 00:33:47.359763 7fde3f784e80 10 osd.137 440635 10.5a5s3 needs 439159-0
    -4> 2017-10-17 00:33:47.360321 7fde3f784e80  1 osd.137 440635 build_past_intervals_parallel over 439159-439159
    -3> 2017-10-17 00:33:47.360329 7fde3f784e80 10 osd.137 440635 build_past_intervals_parallel epoch 439159
    -2> 2017-10-17 00:33:47.360343 7fde3f784e80 20 osd.137 0 get_map 439159 - loading and decoding 0x561954274700
    -1> 2017-10-17 00:33:47.392569 7fde3f784e80 10 osd.137 0 add_map_bl 439159 81522 bytes
     0> 2017-10-17 00:33:47.395993 7fde3f784e80 -1 /var/tmp/portage/sys-cluster/ceph-12.2.1-r1/work/ceph-12.2.1/src/osd/OSD.cc: In function 'void OSD::build_past_intervals_parallel()' thread 7fde3f784e80 time 2017-10-17 00:33:47.394122
/var/tmp/portage/sys-cluster/ceph-12.2.1-r1/work/ceph-12.2.1/src/osd/OSD.cc: 4180: FAILED assert(p.same_interval_since)

This assert is supposed to cope with PGs that have been imported (via the ceph-objectstore-tool — and the whole function was nuked after luminous). But the user says that didn't happen — instead, it follows them setting pg_num to 8192 from 2048. OSD log is available via ceph-post-file signature 3a6dea4f-05d7-4c15-9f7e-2d95d99195ba.

crash-osd-46.txt.gz (87.2 KB) Paul Emmerich, 02/15/2018 04:06 PM


Related issues

Copied to RADOS - Backport #23160: luminous: Multiple asserts caused by DNE pgs left behind after lots of OSD restarts Resolved

History

#1 Updated by David Zafman about 1 year ago

In the context of the newly created PGs:

pg[10.5a5s3( DNE empty local-lis/les=0/0 n=0 ec=0/0 lis/c 0/0 les/c/f 0/0/0 0/0/0)
pg[10.5d3s1( DNE empty local-lis/les=0/0 n=0 ec=0/0 lis/c 0/0 les/c/f 0/0/0 0/0/0)
pg[10.522s2( DNE empty local-lis/les=0/0 n=0 ec=0/0 lis/c 0/0 les/c/f 0/0/0 0/0/0)

Building interval 439159-439159 is probably wrong.
-7> 2017-10-17 00:33:47.358730 7fde3f784e80 10 osd.137 440635 10.522s2 needs 439159-0
-6> 2017-10-17 00:33:47.358912 7fde3f784e80 10 osd.137 440635 10.5d3s1 needs 439159-0
-5> 2017-10-17 00:33:47.359763 7fde3f784e80 10 osd.137 440635 10.5a5s3 needs 439159-0
-4> 2017-10-17 00:33:47.360321 7fde3f784e80 1 osd.137 440635 build_past_intervals_parallel over 439159-439159
-3> 2017-10-17 00:33:47.360329 7fde3f784e80 10 osd.137 440635 build_past_intervals_parallel epoch 439159
-2> 2017-10-17 00:33:47.360343 7fde3f784e80 20 osd.137 0 get_map 439159 - loading and decoding 0x561954274700
-1> 2017-10-17 00:33:47.392569 7fde3f784e80 10 osd.137 0 add_map_bl 439159 81522 bytes

#2 Updated by Michael Schmid 12 months ago

I just wanted to report that I managed to produce what certainly seems to be the same error on Ubuntu's 12.2.0 release. Happened without using ceph-objectstore-tool to import PG or such. It was a result of running crash tests (such as just forcibly shutting down, unplugging drives during operation, restarting ceph components in random order... ~two cycles and I had this and another OSD with a rockdb that can't be recovered with ceph fsck).

/usr/bin/ceph-osd -f --cluster ceph --id 3 --setuser ceph --setgroup ceph
starting osd.3 at - osd_data /var/lib/ceph/osd/ceph-3 /var/lib/ceph/osd/ceph-3/journal
/build/ceph-pKGC1D/ceph-12.2.0/src/osd/OSD.cc: In function 'void OSD::build_past_intervals_parallel()' thread 7f49e66fce40 time 2017-10-29 17:21:49.634188
/build/ceph-pKGC1D/ceph-12.2.0/src/osd/OSD.cc: 4171: FAILED assert(p.same_interval_since)
ceph version 12.2.0 (32ce2a3ae5239ee33d6150705cdb24d43bab910c) luminous (rc)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0xf5) [0x563180071b55]
2: (OSD::build_past_intervals_parallel()+0x1a26) [0x56317fab5156]
3: (OSD::load_pgs()+0x53e) [0x56317fab569e]
4: (OSD::init()+0x2029) [0x56317facd3a9]
5: (main()+0x38a4) [0x56317f9e7514]
6: (__libc_start_main()+0xf1) [0x7f49e31c21c1]
7: (_start()+0x2a) [0x56317fa6f27a]

ceph-bluestore-tool fsck --path /var/lib/ceph/osd/ceph-3 is coming up clean (it's not the one with the broken rockdb), in case that matters.

#3 Updated by Jonathan Light 12 months ago

Hello,

I wanted to report that I seem to be running into this bug as well.
I had a very overloaded node that OOM'd many times and upon adding memory and rebooting the node, three OSDs crash with the same assert error.
This is happening in a cluster created with Luminous 12.2.0 on Ubuntu 16.04.

I am more than happy to provide debug logs or troubleshooting if needed.

#4 Updated by David Zafman 12 months ago

  • Status changed from New to In Progress
  • Assignee set to David Zafman

Tentative fix which still needs testing. Anyone seeing this can try applying this to their OSDs and report back.

https://github.com/ceph/ceph/pull/18673

#5 Updated by Michael Schmid 12 months ago

David Zafman wrote:

Tentative fix which still needs testing. Anyone seeing this can try applying this to their OSDs and report back.

I believe I managed to apply the patch to ubuntu's 12.2.1 bionic configuration & presumably because of this the error changed to "FAILED assert(info.history.same_interval_since != 0)":

starting osd.3 at - osd_data /var/lib/ceph/osd/ceph-3 /var/lib/ceph/osd/ceph-3/journal
2017-11-04 11:09:31.828016 7fdb030a3e40 -1 osd.3 952 log_to_monitors {default=true}
2017-11-04 11:09:31.843163 7fdae6978700 -1 log_channel(cluster) log [ERR] : 3.25 required past_interval bounds are empty [253,0) but past_intervals is not: ([843,942] intervals=([843,855] acting 1,2),([929,936] acting 1,5))
2017-11-04 11:09:31.843175 7fdae6978700 -1 osd.3 pg_epoch: 949 pg[3.25( DNE empty local-lis/les=0/0 n=0 ec=0/0 lis/c 0/0 les/c/f 0/0/0 0/0/0) [5] r=-1 lpr=948 pi=[843,943)/2 crt=0'0 unknown NOTIFY] 3.25 required past_interval bounds are empty [253,0) but past_intervals is
not: ([843,942] intervals=([843,855] acting 1,2),([929,936] acting 1,5))
2017-11-04 11:09:31.843227 7fdae6978700 -1 log_channel(cluster) log [ERR] : 3.25 required past_interval bounds are empty [253,0) but past_intervals is not: ([843,942] intervals=([843,855] acting 1,2),([929,936] acting 1,5))
2017-11-04 11:09:31.843242 7fdae6978700 -1 osd.3 pg_epoch: 950 pg[3.25( DNE empty local-lis/les=0/0 n=0 ec=0/0 lis/c 0/0 les/c/f 0/0/0 0/0/0) [5] r=-1 lpr=948 pi=[843,943)/2 crt=0'0 unknown NOTIFY] 3.25 required past_interval bounds are empty [253,0) but past_intervals is
not: ([843,942] intervals=([843,855] acting 1,2),([929,936] acting 1,5))
2017-11-04 11:09:31.843310 7fdae6978700 -1 log_channel(cluster) log [ERR] : 3.25 required past_interval bounds are empty [253,0) but past_intervals is not: ([843,942] intervals=([843,855] acting 1,2),([929,936] acting 1,5))
2017-11-04 11:09:31.843316 7fdae6978700 -1 osd.3 pg_epoch: 951 pg[3.25( DNE empty local-lis/les=0/0 n=0 ec=0/0 lis/c 0/0 les/c/f 0/0/0 0/0/0) [5] r=-1 lpr=948 pi=[843,943)/2 crt=0'0 unknown NOTIFY] 3.25 required past_interval bounds are empty [253,0) but past_intervals is
not: ([843,942] intervals=([843,855] acting 1,2),([929,936] acting 1,5))
/build/ceph-12.2.1/src/osd/PG.cc: In function 'void PG::start_peering_interval(OSDMapRef, const std::vector<int>&, int, const std::vector<int>&, int, ObjectStore::Transaction*)' thread 7fdae6978700 time 2017-11-04 11:09:31.843351
/build/ceph-12.2.1/src/osd/PG.cc: 5381: FAILED assert(info.history.same_interval_since != 0)

#6 Updated by Sage Weil 12 months ago

  • Status changed from In Progress to Resolved

#7 Updated by David Zafman 12 months ago

  • Status changed from Resolved to In Progress

The fix wasn't sufficient.

#8 Updated by Jonathan Light 11 months ago

I got the patch applied to my OSDs and I see the same thing that Michael Schmid reported. The original p.same_interval_since assert has been taken care of, but I'm now seeing info.history.same_interval_since != 0.

#9 Updated by David Zafman 9 months ago

  • Status changed from In Progress to Verified

#10 Updated by Nathan Cutler 9 months ago

@David I can only guess that this is not reproducible in master and that's why it requires a luminous-only fix. Could you explain that in your commit messages?

#11 Updated by Paul Emmerich 8 months ago

I'm also seeing this on 12.2.2. The crashing OSD has some bad PG which crashes it on startup. I first assumed the disk was simply dead and removed the first OSD that had this issue. However, the issue quickly migrated to another OSD and then another as soon as recovery started.

Only way to get the OSD to start again was to remove the offending PG via objectstore-tool.

2018-02-14 03:05:02.473228 7f92379c7700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.2/rpm/el7/BUILD/ceph-12.2.2/src/osd/PG.cc: In function 'void PG::start_peering_interval(OSDMapRef, const std::vector<int>&, int, const std::vector<int>&, int, ObjectStore::Transaction*)' thread 7f92379c7700 time 2018-02-14 03:05:02.468494
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.2/rpm/el7/BUILD/ceph-12.2.2/src/osd/PG.cc: 5381: FAILED assert(info.history.same_interval_since != 0)

 ceph version 12.2.2 (cf0baeeeeba3b47f9427c6c97e2144b094b7e5ba) luminous (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x110) [0x5625ba59a550]
 2: (PG::start_peering_interval(std::shared_ptr<OSDMap const>, std::vector<int, std::allocator<int> > const&, int, std::vector<int, std::allocator<int> > const&, int, ObjectStore::Transaction*)+0x1517) [0x5625ba0e9877]
 3: (PG::RecoveryState::Reset::react(PG::AdvMap const&)+0x4f1) [0x5625ba0e9f01]
 4: (boost::statechart::simple_state<PG::RecoveryState::Reset, PG::RecoveryState::RecoveryMachine, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x21c) [0x5625ba130f0c]
 5: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::send_event(boost::statechart::event_base const&)+0x6b) [0x5625ba10c3fb]
 6: (PG::handle_advance_map(std::shared_ptr<OSDMap const>, std::shared_ptr<OSDMap const>, std::vector<int, std::allocator<int> >&, int, std::vector<int, std::allocator<int> >&, int, PG::RecoveryCtx*)+0x48b) [0x5625ba0d63fb]
 7: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle&, PG::RecoveryCtx*, std::set<boost::intrusive_ptr<PG>, std::less<boost::intrusive_ptr<PG> >, std::allocator<boost::intrusive_ptr<PG> > >*)+0x2cf) [0x5625ba01a7bf]
 8: (OSD::process_peering_events(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x173) [0x5625ba01b253]
 9: (OSD::PeeringWQ::_process(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x17) [0x5625ba083827]
 10: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa8e) [0x5625ba5a113e]
 11: (ThreadPool::WorkThread::entry()+0x10) [0x5625ba5a2020]
 12: (()+0x7e25) [0x7f924eca0e25]
 13: (clone()+0x6d) [0x7f924dd9434d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

debug 20/20 log attached, i've only skipped 350MB of "20 read_log_and_missing" lines that didn't seem relevant.

#12 Updated by Rafal Wadolowski 8 months ago

I can confirm this on 12.2.2. It makes data unavailable.

My output:

1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) [0x5606007838e2]
 2: (PG::start_peering_interval(std::shared_ptr<OSDMap const>, std::vector<int, std::allocator<int> > const&, int, std::vector<int, std::allocator<int> > const&, int, ObjectStore::Transaction*)+0x151e) [0x560600
266d0e]
 3: (PG::RecoveryState::Reset::react(PG::AdvMap const&)+0x479) [0x5606002672d9]
 4: (boost::statechart::simple_state<PG::RecoveryState::Reset, PG::RecoveryState::RecoveryMachine, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_:
:na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x166)
 [0x5606002b1296]
 5: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::process_event(boost::statechart::event_b
ase const&)+0x69) [0x560600286ed9]
 6: (PG::handle_advance_map(std::shared_ptr<OSDMap const>, std::shared_ptr<OSDMap const>, std::vector<int, std::allocator<int> >&, int, std::vector<int, std::allocator<int> >&, int, PG::RecoveryCtx*)+0x496) [0x5
6060024cb26]
 7: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle&, PG::RecoveryCtx*, std::set<boost::intrusive_ptr<PG>, std::less<boost::intrusive_ptr<PG> >, std::allocator<boost::intrusive_ptr<PG> > >*)+0x2e4) [0x5
60600192144]
 8: (OSD::process_peering_events(std::__cxx11::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x1e4) [0x560600193074]
 9: (ThreadPool::BatchWorkQueue<PG>::_void_process(void*, ThreadPool::TPHandle&)+0x27) [0x560600205a17]
 10: (ThreadPool::worker(ThreadPool::WorkThread*)+0xdb9) [0x56060078a7e9]
 11: (ThreadPool::WorkThread::entry()+0x10) [0x56060078b8f0]
 12: (()+0x76ba) [0x7f0ea78346ba]
 13: (clone()+0x6d) [0x7f0ea68ab41d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

#13 Updated by David Zafman 8 months ago

I was working on this last week, but got distracted by other issues. I'm going to force this scenario and see about fixing this possibly by deleting the bad pg on osd start-up.

#14 Updated by David Zafman 8 months ago

  • Status changed from Verified to In Progress

#15 Updated by David Zafman 8 months ago

  • Subject changed from luminous: "FAILED assert(p.same_interval_since)", without importing PG? to "FAILED assert(p.same_interval_since)", without importing PG?
  • Status changed from In Progress to Need Test
  • Backport set to luminous

Not sure if these needs a Jewel backport

#17 Updated by David Zafman 8 months ago

  • Subject changed from "FAILED assert(p.same_interval_since)", without importing PG? to Multiple asserts caused by DNE pgs left behind after lots of OSD restarts

#18 Updated by David Zafman 8 months ago

  • Status changed from Need Test to Pending Backport

#19 Updated by Nathan Cutler 8 months ago

  • Copied to Backport #23160: luminous: Multiple asserts caused by DNE pgs left behind after lots of OSD restarts added

#20 Updated by Nathan Cutler 7 months ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF