Project

General

Profile

Bug #21833

luminous: "FAILED assert(p.same_interval_since)", without importing PG?

Added by Greg Farnum about 2 months ago. Updated about 1 month ago.

Status:
In Progress
Priority:
High
Assignee:
Category:
PG Split
Target version:
-
Start date:
10/18/2017
Due date:
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Release:
Needs Doc:
No
Component(RADOS):
OSD

Description

   -17> 2017-10-17 00:33:47.342829 7fde3f784e80 10 read_log_and_missing done
   -16> 2017-10-17 00:33:47.343839 7fde3f784e80 10 osd.137 pg_epoch: 440601 pg[3.1ca7( v 440170'847999 lc 440078'847998 (438151'846486,440170'847999] local-lis/les=440557/440558 n=1653 ec=97501/97501 lis/c 440557/438436 les/c/f 440558/438437/0 440557/440557/431789) [216,137] r=1 lpr=0 pi=[438436,440557)/1 crt=440170'
847999 lcod 0'0 unknown m=1] handle_loaded
   -15> 2017-10-17 00:33:47.343892 7fde3f784e80  5 osd.137 pg_epoch: 440601 pg[3.1ca7( v 440170'847999 lc 440078'847998 (438151'846486,440170'847999] local-lis/les=440557/440558 n=1653 ec=97501/97501 lis/c 440557/438436 les/c/f 440558/438437/0 440557/440557/431789) [216,137] r=1 lpr=0 pi=[438436,440557)/1 crt=440170'
847999 lcod 0'0 unknown NOTIFY m=1] exit Initial 0.079676 0 0.000000
   -14> 2017-10-17 00:33:47.343911 7fde3f784e80  5 osd.137 pg_epoch: 440601 pg[3.1ca7( v 440170'847999 lc 440078'847998 (438151'846486,440170'847999] local-lis/les=440557/440558 n=1653 ec=97501/97501 lis/c 440557/438436 les/c/f 440558/438437/0 440557/440557/431789) [216,137] r=1 lpr=0 pi=[438436,440557)/1 crt=440170'
847999 lcod 0'0 unknown NOTIFY m=1] enter Reset
   -13> 2017-10-17 00:33:47.343924 7fde3f784e80 20 osd.137 pg_epoch: 440601 pg[3.1ca7( v 440170'847999 lc 440078'847998 (438151'846486,440170'847999] local-lis/les=440557/440558 n=1653 ec=97501/97501 lis/c 440557/438436 les/c/f 440558/438437/0 440557/440557/431789) [216,137] r=1 lpr=0 pi=[438436,440557)/1 crt=440170'
847999 lcod 0'0 unknown NOTIFY m=1] set_last_peering_reset 440601
   -12> 2017-10-17 00:33:47.343934 7fde3f784e80 10 osd.137 pg_epoch: 440601 pg[3.1ca7( v 440170'847999 lc 440078'847998 (438151'846486,440170'847999] local-lis/les=440557/440558 n=1653 ec=97501/97501 lis/c 440557/438436 les/c/f 440558/438437/0 440557/440557/431789) [216,137] r=1 lpr=440601 pi=[438436,440557)/1 crt=44
0170'847999 lcod 0'0 unknown NOTIFY m=1] Clearing blocked outgoing recovery messages
   -11> 2017-10-17 00:33:47.343944 7fde3f784e80 10 osd.137 pg_epoch: 440601 pg[3.1ca7( v 440170'847999 lc 440078'847998 (438151'846486,440170'847999] local-lis/les=440557/440558 n=1653 ec=97501/97501 lis/c 440557/438436 les/c/f 440558/438437/0 440557/440557/431789) [216,137] r=1 lpr=440601 pi=[438436,440557)/1 crt=44
0170'847999 lcod 0'0 unknown NOTIFY m=1] Not blocking outgoing recovery messages
   -10> 2017-10-17 00:33:47.343969 7fde3f784e80 10 osd.137 440635 load_pgs loaded pg[3.1ca7( v 440170'847999 lc 440078'847998 (438151'846486,440170'847999] local-lis/les=440557/440558 n=1653 ec=97501/97501 lis/c 440557/438436 les/c/f 440558/438437/0 440557/440557/431789) [216,137] r=1 lpr=440601 pi=[438436,440557)/1 
crt=440170'847999 lcod 0'0 unknown NOTIFY m=1] log((438151'846486,440170'847999], crt=440170'847999)
    -9> 2017-10-17 00:33:47.343983 7fde3f784e80  5 write_log_and_missing with: dirty_to: 0'0, dirty_from: 4294967295'18446744073709551615, writeout_from: 4294967295'18446744073709551615, trimmed: , trimmed_dups: , clear_divergent_priors: 0
    -8> 2017-10-17 00:33:47.358300 7fde3f784e80  0 osd.137 440635 load_pgs opened 220 pgs
    -7> 2017-10-17 00:33:47.358730 7fde3f784e80 10 osd.137 440635 10.522s2 needs 439159-0
    -6> 2017-10-17 00:33:47.358912 7fde3f784e80 10 osd.137 440635 10.5d3s1 needs 439159-0
    -5> 2017-10-17 00:33:47.359763 7fde3f784e80 10 osd.137 440635 10.5a5s3 needs 439159-0
    -4> 2017-10-17 00:33:47.360321 7fde3f784e80  1 osd.137 440635 build_past_intervals_parallel over 439159-439159
    -3> 2017-10-17 00:33:47.360329 7fde3f784e80 10 osd.137 440635 build_past_intervals_parallel epoch 439159
    -2> 2017-10-17 00:33:47.360343 7fde3f784e80 20 osd.137 0 get_map 439159 - loading and decoding 0x561954274700
    -1> 2017-10-17 00:33:47.392569 7fde3f784e80 10 osd.137 0 add_map_bl 439159 81522 bytes
     0> 2017-10-17 00:33:47.395993 7fde3f784e80 -1 /var/tmp/portage/sys-cluster/ceph-12.2.1-r1/work/ceph-12.2.1/src/osd/OSD.cc: In function 'void OSD::build_past_intervals_parallel()' thread 7fde3f784e80 time 2017-10-17 00:33:47.394122
/var/tmp/portage/sys-cluster/ceph-12.2.1-r1/work/ceph-12.2.1/src/osd/OSD.cc: 4180: FAILED assert(p.same_interval_since)

This assert is supposed to cope with PGs that have been imported (via the ceph-objectstore-tool — and the whole function was nuked after luminous). But the user says that didn't happen — instead, it follows them setting pg_num to 8192 from 2048. OSD log is available via ceph-post-file signature 3a6dea4f-05d7-4c15-9f7e-2d95d99195ba.

History

#1 Updated by David Zafman about 2 months ago

In the context of the newly created PGs:

pg[10.5a5s3( DNE empty local-lis/les=0/0 n=0 ec=0/0 lis/c 0/0 les/c/f 0/0/0 0/0/0)
pg[10.5d3s1( DNE empty local-lis/les=0/0 n=0 ec=0/0 lis/c 0/0 les/c/f 0/0/0 0/0/0)
pg[10.522s2( DNE empty local-lis/les=0/0 n=0 ec=0/0 lis/c 0/0 les/c/f 0/0/0 0/0/0)

Building interval 439159-439159 is probably wrong.
-7> 2017-10-17 00:33:47.358730 7fde3f784e80 10 osd.137 440635 10.522s2 needs 439159-0
-6> 2017-10-17 00:33:47.358912 7fde3f784e80 10 osd.137 440635 10.5d3s1 needs 439159-0
-5> 2017-10-17 00:33:47.359763 7fde3f784e80 10 osd.137 440635 10.5a5s3 needs 439159-0
-4> 2017-10-17 00:33:47.360321 7fde3f784e80 1 osd.137 440635 build_past_intervals_parallel over 439159-439159
-3> 2017-10-17 00:33:47.360329 7fde3f784e80 10 osd.137 440635 build_past_intervals_parallel epoch 439159
-2> 2017-10-17 00:33:47.360343 7fde3f784e80 20 osd.137 0 get_map 439159 - loading and decoding 0x561954274700
-1> 2017-10-17 00:33:47.392569 7fde3f784e80 10 osd.137 0 add_map_bl 439159 81522 bytes

#2 Updated by Michael Schmid about 2 months ago

I just wanted to report that I managed to produce what certainly seems to be the same error on Ubuntu's 12.2.0 release. Happened without using ceph-objectstore-tool to import PG or such. It was a result of running crash tests (such as just forcibly shutting down, unplugging drives during operation, restarting ceph components in random order... ~two cycles and I had this and another OSD with a rockdb that can't be recovered with ceph fsck).

/usr/bin/ceph-osd -f --cluster ceph --id 3 --setuser ceph --setgroup ceph
starting osd.3 at - osd_data /var/lib/ceph/osd/ceph-3 /var/lib/ceph/osd/ceph-3/journal
/build/ceph-pKGC1D/ceph-12.2.0/src/osd/OSD.cc: In function 'void OSD::build_past_intervals_parallel()' thread 7f49e66fce40 time 2017-10-29 17:21:49.634188
/build/ceph-pKGC1D/ceph-12.2.0/src/osd/OSD.cc: 4171: FAILED assert(p.same_interval_since)
ceph version 12.2.0 (32ce2a3ae5239ee33d6150705cdb24d43bab910c) luminous (rc)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0xf5) [0x563180071b55]
2: (OSD::build_past_intervals_parallel()+0x1a26) [0x56317fab5156]
3: (OSD::load_pgs()+0x53e) [0x56317fab569e]
4: (OSD::init()+0x2029) [0x56317facd3a9]
5: (main()+0x38a4) [0x56317f9e7514]
6: (__libc_start_main()+0xf1) [0x7f49e31c21c1]
7: (_start()+0x2a) [0x56317fa6f27a]

ceph-bluestore-tool fsck --path /var/lib/ceph/osd/ceph-3 is coming up clean (it's not the one with the broken rockdb), in case that matters.

#3 Updated by Jonathan Light about 2 months ago

Hello,

I wanted to report that I seem to be running into this bug as well.
I had a very overloaded node that OOM'd many times and upon adding memory and rebooting the node, three OSDs crash with the same assert error.
This is happening in a cluster created with Luminous 12.2.0 on Ubuntu 16.04.

I am more than happy to provide debug logs or troubleshooting if needed.

#4 Updated by David Zafman about 2 months ago

  • Status changed from New to In Progress
  • Assignee set to David Zafman

Tentative fix which still needs testing. Anyone seeing this can try applying this to their OSDs and report back.

https://github.com/ceph/ceph/pull/18673

#5 Updated by Michael Schmid about 1 month ago

David Zafman wrote:

Tentative fix which still needs testing. Anyone seeing this can try applying this to their OSDs and report back.

I believe I managed to apply the patch to ubuntu's 12.2.1 bionic configuration & presumably because of this the error changed to "FAILED assert(info.history.same_interval_since != 0)":

starting osd.3 at - osd_data /var/lib/ceph/osd/ceph-3 /var/lib/ceph/osd/ceph-3/journal
2017-11-04 11:09:31.828016 7fdb030a3e40 -1 osd.3 952 log_to_monitors {default=true}
2017-11-04 11:09:31.843163 7fdae6978700 -1 log_channel(cluster) log [ERR] : 3.25 required past_interval bounds are empty [253,0) but past_intervals is not: ([843,942] intervals=([843,855] acting 1,2),([929,936] acting 1,5))
2017-11-04 11:09:31.843175 7fdae6978700 -1 osd.3 pg_epoch: 949 pg[3.25( DNE empty local-lis/les=0/0 n=0 ec=0/0 lis/c 0/0 les/c/f 0/0/0 0/0/0) [5] r=-1 lpr=948 pi=[843,943)/2 crt=0'0 unknown NOTIFY] 3.25 required past_interval bounds are empty [253,0) but past_intervals is
not: ([843,942] intervals=([843,855] acting 1,2),([929,936] acting 1,5))
2017-11-04 11:09:31.843227 7fdae6978700 -1 log_channel(cluster) log [ERR] : 3.25 required past_interval bounds are empty [253,0) but past_intervals is not: ([843,942] intervals=([843,855] acting 1,2),([929,936] acting 1,5))
2017-11-04 11:09:31.843242 7fdae6978700 -1 osd.3 pg_epoch: 950 pg[3.25( DNE empty local-lis/les=0/0 n=0 ec=0/0 lis/c 0/0 les/c/f 0/0/0 0/0/0) [5] r=-1 lpr=948 pi=[843,943)/2 crt=0'0 unknown NOTIFY] 3.25 required past_interval bounds are empty [253,0) but past_intervals is
not: ([843,942] intervals=([843,855] acting 1,2),([929,936] acting 1,5))
2017-11-04 11:09:31.843310 7fdae6978700 -1 log_channel(cluster) log [ERR] : 3.25 required past_interval bounds are empty [253,0) but past_intervals is not: ([843,942] intervals=([843,855] acting 1,2),([929,936] acting 1,5))
2017-11-04 11:09:31.843316 7fdae6978700 -1 osd.3 pg_epoch: 951 pg[3.25( DNE empty local-lis/les=0/0 n=0 ec=0/0 lis/c 0/0 les/c/f 0/0/0 0/0/0) [5] r=-1 lpr=948 pi=[843,943)/2 crt=0'0 unknown NOTIFY] 3.25 required past_interval bounds are empty [253,0) but past_intervals is
not: ([843,942] intervals=([843,855] acting 1,2),([929,936] acting 1,5))
/build/ceph-12.2.1/src/osd/PG.cc: In function 'void PG::start_peering_interval(OSDMapRef, const std::vector<int>&, int, const std::vector<int>&, int, ObjectStore::Transaction*)' thread 7fdae6978700 time 2017-11-04 11:09:31.843351
/build/ceph-12.2.1/src/osd/PG.cc: 5381: FAILED assert(info.history.same_interval_since != 0)

#6 Updated by Sage Weil about 1 month ago

  • Status changed from In Progress to Resolved

#7 Updated by David Zafman about 1 month ago

  • Status changed from Resolved to In Progress

The fix wasn't sufficient.

#8 Updated by Jonathan Light about 1 month ago

I got the patch applied to my OSDs and I see the same thing that Michael Schmid reported. The original p.same_interval_since assert has been taken care of, but I'm now seeing info.history.same_interval_since != 0.

Also available in: Atom PDF