Project

General

Profile

Actions

Bug #21833

closed

Multiple asserts caused by DNE pgs left behind after lots of OSD restarts

Added by Greg Farnum over 6 years ago. Updated about 6 years ago.

Status:
Resolved
Priority:
High
Assignee:
David Zafman
Category:
PG Split
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
luminous
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
OSD
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

   -17> 2017-10-17 00:33:47.342829 7fde3f784e80 10 read_log_and_missing done
   -16> 2017-10-17 00:33:47.343839 7fde3f784e80 10 osd.137 pg_epoch: 440601 pg[3.1ca7( v 440170'847999 lc 440078'847998 (438151'846486,440170'847999] local-lis/les=440557/440558 n=1653 ec=97501/97501 lis/c 440557/438436 les/c/f 440558/438437/0 440557/440557/431789) [216,137] r=1 lpr=0 pi=[438436,440557)/1 crt=440170'
847999 lcod 0'0 unknown m=1] handle_loaded
   -15> 2017-10-17 00:33:47.343892 7fde3f784e80  5 osd.137 pg_epoch: 440601 pg[3.1ca7( v 440170'847999 lc 440078'847998 (438151'846486,440170'847999] local-lis/les=440557/440558 n=1653 ec=97501/97501 lis/c 440557/438436 les/c/f 440558/438437/0 440557/440557/431789) [216,137] r=1 lpr=0 pi=[438436,440557)/1 crt=440170'
847999 lcod 0'0 unknown NOTIFY m=1] exit Initial 0.079676 0 0.000000
   -14> 2017-10-17 00:33:47.343911 7fde3f784e80  5 osd.137 pg_epoch: 440601 pg[3.1ca7( v 440170'847999 lc 440078'847998 (438151'846486,440170'847999] local-lis/les=440557/440558 n=1653 ec=97501/97501 lis/c 440557/438436 les/c/f 440558/438437/0 440557/440557/431789) [216,137] r=1 lpr=0 pi=[438436,440557)/1 crt=440170'
847999 lcod 0'0 unknown NOTIFY m=1] enter Reset
   -13> 2017-10-17 00:33:47.343924 7fde3f784e80 20 osd.137 pg_epoch: 440601 pg[3.1ca7( v 440170'847999 lc 440078'847998 (438151'846486,440170'847999] local-lis/les=440557/440558 n=1653 ec=97501/97501 lis/c 440557/438436 les/c/f 440558/438437/0 440557/440557/431789) [216,137] r=1 lpr=0 pi=[438436,440557)/1 crt=440170'
847999 lcod 0'0 unknown NOTIFY m=1] set_last_peering_reset 440601
   -12> 2017-10-17 00:33:47.343934 7fde3f784e80 10 osd.137 pg_epoch: 440601 pg[3.1ca7( v 440170'847999 lc 440078'847998 (438151'846486,440170'847999] local-lis/les=440557/440558 n=1653 ec=97501/97501 lis/c 440557/438436 les/c/f 440558/438437/0 440557/440557/431789) [216,137] r=1 lpr=440601 pi=[438436,440557)/1 crt=44
0170'847999 lcod 0'0 unknown NOTIFY m=1] Clearing blocked outgoing recovery messages
   -11> 2017-10-17 00:33:47.343944 7fde3f784e80 10 osd.137 pg_epoch: 440601 pg[3.1ca7( v 440170'847999 lc 440078'847998 (438151'846486,440170'847999] local-lis/les=440557/440558 n=1653 ec=97501/97501 lis/c 440557/438436 les/c/f 440558/438437/0 440557/440557/431789) [216,137] r=1 lpr=440601 pi=[438436,440557)/1 crt=44
0170'847999 lcod 0'0 unknown NOTIFY m=1] Not blocking outgoing recovery messages
   -10> 2017-10-17 00:33:47.343969 7fde3f784e80 10 osd.137 440635 load_pgs loaded pg[3.1ca7( v 440170'847999 lc 440078'847998 (438151'846486,440170'847999] local-lis/les=440557/440558 n=1653 ec=97501/97501 lis/c 440557/438436 les/c/f 440558/438437/0 440557/440557/431789) [216,137] r=1 lpr=440601 pi=[438436,440557)/1 
crt=440170'847999 lcod 0'0 unknown NOTIFY m=1] log((438151'846486,440170'847999], crt=440170'847999)
    -9> 2017-10-17 00:33:47.343983 7fde3f784e80  5 write_log_and_missing with: dirty_to: 0'0, dirty_from: 4294967295'18446744073709551615, writeout_from: 4294967295'18446744073709551615, trimmed: , trimmed_dups: , clear_divergent_priors: 0
    -8> 2017-10-17 00:33:47.358300 7fde3f784e80  0 osd.137 440635 load_pgs opened 220 pgs
    -7> 2017-10-17 00:33:47.358730 7fde3f784e80 10 osd.137 440635 10.522s2 needs 439159-0
    -6> 2017-10-17 00:33:47.358912 7fde3f784e80 10 osd.137 440635 10.5d3s1 needs 439159-0
    -5> 2017-10-17 00:33:47.359763 7fde3f784e80 10 osd.137 440635 10.5a5s3 needs 439159-0
    -4> 2017-10-17 00:33:47.360321 7fde3f784e80  1 osd.137 440635 build_past_intervals_parallel over 439159-439159
    -3> 2017-10-17 00:33:47.360329 7fde3f784e80 10 osd.137 440635 build_past_intervals_parallel epoch 439159
    -2> 2017-10-17 00:33:47.360343 7fde3f784e80 20 osd.137 0 get_map 439159 - loading and decoding 0x561954274700
    -1> 2017-10-17 00:33:47.392569 7fde3f784e80 10 osd.137 0 add_map_bl 439159 81522 bytes
     0> 2017-10-17 00:33:47.395993 7fde3f784e80 -1 /var/tmp/portage/sys-cluster/ceph-12.2.1-r1/work/ceph-12.2.1/src/osd/OSD.cc: In function 'void OSD::build_past_intervals_parallel()' thread 7fde3f784e80 time 2017-10-17 00:33:47.394122
/var/tmp/portage/sys-cluster/ceph-12.2.1-r1/work/ceph-12.2.1/src/osd/OSD.cc: 4180: FAILED assert(p.same_interval_since)

This assert is supposed to cope with PGs that have been imported (via the ceph-objectstore-tool — and the whole function was nuked after luminous). But the user says that didn't happen — instead, it follows them setting pg_num to 8192 from 2048. OSD log is available via ceph-post-file signature 3a6dea4f-05d7-4c15-9f7e-2d95d99195ba.


Files

crash-osd-46.txt.gz (87.2 KB) crash-osd-46.txt.gz Paul Emmerich, 02/15/2018 04:06 PM

Related issues 1 (0 open1 closed)

Copied to RADOS - Backport #23160: luminous: Multiple asserts caused by DNE pgs left behind after lots of OSD restartsResolvedPrashant DActions
Actions

Also available in: Atom PDF