Project

General

Profile

Bug #43048

nautilus: upgrade/mimic-x/stress-split: failed to recover before timeout expired

Added by Neha Ojha 4 months ago. Updated about 1 month ago.

Status:
New
Priority:
High
Assignee:
Category:
-
Target version:
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
upgrade/mimic-x
Component(RADOS):
OSD
Pull request ID:
Crash signature:

Description

2019-11-27T08:51:03.751 INFO:tasks.ceph.osd.1.smithi175.stderr:2019-11-27 08:51:03.642 7fe6cd392700 -1 osd.1 4195 get_health_metrics reporting 10 slow ops, oldest is osd_pg_create(e3986 36.8:2090 36.e:2090 36.f:2090 36.19:2090 36.1b:2090 36.21:2090 36.25:2090 36.29:2090 36.30:2090 36.32:2090 36.3b:2090 37.9:2092 37.10:2092 37.21:2092 37.28:2092 37.29:2092 37.2f:2092 37.38:2092 37.39:2092 38.3:2094 40.1:2099 43.0:2139 43.5:2139 43.f:2139)

/a/yuriw-2019-11-27_03:40:31-upgrade:mimic-x-nautilus-distro-basic-smithi/4546464/


Related issues

Related to RADOS - Bug #41065: new osd added to cluster upgraded from 13 to 14 will down after some days Closed 08/05/2019

History

#1 Updated by Neha Ojha 4 months ago

  • Related to Bug #41065: new osd added to cluster upgraded from 13 to 14 will down after some days added

#3 Updated by Patrick Donnelly 4 months ago

  • Status changed from 12 to New

#4 Updated by Neha Ojha 4 months ago

/a/yuriw-2019-12-06_21:30:44-upgrade:mimic-x-nautilus-distro-basic-smithi/4576681

#5 Updated by Neha Ojha 4 months ago

  • Assignee set to Neha Ojha

#6 Updated by Sage Weil 2 months ago

  • Target version set to v15.0.0

#7 Updated by David Zafman 2 months ago

The master branch passed, but my nautilus run hit the same issue:

http://pulpito.ceph.com/dzafman-2020-01-27_21:51:30-upgrade:mimic-x-master-distro-basic-smithi/. no slow ops master
http://pulpito.ceph.com/dzafman-2020-01-27_22:00:09-upgrade:mimic-x-master-distro-basic-smithi/ job 4712686 slow ops NAUTILUS

The second run identifier should have been mimic-x-naultius as the specified sha1 2ab5460062b6732d368caed445f975e4b7ef2331 is on the nautilus branch

#8 Updated by David Zafman 2 months ago

I'm seeing a lot of this in a sample of log segments from osd.6 which is reporting the slow ops. The log for osd.6 is massive, so I'm looking at split sections of the log.

There are many osdmap reads sequentially through the map epochs.

2020-01-28T01:40:34.808 INFO:teuthology.orchestra.run.smithi073.stdout:                    "message": "34 slow ops, oldest one blocked for 88 sec, osd.6 has slow ops" 
2020-01-28 01:36:21.077 7f91bb29b700 10 bluestore(/var/lib/ceph/osd/ceph-6) read meta #-1:18412850:::osdmap.2536:0# 0x0~1c5b = 7259
2020-01-28 01:36:21.077 7f91bb29b700 10 osd.6 3378 add_map_bl 2536 7259 bytes
2020-01-28 01:36:21.077 7f91bb29b700 20 osd.6 3378 get_map 2537 - loading and decoding 0x555f2efd0800
2020-01-28 01:36:21.077 7f91bb29b700 15 bluestore(/var/lib/ceph/osd/ceph-6) read meta #-1:15c12850:::osdmap.2537:0# 0x0~0
2020-01-28 01:36:21.077 7f91bb29b700 20 bluestore(/var/lib/ceph/osd/ceph-6) _do_read 0x0~1c87 size 0x1c87 (7303)
2020-01-28 01:36:21.077 7f91bb29b700 20 bluestore(/var/lib/ceph/osd/ceph-6) _do_read will do buffered read
2020-01-28 01:36:21.077 7f91bb29b700 20 bluestore(/var/lib/ceph/osd/ceph-6) _do_read  blob Blob(0x555f3dacb880 blob([0x50fc000~4000] csum+has_unused crc32c/0x1000 unused=0xff00) use_tracker(0x4000 0x1c87) SharedBlob(0x555f3c5ffe20 sbid 0x0)) need 0x0~1c87 cache has 0x[0~1c87]

#9 Updated by Neha Ojha 2 months ago

  • Subject changed from upgrade:mimic-x-nautilus: slow ops due to stuck osd_pg_create to nautilus: upgrade:mimic-x-nautilus: failed to recover before timeout expired
  • Priority changed from Urgent to High

Only happens when upgrading from mimic to nautilus, see https://tracker.ceph.com/issues/43048#note-7.

#10 Updated by Neha Ojha 2 months ago

Logs: /a/dzafman-2020-01-27_22:00:09-upgrade:mimic-x-master-distro-basic-smithi/4712686

There is more than one bug attributing to this failure, detailed explanation below:

Let's pick PGs 43.29 and 43.2b - according to the pg dump at the end of the test, they aren't active+clean, we'll see what's going on in osd.3's(and osd.4) log

43.29 9 0 0 9 0 17418224 0 0 62 62 activating+remapped 2020-01-28 02:00:23.471065 3819'873 4023:4906 [3,1] 3 [3,7] 3 2345'133 2020-01-28 01:13:39.557480 0'0 2020-01-28 01:11:23.265583 9

43.2b 0 0 0 0 0 0 0 0 54 54 active+remapped+backfilling 2020-01-28 02:02:50.963735 2337'154 4098:2696 [3,4] 3 [4,7] 4 2337'154 2020-01-28 01:15:14.374660 0'0 2020-01-28 01:11:23.265583 10

43.29 was in activating+remapped

2020-01-28 02:00:49.053 7fd676084700 10 osd.3 pg_epoch: 4023 pg[43.29( v 3819'873 (3566'811,3819'873] local-lis/les=3977/4014 n=9 ec=2339/2212 lis/c 3834/3086 les/c/f 3847/3091/0 3889/3977/3922) [3,1]/[3,7] backfill=[1] r=0 lpr=3977 pi=[3086,3977)/3 crt=3819'873 lcod 3819'871 mlcod 0'0 activating+remapped mbc={} trimq=[344~1,354~1,35e~1,391~1,395~2,399~1,3a4~2] ps=[373~1,37c~1,37e~1,38a~1,394~1]] state<Started/Primary/Active>: Active advmap

goes to active+remapped

2020-01-28 02:01:39.212 7fd676084700 10 osd.3 pg_epoch: 4060 pg[43.29( v 3819'873 (3566'811,3819'873] local-lis/les=3977/4014 n=9 ec=2339/2212 lis/c 3977/3086 les/c/f 4014/3091/0 3889/3977/3922) [3,1]/[3,7] backfill=[1] r=0 lpr=3977 pi=[3086,3977)/3 crt=3819'873 lcod 3819'871 mlcod 0'0 active+remapped mbc={} trimq=[344~1,354~1,35e~1,391~1,395~2,399~1,3a4~2] ps=[373~1,37c~1,37e~1,38a~1,394~1]] share_pg_info

and then to active+clean

2020-01-28 02:11:21.883 7fd67207c700 10 osd.3 pg_epoch: 4272 pg[43.29( v 4267'888 (3566'811,4267'888] local-lis/les=4268/4272 n=9 ec=2339/2212 lis/c 4268/4268 les/c/f 4272/4272/0 3889/4268/3922) [3,1] r=0 lpr=4268 crt=4267'888 lcod 4267'884 mlcod 0'0 active+clean trimq=[391~1,395~2,399~1,3a4~2,3a8~1,3ac~1] ps=[344~1,354~1,35e~1,373~1,37c~1,37e~1,380~1,38a~1,394~1]] kick_snap_trim: clean and snaps to trim, kicking

Same goes for 43.2b

2020-01-28 02:03:02.205 7f00fde27700 10 osd.4 pg_epoch: 4098 pg[43.2b( v 2337'154 (2290'100,2337'154] local-lis/les=3950/4029 n=0 ec=2339/2212 lis/c 3950/3306 les/c/f 4029/3307/0 3901/3950/3950) [3,4]/[4,7] backfill=[3] r=0 lpr=3950 pi=[3306,3950)/2 rops=1 crt=2337'154 lcod 0'0 mlcod 0'0 active+remapped+backfilling mbc={} trimq=[344~1,354~1,35e~1,391~1,395~1,399~1,3a4~2,3a8~1,3ac~1] ps=[373~1,37c~1,37e~1,38a~1,394~1,396~1]] state<Started/Primary/Active>: Active advmap

also becomes active+clean

2020-01-28 02:10:24.596 7fd67107a700 10 osd.3 pg_epoch: 4251 pg[43.2b( v 2337'154 (2290'100,2337'154] local-lis/les=4101/4245 n=0 ec=2339/2212 lis/c 4101/4101 les/c/f 4245/4251/0 3901/4101/4101) [3,4] r=0 lpr=4101 crt=2337'154 lcod 0'0 mlcod 0'0 active+clean trimq=[344~1,354~1,35e~1,380~1,391~1,395~1,399~1,3a4~2,3a8~1,3ac~1] ps=[373~1,37c~1,37e~1,38a~1,394~1,396~1]] kick_snap_trim: clean and snaps to trim, kicking

The mgr log indicates that the last time it received pg_stats from osd.3 was around the same time the test noted no recovery progress for these PGs.

2020-01-28 02:01:08.185 7fdb6eb82700  1 -- [v2:172.21.15.73:6832/14625,v1:172.21.15.73:6833/14625] <== osd.3 v2:172.21.15.73:6816/230476 361 ==== pg_stats(77 pgs tid 0 v 0) v2 ==== 53999+0+0 (crc 0 0 0) 0x55a7c5e06300 con 0x55a7c56ae000

This is what makes me think that the PGs were making recovery progress but the mgr just stopped getting updates.

On the other hand if we look at PG 1.83, which is also stuck in the following state

1.83 0 0 1 0 0 0 0 0 17 17 active+recovery_wait+degraded 2020-01-28 02:04:00.020505 1723'17 4950:137 [6,11] 6 [6,11] 6 1723'17 2020-01-28 01:02:43.802971 0'0 2020-01-28 00:32:33.316935 0

osd.6's logs tell us that it has been stuck in the same state just waiting for local reservation?

2020-01-28 02:04:00.016 7f91ad27f700  5 osd.6 pg_epoch: 4030 pg[1.83( v 1723'17 (0'0,1723'17] local-lis/les=3898/3950 n=0 ec=1780/18 lis/c 3898/3306 les/c/f 3950/3330/0 3898/3898/3856) [6,11] r=0 lpr=3898 pi=[3306,3898)/3 crt=1723'17 mlcod 0'0 active+degraded mbc={}] enter Started/Primary/Active/WaitLocalRecoveryReserved
2020-01-28 05:47:20.750 7f91ad27f700 10 osd.6 5940 do_recovery started 0/1 on pg[1.83( v 1723'17 (0'0,1723'17] local-lis/les=3898/3950 n=0 ec=1780/18 lis/c 3898/3306 les/c/f 3950/3330/0 3898/3898/3856) [6,11] r=0 lpr=3898 pi=[3306,3898)/3 crt=1723'17 mlcod 0'0 active+recovery_wait+degraded mbc={}]

#11 Updated by Neha Ojha 2 months ago

From mgr.x's log after the last time pg_stats are received we see

2020-01-28 02:01:12.401 7fdb6eb82700 10 mgr.server ms_handle_reset unregistering osd.3  session 0x55a7c85b5f80 con 0x55a7c56ae000

and then osd.3 sends mgropen

2020-01-28 02:01:14.297 7fdb6eb82700  1 -- [v2:172.21.15.73:6832/14625,v1:172.21.15.73:6833/14625] <== osd.3 v2:172.21.15.73:6816/230476 1 ==== mgropen(osd.3) v3 ==== 62022+0+0 (crc 0 0 0) 0x55a7c827b300 con 0x55a7c5d46880

mgr sends mgrconfigure

2020-01-28 02:01:14.297 7fdb6eb82700  1 -- [v2:172.21.15.73:6832/14625,v1:172.21.15.73:6833/14625] --> [v2:172.21.15.73:6816/230476,v1:172.21.15.73:6817/230476] -- mgrconfigure(period=5, threshold=5) v3 -- 0x55a7c3af9a20 con 0x55a7c5d46880 

this mgrconfigure is not received on osd.3, the last one it received was

2020-01-28 01:46:12.398 7fd683ec2700  1 -- [v2:172.21.15.73:6816/230476,v1:172.21.15.73:6817/230476] <== mgr.64133 v2:172.21.15.73:6832/14625 1 ==== mgrconfigure(period=5, threshold=5) v3 ==== 12+0+0 (crc 0 0 0) 0x55ad948b9a20 con 0x55ad973f6880

Will try to reproduce this with better mgrc and ms logging to see what caused osd.3 to not receive/process the mgrconfigure.

#12 Updated by David Zafman 2 months ago

For all log entries of all OSDs at 2020-01-28 03:18 with pg[ information and osd primary these are the log lines that aren't active+clean. Not every pg necessarily logged something during that 60 second period coinciding with the timeout.

2020-01-28 03:18:04.865 7f937aa90700 10 osd.1 pg_epoch: 5475 pg[43.38( v 5382'1059 (3505'961,5382'1059] local-lis/les=3898/3943 n=12 ec=2369/2212 lis/c 3898/3166 les/c/f 3943/3169/0 3898/3898/3837) [5,11]/[1,7] backfill=[5,11] r=0 lpr=3898 pi=[3166,3898)/2 crt=5382'1059 lcod 5382'1058 mlcod 5382'1058 active+remapped+backfill_wait mbc={} trimq=31 ps=[373~1,37c~1,37e~1,38a~1,394~1,396~1]] handle_advance_map [5,11]/[1,7] -- 5/1

2020-01-28 03:18:04.869 7f937a28f700 10 osd.1 pg_epoch: 5475 pg[43.41( v 2368'111 (2334'99,2368'111] local-lis/les=3922/3977 n=0 ec=2369/2212 lis/c 3922/3131 les/c/f 3977/3204/0 3898/3922/2732) [1,11]/[1,6] backfill=[11] r=0 lpr=3922 pi=[3131,3922)/3 crt=2368'111 lcod 0'0 mlcod 0'0 active+remapped+backfill_wait mbc={} trimq=31 ps=[373~1,37c~1,37e~1,38a~1,394~1,396~1]] handle_advance_map [1,11]/[1,6] -- 1/1

2020-01-28 03:18:02.673 7f91ada80700 10 osd.6 pg_epoch: 4990 pg[36.2a( v 2206'2 lc 2206'1 (0'0,2206'2] local-lis/les=3889/3950 n=1 ec=2159/2159 lis/c 3889/3086 les/c/f 3950/3093/0 3889/3889/3889) [6,2] r=0 lpr=3900 pi=[3086,3889)/1 crt=2206'2 lcod 0'0 mlcod 0'0 active+recovery_wait+degraded m=1 mbc={255={(1+1)=1}}] do_peering_event: epoch_sent: 4990 epoch_requested: 4990 NullEvt +create_info

2020-01-28 03:18:04.549 7f91b1a88700 10 osd.6 pg_epoch: 4990 pg[37.3a( v 2206'3 (0'0,2206'3] local-lis/les=3898/4030 n=1 ec=2161/2161 lis/c 3898/3131 les/c/f 4030/3204/0 3898/3898/3889) [6,11] r=0 lpr=3898 pi=[3131,3898)/3 crt=2206'3 mlcod 0'0 active+recovery_wait+degraded mbc={255={(1+1)=1}}] do_peering_event: epoch_sent: 4990 epoch_requested: 4990 NullEvt +create_info

2020-01-28 03:18:05.657 7f91ad27f700 10 osd.6 pg_epoch: 4990 pg[40.3( v 2207'13 (0'0,2207'13] local-lis/les=4004/4110 n=2 ec=2169/2169 lis/c 4004/3458 les/c/f 4110/3568/0 3898/4004/3856) [6,11] r=0 lpr=4004 pi=[3458,4004)/3 crt=2207'13 mlcod 0'0 active+recovery_wait+degraded mbc={255={(1+1)=2}}] do_peering_event: epoch_sent: 4990 epoch_requested: 4990 NullEvt +create_info

2020-01-28 03:18:01.025 7f91aea82700 10 osd.6 pg_epoch: 4990 pg[43.8( v 2337'124 (2321'100,2337'124] local-lis/les=4030/4270 n=0 ec=2212/2212 lis/c 4030/3086 les/c/f 4270/3124/0 3889/4030/3889) [6,10]/[6,7] backfill=[10] r=0 lpr=4030 pi=[3086,4030)/3 crt=2337'124 lcod 0'0 mlcod 0'0 active+remapped+backfill_wait mbc={} trimq=23] do_peering_event: epoch_sent: 4990 epoch_requested: 4990 NullEvt +create_info

2020-01-28 03:18:01.221 7f91ad27f700 10 osd.6 pg_epoch: 4990 pg[43.b( v 4390'479 (2916'408,4390'479] local-lis/les=3906/3977 n=4 ec=2212/2212 lis/c 3906/3273 les/c/f 3977/3286/0 3856/3906/3856) [6,9]/[6,10] backfill=[9] r=0 lpr=3906 pi=[3273,3906)/3 crt=4390'479 lcod 4390'478 mlcod 4390'478 active+remapped+backfill_wait mbc={} trimq=23] do_peering_event: epoch_sent: 4990 epoch_requested: 4990 NullEvt +create_info

#13 Updated by Neha Ojha 2 months ago

David Zafman wrote:

For all log entries of all OSDs at 2020-01-28 03:18 with pg[ information and osd primary these are the log lines that aren't active+clean. Not every pg necessarily logged something during that 60 second period coinciding with the timeout.

[...]

/a/nojha-2020-02-06_18:32:51-upgrade:mimic-x:stress-split-nautilus-distro-basic-smithi/4738800/ seems to have reproduced this issue at least, still no luck reproducing the mgr behavior.

#14 Updated by Sage Weil about 2 months ago

/a/sage-2020-02-09_21:18:03-rados-wip-sage2-testing-2020-02-09-1152-distro-basic-smithi/4749175

PG_STAT OBJECTS MISSING_ON_PRIMARY DEGRADED MISPLACED UNFOUND BYTES OMAP_BYTES* OMAP_KEYS* LOG DISK_LOG STATE                         STATE_STAMP                VERSION REPORTED UP      UP_PRIMARY ACTING  ACTING_PRIMARY LAST_SCRUB SCRUB_STAMP                LAST_DEEP_SCRUB DEEP_SCRUB_STAMP           SNAPTRIMQ_LEN 
4.0           0                  0        0         0       0     0           0          0   0        0                  active+clean 2020-02-09 22:58:37.837211     0'0  468:426  [10,7]         10  [10,7]             10        0'0 2020-02-09 22:58:36.819239             0'0 2020-02-09 22:58:36.819239             0 
3.7           0                  0        0         0       0     0           0          0   0        0                  active+clean 2020-02-09 22:58:09.556598     0'0  468:468   [8,3]          8   [8,3]              8        0'0 2020-02-09 22:54:19.015669             0'0 2020-02-09 22:54:19.015669             0 
1.5           0                  0        0         0       0     0           0          0   0        0                       peering 2020-02-09 22:58:21.756860     0'0  468:442   [8,6]          8   [8,6]              8        0'0 2020-02-09 22:54:15.998342             0'0 2020-02-09 22:54:15.998342             0 
2.6           2                  0        2         0       0     0           0          0   4        4 active+recovery_wait+degraded 2020-02-09 22:58:28.782242    23'4  469:445  [0,10]          0  [0,10]              0        0'0 2020-02-09 22:54:18.010183             0'0 2020-02-09 22:54:18.010183             0 
4.1           0                  0        0         0       0     0           0          0   0        0                  active+clean 2020-02-09 22:58:37.838926     0'0  469:427   [5,6]          5   [5,6]              5        0'0 2020-02-09 22:58:36.819239             0'0 2020-02-09 22:58:36.819239             0 
3.6           0                  0        0         0       0     0           0          0   0        0                  active+clean 2020-02-09 22:58:09.554204     0'0  468:469  [11,4]         11  [11,4]             11        0'0 2020-02-09 22:54:19.015669             0'0 2020-02-09 22:54:19.015669             0 
1.4           0                  0        0         0       0     0           0          0   0        0                  active+clean 2020-02-09 22:58:09.550142     0'0  468:466   [3,0]          3   [3,0]              3        0'0 2020-02-09 22:54:15.998342             0'0 2020-02-09 22:54:15.998342             0 
2.7           4                  0        0         0       0   566           0          0   7        7                  active+clean 2020-02-09 22:58:24.551363    23'7  468:479   [6,0]          6   [6,0]              6        0'0 2020-02-09 22:54:18.010183             0'0 2020-02-09 22:54:18.010183             0 
4.6           0                  0        0         0       0     0           0          0   0        0                  active+clean 2020-02-09 22:58:37.842770     0'0  469:427   [0,8]          0   [0,8]              0        0'0 2020-02-09 22:58:36.819239             0'0 2020-02-09 22:58:36.819239             0 
3.1           0                  0        0         0       0     0           0          0   0        0                  active+clean 2020-02-09 22:58:09.552067     0'0  468:469  [11,4]         11  [11,4]             11        0'0 2020-02-09 22:54:19.015669             0'0 2020-02-09 22:54:19.015669             0 
1.3           0                  0        0         0       0     0           0          0   0        0                  active+clean 2020-02-09 22:58:12.578741     0'0  468:472  [11,6]         11  [11,6]             11        0'0 2020-02-09 22:54:15.998342             0'0 2020-02-09 22:54:15.998342             0 
2.0           4                  0        0         0       0     0           0          0   8        8                  active+clean 2020-02-09 22:58:25.549023    23'8  468:481   [7,1]          7   [7,1]              7        0'0 2020-02-09 22:54:18.010183             0'0 2020-02-09 22:54:18.010183             0 
4.7           0                  0        0         0       0     0           0          0   0        0              creating+peering 2020-02-09 22:58:37.718874     0'0  468:419  [10,0]         10  [10,0]             10        0'0 2020-02-09 22:58:36.819239             0'0 2020-02-09 22:58:36.819239             0 
3.0           0                  0        0         0       0     0           0          0   0        0                  active+clean 2020-02-09 22:58:22.754332     0'0  469:466   [1,6]          1   [1,6]              1        0'0 2020-02-09 22:54:19.015669             0'0 2020-02-09 22:54:19.015669             0 
1.2           0                  0        0         0       0     0           0          0   0        0                  active+clean 2020-02-09 22:58:28.778668     0'0  468:457   [7,4]          7   [7,4]              7        0'0 2020-02-09 22:54:15.998342             0'0 2020-02-09 22:54:15.998342             0 
2.1           0                  0        0         0       0     0           0          0   0        0                  active+clean 2020-02-09 22:58:28.784252     0'0  468:435  [11,8]         11  [11,8]             11        0'0 2020-02-09 22:54:18.010183             0'0 2020-02-09 22:54:18.010183             0 
4.5           0                  0        0         0       0     0           0          0   0        0                  active+clean 2020-02-09 22:58:37.839510     0'0  468:426   [6,1]          6   [6,1]              6        0'0 2020-02-09 22:58:36.819239             0'0 2020-02-09 22:58:36.819239             0 
3.2           0                  0        0         0       0     0           0          0   0        0                  active+clean 2020-02-09 22:58:22.754903     0'0  468:477  [10,6]         10  [10,6]             10        0'0 2020-02-09 22:54:19.015669             0'0 2020-02-09 22:54:19.015669             0 
1.0           0                  0        0         0       0     0           0          0   0        0                  active+clean 2020-02-09 22:58:25.550528     0'0  468:473   [7,0]          7   [7,0]              7        0'0 2020-02-09 22:54:15.998342             0'0 2020-02-09 22:54:15.998342             0 
2.3           3                  0        0         0       0    34           0          0   5        5                  active+clean 2020-02-09 22:58:11.575755    23'5  469:455  [5,11]          5  [5,11]              5        0'0 2020-02-09 22:54:18.010183             0'0 2020-02-09 22:54:18.010183             0 
4.4           0                  0        0         0       0     0           0          0   0        0                  active+clean 2020-02-09 22:58:37.839133     0'0  468:425  [8,11]          8  [8,11]              8        0'0 2020-02-09 22:58:36.819239             0'0 2020-02-09 22:58:36.819239             0 
3.3           0                  0        0         0       0     0           0          0   0        0                  active+clean 2020-02-09 22:58:09.554256     0'0  469:454  [4,11]          4  [4,11]              4        0'0 2020-02-09 22:54:19.015669             0'0 2020-02-09 22:54:19.015669             0 
1.1           0                  0        0         0       0     0           0          0   0        0                  active+clean 2020-02-09 22:58:09.555860     0'0  468:453   [3,8]          3   [3,8]              3        0'0 2020-02-09 22:54:15.998342             0'0 2020-02-09 22:54:15.998342             0 
2.2           0                  0        0         0       0     0           0          0   1        1                  active+clean 2020-02-09 22:58:11.577282    23'1  469:455  [5,10]          5  [5,10]              5        0'0 2020-02-09 22:54:18.010183             0'0 2020-02-09 22:54:18.010183             0 
4.3           0                  0        0         0       0     0           0          0   0        0                  active+clean 2020-02-09 22:58:37.842175     0'0  469:427   [0,7]          0   [0,7]              0        0'0 2020-02-09 22:58:36.819239             0'0 2020-02-09 22:58:36.819239             0 
3.4           0                  0        0         0       0     0           0          0   0        0                       peering 2020-02-09 22:58:21.753906     0'0  468:470  [10,5]         10  [10,5]             10        0'0 2020-02-09 22:54:19.015669             0'0 2020-02-09 22:54:19.015669             0 
1.6           0                  0        0         0       0     0           0          0   0        0                  active+clean 2020-02-09 22:58:09.550682     0'0  468:466   [3,0]          3   [3,0]              3        0'0 2020-02-09 22:54:15.998342             0'0 2020-02-09 22:54:15.998342             0 
2.5           4                  0        0         0       0   160           0          0   6        6                  active+clean 2020-02-09 22:58:09.552964    23'6  468:457   [3,8]          3   [3,8]              3        0'0 2020-02-09 22:54:18.010183             0'0 2020-02-09 22:54:18.010183             0 
4.2           1                  0        0         0       0     0           0          0   1        1                  active+clean 2020-02-09 22:58:37.836575    51'1  469:429   [1,5]          1   [1,5]              1        0'0 2020-02-09 22:58:36.819239             0'0 2020-02-09 22:58:36.819239             0 
3.5           0                  0        0         0       0     0           0          0   0        0                  active+clean 2020-02-09 22:58:11.574779     0'0  469:452   [5,8]          5   [5,8]              5        0'0 2020-02-09 22:54:19.015669             0'0 2020-02-09 22:54:19.015669             0 
1.7           0                  0        0         0       0     0           0          0   0        0                  active+clean 2020-02-09 22:58:12.610192     0'0  468:466   [6,3]          6   [6,3]              6        0'0 2020-02-09 22:54:15.998342             0'0 2020-02-09 22:54:15.998342             0 
2.4           5                  0        0         0       0  1526           0          0   8        8                  active+clean 2020-02-09 22:58:28.799813    23'8  468:484 [11,10]         11 [11,10]             11        0'0 2020-02-09 22:54:18.010183             0'0 2020-02-09 22:54:18.010183             0 

#15 Updated by Neha Ojha about 2 months ago

In the failure that sage observed on master, I looked at pg4.7, which is stuck in creating+peering.

osd.10(mimic) sends a pg_query to osd.0(master)

2020-02-09 22:58:37.713 7fb115552700 10 osd.10 pg_epoch: 50 pg[4.7( empty local-lis/les=0/0 n=0 ec=50/50 lis/c 0/0 les/c/f 0/0/0 50/50/50) [10,0] r=0 lpr=50 crt=0'0 mlcod 0'0 creating+peering mbc={}] state<Started/Primary/Peering/GetInfo>:  querying info from osd.0
2020-02-09 22:58:37.713 7fb115552700  7 osd.10 50 do_queries querying osd.0 on 1 PGs
2020-02-09 22:58:37.713 7fb115552700  1 -- 172.21.15.14:6814/10983 --> 172.21.15.104:6801/16080 -- pg_query(4.7 epoch 50) v4 -- 0x55672c0e4640 con 0

osd.0 responds with the following

2020-02-09T22:58:37.714+0000 7f0306c71700 10 osd.0 50 handle_pg_query_nopg 4.7
2020-02-09T22:58:37.714+0000 7f0306c71700 10 osd.0 50  pg 4.7 dne
2020-02-09T22:58:37.714+0000 7f0306c71700  1 -- v1:172.21.15.104:6801/16080 --> v1:172.21.15.14:6814/10983 -- pg_notify((query:50 sent:50 4.7( DNE empty local-lis/les=0/0 n=0 ec=0/0 lis/c=0/0 les/c/f=0/0/0 sis=0
) ([0,0] all_participants= intervals=)) epoch 50) v7 -- 0x557ca77d15a0 con 0x557ca99eb000

osd.10 never received this(at least I can't see any evidence in the existing logs). Due to this, this pg never exits Peering. Still need to look at what's wrong with the other PGS, but the cause for this pg to be stuck, does not bear any resemblance with the other causes we discovered in mimic->nautilus upgrades.

#16 Updated by Neha Ojha about 2 months ago

pg3.4, which is stuck in "peering" shows similar behavior as https://tracker.ceph.com/issues/43048#note-15

osd.10 sends pg_query to osd.2 and osd.5

2020-02-09 22:58:21.749 7fb112d4d700 10 osd.10 pg_epoch: 41 pg[3.4( empty local-lis/les=30/31 n=0 ec=21/21 lis/c 30/30 les/c/f 31/31/0 41/41/21) [10,5] r=0 lpr=41 pi=[30,41)/1 crt=0'0 mlcod 0'0 peering mbc={}] state<Started/Primary/Peering/GetInfo>:  querying info from osd.2
2020-02-09 22:58:21.749 7fb112d4d700 10 osd.10 pg_epoch: 41 pg[3.4( empty local-lis/les=30/31 n=0 ec=21/21 lis/c 30/30 les/c/f 31/31/0 41/41/21) [10,5] r=0 lpr=41 pi=[30,41)/1 crt=0'0 mlcod 0'0 peering mbc={}] state<Started/Primary/Peering/GetInfo>:  querying info from osd.5

2020-02-09 22:58:21.749 7fb112d4d700  7 osd.10 41 do_queries querying osd.2 on 1 PGs
2020-02-09 22:58:21.749 7fb112d4d700  1 -- 172.21.15.14:6814/10983 --> 172.21.15.104:6805/16288 -- pg_query(3.4 epoch 41) v4 -- 0x55672c2dd180 con 0

2020-02-09 22:58:21.749 7fb112d4d700  7 osd.10 41 do_queries querying osd.5 on 1 PGs
2020-02-09 22:58:21.749 7fb112d4d700  1 -- 172.21.15.14:6814/10983 --> 172.21.15.88:6805/14634 -- pg_query(3.4 epoch 41) v4 -- 0x55672c2dc640 con 0

osd.2

2020-02-09T22:58:21.746+0000 7f5635787700  1 -- v1:172.21.15.104:6805/16288 <== osd.10 v1:172.21.15.14:6814/10983 11 ==== pg_query(3.4 epoch 41) v4 ==== 162+0+0 (unknown 1654604287 0 0) 0x564352e91900 con 0x564370949c00
2020-02-09T22:58:21.746+0000 7f5613ce2700  1 -- v1:172.21.15.104:6805/16288 --> v1:172.21.15.14:6814/10983 -- pg_notify((query:41 sent:41 3.4( empty local-lis/les=30/31 n=0 ec=21/21 lis/c=30/30 les/c/f=31/31/0 sis=41) ([30,40] all_participants=2,10 intervals=([30,40] acting 2,10))) epoch 41) v7 -- 0x5643706282c0 con 0x564370949c00

osd.5

2020-02-09T22:58:21.746+0000 7f5635787700  1 -- v1:172.21.15.104:6805/16288 <== osd.10 v1:172.21.15.14:6814/10983 11 ==== pg_query(3.4 epoch 41) v4 ==== 162+0+0 (unknown 1654604287 0 0) 0x564352e91900 con 0x5643
70949c00
2020-02-09T22:58:21.754+0000 7f114e891700  1 -- v1:172.21.15.88:6805/14634 --> v1:172.21.15.14:6814/10983 -- pg_notify((query:41 sent:41 3.4( DNE empty local-lis/les=0/0 n=0 ec=0/0 lis/c=0/0 les/c/f=0/0/0 sis=0) ([0,0] all_participants= intervals=)) epoch 41) v7 -- 0x55b5c714b700 con 0x55b5e488e400

only osd.2's response is received by osd.10

2020-02-09 22:58:21.749 7fb1327ec700  1 -- 172.21.15.14:6814/10983 <== osd.2 172.21.15.104:6805/16288 17 ==== pg_notify((query:41 sent:41 3.4( empty local-lis/les=30/31 n=0 ec=21/21 lis/c 30/30 les/c/f 31/31/0 41/41/21))=([30,40] intervals=([30,40] acting 2,10)) epoch 41) v6 ==== 1154+0+0 (3882779904 0 0) 0x55672be7a600 con 0x55672c288a00

however, looks like the problem started since osd.5 was upgraded

2020-02-09T22:57:58.198+0000 7f563aab5d80  0 ceph version 15.1.0-441-g39dbd5e (39dbd5e9e63508f2c4b59180b41f5b8bd27d749f) octopus (rc), process ceph-osd, pid 16288

2020-02-09 22:58:11.573 7fb1327ec700  0 can't decode unknown message type 133 MSG_AUTH=17
2020-02-09 22:58:11.573 7fb1327ec700  1 -- 172.21.15.14:6814/10983 >> 172.21.15.88:6805/14634 conn(0x55672c286e00 :6814 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=3 cs=1 l=0).process decode message failed

2020-02-09 22:58:11.573 7fb1327ec700  1 -- 172.21.15.14:6814/10983 >> - conn(0x55672c286700 :6814 s=STATE_ACCEPTING pgs=0 cs=0 l=0)._process_connection sd=66 -
2020-02-09 22:58:11.573 7fb1327ec700 10 osd.10 38  new session 0x556729a0ac80 con=0x55672c286700 addr=172.21.15.88:6805/14634
2020-02-09 22:58:11.573 7fb1327ec700 10 osd.10 38  session 0x556729a0ac80 osd.5 has caps osdcap[grant(*)] 'allow *'
2020-02-09 22:58:11.573 7fb1327ec700  1 -- 172.21.15.14:6814/10983 >> 172.21.15.88:6805/14634 conn(0x55672c286700 :6814 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept connect_s
eq 2 vs existing csq=1 existing_state=STATE_STANDBY
2020-02-09 22:58:11.573 7fb1327ec700  1 -- 172.21.15.14:6814/10983 >> 172.21.15.88:6805/14634 conn(0x55672c286700 :6814 s=STATE_CLOSED pgs=0 cs=0 l=0).handle_connect_msg stop myself to swap existing
2020-02-09 22:58:11.573 7fb125195700  2 osd.10 38 ms_handle_reset con 0x55672c286700 session 0x556729a0ac80
2020-02-09 22:58:11.573 7fb1327ec700 10 osd.10 38  session 0x556729a0b180 osd.5 has caps osdcap[grant(*)] 'allow *'
2020-02-09 22:58:11.573 7fb1327ec700  0 can't decode unknown message type 133 MSG_AUTH=17
2020-02-09 22:58:11.573 7fb1327ec700  1 -- 172.21.15.14:6814/10983 >> 172.21.15.88:6805/14634 conn(0x55672c286e00 :6814 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=15 cs=3 l=0).process decode message faile
d 
2020-02-09 22:58:11.573 7fb1327ec700  1 -- 172.21.15.14:6814/10983 >> - conn(0x55672bebc300 :6814 s=STATE_ACCEPTING pgs=0 cs=0 l=0)._process_connection sd=66 -
2020-02-09 22:58:11.573 7fb1327ec700 10 osd.10 38  new session 0x55672c256580 con=0x55672bebc300 addr=172.21.15.88:6805/14634
2020-02-09 22:58:11.573 7fb1327ec700 10 osd.10 38  session 0x55672c256580 osd.5 has caps osdcap[grant(*)] 'allow *'
2020-02-09 22:58:11.573 7fb1327ec700  1 -- 172.21.15.14:6814/10983 >> 172.21.15.88:6805/14634 conn(0x55672bebc300 :6814 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept connect_s
eq 4 vs existing csq=3 existing_state=STATE_STANDBY
2020-02-09 22:58:11.573 7fb1327ec700  1 -- 172.21.15.14:6814/10983 >> 172.21.15.88:6805/14634 conn(0x55672bebc300 :6814 s=STATE_CLOSED pgs=0 cs=0 l=0).handle_connect_msg stop myself to swap existing
2020-02-09 22:58:11.573 7fb125195700  2 osd.10 38 ms_handle_reset con 0x55672bebc300 session 0x55672c256580
2020-02-09 22:58:11.573 7fb1327ec700 10 osd.10 38  session 0x556729a0b180 osd.5 has caps osdcap[grant(*)] 'allow *'
2020-02-09 22:58:11.573 7fb1327ec700  0 can't decode unknown message type 133 MSG_AUTH=17
2020-02-09 22:58:11.573 7fb1327ec700  1 -- 172.21.15.14:6814/10983 >> 172.21.15.88:6805/14634 conn(0x55672c286e00 :6814 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=20 cs=5 l=0).process decode message faile
d 
2020-02-09 22:58:11.573 7fb1327ec700  1 -- 172.21.15.14:6814/10983 >> - conn(0x55672bebd800 :6814 s=STATE_ACCEPTING pgs=0 cs=0 l=0)._process_connection sd=66 -
2020-02-09 22:58:11.573 7fb1327ec700 10 osd.10 38  new session 0x55672c254a00 con=0x55672bebd800 addr=172.21.15.88:6805/14634
2020-02-09 22:58:11.573 7fb1327ec700 10 osd.10 38  session 0x55672c254a00 osd.5 has caps osdcap[grant(*)] 'allow *'
2020-02-09 22:58:11.573 7fb1327ec700  1 -- 172.21.15.14:6814/10983 >> 172.21.15.88:6805/14634 conn(0x55672bebd800 :6814 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept connect_s
eq 6 vs existing csq=5 existing_state=STATE_STANDBY
2020-02-09 22:58:11.573 7fb1327ec700  1 -- 172.21.15.14:6814/10983 >> 172.21.15.88:6805/14634 conn(0x55672bebd800 :6814 s=STATE_CLOSED pgs=0 cs=0 l=0).handle_connect_msg stop myself to swap existing
2020-02-09 22:58:11.573 7fb132fed700  1 -- 172.21.15.14:6814/10983 reap_dead start

message type 133 corresponds to MSG_OSD_PG_LEASE in master/octopus
note the addr of osd.10 is 172.21.15.14:6814/10983

Searching for pg_lease messages sent by osd.5(master/octopus) to osd.10(mimic), we see a whole bunch of them starting with

2020-02-09T22:58:11.574+0000 7f115389b700  1 -- v1:172.21.15.88:6805/14634 --> v1:172.21.15.14:6814/10983 -- pg_lease(2.2 pg_lease(ru 0s ub 17.964713671s int 16s) e38/38) v1 -- 0x55b5c7147480 con 0x55b5e488e400
2020-02-09T22:58:19.574+0000 7f114f893700  1 -- v1:172.21.15.88:6805/14634 --> v1:172.21.15.14:6814/10983 -- pg_lease(2.2 pg_lease(ru 0s ub 25.964872218s int 16s) e40/40) v1 -- 0x55b5c7146580 con 0x55b5e488e400
2020-02-09T22:58:27.574+0000 7f114f893700  1 -- v1:172.21.15.88:6805/14634 --> v1:172.21.15.14:6814/10983 -- pg_lease(2.2 pg_lease(ru 0s ub 33.965058546s int 16s) e43/43) v1 -- 0x55b5e4974400 con 0x55b5e488e400
2020-02-09T22:58:35.574+0000 7f114f893700  1 -- v1:172.21.15.88:6805/14634 --> v1:172.21.15.14:6814/10983 -- pg_lease(2.2 pg_lease(ru 0s ub 41.965256942s int 16s) e48/48) v1 -- 0x55b5e4974a00 con 0x55b5e488e400
2020-02-09T22:58:43.574+0000 7f114f893700  1 -- v1:172.21.15.88:6805/14634 --> v1:172.21.15.14:6814/10983 -- pg_lease(2.2 pg_lease(ru 0s ub 49.965437898s int 16s) e56/56) v1 -- 0x55b5e4975d80 con 0x55b5e488e400
.
.

I think this is the source of the problem.

#17 Updated by Neha Ojha about 2 months ago

Neha Ojha wrote:

pg3.4, which is stuck in "peering" shows similar behavior as https://tracker.ceph.com/issues/43048#note-15

osd.10 sends pg_query to osd.2 and osd.5
[...]

osd.2

[...]

osd.5

[...]

only osd.2's response is received by osd.10

[...]

however, looks like the problem started since osd.5 was upgraded
[...]

[...]

message type 133 corresponds to MSG_OSD_PG_LEASE in master/octopus
note the addr of osd.10 is 172.21.15.14:6814/10983

Searching for pg_lease messages sent by osd.5(master/octopus) to osd.10(mimic), we see a whole bunch of them starting with

[...]

I think this is the source of the problem.

Created https://tracker.ceph.com/issues/44156 for this, since this only affects mimic(or nautilus)->octopus upgrades.

#18 Updated by Neha Ojha about 2 months ago

  • Subject changed from nautilus: upgrade:mimic-x-nautilus: failed to recover before timeout expired to nautilus: upgrade/mimic-x/stress-split: failed to recover before timeout expired
  • ceph-qa-suite upgrade/mimic-x added

#19 Updated by Neha Ojha about 1 month ago

https://github.com/ceph/ceph/pull/33470 - fixing the order of msgr2 vs nautilus install is the first step here.

Also available in: Atom PDF