Project

General

Profile

Actions

Bug #44299

open

nautilus: mgr osd communication stalled in mixed cluster

Added by Neha Ojha about 4 years ago. Updated about 4 years ago.

Status:
Need More Info
Priority:
High
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Looks like a dup of https://tracker.ceph.com/issues/43048#note-11, but opening a separate bug for clarity.

This appears in mimic-x/nautilus with a symptom of "failed to recover before timeout".
The pgs in question are(both have osd.2 running nautilus as primary):

38.18         2                  0        2         0       0     1080           0          0    2        2 active+recovery_wait+degraded 2020-02-25 21:47:56.716779    2091'2  3063:534   [2,7]          2   [2,7]              2        0'0 2020-02-25 21:26:19.904666             0'0 2020-02-25 21:26:19.904666             0

36.2d         2                  0        2         0       0  2097152           0          0    2        2 active+recovery_wait+degraded 2020-02-25 21:47:39.918751    2120'2  3063:221   [2,5]          2   [2,5]              2        0'0 2020-02-25 21:26:15.881393             0'0 2020-02-25 21:26:15.881393             0

From osd.2's log, both became active+clean starting here:

2020-02-25 21:48:27.403 7f6d6383b700 10 osd.2 pg_epoch: 3071 pg[36.2d( v 2120'2 (0'0,2120'2] local-lis/les=3009/3026 n=2 ec=2085/2085 lis/c 3009/3009 les/c/f 3026/3071/0 3009/3009/3009) [2,5] r=0 lpr=3009 crt=2120'2 lcod 0'0 mlcod 0'0 active+clean] share_pg_info

2020-02-25 21:48:36.527 7f6d66040700 10 osd.2 pg_epoch: 3075 pg[38.18( v 2091'2 (0'0,2091'2] local-lis/les=3024/3034 n=2 ec=2089/2089 lis/c 3024/3024 les/c/f 3034/3075/0 3024/3024/3024) [2,7] r=0 lpr=3024 crt=2091'2 lcod 0'0 mlcod 0'0 active+clean] share_pg_info

From the mgr log the above never got reflected, the last pg dump reports:

2020-02-25 22:50:53.812 7f68f1fac700  0 log_channel(audit) log [DBG] : from='client.41800 -' entity='client.admin' cmd=[{"prefix": "pg dump", "target": ["mgr", ""], "format": "json"}]: dispatch
2020-02-25 22:50:53.880 7f68f0faa700 10 mgr.server operator() 566 pgs: 2 active+recovery_wait+degraded, 564 active+clean; 110 MiB data, 5.6 GiB used, 1.0 TiB / 1.1 TiB avail; 4/628 objects degraded (0.637%)
{
    "PG_DEGRADED": {
        "severity": "HEALTH_WARN",
        "summary": {
            "message": "Degraded data redundancy: 4/628 objects degraded (0.637%), 2 pgs degraded" 
        },
        "detail": [
            {
                "message": "pg 36.2d is active+recovery_wait+degraded, acting [2,5]" 
            },                                                                                                                                                                                                                 {
                "message": "pg 38.18 is active+recovery_wait+degraded, acting [2,7]" 
            }
        ]
    }
}

Looking at communication between the mgr and osd.2, the last pg_stats were sent by osd.2 before the PGs recovered

2020-02-25 21:48:18.340 7f68f1fac700  1 -- [v2:172.21.15.136:6832/16446,v1:172.21.15.136:6833/16446] <== osd.2 v2:172.21.15.136:6816/220806 361 ==== pg_stats(84 pgs tid 0 v 0) v2 ==== 55972+0+0 (crc 0 0 0) 0x8216300 con 0x7068400
2020-02-25 21:48:23.260 7f68f1fac700  1 -- [v2:172.21.15.136:6832/16446,v1:172.21.15.136:6833/16446] <== osd.2 v2:172.21.15.136:6816/220806 1 ==== mgropen(osd.2) v3 ==== 62127+0+0 (crc 0 0 0) 0x83da000 con 0x6713f80
2020-02-25 22:03:24.687 7f68f1fac700  1 -- [v2:172.21.15.136:6832/16446,v1:172.21.15.136:6833/16446] <== osd.2 v2:172.21.15.136:6816/220806 1 ==== mgropen(osd.2) v3 ==== 62127+0+0 (crc 0 0 0) 0x83da600 con 0x6f7b180
2020-02-25 22:18:28.231 7f68f1fac700  1 -- [v2:172.21.15.136:6832/16446,v1:172.21.15.136:6833/16446] <== osd.2 v2:172.21.15.136:6816/220806 1 ==== mgropen(osd.2) v3 ==== 62127+0+0 (crc 0 0 0) 0x91e7600 con 0x7503f80
2020-02-25 22:33:33.195 7f68f1fac700  1 -- [v2:172.21.15.136:6832/16446,v1:172.21.15.136:6833/16446] <== osd.2 v2:172.21.15.136:6816/220806 1 ==== mgropen(osd.2) v3 ==== 62127+0+0 (crc 0 0 0) 0x6873800 con 0x6f79200
2020-02-25 22:48:35.951 7f68f1fac700  1 -- [v2:172.21.15.136:6832/16446,v1:172.21.15.136:6833/16446] <== osd.2 v2:172.21.15.136:6816/220806 1 ==== mgropen(osd.2) v3 ==== 62127+0+0 (crc 0 0 0) 0x6872900 con 0x6a55a80

On osd.2 the last mgrconfigure is received here, which was sent in response to the mgropen at 2020-02-25 21:48:23.260

2020-02-25 21:33:23.228 7f6d71056700  1 -- [v2:172.21.15.136:6816/220806,v1:172.21.15.136:6817/220806] <== mgr.14104 v2:172.21.15.136:6832/16446 1 ==== mgrconfigure(period=5, threshold=5) v3 ==== 12+0+0 (crc 0 0 0) 0xfa8c000 con 0xf97e900

Messages like these seem to indicate that there have been connection issues, which are preventing any progress.

2020-02-25 22:03:23.263 7f69041c5700  1 -- [v2:172.21.15.136:6832/16446,v1:172.21.15.136:6833/16446] >> [v2:172.21.15.136:6816/220806,v1:172.21.15.136:6817/220806] conn(0x6713f80 msgr2=0x9d1b600 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=1).read_bulk peer close file descriptor 53
2020-02-25 22:03:23.263 7f69041c5700  1 -- [v2:172.21.15.136:6832/16446,v1:172.21.15.136:6833/16446] >> [v2:172.21.15.136:6816/220806,v1:172.21.15.136:6817/220806] conn(0x6713f80 msgr2=0x9d1b600 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=1).read_until read failed
2020-02-25 22:03:23.263 7f69041c5700  1 --2- [v2:172.21.15.136:6832/16446,v1:172.21.15.136:6833/16446] >> [v2:172.21.15.136:6816/220806,v1:172.21.15.136:6817/220806] conn(0x6713f80 0x9d1b600 crc :-1 s=READY pgs=16 cs=0 l=1 rx=0 tx=0).handle_read_frame_preamble_main read frame length and tag failed r=-1 ((1) Operation not permitted)
2020-02-25 22:03:23.263 7f69041c5700  1 --2- [v2:172.21.15.136:6832/16446,v1:172.21.15.136:6833/16446] >> [v2:172.21.15.136:6816/220806,v1:172.21.15.136:6817/220806] conn(0x6713f80 0x9d1b600 crc :-1 s=READY pgs=16 cs=0 l=1 rx=0 tx=0).stop
2020-02-25 22:03:23.263 7f69041c5700  1 -- [v2:172.21.15.136:6832/16446,v1:172.21.15.136:6833/16446] >> [v2:172.21.15.136:6816/220806,v1:172.21.15.136:6817/220806] conn(0x6713f80 msgr2=0x9d1b600 unknown :-1 s=STATE_CLOSED l=1).tick see no progress in more than 10000000 us during connecting, fault.
2020-02-25 22:03:23.763 7f68fe9ba700  4 mgr operator() OSD[2] addr change [v2:172.21.15.136:6816/220806,v1:172.21.15.136:6817/220806] != v1:172.21.15.136:6817/220806
2020-02-25 22:03:24.687 7f69041c5700  1 --2- [v2:172.21.15.136:6832/16446,v1:172.21.15.136:6833/16446] >> [v2:172.21.15.136:6816/220806,v1:172.21.15.136:6817/220806] conn(0x6f7b180 0x69ccb00 crc :-1 s=READY pgs=17 cs=0 l=1 rx=0 tx=0).ready entity=osd.2 client_cookie=0 server_cookie=0 in_seq=0 out_seq=0
2020-02-25 22:03:24.687 7f68f1fac700  1 -- [v2:172.21.15.136:6832/16446,v1:172.21.15.136:6833/16446] <== osd.2 v2:172.21.15.136:6816/220806 1 ==== mgropen(osd.2) v3 ==== 62127+0+0 (crc 0 0 0) 0x83da600 con 0x6f7b180
2020-02-25 22:03:24.687 7f68f1fac700  1 -- [v2:172.21.15.136:6832/16446,v1:172.21.15.136:6833/16446] --> [v2:172.21.15.136:6816/220806,v1:172.21.15.136:6817/220806] -- mgrconfigure(period=5, threshold=5) v3 -- 0x94da000 con 0x6f7b180

/a/yuriw-2020-02-25_19:22:25-upgrade:mimic-x:stress-split-nautilus-distro-basic-smithi/4802142

Actions #1

Updated by Sage Weil about 4 years ago

  • Status changed from New to Need More Info

I can't make heads or tails of this. I think we need to reproduce with debug_ms=20

Actions #2

Updated by Neha Ojha about 4 years ago

  • Priority changed from Normal to High
Actions #3

Updated by Neha Ojha about 4 years ago

logs with debug_ms=20 /a/nojha-2020-02-28_01:16:11-upgrade:mimic-x:stress-split-nautilus-distro-basic-smithi/4807275/

This time the connection is stalled for ~40mins, see the timestamps of pg_stats sent by osd.7. There are three attempts to re-establish the connection during this time by the osd every ~15mins.

2020-02-28 05:06:55.217 7f59ac16c700  1 -- [v2:172.21.15.92:6832/16779,v1:172.21.15.92:6833/16779] <== osd.7 v2:172.21.15.26:6824/15251 361 ==== pg_stats(87 pgs tid 0 v 0) v2 ==== 60450+0+0 (crc 0 0 0) 0x283a000 con 0x6409a80
2020-02-28 05:07:02.373 7f59ac16c700  1 -- [v2:172.21.15.92:6832/16779,v1:172.21.15.92:6833/16779] <== osd.7 v2:172.21.15.26:6824/15251 1 ==== mgropen(osd.7) v3 ==== 62230+0+0 (crc 0 0 0) 0x66ad300 con 0x6657680
2020-02-28 05:22:04.891 7f59ac16c700  1 -- [v2:172.21.15.92:6832/16779,v1:172.21.15.92:6833/16779] <== osd.7 v2:172.21.15.26:6824/15251 1 ==== mgropen(osd.7) v3 ==== 62230+0+0 (crc 0 0 0) 0x8a0ec00 con 0x889df80
2020-02-28 05:37:05.417 7f59ac16c700  1 -- [v2:172.21.15.92:6832/16779,v1:172.21.15.92:6833/16779] <== osd.7 v2:172.21.15.26:6824/15251 1 ==== mgropen(osd.7) v3 ==== 62230+0+0 (crc 0 0 0) 0x9085e00 con 0x6e64900
2020-02-28 05:47:42.683 7f59ac16c700  1 -- [v2:172.21.15.92:6832/16779,v1:172.21.15.92:6833/16779] <== osd.7 v2:172.21.15.26:6824/15251 2 ==== mgrreport(osd.7 +112-0 packed 1222 daemon_metrics=2) v7 ==== 10146+0+0 (crc 0 0 0) 0x60db0c0 con 0x6e64900
2020-02-28 05:47:42.683 7f59ac16c700  1 -- [v2:172.21.15.92:6832/16779,v1:172.21.15.92:6833/16779] <== osd.7 v2:172.21.15.26:6824/15251 3 ==== pg_stats(74 pgs tid 0 v 0) v2 ==== 48942+0+0 (crc 0 0 0) 0x786e000 con 0x6e64900

Looking at osd.7's log, the problem starts here

2020-02-28 05:07:00.050 7fe3b9ccf700 20 -- [v2:172.21.15.26:6824/15251,v1:172.21.15.26:6825/15251] >> [v2:172.21.15.92:6832/16779,v1:172.21.15.92:6833/16779] conn(0x2130a900 msgr2=0x318aac00 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=1).tick last_id=765 last_active=5153.645694s
2020-02-28 05:07:00.050 7fe3b9ccf700  1 -- [v2:172.21.15.26:6824/15251,v1:172.21.15.26:6825/15251] >> [v2:172.21.15.92:6832/16779,v1:172.21.15.92:6833/16779] conn(0x2130a900 msgr2=0x318aac00 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=1).tick idle (900002410) for more than 900000000 us, fault.
2020-02-28 05:07:00.050 7fe3b9ccf700 10 --2- [v2:172.21.15.26:6824/15251,v1:172.21.15.26:6825/15251] >> [v2:172.21.15.92:6832/16779,v1:172.21.15.92:6833/16779] conn(0x2130a900 0x318aac00 crc :-1 s=READY pgs=4955 cs=0 l=1 rx=0 tx=0)._fault
2020-02-28 05:07:00.050 7fe3b9ccf700  2 --2- [v2:172.21.15.26:6824/15251,v1:172.21.15.26:6825/15251] >> [v2:172.21.15.92:6832/16779,v1:172.21.15.92:6833/16779] conn(0x2130a900 0x318aac00 crc :-1 s=READY pgs=4955 cs=0 l=1 rx=0 tx=0)._fault on lossy channel, failing
2020-02-28 05:07:00.050 7fe3b9ccf700  1 --2- [v2:172.21.15.26:6824/15251,v1:172.21.15.26:6825/15251] >> [v2:172.21.15.92:6832/16779,v1:172.21.15.92:6833/16779] conn(0x2130a900 0x318aac00 crc :-1 s=READY pgs=4955 cs=0 l=1 rx=0 tx=0).stop
2020-02-28 05:07:00.050 7fe3b9ccf700 10 --2- [v2:172.21.15.26:6824/15251,v1:172.21.15.26:6825/15251] >> [v2:172.21.15.92:6832/16779,v1:172.21.15.92:6833/16779] conn(0x2130a900 0x318aac00 unknown :-1 s=READY pgs=4955 cs=0 l=1 rx=0 tx=0).discard_out_queue started
2020-02-28 05:22:02.372 7fe3b9ccf700 20 -- [v2:172.21.15.26:6824/15251,v1:172.21.15.26:6825/15251] >> [v2:172.21.15.92:6832/16779,v1:172.21.15.92:6833/16779] conn(0x31f31b00 msgr2=0x1065fb80 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=1).tick last_id=792 last_active=6055.968069s
2020-02-28 05:22:02.372 7fe3b9ccf700  1 -- [v2:172.21.15.26:6824/15251,v1:172.21.15.26:6825/15251] >> [v2:172.21.15.92:6832/16779,v1:172.21.15.92:6833/16779] conn(0x31f31b00 msgr2=0x1065fb80 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=1).tick idle (900002411) for more than 900000000 us, fault.
2020-02-28 05:22:02.372 7fe3b9ccf700 10 --2- [v2:172.21.15.26:6824/15251,v1:172.21.15.26:6825/15251] >> [v2:172.21.15.92:6832/16779,v1:172.21.15.92:6833/16779] conn(0x31f31b00 0x1065fb80 crc :-1 s=READY pgs=5829 cs=0 l=1 rx=0 tx=0)._fault
2020-02-28 05:22:02.372 7fe3b9ccf700  2 --2- [v2:172.21.15.26:6824/15251,v1:172.21.15.26:6825/15251] >> [v2:172.21.15.92:6832/16779,v1:172.21.15.92:6833/16779] conn(0x31f31b00 0x1065fb80 crc :-1 s=READY pgs=5829 cs=0 l=1 rx=0 tx=0)._fault on lossy channel, failing
2020-02-28 05:22:02.372 7fe3b9ccf700  1 --2- [v2:172.21.15.26:6824/15251,v1:172.21.15.26:6825/15251] >> [v2:172.21.15.92:6832/16779,v1:172.21.15.92:6833/16779] conn(0x31f31b00 0x1065fb80 crc :-1 s=READY pgs=5829 cs=0 l=1 rx=0 tx=0).stop
2020-02-28 05:22:02.372 7fe3b9ccf700 10 --2- [v2:172.21.15.26:6824/15251,v1:172.21.15.26:6825/15251] >> [v2:172.21.15.92:6832/16779,v1:172.21.15.92:6833/16779] conn(0x31f31b00 0x1065fb80 unknown :-1 s=READY pgs=5829 cs=0 l=1 rx=0 tx=0).discard_out_queue started

on the mgr here:

2020-02-28 05:07:00.045 7f59bd383700 20 -- [v2:172.21.15.92:6832/16779,v1:172.21.15.92:6833/16779] >> [v2:172.21.15.26:6824/15251,v1:172.21.15.26:6825/15251] conn(0x6409a80 msgr2=0x648c000 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=1).tick last_id=1620 last_active=6043.534510s
2020-02-28 05:07:00.049 7f59bd383700 20 -- [v2:172.21.15.92:6832/16779,v1:172.21.15.92:6833/16779] >> [v2:172.21.15.26:6824/15251,v1:172.21.15.26:6825/15251] conn(0x6409a80 msgr2=0x648c000 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=1).process
2020-02-28 05:07:00.049 7f59bd383700 20 -- [v2:172.21.15.92:6832/16779,v1:172.21.15.92:6833/16779] >> [v2:172.21.15.26:6824/15251,v1:172.21.15.26:6825/15251] conn(0x6409a80 msgr2=0x648c000 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=1).read continue len=32
2020-02-28 05:07:00.049 7f59bd383700  1 -- [v2:172.21.15.92:6832/16779,v1:172.21.15.92:6833/16779] >> [v2:172.21.15.26:6824/15251,v1:172.21.15.26:6825/15251] conn(0x6409a80 msgr2=0x648c000 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=1).read_bulk peer close file descriptor 48
2020-02-28 05:07:00.049 7f59bd383700  1 -- [v2:172.21.15.92:6832/16779,v1:172.21.15.92:6833/16779] >> [v2:172.21.15.26:6824/15251,v1:172.21.15.26:6825/15251] conn(0x6409a80 msgr2=0x648c000 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=1).read_until read failed
2020-02-28 05:07:00.049 7f59bd383700 20 --2- [v2:172.21.15.92:6832/16779,v1:172.21.15.92:6833/16779] >> [v2:172.21.15.26:6824/15251,v1:172.21.15.26:6825/15251] conn(0x6409a80 0x648c000 crc :-1 s=READY pgs=1535 cs=0 l=1 rx=0 tx=0).handle_read_frame_preamble_main r=-1
2020-02-28 05:07:00.049 7f59bd383700  1 --2- [v2:172.21.15.92:6832/16779,v1:172.21.15.92:6833/16779] >> [v2:172.21.15.26:6824/15251,v1:172.21.15.26:6825/15251] conn(0x6409a80 0x648c000 crc :-1 s=READY pgs=1535 cs=0 l=1 rx=0 tx=0).handle_read_frame_preamble_main read frame length and tag failed r=-1 ((1) Operation not permitted)
2020-02-28 05:07:00.049 7f59bd383700 10 --2- [v2:172.21.15.92:6832/16779,v1:172.21.15.92:6833/16779] >> [v2:172.21.15.26:6824/15251,v1:172.21.15.26:6825/15251] conn(0x6409a80 0x648c000 crc :-1 s=READY pgs=1535 cs=0 l=1 rx=0 tx=0)._fault
2020-02-28 05:07:00.049 7f59bd383700  2 --2- [v2:172.21.15.92:6832/16779,v1:172.21.15.92:6833/16779] >> [v2:172.21.15.26:6824/15251,v1:172.21.15.26:6825/15251] conn(0x6409a80 0x648c000 crc :-1 s=READY pgs=1535 cs=0 l=1 rx=0 tx=0)._fault on lossy channel, failing
2020-02-28 05:07:00.049 7f59bd383700  1 --2- [v2:172.21.15.92:6832/16779,v1:172.21.15.92:6833/16779] >> [v2:172.21.15.26:6824/15251,v1:172.21.15.26:6825/15251] conn(0x6409a80 0x648c000 crc :-1 s=READY pgs=1535 cs=0 l=1 rx=0 tx=0).stop
2020-02-28 05:07:00.049 7f59bd383700 10 --2- [v2:172.21.15.92:6832/16779,v1:172.21.15.92:6833/16779] >> [v2:172.21.15.26:6824/15251,v1:172.21.15.26:6825/15251] conn(0x6409a80 0x648c000 unknown :-1 s=READY pgs=1535 cs=0 l=1 rx=0 tx=0).discard_out_queue started
2020-02-28 05:07:00.049 7f59bd383700 20 EpollDriver.del_event del event fd=48 cur_mask=1 delmask=3 to 10
Actions

Also available in: Atom PDF