Bug #44299
opennautilus: mgr osd communication stalled in mixed cluster
0%
Description
Looks like a dup of https://tracker.ceph.com/issues/43048#note-11, but opening a separate bug for clarity.
This appears in mimic-x/nautilus with a symptom of "failed to recover before timeout".
The pgs in question are(both have osd.2 running nautilus as primary):
38.18 2 0 2 0 0 1080 0 0 2 2 active+recovery_wait+degraded 2020-02-25 21:47:56.716779 2091'2 3063:534 [2,7] 2 [2,7] 2 0'0 2020-02-25 21:26:19.904666 0'0 2020-02-25 21:26:19.904666 0 36.2d 2 0 2 0 0 2097152 0 0 2 2 active+recovery_wait+degraded 2020-02-25 21:47:39.918751 2120'2 3063:221 [2,5] 2 [2,5] 2 0'0 2020-02-25 21:26:15.881393 0'0 2020-02-25 21:26:15.881393 0
From osd.2's log, both became active+clean starting here:
2020-02-25 21:48:27.403 7f6d6383b700 10 osd.2 pg_epoch: 3071 pg[36.2d( v 2120'2 (0'0,2120'2] local-lis/les=3009/3026 n=2 ec=2085/2085 lis/c 3009/3009 les/c/f 3026/3071/0 3009/3009/3009) [2,5] r=0 lpr=3009 crt=2120'2 lcod 0'0 mlcod 0'0 active+clean] share_pg_info 2020-02-25 21:48:36.527 7f6d66040700 10 osd.2 pg_epoch: 3075 pg[38.18( v 2091'2 (0'0,2091'2] local-lis/les=3024/3034 n=2 ec=2089/2089 lis/c 3024/3024 les/c/f 3034/3075/0 3024/3024/3024) [2,7] r=0 lpr=3024 crt=2091'2 lcod 0'0 mlcod 0'0 active+clean] share_pg_info
From the mgr log the above never got reflected, the last pg dump reports:
2020-02-25 22:50:53.812 7f68f1fac700 0 log_channel(audit) log [DBG] : from='client.41800 -' entity='client.admin' cmd=[{"prefix": "pg dump", "target": ["mgr", ""], "format": "json"}]: dispatch 2020-02-25 22:50:53.880 7f68f0faa700 10 mgr.server operator() 566 pgs: 2 active+recovery_wait+degraded, 564 active+clean; 110 MiB data, 5.6 GiB used, 1.0 TiB / 1.1 TiB avail; 4/628 objects degraded (0.637%) { "PG_DEGRADED": { "severity": "HEALTH_WARN", "summary": { "message": "Degraded data redundancy: 4/628 objects degraded (0.637%), 2 pgs degraded" }, "detail": [ { "message": "pg 36.2d is active+recovery_wait+degraded, acting [2,5]" }, { "message": "pg 38.18 is active+recovery_wait+degraded, acting [2,7]" } ] } }
Looking at communication between the mgr and osd.2, the last pg_stats were sent by osd.2 before the PGs recovered
2020-02-25 21:48:18.340 7f68f1fac700 1 -- [v2:172.21.15.136:6832/16446,v1:172.21.15.136:6833/16446] <== osd.2 v2:172.21.15.136:6816/220806 361 ==== pg_stats(84 pgs tid 0 v 0) v2 ==== 55972+0+0 (crc 0 0 0) 0x8216300 con 0x7068400 2020-02-25 21:48:23.260 7f68f1fac700 1 -- [v2:172.21.15.136:6832/16446,v1:172.21.15.136:6833/16446] <== osd.2 v2:172.21.15.136:6816/220806 1 ==== mgropen(osd.2) v3 ==== 62127+0+0 (crc 0 0 0) 0x83da000 con 0x6713f80 2020-02-25 22:03:24.687 7f68f1fac700 1 -- [v2:172.21.15.136:6832/16446,v1:172.21.15.136:6833/16446] <== osd.2 v2:172.21.15.136:6816/220806 1 ==== mgropen(osd.2) v3 ==== 62127+0+0 (crc 0 0 0) 0x83da600 con 0x6f7b180 2020-02-25 22:18:28.231 7f68f1fac700 1 -- [v2:172.21.15.136:6832/16446,v1:172.21.15.136:6833/16446] <== osd.2 v2:172.21.15.136:6816/220806 1 ==== mgropen(osd.2) v3 ==== 62127+0+0 (crc 0 0 0) 0x91e7600 con 0x7503f80 2020-02-25 22:33:33.195 7f68f1fac700 1 -- [v2:172.21.15.136:6832/16446,v1:172.21.15.136:6833/16446] <== osd.2 v2:172.21.15.136:6816/220806 1 ==== mgropen(osd.2) v3 ==== 62127+0+0 (crc 0 0 0) 0x6873800 con 0x6f79200 2020-02-25 22:48:35.951 7f68f1fac700 1 -- [v2:172.21.15.136:6832/16446,v1:172.21.15.136:6833/16446] <== osd.2 v2:172.21.15.136:6816/220806 1 ==== mgropen(osd.2) v3 ==== 62127+0+0 (crc 0 0 0) 0x6872900 con 0x6a55a80
On osd.2 the last mgrconfigure is received here, which was sent in response to the mgropen at 2020-02-25 21:48:23.260
2020-02-25 21:33:23.228 7f6d71056700 1 -- [v2:172.21.15.136:6816/220806,v1:172.21.15.136:6817/220806] <== mgr.14104 v2:172.21.15.136:6832/16446 1 ==== mgrconfigure(period=5, threshold=5) v3 ==== 12+0+0 (crc 0 0 0) 0xfa8c000 con 0xf97e900
Messages like these seem to indicate that there have been connection issues, which are preventing any progress.
2020-02-25 22:03:23.263 7f69041c5700 1 -- [v2:172.21.15.136:6832/16446,v1:172.21.15.136:6833/16446] >> [v2:172.21.15.136:6816/220806,v1:172.21.15.136:6817/220806] conn(0x6713f80 msgr2=0x9d1b600 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=1).read_bulk peer close file descriptor 53 2020-02-25 22:03:23.263 7f69041c5700 1 -- [v2:172.21.15.136:6832/16446,v1:172.21.15.136:6833/16446] >> [v2:172.21.15.136:6816/220806,v1:172.21.15.136:6817/220806] conn(0x6713f80 msgr2=0x9d1b600 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=1).read_until read failed 2020-02-25 22:03:23.263 7f69041c5700 1 --2- [v2:172.21.15.136:6832/16446,v1:172.21.15.136:6833/16446] >> [v2:172.21.15.136:6816/220806,v1:172.21.15.136:6817/220806] conn(0x6713f80 0x9d1b600 crc :-1 s=READY pgs=16 cs=0 l=1 rx=0 tx=0).handle_read_frame_preamble_main read frame length and tag failed r=-1 ((1) Operation not permitted) 2020-02-25 22:03:23.263 7f69041c5700 1 --2- [v2:172.21.15.136:6832/16446,v1:172.21.15.136:6833/16446] >> [v2:172.21.15.136:6816/220806,v1:172.21.15.136:6817/220806] conn(0x6713f80 0x9d1b600 crc :-1 s=READY pgs=16 cs=0 l=1 rx=0 tx=0).stop 2020-02-25 22:03:23.263 7f69041c5700 1 -- [v2:172.21.15.136:6832/16446,v1:172.21.15.136:6833/16446] >> [v2:172.21.15.136:6816/220806,v1:172.21.15.136:6817/220806] conn(0x6713f80 msgr2=0x9d1b600 unknown :-1 s=STATE_CLOSED l=1).tick see no progress in more than 10000000 us during connecting, fault. 2020-02-25 22:03:23.763 7f68fe9ba700 4 mgr operator() OSD[2] addr change [v2:172.21.15.136:6816/220806,v1:172.21.15.136:6817/220806] != v1:172.21.15.136:6817/220806 2020-02-25 22:03:24.687 7f69041c5700 1 --2- [v2:172.21.15.136:6832/16446,v1:172.21.15.136:6833/16446] >> [v2:172.21.15.136:6816/220806,v1:172.21.15.136:6817/220806] conn(0x6f7b180 0x69ccb00 crc :-1 s=READY pgs=17 cs=0 l=1 rx=0 tx=0).ready entity=osd.2 client_cookie=0 server_cookie=0 in_seq=0 out_seq=0 2020-02-25 22:03:24.687 7f68f1fac700 1 -- [v2:172.21.15.136:6832/16446,v1:172.21.15.136:6833/16446] <== osd.2 v2:172.21.15.136:6816/220806 1 ==== mgropen(osd.2) v3 ==== 62127+0+0 (crc 0 0 0) 0x83da600 con 0x6f7b180 2020-02-25 22:03:24.687 7f68f1fac700 1 -- [v2:172.21.15.136:6832/16446,v1:172.21.15.136:6833/16446] --> [v2:172.21.15.136:6816/220806,v1:172.21.15.136:6817/220806] -- mgrconfigure(period=5, threshold=5) v3 -- 0x94da000 con 0x6f7b180
/a/yuriw-2020-02-25_19:22:25-upgrade:mimic-x:stress-split-nautilus-distro-basic-smithi/4802142
Updated by Sage Weil about 4 years ago
- Status changed from New to Need More Info
I can't make heads or tails of this. I think we need to reproduce with debug_ms=20
Updated by Neha Ojha about 4 years ago
logs with debug_ms=20 /a/nojha-2020-02-28_01:16:11-upgrade:mimic-x:stress-split-nautilus-distro-basic-smithi/4807275/
This time the connection is stalled for ~40mins, see the timestamps of pg_stats sent by osd.7. There are three attempts to re-establish the connection during this time by the osd every ~15mins.
2020-02-28 05:06:55.217 7f59ac16c700 1 -- [v2:172.21.15.92:6832/16779,v1:172.21.15.92:6833/16779] <== osd.7 v2:172.21.15.26:6824/15251 361 ==== pg_stats(87 pgs tid 0 v 0) v2 ==== 60450+0+0 (crc 0 0 0) 0x283a000 con 0x6409a80 2020-02-28 05:07:02.373 7f59ac16c700 1 -- [v2:172.21.15.92:6832/16779,v1:172.21.15.92:6833/16779] <== osd.7 v2:172.21.15.26:6824/15251 1 ==== mgropen(osd.7) v3 ==== 62230+0+0 (crc 0 0 0) 0x66ad300 con 0x6657680 2020-02-28 05:22:04.891 7f59ac16c700 1 -- [v2:172.21.15.92:6832/16779,v1:172.21.15.92:6833/16779] <== osd.7 v2:172.21.15.26:6824/15251 1 ==== mgropen(osd.7) v3 ==== 62230+0+0 (crc 0 0 0) 0x8a0ec00 con 0x889df80 2020-02-28 05:37:05.417 7f59ac16c700 1 -- [v2:172.21.15.92:6832/16779,v1:172.21.15.92:6833/16779] <== osd.7 v2:172.21.15.26:6824/15251 1 ==== mgropen(osd.7) v3 ==== 62230+0+0 (crc 0 0 0) 0x9085e00 con 0x6e64900 2020-02-28 05:47:42.683 7f59ac16c700 1 -- [v2:172.21.15.92:6832/16779,v1:172.21.15.92:6833/16779] <== osd.7 v2:172.21.15.26:6824/15251 2 ==== mgrreport(osd.7 +112-0 packed 1222 daemon_metrics=2) v7 ==== 10146+0+0 (crc 0 0 0) 0x60db0c0 con 0x6e64900 2020-02-28 05:47:42.683 7f59ac16c700 1 -- [v2:172.21.15.92:6832/16779,v1:172.21.15.92:6833/16779] <== osd.7 v2:172.21.15.26:6824/15251 3 ==== pg_stats(74 pgs tid 0 v 0) v2 ==== 48942+0+0 (crc 0 0 0) 0x786e000 con 0x6e64900
Looking at osd.7's log, the problem starts here
2020-02-28 05:07:00.050 7fe3b9ccf700 20 -- [v2:172.21.15.26:6824/15251,v1:172.21.15.26:6825/15251] >> [v2:172.21.15.92:6832/16779,v1:172.21.15.92:6833/16779] conn(0x2130a900 msgr2=0x318aac00 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=1).tick last_id=765 last_active=5153.645694s 2020-02-28 05:07:00.050 7fe3b9ccf700 1 -- [v2:172.21.15.26:6824/15251,v1:172.21.15.26:6825/15251] >> [v2:172.21.15.92:6832/16779,v1:172.21.15.92:6833/16779] conn(0x2130a900 msgr2=0x318aac00 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=1).tick idle (900002410) for more than 900000000 us, fault. 2020-02-28 05:07:00.050 7fe3b9ccf700 10 --2- [v2:172.21.15.26:6824/15251,v1:172.21.15.26:6825/15251] >> [v2:172.21.15.92:6832/16779,v1:172.21.15.92:6833/16779] conn(0x2130a900 0x318aac00 crc :-1 s=READY pgs=4955 cs=0 l=1 rx=0 tx=0)._fault 2020-02-28 05:07:00.050 7fe3b9ccf700 2 --2- [v2:172.21.15.26:6824/15251,v1:172.21.15.26:6825/15251] >> [v2:172.21.15.92:6832/16779,v1:172.21.15.92:6833/16779] conn(0x2130a900 0x318aac00 crc :-1 s=READY pgs=4955 cs=0 l=1 rx=0 tx=0)._fault on lossy channel, failing 2020-02-28 05:07:00.050 7fe3b9ccf700 1 --2- [v2:172.21.15.26:6824/15251,v1:172.21.15.26:6825/15251] >> [v2:172.21.15.92:6832/16779,v1:172.21.15.92:6833/16779] conn(0x2130a900 0x318aac00 crc :-1 s=READY pgs=4955 cs=0 l=1 rx=0 tx=0).stop 2020-02-28 05:07:00.050 7fe3b9ccf700 10 --2- [v2:172.21.15.26:6824/15251,v1:172.21.15.26:6825/15251] >> [v2:172.21.15.92:6832/16779,v1:172.21.15.92:6833/16779] conn(0x2130a900 0x318aac00 unknown :-1 s=READY pgs=4955 cs=0 l=1 rx=0 tx=0).discard_out_queue started
2020-02-28 05:22:02.372 7fe3b9ccf700 20 -- [v2:172.21.15.26:6824/15251,v1:172.21.15.26:6825/15251] >> [v2:172.21.15.92:6832/16779,v1:172.21.15.92:6833/16779] conn(0x31f31b00 msgr2=0x1065fb80 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=1).tick last_id=792 last_active=6055.968069s 2020-02-28 05:22:02.372 7fe3b9ccf700 1 -- [v2:172.21.15.26:6824/15251,v1:172.21.15.26:6825/15251] >> [v2:172.21.15.92:6832/16779,v1:172.21.15.92:6833/16779] conn(0x31f31b00 msgr2=0x1065fb80 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=1).tick idle (900002411) for more than 900000000 us, fault. 2020-02-28 05:22:02.372 7fe3b9ccf700 10 --2- [v2:172.21.15.26:6824/15251,v1:172.21.15.26:6825/15251] >> [v2:172.21.15.92:6832/16779,v1:172.21.15.92:6833/16779] conn(0x31f31b00 0x1065fb80 crc :-1 s=READY pgs=5829 cs=0 l=1 rx=0 tx=0)._fault 2020-02-28 05:22:02.372 7fe3b9ccf700 2 --2- [v2:172.21.15.26:6824/15251,v1:172.21.15.26:6825/15251] >> [v2:172.21.15.92:6832/16779,v1:172.21.15.92:6833/16779] conn(0x31f31b00 0x1065fb80 crc :-1 s=READY pgs=5829 cs=0 l=1 rx=0 tx=0)._fault on lossy channel, failing 2020-02-28 05:22:02.372 7fe3b9ccf700 1 --2- [v2:172.21.15.26:6824/15251,v1:172.21.15.26:6825/15251] >> [v2:172.21.15.92:6832/16779,v1:172.21.15.92:6833/16779] conn(0x31f31b00 0x1065fb80 crc :-1 s=READY pgs=5829 cs=0 l=1 rx=0 tx=0).stop 2020-02-28 05:22:02.372 7fe3b9ccf700 10 --2- [v2:172.21.15.26:6824/15251,v1:172.21.15.26:6825/15251] >> [v2:172.21.15.92:6832/16779,v1:172.21.15.92:6833/16779] conn(0x31f31b00 0x1065fb80 unknown :-1 s=READY pgs=5829 cs=0 l=1 rx=0 tx=0).discard_out_queue started
on the mgr here:
2020-02-28 05:07:00.045 7f59bd383700 20 -- [v2:172.21.15.92:6832/16779,v1:172.21.15.92:6833/16779] >> [v2:172.21.15.26:6824/15251,v1:172.21.15.26:6825/15251] conn(0x6409a80 msgr2=0x648c000 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=1).tick last_id=1620 last_active=6043.534510s 2020-02-28 05:07:00.049 7f59bd383700 20 -- [v2:172.21.15.92:6832/16779,v1:172.21.15.92:6833/16779] >> [v2:172.21.15.26:6824/15251,v1:172.21.15.26:6825/15251] conn(0x6409a80 msgr2=0x648c000 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=1).process 2020-02-28 05:07:00.049 7f59bd383700 20 -- [v2:172.21.15.92:6832/16779,v1:172.21.15.92:6833/16779] >> [v2:172.21.15.26:6824/15251,v1:172.21.15.26:6825/15251] conn(0x6409a80 msgr2=0x648c000 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=1).read continue len=32 2020-02-28 05:07:00.049 7f59bd383700 1 -- [v2:172.21.15.92:6832/16779,v1:172.21.15.92:6833/16779] >> [v2:172.21.15.26:6824/15251,v1:172.21.15.26:6825/15251] conn(0x6409a80 msgr2=0x648c000 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=1).read_bulk peer close file descriptor 48 2020-02-28 05:07:00.049 7f59bd383700 1 -- [v2:172.21.15.92:6832/16779,v1:172.21.15.92:6833/16779] >> [v2:172.21.15.26:6824/15251,v1:172.21.15.26:6825/15251] conn(0x6409a80 msgr2=0x648c000 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=1).read_until read failed 2020-02-28 05:07:00.049 7f59bd383700 20 --2- [v2:172.21.15.92:6832/16779,v1:172.21.15.92:6833/16779] >> [v2:172.21.15.26:6824/15251,v1:172.21.15.26:6825/15251] conn(0x6409a80 0x648c000 crc :-1 s=READY pgs=1535 cs=0 l=1 rx=0 tx=0).handle_read_frame_preamble_main r=-1 2020-02-28 05:07:00.049 7f59bd383700 1 --2- [v2:172.21.15.92:6832/16779,v1:172.21.15.92:6833/16779] >> [v2:172.21.15.26:6824/15251,v1:172.21.15.26:6825/15251] conn(0x6409a80 0x648c000 crc :-1 s=READY pgs=1535 cs=0 l=1 rx=0 tx=0).handle_read_frame_preamble_main read frame length and tag failed r=-1 ((1) Operation not permitted) 2020-02-28 05:07:00.049 7f59bd383700 10 --2- [v2:172.21.15.92:6832/16779,v1:172.21.15.92:6833/16779] >> [v2:172.21.15.26:6824/15251,v1:172.21.15.26:6825/15251] conn(0x6409a80 0x648c000 crc :-1 s=READY pgs=1535 cs=0 l=1 rx=0 tx=0)._fault 2020-02-28 05:07:00.049 7f59bd383700 2 --2- [v2:172.21.15.92:6832/16779,v1:172.21.15.92:6833/16779] >> [v2:172.21.15.26:6824/15251,v1:172.21.15.26:6825/15251] conn(0x6409a80 0x648c000 crc :-1 s=READY pgs=1535 cs=0 l=1 rx=0 tx=0)._fault on lossy channel, failing 2020-02-28 05:07:00.049 7f59bd383700 1 --2- [v2:172.21.15.92:6832/16779,v1:172.21.15.92:6833/16779] >> [v2:172.21.15.26:6824/15251,v1:172.21.15.26:6825/15251] conn(0x6409a80 0x648c000 crc :-1 s=READY pgs=1535 cs=0 l=1 rx=0 tx=0).stop 2020-02-28 05:07:00.049 7f59bd383700 10 --2- [v2:172.21.15.92:6832/16779,v1:172.21.15.92:6833/16779] >> [v2:172.21.15.26:6824/15251,v1:172.21.15.26:6825/15251] conn(0x6409a80 0x648c000 unknown :-1 s=READY pgs=1535 cs=0 l=1 rx=0 tx=0).discard_out_queue started 2020-02-28 05:07:00.049 7f59bd383700 20 EpollDriver.del_event del event fd=48 cur_mask=1 delmask=3 to 10