Actions
Bug #53436
closedmds, mon: mds beacon messages get dropped? (mds never reaches up:active state)
Status:
Duplicate
Priority:
Normal
Assignee:
Category:
Correctness/Safety
Target version:
% Done:
0%
Source:
Community (dev)
Tags:
Backport:
pacific
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS, MDSMonitor
Labels (FS):
task(medium)
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
Seen in this run - https://pulpito.ceph.com/vshankar-2021-11-24_07:14:27-fs-wip-vshankar-testing-20211124-094330-testing-default-smithi/6524013/
Didn't analyze in depth, but looks like the mds (mds.d) sent a beacon message wanting "up:active" state to the monitor (mon.a), however, the monitor never received this beacon thereby leaving the mds in "up:creating" state.
AFAICS, there was no monitor crash or failover.
Updated by Venky Shankar over 2 years ago
- Status changed from New to Triaged
- Assignee set to Xiubo Li
Updated by Xiubo Li over 2 years ago
Seems the same issue with https://tracker.ceph.com/issues/51705.
Updated by Xiubo Li over 2 years ago
From remote/smithi154/log/ceph-mds.d.log.gz, we can see that the mon.1 connection was broken:
2021-11-24T12:57:17.993+0000 7fb7faada700 5 mds.beacon.d set_want_state: up:creating -> up:active 2021-11-24T12:57:17.993+0000 7fb7faada700 5 mds.beacon.d Sending beacon up:active seq 4 2021-11-24T12:57:17.993+0000 7fb7faada700 1 -- [v2:172.21.15.154:6834/2120133015,v1:172.21.15.154:6835/2120133015] --> [v2:172.21.15.154:3300/0,v1:172.21.15.154:6789/0] -- mdsbeacon(4312/d up:active seq=4 v5) v8 -- 0x557d74d298c0 con 0x557d74d2b400 2021-11-24T12:57:17.993+0000 7fb7faada700 10 mds.0.snapclient sync 2021-11-24T12:57:17.993+0000 7fb7faada700 10 mds.0.snapclient refresh want 1 2021-11-24T12:57:17.993+0000 7fb803aec700 0 -- [v2:172.21.15.154:6834/2120133015,v1:172.21.15.154:6835/2120133015] >> [v2:172.21.15.154:3300/0,v1:172.21.15.154:6789/0] conn(0x557d74d2b400 msgr2=0x557d74d6e000 secure :-1 s=STATE_CONNECTION_ESTABLISHED l=1)._try_send injecting socket failure 2021-11-24T12:57:17.993+0000 7fb803aec700 1 -- [v2:172.21.15.154:6834/2120133015,v1:172.21.15.154:6835/2120133015] >> [v2:172.21.15.154:3300/0,v1:172.21.15.154:6789/0] conn(0x557d74d2b400 msgr2=0x557d74d6e000 secure :-1 s=STATE_CONNECTION_ESTABLISHED l=1)._try_send send error: (32) Broken pipe 2021-11-24T12:57:17.993+0000 7fb803aec700 1 --2- [v2:172.21.15.154:6834/2120133015,v1:172.21.15.154:6835/2120133015] >> [v2:172.21.15.154:3300/0,v1:172.21.15.154:6789/0] conn(0x557d74d2b400 0x557d74d6e000 secure :-1 s=READY pgs=41 cs=0 l=1 rev1=1 crypto rx=0x557d74028990 tx=0x557d74d32e40 comp rx=0 tx=0).write_message error sending 0x557d74d298c0, (32) Broken pipe
And from remote/smithi154/log/ceph-mon.b.log.gz, it says the mds.b closed the connection, but the mds.d daemon didn't do that:
2021-11-24T12:57:17.987+0000 7f5cefda7700 10 mon.b@1(peon).osd e27 check_osdmap_sub 0x56409282cc00 next 28 (onetime) 2021-11-24T12:57:17.993+0000 7f5cee5a4700 1 -- [v2:172.21.15.154:3300/0,v1:172.21.15.154:6789/0] >> [v2:172.21.15.154:6834/2120133015,v1:172.21.15.154:6835/2120133015] conn(0x56409239d000 msgr2=0x56409268ec00 secure :-1 s=STATE_CONNECTION_ESTABLISHED l=1).read_bulk peer close file descriptor 42 2021-11-24T12:57:17.993+0000 7f5cee5a4700 1 -- [v2:172.21.15.154:3300/0,v1:172.21.15.154:6789/0] >> [v2:172.21.15.154:6834/2120133015,v1:172.21.15.154:6835/2120133015] conn(0x56409239d000 msgr2=0x56409268ec00 secure :-1 s=STATE_CONNECTION_ESTABLISHED l=1).read_until read failed 2021-11-24T12:57:17.993+0000 7f5cee5a4700 1 --2- [v2:172.21.15.154:3300/0,v1:172.21.15.154:6789/0] >> [v2:172.21.15.154:6834/2120133015,v1:172.21.15.154:6835/2120133015] conn(0x56409239d000 0x56409268ec00 secure :-1 s=READY pgs=1 cs=0 l=1 rev1=1 crypto rx=0x5640926a74d0 tx=0x564092719980 comp rx=0 tx=0).handle_read_frame_preamble_main read frame preamble failed r=-1 ((1) Operation not permitted) 2021-11-24T12:57:17.993+0000 7f5cee5a4700 1 --2- [v2:172.21.15.154:3300/0,v1:172.21.15.154:6789/0] >> [v2:172.21.15.154:6834/2120133015,v1:172.21.15.154:6835/2120133015] conn(0x56409239d000 0x56409268ec00 secure :-1 s=READY pgs=1 cs=0 l=1 rev1=1 crypto rx=0x5640926a74d0 tx=0x564092719980 comp rx=0 tx=0).stop 2021-11-24T12:57:17.993+0000 7f5cefda7700 10 mon.b@1(peon) e1 ms_handle_reset 0x56409239d000 v2:172.21.15.154:6834/2120133015 2021-11-24T12:57:17.993+0000 7f5cefda7700 10 mon.b@1(peon) e1 reset/close on session mds.? [v2:172.21.15.154:6834/2120133015,v1:172.21.15.154:6835/2120133015] 2021-11-24T12:57:17.993+0000 7f5cefda7700 10 mon.b@1(peon) e1 remove_session 0x5640923fab40 mds.? [v2:172.21.15.154:6834/2120133015,v1:172.21.15.154:6835/2120133015] features 0x3f01cfbb7ffdffff 2021-11-24T12:57:17.993+0000 7f5cedda3700 1 --2- [v2:172.21.15.154:3300/0,v1:172.21.15.154:6789/0] >> conn(0x564091084800 0x56409268d600 unknown :-1 s=NONE pgs=0 cs=0 l=0 rev1=0 crypto rx=0 tx=0 comp rx=0 tx=0).accept 2021-11-24T12:57:17.993+0000 7f5cee5a4700 1 --2- [v2:172.21.15.154:3300/0,v1:172.21.15.154:6789/0] >> conn(0x564091084800 0x56409268d600 unknown :-1 s=BANNER_ACCEPTING pgs=0 cs=0 l=0 rev1=0 crypto rx=0 tx=0 comp rx=0 tx=0)._handle_peer_banner_payload supported=3 required=0 2021-11-24T12:57:17.993+0000 7f5cee5a4700 10 mon.b@1(peon) e1 handle_auth_request con 0x564091084800 (start) method 2 payload 18 2021-11-24T12:57:17.993+0000 7f5cee5a4700 10 start_session entity_name=mds.d global_id=4312 is_new_global_id=0 2021-11-24T12:57:17.993+0000 7f5cee5a4700 10 cephx server mds.d: start_session server_challenge 28be28b095cac19c 2021-11-24T12:57:17.993+0000 7f5cee5a4700 10 mon.b@1(peon) e1 handle_auth_request con 0x564091084800 (more) method 2 payload 132 2021-11-24T12:57:17.993+0000 7f5cee5a4700 10 cephx server mds.d: handle_request get_auth_session_key for mds.d 2021-11-24T12:57:17.993+0000 7f5cee5a4700 20 cephx server mds.d: checking key: req.key=68a540dad2b05242 expected_key=68a540dad2b05242 2021-11-24T12:57:17.993+0000 7f5cee5a4700 20 cephx server mds.d: checking old_ticket: secret_id=2 len=96, old_ticket_may_be_omitted=0 2021-11-24T12:57:17.993+0000 7f5cee5a4700 20 cephx server mds.d: decoded old_ticket: global_id=4312 2021-11-24T12:57:17.993+0000 7f5cee5a4700 10 cephx server mds.d: allowing reclaim of global_id 4312 (valid ticket presented, will encrypt new ticket) 2021-11-24T12:57:17.993+0000 7f5cee5a4700 10 cephx: build_service_ticket_reply encoding 1 tickets with secret AQBwNp5h5FObABAAd8zJ5u1Pmk0NcubOt7rI3g== 2021-11-24T12:57:17.993+0000 7f5cee5a4700 10 cephx: build_service_ticket service auth secret_id 2 ticket_info.ticket.name=mds.d ticket.global_id 4312 2021-11-24T12:57:17.993+0000 7f5cee5a4700 10 cephx keyserverdata: get_caps: name=mds.d 2021-11-24T12:57:17.993+0000 7f5cee5a4700 10 cephx keyserverdata: get_secret: num of caps=4
Updated by Xiubo Li over 2 years ago
- Status changed from Triaged to Duplicate
- Parent task set to #39039
This is a known bug long time ago.
Updated by Xiubo Li over 2 years ago
- Related to Bug #45647: "ceph --cluster ceph --log-early osd last-stat-seq osd.0" times out due to msgr-failures/many.yaml added
Updated by Xiubo Li over 2 years ago
- Related to Bug #39039: mon connection reset, command not resent added
Actions