Project

General

Profile

Bug #53436

mds, mon: mds beacon messages get dropped? (mds never reaches up:active state)

Added by Venky Shankar about 1 year ago. Updated about 1 year ago.

Status:
Duplicate
Priority:
Normal
Assignee:
Category:
Correctness/Safety
Target version:
% Done:

0%

Source:
Community (dev)
Tags:
Backport:
pacific
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS, MDSMonitor
Labels (FS):
task(medium)
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Seen in this run - https://pulpito.ceph.com/vshankar-2021-11-24_07:14:27-fs-wip-vshankar-testing-20211124-094330-testing-default-smithi/6524013/

Didn't analyze in depth, but looks like the mds (mds.d) sent a beacon message wanting "up:active" state to the monitor (mon.a), however, the monitor never received this beacon thereby leaving the mds in "up:creating" state.

AFAICS, there was no monitor crash or failover.


Related issues

Related to RADOS - Bug #45647: "ceph --cluster ceph --log-early osd last-stat-seq osd.0" times out due to msgr-failures/many.yaml New
Related to RADOS - Bug #39039: mon connection reset, command not resent Duplicate

History

#1 Updated by Venky Shankar about 1 year ago

  • Status changed from New to Triaged
  • Assignee set to Xiubo Li

#2 Updated by Xiubo Li about 1 year ago

Seems the same issue with https://tracker.ceph.com/issues/51705.

#3 Updated by Xiubo Li about 1 year ago

From remote/smithi154/log/ceph-mds.d.log.gz, we can see that the mon.1 connection was broken:

2021-11-24T12:57:17.993+0000 7fb7faada700  5 mds.beacon.d set_want_state: up:creating -> up:active
2021-11-24T12:57:17.993+0000 7fb7faada700  5 mds.beacon.d Sending beacon up:active seq 4
2021-11-24T12:57:17.993+0000 7fb7faada700  1 -- [v2:172.21.15.154:6834/2120133015,v1:172.21.15.154:6835/2120133015] --> [v2:172.21.15.154:3300/0,v1:172.21.15.154:6789/0] -- mdsbeacon(4312/d up:active seq=4 v5) v8 -- 0x557d74d298c0 con 0x557d74d2b400
2021-11-24T12:57:17.993+0000 7fb7faada700 10 mds.0.snapclient sync
2021-11-24T12:57:17.993+0000 7fb7faada700 10 mds.0.snapclient refresh want 1
2021-11-24T12:57:17.993+0000 7fb803aec700  0 -- [v2:172.21.15.154:6834/2120133015,v1:172.21.15.154:6835/2120133015] >> [v2:172.21.15.154:3300/0,v1:172.21.15.154:6789/0] conn(0x557d74d2b400 msgr2=0x557d74d6e000 secure :-1 s=STATE_CONNECTION_ESTABLISHED l=1)._try_send injecting socket failure
2021-11-24T12:57:17.993+0000 7fb803aec700  1 -- [v2:172.21.15.154:6834/2120133015,v1:172.21.15.154:6835/2120133015] >> [v2:172.21.15.154:3300/0,v1:172.21.15.154:6789/0] conn(0x557d74d2b400 msgr2=0x557d74d6e000 secure :-1 s=STATE_CONNECTION_ESTABLISHED l=1)._try_send send error: (32) Broken pipe
2021-11-24T12:57:17.993+0000 7fb803aec700  1 --2- [v2:172.21.15.154:6834/2120133015,v1:172.21.15.154:6835/2120133015] >> [v2:172.21.15.154:3300/0,v1:172.21.15.154:6789/0] conn(0x557d74d2b400 0x557d74d6e000 secure :-1 s=READY pgs=41 cs=0 l=1 rev1=1 crypto rx=0x557d74028990 tx=0x557d74d32e40 comp rx=0 tx=0).write_message error sending 0x557d74d298c0, (32) Broken pipe

And from remote/smithi154/log/ceph-mon.b.log.gz, it says the mds.b closed the connection, but the mds.d daemon didn't do that:

2021-11-24T12:57:17.987+0000 7f5cefda7700 10 mon.b@1(peon).osd e27 check_osdmap_sub 0x56409282cc00 next 28 (onetime)
2021-11-24T12:57:17.993+0000 7f5cee5a4700  1 -- [v2:172.21.15.154:3300/0,v1:172.21.15.154:6789/0] >> [v2:172.21.15.154:6834/2120133015,v1:172.21.15.154:6835/2120133015] conn(0x56409239d000 msgr2=0x56409268ec00 secure :-1 s=STATE_CONNECTION_ESTABLISHED l=1).read_bulk peer close file descriptor 42
2021-11-24T12:57:17.993+0000 7f5cee5a4700  1 -- [v2:172.21.15.154:3300/0,v1:172.21.15.154:6789/0] >> [v2:172.21.15.154:6834/2120133015,v1:172.21.15.154:6835/2120133015] conn(0x56409239d000 msgr2=0x56409268ec00 secure :-1 s=STATE_CONNECTION_ESTABLISHED l=1).read_until read failed
2021-11-24T12:57:17.993+0000 7f5cee5a4700  1 --2- [v2:172.21.15.154:3300/0,v1:172.21.15.154:6789/0] >> [v2:172.21.15.154:6834/2120133015,v1:172.21.15.154:6835/2120133015] conn(0x56409239d000 0x56409268ec00 secure :-1 s=READY pgs=1 cs=0 l=1 rev1=1 crypto rx=0x5640926a74d0 tx=0x564092719980 comp rx=0 tx=0).handle_read_frame_preamble_main read frame preamble failed r=-1 ((1) Operation not permitted)
2021-11-24T12:57:17.993+0000 7f5cee5a4700  1 --2- [v2:172.21.15.154:3300/0,v1:172.21.15.154:6789/0] >> [v2:172.21.15.154:6834/2120133015,v1:172.21.15.154:6835/2120133015] conn(0x56409239d000 0x56409268ec00 secure :-1 s=READY pgs=1 cs=0 l=1 rev1=1 crypto rx=0x5640926a74d0 tx=0x564092719980 comp rx=0 tx=0).stop
2021-11-24T12:57:17.993+0000 7f5cefda7700 10 mon.b@1(peon) e1 ms_handle_reset 0x56409239d000 v2:172.21.15.154:6834/2120133015
2021-11-24T12:57:17.993+0000 7f5cefda7700 10 mon.b@1(peon) e1 reset/close on session mds.? [v2:172.21.15.154:6834/2120133015,v1:172.21.15.154:6835/2120133015]
2021-11-24T12:57:17.993+0000 7f5cefda7700 10 mon.b@1(peon) e1 remove_session 0x5640923fab40 mds.? [v2:172.21.15.154:6834/2120133015,v1:172.21.15.154:6835/2120133015] features 0x3f01cfbb7ffdffff
2021-11-24T12:57:17.993+0000 7f5cedda3700  1 --2- [v2:172.21.15.154:3300/0,v1:172.21.15.154:6789/0] >>  conn(0x564091084800 0x56409268d600 unknown :-1 s=NONE pgs=0 cs=0 l=0 rev1=0 crypto rx=0 tx=0 comp rx=0 tx=0).accept
2021-11-24T12:57:17.993+0000 7f5cee5a4700  1 --2- [v2:172.21.15.154:3300/0,v1:172.21.15.154:6789/0] >>  conn(0x564091084800 0x56409268d600 unknown :-1 s=BANNER_ACCEPTING pgs=0 cs=0 l=0 rev1=0 crypto rx=0 tx=0 comp rx=0 tx=0)._handle_peer_banner_payload supported=3 required=0
2021-11-24T12:57:17.993+0000 7f5cee5a4700 10 mon.b@1(peon) e1 handle_auth_request con 0x564091084800 (start) method 2 payload 18
2021-11-24T12:57:17.993+0000 7f5cee5a4700 10 start_session entity_name=mds.d global_id=4312 is_new_global_id=0
2021-11-24T12:57:17.993+0000 7f5cee5a4700 10 cephx server mds.d: start_session server_challenge 28be28b095cac19c
2021-11-24T12:57:17.993+0000 7f5cee5a4700 10 mon.b@1(peon) e1 handle_auth_request con 0x564091084800 (more) method 2 payload 132
2021-11-24T12:57:17.993+0000 7f5cee5a4700 10 cephx server mds.d: handle_request get_auth_session_key for mds.d
2021-11-24T12:57:17.993+0000 7f5cee5a4700 20 cephx server mds.d:  checking key: req.key=68a540dad2b05242 expected_key=68a540dad2b05242
2021-11-24T12:57:17.993+0000 7f5cee5a4700 20 cephx server mds.d:  checking old_ticket: secret_id=2 len=96, old_ticket_may_be_omitted=0
2021-11-24T12:57:17.993+0000 7f5cee5a4700 20 cephx server mds.d:  decoded old_ticket: global_id=4312
2021-11-24T12:57:17.993+0000 7f5cee5a4700 10 cephx server mds.d:  allowing reclaim of global_id 4312 (valid ticket presented, will encrypt new ticket)
2021-11-24T12:57:17.993+0000 7f5cee5a4700 10 cephx: build_service_ticket_reply encoding 1 tickets with secret AQBwNp5h5FObABAAd8zJ5u1Pmk0NcubOt7rI3g==
2021-11-24T12:57:17.993+0000 7f5cee5a4700 10 cephx: build_service_ticket service auth secret_id 2 ticket_info.ticket.name=mds.d ticket.global_id 4312
2021-11-24T12:57:17.993+0000 7f5cee5a4700 10 cephx keyserverdata: get_caps: name=mds.d
2021-11-24T12:57:17.993+0000 7f5cee5a4700 10 cephx keyserverdata: get_secret: num of caps=4

#4 Updated by Xiubo Li about 1 year ago

  • Status changed from Triaged to Duplicate
  • Parent task set to #39039

This is a known bug long time ago.

#5 Updated by Xiubo Li about 1 year ago

  • Parent task deleted (#39039)

#6 Updated by Xiubo Li about 1 year ago

  • Related to Bug #45647: "ceph --cluster ceph --log-early osd last-stat-seq osd.0" times out due to msgr-failures/many.yaml added

#7 Updated by Xiubo Li about 1 year ago

  • Related to Bug #39039: mon connection reset, command not resent added

Also available in: Atom PDF