Bug #36471: connection resetting tcp errors between mgr daemons - mgr - Ceph

Actions

Copy link

Bug #36471

open

connection resetting tcp errors between mgr daemons

Added by Tomasz Sętkowski over 5 years ago. Updated almost 5 years ago.

Status:

New

Priority:

Normal

Assignee:

Category:

ceph-mgr

Target version:

% Done:

Source:

Community (user)

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

Ceph - v13.2.2

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Since upgrading to mimic from luminous I have had problems with connections in ceph-mgr in production cluster.
In the active daemon I can see a lot of messages like these:

-- 192.168.200.49:6802/7 >> 192.168.200.41:0/7 conn(0x7f77fac07400 :6802 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_msg accept replacing existing (lossy) channel (new one lossy=1)

Which in my understanding is daemon restarting connection because something was wrong with it, (which tcpdump confirms).
Initially it caused memory consumption to grow linearly, so I opened https://tracker.ceph.com/issues/35998 and searched for leak in mgr. I did found a leak, but it was not the root cause of resetting connections.
After deploying daemons with fixed leak, I still can see those messages on active daemon. The weirdest thing is that sometimes stopping one of the mgrs "fixes it". I have 3 mgrs in total. Sometimes I can run 2 or 1 and this problem does not appear. A series of restarts of mgrs in no apparent order can make it go away or re-appear.

I made tcpdump of the lossy connections and found out that wireshark reports several problems:
TCP:

893    9.499666    192.168.200.41    192.168.200.33    TCP    8298    [TCP Window Full] 50102 → 6803 [ACK] Seq=20328 Ack=410 Win=25984 Len=8232 TSval=2447622448 TSecr=577135410[Reassembly error, protocol TCP: New fragment overlaps old data (retransmission?)]

[Reassembly error, protocol TCP: New fragment overlaps old data (retransmission?)]

And ceph dissector is showing errors too:

Ceph UNKNOWN x22
    Filter Data
        [Source Node Name: Unknown]
        [Source Node Type: unknown (0x00)]
        [Destination Node Name: Unknown]
        [Destination Node Type: unknown (0x00)]
    Tag: Unknown (0x60)
        [Expert Info (Error/Undecoded): Unknown tag.  This is either an error by the sender or an indication that the dissector is out of date.]
            [Unknown tag.  This is either an error by the sender or an indication that the dissector is out of date.]
            [Severity level: Error]
            [Group: Undecoded]

I have tried several things to narrow this down and it did not change anything:
- swapped all NICs
- switched from jemalloc to tcmalloc

I have a pcap file with the dump, which I can share, just email me for it.

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Corin Langosch over 5 years ago

I'm running a small test cluster with 1 mon and 1 mgr, both on the same host. My mgr log gets filled pretty fast with the following entries and I see constant memory growth of the mgr daemon. The cluster state is HEALTHY.

mgr1_1 | 2018-10-17 11:49:20.736 7f873b3cc700 1 mgr send_beacon active
mgr1_1 | 2018-10-17 11:49:20.804 7f8742bdb700 0 -- 10.0.0.1:6800/15 >> 10.0.0.1:0/15 conn(0x55ecbb100300 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_msg: challenging authorizer
mgr1_1 | 2018-10-17 11:49:20.804 7f8742bdb700 0 -- 10.0.0.1:6800/15 >> 10.0.0.1:0/15 conn(0x55ecbb100300 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_msg accept replacing existing (lossy) channel (new one lossy=1)
mgr1_1 | 2018-10-17 11:49:20.804 7f873ebd3700 0 client.0 ms_handle_reset on 10.0.0.1:6800/15
mgr1_1 | 2018-10-17 11:49:20.804 7f87433dc700 0 -- 10.0.0.1:6800/15 >> 10.0.0.1:0/15 conn(0x55ecbb100a00 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_msg: challenging authorizer
mgr1_1 | 2018-10-17 11:49:20.804 7f87433dc700 0 -- 10.0.0.1:6800/15 >> 10.0.0.1:0/15 conn(0x55ecbb100a00 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_msg accept replacing existing (lossy) channel (new one lossy=1)
mgr1_1 | 2018-10-17 11:49:21.804 7f8742bdb700 0 -- 10.0.0.1:6800/15 >> 10.0.0.1:0/15 conn(0x55ecbb101100 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_msg: challenging authorizer
mgr1_1 | 2018-10-17 11:49:21.804 7f8742bdb700 0 -- 10.0.0.1:6800/15 >> 10.0.0.1:0/15 conn(0x55ecbb101100 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_msg accept replacing existing (lossy) channel (new one lossy=1)
mgr1_1 | 2018-10-17 11:49:21.804 7f873ebd3700 0 client.0 ms_handle_reset on 10.0.0.1:6800/15
mgr1_1 | 2018-10-17 11:49:21.804 7f87423da700 0 -- 10.0.0.1:6800/15 >> 10.0.0.1:0/15 conn(0x55ecbb101800 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_msg: challenging authorizer
mgr1_1 | 2018-10-17 11:49:21.804 7f87423da700 0 -- 10.0.0.1:6800/15 >> 10.0.0.1:0/15 conn(0x55ecbb101800 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_msg accept replacing existing (lossy) channel (new one lossy=1)
mgr1_1 | 2018-10-17 11:49:22.736 7f873b3cc700 1 mgr send_beacon active
mgr1_1 | 2018-10-17 11:49:22.804 7f8742bdb700 0 -- 10.0.0.1:6800/15 >> 10.0.0.1:0/15 conn(0x55ecbb142000 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_msg: challenging authorizer
mgr1_1 | 2018-10-17 11:49:22.804 7f8742bdb700 0 -- 10.0.0.1:6800/15 >> 10.0.0.1:0/15 conn(0x55ecbb142000 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_msg accept replacing existing (lossy) channel (new one lossy=1)
mgr1_1 | 2018-10-17 11:49:22.804 7f873ebd3700 0 client.0 ms_handle_reset on 10.0.0.1:6800/15
mgr1_1 | 2018-10-17 11:49:22.804 7f87433dc700 0 -- 10.0.0.1:6800/15 >> 10.0.0.1:0/15 conn(0x55ecbb142700 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_msg: challenging authorizer
mgr1_1 | 2018-10-17 11:49:22.804 7f87433dc700 0 -- 10.0.0.1:6800/15 >> 10.0.0.1:0/15 conn(0x55ecbb142700 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_msg accept replacing existing (lossy) channel (new one lossy=1)
mgr1_1 | 2018-10-17 11:49:23.804 7f8742bdb700 0 -- 10.0.0.1:6800/15 >> 10.0.0.1:0/15 conn(0x55ecbb142e00 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_msg: challenging authorizer
mgr1_1 | 2018-10-17 11:49:23.804 7f8742bdb700 0 -- 10.0.0.1:6800/15 >> 10.0.0.1:0/15 conn(0x55ecbb142e00 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_msg accept replacing existing (lossy) channel (new one lossy=1)
mgr1_1 | 2018-10-17 11:49:23.804 7f873ebd3700 0 client.0 ms_handle_reset on 10.0.0.1:6800/15
mgr1_1 | 2018-10-17 11:49:23.804 7f87433dc700 0 -- 10.0.0.1:6800/15 >> 10.0.0.1:0/15 conn(0x55ecbb143500 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_msg: challenging authorizer
mgr1_1 | 2018-10-17 11:49:23.804 7f87433dc700 0 -- 10.0.0.1:6800/15 >> 10.0.0.1:0/15 conn(0x55ecbb143500 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_msg accept replacing existing (lossy) channel (new one lossy=1)
mgr1_1 | 2018-10-17 11:49:24.736 7f873b3cc700 1 mgr send_beacon active
mgr1_1 | 2018-10-17 11:49:24.804 7f8742bdb700 0 -- 10.0.0.1:6800/15 >> 10.0.0.1:0/15 conn(0x55ecbb143c00 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_msg: challenging authorizer
mgr1_1 | 2018-10-17 11:49:24.804 7f8742bdb700 0 -- 10.0.0.1:6800/15 >> 10.0.0.1:0/15 conn(0x55ecbb143c00 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_msg accept replacing existing (lossy) channel (new one lossy=1)
mgr1_1 | 2018-10-17 11:49:24.804 7f873ebd3700 0 client.0 ms_handle_reset on 10.0.0.1:6800/15
mgr1_1 | 2018-10-17 11:49:24.804 7f87423da700 0 -- 10.0.0.1:6800/15 >> 10.0.0.1:0/15 conn(0x55ecbb144300 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_msg: challenging authorizer
mgr1_1 | 2018-10-17 11:49:24.804 7f87423da700 0 -- 10.0.0.1:6800/15 >> 10.0.0.1:0/15 conn(0x55ecbb144300 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_msg accept replacing existing (lossy) channel (new one lossy=1)
mgr1_1 | 2018-10-17 11:49:25.804 7f8742bdb700 0 -- 10.0.0.1:6800/15 >> 10.0.0.1:0/15 conn(0x55ecbb144a00 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_msg: challenging authorizer
mgr1_1 | 2018-10-17 11:49:25.804 7f8742bdb700 0 -- 10.0.0.1:6800/15 >> 10.0.0.1:0/15 conn(0x55ecbb144a00 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_msg accept replacing existing (lossy) channel (new one lossy=1)
mgr1_1 | 2018-10-17 11:49:25.804 7f873ebd3700 0 client.0 ms_handle_reset on 10.0.0.1:6800/15
mgr1_1 | 2018-10-17 11:49:25.804 7f87433dc700 0 -- 10.0.0.1:6800/15 >> 10.0.0.1:0/15 conn(0x55ecbb145100 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_msg: challenging authorizer
mgr1_1 | 2018-10-17 11:49:25.804 7f87433dc700 0 -- 10.0.0.1:6800/15 >> 10.0.0.1:0/15 conn(0x55ecbb145100 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_msg accept replacing existing (lossy) channel (new one lossy=1)

Actions

Copy link

Updated by Corin Langosch about 5 years ago

Bug is still present in latest ceph version 13.2.4 (b10be4d44915a4d78a8e06aa31919e74927b142e) mimic (stable).

mgr1_1 | 2019-03-10 23:09:47.517 7f53a7a0a700 1 mgr send_beacon active
mgr1_1 | 2019-03-10 23:09:47.529 7f53afa1a700 0 -- 10.0.1.2:6800/13 >> 10.0.1.2:0/13 conn(0x555987c77800 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_msg accept replacing existing (lossy) channel (new one lossy=1)
mgr1_1 | 2019-03-10 23:09:47.529 7f53ab211700 0 client.0 ms_handle_reset on 10.0.1.2:6800/13
mgr1_1 | 2019-03-10 23:09:47.529 7f53afa1a700 0 -- 10.0.1.2:6800/13 >> 10.0.1.2:0/13 conn(0x555987c98000 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_msg accept replacing existing (lossy) channel (new one lossy=1)
mgr1_1 | 2019-03-10 23:09:48.529 7f53afa1a700 0 -- 10.0.1.2:6800/13 >> 10.0.1.2:0/13 conn(0x555987c98700 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_msg accept replacing existing (lossy) channel (new one lossy=1)
mgr1_1 | 2019-03-10 23:09:48.529 7f53ab211700 0 client.0 ms_handle_reset on 10.0.1.2:6800/13
mgr1_1 | 2019-03-10 23:09:48.529 7f53aea18700 0 -- 10.0.1.2:6800/13 >> 10.0.1.2:0/13 conn(0x555987c98e00 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_msg accept replacing existing (lossy) channel (new one lossy=1)
mgr1_1 | 2019-03-10 23:09:49.517 7f53a7a0a700 1 mgr send_beacon active
mgr1_1 | 2019-03-10 23:09:49.529 7f53afa1a700 0 -- 10.0.1.2:6800/13 >> 10.0.1.2:0/13 conn(0x555987c99500 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_msg accept replacing existing (lossy) channel (new one lossy=1)
mgr1_1 | 2019-03-10 23:09:49.529 7f53ab211700 0 client.0 ms_handle_reset on 10.0.1.2:6800/13
mgr1_1 | 2019-03-10 23:09:49.529 7f53afa1a700 0 -- 10.0.1.2:6800/13 >> 10.0.1.2:0/13 conn(0x555987c99c00 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_msg accept replacing existing (lossy) channel (new one lossy=1)
mgr1_1 | 2019-03-10 23:09:50.529 7f53afa1a700 0 -- 10.0.1.2:6800/13 >> 10.0.1.2:0/13 conn(0x555987c9a300 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_msg accept replacing existing (lossy) channel (new one lossy=1)
mgr1_1 | 2019-03-10 23:09:50.529 7f53ab211700 0 client.0 ms_handle_reset on 10.0.1.2:6800/13
mgr1_1 | 2019-03-10 23:09:50.529 7f53afa1a700 0 -- 10.0.1.2:6800/13 >> 10.0.1.2:0/13 conn(0x555987c9aa00 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_msg accept replacing existing (lossy) channel (new one lossy=1)

I have a single mgr running. Seems like it tries to talk to itself (10.0.1.2:6800/13 >> 10.0.1.2:0/13) ;)

Actions

Copy link