Project

General

Profile

Actions

Bug #36471

open

connection resetting tcp errors between mgr daemons

Added by Tomasz Sętkowski over 5 years ago. Updated almost 5 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
ceph-mgr
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Since upgrading to mimic from luminous I have had problems with connections in ceph-mgr in production cluster.
In the active daemon I can see a lot of messages like these:

-- 192.168.200.49:6802/7 >> 192.168.200.41:0/7 conn(0x7f77fac07400 :6802 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_msg accept replacing existing (lossy) channel (new one lossy=1)

Which in my understanding is daemon restarting connection because something was wrong with it, (which tcpdump confirms).
Initially it caused memory consumption to grow linearly, so I opened https://tracker.ceph.com/issues/35998 and searched for leak in mgr. I did found a leak, but it was not the root cause of resetting connections.
After deploying daemons with fixed leak, I still can see those messages on active daemon. The weirdest thing is that sometimes stopping one of the mgrs "fixes it". I have 3 mgrs in total. Sometimes I can run 2 or 1 and this problem does not appear. A series of restarts of mgrs in no apparent order can make it go away or re-appear.

I made tcpdump of the lossy connections and found out that wireshark reports several problems:
TCP:

893    9.499666    192.168.200.41    192.168.200.33    TCP    8298    [TCP Window Full] 50102 → 6803 [ACK] Seq=20328 Ack=410 Win=25984 Len=8232 TSval=2447622448 TSecr=577135410[Reassembly error, protocol TCP: New fragment overlaps old data (retransmission?)]

[Reassembly error, protocol TCP: New fragment overlaps old data (retransmission?)]

And ceph dissector is showing errors too:

Ceph UNKNOWN x22
    Filter Data
        [Source Node Name: Unknown]
        [Source Node Type: unknown (0x00)]
        [Destination Node Name: Unknown]
        [Destination Node Type: unknown (0x00)]
    Tag: Unknown (0x60)
        [Expert Info (Error/Undecoded): Unknown tag.  This is either an error by the sender or an indication that the dissector is out of date.]
            [Unknown tag.  This is either an error by the sender or an indication that the dissector is out of date.]
            [Severity level: Error]
            [Group: Undecoded]

I have tried several things to narrow this down and it did not change anything:
- swapped all NICs
- switched from jemalloc to tcmalloc

I have a pcap file with the dump, which I can share, just email me for it.


Related issues 1 (0 open1 closed)

Related to mgr - Bug #35998: ceph-mgr active daemon memory leak since mimicResolved09/15/2018

Actions
Actions

Also available in: Atom PDF