Bug #36471: connection resetting tcp errors between mgr daemons - mgr - Ceph

Actions

Copy link

Bug #36471

open

connection resetting tcp errors between mgr daemons

Added by Tomasz Sętkowski over 5 years ago. Updated almost 5 years ago.

Status:

New

Priority:

Normal

Assignee:

Category:

ceph-mgr

Target version:

% Done:

Source:

Community (user)

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

Ceph - v13.2.2

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Since upgrading to mimic from luminous I have had problems with connections in ceph-mgr in production cluster.
In the active daemon I can see a lot of messages like these:

-- 192.168.200.49:6802/7 >> 192.168.200.41:0/7 conn(0x7f77fac07400 :6802 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_msg accept replacing existing (lossy) channel (new one lossy=1)

Which in my understanding is daemon restarting connection because something was wrong with it, (which tcpdump confirms).
Initially it caused memory consumption to grow linearly, so I opened https://tracker.ceph.com/issues/35998 and searched for leak in mgr. I did found a leak, but it was not the root cause of resetting connections.
After deploying daemons with fixed leak, I still can see those messages on active daemon. The weirdest thing is that sometimes stopping one of the mgrs "fixes it". I have 3 mgrs in total. Sometimes I can run 2 or 1 and this problem does not appear. A series of restarts of mgrs in no apparent order can make it go away or re-appear.

I made tcpdump of the lossy connections and found out that wireshark reports several problems:
TCP:

893    9.499666    192.168.200.41    192.168.200.33    TCP    8298    [TCP Window Full] 50102 → 6803 [ACK] Seq=20328 Ack=410 Win=25984 Len=8232 TSval=2447622448 TSecr=577135410[Reassembly error, protocol TCP: New fragment overlaps old data (retransmission?)]

[Reassembly error, protocol TCP: New fragment overlaps old data (retransmission?)]

And ceph dissector is showing errors too:

Ceph UNKNOWN x22
    Filter Data
        [Source Node Name: Unknown]
        [Source Node Type: unknown (0x00)]
        [Destination Node Name: Unknown]
        [Destination Node Type: unknown (0x00)]
    Tag: Unknown (0x60)
        [Expert Info (Error/Undecoded): Unknown tag.  This is either an error by the sender or an indication that the dissector is out of date.]
            [Unknown tag.  This is either an error by the sender or an indication that the dissector is out of date.]
            [Severity level: Error]
            [Group: Undecoded]

I have tried several things to narrow this down and it did not change anything:
- swapped all NICs
- switched from jemalloc to tcmalloc

I have a pcap file with the dump, which I can share, just email me for it.

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » mgr

Custom queries

Bug #36471

connection resetting tcp errors between mgr daemons

Updated by Corin Langosch over 5 years ago

Updated by Corin Langosch about 5 years ago

Updated by Corin Langosch about 5 years ago

Updated by Corin Langosch about 5 years ago

Updated by Nathan Cutler about 5 years ago

Updated by Jan Smets about 5 years ago

Updated by yite gu almost 5 years ago