Bug #36471
openconnection resetting tcp errors between mgr daemons
0%
Description
Since upgrading to mimic from luminous I have had problems with connections in ceph-mgr in production cluster.
In the active daemon I can see a lot of messages like these:
-- 192.168.200.49:6802/7 >> 192.168.200.41:0/7 conn(0x7f77fac07400 :6802 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_msg accept replacing existing (lossy) channel (new one lossy=1)
Which in my understanding is daemon restarting connection because something was wrong with it, (which tcpdump confirms).
Initially it caused memory consumption to grow linearly, so I opened https://tracker.ceph.com/issues/35998 and searched for leak in mgr. I did found a leak, but it was not the root cause of resetting connections.
After deploying daemons with fixed leak, I still can see those messages on active daemon. The weirdest thing is that sometimes stopping one of the mgrs "fixes it". I have 3 mgrs in total. Sometimes I can run 2 or 1 and this problem does not appear. A series of restarts of mgrs in no apparent order can make it go away or re-appear.
I made tcpdump of the lossy connections and found out that wireshark reports several problems:
TCP:
893 9.499666 192.168.200.41 192.168.200.33 TCP 8298 [TCP Window Full] 50102 → 6803 [ACK] Seq=20328 Ack=410 Win=25984 Len=8232 TSval=2447622448 TSecr=577135410[Reassembly error, protocol TCP: New fragment overlaps old data (retransmission?)]
[Reassembly error, protocol TCP: New fragment overlaps old data (retransmission?)]
And ceph dissector is showing errors too:
Ceph UNKNOWN x22
Filter Data
[Source Node Name: Unknown]
[Source Node Type: unknown (0x00)]
[Destination Node Name: Unknown]
[Destination Node Type: unknown (0x00)]
Tag: Unknown (0x60)
[Expert Info (Error/Undecoded): Unknown tag. This is either an error by the sender or an indication that the dissector is out of date.]
[Unknown tag. This is either an error by the sender or an indication that the dissector is out of date.]
[Severity level: Error]
[Group: Undecoded]
I have tried several things to narrow this down and it did not change anything:
- swapped all NICs
- switched from jemalloc to tcmalloc
I have a pcap file with the dump, which I can share, just email me for it.