Project

General

Profile

Actions

Bug #36471

open

connection resetting tcp errors between mgr daemons

Added by Tomasz Sętkowski over 5 years ago. Updated almost 5 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
ceph-mgr
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Since upgrading to mimic from luminous I have had problems with connections in ceph-mgr in production cluster.
In the active daemon I can see a lot of messages like these:

-- 192.168.200.49:6802/7 >> 192.168.200.41:0/7 conn(0x7f77fac07400 :6802 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_msg accept replacing existing (lossy) channel (new one lossy=1)

Which in my understanding is daemon restarting connection because something was wrong with it, (which tcpdump confirms).
Initially it caused memory consumption to grow linearly, so I opened https://tracker.ceph.com/issues/35998 and searched for leak in mgr. I did found a leak, but it was not the root cause of resetting connections.
After deploying daemons with fixed leak, I still can see those messages on active daemon. The weirdest thing is that sometimes stopping one of the mgrs "fixes it". I have 3 mgrs in total. Sometimes I can run 2 or 1 and this problem does not appear. A series of restarts of mgrs in no apparent order can make it go away or re-appear.

I made tcpdump of the lossy connections and found out that wireshark reports several problems:
TCP:

893    9.499666    192.168.200.41    192.168.200.33    TCP    8298    [TCP Window Full] 50102 → 6803 [ACK] Seq=20328 Ack=410 Win=25984 Len=8232 TSval=2447622448 TSecr=577135410[Reassembly error, protocol TCP: New fragment overlaps old data (retransmission?)]

[Reassembly error, protocol TCP: New fragment overlaps old data (retransmission?)]

And ceph dissector is showing errors too:

Ceph UNKNOWN x22
    Filter Data
        [Source Node Name: Unknown]
        [Source Node Type: unknown (0x00)]
        [Destination Node Name: Unknown]
        [Destination Node Type: unknown (0x00)]
    Tag: Unknown (0x60)
        [Expert Info (Error/Undecoded): Unknown tag.  This is either an error by the sender or an indication that the dissector is out of date.]
            [Unknown tag.  This is either an error by the sender or an indication that the dissector is out of date.]
            [Severity level: Error]
            [Group: Undecoded]

I have tried several things to narrow this down and it did not change anything:
- swapped all NICs
- switched from jemalloc to tcmalloc

I have a pcap file with the dump, which I can share, just email me for it.


Related issues 1 (0 open1 closed)

Related to mgr - Bug #35998: ceph-mgr active daemon memory leak since mimicResolved09/15/2018

Actions
Actions #1

Updated by Corin Langosch over 5 years ago

I'm running a small test cluster with 1 mon and 1 mgr, both on the same host. My mgr log gets filled pretty fast with the following entries and I see constant memory growth of the mgr daemon. The cluster state is HEALTHY.

mgr1_1 | 2018-10-17 11:49:20.736 7f873b3cc700 1 mgr send_beacon active
mgr1_1 | 2018-10-17 11:49:20.804 7f8742bdb700 0 -- 10.0.0.1:6800/15 >> 10.0.0.1:0/15 conn(0x55ecbb100300 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_msg: challenging authorizer
mgr1_1 | 2018-10-17 11:49:20.804 7f8742bdb700 0 -- 10.0.0.1:6800/15 >> 10.0.0.1:0/15 conn(0x55ecbb100300 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_msg accept replacing existing (lossy) channel (new one lossy=1)
mgr1_1 | 2018-10-17 11:49:20.804 7f873ebd3700 0 client.0 ms_handle_reset on 10.0.0.1:6800/15
mgr1_1 | 2018-10-17 11:49:20.804 7f87433dc700 0 -- 10.0.0.1:6800/15 >> 10.0.0.1:0/15 conn(0x55ecbb100a00 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_msg: challenging authorizer
mgr1_1 | 2018-10-17 11:49:20.804 7f87433dc700 0 -- 10.0.0.1:6800/15 >> 10.0.0.1:0/15 conn(0x55ecbb100a00 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_msg accept replacing existing (lossy) channel (new one lossy=1)
mgr1_1 | 2018-10-17 11:49:21.804 7f8742bdb700 0 -- 10.0.0.1:6800/15 >> 10.0.0.1:0/15 conn(0x55ecbb101100 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_msg: challenging authorizer
mgr1_1 | 2018-10-17 11:49:21.804 7f8742bdb700 0 -- 10.0.0.1:6800/15 >> 10.0.0.1:0/15 conn(0x55ecbb101100 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_msg accept replacing existing (lossy) channel (new one lossy=1)
mgr1_1 | 2018-10-17 11:49:21.804 7f873ebd3700 0 client.0 ms_handle_reset on 10.0.0.1:6800/15
mgr1_1 | 2018-10-17 11:49:21.804 7f87423da700 0 -- 10.0.0.1:6800/15 >> 10.0.0.1:0/15 conn(0x55ecbb101800 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_msg: challenging authorizer
mgr1_1 | 2018-10-17 11:49:21.804 7f87423da700 0 -- 10.0.0.1:6800/15 >> 10.0.0.1:0/15 conn(0x55ecbb101800 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_msg accept replacing existing (lossy) channel (new one lossy=1)
mgr1_1 | 2018-10-17 11:49:22.736 7f873b3cc700 1 mgr send_beacon active
mgr1_1 | 2018-10-17 11:49:22.804 7f8742bdb700 0 -- 10.0.0.1:6800/15 >> 10.0.0.1:0/15 conn(0x55ecbb142000 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_msg: challenging authorizer
mgr1_1 | 2018-10-17 11:49:22.804 7f8742bdb700 0 -- 10.0.0.1:6800/15 >> 10.0.0.1:0/15 conn(0x55ecbb142000 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_msg accept replacing existing (lossy) channel (new one lossy=1)
mgr1_1 | 2018-10-17 11:49:22.804 7f873ebd3700 0 client.0 ms_handle_reset on 10.0.0.1:6800/15
mgr1_1 | 2018-10-17 11:49:22.804 7f87433dc700 0 -- 10.0.0.1:6800/15 >> 10.0.0.1:0/15 conn(0x55ecbb142700 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_msg: challenging authorizer
mgr1_1 | 2018-10-17 11:49:22.804 7f87433dc700 0 -- 10.0.0.1:6800/15 >> 10.0.0.1:0/15 conn(0x55ecbb142700 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_msg accept replacing existing (lossy) channel (new one lossy=1)
mgr1_1 | 2018-10-17 11:49:23.804 7f8742bdb700 0 -- 10.0.0.1:6800/15 >> 10.0.0.1:0/15 conn(0x55ecbb142e00 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_msg: challenging authorizer
mgr1_1 | 2018-10-17 11:49:23.804 7f8742bdb700 0 -- 10.0.0.1:6800/15 >> 10.0.0.1:0/15 conn(0x55ecbb142e00 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_msg accept replacing existing (lossy) channel (new one lossy=1)
mgr1_1 | 2018-10-17 11:49:23.804 7f873ebd3700 0 client.0 ms_handle_reset on 10.0.0.1:6800/15
mgr1_1 | 2018-10-17 11:49:23.804 7f87433dc700 0 -- 10.0.0.1:6800/15 >> 10.0.0.1:0/15 conn(0x55ecbb143500 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_msg: challenging authorizer
mgr1_1 | 2018-10-17 11:49:23.804 7f87433dc700 0 -- 10.0.0.1:6800/15 >> 10.0.0.1:0/15 conn(0x55ecbb143500 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_msg accept replacing existing (lossy) channel (new one lossy=1)
mgr1_1 | 2018-10-17 11:49:24.736 7f873b3cc700 1 mgr send_beacon active
mgr1_1 | 2018-10-17 11:49:24.804 7f8742bdb700 0 -- 10.0.0.1:6800/15 >> 10.0.0.1:0/15 conn(0x55ecbb143c00 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_msg: challenging authorizer
mgr1_1 | 2018-10-17 11:49:24.804 7f8742bdb700 0 -- 10.0.0.1:6800/15 >> 10.0.0.1:0/15 conn(0x55ecbb143c00 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_msg accept replacing existing (lossy) channel (new one lossy=1)
mgr1_1 | 2018-10-17 11:49:24.804 7f873ebd3700 0 client.0 ms_handle_reset on 10.0.0.1:6800/15
mgr1_1 | 2018-10-17 11:49:24.804 7f87423da700 0 -- 10.0.0.1:6800/15 >> 10.0.0.1:0/15 conn(0x55ecbb144300 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_msg: challenging authorizer
mgr1_1 | 2018-10-17 11:49:24.804 7f87423da700 0 -- 10.0.0.1:6800/15 >> 10.0.0.1:0/15 conn(0x55ecbb144300 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_msg accept replacing existing (lossy) channel (new one lossy=1)
mgr1_1 | 2018-10-17 11:49:25.804 7f8742bdb700 0 -- 10.0.0.1:6800/15 >> 10.0.0.1:0/15 conn(0x55ecbb144a00 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_msg: challenging authorizer
mgr1_1 | 2018-10-17 11:49:25.804 7f8742bdb700 0 -- 10.0.0.1:6800/15 >> 10.0.0.1:0/15 conn(0x55ecbb144a00 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_msg accept replacing existing (lossy) channel (new one lossy=1)
mgr1_1 | 2018-10-17 11:49:25.804 7f873ebd3700 0 client.0 ms_handle_reset on 10.0.0.1:6800/15
mgr1_1 | 2018-10-17 11:49:25.804 7f87433dc700 0 -- 10.0.0.1:6800/15 >> 10.0.0.1:0/15 conn(0x55ecbb145100 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_msg: challenging authorizer
mgr1_1 | 2018-10-17 11:49:25.804 7f87433dc700 0 -- 10.0.0.1:6800/15 >> 10.0.0.1:0/15 conn(0x55ecbb145100 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_msg accept replacing existing (lossy) channel (new one lossy=1)

Actions #2

Updated by Corin Langosch about 5 years ago

Bug is still present in latest ceph version 13.2.4 (b10be4d44915a4d78a8e06aa31919e74927b142e) mimic (stable).

mgr1_1 | 2019-03-10 23:09:47.517 7f53a7a0a700 1 mgr send_beacon active
mgr1_1 | 2019-03-10 23:09:47.529 7f53afa1a700 0 -- 10.0.1.2:6800/13 >> 10.0.1.2:0/13 conn(0x555987c77800 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_msg accept replacing existing (lossy) channel (new one lossy=1)
mgr1_1 | 2019-03-10 23:09:47.529 7f53ab211700 0 client.0 ms_handle_reset on 10.0.1.2:6800/13
mgr1_1 | 2019-03-10 23:09:47.529 7f53afa1a700 0 -- 10.0.1.2:6800/13 >> 10.0.1.2:0/13 conn(0x555987c98000 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_msg accept replacing existing (lossy) channel (new one lossy=1)
mgr1_1 | 2019-03-10 23:09:48.529 7f53afa1a700 0 -- 10.0.1.2:6800/13 >> 10.0.1.2:0/13 conn(0x555987c98700 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_msg accept replacing existing (lossy) channel (new one lossy=1)
mgr1_1 | 2019-03-10 23:09:48.529 7f53ab211700 0 client.0 ms_handle_reset on 10.0.1.2:6800/13
mgr1_1 | 2019-03-10 23:09:48.529 7f53aea18700 0 -- 10.0.1.2:6800/13 >> 10.0.1.2:0/13 conn(0x555987c98e00 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_msg accept replacing existing (lossy) channel (new one lossy=1)
mgr1_1 | 2019-03-10 23:09:49.517 7f53a7a0a700 1 mgr send_beacon active
mgr1_1 | 2019-03-10 23:09:49.529 7f53afa1a700 0 -- 10.0.1.2:6800/13 >> 10.0.1.2:0/13 conn(0x555987c99500 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_msg accept replacing existing (lossy) channel (new one lossy=1)
mgr1_1 | 2019-03-10 23:09:49.529 7f53ab211700 0 client.0 ms_handle_reset on 10.0.1.2:6800/13
mgr1_1 | 2019-03-10 23:09:49.529 7f53afa1a700 0 -- 10.0.1.2:6800/13 >> 10.0.1.2:0/13 conn(0x555987c99c00 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_msg accept replacing existing (lossy) channel (new one lossy=1)
mgr1_1 | 2019-03-10 23:09:50.529 7f53afa1a700 0 -- 10.0.1.2:6800/13 >> 10.0.1.2:0/13 conn(0x555987c9a300 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_msg accept replacing existing (lossy) channel (new one lossy=1)
mgr1_1 | 2019-03-10 23:09:50.529 7f53ab211700 0 client.0 ms_handle_reset on 10.0.1.2:6800/13
mgr1_1 | 2019-03-10 23:09:50.529 7f53afa1a700 0 -- 10.0.1.2:6800/13 >> 10.0.1.2:0/13 conn(0x555987c9aa00 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_msg accept replacing existing (lossy) channel (new one lossy=1)

I have a single mgr running. Seems like it tries to talk to itself (10.0.1.2:6800/13 >> 10.0.1.2:0/13) ;)

Actions #3

Updated by Corin Langosch about 5 years ago

I added another two mgrs but nothing changed. The first mgr is still producing the errors, the other two don't.

mgr1 is also eating up all memory over time (5 GB RAM in about 1 day):
root 3104 1.2 2.0 7490484 5450612 ? Sl Mar10 13:22 ceph-mgr -d -i mgr1

Actions #4

Updated by Corin Langosch about 5 years ago

Could the severity please be changed to major? This bug has the potential to OOM the whole system.

Actions #5

Updated by Nathan Cutler about 5 years ago

  • Related to Bug #35998: ceph-mgr active daemon memory leak since mimic added
Actions #6

Updated by Jan Smets almost 5 years ago

I'm having the same problem.
Running 13.2.5 with Rook 0.9.

I had the problem before 13.2.5 and was waiting for #35998, but unfortunately the issue still persists.

single mgr causes OOM after 2-3 days.

Actions #7

Updated by yite gu almost 5 years ago

I'm having the same problem.
my ceph version 12.2.12

Actions

Also available in: Atom PDF