Bug #39039: mon connection reset, command not resent - RADOS - Ceph

Actions

Copy link

Bug #39039

closed

mon connection reset, command not resent

Added by Sage Weil about 5 years ago. Updated over 2 years ago.

Status:

Duplicate

Priority:

Normal

Assignee:

Sage Weil

Category:

Target version:

% Done:

100%

Source:

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(RADOS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

2019-03-28 19:35:52.315 7f45995fe700  1 -- [v2:172.21.15.31:3300/0,v1:172.21.15.31:6789/0] <== client.? 172.21.15.31:0/386512091 6 ==== mon_command({"prefix": "osd last-stat-seq", "id": 0} v 0) v1 ==== 82+0+0 (secure 0 0 0) 0x55da2b811400 con 0x55da2b56e400
...
2019-03-28 19:35:52.317 7f4596df9700 20 -- [v2:172.21.15.31:3300/0,v1:172.21.15.31:6789/0] >> 172.21.15.31:0/386512091 conn(0x55da2b56e400 msgr2=0x55da2b844c00 secure :-1 s=STATE_CONNECTION_ESTABLISHED l=1).read continue len=32
2019-03-28 19:35:52.317 7f4596df9700  1 -- [v2:172.21.15.31:3300/0,v1:172.21.15.31:6789/0] >> 172.21.15.31:0/386512091 conn(0x55da2b56e400 msgr2=0x55da2b844c00 secure :-1 s=STATE_CONNECTION_ESTABLISHED l=1).read_bulk peer close file descriptor 40
2019-03-28 19:35:52.317 7f4596df9700  1 -- [v2:172.21.15.31:3300/0,v1:172.21.15.31:6789/0] >> 172.21.15.31:0/386512091 conn(0x55da2b56e400 msgr2=0x55da2b844c00 secure :-1 s=STATE_CONNECTION_ESTABLISHED l=1).read_until read failed
2019-03-28 19:35:52.317 7f4596df9700 20 --2- [v2:172.21.15.31:3300/0,v1:172.21.15.31:6789/0] >> 172.21.15.31:0/386512091 conn(0x55da2b56e400 0x55da2b844c00 secure :-1 s=READY pgs=2 cs=0 l=1 rx=0x55da2b374990 tx=0x55da2b570d00).handle_read_frame_preamble_main r=-1
2019-03-28 19:35:52.317 7f4596df9700  1 --2- [v2:172.21.15.31:3300/0,v1:172.21.15.31:6789/0] >> 172.21.15.31:0/386512091 conn(0x55da2b56e400 0x55da2b844c00 secure :-1 s=READY pgs=2 cs=0 l=1 rx=0x55da2b374990 tx=0x55da2b570d00).handle_read_frame_preamble_main read frame length and tag failed r=-1 ((1) Operation not permitted)
2019-03-28 19:35:52.317 7f4596df9700 10 --2- [v2:172.21.15.31:3300/0,v1:172.21.15.31:6789/0] >> 172.21.15.31:0/386512091 conn(0x55da2b56e400 0x55da2b844c00 secure :-1 s=READY pgs=2 cs=0 l=1 rx=0x55da2b374990 tx=0x55da2b570d00)._fault
2019-03-28 19:35:52.317 7f4596df9700  2 --2- [v2:172.21.15.31:3300/0,v1:172.21.15.31:6789/0] >> 172.21.15.31:0/386512091 conn(0x55da2b56e400 0x55da2b844c00 secure :-1 s=READY pgs=2 cs=0 l=1 rx=0x55da2b374990 tx=0x55da2b570d00)._fault on lossy channel, failing
2019-03-28 19:35:52.317 7f4596df9700  1 --2- [v2:172.21.15.31:3300/0,v1:172.21.15.31:6789/0] >> 172.21.15.31:0/386512091 conn(0x55da2b56e400 0x55da2b844c00 secure :-1 s=READY pgs=2 cs=0 l=1 rx=0x55da2b374990 tx=0x55da2b570d00).stop
2019-03-28 19:35:52.317 7f4596df9700 10 --2- [v2:172.21.15.31:3300/0,v1:172.21.15.31:6789/0] >> 172.21.15.31:0/386512091 conn(0x55da2b56e400 0x55da2b844c00 unknown :-1 s=READY pgs=2 cs=0 l=1 rx=0 tx=0).discard_out_queue started

but not reconnect. the command times out:

2019-03-28T19:35:52.319 INFO:teuthology.orchestra.run.smithi031:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph osd last-stat-seq osd.1
2019-03-28T19:37:52.369 DEBUG:teuthology.orchestra.run:got remote process result: 124
2019-03-28T19:37:52.369 ERROR:teuthology.run_tasks:Saw exception from tasks.

/a/sage-2019-03-28_19:01:44-rados-wip-sage-testing-2019-03-28-1123-distro-basic-smithi/3780941

Files

Download all files

ceph-user.pre-install (150 Bytes) ceph-user.pre-install		Tony Davies, 04/20/2020 02:24 AM
APKBUILD (13.5 KB) APKBUILD		Tony Davies, 04/20/2020 02:24 AM
git_search_fix.patch (1.79 KB) git_search_fix.patch		Tony Davies, 04/20/2020 02:24 AM
musl-fixes.patch (469 Bytes) musl-fixes.patch		Tony Davies, 04/20/2020 02:24 AM
nodeenv-armv8l.patch (1.5 KB) nodeenv-armv8l.patch		Tony Davies, 04/20/2020 02:24 AM
PurgeQueue.cc.patch (1.34 KB) PurgeQueue.cc.patch		Tony Davies, 04/20/2020 02:24 AM
mon.a.tar.gz (194 KB) mon.a.tar.gz		Sunny Kumar, 06/25/2020 03:42 PM

Related issues 5 (1 open — 4 closed)

Actions

Copy link

Updated by Greg Farnum about 5 years ago

So it's not the command specifically but that the client doesn't reconnect to a working monitor, right?

Actions

Copy link

Updated by Greg Farnum over 4 years ago

Related to Bug #40521: cli timeout (e.g., ceph pg dump) added

Actions

Copy link

Updated by Kefu Chai over 4 years ago

2019-10-28T04:17:44.739+0000 7f469f50c700  1 --2- [v2:172.21.15.159:3305/0,v1:172.21.15.159:6794/0] >> [v2:172.21.15.154:3301/0,v1:172.21.15.154:6790/0] conn(0x562a2e9d4480 0x562a2e85a680 secure :-1 s=REA
DY pgs=40 cs=1 l=0 rx=0x562a2ea8a5a0 tx=0x562a2e9c2900).handle_read_frame_preamble_main read frame length and tag failed r=-1 ((1) Operation not permitted)

/a/kchai-2019-10-28_03:11:42-rados-wip-kefu-testing-2019-10-28-0018-distro-basic-smithi/4450603/remote/smithi159/log/ceph-mon.q.log.gz

Actions

Copy link

Updated by Patrick Donnelly over 4 years ago

Status changed from 12 to New

Actions

Copy link

Updated by Sage Weil about 4 years ago

Status changed from New to In Progress
Assignee set to Sage Weil
Priority changed from High to Urgent

/a/sage-2020-03-07_14:00:00-rados-master-distro-basic-smithi/4834734

Actions

Copy link

Updated by Sage Weil about 4 years ago

trying to reproduce here:

http://pulpito.ceph.com/sage-39039/
http://pulpito.ceph.com/sage-39039-lessdebug/

Actions

Copy link

Updated by Sage Weil about 4 years ago

Status changed from In Progress to Need More Info

can't reproduce :(

Actions

Copy link

Updated by Brad Hubbard about 4 years ago

Related to Bug #44197: read_until returns Operation not permitted in a mixed arch client MON session added

Actions

Copy link

Updated by Brad Hubbard about 4 years ago

#44197 looks related?

Actions

Copy link Download all files

#10

Updated by Tony Davies about 4 years ago

File APKBUILD APKBUILD added
File ceph-user.pre-install ceph-user.pre-install added
File git_search_fix.patch git_search_fix.patch added
File musl-fixes.patch musl-fixes.patch added
File nodeenv-armv8l.patch nodeenv-armv8l.patch added
File PurgeQueue.cc.patch PurgeQueue.cc.patch added

This also continues to happen on octopus, I just tested on 15.2.0.

I have attached the build instructions I used to compile on armv7. I can also provide the alpine package or a docker image I use for rook.

Actions

Copy link

#11

Updated by Tony Davies about 4 years ago

I tested this on a lab cluster after disabling cephx per https://docs.ceph.com/docs/octopus/rados/configuration/auth-config-ref/#disabling-cephx

My armv7 instance was successfully able to get a response from the test cluser using

ceph -s

Re-enabling cephx on the same lab cluster caused the hang to reappear, so I guess it narrows the issue to cephx.

Actions

Copy link

#12

Updated by Sunny Kumar almost 4 years ago

File mon.a.tar.gz mon.a.tar.gz added

Hitting this issue on octopus, Fedora 32:


2020-06-25T10:28:34.853-0400 ffff8f4803c0  1 -- [v2:127.0.0.1:40880/0,v1:127.0.0.1:40881/0] >> 127.0.0.1:0/3944021820 conn(0xaaab166c6000 msgr2=0xaaab16604d80 secure :-1 s=STATE_CONNECTION_ESTABLISHED l=1).read_until read failed
2020-06-25T10:28:34.853-0400 ffff8f4803c0  1 --2- [v2:127.0.0.1:40880/0,v1:127.0.0.1:40881/0] >> 127.0.0.1:0/3944021820 conn(0xaaab166c6000 0xaaab16604d80 secure :-1 s=READY pgs=1 cs=0 l=1 rx=0xaaab16631e30 tx=0xaaab165ea420).handle_read_frame_preamble_main read frame length and tag failed r=-1 ((1) Operation not permitted)

Actions

Copy link

#13