Project

General

Profile

Bug #42783

test failure: due to client closed connection

Added by Kefu Chai over 4 years ago. Updated over 4 years ago.

Status:
Resolved
Priority:
High
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

on client side:

2019-11-09T09:15:02.645 INFO:teuthology.orchestra.run.smithi190:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph -- tell
 'mon.*' injectargs --mon_osd_down_out_interval=0
2019-11-09T09:15:03.021 INFO:teuthology.orchestra.run.smithi190.stderr:2019-11-09 09:15:03.021 7f4ef149e700  0 --2- 172.21.15.190:0/663686713 >> [v2:172.21.15.40:3300/0,v1:172.21.15.40:6789/0] conn(0x7f4eec0d61e0 0x7f4eec0b16c0 unknown :-1 s=AUTH_CONNECTING pgs=0 cs=0 l=1 rx=0 tx=0).send_auth_request get_initial_auth_request returned -2
2019-11-09T09:15:03.029 INFO:teuthology.orchestra.run.smithi190.stderr:Error ENXIO: problem getting command descriptions from mon.b
2019-11-09T09:15:03.029 INFO:teuthology.orchestra.run.smithi190.stderr:mon.b: problem getting command descriptions from mon.b
2019-11-09T09:15:03.305 INFO:teuthology.orchestra.run.smithi190.stderr:mon.a: injectargs:mon_osd_down_out_interval = '0' (not observed, change may require restart)
2019-11-09T09:15:03.509 INFO:teuthology.orchestra.run.smithi190.stderr:mon.c: injectargs:mon_osd_down_out_interval = '0' (not observed, change may require restart)
2019-11-09T09:15:03.527 DEBUG:teuthology.orchestra.run:got remote process result: 6

on monitor side

2019-11-09 09:15:02.793 7fc1d4427700 10 mon.a@1(peon).auth v6 _assign_global_id 4477 (max 54096)
2019-11-09 09:15:02.793 7fc1d4427700 10 cephx server client.admin: start_session server_challenge 699b1d50e6d8a0fc
2019-11-09 09:15:02.793 7fc1d4427700  2 mon.a@1(peon) e1 send_reply 0x407c960 0x41f41c0 auth_reply(proto 2 0 (0) Success) v1
2019-11-09 09:15:02.793 7fc1d4427700  1 -- [v2:172.21.15.190:3300/0,v1:172.21.15.190:6789/0] --> v1:172.21.15.190:0/663686713 -- auth_reply(proto 2 0 (0) Success) v1 -- 0x41f41c0
 con 0x40d0000
2019-11-09 09:15:02.793 7fc1d3425700  1 -- [v2:172.21.15.190:3300/0,v1:172.21.15.190:6789/0] >> v1:172.21.15.190:0/663686713 conn(0x40d0000 legacy=0x4071800 unknown :6789 s=STATE
_CONNECTION_ESTABLISHED l=1).read_bulk peer close file descriptor 42
2019-11-09 09:15:02.793 7fc1d3425700  1 -- [v2:172.21.15.190:3300/0,v1:172.21.15.190:6789/0] >> v1:172.21.15.190:0/663686713 conn(0x40d0000 legacy=0x4071800 unknown :6789 s=STATE
_CONNECTION_ESTABLISHED l=1).read_until read failed
2019-11-09 09:15:02.793 7fc1d3425700  1 --1- [v2:172.21.15.190:3300/0,v1:172.21.15.190:6789/0] >> v1:172.21.15.190:0/663686713 conn(0x40d0000 0x4071800 :6789 s=OPENED pgs=1 cs=1
l=1).handle_message read tag failed
2019-11-09 09:15:02.793 7fc1d3425700  1 --1- [v2:172.21.15.190:3300/0,v1:172.21.15.190:6789/0] >> v1:172.21.15.190:0/663686713 conn(0x40d0000 0x4071800 :6789 s=OPENED pgs=1 cs=1
l=1).fault on lossy channel, failing

2019-11-09 09:15:03.301 7fc1d4427700  1 -- [v2:172.21.15.190:3300/0,v1:172.21.15.190:6789/0] send_to--> mon [v2:172.21.15.40:3300/0,v1:172.21.15.40:6789/0] -- forward(log(1 entri
es from seq 63 at 2019-11-09 09:15:03.302997) v1 caps allow * tid 90 con_features 4611087854031667199) v4 -- ?+0 0x406c000
2019-11-09 09:15:03.301 7fc1d4427700  1 -- [v2:172.21.15.190:3300/0,v1:172.21.15.190:6789/0] --> [v2:172.21.15.40:3300/0,v1:172.21.15.40:6789/0] -- forward(log(1 entries from seq
 63 at 2019-11-09 09:15:03.302997) v1 caps allow * tid 90 con_features 4611087854031667199) v4 -- 0x406c000 con 0x3213180
2019-11-09 09:15:03.301 7fc1d3425700  1 --2- [v2:172.21.15.190:3300/0,v1:172.21.15.190:6789/0] >> [v2:172.21.15.40:3300/0,v1:172.21.15.40:6789/0] conn(0x3213180 0x3e75b80 secure
:-1 s=READY pgs=13 cs=5 l=0 rx=0x3fb7660 tx=0x3fb9e00).handle_read_frame_epilogue_main read frame epilogue bytes=32
2019-11-09 09:15:03.305 7fc1d3425700  1 -- [v2:172.21.15.190:3300/0,v1:172.21.15.190:6789/0] >> 172.21.15.190:0/663686713 conn(0x4002880 msgr2=0x401e680 secure :-1 s=STATE_CONNEC
TION_ESTABLISHED l=1).read_bulk peer close file descriptor 37
2019-11-09 09:15:03.305 7fc1d3425700  1 -- [v2:172.21.15.190:3300/0,v1:172.21.15.190:6789/0] >> 172.21.15.190:0/663686713 conn(0x4002880 msgr2=0x401e680 secure :-1 s=STATE_CONNEC
TION_ESTABLISHED l=1).read_until read failed
2019-11-09 09:15:03.305 7fc1d3425700  1 --2- [v2:172.21.15.190:3300/0,v1:172.21.15.190:6789/0] >> 172.21.15.190:0/663686713 conn(0x4002880 0x401e680 secure :-1 s=READY pgs=8 cs=0
 l=1 rx=0x4376310 tx=0x41bc380).handle_read_frame_preamble_main read frame length and tag failed r=-1 ((1) Operation not permitted)
2019-11-09 09:15:03.305 7fc1d3425700  1 --2- [v2:172.21.15.190:3300/0,v1:172.21.15.190:6789/0] >> 172.21.15.190:0/663686713 conn(0x4002880 0x401e680 secure :-1 s=READY pgs=8 cs=0
 l=1 rx=0x4376310 tx=0x41bc380).stop
2019-11-09 09:15:03.305 7fc1d4427700 10 mon.a@1(peon) e1 ms_handle_reset 0x4002880 172.21.15.190:0/663686713
2019-11-09 09:15:03.305 7fc1d4427700 10 mon.a@1(peon) e1 reset/close on session client.4474 172.21.15.190:0/663686713

rados/thrash-erasure-code-big/{ceph.yaml cluster/{12-osds.yaml openstack.yaml} msgr-failures/fastclose.yaml objectstore/bluestore-stupid.yaml rados.yaml recovery-overrides/{default.yaml} supported-random-distro$/{ubuntu_16.04.yaml} thrashers/morepggrow.yaml thrashosds-health.yaml workloads/ec-rados-plugin=lrc-k=4-m=2-l=3.yaml}

/a/yuriw-2019-11-08_20:11:33-rados-wip-yuri6-testing_2PRs-PR30949-nautilus-distro-basic-smithi/4484882

History

#1 Updated by Kefu Chai over 4 years ago

the connections were closed by client repeatly. i wonder if it's expected: we have "msgr-failures/fastclose.yaml". and in fastclose.yaml:

overrides:
  ceph:
    conf:
      global:
        ms inject socket failures: 2500
        ms tcp read timeout: 5
        mon client directed command retry: 5
    log-whitelist:
      - \(OSD_SLOW_PING_TIME

the client side might actually have closed the connection before getting the command description from monitor.

#2 Updated by Neha Ojha over 4 years ago

  • Priority changed from Normal to High

Marking this high, since this is showing up a lot on nautilus.

#3 Updated by Sage Weil over 4 years ago

  • Status changed from New to Fix Under Review
  • Pull request ID set to 31641

This is just the 'mon tell' implementation sucking up through nautilus. it's rewritten to not suck for octopus.

For nautilus, we merged some bandaids that might be why we are seeing this more:
b560f03dcd01fd73c6870a3bbf9ff20bf01ec2fd
ad3c03f38131c8871c596d26239a8d02cde43026

I suggest for nautilus we just increase the number of tell retries. See https://github.com/ceph/ceph/pull/31641

#4 Updated by Neha Ojha over 4 years ago

  • Status changed from Fix Under Review to Resolved

Also available in: Atom PDF