Project

General

Profile

Actions

Bug #42711

closed

"[Errno None] Unable to connect to port x.x.x.x" in smoke

Added by Yuri Weinstein over 4 years ago. Updated 11 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
smoke
Crash signature (v1):
Crash signature (v2):

Description

It seems on all releases luminous, mimic and nautilus

Runs:
http://pulpito.ceph.com/yuriw-2019-11-08_14:54:42-smoke-nautilus-distro-basic-smithi/
http://pulpito.ceph.com/yuriw-2019-11-08_14:54:32-smoke-mimic-distro-basic-smithi/
http://pulpito.ceph.com/yuriw-2019-11-08_14:54:19-smoke-luminous-distro-basic-smithi/

Jobs: see failed and dead `SSH connection to smithi081 was lost: 'sudo rm rf - /home/ubuntu/cephtest/workunits.list.client.0 /home/ubuntu/cephtest/clone.client.0'`

Log: http://qa-proxy.ceph.com/teuthology/yuriw-2019-11-08_14:54:42-smoke-nautilus-distro-basic-smithi/4484163/teuthology.log

2019-11-08T15:50:17.214 INFO:teuthology.orchestra.run.smithi047:> mkdir -p -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && cd -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && CEPH_CLI_TEST_DUP_COMMAND=1 CEPH_REF=dd7922d9362fe3f8587e38f50383b76b1fbbee77 TESTDIR="/home/ubuntu/cephtest" CEPH_ARGS="--cluster ceph" CEPH_ID="0" PATH=$PATH:/usr/sbin CEPH_BASE=/home/ubuntu/cephtest/clone.client.0 CEPH_ROOT=/home/ubuntu/cephtest/clone.client.0 adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 3h /home/ubuntu/cephtest/clone.client.0/qa/workunits/direct_io/test_short_dio_read
2019-11-08T15:50:17.267 INFO:tasks.workunit.client.0.smithi047.stdout:writing first 3 bytes of 10k file
2019-11-08T15:50:17.268 INFO:tasks.workunit.client.0.smithi047.stdout:reading O_DIRECT
2019-11-08T15:50:17.268 INFO:tasks.workunit.client.0.smithi047.stdout:got 10000
2019-11-08T15:50:17.269 INFO:teuthology.orchestra.run:Running command with timeout 3600
2019-11-08T15:50:17.269 INFO:teuthology.orchestra.run.smithi047:Running:
2019-11-08T15:50:17.269 INFO:teuthology.orchestra.run.smithi047:> sudo rm -rf -- /home/ubuntu/cephtest/mnt.0/client.0/tmp
2019-11-08T15:50:17.369 INFO:tasks.workunit:Running workunit direct_io/test_sync_io...
2019-11-08T15:50:17.369 INFO:teuthology.orchestra.run.smithi047:Running (workunit test direct_io/test_sync_io):
2019-11-08T15:50:17.369 INFO:teuthology.orchestra.run.smithi047:> mkdir -p -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && cd -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && CEPH_CLI_TEST_DUP_COMMAND=1 CEPH_REF=dd7922d9362fe3f8587e38f50383b76b1fbbee77 TESTDIR="/home/ubuntu/cephtest" CEPH_ARGS="--cluster ceph" CEPH_ID="0" PATH=$PATH:/usr/sbin CEPH_BASE=/home/ubuntu/cephtest/clone.client.0 CEPH_ROOT=/home/ubuntu/cephtest/clone.client.0 adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 3h /home/ubuntu/cephtest/clone.client.0/qa/workunits/direct_io/test_sync_io
2019-11-08T15:50:18.561 INFO:tasks.workunit.client.0.smithi047.stdout:writing pattern
2019-11-08T15:50:18.561 INFO:tasks.workunit.client.0.smithi047.stdout:read_direct buf_align 0 offset 4190208 len 1024
2019-11-08T15:50:18.562 INFO:tasks.workunit.client.0.smithi047.stdout:read_file buf_align 0 offset 4190208 len 1024
2019-11-08T15:50:18.562 INFO:tasks.workunit.client.0.smithi047.stdout:read_sync buf_align 0 offset 4190208 len 1024
2019-11-08T15:50:18.562 INFO:tasks.workunit.client.0.smithi047.stdout:read_file buf_align 0 offset 4190208 len 1024
2019-11-08T15:50:18.562 INFO:tasks.workunit.client.0.smithi047.stdout:read_direct buf_align 0 offset 4190208 len 2048
..........
2019-11-08T16:06:16.002 DEBUG:teuthology.orchestra.connection:{'username': 'ubuntu', 'hostname': 'smithi047.front.sepia.ceph.com', 'timeout': 60}
2019-11-08T16:06:16.004 DEBUG:tasks.ceph:Missed logrotate, host unreachable
2019-11-08T16:06:19.067 DEBUG:teuthology.orchestra.remote:[Errno None] Unable to connect to port 22 on 172.21.15.47
2019-11-08T16:06:19.067 ERROR:teuthology.run_tasks:Saw exception from tasks.
Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/run_tasks.py", line 86, in run_tasks
    manager = run_one_task(taskname, ctx=ctx, config=config)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/run_tasks.py", line 65, in run_one_task
    return task(**kwargs)
  File "/home/teuthworker/src/github.com_ceph_ceph_nautilus/qa/tasks/workunit.py", line 136, in task
    cleanup=cleanup)
  File "/home/teuthworker/src/github.com_ceph_ceph_nautilus/qa/tasks/workunit.py", line 286, in _spawn_on_all_clients
    timeout=timeout)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/parallel.py", line 85, in __exit__
    for result in self:
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/parallel.py", line 99, in next
    resurrect_traceback(result)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/parallel.py", line 22, in capture_traceback
    return func(*args, **kwargs)
  File "/home/teuthworker/src/github.com_ceph_ceph_nautilus/qa/tasks/workunit.py", line 420, in _run_tests
    args=args,
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/remote.py", line 205, in run
    r = self._runner(client=self.ssh, name=self.shortname, **kwargs)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 420, in run
    raise ConnectionLostError(command=quote(args), node=name)
ConnectionLostError: SSH connection to smithi047 was lost: 'sudo rm -rf -- /home/ubuntu/cephtest/workunits.list.client.0 /home/ubuntu/cephtest/clone.client.0'
Actions #1

Updated by David Galloway over 4 years ago

From smithi047's console log:

[  446.169889] BUG: unable to handle kernel paging request at 0000000000023420
[  446.176904] IP: native_queued_spin_lock_slowpath+0x174/0x1c0
[  446.182588] PGD 0 P4D 0 
[  446.185139] Oops: 0002 [#1] SMP PTI

Entering kdb (current=0xffff9f0f183cdb00, pid 9588) on processor 1 Oops: (null)
due to oops @ 0xffffffff860e2c14
CPU: 1 PID: 9588 Comm: test_sync_io Tainted: G        W        4.15.0-66-generic #75-Ubuntu
Hardware name: Supermicro SYS-5018R-WR/X10SRW-F, BIOS 1.0c 09/07/2015
RIP: 0010:native_queued_spin_lock_slowpath+0x174/0x1c0
RSP: 0018:ffffb28e8459fc48 EFLAGS: 00010202
RAX: 0000000000023420 RBX: ffff9f0ed9d2f330 RCX: 0000000000003673
RDX: ffff9f0f3fc63400 RSI: 0000000000080000 RDI: ffff9f0ed9d2f388
RBP: ffffb28e8459fc48 R08: 000000010000c8fe R09: ffff9f0ed9d2d568
R10: ffffb28e8459fd90 R11: ffff9f0f3ffd5000 R12: ffff9f0ed9d2d728
R13: ffff9f0ed9d2d7b0 R14: ffff9f0ed9d2f388 R15: 0000000000000000
FS:  00007f281ce874c0(0000) GS:ffff9f0f3fc40000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000023420 CR3: 00000008550d2006 CR4: 00000000001606e0
Call Trace:
 _raw_spin_lock+0x21/0x30
 locked_inode_to_wb_and_lock_list+0x5b/0x140
 __mark_inode_dirty+0x1f0/0x3b0
 ceph_write_iter+0x961/0xb90 [ceph]
 new_sync_write+0xe7/0x140
 ? ceph_direct_read_write+0xb60/0xb60 [ceph]
more> 
Actions #2

Updated by Jeff Layton over 4 years ago

Hmm, so we hit a bogus address, possibly a use-after-free or something along those lines. Unfortunately I can't tell much more without a core (and a way to analyse it).

v4.15 is pretty ancient at this point. Any chance you could move this testing to use newer kernels? It's not clear to me that the ubuntu kernel maintainers are aggressively backporting ceph patches.

Actions #3

Updated by Jeff Layton over 4 years ago

It looks like ubuntu pulled in 87bc5b895d94a0f40fe170d4cf5771c8e8f85d15, but that patch should never have gone into kernels this old, as it does not account for changes to the internal inode handling API.

I sent a corrected, one-off backport to the stable maintainers. They backed out the bad backport and applied the good one to the v4.19 series, but it looks like the ubuntu maintainers have not yet followed suit. I'd contact the ubuntu folks and reference this discussion:

https://www.spinics.net/lists/ceph-users/msg55771.html

You may want to downrev the kernel to -66.74 or earlier until they get this resolved.

Actions #4

Updated by Jeff Layton over 4 years ago

I talked to the ubuntu maintainers on IRC and it sounds like they 4.15.0-68.77 should get the fix.

Actions #6

Updated by Laura Flores 11 months ago

  • Status changed from New to Resolved
  • Priority changed from Urgent to Normal

This seems resolved, but if it appears again in the smoke suite, feel free to reopen.

Actions

Also available in: Atom PDF