Bug #42711
closed"[Errno None] Unable to connect to port x.x.x.x" in smoke
0%
Description
It seems on all releases luminous, mimic and nautilus
Runs:
http://pulpito.ceph.com/yuriw-2019-11-08_14:54:42-smoke-nautilus-distro-basic-smithi/
http://pulpito.ceph.com/yuriw-2019-11-08_14:54:32-smoke-mimic-distro-basic-smithi/
http://pulpito.ceph.com/yuriw-2019-11-08_14:54:19-smoke-luminous-distro-basic-smithi/
Jobs: see failed and dead `SSH connection to smithi081 was lost: 'sudo rm rf - /home/ubuntu/cephtest/workunits.list.client.0 /home/ubuntu/cephtest/clone.client.0'`
2019-11-08T15:50:17.214 INFO:teuthology.orchestra.run.smithi047:> mkdir -p -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && cd -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && CEPH_CLI_TEST_DUP_COMMAND=1 CEPH_REF=dd7922d9362fe3f8587e38f50383b76b1fbbee77 TESTDIR="/home/ubuntu/cephtest" CEPH_ARGS="--cluster ceph" CEPH_ID="0" PATH=$PATH:/usr/sbin CEPH_BASE=/home/ubuntu/cephtest/clone.client.0 CEPH_ROOT=/home/ubuntu/cephtest/clone.client.0 adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 3h /home/ubuntu/cephtest/clone.client.0/qa/workunits/direct_io/test_short_dio_read 2019-11-08T15:50:17.267 INFO:tasks.workunit.client.0.smithi047.stdout:writing first 3 bytes of 10k file 2019-11-08T15:50:17.268 INFO:tasks.workunit.client.0.smithi047.stdout:reading O_DIRECT 2019-11-08T15:50:17.268 INFO:tasks.workunit.client.0.smithi047.stdout:got 10000 2019-11-08T15:50:17.269 INFO:teuthology.orchestra.run:Running command with timeout 3600 2019-11-08T15:50:17.269 INFO:teuthology.orchestra.run.smithi047:Running: 2019-11-08T15:50:17.269 INFO:teuthology.orchestra.run.smithi047:> sudo rm -rf -- /home/ubuntu/cephtest/mnt.0/client.0/tmp 2019-11-08T15:50:17.369 INFO:tasks.workunit:Running workunit direct_io/test_sync_io... 2019-11-08T15:50:17.369 INFO:teuthology.orchestra.run.smithi047:Running (workunit test direct_io/test_sync_io): 2019-11-08T15:50:17.369 INFO:teuthology.orchestra.run.smithi047:> mkdir -p -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && cd -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && CEPH_CLI_TEST_DUP_COMMAND=1 CEPH_REF=dd7922d9362fe3f8587e38f50383b76b1fbbee77 TESTDIR="/home/ubuntu/cephtest" CEPH_ARGS="--cluster ceph" CEPH_ID="0" PATH=$PATH:/usr/sbin CEPH_BASE=/home/ubuntu/cephtest/clone.client.0 CEPH_ROOT=/home/ubuntu/cephtest/clone.client.0 adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 3h /home/ubuntu/cephtest/clone.client.0/qa/workunits/direct_io/test_sync_io 2019-11-08T15:50:18.561 INFO:tasks.workunit.client.0.smithi047.stdout:writing pattern 2019-11-08T15:50:18.561 INFO:tasks.workunit.client.0.smithi047.stdout:read_direct buf_align 0 offset 4190208 len 1024 2019-11-08T15:50:18.562 INFO:tasks.workunit.client.0.smithi047.stdout:read_file buf_align 0 offset 4190208 len 1024 2019-11-08T15:50:18.562 INFO:tasks.workunit.client.0.smithi047.stdout:read_sync buf_align 0 offset 4190208 len 1024 2019-11-08T15:50:18.562 INFO:tasks.workunit.client.0.smithi047.stdout:read_file buf_align 0 offset 4190208 len 1024 2019-11-08T15:50:18.562 INFO:tasks.workunit.client.0.smithi047.stdout:read_direct buf_align 0 offset 4190208 len 2048 .......... 2019-11-08T16:06:16.002 DEBUG:teuthology.orchestra.connection:{'username': 'ubuntu', 'hostname': 'smithi047.front.sepia.ceph.com', 'timeout': 60} 2019-11-08T16:06:16.004 DEBUG:tasks.ceph:Missed logrotate, host unreachable 2019-11-08T16:06:19.067 DEBUG:teuthology.orchestra.remote:[Errno None] Unable to connect to port 22 on 172.21.15.47 2019-11-08T16:06:19.067 ERROR:teuthology.run_tasks:Saw exception from tasks. Traceback (most recent call last): File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/run_tasks.py", line 86, in run_tasks manager = run_one_task(taskname, ctx=ctx, config=config) File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/run_tasks.py", line 65, in run_one_task return task(**kwargs) File "/home/teuthworker/src/github.com_ceph_ceph_nautilus/qa/tasks/workunit.py", line 136, in task cleanup=cleanup) File "/home/teuthworker/src/github.com_ceph_ceph_nautilus/qa/tasks/workunit.py", line 286, in _spawn_on_all_clients timeout=timeout) File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/parallel.py", line 85, in __exit__ for result in self: File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/parallel.py", line 99, in next resurrect_traceback(result) File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/parallel.py", line 22, in capture_traceback return func(*args, **kwargs) File "/home/teuthworker/src/github.com_ceph_ceph_nautilus/qa/tasks/workunit.py", line 420, in _run_tests args=args, File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/remote.py", line 205, in run r = self._runner(client=self.ssh, name=self.shortname, **kwargs) File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 420, in run raise ConnectionLostError(command=quote(args), node=name) ConnectionLostError: SSH connection to smithi047 was lost: 'sudo rm -rf -- /home/ubuntu/cephtest/workunits.list.client.0 /home/ubuntu/cephtest/clone.client.0'
Updated by David Galloway over 4 years ago
From smithi047's console log:
[ 446.169889] BUG: unable to handle kernel paging request at 0000000000023420 [ 446.176904] IP: native_queued_spin_lock_slowpath+0x174/0x1c0 [ 446.182588] PGD 0 P4D 0 [ 446.185139] Oops: 0002 [#1] SMP PTI Entering kdb (current=0xffff9f0f183cdb00, pid 9588) on processor 1 Oops: (null) due to oops @ 0xffffffff860e2c14 CPU: 1 PID: 9588 Comm: test_sync_io Tainted: G W 4.15.0-66-generic #75-Ubuntu Hardware name: Supermicro SYS-5018R-WR/X10SRW-F, BIOS 1.0c 09/07/2015 RIP: 0010:native_queued_spin_lock_slowpath+0x174/0x1c0 RSP: 0018:ffffb28e8459fc48 EFLAGS: 00010202 RAX: 0000000000023420 RBX: ffff9f0ed9d2f330 RCX: 0000000000003673 RDX: ffff9f0f3fc63400 RSI: 0000000000080000 RDI: ffff9f0ed9d2f388 RBP: ffffb28e8459fc48 R08: 000000010000c8fe R09: ffff9f0ed9d2d568 R10: ffffb28e8459fd90 R11: ffff9f0f3ffd5000 R12: ffff9f0ed9d2d728 R13: ffff9f0ed9d2d7b0 R14: ffff9f0ed9d2f388 R15: 0000000000000000 FS: 00007f281ce874c0(0000) GS:ffff9f0f3fc40000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000023420 CR3: 00000008550d2006 CR4: 00000000001606e0 Call Trace: _raw_spin_lock+0x21/0x30 locked_inode_to_wb_and_lock_list+0x5b/0x140 __mark_inode_dirty+0x1f0/0x3b0 ceph_write_iter+0x961/0xb90 [ceph] new_sync_write+0xe7/0x140 ? ceph_direct_read_write+0xb60/0xb60 [ceph] more>
Updated by Jeff Layton over 4 years ago
Hmm, so we hit a bogus address, possibly a use-after-free or something along those lines. Unfortunately I can't tell much more without a core (and a way to analyse it).
v4.15 is pretty ancient at this point. Any chance you could move this testing to use newer kernels? It's not clear to me that the ubuntu kernel maintainers are aggressively backporting ceph patches.
Updated by Jeff Layton over 4 years ago
It looks like ubuntu pulled in 87bc5b895d94a0f40fe170d4cf5771c8e8f85d15, but that patch should never have gone into kernels this old, as it does not account for changes to the internal inode handling API.
I sent a corrected, one-off backport to the stable maintainers. They backed out the bad backport and applied the good one to the v4.19 series, but it looks like the ubuntu maintainers have not yet followed suit. I'd contact the ubuntu folks and reference this discussion:
https://www.spinics.net/lists/ceph-users/msg55771.html
You may want to downrev the kernel to -66.74 or earlier until they get this resolved.
Updated by Jeff Layton over 4 years ago
I talked to the ubuntu maintainers on IRC and it sounds like they 4.15.0-68.77 should get the fix.
Updated by Yuri Weinstein over 4 years ago
Confirmed this is not a problem in testing kernel
http://pulpito.front.sepia.ceph.com/yuriw-2019-11-08_23:16:07-smoke-mimic-testing-basic-smithi/#
http://pulpito.front.sepia.ceph.com/yuriw-2019-11-09_05:25:34-smoke-luminous-testing-basic-smithi/
http://pulpito.front.sepia.ceph.com/yuriw-2019-11-09_05:26:31-smoke-nautilus-testing-basic-smithi/
Updated by Laura Flores 11 months ago
- Status changed from New to Resolved
- Priority changed from Urgent to Normal
This seems resolved, but if it appears again in the smoke suite, feel free to reopen.