Project

General

Profile

Actions

Bug #11581

closed

"port 22: No route to host" errors in the labs

Added by Yuri Weinstein almost 9 years ago. Updated over 8 years ago.

Status:
Closed
Priority:
Urgent
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

This is a placeholder to capture information on jobs that leave machines in unusable state, e.g. can't ssh to a machine with an error "port 22: No route to host"

Actions #1

Updated by Yuri Weinstein almost 9 years ago

In Sepia running teuthology-nuke --stale --owner scheduled_teuthology@teuthology --unlock - failed

ERROR:teuthology.nuke:Could not nuke the following targets:

ssh'ing to machines plana80 and plana87 failed:

ssh: connect to host plana80 port 22: No route to host

plana80 was locked by teuthology-2015-05-07_17:13:01-upgrade:firefly-x-next-distro-basic-multi/879321

plana87 was locked by teuthology-2015-05-07_17:05:02-upgrade:giant-x-next-distro-basic-multi/879209

ipmitool sol shows both plana80 and plana87 stuck in kdb debug mode

Both marked down for further analysis.

Actions #2

Updated by Ilya Dryomov almost 9 years ago

Two identical failures in ext4 xattr.

plana80:

<5>[  224.126924] XFS (sdd): Mounting Filesystem
<6>[  224.621228] XFS (sdd): Ending clean mount
<5>[  226.978492] XFS (sdb): Mounting Filesystem
<6>[  227.506227] XFS (sdb): Ending clean mount
<5>[  229.787749] XFS (sdc): Mounting Filesystem
<6>[  230.248503] XFS (sdc): Ending clean mount
<4>[ 1239.086319] perf samples too long (2502 > 2500), lowering kernel.perf_event_max_sample_rate to 50000
<6>[ 1239.095498] INFO: NMI handler (perf_event_nmi_handler) took too long to run: 9.180 msecs
<3>[ 3126.099041] JBD2: sda1-8: jh->b_next_transaction (81273957,           (null), 0) != transaction (ffff8800bdad6500, 93718013)
<4>[ 3126.110117] ------------[ cut here ]------------
<4>[ 3126.114968] WARNING: CPU: 4 PID: 11297 at /build/buildd/linux-lts-trusty-3.13.0/fs/ext4/ext4_jbd2.c:259 __ext4_handle_dirty_metadata+0x1a9/0x1c0()

<4>[ 3126.173176] CPU: 4 PID: 11297 Comm: ceph-osd Tainted: G          I  3.13.0-27-generic #50~precise1-Ubuntu
<4>[ 3126.182792] Hardware name: Dell Inc. PowerEdge R410/01V648, BIOS 1.6.3 02/07/2011
<4>[ 3126.190322]  0000000000000103 ffff880220f5f998 ffffffff817507d1 ffff88022728fff0
<4>[ 3126.197812]  0000000000000000 ffff880220f5f9d8 ffffffff8106af4c 00000000059605fd
<4>[ 3126.205303]  ffff88016937f4e0 00000000ffffffea ffff88021ec77400 ffff880036e0b510
<4>[ 3126.212798] Call Trace:
<4>[ 3126.215265]  [<ffffffff817507d1>] dump_stack+0x46/0x58
<4>[ 3126.220436]  [<ffffffff8106af4c>] warn_slowpath_common+0x8c/0xc0
<4>[ 3126.226475]  [<ffffffff8106af9a>] warn_slowpath_null+0x1a/0x20
<4>[ 3126.232339]  [<ffffffff8127acb9>] __ext4_handle_dirty_metadata+0x1a9/0x1c0
<4>[ 3126.239246]  [<ffffffff8128c93c>] ext4_xattr_release_block+0x10c/0x1d0
<4>[ 3126.245804]  [<ffffffff8128d0d5>] ext4_xattr_block_set+0x3a5/0x710
<4>[ 3126.252014]  [<ffffffff8128db80>] ext4_xattr_set_handle+0x370/0x490
<4>[ 3126.258314]  [<ffffffff8128dd39>] ? ext4_xattr_set+0x99/0x140
<4>[ 3126.264094]  [<ffffffff8128dd65>] ext4_xattr_set+0xc5/0x140
<4>[ 3126.269698]  [<ffffffff8128e784>] ext4_xattr_user_set+0x44/0x50
<4>[ 3126.275648]  [<ffffffff811edeeb>] generic_setxattr+0x6b/0x90
<4>[ 3126.281338]  [<ffffffff811ee81b>] __vfs_setxattr_noperm+0x7b/0x1c0
<4>[ 3126.287552]  [<ffffffff8133359e>] ? evm_inode_setxattr+0xe/0x10
<4>[ 3126.293507]  [<ffffffff811eea1c>] vfs_setxattr+0xbc/0xc0
<4>[ 3126.298852]  [<ffffffff811eeb5e>] setxattr+0x13e/0x1e0
<4>[ 3126.304023]  [<ffffffff8109cf47>] ? ttwu_queue+0xb7/0xd0
<4>[ 3126.309367]  [<ffffffff8109f800>] ? try_to_wake_up+0x190/0x210
<4>[ 3126.315245]  [<ffffffff811cb823>] ? __sb_start_write+0x53/0xf0
<4>[ 3126.321120]  [<ffffffff810a2589>] ? account_user_time+0x99/0xb0
<4>[ 3126.327115]  [<ffffffff811e99d8>] ? __mnt_want_write+0x58/0x70
<4>[ 3126.332981]  [<ffffffff811eeffe>] SyS_fsetxattr+0xbe/0x100
<4>[ 3126.338499]  [<ffffffff81765f7f>] tracesys+0xe1/0xe6
<4>[ 3126.343494] ---[ end trace 370e2dece562e736 ]---
<3>[ 3126.348147] EXT4-fs: ext4_handle_dirty_xattr_block:167: aborting transaction: error 22 in __ext4_handle_dirty_metadata

<3>[ 3126.378547] Aborting journal on device sda1-8.
<2>[ 3126.383237] EXT4-fs (sda1): Remounting filesystem read-only
<2>[ 3126.388851] EXT4-fs error (device sda1) in ext4_xattr_release_block:558: error 22
<1>[ 3126.396589] BUG: unable to handle kernel NULL pointer dereference at 0000000000000028
<1>[ 3126.404475] IP: [<ffffffff8126b771>] __ext4_error_inode+0x31/0x150
<4>[ 3126.410686] PGD 6b933067 PUD 6d191067 PMD 0
<4>[ 3126.415005] Oops: 0000 [#1] SMP

plana87:

<5>[  805.417218] XFS (sdd): Mounting Filesystem
<6>[  805.985840] XFS (sdd): Ending clean mount
<5>[  808.342428] XFS (sdb): Mounting Filesystem
<6>[  808.868967] XFS (sdb): Ending clean mount
<5>[  811.195855] XFS (sdc): Mounting Filesystem
<6>[  811.799708] XFS (sdc): Ending clean mount
<4>[ 1370.641635] perf samples too long (2510 > 2500), lowering kernel.perf_event_max_sample_rate to 50000
<6>[ 1370.650813] INFO: NMI handler (perf_event_nmi_handler) took too long to run: 9.178 msecs
<3>[ 5840.074506] JBD2: sda1-8: jh->b_next_transaction (63971849,           (null), 0) != transaction (ffff8802160a2900, 24123492)
<4>[ 5840.085583] ------------[ cut here ]------------
<4>[ 5840.090437] WARNING: CPU: 7 PID: 25506 at /build/buildd/linux-lts-trusty-3.13.0/fs/ext4/ext4_jbd2.c:259 __ext4_handle_dirty_metadata+0x1a9/0x1c0()

<4>[ 5840.150389] CPU: 7 PID: 25506 Comm: ceph-osd Tainted: G          I  3.13.0-27-generic #50~precise1-Ubuntu
<4>[ 5840.160001] Hardware name: Dell Inc. PowerEdge R410/01V648, BIOS 1.6.3 02/07/2011
<4>[ 5840.167527]  0000000000000103 ffff8801f04db998 ffffffff817507d1 ffff8802272efff0
<4>[ 5840.175008]  0000000000000000 ffff8801f04db9d8 ffffffff8106af4c 0000000001701864
<4>[ 5840.182499]  ffff8801aab46888 00000000ffffffea ffff8800bffa2af8 ffff8802160a9720
<4>[ 5840.189971] Call Trace:
<4>[ 5840.192442]  [<ffffffff817507d1>] dump_stack+0x46/0x58
<4>[ 5840.197611]  [<ffffffff8106af4c>] warn_slowpath_common+0x8c/0xc0
<4>[ 5840.203646]  [<ffffffff8106af9a>] warn_slowpath_null+0x1a/0x20
<4>[ 5840.209505]  [<ffffffff8127acb9>] __ext4_handle_dirty_metadata+0x1a9/0x1c0
<4>[ 5840.216403]  [<ffffffff8128c93c>] ext4_xattr_release_block+0x10c/0x1d0
<4>[ 5840.222954]  [<ffffffff8128d0d5>] ext4_xattr_block_set+0x3a5/0x710
<4>[ 5840.229212]  [<ffffffff8128db80>] ext4_xattr_set_handle+0x370/0x490
<4>[ 5840.235558]  [<ffffffff8128dd39>] ? ext4_xattr_set+0x99/0x140
<4>[ 5840.241381]  [<ffffffff8128dd65>] ext4_xattr_set+0xc5/0x140
<4>[ 5840.247031]  [<ffffffff8128e784>] ext4_xattr_user_set+0x44/0x50
<4>[ 5840.253026]  [<ffffffff811edeeb>] generic_setxattr+0x6b/0x90
<4>[ 5840.258764]  [<ffffffff811ee81b>] __vfs_setxattr_noperm+0x7b/0x1c0
<4>[ 5840.265022]  [<ffffffff8133359e>] ? evm_inode_setxattr+0xe/0x10
<4>[ 5840.271020]  [<ffffffff811eea1c>] vfs_setxattr+0xbc/0xc0
<4>[ 5840.276416]  [<ffffffff811eeb5e>] setxattr+0x13e/0x1e0
<4>[ 5840.281633]  [<ffffffff810affc7>] ? finish_wait+0x67/0x80
<4>[ 5840.287112]  [<ffffffff810dea16>] ? get_futex_key+0x216/0x310
<4>[ 5840.292938]  [<ffffffff811cb823>] ? __sb_start_write+0x53/0xf0
<4>[ 5840.298854]  [<ffffffff810a2589>] ? account_user_time+0x99/0xb0
<4>[ 5840.304849]  [<ffffffff811e99d8>] ? __mnt_want_write+0x58/0x70
<4>[ 5840.310761]  [<ffffffff811eeffe>] SyS_fsetxattr+0xbe/0x100
<4>[ 5840.316323]  [<ffffffff81765f7f>] tracesys+0xe1/0xe6
<4>[ 5840.321361] ---[ end trace 94ed2e0930f9eaa2 ]---
<3>[ 5840.326063] EXT4-fs: ext4_handle_dirty_xattr_block:167: aborting transaction: error 22 in __ext4_handle_dirty_metadata

<3>[ 5840.356646] Aborting journal on device sda1-8.
<2>[ 5840.361377] EXT4-fs (sda1): Remounting filesystem read-only
<2>[ 5840.366985] EXT4-fs error (device sda1) in ext4_xattr_release_block:558: error 22
<1>[ 5840.374788] BUG: unable to handle kernel NULL pointer dereference at 0000000000000028
<1>[ 5840.382673] IP: [<ffffffff8126b771>] __ext4_error_inode+0x31/0x150
<4>[ 5840.388881] PGD 21a85f067 PUD 221f7f067 PMD 0
<4>[ 5840.393374] Oops: 0000 [#1] SMP

I think this problem has been fixed by "ext4: Fix jbd2 warning under heavy xattr load"?

Actions #3

Updated by Ilya Dryomov almost 9 years ago

The fix (if it is indeed the issue I noted above, Sage can confirm) is in ubuntu-trusty kernel starting with Ubuntu-3.13.0-31.55, from July 1 2014. The question is why are we running this old a kernel? If the idea is to not run ceph-client.git testing branch for some reason (why?) then the distro kernels need to be upgraded regularly.

Actions #4

Updated by Yuri Weinstein almost 9 years ago

plana34 was locked by teuthology-2015-05-10_23:18:01-multimds-next-testing-basic-multi/884661

marked down

Actions #5

Updated by Yuri Weinstein almost 9 years ago

yuriw@typica002:~$ ssh typica143
ssh: connect to host typica143 port 22: Connection refused

was locked by teuthology-2015-05-08_23:08:01-kcephfs-giant-testing-basic-typica/12939, no sol access, marked down

Actions #6

Updated by Zack Cerza almost 9 years ago

Yuri Weinstein wrote:

[...]

was locked by teuthology-2015-05-08_23:08:01-kcephfs-giant-testing-basic-typica/12939, no sol access, marked down

What happened to power cycling first?

Actions #7

Updated by Yuri Weinstein almost 9 years ago

Was not able to gain any access to it via ssh or sol, so no power cycle was done

Actions #8

Updated by Yuri Weinstein almost 9 years ago

yuriw@typica002:~$ ssh typica125
ssh: connect to host typica125 port 22: No route to host

was locked by teuthology-2015-05-12_21:00:01-rados-master-distro-basic-typica/15427

Actions #9

Updated by Yuri Weinstein almost 9 years ago

yuriw@typica002:~$ ssh typica125
ssh: connect to host typica125 port 22: No route to host

teuthology-2015-05-15_12:42:53-rados-hammer-distro-basic-typica/17978

Note typica125 again !

Actions #10

Updated by Yuri Weinstein almost 9 years ago

yuriw@typica002:~$ ssh typica091
ssh: connect to host typica091 port 22: No route to host

Was locked by teuthology-2015-05-15_12:42:53-rados-hammer-distro-basic-typica/17979

Actions #11

Updated by Yuri Weinstein almost 9 years ago

yuriw@typica002:~$ ssh typica114
ssh: connect to host typica114 port 22: No route to host

was locked by teuthology-2015-05-12_23:20:01-upgrade:client-upgrade-master-distro-basic-typica/16000

Actions #12

Updated by Dan Mick almost 9 years ago

the lack of ssh does not imply any particular cause.

Actions #13

Updated by Yuri Weinstein almost 9 years ago

ubuntu@teuthology:~$ ssh plana47
ssh: connect to host plana47 port 22: No route to host

was locked by teuthology-2015-05-17_23:18:01-multimds-next-testing-basic-multi/897193

Actions #14

Updated by Yuri Weinstein almost 9 years ago

ubuntu@teuthology:~$ ssh burnupi61
ssh: connect to host burnupi61 port 22: No route to host

stuck in kbd debug, in was locked by teuthology-2015-05-18_21:00:01-rados-next-distro-basic-multi/898466

Actions #15

Updated by Yuri Weinstein almost 9 years ago

ubuntu@teuthology:~$ ssh plana20
ssh: connect to host plana20 port 22: No route to host

was locked by teuthology-2015-05-17_17:13:01-upgrade:firefly-x-next-distro-basic-multi/896671

Added #11683

Actions #17

Updated by Sage Weil over 8 years ago

  • Status changed from New to Closed

closing this... i don't see anything useful here!

Actions

Also available in: Atom PDF