Bug #11581
closed"port 22: No route to host" errors in the labs
0%
Description
This is a placeholder to capture information on jobs that leave machines in unusable state, e.g. can't ssh to a machine with an error "port 22: No route to host"
Updated by Yuri Weinstein about 9 years ago
In Sepia running teuthology-nuke --stale --owner scheduled_teuthology@teuthology --unlock
- failed
ERROR:teuthology.nuke:Could not nuke the following targets:
ssh'ing to machines plana80
and plana87
failed:
ssh: connect to host plana80 port 22: No route to host
plana80
was locked by teuthology-2015-05-07_17:13:01-upgrade:firefly-x-next-distro-basic-multi/879321
plana87
was locked by teuthology-2015-05-07_17:05:02-upgrade:giant-x-next-distro-basic-multi/879209
ipmitool sol shows both plana80
and plana87
stuck in kdb debug mode
Both marked down for further analysis.
Updated by Ilya Dryomov about 9 years ago
Two identical failures in ext4 xattr.
plana80:
<5>[ 224.126924] XFS (sdd): Mounting Filesystem <6>[ 224.621228] XFS (sdd): Ending clean mount <5>[ 226.978492] XFS (sdb): Mounting Filesystem <6>[ 227.506227] XFS (sdb): Ending clean mount <5>[ 229.787749] XFS (sdc): Mounting Filesystem <6>[ 230.248503] XFS (sdc): Ending clean mount <4>[ 1239.086319] perf samples too long (2502 > 2500), lowering kernel.perf_event_max_sample_rate to 50000 <6>[ 1239.095498] INFO: NMI handler (perf_event_nmi_handler) took too long to run: 9.180 msecs <3>[ 3126.099041] JBD2: sda1-8: jh->b_next_transaction (81273957, (null), 0) != transaction (ffff8800bdad6500, 93718013) <4>[ 3126.110117] ------------[ cut here ]------------ <4>[ 3126.114968] WARNING: CPU: 4 PID: 11297 at /build/buildd/linux-lts-trusty-3.13.0/fs/ext4/ext4_jbd2.c:259 __ext4_handle_dirty_metadata+0x1a9/0x1c0() <4>[ 3126.173176] CPU: 4 PID: 11297 Comm: ceph-osd Tainted: G I 3.13.0-27-generic #50~precise1-Ubuntu <4>[ 3126.182792] Hardware name: Dell Inc. PowerEdge R410/01V648, BIOS 1.6.3 02/07/2011 <4>[ 3126.190322] 0000000000000103 ffff880220f5f998 ffffffff817507d1 ffff88022728fff0 <4>[ 3126.197812] 0000000000000000 ffff880220f5f9d8 ffffffff8106af4c 00000000059605fd <4>[ 3126.205303] ffff88016937f4e0 00000000ffffffea ffff88021ec77400 ffff880036e0b510 <4>[ 3126.212798] Call Trace: <4>[ 3126.215265] [<ffffffff817507d1>] dump_stack+0x46/0x58 <4>[ 3126.220436] [<ffffffff8106af4c>] warn_slowpath_common+0x8c/0xc0 <4>[ 3126.226475] [<ffffffff8106af9a>] warn_slowpath_null+0x1a/0x20 <4>[ 3126.232339] [<ffffffff8127acb9>] __ext4_handle_dirty_metadata+0x1a9/0x1c0 <4>[ 3126.239246] [<ffffffff8128c93c>] ext4_xattr_release_block+0x10c/0x1d0 <4>[ 3126.245804] [<ffffffff8128d0d5>] ext4_xattr_block_set+0x3a5/0x710 <4>[ 3126.252014] [<ffffffff8128db80>] ext4_xattr_set_handle+0x370/0x490 <4>[ 3126.258314] [<ffffffff8128dd39>] ? ext4_xattr_set+0x99/0x140 <4>[ 3126.264094] [<ffffffff8128dd65>] ext4_xattr_set+0xc5/0x140 <4>[ 3126.269698] [<ffffffff8128e784>] ext4_xattr_user_set+0x44/0x50 <4>[ 3126.275648] [<ffffffff811edeeb>] generic_setxattr+0x6b/0x90 <4>[ 3126.281338] [<ffffffff811ee81b>] __vfs_setxattr_noperm+0x7b/0x1c0 <4>[ 3126.287552] [<ffffffff8133359e>] ? evm_inode_setxattr+0xe/0x10 <4>[ 3126.293507] [<ffffffff811eea1c>] vfs_setxattr+0xbc/0xc0 <4>[ 3126.298852] [<ffffffff811eeb5e>] setxattr+0x13e/0x1e0 <4>[ 3126.304023] [<ffffffff8109cf47>] ? ttwu_queue+0xb7/0xd0 <4>[ 3126.309367] [<ffffffff8109f800>] ? try_to_wake_up+0x190/0x210 <4>[ 3126.315245] [<ffffffff811cb823>] ? __sb_start_write+0x53/0xf0 <4>[ 3126.321120] [<ffffffff810a2589>] ? account_user_time+0x99/0xb0 <4>[ 3126.327115] [<ffffffff811e99d8>] ? __mnt_want_write+0x58/0x70 <4>[ 3126.332981] [<ffffffff811eeffe>] SyS_fsetxattr+0xbe/0x100 <4>[ 3126.338499] [<ffffffff81765f7f>] tracesys+0xe1/0xe6 <4>[ 3126.343494] ---[ end trace 370e2dece562e736 ]--- <3>[ 3126.348147] EXT4-fs: ext4_handle_dirty_xattr_block:167: aborting transaction: error 22 in __ext4_handle_dirty_metadata <3>[ 3126.378547] Aborting journal on device sda1-8. <2>[ 3126.383237] EXT4-fs (sda1): Remounting filesystem read-only <2>[ 3126.388851] EXT4-fs error (device sda1) in ext4_xattr_release_block:558: error 22 <1>[ 3126.396589] BUG: unable to handle kernel NULL pointer dereference at 0000000000000028 <1>[ 3126.404475] IP: [<ffffffff8126b771>] __ext4_error_inode+0x31/0x150 <4>[ 3126.410686] PGD 6b933067 PUD 6d191067 PMD 0 <4>[ 3126.415005] Oops: 0000 [#1] SMP
plana87:
<5>[ 805.417218] XFS (sdd): Mounting Filesystem <6>[ 805.985840] XFS (sdd): Ending clean mount <5>[ 808.342428] XFS (sdb): Mounting Filesystem <6>[ 808.868967] XFS (sdb): Ending clean mount <5>[ 811.195855] XFS (sdc): Mounting Filesystem <6>[ 811.799708] XFS (sdc): Ending clean mount <4>[ 1370.641635] perf samples too long (2510 > 2500), lowering kernel.perf_event_max_sample_rate to 50000 <6>[ 1370.650813] INFO: NMI handler (perf_event_nmi_handler) took too long to run: 9.178 msecs <3>[ 5840.074506] JBD2: sda1-8: jh->b_next_transaction (63971849, (null), 0) != transaction (ffff8802160a2900, 24123492) <4>[ 5840.085583] ------------[ cut here ]------------ <4>[ 5840.090437] WARNING: CPU: 7 PID: 25506 at /build/buildd/linux-lts-trusty-3.13.0/fs/ext4/ext4_jbd2.c:259 __ext4_handle_dirty_metadata+0x1a9/0x1c0() <4>[ 5840.150389] CPU: 7 PID: 25506 Comm: ceph-osd Tainted: G I 3.13.0-27-generic #50~precise1-Ubuntu <4>[ 5840.160001] Hardware name: Dell Inc. PowerEdge R410/01V648, BIOS 1.6.3 02/07/2011 <4>[ 5840.167527] 0000000000000103 ffff8801f04db998 ffffffff817507d1 ffff8802272efff0 <4>[ 5840.175008] 0000000000000000 ffff8801f04db9d8 ffffffff8106af4c 0000000001701864 <4>[ 5840.182499] ffff8801aab46888 00000000ffffffea ffff8800bffa2af8 ffff8802160a9720 <4>[ 5840.189971] Call Trace: <4>[ 5840.192442] [<ffffffff817507d1>] dump_stack+0x46/0x58 <4>[ 5840.197611] [<ffffffff8106af4c>] warn_slowpath_common+0x8c/0xc0 <4>[ 5840.203646] [<ffffffff8106af9a>] warn_slowpath_null+0x1a/0x20 <4>[ 5840.209505] [<ffffffff8127acb9>] __ext4_handle_dirty_metadata+0x1a9/0x1c0 <4>[ 5840.216403] [<ffffffff8128c93c>] ext4_xattr_release_block+0x10c/0x1d0 <4>[ 5840.222954] [<ffffffff8128d0d5>] ext4_xattr_block_set+0x3a5/0x710 <4>[ 5840.229212] [<ffffffff8128db80>] ext4_xattr_set_handle+0x370/0x490 <4>[ 5840.235558] [<ffffffff8128dd39>] ? ext4_xattr_set+0x99/0x140 <4>[ 5840.241381] [<ffffffff8128dd65>] ext4_xattr_set+0xc5/0x140 <4>[ 5840.247031] [<ffffffff8128e784>] ext4_xattr_user_set+0x44/0x50 <4>[ 5840.253026] [<ffffffff811edeeb>] generic_setxattr+0x6b/0x90 <4>[ 5840.258764] [<ffffffff811ee81b>] __vfs_setxattr_noperm+0x7b/0x1c0 <4>[ 5840.265022] [<ffffffff8133359e>] ? evm_inode_setxattr+0xe/0x10 <4>[ 5840.271020] [<ffffffff811eea1c>] vfs_setxattr+0xbc/0xc0 <4>[ 5840.276416] [<ffffffff811eeb5e>] setxattr+0x13e/0x1e0 <4>[ 5840.281633] [<ffffffff810affc7>] ? finish_wait+0x67/0x80 <4>[ 5840.287112] [<ffffffff810dea16>] ? get_futex_key+0x216/0x310 <4>[ 5840.292938] [<ffffffff811cb823>] ? __sb_start_write+0x53/0xf0 <4>[ 5840.298854] [<ffffffff810a2589>] ? account_user_time+0x99/0xb0 <4>[ 5840.304849] [<ffffffff811e99d8>] ? __mnt_want_write+0x58/0x70 <4>[ 5840.310761] [<ffffffff811eeffe>] SyS_fsetxattr+0xbe/0x100 <4>[ 5840.316323] [<ffffffff81765f7f>] tracesys+0xe1/0xe6 <4>[ 5840.321361] ---[ end trace 94ed2e0930f9eaa2 ]--- <3>[ 5840.326063] EXT4-fs: ext4_handle_dirty_xattr_block:167: aborting transaction: error 22 in __ext4_handle_dirty_metadata <3>[ 5840.356646] Aborting journal on device sda1-8. <2>[ 5840.361377] EXT4-fs (sda1): Remounting filesystem read-only <2>[ 5840.366985] EXT4-fs error (device sda1) in ext4_xattr_release_block:558: error 22 <1>[ 5840.374788] BUG: unable to handle kernel NULL pointer dereference at 0000000000000028 <1>[ 5840.382673] IP: [<ffffffff8126b771>] __ext4_error_inode+0x31/0x150 <4>[ 5840.388881] PGD 21a85f067 PUD 221f7f067 PMD 0 <4>[ 5840.393374] Oops: 0000 [#1] SMP
I think this problem has been fixed by "ext4: Fix jbd2 warning under heavy xattr load"?
Updated by Ilya Dryomov about 9 years ago
The fix (if it is indeed the issue I noted above, Sage can confirm) is in ubuntu-trusty kernel starting with Ubuntu-3.13.0-31.55, from July 1 2014. The question is why are we running this old a kernel? If the idea is to not run ceph-client.git testing branch for some reason (why?) then the distro kernels need to be upgraded regularly.
Updated by Yuri Weinstein almost 9 years ago
plana34 was locked by teuthology-2015-05-10_23:18:01-multimds-next-testing-basic-multi/884661
marked down
Updated by Yuri Weinstein almost 9 years ago
yuriw@typica002:~$ ssh typica143 ssh: connect to host typica143 port 22: Connection refused
was locked by teuthology-2015-05-08_23:08:01-kcephfs-giant-testing-basic-typica/12939, no sol access, marked down
Updated by Zack Cerza almost 9 years ago
Yuri Weinstein wrote:
[...]
was locked by teuthology-2015-05-08_23:08:01-kcephfs-giant-testing-basic-typica/12939, no sol access, marked down
What happened to power cycling first?
Updated by Yuri Weinstein almost 9 years ago
Was not able to gain any access to it via ssh or sol, so no power cycle was done
Updated by Yuri Weinstein almost 9 years ago
yuriw@typica002:~$ ssh typica125 ssh: connect to host typica125 port 22: No route to host
was locked by teuthology-2015-05-12_21:00:01-rados-master-distro-basic-typica/15427
Updated by Yuri Weinstein almost 9 years ago
yuriw@typica002:~$ ssh typica125 ssh: connect to host typica125 port 22: No route to host
teuthology-2015-05-15_12:42:53-rados-hammer-distro-basic-typica/17978
Note typica125 again !
Updated by Yuri Weinstein almost 9 years ago
yuriw@typica002:~$ ssh typica091 ssh: connect to host typica091 port 22: No route to host
Was locked by teuthology-2015-05-15_12:42:53-rados-hammer-distro-basic-typica/17979
Updated by Yuri Weinstein almost 9 years ago
yuriw@typica002:~$ ssh typica114 ssh: connect to host typica114 port 22: No route to host
was locked by teuthology-2015-05-12_23:20:01-upgrade:client-upgrade-master-distro-basic-typica/16000
Updated by Dan Mick almost 9 years ago
the lack of ssh does not imply any particular cause.
Updated by Yuri Weinstein almost 9 years ago
ubuntu@teuthology:~$ ssh plana47 ssh: connect to host plana47 port 22: No route to host
was locked by teuthology-2015-05-17_23:18:01-multimds-next-testing-basic-multi/897193
Updated by Yuri Weinstein almost 9 years ago
ubuntu@teuthology:~$ ssh burnupi61 ssh: connect to host burnupi61 port 22: No route to host
stuck in kbd debug, in was locked by teuthology-2015-05-18_21:00:01-rados-next-distro-basic-multi/898466
Updated by Yuri Weinstein almost 9 years ago
ubuntu@teuthology:~$ ssh plana20 ssh: connect to host plana20 port 22: No route to host
was locked by teuthology-2015-05-17_17:13:01-upgrade:firefly-x-next-distro-basic-multi/896671
Added #11683
Updated by Yuri Weinstein almost 9 years ago
Updated by Sage Weil over 8 years ago
- Status changed from New to Closed
closing this... i don't see anything useful here!