Project

General

Profile

Actions

Bug #10600

closed

PATH issues on RHEL7 nodes?

Added by Greg Farnum over 9 years ago. Updated about 9 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):

Description

http://pulpito.ceph.redhat.com/teuthology-2015-01-19_22:06:01-fs-dumpling-distro-basic-magna/29429/

2015-01-21T00:08:24.067 INFO:tasks.workunit.client.0.magna076.stderr:+ ./fsync-tester
2015-01-21T00:09:39.527 INFO:tasks.workunit.client.0.magna076.stdout:setting up random write file
2015-01-21T00:09:39.530 INFO:tasks.workunit.client.0.magna076.stdout:done setting up random write file
2015-01-21T00:09:39.532 INFO:tasks.workunit.client.0.magna076.stdout:starting fsync run
2015-01-21T00:09:39.572 INFO:tasks.workunit.client.0.magna076.stdout:write time: 0.0024s fsync time: 0.1367s
2015-01-21T00:09:39.573 INFO:tasks.workunit.client.0.magna076.stdout:write time: 0.0050s fsync time: 1.8904s
2015-01-21T00:09:39.574 INFO:tasks.workunit.client.0.magna076.stdout:write time: 0.0055s fsync time: 2.5049s
2015-01-21T00:09:39.575 INFO:tasks.workunit.client.0.magna076.stdout:write time: 0.0053s fsync time: 3.5852s
2015-01-21T00:09:39.576 INFO:tasks.workunit.client.0.magna076.stdout:write time: 0.0058s fsync time: 4.1350s
2015-01-21T00:09:39.577 INFO:tasks.workunit.client.0.magna076.stdout:write time: 0.0058s fsync time: 4.3896s
2015-01-21T00:09:39.578 INFO:tasks.workunit.client.0.magna076.stdout:write time: 0.0045s fsync time: 7.2922s
2015-01-21T00:09:39.580 INFO:tasks.workunit.client.0.magna076.stdout:write time: 0.0040s fsync time: 9.2037s
2015-01-21T00:09:39.620 INFO:tasks.workunit.client.0.magna076.stdout:write time: 0.0045s fsync time: 8.1377s
2015-01-21T00:09:39.621 INFO:tasks.workunit.client.0.magna076.stdout:write time: 0.0044s fsync time: 7.9669s
2015-01-21T00:09:39.622 INFO:tasks.workunit.client.0.magna076.stdout:write time: 0.0047s fsync time: 8.9068s
2015-01-21T00:09:39.625 INFO:tasks.workunit.client.0.magna076.stdout:run done 11 fsyncs total, killing random writer
2015-01-21T00:09:39.630 INFO:tasks.workunit.client.0.magna076.stderr:+ lsof
2015-01-21T00:09:39.668 INFO:tasks.workunit.client.0.magna076.stderr:/home/ubuntu/cephtest/workunit.client.0/suites/fsync-tester.sh: line 10: lsof: command not found
2015-01-21T00:09:39.669 INFO:tasks.workunit:Stopping suites/fsync-tester.sh on client.0...
2015-01-21T00:09:39.671 INFO:teuthology.orchestra.run.magna076:Running: 'rm -rf -- /home/ubuntu/cephtest/workunits.list /home/ubuntu/cephtest/workunit.client.0'
2015-01-21T00:09:39.723 ERROR:teuthology.parallel:Exception in parallel execution
Traceback (most recent call last):
  File "/home/teuthworker/src/teuthology_master/teuthology/parallel.py", line 82, in __exit__
    for result in self:
  File "/home/teuthworker/src/teuthology_master/teuthology/parallel.py", line 101, in next
    resurrect_traceback(result)
  File "/home/teuthworker/src/teuthology_master/teuthology/parallel.py", line 19, in capture_traceback
    return func(*args, **kwargs)
  File "/home/teuthworker/src/ceph-qa-suite_dumpling/tasks/workunit.py", line 301, in _run_tests
    args=args,
  File "/home/teuthworker/src/teuthology_master/teuthology/orchestra/remote.py", line 128, in run
    r = self._runner(client=self.ssh, name=self.shortname, **kwargs)
  File "/home/teuthworker/src/teuthology_master/teuthology/orchestra/run.py", line 368, in run
    r.wait()
  File "/home/teuthworker/src/teuthology_master/teuthology/orchestra/run.py", line 106, in wait
    exitstatus=status, node=self.hostname)
CommandFailedError: Command failed on magna076 with status 127: 'mkdir -p -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && cd -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && CEPH_CLI_TEST_DUP_COMMAND=1 CEPH_REF=d73f0b86d3989d7b5924e984f949734a64ab04a9 TESTDIR="/home/ubuntu/cephtest" CEPH_ID="0" PYTHONPATH="$PYTHONPATH:/home/ubuntu/cephtest/binary/usr/local/lib/python2.7/dist-packages:/home/ubuntu/cephtest/binary/usr/local/lib/python2.6/dist-packages" adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage /home/ubuntu/cephtest/workunit.client.0/suites/fsync-tester.sh'
Actions #1

Updated by Greg Farnum over 9 years ago

  • Subject changed from lost needs to be installed on all magna nodes to lsof needs to be installed on all magna nodes
Actions #2

Updated by Greg Farnum over 9 years ago

  • Project changed from teuthology to sepia
Actions #3

Updated by Sandon Van Ness about 9 years ago

We install lsof on our magna nodes. Its even in /usr/sbin which very commonly on the $PATH. is something stripping environment variables here causing it to not be found?

[ubuntu@magna076 ~]$ sudo grep lsof /var/log/yum.log
Jan 09 17:20:48 Installed: lsof-4.87-4.el7.x86_64
[ubuntu@magna076 ~]$ lsof -v
lsof version information:
    revision: 4.87
    latest revision: ftp://lsof.itap.purdue.edu/pub/tools/unix/lsof/
    latest FAQ: ftp://lsof.itap.purdue.edu/pub/tools/unix/lsof/FAQ
    latest man page: ftp://lsof.itap.purdue.edu/pub/tools/unix/lsof/lsof_man
    constructed: Sun Jan 26 09:55:57 EST 2014
    constructed by and on: mockbuild@x86-017.build.eng.bos.redhat.com
    compiler: cc
    compiler version: 4.8.2 20140120 (Red Hat 4.8.2-12) (GCC)
    compiler flags: -DLINUXV=310000 -DGLIBCV=217 -DHASIPv6 -DHASSELINUX -D_FILE_OFFSET_BITS=64 -D_LARGEFILE64_SOURCE -DHAS_STRFTIME -DLSOF_VSTR="3.10.0" -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic
    loader flags: -L./lib -llsof  -lselinux
    system info: Linux x86-017.build.eng.bos.redhat.com 2.6.32-431.4.1.el6.x86_64 #1 SMP Thu Dec 19 10:26:41 EST 2013 x86_64 x86_64 x86_64 GNU/Linux
    Anyone can list all files.
    /dev warnings are disabled.
    Kernel ID check is disabled.
Actions #4

Updated by Greg Farnum about 9 years ago

I don't know, I'm just seeing intermittent failures. I somewhere saw from you a reference to paths getting stripped, so I suppose that could just be happening on RHEL machines or something?

We're not doing anything weird with these commands though, so if the PATH is going away it's pretty likely to be a teuthology issue of some kind that will need to be discussed with Zack and resolved globally. Any idea what would be likely to do that?

Actions #5

Updated by Zack Cerza about 9 years ago

I just checked, and I don't see a place where teuthology could be affecting $PATH...

Actions #6

Updated by Greg Farnum about 9 years ago

  • Project changed from sepia to teuthology
  • Subject changed from lsof needs to be installed on all magna nodes to PATH issues on RHEL7 nodes?

See also #10302. Sandon there suggests adding a "PATH=$PATH:/usr/sbin" to the script, though that seems wonky to me?

Anyway, the command invoking the script is above, and here are the full script contents:

#!/bin/sh -x

set -e

wget http://ceph.com/qa/fsync-tester.c
gcc fsync-tester.c -o fsync-tester

./fsync-tester

lsof

This has been running reliably up until the addition of RHEL7 machines. I logged into a RHEL7 box and invoking lsof manually works just fine. :/

Actions #7

Updated by Zack Cerza about 9 years ago

I'd personally rather see something like an "echo $PATH; which lsof" to help us figure out if it is indeed a path issue.

Actions #8

Updated by Greg Farnum about 9 years ago

Ugh, I wasn't getting email notifications on this bug...watched now.

Anyway, that script is one of the ceph.git workunits. I assume that's where you want that info to come from? Or do you mean from in the workunit task?

Actions #9

Updated by Zack Cerza about 9 years ago

I meant from the script

Actions #10

Updated by Greg Farnum about 9 years ago

  • Status changed from New to In Progress
  • Assignee changed from Sandon Van Ness to Greg Farnum

I've got a branch to get path info and will run some tests once it's available for install.

Actions #11

Updated by Greg Farnum about 9 years ago

Well, I tried reproducing this in a VPS setup and was unable to do so. So I've given in and pushed a patch to master and our LTS branches which will echo the $PATH and run "whereis lsof" as part of the script, which will hopefully help us narrow things down a bit.

Actions #12

Updated by Greg Farnum about 9 years ago

  • Status changed from In Progress to Need More Info
Actions #13

Updated by Greg Farnum about 9 years ago

  • Status changed from Need More Info to In Progress
  • Assignee changed from Greg Farnum to Zack Cerza

gregf@magna002:/a/gregf-2015-02-17_14:18:29-fs-wip-firefly-flock---basic-magna/49584/teuthology.log:

2015-02-18T16:33:07.010 INFO:tasks.workunit.client.0.magna074.stdout:write time: 107.0348s fsync time: 18.4140s
2015-02-18T16:33:07.050 INFO:tasks.workunit.client.0.magna074.stdout:run done 4 fsyncs total, killing random writer
2015-02-18T16:33:07.051 INFO:tasks.workunit.client.0.magna074.stderr:+ echo /usr/lib64/qt-3.3/bin:/usr/local/bin:/usr/bin
2015-02-18T16:33:07.091 INFO:tasks.workunit.client.0.magna074.stderr:+ whereis lsof
2015-02-18T16:33:07.092 INFO:tasks.workunit.client.0.magna074.stdout:/usr/lib64/qt-3.3/bin:/usr/local/bin:/usr/bin
2015-02-18T16:33:07.497 INFO:tasks.workunit.client.0.magna074.stdout:lsof: /usr/sbin/lsof /usr/share/man/man8/lsof.8.gz
2015-02-18T16:33:07.498 INFO:tasks.workunit.client.0.magna074.stderr:+ lsof
2015-02-18T16:33:07.500 INFO:tasks.workunit.client.0.magna074.stderr:/home/ubuntu/cephtest/workunit.client.0/suites/fsync-tester.sh: line 12: lsof: command not found
2015-02-18T16:33:07.501 INFO:tasks.workunit:Stopping suites/fsync-tester.sh on client.0...
2015-02-18T16:33:07.501 INFO:teuthology.orchestra.run.magna074:Running: 'rm -rf -- /home/ubuntu/cephtest/workunits.list /home/ubuntu/cephtest/workunit.client.0'
2015-02-18T16:33:07.569 ERROR:teuthology.parallel:Exception in parallel execution
Traceback (most recent call last):
  File "/home/teuthworker/src/teuthology_master/teuthology/parallel.py", line 82, in __exit__
    for result in self:
  File "/home/teuthworker/src/teuthology_master/teuthology/parallel.py", line 101, in next
    resurrect_traceback(result)
  File "/home/teuthworker/src/teuthology_master/teuthology/parallel.py", line 19, in capture_traceback
    return func(*args, **kwargs)
  File "/home/teuthworker/src/ceph-qa-suite_firefly/tasks/workunit.py", line 360, in _run_tests
    label="workunit test {workunit}".format(workunit=workunit)
  File "/home/teuthworker/src/teuthology_master/teuthology/orchestra/remote.py", line 137, in run
    r = self._runner(client=self.ssh, name=self.shortname, **kwargs)
  File "/home/teuthworker/src/teuthology_master/teuthology/orchestra/run.py", line 378, in run
    r.wait()
  File "/home/teuthworker/src/teuthology_master/teuthology/orchestra/run.py", line 114, in wait
    label=self.label)
CommandFailedError: Command failed (workunit test suites/fsync-tester.sh) on magna074 with status 127: 'mkdir -p -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && cd -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && CEPH_CLI_TEST_DUP_COMMAND=1 CEPH_REF=702dbc0a247c149d53b52d1929f9880bc99d0522 TESTDIR="/home/ubuntu/cephtest" CEPH_ID="0" adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 3h /home/ubuntu/cephtest/workunit.client.0/suites/fsync-tester.sh'

Okay, so it looks like we're not including /usr/sbin in the PATH that the script is seeing. I don't see that we're including "sudo" anywhere in our execution of this one, which I think means teuthology has to be doing it somehow?

Actions #14

Updated by Greg Farnum about 9 years ago

Forgot to update this yesterday: ssh sets up the environment differently depending on whether it's a login shell (or something like that) and RHEL has different PATH values in those two instances!

But in our case we really want the full login experience for all our workings.

Actions #15

Updated by Greg Farnum about 9 years ago

  • Assignee changed from Zack Cerza to Greg Farnum

And I guess I'm stuck shepherding this even if I can't solve it on my own right now.

Actions #16

Updated by Greg Farnum about 9 years ago

  • Status changed from In Progress to Fix Under Review
Actions #18

Updated by Greg Farnum about 9 years ago

  • Status changed from Resolved to Pending Backport

This is in master but we'll want it in our other test branches as well. Just waiting to let it get through a few runs and make sure we haven't somehow busted up something else.

Actions #19

Updated by Greg Farnum about 9 years ago

  • Priority changed from High to Urgent
Actions #20

Updated by Greg Farnum about 9 years ago

  • Status changed from Pending Backport to Resolved

This is on hammer, firefly, dumpling now.

Actions

Also available in: Atom PDF