Project

General

Profile

Actions

Bug #14749

closed

smithi049 and smithi041 rebooting constantly

Added by Yuri Weinstein about 8 years ago. Updated about 8 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Infrastructure Hardware
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):

Description

Majority of nodes that failed during nuke/stale operation fail with

2016-02-12 08:27:44,220.220 INFO:teuthology.orchestra.run.smithi049:Running: "dpkg -l | grep '^.\\(U\\|H\\)R' | awk '{print $2}' | sudo xargs --no-run-if-empty dpkg -P --force-remove-reinstreq" 
2016-02-12 08:27:44,293.293 INFO:teuthology.orchestra.run.smithi049:Running: 'sudo DEBIAN_FRONTEND=noninteractive apt-get -y --force-yes -o Dpkg::Options::="--force-confdef" -o Dpkg::Options::="--force-confold" autoremove'
2016-02-12 08:27:44,376.376 INFO:teuthology.orchestra.run.smithi049.stderr:E: dpkg was interrupted, you must manually run 'sudo dpkg --configure -a' to correct the problem.

Usually running 'sudo dpkg --configure -a' fix the issue, maybe we should re-try it?

Actions #1

Updated by Zack Cerza about 8 years ago

As of right now, mira049's job history, most recent first:
running http://pulpito.front.sepia.ceph.com/sage-2016-02-12_05:26:12-rados-wip-sage-testing---basic-smithi/7142
dead http://pulpito.front.sepia.ceph.com/sage-2016-02-12_05:26:12-rados-wip-sage-testing---basic-smithi/7136

{'smithi041.front.sepia.ceph.com': {'invocation': {'module_name': 'apt', 'module_args': ''}, 'failed': True, 'parsed': False, 'msg': 'BECOME-SUCCESS-hcuscfuixbfpipjxokrryrevriuxxmfk\nOpenSSH_6.6.1, OpenSSL 1.0.1f 6 Jan 2014\r\ndebug1: Reading configuration data /var/lib/teuthworker/.ssh/config\r\ndebug1: /var/lib/teuthworker/.ssh/config line 1: Applying options for *\r\ndebug1: Reading configuration data /etc/ssh/ssh_config\r\ndebug1: /etc/ssh/ssh_config line 19: Applying options for *\r\ndebug1: /etc/ssh/ssh_config line 56: Applying options for smithi*\r\ndebug1: auto-mux: Trying existing master\r\ndebug1: mux_client_request_session: master session id: 2\r\nTraceback (most recent call last):\n File "<stdin>", line 2246, in <module>\n File "<stdin>", line 549, in main\n File "/usr/lib/python2.7/dist-packages/apt/cache.py", line 107, in __init__\n self.open(progress)\n File "/usr/lib/python2.7/dist-packages/apt/cache.py", line 151, in open\n self._cache = apt_pkg.Cache(progress)\nSystemError: E:Problem renaming the file /var/cache/apt/pkgcache.bin.aHwvNE to /var/cache/apt/pkgcache.bin - rename (2: No such file or directory), W:You may want to run apt-get update to correct these problems\n'}, 'smithi049.front.sepia.ceph.com': {'invocation': {'module_name': 'apt', 'module_args': ''}, 'failed': True, 'parsed': False, 'item': 'ceph,ceph-common,libcephfs1,radosgw,python-ceph,python-rados,librbd1,librados2,openmpi-common', 'msg': 'BECOME-SUCCESS-zavvekjiqlacfpzehdvxtjpnsiwthyoe\nOpenSSH_6.6.1, OpenSSL 1.0.1f 6 Jan 2014\r\ndebug1: Reading configuration data /var/lib/teuthworker/.ssh/config\r\ndebug1: /var/lib/teuthworker/.ssh/config line 1: Applying options for *\r\ndebug1: Reading configuration data /etc/ssh/ssh_config\r\ndebug1: /etc/ssh/ssh_config line 19: Applying options for *\r\ndebug1: /etc/ssh/ssh_config line 56: Applying options for smithi*\r\ndebug1: auto-mux: Trying existing master\r\ndebug1: mux_client_request_session: master session id: 4\r\nTraceback (most recent call last):\n File "<stdin>", line 2246, in <module>\n File "<stdin>", line 549, in main\n File "/usr/lib/python2.7/dist-packages/apt/cache.py", line 107, in __init__\n self.open(progress)\n File "/usr/lib/python2.7/dist-packages/apt/cache.py", line 153, in open\n self._records = apt_pkg.PackageRecords(self._cache)\nSystemError: E:Could not open file /var/lib/apt/lists/apt-mirror.front.sepia.ceph.com_archive.ubuntu.com_ubuntu_dists_trusty_universe_i18n_Translation-en - open (2: No such file or directory), E:Could not open file /var/lib/apt/lists/apt-mirror.front.sepia.ceph.com_archive.ubuntu.com_ubuntu_dists_trusty_restricted_i18n_Translation-en - open (2: No such file or directory), E:Could not open file /var/lib/apt/lists/apt-mirror.front.sepia.ceph.com_archive.ubuntu.com_ubuntu_dists_trusty_multiverse_i18n_Translation-en - open (2: No such file or directory), E:Could not open file /var/lib/apt/lists/apt-mirror.front.sepia.ceph.com_archive.ubuntu.com_ubuntu_dists_trusty_main_i18n_Translation-en - open (2: No such file or directory), E:Could not open file /var/lib/apt/lists/apt-mirror.front.sepia.ceph.com_archive.ubuntu.com_ubuntu_dists_trusty_multiverse_binary-i386_Packages - open (2: No such file or directory), E:Could not open file /var/lib/apt/lists/apt-mirror.front.sepia.ceph.com_archive.ubuntu.com_ubuntu_dists_trusty_universe_binary-i386_Packages - open (2: No such file or directory), E:Could not open file /var/lib/apt/lists/apt-mirror.front.sepia.ceph.com_archive.ubuntu.com_ubuntu_dists_trusty_restricted_binary-i386_Packages - open (2: No such file or directory), E:Could not open file /var/lib/apt/lists/apt-mirror.front.sepia.ceph.com_archive.ubuntu.com_ubuntu_dists_trusty_main_binary-i386_Packages - open (2: No such file or directory), E:Could not open file /var/lib/apt/lists/apt-mirror.front.sepia.ceph.com_archive.ubuntu.com_ubuntu_dists_trusty_multiverse_binary-amd64_Packages - open (2: No such file or directory), E:Could not open file /var/lib/apt/lists/apt-mirror.front.sepia.ceph.com_archive.ubuntu.com_ubuntu_dists_trusty_universe_binary-amd64_Packages - open (2: No such file or directory), E:Could not open file /var/lib/apt/lists/apt-mirror.front.sepia.ceph.com_archive.ubuntu.com_ubuntu_dists_trusty_restricted_binary-amd64_Packages - open (2: No such file or directory), E:Could not open file /var/lib/apt/lists/apt-mirror.front.sepia.ceph.com_archive.ubuntu.com_ubuntu_dists_trusty_main_binary-amd64_Packages - open (2: No such file or directory)\n'}}

dead http://pulpito.front.sepia.ceph.com/sage-2016-02-12_05:26:12-rados-wip-sage-testing---basic-smithi/7128
{'smithi049.front.sepia.ceph.com': 'SSH Error: data could not be sent to the remote host. Make sure this host can be reached over ssh'}

dead http://pulpito.front.sepia.ceph.com/sage-2016-02-12_05:26:12-rados-wip-sage-testing---basic-smithi/7119
SSH connection to smithi049 was lost: u'sudo DEBIAN_FRONTEND=noninteractive apt-get -y --force-yes -o Dpkg::Options::="--force-confdef" -o Dpkg::Options::="--force-confold" install ceph=10.0.3-979-g35a4db4-1trusty ceph-mds=10.0.3-979-g35a4db4-1trusty ceph-common=10.0.3-979-g35a4db4-1trusty ceph-fuse=10.0.3-979-g35a4db4-1trusty ceph-test=10.0.3-979-g35a4db4-1trusty radosgw=10.0.3-979-g35a4db4-1trusty python-ceph=10.0.3-979-g35a4db4-1trusty libcephfs1=10.0.3-979-g35a4db4-1trusty libcephfs-java=10.0.3-979-g35a4db4-1trusty libcephfs-jni=10.0.3-979-g35a4db4-1trusty librados2=10.0.3-979-g35a4db4-1trusty librbd1=10.0.3-979-g35a4db4-1trusty rbd-fuse=10.0.3-979-g35a4db4-1trusty librados2=10.0.3-979-g35a4db4-1trusty librbd1=10.0.3-979-g35a4db4-1trusty'

dead http://pulpito.front.sepia.ceph.com/sage-2016-02-12_05:26:12-rados-wip-sage-testing---basic-smithi/7112
SSH connection to smithi049 was lost: 'rm -rf -- /home/ubuntu/cephtest/data/test.client.0'

Job 7119 lost the ssh connection while dpkg was running. I have not investigated why.

Actions #2

Updated by Zack Cerza about 8 years ago

teuthology.log:

2016-02-12T08:59:27.402 INFO:teuthology.orchestra.run.smithi049.stdout:Preparing to unpack .../libjs-jquery_1.7.2+dfsg-2ubuntu1_all.deb ...
2016-02-12T08:59:27.451 INFO:teuthology.orchestra.run.smithi049.stdout:Unpacking libjs-jquery (1.7.2+dfsg-2ubuntu1) ...
[...]
2016-02-12T09:01:16.836 ERROR:paramiko.transport:Socket exception: Connection reset by peer (104)

/var/log/syslog:

Feb 12 08:58:37 smithi049 dhclient: DHCPREQUEST of 172.21.15.49 on eth1 to 172.21.0.10 port 67 (xid=0x1a942231)
Feb 12 08:58:37 smithi049 dhclient: DHCPACK of 172.21.15.49 from 172.21.0.10
Feb 12 08:58:37 smithi049 dhclient: bound to 172.21.15.49 -- renewal in 3015 seconds.
Feb 12 09:01:41 smithi049 kernel: [21787.472823] init: tgt main process (6848) killed by KILL signal
Feb 12 09:02:22 smithi049 rsyslogd: [origin software="rsyslogd" swVersion="7.4.4" x-pid="2820" x-info="http://www.rsyslog.com"] exiting on signal 15.
Feb 12 09:02:22 smithi049 rsyslogd: [origin software="rsyslogd" swVersion="7.4.4" x-pid="15604" x-info="http://www.rsyslog.com"] start

Actions #3

Updated by Zack Cerza about 8 years ago

  • Project changed from teuthology to sepia
  • Subject changed from teuthology-nuke gets stuck with "dpkg was interrupted" error to connection lost on smithi049 while running dpkg
Actions #4

Updated by Zack Cerza about 8 years ago

From http://qa-proxy.ceph.com/teuthology/sage-2016-02-12_05:26:12-rados-wip-sage-testing---basic-smithi/7136/teuthology.log:

2016-02-12T09:14:09.406 INFO:teuthology.task.ansible.out:TASK: [testnode | Update apt cache.] ****************************************** 

2016-02-12T09:14:18.075 INFO:teuthology.task.ansible.out:[0;31mfailed: [smithi041.front.sepia.ceph.com] => {"failed": true, "parsed": false}[0m
[0;31mBECOME-SUCCESS-hcuscfuixbfpipjxokrryrevriuxxmfk
2016-02-12T09:14:18.078 INFO:teuthology.task.ansible.out:
OpenSSH_6.6.1, OpenSSL 1.0.1f 6 Jan 2014

debug1: Reading configuration data /var/lib/teuthworker/.ssh/config

debug1: /var/lib/teuthworker/.ssh/config line 1: Applying options for *

debug1: Reading configuration data /etc/ssh/ssh_config

debug1: /etc/ssh/ssh_config line 19: Applying options for *

debug1: /etc/ssh/ssh_config line 56: Applying options for smithi*

debug1: auto-mux: Trying existing master

debug1: mux_client_request_session: master session id: 2

Traceback (most recent call last):
  File "<stdin>", line 2246, in <module>
  File "<stdin>", line 549, in main
  File "/usr/lib/python2.7/dist-packages/apt/cache.py", line 107, in __init__
    self.open(progress)
  File "/usr/lib/python2.7/dist-packages/apt/cache.py", line 151, in open
    self._cache = apt_pkg.Cache(progress)
SystemError: E:Problem renaming the file /var/cache/apt/pkgcache.bin.aHwvNE to /var/cache/apt/pkgcache.bin - rename (2: No such file or directory), W:You may want to run apt-get update to correct these problems
[0m
[0;32mok: [smithi049.front.sepia.ceph.com] => {"changed": false}[0m

Actions #5

Updated by David Galloway about 8 years ago

  • Category set to Infrastructure Hardware
Actions #6

Updated by Yuri Weinstein about 8 years ago

see also
smithi049 smithi041

Actions #7

Updated by Dan Mick about 8 years ago

smithi041 currently responds to apt-get update with

E: dpkg was interrupted, you must manually run 'sudo dpkg --configure -a' to correct the problem.

and dpkg -i shows libc-bin's state as 'it', which I'm assuming is the problem there, so it looks like dpkg was in fact interrupted in the past...but apt-get update doesn't fix it

Actions #8

Updated by Zack Cerza about 8 years ago

  • Subject changed from connection lost on smithi049 while running dpkg to smithi049 and smithi041 rebooting constantly
Actions #9

Updated by Zack Cerza about 8 years ago

  • Assignee set to Zack Cerza
Actions #10

Updated by Zack Cerza about 8 years ago

  • Status changed from New to Resolved

This was a strange one to debug.

I stayed logged in to both machines tailing /var/log/{auth.log,syslog} and noticed output consistent with what teuthology-nuke would produce:

Feb 15 18:25:54 smithi041 systemd-logind[660]: New session 3 of user ubuntu.
Feb 15 18:25:54 smithi041 sudo:   ubuntu : TTY=unknown ; PWD=/home/ubuntu ; USER=root ; COMMAND=/bin/sh -c iptables-save | grep -v teuthology | iptables-restore
Feb 15 18:25:54 smithi041 sudo: pam_unix(sudo:session): session opened for user root by (uid=0)
Feb 15 18:25:54 smithi041 sudo: pam_unix(sudo:session): session closed for user root
Feb 15 18:25:54 smithi041 sudo:   ubuntu : TTY=unknown ; PWD=/home/ubuntu ; USER=root ; COMMAND=/usr/bin/killall --quiet ceph-mon ceph-osd ceph-mds ceph-fuse ceph-disk radosgw ceph_test_rados rados rbd-fuse apache2
Feb 15 18:25:54 smithi041 sudo: pam_unix(sudo:session): session opened for user root by (uid=0)
Feb 15 18:25:54 smithi041 sudo: pam_unix(sudo:session): session closed for user root
Feb 15 18:25:59 smithi041 sudo:   ubuntu : TTY=unknown ; PWD=/home/ubuntu ; USER=root ; COMMAND=/sbin/reboot -f -n
Feb 15 18:25:59 smithi041 sudo: pam_unix(sudo:session): session opened for user root by (uid=0)

I found the culprit. This job had been running for over six days:
http://pulpito.front.sepia.ceph.com/sage-blue-3/1583/
http://qa-proxy.ceph.com/teuthology/sage-blue-3/1583/teuthology.log

Its worker process was trying to kill it:
http://qa-proxy.ceph.com/teuthology/sage-blue-3/1583/worker.log

But the job process wasn't responding to SIGTERM, so the kill would re-attempt forever. Any time another job owned by scheduled_sage@teuthology had one or both locked, the nuke would proceed, rebooting one or both machines.

I had to manually send a SIGKILL to the job process.

I will resolve this but leave #14775 open (which I just filed) to track the teuthology-kill behavior.

Actions

Also available in: Atom PDF