Bug #14749
closedsmithi049 and smithi041 rebooting constantly
0%
Description
Majority of nodes that failed during nuke/stale
operation fail with
2016-02-12 08:27:44,220.220 INFO:teuthology.orchestra.run.smithi049:Running: "dpkg -l | grep '^.\\(U\\|H\\)R' | awk '{print $2}' | sudo xargs --no-run-if-empty dpkg -P --force-remove-reinstreq" 2016-02-12 08:27:44,293.293 INFO:teuthology.orchestra.run.smithi049:Running: 'sudo DEBIAN_FRONTEND=noninteractive apt-get -y --force-yes -o Dpkg::Options::="--force-confdef" -o Dpkg::Options::="--force-confold" autoremove' 2016-02-12 08:27:44,376.376 INFO:teuthology.orchestra.run.smithi049.stderr:E: dpkg was interrupted, you must manually run 'sudo dpkg --configure -a' to correct the problem.
Usually running 'sudo dpkg --configure -a' fix the issue, maybe we should re-try it?
Updated by Zack Cerza about 8 years ago
As of right now, mira049's job history, most recent first:
running http://pulpito.front.sepia.ceph.com/sage-2016-02-12_05:26:12-rados-wip-sage-testing---basic-smithi/7142
dead http://pulpito.front.sepia.ceph.com/sage-2016-02-12_05:26:12-rados-wip-sage-testing---basic-smithi/7136
{'smithi041.front.sepia.ceph.com': {'invocation': {'module_name': 'apt', 'module_args': ''}, 'failed': True, 'parsed': False, 'msg': 'BECOME-SUCCESS-hcuscfuixbfpipjxokrryrevriuxxmfk\nOpenSSH_6.6.1, OpenSSL 1.0.1f 6 Jan 2014\r\ndebug1: Reading configuration data /var/lib/teuthworker/.ssh/config\r\ndebug1: /var/lib/teuthworker/.ssh/config line 1: Applying options for *\r\ndebug1: Reading configuration data /etc/ssh/ssh_config\r\ndebug1: /etc/ssh/ssh_config line 19: Applying options for *\r\ndebug1: /etc/ssh/ssh_config line 56: Applying options for smithi*\r\ndebug1: auto-mux: Trying existing master\r\ndebug1: mux_client_request_session: master session id: 2\r\nTraceback (most recent call last):\n File "<stdin>", line 2246, in <module>\n File "<stdin>", line 549, in main\n File "/usr/lib/python2.7/dist-packages/apt/cache.py", line 107, in __init__\n self.open(progress)\n File "/usr/lib/python2.7/dist-packages/apt/cache.py", line 151, in open\n self._cache = apt_pkg.Cache(progress)\nSystemError: E:Problem renaming the file /var/cache/apt/pkgcache.bin.aHwvNE to /var/cache/apt/pkgcache.bin - rename (2: No such file or directory), W:You may want to run apt-get update to correct these problems\n'}, 'smithi049.front.sepia.ceph.com': {'invocation': {'module_name': 'apt', 'module_args': ''}, 'failed': True, 'parsed': False, 'item': 'ceph,ceph-common,libcephfs1,radosgw,python-ceph,python-rados,librbd1,librados2,openmpi-common', 'msg': 'BECOME-SUCCESS-zavvekjiqlacfpzehdvxtjpnsiwthyoe\nOpenSSH_6.6.1, OpenSSL 1.0.1f 6 Jan 2014\r\ndebug1: Reading configuration data /var/lib/teuthworker/.ssh/config\r\ndebug1: /var/lib/teuthworker/.ssh/config line 1: Applying options for *\r\ndebug1: Reading configuration data /etc/ssh/ssh_config\r\ndebug1: /etc/ssh/ssh_config line 19: Applying options for *\r\ndebug1: /etc/ssh/ssh_config line 56: Applying options for smithi*\r\ndebug1: auto-mux: Trying existing master\r\ndebug1: mux_client_request_session: master session id: 4\r\nTraceback (most recent call last):\n File "<stdin>", line 2246, in <module>\n File "<stdin>", line 549, in main\n File "/usr/lib/python2.7/dist-packages/apt/cache.py", line 107, in __init__\n self.open(progress)\n File "/usr/lib/python2.7/dist-packages/apt/cache.py", line 153, in open\n self._records = apt_pkg.PackageRecords(self._cache)\nSystemError: E:Could not open file /var/lib/apt/lists/apt-mirror.front.sepia.ceph.com_archive.ubuntu.com_ubuntu_dists_trusty_universe_i18n_Translation-en - open (2: No such file or directory), E:Could not open file /var/lib/apt/lists/apt-mirror.front.sepia.ceph.com_archive.ubuntu.com_ubuntu_dists_trusty_restricted_i18n_Translation-en - open (2: No such file or directory), E:Could not open file /var/lib/apt/lists/apt-mirror.front.sepia.ceph.com_archive.ubuntu.com_ubuntu_dists_trusty_multiverse_i18n_Translation-en - open (2: No such file or directory), E:Could not open file /var/lib/apt/lists/apt-mirror.front.sepia.ceph.com_archive.ubuntu.com_ubuntu_dists_trusty_main_i18n_Translation-en - open (2: No such file or directory), E:Could not open file /var/lib/apt/lists/apt-mirror.front.sepia.ceph.com_archive.ubuntu.com_ubuntu_dists_trusty_multiverse_binary-i386_Packages - open (2: No such file or directory), E:Could not open file /var/lib/apt/lists/apt-mirror.front.sepia.ceph.com_archive.ubuntu.com_ubuntu_dists_trusty_universe_binary-i386_Packages - open (2: No such file or directory), E:Could not open file /var/lib/apt/lists/apt-mirror.front.sepia.ceph.com_archive.ubuntu.com_ubuntu_dists_trusty_restricted_binary-i386_Packages - open (2: No such file or directory), E:Could not open file /var/lib/apt/lists/apt-mirror.front.sepia.ceph.com_archive.ubuntu.com_ubuntu_dists_trusty_main_binary-i386_Packages - open (2: No such file or directory), E:Could not open file /var/lib/apt/lists/apt-mirror.front.sepia.ceph.com_archive.ubuntu.com_ubuntu_dists_trusty_multiverse_binary-amd64_Packages - open (2: No such file or directory), E:Could not open file /var/lib/apt/lists/apt-mirror.front.sepia.ceph.com_archive.ubuntu.com_ubuntu_dists_trusty_universe_binary-amd64_Packages - open (2: No such file or directory), E:Could not open file /var/lib/apt/lists/apt-mirror.front.sepia.ceph.com_archive.ubuntu.com_ubuntu_dists_trusty_restricted_binary-amd64_Packages - open (2: No such file or directory), E:Could not open file /var/lib/apt/lists/apt-mirror.front.sepia.ceph.com_archive.ubuntu.com_ubuntu_dists_trusty_main_binary-amd64_Packages - open (2: No such file or directory)\n'}}
dead http://pulpito.front.sepia.ceph.com/sage-2016-02-12_05:26:12-rados-wip-sage-testing---basic-smithi/7128
{'smithi049.front.sepia.ceph.com': 'SSH Error: data could not be sent to the remote host. Make sure this host can be reached over ssh'}
dead http://pulpito.front.sepia.ceph.com/sage-2016-02-12_05:26:12-rados-wip-sage-testing---basic-smithi/7119
SSH connection to smithi049 was lost: u'sudo DEBIAN_FRONTEND=noninteractive apt-get -y --force-yes -o Dpkg::Options::="--force-confdef" -o Dpkg::Options::="--force-confold" install ceph=10.0.3-979-g35a4db4-1trusty ceph-mds=10.0.3-979-g35a4db4-1trusty ceph-common=10.0.3-979-g35a4db4-1trusty ceph-fuse=10.0.3-979-g35a4db4-1trusty ceph-test=10.0.3-979-g35a4db4-1trusty radosgw=10.0.3-979-g35a4db4-1trusty python-ceph=10.0.3-979-g35a4db4-1trusty libcephfs1=10.0.3-979-g35a4db4-1trusty libcephfs-java=10.0.3-979-g35a4db4-1trusty libcephfs-jni=10.0.3-979-g35a4db4-1trusty librados2=10.0.3-979-g35a4db4-1trusty librbd1=10.0.3-979-g35a4db4-1trusty rbd-fuse=10.0.3-979-g35a4db4-1trusty librados2=10.0.3-979-g35a4db4-1trusty librbd1=10.0.3-979-g35a4db4-1trusty'
dead http://pulpito.front.sepia.ceph.com/sage-2016-02-12_05:26:12-rados-wip-sage-testing---basic-smithi/7112
SSH connection to smithi049 was lost: 'rm -rf -- /home/ubuntu/cephtest/data/test.client.0'
Job 7119 lost the ssh connection while dpkg was running. I have not investigated why.
Updated by Zack Cerza about 8 years ago
teuthology.log
:
2016-02-12T08:59:27.402 INFO:teuthology.orchestra.run.smithi049.stdout:Preparing to unpack .../libjs-jquery_1.7.2+dfsg-2ubuntu1_all.deb ... 2016-02-12T08:59:27.451 INFO:teuthology.orchestra.run.smithi049.stdout:Unpacking libjs-jquery (1.7.2+dfsg-2ubuntu1) ... [...] 2016-02-12T09:01:16.836 ERROR:paramiko.transport:Socket exception: Connection reset by peer (104)
/var/log/syslog
:
Feb 12 08:58:37 smithi049 dhclient: DHCPREQUEST of 172.21.15.49 on eth1 to 172.21.0.10 port 67 (xid=0x1a942231) Feb 12 08:58:37 smithi049 dhclient: DHCPACK of 172.21.15.49 from 172.21.0.10 Feb 12 08:58:37 smithi049 dhclient: bound to 172.21.15.49 -- renewal in 3015 seconds. Feb 12 09:01:41 smithi049 kernel: [21787.472823] init: tgt main process (6848) killed by KILL signal Feb 12 09:02:22 smithi049 rsyslogd: [origin software="rsyslogd" swVersion="7.4.4" x-pid="2820" x-info="http://www.rsyslog.com"] exiting on signal 15. Feb 12 09:02:22 smithi049 rsyslogd: [origin software="rsyslogd" swVersion="7.4.4" x-pid="15604" x-info="http://www.rsyslog.com"] start
Updated by Zack Cerza about 8 years ago
- Project changed from teuthology to sepia
- Subject changed from teuthology-nuke gets stuck with "dpkg was interrupted" error to connection lost on smithi049 while running dpkg
Updated by Zack Cerza about 8 years ago
2016-02-12T09:14:09.406 INFO:teuthology.task.ansible.out:TASK: [testnode | Update apt cache.] ****************************************** 2016-02-12T09:14:18.075 INFO:teuthology.task.ansible.out:[0;31mfailed: [smithi041.front.sepia.ceph.com] => {"failed": true, "parsed": false}[0m [0;31mBECOME-SUCCESS-hcuscfuixbfpipjxokrryrevriuxxmfk 2016-02-12T09:14:18.078 INFO:teuthology.task.ansible.out: OpenSSH_6.6.1, OpenSSL 1.0.1f 6 Jan 2014 debug1: Reading configuration data /var/lib/teuthworker/.ssh/config debug1: /var/lib/teuthworker/.ssh/config line 1: Applying options for * debug1: Reading configuration data /etc/ssh/ssh_config debug1: /etc/ssh/ssh_config line 19: Applying options for * debug1: /etc/ssh/ssh_config line 56: Applying options for smithi* debug1: auto-mux: Trying existing master debug1: mux_client_request_session: master session id: 2 Traceback (most recent call last): File "<stdin>", line 2246, in <module> File "<stdin>", line 549, in main File "/usr/lib/python2.7/dist-packages/apt/cache.py", line 107, in __init__ self.open(progress) File "/usr/lib/python2.7/dist-packages/apt/cache.py", line 151, in open self._cache = apt_pkg.Cache(progress) SystemError: E:Problem renaming the file /var/cache/apt/pkgcache.bin.aHwvNE to /var/cache/apt/pkgcache.bin - rename (2: No such file or directory), W:You may want to run apt-get update to correct these problems [0m [0;32mok: [smithi049.front.sepia.ceph.com] => {"changed": false}[0m
Updated by David Galloway about 8 years ago
- Category set to Infrastructure Hardware
Updated by Dan Mick about 8 years ago
smithi041 currently responds to apt-get update with
E: dpkg was interrupted, you must manually run 'sudo dpkg --configure -a' to correct the problem.
and dpkg -i shows libc-bin's state as 'it', which I'm assuming is the problem there, so it looks like dpkg was in fact interrupted in the past...but apt-get update doesn't fix it
Updated by Zack Cerza about 8 years ago
- Subject changed from connection lost on smithi049 while running dpkg to smithi049 and smithi041 rebooting constantly
I marked smithi041 and smithi049 down. Yuri provided these logs just now:
http://qa-proxy.ceph.com/teuthology/sage-2016-02-12_05:26:12-rados-wip-sage-testing---basic-smithi/7144/teuthology.log
http://qa-proxy.ceph.com/teuthology/sage-2016-02-12_05:26:12-rados-wip-sage-testing---basic-smithi/7142/teuthology.log
Updated by Zack Cerza about 8 years ago
- Status changed from New to Resolved
This was a strange one to debug.
I stayed logged in to both machines tailing /var/log/{auth.log,syslog}
and noticed output consistent with what teuthology-nuke
would produce:
Feb 15 18:25:54 smithi041 systemd-logind[660]: New session 3 of user ubuntu. Feb 15 18:25:54 smithi041 sudo: ubuntu : TTY=unknown ; PWD=/home/ubuntu ; USER=root ; COMMAND=/bin/sh -c iptables-save | grep -v teuthology | iptables-restore Feb 15 18:25:54 smithi041 sudo: pam_unix(sudo:session): session opened for user root by (uid=0) Feb 15 18:25:54 smithi041 sudo: pam_unix(sudo:session): session closed for user root Feb 15 18:25:54 smithi041 sudo: ubuntu : TTY=unknown ; PWD=/home/ubuntu ; USER=root ; COMMAND=/usr/bin/killall --quiet ceph-mon ceph-osd ceph-mds ceph-fuse ceph-disk radosgw ceph_test_rados rados rbd-fuse apache2 Feb 15 18:25:54 smithi041 sudo: pam_unix(sudo:session): session opened for user root by (uid=0) Feb 15 18:25:54 smithi041 sudo: pam_unix(sudo:session): session closed for user root Feb 15 18:25:59 smithi041 sudo: ubuntu : TTY=unknown ; PWD=/home/ubuntu ; USER=root ; COMMAND=/sbin/reboot -f -n Feb 15 18:25:59 smithi041 sudo: pam_unix(sudo:session): session opened for user root by (uid=0)
I found the culprit. This job had been running for over six days:
http://pulpito.front.sepia.ceph.com/sage-blue-3/1583/
http://qa-proxy.ceph.com/teuthology/sage-blue-3/1583/teuthology.log
Its worker process was trying to kill it:
http://qa-proxy.ceph.com/teuthology/sage-blue-3/1583/worker.log
But the job process wasn't responding to SIGTERM
, so the kill would re-attempt forever. Any time another job owned by scheduled_sage@teuthology
had one or both locked, the nuke would proceed, rebooting one or both machines.
I had to manually send a SIGKILL
to the job process.
I will resolve this but leave #14775 open (which I just filed) to track the teuthology-kill
behavior.