Bug #3459
closedosd crash in CephXAuthorizer::verify_reply
0%
Description
Log: ubuntu@teuthology:/a/teuthology-2012-11-05_19:00:02-regression-master-testing-gcov/10108
2012-11-06 21:13:39.757184 1cd7e700 -1 ** Caught signal (Aborted) *
in thread 1cd7e700
ceph version 0.53-618-g15b3d98 (15b3d98fc4d9d1371253edf0c4c77f7e8932ecf3)
1: /tmp/cephtest/binary/usr/local/bin/ceph-osd() [0x7486ba]
2: (()+0xfcb0) [0x5043cb0]
3: (gsignal()+0x35) [0x69a4445]
4: (abort()+0x17b) [0x69a7bab]
5: (_gnu_cxx::_verbose_terminate_handler()+0x11d) [0x621569d]
6: (()+0xb5846) [0x6213846]
7: (()+0xb5873) [0x6213873]
8: (()+0xb596e) [0x621396e]
9: (ceph::buffer::list::iterator::copy(unsigned int, char*)+0x127) [0x7f3fd7]
10: (CephXAuthorizer::verify_reply(ceph::buffer::list::iterator&)+0xeb) [0x74d69b]
11: (Pipe::connect()+0x18a3) [0x896dc3]
12: (Pipe::writer()+0x4cd) [0x8a051d]
13: (Pipe::Writer::entry()+0xd) [0x8a1f6d]
14: (()+0x7e9a) [0x503be9a]
15: (clone()+0x6d) [0x6a604bd]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
config file:
ubuntu@teuthology:/a/teuthology-2012-11-05_19:00:02-regression-master-testing-gcov/10108$ cat config.yaml
kernel: &id001
kdb: true
sha1: 22cddde104d715600a4c218bf9224923208afe90
nuke-on-error: true
overrides:
ceph:
conf:
global:
ms inject socket failures: 5000
coverage: true
fs: btrfs
log-whitelist:
- slow request
sha1: 15b3d98fc4d9d1371253edf0c4c77f7e8932ecf3
valgrind:
mds:
- --tool=memcheck
mon:
- --tool=memcheck
osd:
- --tool=memcheck
s3tests:
branch: master
workunit:
sha1: 15b3d98fc4d9d1371253edf0c4c77f7e8932ecf3
roles:
- - mon.a
- mon.c
- osd.0
- osd.1
- osd.2
- - mon.b
- mds.a
- osd.3
- osd.4
- osd.5
- - client.0
targets:
ubuntu@plana44.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDYE0eu9E8TQwtUy89Wldp54VbNBEoO9XQf77eXXzzmNwYUFRrNX0mZV/I8GqyRJuMrPG8V4aZBthBHTtnEmQ6RAS7fVdthi/hEgwnM9cAqY3KX9mR5xJnHBc/fa5KLrnSr3Wrztf42PpQNEN5Tk55K6wWUlZOTHU3vE0j3kF+YQ5FeBhQbghztHPKFR8bOmZJp9TpbXgbvEM2RWr9bYtro1KuQOgrairyVVNWdAuwZuxSQT4soyHoSkY9JmeXKsNRAOamxH9w57mDC3PXui7r6Fp8OCWSK+GmlLTtPaZtulSCcucaZtpVae7F4s9JNxaRl5RxuUtwMRfgAHGlL2BZv
ubuntu@plana55.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCdrzGTR0Fbl6sedYlwlX+FlmF6fuE3l/RTu2kzOkmG47rPEn5CI37Injb7Epc50RXCbUIfzmDqtEY6uZT3YssYrE4jvhQlynPndbn1KmiTbgxTyuumGXv7O4OOntezighA1W49phUNZys1DhdEEO8VSQAIdHrBgBLhY9DDgC4LAhrP4BSbDTN0rUXtYYHBj4aa3sJV0o3sKjpsyjjlieEQnto6JkjK6EGZCSuY+AyMZyLJjFTgMwJ9i4aC5eZoWZAWSDfDsxo8PtFR+kjUmz5uiheyn5lAzKBxmd4ZNojf7wOhSGia0ghbtUeQkdoRZXZhP2ourNn3uAguf1xt43kX
ubuntu@plana61.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDOTCMIScDTmD9NkfsWU7xeyZ+WOXai5izYeliiXDSjJC3bT6r8Fp+rhPfcHCVHiw++VsbvKZtkhjCSnJTVPWCdpRDghzJ3nZUBImWRo3PmHo1etQpCeimaOrIJ2q0ChN5jmSOqy5B+Z4om2vXBtBY6nkdTxDOr2+MH3NrSPkQSFB0zO+VPuwKXsemeUC6urb2IZZpxY3cxNq4fafTF9PROpgOnIA+o3igyU4duKEjnCzTHZjw/PL7Eph/7p6+UQgrUwe7pgVzT+2MM0zcBtBSXNqs3dCGmpvUapOkBlDoIX02EkWRNpkM3vfeFt1EFC17B5vd61Kg40bYUG8qWGR0T
tasks:
- internal.lock_machines: 3
- internal.save_config: null
- internal.check_lock: null
- internal.connect: null
- internal.check_conflict: null
- kernel: *id001
- internal.base: null
- internal.archive: null
- internal.coredump: null
- internal.syslog: null
- internal.timer: null
- chef: null
- clock: null
- ceph: null
- mon_recovery: null
ubuntu@teuthology:/a/teuthology-2012-11-05_19:00:02-regression-master-testing-gcov/10108$ cat summary.yaml
ceph-sha1: 15b3d98fc4d9d1371253edf0c4c77f7e8932ecf3
client.0-kernel-sha1: 22cddde104d715600a4c218bf9224923208afe90
description: collection:rados-verify clusters:fixed-3.yaml fs:btrfs.yaml msgr-failures:few.yaml
tasks:mon_recovery.yaml validater:valgrind.yaml
duration: 749.3446509838104
failure_reason: 'Command failed with status 1: ''/tmp/cephtest/enable-coredump /tmp/cephtest/binary/usr/local/bin/ceph-coverage
/tmp/cephtest/archive/coverage /tmp/cephtest/daemon-helper term /tmp/cephtest/chdir-coredump
valgrind --suppressions=/tmp/cephtest/valgrind.supp --xml=yes --xml-file=/tmp/cephtest/archive/log/valgrind/osd.1.log
--tool=memcheck /tmp/cephtest/binary/usr/local/bin/ceph-osd -f -i 1 -c /tmp/cephtest/ceph.conf'''
flavor: notcmalloc
mon.a-kernel-sha1: 22cddde104d715600a4c218bf9224923208afe90
mon.b-kernel-sha1: 22cddde104d715600a4c218bf9224923208afe90
owner: scheduled_teuthology@teuthology
success: false
Updated by Sage Weil over 11 years ago
- Priority changed from Normal to Urgent
ubuntu@teuthology:/a/sage-2012-11-12_16:44:02-regression-master-wip-3.4-basic/13948
Updated by Sage Weil over 11 years ago
- Subject changed from osd crash in the nightly run to osd crash in CephXAuthorizer::verify_reply
Updated by Sage Weil over 11 years ago
- Status changed from New to Resolved
this should be fixed by the new guards around decrypt_decode().
Updated by Dan Mick over 11 years ago
A user reports this same crash today in IRC with 0.55:
Updated by Tamilarasi muthamizhan over 11 years ago
- Status changed from Resolved to In Progress
Updated by Sage Weil over 11 years ago
wth, i could have sworn i pushed something that added a try/catch block around the decode, but now i don't see it. pushed wip-3459 that does just that. which means there is probably a dup bug in the tracker somewhere with the same crash...
Updated by Sage Weil over 11 years ago
- Status changed from In Progress to Fix Under Review
the try/catch may be treating hte symptom, but it's definitley correct, and the binary for the qa run is long gone so i can't get anything else useful out of the failure. i think we merge the patch and wait for this to strike again (or not!)
Updated by Greg Farnum over 11 years ago
The patch looks fine on its face but several tests in the suite failed. I need to track down if they're familiar errors to anybody and look a little more closely into a couple of them. If you're interested, here they are....
failed tests:11922: rados test workunit failed
CommandFailedError: Command failed with status 1: 'mkdir p - /tmp/cephtest/mnt.0/client.0/tmp && cd -- /tmp/cephtest/mnt.0/client.0/tmp && CEPH_REF=f957cd57c513d7f45b0d0ab1c3db6c4ccbbc110b PATH="$PATH:/tmp/cephtest/binary/usr/local/bin" LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/tmp/cephtest/binary/usr/local/lib" CEPH_CONF="/tmp/cephtest/ceph.conf" CEPH_SECRET_FILE="/tmp/cephtest/data/client.0.secret" CEPH_ID="0" PYTHONPATH="$PYTHONPATH:/tmp/cephtest/binary/usr/local/lib/python2.7/dist-packages:/tmp/cephtest/binary/usr/local/lib/python2.6/dist-packages" /tmp/cephtest/enable-coredump /tmp/cephtest/binary/usr/local/bin/ceph-coverage /tmp/cephtest/archive/coverage /tmp/cephtest/workunit.client.0/rados/test.sh'
11963: still running...
11964: rados command failed
CommandFailedError: Command failed with status 1: "/bin/sh -c 'LD_LIBRARY_PATH=/tmp/cephtest/binary/usr/local/lib /tmp/cephtest/enable-coredump /tmp/cephtest/binary/usr/local/bin/ceph-cover
age /tmp/cephtest/archive/coverage /tmp/cephtest/binary/usr/local/bin/rados -c /tmp/cephtest/ceph.conf -k /tmp/cephtest/data/client.0.keyring --name client.0 -p data bench 1200 write'"
11970: rados command failed
CommandFailedError: Command failed with status 1: "/bin/sh -c 'LD_LIBRARY_PATH=/tmp/cephtest/binary/usr/local/lib /tmp/cephtest/enable-coredump /tmp/cephtest/binary/usr/local/bin/ceph-cover
age /tmp/cephtest/archive/coverage /tmp/cephtest/binary/usr/local/bin/rados -c /tmp/cephtest/ceph.conf -k /tmp/cephtest/data/client.0.keyring --name client.0 -p data bench 1200 write'"
11975: osd crashed on a startup (thrashing, I assume?)
osd/OSD.cc: 2434: FAILED assert(pg)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x95) [0x11a4ac1]
2: (OSD::disconnect_session_watches(OSD::Session*)+0x2a7) [0xea6ef5]
3: (OSD::ms_handle_reset(Connection*)+0x155) [0xea761d]
4: (Messenger::ms_deliver_handle_reset(Connection*)+0x4b) [0x126b419]
5: (DispatchQueue::entry()+0x176) [0x126a4be]
6: (DispatchQueue::DispatchThread::entry()+0x1c) [0x118ac14]
7: (Thread::_entry_func(void*)+0x23) [0x11932ad]
8: (()+0x7e9a) [0x7fc73f45de9a]
9: (clone()+0x6d) [0x7fc73d5e84bd]
11976: both a ceph and a rados command failed. How did they both get the chance to do so?
2012-12-11T23:10:44.121 DEBUG:teuthology.orchestra.run:Running: 'LD_LIBRARY_PRELOAD=/tmp/cephtest/binary/usr/local/lib /tmp/cephtest/enable-coredump /tmp/cephtest/binary/usr/local/bin/ceph-
coverage /tmp/cephtest/archive/coverage /tmp/cephtest/binary/usr/local/bin/ceph -k /tmp/cephtest/ceph.keyring -c /tmp/cephtest/ceph.conf --concise osd in 4'
2012-12-11T23:10:50.039 INFO:teuthology.task.radosbench.radosbench.2.err:error during benchmark: -2
2012-12-11T23:10:50.040 INFO:teuthology.task.radosbench.radosbench.2.err:error 2: (2) No such file or directory
CommandFailedError: Command failed with status 1: "/bin/sh -c 'LD_LIBRARY_PATH=/tmp/cephtest/binary/usr/local/lib /tmp/cephtest/enable-coredump /tmp/cephtest/binary/usr/local/bin/ceph-coverage /tmp/cephtest/archive/coverage /tmp/cephtest/binary/usr/local/bin/rados -c /tmp/cephtest/ceph.conf -k /tmp/cephtest/data/client.1.keyring --name client.1 -p data bench 1200 write'"
Updated by Sage Weil over 11 years ago
these all appear to be unrelated. i had broken tests in my lock teuthology repo, or they were other bugs.
except one new one, opening that now
Updated by Sage Weil over 11 years ago
- Status changed from Fix Under Review to Resolved
other bug is #3414, but it doesn't appear related.
going to merge this change in.