Project

General

Profile

Actions

Bug #540

closed

CephxClientHandler::handle_response

Added by Wido den Hollander over 13 years ago. Updated over 13 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
cephx
Target version:
% Done:

0%

Spent time:
Source:
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Saw this crash today after upgrading to the latest unstable:

Core was generated by `/usr/bin/cosd -i 5 -c /etc/ceph/ceph.conf'.
Program terminated with signal 6, Aborted.
#0  0x00007f9bd0120a75 in raise () from /lib/libc.so.6
(gdb) bt
#0  0x00007f9bd0120a75 in raise () from /lib/libc.so.6
#1  0x00007f9bd0124545 in abort () from /lib/libc.so.6
#2  0x00007f9bd09d68e5 in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib/libstdc++.so.6
#3  0x00007f9bd09d4d16 in ?? () from /usr/lib/libstdc++.so.6
#4  0x00007f9bd09d4d43 in std::terminate() () from /usr/lib/libstdc++.so.6
#5  0x00007f9bd09d4e3e in __cxa_throw () from /usr/lib/libstdc++.so.6
#6  0x00000000005c8a98 in ceph::__ceph_assert_fail (assertion=0x60b537 "0", file=<value optimized out>, line=162, func=<value optimized out>) at common/assert.cc:30
#7  0x00000000005e432c in CephxClientHandler::handle_response (this=<value optimized out>, ret=-792682944, indata=...) at auth/cephx/CephxClientHandler.cc:162
#8  0x00000000005ce27b in MonClient::handle_auth (this=0x7ffff01d6420, m=0x2255400) at mon/MonClient.cc:375
#9  0x00000000005d0733 in MonClient::ms_dispatch (this=0x7ffff01d6420, m=0x2255400) at mon/MonClient.cc:227
#10 0x0000000000463b39 in Messenger::ms_deliver_dispatch (this=0x11b7000) at msg/Messenger.h:97
#11 SimpleMessenger::dispatch_entry (this=0x11b7000) at msg/SimpleMessenger.cc:342
#12 0x000000000045abec in SimpleMessenger::DispatchThread::entry (this=0x11b7488) at msg/SimpleMessenger.h:570
#13 0x000000000046f1ba in Thread::_entry_func (arg=0x6ec2) at ./common/Thread.h:39
#14 0x00007f9bd121b9ca in start_thread () from /lib/libpthread.so.0
#15 0x00007f9bd01d370d in clone () from /lib/libc.so.6
#16 0x0000000000000000 in ?? ()

Last loglines:

2010-11-03 10:09:22.780031 7f9bc348f710 -- [2001:16f8:10:2::c3c3:3b6a]:6801/28353 >> [2001:16f8:10:2::c3c3:ab76]:6801/14744 pipe(0x26d7780 sd=12 pgs=167 cs=2 l=0).connect claims to be [2001:16f8:10:2::c3c3:ab76]:6801/17673 not [2001:16f8:10:2::c3c3:ab76]:6801/14744 - wrong node!
2010-11-03 10:09:23.553928 7f9bc3f9a710 -- [2001:16f8:10:2::c3c3:3b6a]:6801/28353 >> [2001:16f8:10:2::c3c3:2e56]:6801/21758 pipe(0x26a4a00 sd=12 pgs=201 cs=2 l=0).connect claims to be [2001:16f8:10:2::c3c3:2e56]:6801/3071 not [2001:16f8:10:2::c3c3:2e56]:6801/21758 - wrong node!
2010-11-03 10:09:26.016180 7f9bc94b0710 cephx client: handle_response ret = 0
2010-11-03 10:09:26.016245 7f9bc94b0710 cephx client:  got initial server challenge 15029606718942791501
2010-11-03 10:09:26.016260 7f9bc94b0710 cephx client: build_request
2010-11-03 10:09:26.016270 7f9bc94b0710 cephx client: validate_tickets: want=37 need=32 have=5
2010-11-03 10:09:26.016287 7f9bc94b0710 cephx: validate_tickets want 37 have 5 need 32
2010-11-03 10:09:26.016298 7f9bc94b0710 cephx client: want=37 need=32 have=5
2010-11-03 10:09:26.016373 7f9bc94b0710 cephx client: old ticket len=96
2010-11-03 10:09:26.016386 7f9bc94b0710 cephx client: get auth session key: client_challenge 5687951232298511269
2010-11-03 10:09:26.998599 7f9bc7cad710 -- [2001:16f8:10:2::c3c3:3b6a]:6800/28353 >> [2001:16f8:10:2::c3c3:3f9b]:6789/0 pipe(0x369e000 sd=-1 pgs=0 cs=0 l=0).fault first fault
2010-11-03 10:09:35.966178 7f9bc94b0710 cephx client: handle_response ret = 0
2010-11-03 10:09:35.966245 7f9bc94b0710 cephx client:  got initial server challenge 282574488338433
2010-11-03 10:09:35.966259 7f9bc94b0710 cephx client: build_request
2010-11-03 10:09:35.966268 7f9bc94b0710 cephx client: validate_tickets: want=37 need=32 have=5
2010-11-03 10:09:35.966286 7f9bc94b0710 cephx: validate_tickets want 37 have 5 need 32
2010-11-03 10:09:35.966297 7f9bc94b0710 cephx client: want=37 need=32 have=5
2010-11-03 10:09:35.966370 7f9bc94b0710 cephx client: old ticket len=96
2010-11-03 10:09:35.966382 7f9bc94b0710 cephx client: get auth session key: client_challenge 10140645574015473838
2010-11-03 10:09:35.966405 7f9bc94b0710 cephx client: handle_response ret = 0
2010-11-03 10:09:35.966416 7f9bc94b0710 cephx client:  unknown request_type 27649
auth/cephx/CephxClientHandler.cc: In function 'virtual int CephxClientHandler::handle_response(int, ceph::buffer::list::iterator&)':
auth/cephx/CephxClientHandler.cc:162: FAILED assert(0)
 ceph version 0.23~rc (commit:e304a2451a80f11117cb01031734e72077c88ce0)
 1: (MonClient::handle_auth(MAuthReply*)+0x7b) [0x5ce27b]
 2: (MonClient::ms_dispatch(Message*)+0x143) [0x5d0733]
 3: (SimpleMessenger::dispatch_entry()+0x749) [0x463b39]
 4: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x45abec]
 5: (Thread::_entry_func(void*)+0xa) [0x46f1ba]
 6: (()+0x69ca) [0x7f9bd121b9ca]
 7: (clone()+0x6d) [0x7f9bd01d370d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
*** Caught signal (ABRT) ***
 ceph version 0.23~rc (commit:e304a2451a80f11117cb01031734e72077c88ce0)
 1: (sigabrt_handler(int)+0x7d) [0x5db9fd]
 2: (()+0x33af0) [0x7f9bd0120af0]
 3: (gsignal()+0x35) [0x7f9bd0120a75]
 4: (abort()+0x180) [0x7f9bd01245c0]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7f9bd09d68e5]
 6: (()+0xcad16) [0x7f9bd09d4d16]
 7: (()+0xcad43) [0x7f9bd09d4d43]
 8: (()+0xcae3e) [0x7f9bd09d4e3e]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x448) [0x5c8a98]
 10: (CephxClientHandler::handle_response(int, ceph::buffer::list::iterator&)+0x13c) [0x5e432c]
 11: (MonClient::handle_auth(MAuthReply*)+0x7b) [0x5ce27b]
 12: (MonClient::ms_dispatch(Message*)+0x143) [0x5d0733]
 13: (SimpleMessenger::dispatch_entry()+0x749) [0x463b39]
 14: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x45abec]
 15: (Thread::_entry_func(void*)+0xa) [0x46f1ba]
 16: (()+0x69ca) [0x7f9bd121b9ca]
 17: (clone()+0x6d) [0x7f9bd01d370d]

I've used cdebugpack to gather all the relevant information, it has been uploaded to logger.ceph.widodh.nl:/srv/ceph/issues/osd_crash_cephx_client_handler

Restarted the OSD, it didn't crash again afterwards.

Actions #1

Updated by Sage Weil over 13 years ago

  • Category changed from OSD to cephx
  • Target version set to v0.23.1

Wido just saw this:

2010-11-11 14:28:32.693427 7fb2adcfd710 cephx client:  unknown request_type 11521
auth/cephx/CephxClientHandler.cc: In function 'virtual int CephxClientHandler::handle_response(int, ceph::buffer::list::iterator&)':
auth/cephx/CephxClientHandler.cc:162: FAILED assert(0)
 ceph version 0.24~rc (commit:3d6e915583ac95263732982462b2e69bb2411db7)
 1: (MonClient::handle_auth(MAuthReply*)+0x7b) [0x4e629b]
 2: (MonClient::ms_dispatch(Message*)+0x143) [0x4e8e03]
 3: (SimpleMessenger::dispatch_entry()+0x749) [0x46c599]
 4: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x45779c]
 5: (Thread::_entry_func(void*)+0xa) [0x477bfa]
 6: (()+0x69ca) [0x7fb2b6ea79ca]
 7: (clone()+0x6d) [0x7fb2b21b870d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
auth/cephx/CephxClientHandler.cc: In function 'virtual int CephxClientHandler::handle_response(int, ceph::buffer::list::iterator&)':
auth/cephx/CephxClientHandler.cc:162: FAILED assert(0)
 ceph version 0.24~rc (commit:3d6e915583ac95263732982462b2e69bb2411db7)
 1: (MonClient::handle_auth(MAuthReply*)+0x7b) [0x4e629b]
 2: (MonClient::ms_dispatch(Message*)+0x143) [0x4e8e03]
 3: (SimpleMessenger::dispatch_entry()+0x749) [0x46c599]
 4: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x45779c]
 5: (Thread::_entry_func(void*)+0xa) [0x477bfa]
 6: (()+0x69ca) [0x7fb2b6ea79ca]
 7: (clone()+0x6d) [0x7fb2b21b870d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
terminate called after throwing an instance of 'ceph::FailedAssertion'
*** Caught signal (ABRT) ***
 ceph version 0.24~rc (commit:3d6e915583ac95263732982462b2e69bb2411db7)
 1: (sigabrt_handler(int)+0x7d) [0x4f95ed]
 2: (()+0x33af0) [0x7fb2b2105af0]
 3: (gsignal()+0x35) [0x7fb2b2105a75]
 4: (abort()+0x180) [0x7fb2b21095c0]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7fb2b29bb8e5]
 6: (()+0xcad16) [0x7fb2b29b9d16]
 7: (()+0xcad43) [0x7fb2b29b9d43]
 8: (()+0xcae3e) [0x7fb2b29b9e3e]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x448) [0x4e0d28]
 10: (CephxClientHandler::handle_response(int, ceph::buffer::list::iterator&)+0x13c) [0x502d3c]
 11: (MonClient::handle_auth(MAuthReply*)+0x7b) [0x4e629b]
 12: (MonClient::ms_dispatch(Message*)+0x143) [0x4e8e03]
 13: (SimpleMessenger::dispatch_entry()+0x749) [0x46c599]
 14: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x45779c]
 15: (Thread::_entry_func(void*)+0xa) [0x477bfa]
 16: (()+0x69ca) [0x7fb2b6ea79ca]
 17: (clone()+0x6d) [0x7fb2b21b870d]
Aborted

Actions #2

Updated by Wido den Hollander over 13 years ago

I saw that on a test machine of mine. The 'ceph -w' command was hanging for about 10 seconds and then exited with this message.

The cluster had:
  • 1 Monitor
  • 1 MDS
  • 3 OSD

All on the same machine.

During this message, I was importing a 4GB RBD image via "rbd import"

Actions #3

Updated by Wido den Hollander over 13 years ago

Just saw it again on the same cluster, this time osd2 crashed when upgrading to this morning's unstable:

2010-11-12 12:18:22.715982 --- 17142 opened log /var/log/ceph/osd.2.log ---
ceph version 0.24~rc (commit:ce6d63947fc9dc479376184e50683bdd9dd8cf81)
2010-11-12 12:18:23.815544 7f68288d8720 filestore(/var/lib/ceph/osd.2) mount FIEMAP ioctl is NOT supported
2010-11-12 12:18:23.815763 7f68288d8720 filestore(/var/lib/ceph/osd.2) mount detected btrfs
2010-11-12 12:18:23.815781 7f68288d8720 filestore(/var/lib/ceph/osd.2) mount btrfs CLONE_RANGE ioctl is supported
2010-11-12 12:18:24.397478 7f68288d8720 filestore(/var/lib/ceph/osd.2) mount btrfs SNAP_CREATE is supported
2010-11-12 12:18:24.497010 7f68288d8720 filestore(/var/lib/ceph/osd.2) mount btrfs SNAP_DESTROY is supported
2010-11-12 12:18:24.497488 7f68288d8720 filestore(/var/lib/ceph/osd.2) mount btrfs START_SYNC got 0 Success
2010-11-12 12:18:24.497505 7f68288d8720 filestore(/var/lib/ceph/osd.2) mount btrfs START_SYNC is supported (transid 6438)
2010-11-12 12:18:26.408953 7f68288d8720 filestore(/var/lib/ceph/osd.2) mount btrfs WAIT_SYNC is supported
2010-11-12 12:18:26.441939 7f68288d8720 filestore(/var/lib/ceph/osd.2) mount btrfs SNAP_CREATE_ASYNC got 0 Success
2010-11-12 12:18:26.442057 7f68288d8720 filestore(/var/lib/ceph/osd.2) mount btrfs SNAP_CREATE_ASYNC is supported
2010-11-12 12:18:26.497175 7f68288d8720 filestore(/var/lib/ceph/osd.2) mount found snaps <7384,56732,56737>
2010-11-12 12:18:28.320230 7f68288d8720 journal read_entry 77537280 : seq 56733 5222 bytes
2010-11-12 12:18:28.388666 7f68288d8720 journal read_entry 77549568 : seq 56734 5222 bytes
2010-11-12 12:18:28.389460 7f68288d8720 journal read_entry 77561856 : seq 56735 21606 bytes
2010-11-12 12:18:28.389626 7f68288d8720 journal read_entry 77590528 : seq 56736 5222 bytes
2010-11-12 12:18:28.389645 7f68288d8720 journal read_entry 77602816 : seq 56737 5222 bytes
2010-11-12 12:18:28.624392 7f68288d8720 filestore(/var/lib/ceph/osd.2) mount: enabling PARALLEL journal mode: btrfs, SNAP_CREATE_ASYNC detected and 'filestore btrfs snap' mode is enabled
2010-11-12 12:18:57.550419 7f681fa06710 osd2 62 handle_osd_map fsid 12d34577-3824-fa22-1cd5-887be3a7dbd4 != 00000000-0000-0000-0000-000000000000
2010-11-12 12:18:57.550890 7f681fa06710 cephx client:  unknown request_type 33537
auth/cephx/CephxClientHandler.cc: In function 'virtual int CephxClientHandler::handle_response(int, ceph::buffer::list::iterator&)':
auth/cephx/CephxClientHandler.cc:162: FAILED assert(0)
 ceph version 0.24~rc (commit:ce6d63947fc9dc479376184e50683bdd9dd8cf81)
 1: (CephxClientHandler::handle_response(int, ceph::buffer::list::iterator&)+0x13c) [0x5e604c]
 2: (MonClient::handle_auth(MAuthReply*)+0x7b) [0x5d064b]
 3: (MonClient::ms_dispatch(Message*)+0x143) [0x5d31b3]
 4: (SimpleMessenger::dispatch_entry()+0x749) [0x462a19]
 5: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x45a68c]
 6: (Thread::_entry_func(void*)+0xa) [0x46e09a]
 7: (()+0x69ca) [0x7f68282c59ca]
 8: (clone()+0x6d) [0x7f682702b70d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
*** Caught signal (ABRT) ***
 ceph version 0.24~rc (commit:ce6d63947fc9dc479376184e50683bdd9dd8cf81)
 1: (sigabrt_handler(int)+0x7d) [0x5dd71d]
 2: (()+0x33af0) [0x7f6826f78af0]
 3: (gsignal()+0x35) [0x7f6826f78a75]
 4: (abort()+0x180) [0x7f6826f7c5c0]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7f682782e8e5]
 6: (()+0xcad16) [0x7f682782cd16]
 7: (()+0xcad43) [0x7f682782cd43]
 8: (()+0xcae3e) [0x7f682782ce3e]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x448) [0x5ca5d8]
 10: (CephxClientHandler::handle_response(int, ceph::buffer::list::iterator&)+0x13c) [0x5e604c]
 11: (MonClient::handle_auth(MAuthReply*)+0x7b) [0x5d064b]
 12: (MonClient::ms_dispatch(Message*)+0x143) [0x5d31b3]
 13: (SimpleMessenger::dispatch_entry()+0x749) [0x462a19]
 14: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x45a68c]
 15: (Thread::_entry_func(void*)+0xa) [0x46e09a]
 16: (()+0x69ca) [0x7f68282c59ca]
 17: (clone()+0x6d) [0x7f682702b70d]
Actions #4

Updated by Sage Weil over 13 years ago

  • Status changed from New to Resolved

couldn't reproduce this, but fixed two smallish things that may have been responsible for this:
61dd4f03e6e15d59951ff69d95807a3026f5b9c4 and c327c6a2064f3a14f0221b53dc0c87d744d0e977

Please let us know if it pops up again!

Actions

Also available in: Atom PDF