Actions
Bug #19900
closedosd crash during shutdown
% Done:
0%
Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
<frame> <ip>0xAB00DE</ip> <obj>/usr/bin/ceph-osd</obj> <fn>ceph::buffer::create_aligned(unsigned int, unsigned int)</fn> <dir>/usr/src/debug/ceph-12.0.1-2198-g8760432/src/common</dir> <file>buffer.cc</file> <line>755</line> </frame> <frame> <ip>0xAB4C91</ip> <obj>/usr/bin/ceph-osd</obj> <fn>ceph::buffer::list::iterator_impl<false>::copy_deep(unsigned int, ceph::buffer::ptr&)</fn> <dir>/usr/src/debug/ceph-12.0.1-2198-g8760432/src/common</dir> <file>buffer.cc</file> <line>1189</line> </frame> <frame> <ip>0xC42015</ip> <obj>/usr/bin/ceph-osd</obj> <fn>CryptoKey::decode(ceph::buffer::list::iterator&)</fn> <dir>/usr/src/debug/ceph-12.0.1-2198-g8760432/src/auth</dir> <file>Crypto.cc</file> <line>389</line> </frame> <frame> <ip>0xC35956</ip> <obj>/usr/bin/ceph-osd</obj> <fn>decode</fn> <dir>/usr/src/debug/ceph-12.0.1-2198-g8760432/src/auth</dir> <file>Crypto.h</file> <line>120</line> </frame> <frame> <ip>0xC35956</ip> <obj>/usr/bin/ceph-osd</obj> <fn>decode</fn> <dir>/usr/src/debug/ceph-12.0.1-2198-g8760432/src/auth</dir> <file>Auth.h</file> <line>160</line> </frame> <frame> <ip>0xC35956</ip> <obj>/usr/bin/ceph-osd</obj> <fn>decode</fn> <dir>/usr/src/debug/ceph-12.0.1-2198-g8760432/src/auth</dir> <file>Auth.h</file> <line>164</line> </frame> <frame> <ip>0xC35956</ip> <obj>/usr/bin/ceph-osd</obj> <fn>decode<long unsigned int, ExpiringCryptoKey, std::less<long unsigned int>, std::allocator<std::pair<long unsigned int const, ExpiringCryptoKey> >, denc_traits<long unsigned int>, denc_traits<ExpiringCryptoKey, void> >< /fn> <dir>/usr/src/debug/ceph-12.0.1-2198-g8760432/src/include</dir> <file>encoding.h</file> <line>732</line> </frame> <frame> <ip>0xC35956</ip> <obj>/usr/bin/ceph-osd</obj> <fn>decode</fn> <dir>/usr/src/debug/ceph-12.0.1-2198-g8760432/src/auth</dir> <file>Auth.h</file> <line>186</line> </frame> <frame> <ip>0xC35956</ip> <obj>/usr/bin/ceph-osd</obj> <fn>decode</fn> <dir>/usr/src/debug/ceph-12.0.1-2198-g8760432/src/auth</dir> <file>Auth.h</file> <line>226</line> </frame> <frame> <ip>0xC35956</ip> <obj>/usr/bin/ceph-osd</obj> <fn>void decode_decrypt_enc_bl<RotatingSecrets>(CephContext*, RotatingSecrets&, CryptoKey, ceph::buffer::list&, std::string&)</fn> <dir>/usr/src/debug/ceph-12.0.1-2198-g8760432/src/auth/cephx</dir> <file>CephxProtocol.h</file> <line>447</line> </frame> <frame> <ip>0xC35D8E</ip> <obj>/usr/bin/ceph-osd</obj> <fn>int decode_decrypt<RotatingSecrets>(CephContext*, RotatingSecrets&, CryptoKey const&, ceph::buffer::list::iterator&, std::string&)</fn> <dir>/usr/src/debug/ceph-12.0.1-2198-g8760432/src/auth/cephx</dir> <file>CephxProtocol.h</file> <line>471</line> </frame> <frame> <ip>0xC32EC4</ip> <obj>/usr/bin/ceph-osd</obj> <fn>CephxClientHandler::handle_response(int, ceph::buffer::list::iterator&)</fn> <dir>/usr/src/debug/ceph-12.0.1-2198-g8760432/src/auth/cephx</dir> <file>CephxClientHandler.cc</file> <line>184</line> </frame> <frame> <ip>0xAF8348</ip> <obj>/usr/bin/ceph-osd</obj> <fn>MonConnection::authenticate(MAuthReply*)</fn> <dir>/usr/src/debug/ceph-12.0.1-2198-g8760432/src/mon</dir> <file>MonClient.cc</file> <line>1263</line>
/a/sage-2017-05-10_03:08:19-rados-wip-sage-testing2---basic-smithi/1119667
Updated by Josh Durgin almost 7 years ago
- Subject changed from osd leak of ExpiringCrpytoKey etc to osd crash during shutdown
This was actually a crash, from teuthology.log:
2017-05-10T04:50:27.383 DEBUG:tasks.ceph.osd.4:waiting for process to exit 2017-05-10T04:50:27.383 INFO:teuthology.orchestra.run:waiting for 300 2017-05-10T04:50:27.409 INFO:tasks.ceph.osd.4.smithi130.stderr:2017-05-10 04:50:27.387550 323e7700 -1 received signal: Terminated from PID: 31142 task name: /usr/bin/python /bin/daemon-helper term valgrind -- trace-children=no --child-silent-after-fork=yes --num-callers=50 --suppressions=/home/ubuntu/cephtest/valgrind.supp --xml=yes --xml-file=/var/log/ceph/valgrind/osd.4.log --time-stamp=yes --tool=memcheck ceph-os d -f --cluster ceph -i 4 UID: 0 2017-05-10T04:50:27.409 INFO:tasks.ceph.osd.4.smithi130.stderr:2017-05-10 04:50:27.389154 323e7700 -1 osd.4 231 *** Got signal Terminated *** 2017-05-10T04:50:27.571 INFO:tasks.ceph.osd.4.smithi130.stderr:2017-05-10 04:50:27.448222 323e7700 -1 osd.4 231 shutdown 2017-05-10T04:50:30.668 INFO:tasks.ceph.osd.4.smithi130.stderr:/mnt/jenkins/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.0.1-21 98-g8760432/rpm/el7/BUILD/ceph-12.0.1-2198-g8760432/src/mon/MonClient.cc: In function 'int MonClient::wait_auth_rotating(double)' thread 158af700 time 2017-05-10 04:50:30.668855 2017-05-10T04:50:30.733 INFO:tasks.ceph.osd.4.smithi130.stderr:/mnt/jenkins/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.0.1-21 98-g8760432/rpm/el7/BUILD/ceph-12.0.1-2198-g8760432/src/mon/MonClient.cc: 901: FAILED assert(auth != nullptr) 2017-05-10T04:50:30.759 INFO:tasks.ceph.osd.4.smithi130.stderr: ceph version 12.0.1-2198-g8760432 (87604323e6d350d533c826fdb91f2052a11d3de9) 2017-05-10T04:50:30.759 INFO:tasks.ceph.osd.4.smithi130.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x110) [0xae6630] 2017-05-10T04:50:30.759 INFO:tasks.ceph.osd.4.smithi130.stderr: 2: (MonClient::wait_auth_rotating(double)+0x63c) [0xaf746c] 2017-05-10T04:50:30.759 INFO:tasks.ceph.osd.4.smithi130.stderr: 3: (OSD::ms_get_authorizer(int, AuthAuthorizer**, bool)+0xbc) [0x5771ec] 2017-05-10T04:50:30.759 INFO:tasks.ceph.osd.4.smithi130.stderr: 4: (SimpleMessenger::get_authorizer(int, bool)+0x5b) [0xb5e54b] 2017-05-10T04:50:30.759 INFO:tasks.ceph.osd.4.smithi130.stderr: 5: (Pipe::connect()+0x26f9) [0xcd92e9] 2017-05-10T04:50:30.760 INFO:tasks.ceph.osd.4.smithi130.stderr: 6: (Pipe::writer()+0x531) [0xcdbee1] 2017-05-10T04:50:30.760 INFO:tasks.ceph.osd.4.smithi130.stderr: 7: (Pipe::Writer::entry()+0xd) [0xcde9ad] 2017-05-10T04:50:30.760 INFO:tasks.ceph.osd.4.smithi130.stderr: 8: (()+0x7dc5) [0xc729dc5] 2017-05-10T04:50:30.760 INFO:tasks.ceph.osd.4.smithi130.stderr: 9: (clone()+0x6d) [0xd87573d] 2017-05-10T04:50:30.760 INFO:tasks.ceph.osd.4.smithi130.stderr: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Updated by Kefu Chai almost 7 years ago
-1927> 2017-05-10 04:50:27.436359 1ebc0700 20 osd.4 231 OSD::ms_dispatch: MOSDMarkMeDown(request_ack=0, target_osd=osd.4 172.21.15.130:6801/31147, fsid=db4a4e71-b01d-4cf1-bba3-b9b29f51cd70) v2 -1926> 2017-05-10 04:50:27.438216 1ebc0700 0 osd.4 231 got_stop_ack starting shutdown -1925> 2017-05-10 04:50:27.443950 323e7700 0 osd.4 231 prepare_to_stop starting shutdown -1924> 2017-05-10 04:50:27.448222 323e7700 -1 osd.4 231 shutdown .. -1440> 2017-05-10 04:50:27.835363 1ebc0700 1 -- 172.21.15.130:6801/31147 <== mon.2 172.21.15.19:6790/0 4 ==== mgrmap(e 4) v1 ==== 90+0+0 (1486229013 0 0) 0xf685aa0 con 0x3a461540 << got a mgrmap via con 0x3a461540 -1439> 2017-05-10 04:50:27.835574 1ebc0700 4 mgrc handle_mgr_map Got map version 4 -1438> 2017-05-10 04:50:27.835621 1ebc0700 4 mgrc handle_mgr_map Active mgr is now 172.21.15.130:6800/12188 -1437> 2017-05-10 04:50:27.836396 1ebc0700 4 mgrc reconnect Starting new session with 172.21.15.130:6800/12188 .. -1411> 2017-05-10 04:50:27.843961 970d700 10 -- 172.21.15.130:6801/31147 >> 172.21.15.130:6800/12188 pipe(0x1846d570 sd=34 :58180 s=2 pgs=266 cs=1 l=1 c=0x1846e890).reader got message 2 0xf68b3a0 mgrconfigure() v1 .. -1403> 2017-05-10 04:50:27.845439 1ebc0700 4 mgrc handle_mgr_configure stats_period=5 .. -112> 2017-05-10 04:50:30.199612 323e7700 1 bluefs umount .. -95> 2017-05-10 04:50:30.494950 323e7700 10 monclient: shutdown -94> 2017-05-10 04:50:30.496994 323e7700 1 -- 172.21.15.130:6801/31147 mark_down 0x3a461540 -- 0x192e7fc0 << we mark con 0x3a461540 now. .. -54> 2017-05-10 04:50:30.625266 1ebc0700 4 mgrc ms_handle_reset ms_handle_reset con 0x1846e890 -53> 2017-05-10 04:50:30.627750 1ebc0700 4 mgrc reconnect Terminating session with 172.21.15.130:6800/12188 -52> 2017-05-10 04:50:30.629387 1ebc0700 1 -- 172.21.15.130:6801/31147 mark_down 0x1846e890 -- pipe dne -51> 2017-05-10 04:50:30.631857 1ebc0700 4 mgrc reconnect Starting new session with 172.21.15.130:6800/12188 .. -15> 2017-05-10 04:50:30.665362 158af700 0 -- 172.21.15.130:6801/31147 >> 172.21.15.130:6800/12188 pipe(0x3a056d30 sd=21 :60649 s=1 pgs=0 cs=0 l=1 c=0xe2375d0).connect got BADAUTHORIZER -14> 2017-05-10 04:50:30.667752 158af700 10 osd.4 0 OSD::ms_get_authorizer type=mgr -13> 2017-05-10 04:50:30.752065 158af700 -1 /mnt/jenkins/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.0.1-2198-g8760432/rpm/el7/BUILD/ceph-12.0.1-2198-g8760432/src/mon/MonClient.cc: In f unction 'int MonClient::wait_auth_rotating(double)' thread 158af700 time 2017-05-10 04:50:30.668855
so even with https://github.com/ceph/ceph/commit/611572a6103e20d3e3abf3530ea33e539b9db371, MgrClient can still try to (re)connect to mgr after MgrClient::shutdown()
, when handling mgrmap after its mgrsession got reset.
Updated by Brad Hubbard almost 7 years ago
- Status changed from New to In Progress
Updated by Sage Weil almost 7 years ago
/a/sage-2017-05-27_03:43:09-rados-wip-sage-testing2---basic-smithi/1235367
the branch contained eb5c02d from https://github.com/ceph/ceph/pull/15192, so not fixed yet!
Updated by Brad Hubbard almost 7 years ago
OMG.
How did we all (especially me) not realise there was a missing return statement in https://github.com/ceph/ceph/pull/15192/commits/eb5c02df634b34ca45ecbf1eaf4440ccf806845f ?
Updated by Brad Hubbard almost 7 years ago
- Status changed from In Progress to Resolved
Actions