Bug #19900
closed
osd crash during shutdown
Added by Sage Weil almost 7 years ago.
Updated almost 7 years ago.
Description
<frame>
<ip>0xAB00DE</ip>
<obj>/usr/bin/ceph-osd</obj>
<fn>ceph::buffer::create_aligned(unsigned int, unsigned int)</fn>
<dir>/usr/src/debug/ceph-12.0.1-2198-g8760432/src/common</dir>
<file>buffer.cc</file>
<line>755</line>
</frame>
<frame>
<ip>0xAB4C91</ip>
<obj>/usr/bin/ceph-osd</obj>
<fn>ceph::buffer::list::iterator_impl<false>::copy_deep(unsigned int, ceph::buffer::ptr&)</fn>
<dir>/usr/src/debug/ceph-12.0.1-2198-g8760432/src/common</dir>
<file>buffer.cc</file>
<line>1189</line>
</frame>
<frame>
<ip>0xC42015</ip>
<obj>/usr/bin/ceph-osd</obj>
<fn>CryptoKey::decode(ceph::buffer::list::iterator&)</fn>
<dir>/usr/src/debug/ceph-12.0.1-2198-g8760432/src/auth</dir>
<file>Crypto.cc</file>
<line>389</line>
</frame>
<frame>
<ip>0xC35956</ip>
<obj>/usr/bin/ceph-osd</obj>
<fn>decode</fn>
<dir>/usr/src/debug/ceph-12.0.1-2198-g8760432/src/auth</dir>
<file>Crypto.h</file>
<line>120</line>
</frame>
<frame>
<ip>0xC35956</ip>
<obj>/usr/bin/ceph-osd</obj>
<fn>decode</fn>
<dir>/usr/src/debug/ceph-12.0.1-2198-g8760432/src/auth</dir>
<file>Auth.h</file>
<line>160</line>
</frame>
<frame>
<ip>0xC35956</ip>
<obj>/usr/bin/ceph-osd</obj>
<fn>decode</fn>
<dir>/usr/src/debug/ceph-12.0.1-2198-g8760432/src/auth</dir>
<file>Auth.h</file>
<line>164</line>
</frame>
<frame>
<ip>0xC35956</ip>
<obj>/usr/bin/ceph-osd</obj>
<fn>decode<long unsigned int, ExpiringCryptoKey, std::less<long unsigned int>, std::allocator<std::pair<long unsigned int const, ExpiringCryptoKey> >, denc_traits<long unsigned int>, denc_traits<ExpiringCryptoKey, void> ><
/fn>
<dir>/usr/src/debug/ceph-12.0.1-2198-g8760432/src/include</dir>
<file>encoding.h</file>
<line>732</line>
</frame>
<frame>
<ip>0xC35956</ip>
<obj>/usr/bin/ceph-osd</obj>
<fn>decode</fn>
<dir>/usr/src/debug/ceph-12.0.1-2198-g8760432/src/auth</dir>
<file>Auth.h</file>
<line>186</line>
</frame>
<frame>
<ip>0xC35956</ip>
<obj>/usr/bin/ceph-osd</obj>
<fn>decode</fn>
<dir>/usr/src/debug/ceph-12.0.1-2198-g8760432/src/auth</dir>
<file>Auth.h</file>
<line>226</line>
</frame>
<frame>
<ip>0xC35956</ip>
<obj>/usr/bin/ceph-osd</obj>
<fn>void decode_decrypt_enc_bl<RotatingSecrets>(CephContext*, RotatingSecrets&, CryptoKey, ceph::buffer::list&, std::string&)</fn>
<dir>/usr/src/debug/ceph-12.0.1-2198-g8760432/src/auth/cephx</dir>
<file>CephxProtocol.h</file>
<line>447</line>
</frame>
<frame>
<ip>0xC35D8E</ip>
<obj>/usr/bin/ceph-osd</obj>
<fn>int decode_decrypt<RotatingSecrets>(CephContext*, RotatingSecrets&, CryptoKey const&, ceph::buffer::list::iterator&, std::string&)</fn>
<dir>/usr/src/debug/ceph-12.0.1-2198-g8760432/src/auth/cephx</dir>
<file>CephxProtocol.h</file>
<line>471</line>
</frame>
<frame>
<ip>0xC32EC4</ip>
<obj>/usr/bin/ceph-osd</obj>
<fn>CephxClientHandler::handle_response(int, ceph::buffer::list::iterator&)</fn>
<dir>/usr/src/debug/ceph-12.0.1-2198-g8760432/src/auth/cephx</dir>
<file>CephxClientHandler.cc</file>
<line>184</line>
</frame>
<frame>
<ip>0xAF8348</ip>
<obj>/usr/bin/ceph-osd</obj>
<fn>MonConnection::authenticate(MAuthReply*)</fn>
<dir>/usr/src/debug/ceph-12.0.1-2198-g8760432/src/mon</dir>
<file>MonClient.cc</file>
<line>1263</line>
/a/sage-2017-05-10_03:08:19-rados-wip-sage-testing2---basic-smithi/1119667
- Subject changed from osd leak of ExpiringCrpytoKey etc to osd crash during shutdown
This was actually a crash, from teuthology.log:
2017-05-10T04:50:27.383 DEBUG:tasks.ceph.osd.4:waiting for process to exit
2017-05-10T04:50:27.383 INFO:teuthology.orchestra.run:waiting for 300
2017-05-10T04:50:27.409 INFO:tasks.ceph.osd.4.smithi130.stderr:2017-05-10 04:50:27.387550 323e7700 -1 received signal: Terminated from PID: 31142 task name: /usr/bin/python /bin/daemon-helper term valgrind --
trace-children=no --child-silent-after-fork=yes --num-callers=50 --suppressions=/home/ubuntu/cephtest/valgrind.supp --xml=yes --xml-file=/var/log/ceph/valgrind/osd.4.log --time-stamp=yes --tool=memcheck ceph-os
d -f --cluster ceph -i 4 UID: 0
2017-05-10T04:50:27.409 INFO:tasks.ceph.osd.4.smithi130.stderr:2017-05-10 04:50:27.389154 323e7700 -1 osd.4 231 *** Got signal Terminated ***
2017-05-10T04:50:27.571 INFO:tasks.ceph.osd.4.smithi130.stderr:2017-05-10 04:50:27.448222 323e7700 -1 osd.4 231 shutdown
2017-05-10T04:50:30.668 INFO:tasks.ceph.osd.4.smithi130.stderr:/mnt/jenkins/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.0.1-21
98-g8760432/rpm/el7/BUILD/ceph-12.0.1-2198-g8760432/src/mon/MonClient.cc: In function 'int MonClient::wait_auth_rotating(double)' thread 158af700 time 2017-05-10 04:50:30.668855
2017-05-10T04:50:30.733 INFO:tasks.ceph.osd.4.smithi130.stderr:/mnt/jenkins/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.0.1-21
98-g8760432/rpm/el7/BUILD/ceph-12.0.1-2198-g8760432/src/mon/MonClient.cc: 901: FAILED assert(auth != nullptr)
2017-05-10T04:50:30.759 INFO:tasks.ceph.osd.4.smithi130.stderr: ceph version 12.0.1-2198-g8760432 (87604323e6d350d533c826fdb91f2052a11d3de9)
2017-05-10T04:50:30.759 INFO:tasks.ceph.osd.4.smithi130.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x110) [0xae6630]
2017-05-10T04:50:30.759 INFO:tasks.ceph.osd.4.smithi130.stderr: 2: (MonClient::wait_auth_rotating(double)+0x63c) [0xaf746c]
2017-05-10T04:50:30.759 INFO:tasks.ceph.osd.4.smithi130.stderr: 3: (OSD::ms_get_authorizer(int, AuthAuthorizer**, bool)+0xbc) [0x5771ec]
2017-05-10T04:50:30.759 INFO:tasks.ceph.osd.4.smithi130.stderr: 4: (SimpleMessenger::get_authorizer(int, bool)+0x5b) [0xb5e54b]
2017-05-10T04:50:30.759 INFO:tasks.ceph.osd.4.smithi130.stderr: 5: (Pipe::connect()+0x26f9) [0xcd92e9]
2017-05-10T04:50:30.760 INFO:tasks.ceph.osd.4.smithi130.stderr: 6: (Pipe::writer()+0x531) [0xcdbee1]
2017-05-10T04:50:30.760 INFO:tasks.ceph.osd.4.smithi130.stderr: 7: (Pipe::Writer::entry()+0xd) [0xcde9ad]
2017-05-10T04:50:30.760 INFO:tasks.ceph.osd.4.smithi130.stderr: 8: (()+0x7dc5) [0xc729dc5]
2017-05-10T04:50:30.760 INFO:tasks.ceph.osd.4.smithi130.stderr: 9: (clone()+0x6d) [0xd87573d]
2017-05-10T04:50:30.760 INFO:tasks.ceph.osd.4.smithi130.stderr: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
- Assignee set to Brad Hubbard
-1927> 2017-05-10 04:50:27.436359 1ebc0700 20 osd.4 231 OSD::ms_dispatch: MOSDMarkMeDown(request_ack=0, target_osd=osd.4 172.21.15.130:6801/31147, fsid=db4a4e71-b01d-4cf1-bba3-b9b29f51cd70) v2
-1926> 2017-05-10 04:50:27.438216 1ebc0700 0 osd.4 231 got_stop_ack starting shutdown
-1925> 2017-05-10 04:50:27.443950 323e7700 0 osd.4 231 prepare_to_stop starting shutdown
-1924> 2017-05-10 04:50:27.448222 323e7700 -1 osd.4 231 shutdown
..
-1440> 2017-05-10 04:50:27.835363 1ebc0700 1 -- 172.21.15.130:6801/31147 <== mon.2 172.21.15.19:6790/0 4 ==== mgrmap(e 4) v1 ==== 90+0+0 (1486229013 0 0) 0xf685aa0 con 0x3a461540
<< got a mgrmap via con 0x3a461540
-1439> 2017-05-10 04:50:27.835574 1ebc0700 4 mgrc handle_mgr_map Got map version 4
-1438> 2017-05-10 04:50:27.835621 1ebc0700 4 mgrc handle_mgr_map Active mgr is now 172.21.15.130:6800/12188
-1437> 2017-05-10 04:50:27.836396 1ebc0700 4 mgrc reconnect Starting new session with 172.21.15.130:6800/12188
..
-1411> 2017-05-10 04:50:27.843961 970d700 10 -- 172.21.15.130:6801/31147 >> 172.21.15.130:6800/12188 pipe(0x1846d570 sd=34 :58180 s=2 pgs=266 cs=1 l=1 c=0x1846e890).reader got message 2 0xf68b3a0 mgrconfigure() v1
..
-1403> 2017-05-10 04:50:27.845439 1ebc0700 4 mgrc handle_mgr_configure stats_period=5
..
-112> 2017-05-10 04:50:30.199612 323e7700 1 bluefs umount
..
-95> 2017-05-10 04:50:30.494950 323e7700 10 monclient: shutdown
-94> 2017-05-10 04:50:30.496994 323e7700 1 -- 172.21.15.130:6801/31147 mark_down 0x3a461540 -- 0x192e7fc0
<< we mark con 0x3a461540 now.
..
-54> 2017-05-10 04:50:30.625266 1ebc0700 4 mgrc ms_handle_reset ms_handle_reset con 0x1846e890
-53> 2017-05-10 04:50:30.627750 1ebc0700 4 mgrc reconnect Terminating session with 172.21.15.130:6800/12188
-52> 2017-05-10 04:50:30.629387 1ebc0700 1 -- 172.21.15.130:6801/31147 mark_down 0x1846e890 -- pipe dne
-51> 2017-05-10 04:50:30.631857 1ebc0700 4 mgrc reconnect Starting new session with 172.21.15.130:6800/12188
..
-15> 2017-05-10 04:50:30.665362 158af700 0 -- 172.21.15.130:6801/31147 >> 172.21.15.130:6800/12188 pipe(0x3a056d30 sd=21 :60649 s=1 pgs=0 cs=0 l=1 c=0xe2375d0).connect got BADAUTHORIZER
-14> 2017-05-10 04:50:30.667752 158af700 10 osd.4 0 OSD::ms_get_authorizer type=mgr
-13> 2017-05-10 04:50:30.752065 158af700 -1 /mnt/jenkins/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.0.1-2198-g8760432/rpm/el7/BUILD/ceph-12.0.1-2198-g8760432/src/mon/MonClient.cc: In f
unction 'int MonClient::wait_auth_rotating(double)' thread 158af700 time 2017-05-10 04:50:30.668855
so even with https://github.com/ceph/ceph/commit/611572a6103e20d3e3abf3530ea33e539b9db371, MgrClient can still try to (re)connect to mgr after MgrClient::shutdown()
, when handling mgrmap after its mgrsession got reset.
- Status changed from New to In Progress
- Status changed from In Progress to Resolved
Also available in: Atom
PDF