Project

General

Profile

Bug #19900

osd crash during shutdown

Added by Sage Weil 4 months ago. Updated 4 months ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
-
Start date:
05/10/2017
Due date:
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Release:
Needs Doc:
No

Description

    <frame>
      <ip>0xAB00DE</ip>
      <obj>/usr/bin/ceph-osd</obj>
      <fn>ceph::buffer::create_aligned(unsigned int, unsigned int)</fn>
      <dir>/usr/src/debug/ceph-12.0.1-2198-g8760432/src/common</dir>
      <file>buffer.cc</file>
      <line>755</line>
    </frame>
    <frame>
      <ip>0xAB4C91</ip>
      <obj>/usr/bin/ceph-osd</obj>
      <fn>ceph::buffer::list::iterator_impl&lt;false&gt;::copy_deep(unsigned int, ceph::buffer::ptr&amp;)</fn>
      <dir>/usr/src/debug/ceph-12.0.1-2198-g8760432/src/common</dir>
      <file>buffer.cc</file>
      <line>1189</line>
    </frame>
    <frame>
      <ip>0xC42015</ip>
      <obj>/usr/bin/ceph-osd</obj>
      <fn>CryptoKey::decode(ceph::buffer::list::iterator&amp;)</fn>
      <dir>/usr/src/debug/ceph-12.0.1-2198-g8760432/src/auth</dir>
      <file>Crypto.cc</file>
      <line>389</line>
    </frame>
    <frame>
      <ip>0xC35956</ip>
      <obj>/usr/bin/ceph-osd</obj>
      <fn>decode</fn>
      <dir>/usr/src/debug/ceph-12.0.1-2198-g8760432/src/auth</dir>
      <file>Crypto.h</file>
      <line>120</line>
    </frame>
    <frame>
      <ip>0xC35956</ip>
      <obj>/usr/bin/ceph-osd</obj>
      <fn>decode</fn>
      <dir>/usr/src/debug/ceph-12.0.1-2198-g8760432/src/auth</dir>
      <file>Auth.h</file>
      <line>160</line>
    </frame>
    <frame>
      <ip>0xC35956</ip>
      <obj>/usr/bin/ceph-osd</obj>
      <fn>decode</fn>
      <dir>/usr/src/debug/ceph-12.0.1-2198-g8760432/src/auth</dir>
      <file>Auth.h</file>
      <line>164</line>
    </frame>
    <frame>
      <ip>0xC35956</ip>
      <obj>/usr/bin/ceph-osd</obj>
      <fn>decode&lt;long unsigned int, ExpiringCryptoKey, std::less&lt;long unsigned int&gt;, std::allocator&lt;std::pair&lt;long unsigned int const, ExpiringCryptoKey&gt; &gt;, denc_traits&lt;long unsigned int&gt;, denc_traits&lt;ExpiringCryptoKey, void&gt; &gt;<
/fn>
      <dir>/usr/src/debug/ceph-12.0.1-2198-g8760432/src/include</dir>
      <file>encoding.h</file>
      <line>732</line>
    </frame>
    <frame>
      <ip>0xC35956</ip>
      <obj>/usr/bin/ceph-osd</obj>
      <fn>decode</fn>
      <dir>/usr/src/debug/ceph-12.0.1-2198-g8760432/src/auth</dir>
      <file>Auth.h</file>
      <line>186</line>
    </frame>
    <frame>
      <ip>0xC35956</ip>
      <obj>/usr/bin/ceph-osd</obj>
      <fn>decode</fn>
      <dir>/usr/src/debug/ceph-12.0.1-2198-g8760432/src/auth</dir>
      <file>Auth.h</file>
      <line>226</line>
    </frame>
    <frame>
      <ip>0xC35956</ip>
      <obj>/usr/bin/ceph-osd</obj>
      <fn>void decode_decrypt_enc_bl&lt;RotatingSecrets&gt;(CephContext*, RotatingSecrets&amp;, CryptoKey, ceph::buffer::list&amp;, std::string&amp;)</fn>
      <dir>/usr/src/debug/ceph-12.0.1-2198-g8760432/src/auth/cephx</dir>
      <file>CephxProtocol.h</file>
      <line>447</line>
    </frame>
    <frame>
      <ip>0xC35D8E</ip>
      <obj>/usr/bin/ceph-osd</obj>
      <fn>int decode_decrypt&lt;RotatingSecrets&gt;(CephContext*, RotatingSecrets&amp;, CryptoKey const&amp;, ceph::buffer::list::iterator&amp;, std::string&amp;)</fn>
      <dir>/usr/src/debug/ceph-12.0.1-2198-g8760432/src/auth/cephx</dir>
      <file>CephxProtocol.h</file>
      <line>471</line>
    </frame>
    <frame>
      <ip>0xC32EC4</ip>
      <obj>/usr/bin/ceph-osd</obj>
      <fn>CephxClientHandler::handle_response(int, ceph::buffer::list::iterator&amp;)</fn>
      <dir>/usr/src/debug/ceph-12.0.1-2198-g8760432/src/auth/cephx</dir>
      <file>CephxClientHandler.cc</file>
      <line>184</line>
    </frame>
    <frame>
      <ip>0xAF8348</ip>
      <obj>/usr/bin/ceph-osd</obj>
      <fn>MonConnection::authenticate(MAuthReply*)</fn>
      <dir>/usr/src/debug/ceph-12.0.1-2198-g8760432/src/mon</dir>
      <file>MonClient.cc</file>
      <line>1263</line>

/a/sage-2017-05-10_03:08:19-rados-wip-sage-testing2---basic-smithi/1119667

History

#1 Updated by Josh Durgin 4 months ago

  • Subject changed from osd leak of ExpiringCrpytoKey etc to osd crash during shutdown

This was actually a crash, from teuthology.log:

2017-05-10T04:50:27.383 DEBUG:tasks.ceph.osd.4:waiting for process to exit
2017-05-10T04:50:27.383 INFO:teuthology.orchestra.run:waiting for 300
2017-05-10T04:50:27.409 INFO:tasks.ceph.osd.4.smithi130.stderr:2017-05-10 04:50:27.387550 323e7700 -1 received  signal: Terminated from  PID: 31142 task name: /usr/bin/python /bin/daemon-helper term valgrind --
trace-children=no --child-silent-after-fork=yes --num-callers=50 --suppressions=/home/ubuntu/cephtest/valgrind.supp --xml=yes --xml-file=/var/log/ceph/valgrind/osd.4.log --time-stamp=yes --tool=memcheck ceph-os
d -f --cluster ceph -i 4  UID: 0
2017-05-10T04:50:27.409 INFO:tasks.ceph.osd.4.smithi130.stderr:2017-05-10 04:50:27.389154 323e7700 -1 osd.4 231 *** Got signal Terminated ***
2017-05-10T04:50:27.571 INFO:tasks.ceph.osd.4.smithi130.stderr:2017-05-10 04:50:27.448222 323e7700 -1 osd.4 231 shutdown
2017-05-10T04:50:30.668 INFO:tasks.ceph.osd.4.smithi130.stderr:/mnt/jenkins/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.0.1-21
98-g8760432/rpm/el7/BUILD/ceph-12.0.1-2198-g8760432/src/mon/MonClient.cc: In function 'int MonClient::wait_auth_rotating(double)' thread 158af700 time 2017-05-10 04:50:30.668855
2017-05-10T04:50:30.733 INFO:tasks.ceph.osd.4.smithi130.stderr:/mnt/jenkins/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.0.1-21
98-g8760432/rpm/el7/BUILD/ceph-12.0.1-2198-g8760432/src/mon/MonClient.cc: 901: FAILED assert(auth != nullptr)
2017-05-10T04:50:30.759 INFO:tasks.ceph.osd.4.smithi130.stderr: ceph version 12.0.1-2198-g8760432 (87604323e6d350d533c826fdb91f2052a11d3de9)
2017-05-10T04:50:30.759 INFO:tasks.ceph.osd.4.smithi130.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x110) [0xae6630]
2017-05-10T04:50:30.759 INFO:tasks.ceph.osd.4.smithi130.stderr: 2: (MonClient::wait_auth_rotating(double)+0x63c) [0xaf746c]
2017-05-10T04:50:30.759 INFO:tasks.ceph.osd.4.smithi130.stderr: 3: (OSD::ms_get_authorizer(int, AuthAuthorizer**, bool)+0xbc) [0x5771ec]
2017-05-10T04:50:30.759 INFO:tasks.ceph.osd.4.smithi130.stderr: 4: (SimpleMessenger::get_authorizer(int, bool)+0x5b) [0xb5e54b]
2017-05-10T04:50:30.759 INFO:tasks.ceph.osd.4.smithi130.stderr: 5: (Pipe::connect()+0x26f9) [0xcd92e9]
2017-05-10T04:50:30.760 INFO:tasks.ceph.osd.4.smithi130.stderr: 6: (Pipe::writer()+0x531) [0xcdbee1]
2017-05-10T04:50:30.760 INFO:tasks.ceph.osd.4.smithi130.stderr: 7: (Pipe::Writer::entry()+0xd) [0xcde9ad]
2017-05-10T04:50:30.760 INFO:tasks.ceph.osd.4.smithi130.stderr: 8: (()+0x7dc5) [0xc729dc5]
2017-05-10T04:50:30.760 INFO:tasks.ceph.osd.4.smithi130.stderr: 9: (clone()+0x6d) [0xd87573d]
2017-05-10T04:50:30.760 INFO:tasks.ceph.osd.4.smithi130.stderr: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

#2 Updated by Brad Hubbard 4 months ago

  • Assignee set to Brad Hubbard

#3 Updated by Kefu Chai 4 months ago

 -1927> 2017-05-10 04:50:27.436359 1ebc0700 20 osd.4 231 OSD::ms_dispatch: MOSDMarkMeDown(request_ack=0, target_osd=osd.4 172.21.15.130:6801/31147, fsid=db4a4e71-b01d-4cf1-bba3-b9b29f51cd70) v2
 -1926> 2017-05-10 04:50:27.438216 1ebc0700  0 osd.4 231 got_stop_ack starting shutdown
 -1925> 2017-05-10 04:50:27.443950 323e7700  0 osd.4 231 prepare_to_stop starting shutdown
 -1924> 2017-05-10 04:50:27.448222 323e7700 -1 osd.4 231 shutdown
..
 -1440> 2017-05-10 04:50:27.835363 1ebc0700  1 -- 172.21.15.130:6801/31147 <== mon.2 172.21.15.19:6790/0 4 ==== mgrmap(e 4) v1 ==== 90+0+0 (1486229013 0 0) 0xf685aa0 con 0x3a461540
  << got a mgrmap via con 0x3a461540
 -1439> 2017-05-10 04:50:27.835574 1ebc0700  4 mgrc handle_mgr_map Got map version 4
 -1438> 2017-05-10 04:50:27.835621 1ebc0700  4 mgrc handle_mgr_map Active mgr is now 172.21.15.130:6800/12188
 -1437> 2017-05-10 04:50:27.836396 1ebc0700  4 mgrc reconnect Starting new session with 172.21.15.130:6800/12188
..
 -1411> 2017-05-10 04:50:27.843961 970d700 10 -- 172.21.15.130:6801/31147 >> 172.21.15.130:6800/12188 pipe(0x1846d570 sd=34 :58180 s=2 pgs=266 cs=1 l=1 c=0x1846e890).reader got message 2 0xf68b3a0 mgrconfigure() v1
..
 -1403> 2017-05-10 04:50:27.845439 1ebc0700  4 mgrc handle_mgr_configure stats_period=5
..
  -112> 2017-05-10 04:50:30.199612 323e7700  1 bluefs umount
..
   -95> 2017-05-10 04:50:30.494950 323e7700 10 monclient: shutdown
   -94> 2017-05-10 04:50:30.496994 323e7700  1 -- 172.21.15.130:6801/31147 mark_down 0x3a461540 -- 0x192e7fc0
   << we mark con 0x3a461540 now.
..
   -54> 2017-05-10 04:50:30.625266 1ebc0700  4 mgrc ms_handle_reset ms_handle_reset con 0x1846e890
   -53> 2017-05-10 04:50:30.627750 1ebc0700  4 mgrc reconnect Terminating session with 172.21.15.130:6800/12188
   -52> 2017-05-10 04:50:30.629387 1ebc0700  1 -- 172.21.15.130:6801/31147 mark_down 0x1846e890 -- pipe dne
   -51> 2017-05-10 04:50:30.631857 1ebc0700  4 mgrc reconnect Starting new session with 172.21.15.130:6800/12188
..
   -15> 2017-05-10 04:50:30.665362 158af700  0 -- 172.21.15.130:6801/31147 >> 172.21.15.130:6800/12188 pipe(0x3a056d30 sd=21 :60649 s=1 pgs=0 cs=0 l=1 c=0xe2375d0).connect got BADAUTHORIZER
   -14> 2017-05-10 04:50:30.667752 158af700 10 osd.4 0 OSD::ms_get_authorizer type=mgr
   -13> 2017-05-10 04:50:30.752065 158af700 -1 /mnt/jenkins/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.0.1-2198-g8760432/rpm/el7/BUILD/ceph-12.0.1-2198-g8760432/src/mon/MonClient.cc: In f
unction 'int MonClient::wait_auth_rotating(double)' thread 158af700 time 2017-05-10 04:50:30.668855

so even with https://github.com/ceph/ceph/commit/611572a6103e20d3e3abf3530ea33e539b9db371, MgrClient can still try to (re)connect to mgr after MgrClient::shutdown(), when handling mgrmap after its mgrsession got reset.

#4 Updated by Brad Hubbard 4 months ago

  • Status changed from New to In Progress

#5 Updated by Sage Weil 4 months ago

/a/sage-2017-05-27_03:43:09-rados-wip-sage-testing2---basic-smithi/1235367

the branch contained eb5c02d from https://github.com/ceph/ceph/pull/15192, so not fixed yet!

#6 Updated by Brad Hubbard 4 months ago

OMG.

How did we all (especially me) not realise there was a missing return statement in https://github.com/ceph/ceph/pull/15192/commits/eb5c02df634b34ca45ecbf1eaf4440ccf806845f ?

https://github.com/ceph/ceph/pull/15345

#7 Updated by Brad Hubbard 4 months ago

  • Status changed from In Progress to Resolved

Also available in: Atom PDF