Project

General

Profile

Bug #51527

Ceph osd crashed due to segfault

Added by Evgeny Zakharov 5 months ago. Updated about 1 month ago.

Status:
Resolved
Priority:
Normal
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
osd crash segfault
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hi everyone,

We have 9 osd nodes with 12 deamons for each node.
Ceph used for s3 objects and rbd images.

ceph state:

cluster:
    id:     25d4951b-d2b2-4985-912b-81bf48bbb10a
    health: HEALTH_WARN
            1 daemons have recently crashed

For half a year, we see the problem of the periodic crash of the OSD in the cluster due to the segfault error.
Here is an example of a log from Ceph crash Info

{
    "backtrace": [
        "(()+0x15420) [0x7fb8d538d420]",
        "(AsyncConnection::_stop()+0xbc) [0x556f0e98360c]",
        "(ProtocolV2::stop()+0x8b) [0x556f0e9b51eb]",
        "(ProtocolV2::_fault()+0x3ab) [0x556f0e9b57ab]",
        "(ProtocolV2::handle_read_frame_preamble_main(std::unique_ptr<ceph::buffer::v15_2_0::ptr_node, ceph::buffer::v15_2_0::ptr_node::disposer>&&, int)+0x159) [0x556f0e9d0049]",
        "(ProtocolV2::run_continuation(Ct<ProtocolV2>&)+0x31) [0x556f0e9b5fa1]",
        "(AsyncConnection::process()+0x6b0) [0x556f0e989ca0]",
        "(EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0x595) [0x556f0e7ba4c5]",
        "(()+0x138f602) [0x556f0e7c0602]",
        "(()+0x1acd260) [0x556f0eefe260]",
        "(()+0x9609) [0x7fb8d5381609]",
        "(clone()+0x43) [0x7fb8d50ec293]" 
    ],
    "ceph_version": "15.2.11",
    "crash_id": "2021-07-02T08:26:54.947458Z_7828d09c-0c02-4095-8a55-df7040f62e97",
    "entity_name": "osd.73",
    "os_id": "ubuntu",
    "os_name": "Ubuntu",
    "os_version": "20.04.2 LTS (Focal Fossa)",
    "os_version_id": "20.04",
    "process_name": "ceph-osd",
    "stack_sig": "35abf74378224de0387c08fbeb133dcc63e86dfed1500aa05fcd268e9ba0a30d",
    "timestamp": "2021-07-02T08:26:54.947458Z",
    "utsname_hostname": "ceph9",
    "utsname_machine": "x86_64",
    "utsname_release": "5.4.0-73-generic",
    "utsname_sysname": "Linux",
    "utsname_version": "#82-Ubuntu SMP Wed Apr 14 17:39:42 UTC 2021" 
}

On the systemctl ceph daemon side

Jul 02 11:26:54 ceph9 ceph-osd[378762]: *** Caught signal (Segmentation fault) **
Jul 02 11:26:54 ceph9 ceph-osd[378762]:  in thread 7fb8d3c64700 thread_name:msgr-worker-1
Jul 02 11:26:54 ceph9 ceph-osd[378762]:  ceph version 15.2.11 (e3523634d9c2227df9af89a4eac33d16738c49cb) octopus (stable)
Jul 02 11:26:54 ceph9 ceph-osd[378762]:  1: (()+0x15420) [0x7fb8d538d420]
Jul 02 11:26:54 ceph9 ceph-osd[378762]:  2: (AsyncConnection::_stop()+0xbc) [0x556f0e98360c]
Jul 02 11:26:54 ceph9 ceph-osd[378762]:  3: (ProtocolV2::stop()+0x8b) [0x556f0e9b51eb]
Jul 02 11:26:54 ceph9 ceph-osd[378762]:  4: (ProtocolV2::_fault()+0x3ab) [0x556f0e9b57ab]
Jul 02 11:26:54 ceph9 ceph-osd[378762]:  5: (ProtocolV2::handle_read_frame_preamble_main(std::unique_ptr<ceph::buffer::v15_2_0::ptr_node, ceph::buffer::v15_2_>
Jul 02 11:26:54 ceph9 ceph-osd[378762]:  6: (ProtocolV2::run_continuation(Ct<ProtocolV2>&)+0x31) [0x556f0e9b5fa1]
Jul 02 11:26:54 ceph9 ceph-osd[378762]:  7: (AsyncConnection::process()+0x6b0) [0x556f0e989ca0]
Jul 02 11:26:54 ceph9 ceph-osd[378762]:  8: (EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0x>
Jul 02 11:26:54 ceph9 ceph-osd[378762]:  9: (()+0x138f602) [0x556f0e7c0602]
Jul 02 11:26:54 ceph9 ceph-osd[378762]:  10: (()+0x1acd260) [0x556f0eefe260]
Jul 02 11:26:54 ceph9 ceph-osd[378762]:  11: (()+0x9609) [0x7fb8d5381609]
Jul 02 11:26:54 ceph9 ceph-osd[378762]:  12: (clone()+0x43) [0x7fb8d50ec293]

The first problems began in the Octopus version 15.2.0. We were updated to 15.2.11 and the problem is still preserved. We also made Ubuntu update to LTS version 20.04
All actions have not led to the disappearance of the error.
We tried to change the OSD_MEMORY_TARGET parameter, in a big and smaller side. Nevertheless, it did not help.
Please help to understand the reason for such an error. Where can I read about this problem?
I saw that there were similar topics here, but their solutions did not come up

ceph_profiling.zip - this example log ceph memory profilling (219 KB) Evgeny Zakharov, 07/05/2021 09:23 AM

log.gz (236 KB) J. Eric Ivancich, 09/23/2021 05:50 PM

load-bi.sh View (1.66 KB) J. Eric Ivancich, 09/29/2021 05:09 PM

pacific-osd.log.gz (265 KB) J. Eric Ivancich, 09/30/2021 02:54 PM


Related issues

Duplicates Messengers - Bug #49237: segv in AsyncConnection::_stop() Resolved

History

#1 Updated by Josh Durgin 5 months ago

  • Project changed from Ceph to RADOS
  • Category deleted (OSD)

#2 Updated by J. Eric Ivancich 2 months ago

I too have run into this segfault in the two latest versions of Octopus -- 15.2.14 and 15.2.13.

I can reproduce it pretty easily it seems. It doesn't seem to happen on master (mid-September 2021).

The one thing I'm doing a little out of the ordinary is sending SIGHUP to ceph-osd every few minutes in order to cycle the logging. Logging is at a high level and I'm adding 1.8 million RGW bucket index entries, and I'm preventing the logging from filling the the file system.

It happens on a test vstart cluster with 1 osd, 1 mon, and 1 rgw.

#3 Updated by J. Eric Ivancich 2 months ago

  • Affected Versions v15.2.13, v15.2.14 added

#4 Updated by J. Eric Ivancich 2 months ago

Attached is an example log with backtrace that I get.

#5 Updated by Neha Ojha 2 months ago

 -9928> 2021-09-23T04:29:45.085-0400 7f39fb4cb640  1 -- [v2:127.0.0.1:6800/3052995,v1:127.0.0.1:6801/3052995] <== client.2242424 127.0.0.1:0/165457206 59 ==== osd_op(client.2242424.0:59 3.d 3.a204812d (undecoded) ondisk+write+known_if_redirected e38) v8 ==== 217+0+0 (crc 0 0 0) 0x5574b26caf00 con 0x5574ad970000
 -9927> 2021-09-23T04:29:45.085-0400 7f39fb4cb640 15 osd.0 38 enqueue_op 0x5574a9c2bba0 prio 63 cost 0 latency 0.000058 epoch 38 osd_op(client.2242424.0:59 3.d 3.a204812d (undecoded) ondisk+write+known_if_redirected e38) v8
 -9926> 2021-09-23T04:29:45.085-0400 7f39fb4cb640 20 osd.0 op_wq(5) _enqueue OpSchedulerItem(3.d PGOpItem(op=osd_op(client.2242424.0:59 3.d 3.a204812d (undecoded) ondisk+write+known_if_redirected e38) v8) prio 63 cost 0 e38)
 -9916> 2021-09-23T04:29:45.086-0400 7f39fb4cb640  1 -- [v2:127.0.0.1:6800/3052995,v1:127.0.0.1:6801/3052995] >> 127.0.0.1:0/924890622 conn(0x5574a4728480 msgr2=0x55749ffee300 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=1).read_bulk peer close file descriptor 94
 -9915> 2021-09-23T04:29:45.086-0400 7f39fb4cb640  1 -- [v2:127.0.0.1:6800/3052995,v1:127.0.0.1:6801/3052995] >> 127.0.0.1:0/924890622 conn(0x5574a4728480 msgr2=0x55749ffee300 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=1).read_until read failed
 -9914> 2021-09-23T04:29:45.086-0400 7f39fb4cb640  1 --2- [v2:127.0.0.1:6800/3052995,v1:127.0.0.1:6801/3052995] >> 127.0.0.1:0/924890622 conn(0x5574a4728480 0x55749ffee300 crc :-1 s=READY pgs=2 cs=0 l=1 rev1=1 rx=0 tx=0).handle_read_frame_preamble_main read frame preamble failed r=-1 ((1) Operation not permitted)
 -9912> 2021-09-23T04:29:45.086-0400 7f39fb4cb640  1 --2- [v2:127.0.0.1:6800/3052995,v1:127.0.0.1:6801/3052995] >> 127.0.0.1:0/924890622 conn(0x5574a4728480 0x55749ffee300 crc :-1 s=READY pgs=2 cs=0 l=1 rev1=1 rx=0 tx=0).stop
 -1534> 2021-09-23T04:29:45.204-0400 7f39fb4cb640 -1 *** Caught signal (Segmentation fault) **
 in thread 7f39fb4cb640 thread_name:msgr-worker-0
  7f39fb4cb640 / msgr-worker-0
 ceph version Development (no_version) octopus (stable)
 1: (()+0x2df2f50) [0x557493fe6f50]
 2: (()+0x141b0) [0x7f39fda9d1b0]
 3: (boost::intrusive_ptr<AsyncConnection>::get() const+0xc) [0x5574942df880]
 4: (bool boost::operator< <AsyncConnection>(boost::intrusive_ptr<AsyncConnection> const&, boost::intrusive_ptr<AsyncConnection> const&)+0x3b) [0x5574942e6642]
 5: (std::less<boost::intrusive_ptr<AsyncConnection> >::operator()(boost::intrusive_ptr<AsyncConnection> const&, boost::intrusive_ptr<AsyncConnection> const&) const+0x27) [0x5574942e39d3]
 6: (std::_Rb_tree<boost::intrusive_ptr<AsyncConnection>, boost::intrusive_ptr<AsyncConnection>, std::_Identity<boost::intrusive_ptr<AsyncConnection> >, std::less<boost::intrusive_ptr<AsyncConnection> >, std::allocator<boost::intrusive_ptr<AsyncConnection> > >::_M_lower_bound(std::_Rb_tree_node<boost::intrusive_ptr<AsyncConnection> > const*, std::_Rb_tree_node_base const*, boost::intrusive_ptr<AsyncConnection> const&) const+0x54) [0x5574945b7548]
 7: (std::_Rb_tree<boost::intrusive_ptr<AsyncConnection>, boost::intrusive_ptr<AsyncConnection>, std::_Identity<boost::intrusive_ptr<AsyncConnection> >, std::less<boost::intrusive_ptr<AsyncConnection> >, std::allocator<boost::intrusive_ptr<AsyncConnection> > >::find(boost::intrusive_ptr<AsyncConnection> const&) const+0x54) [0x5574945b6a16]
 8: (std::set<boost::intrusive_ptr<AsyncConnection>, std::less<boost::intrusive_ptr<AsyncConnection> >, std::allocator<boost::intrusive_ptr<AsyncConnection> > >::count(boost::intrusive_ptr<AsyncConnection> const&) const+0x42) [0x5574945b5ccc]
 9: (AsyncMessenger::unregister_conn(boost::intrusive_ptr<AsyncConnection> const&)+0x54) [0x5574945b52a8]
 10: (AsyncConnection::_stop()+0x83) [0x5574945b221b]
 11: (ProtocolV2::stop()+0x15a) [0x5574945e159a]
 12: (ProtocolV2::_fault()+0x33d) [0x5574945e2f1b]
 13: (ProtocolV2::handle_read_frame_preamble_main(std::unique_ptr<ceph::buffer::v15_2_0::ptr_node, ceph::buffer::v15_2_0::ptr_node::disposer>&&, int)+0x27f) [0x5574945eae71]
 14: (CtRxNode<ProtocolV2>::call(ProtocolV2*) const+0x8c) [0x5574946152d6]
 15: (ProtocolV2::run_continuation(Ct<ProtocolV2>&)+0x5e) [0x5574945dfc50]
 16: (()+0x33f2cfe) [0x5574945e6cfe]
 17: (()+0x340dd01) [0x557494601d01]
 18: (()+0x340d8f2) [0x5574946018f2]
 19: (()+0x340d07c) [0x55749460107c]
 20: (std::function<void (char*, long)>::operator()(char*, long) const+0x61) [0x5574945b6241]
 21: (AsyncConnection::process()+0xd3b) [0x5574945b0c55]
 22: (C_handle_read::do_request(unsigned long)+0x28) [0x5574945b555e]
 23: (EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0x513) [0x5574942f1c15]
 24: (()+0x3105ef2) [0x5574942f9ef2]
 25: (()+0x31075a2) [0x5574942fb5a2]
 26: (()+0x3107493) [0x5574942fb493]
 27: (()+0x310737a) [0x5574942fb37a]
 28: (std::function<void ()>::operator()() const+0x32) [0x557493892fb8]
 29: (void std::__invoke_impl<void, std::function<void ()>>(std::__invoke_other, std::function<void ()>&&)+0x20) [0x557494618a96]
 30: (std::__invoke_result<std::function<void ()>>::type std::__invoke<std::function<void ()>>(std::function<void ()>&&)+0x20) [0x557494618a4b]
 31: (void std::thread::_Invoker<std::tuple<std::function<void ()> > >::_M_invoke<0ul>(std::_Index_tuple<0ul>)+0x28) [0x5574946189f8]
 32: (std::thread::_Invoker<std::tuple<std::function<void ()> > >::operator()()+0x18) [0x5574946189cc]
 33: (std::thread::_State_impl<std::thread::_Invoker<std::tuple<std::function<void ()> > > >::_M_run()+0x1c) [0x5574946189b0]
 34: (()+0xd6444) [0x7f39fd94a444]
 35: (()+0x93f9) [0x7f39fda923f9]
 36: (clone()+0x43) [0x7f39fd6494c3]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

#6 Updated by Evgeny Zakharov 2 months ago

J. Eric Ivancich wrote:

I too have run into this segfault in the two latest versions of Octopus -- 15.2.14 and 15.2.13.

I can reproduce it pretty easily it seems. It doesn't seem to happen on master (mid-September 2021).

The one thing I'm doing a little out of the ordinary is sending SIGHUP to ceph-osd every few minutes in order to cycle the logging. Logging is at a high level and I'm adding 1.8 million RGW bucket index entries, and I'm preventing the logging from filling the the file system.

It happens on a test vstart cluster with 1 osd, 1 mon, and 1 rgw.

We continue to receive periodic segfault for all OSD demons. We have not yet found solving this problem. This does not affect the work of Ceph Cluster (for the time being). Nevertheless, we still do not cease to worry about the future of our cluster(((.

#7 Updated by Neha Ojha 2 months ago

Hi Eric,

Could you please share the test that reproduces this crash on a vstart cluster?

#8 Updated by Neha Ojha 2 months ago

  • Status changed from New to Need More Info

#9 Updated by J. Eric Ivancich 2 months ago

I've attached the shell script "load-bi.sh". It requires that a cluster be brought up with RGW. It requires that a bucket be created. I use "s3cmd mb s3://<bktname>", which requires that the s3cmd be installed and configured as well.

The script is run with the bucket name (<bktname>) and a set of data files, that each contain one entry per line. I had 1.8 million entries spread across 17 data files, and these 17 files were the final arguments to the script.

I don't think there's anything key in the entries in the files. I suspect you could generate "a0", "a1", through "a1800000" in one file and then using `split -n l/17 <file>` to split the first file into 17 smaller files. The script processes these 17 files in parallel to help increase throughput. I don't know if that's key to triggering the segfault.

For me this was a vstart cluster with 1 osd, 1 mon, and 1 rgw.

Let me know if there are any questions, and I'd be happy to answer.

Now that I've resolved the tracker that I created the process to reproduce (https://tracker.ceph.com/issues/51767), I will see if I can reproduce on pacific.

#10 Updated by J. Eric Ivancich 2 months ago

I was able to reproduce on pacific (v16.2.6). Log attached.

#11 Updated by Neha Ojha 2 months ago

  • Status changed from Need More Info to New
  • Assignee set to Radoslaw Zarzynski

#12 Updated by Radoslaw Zarzynski about 2 months ago

  • Duplicates Bug #49237: segv in AsyncConnection::_stop() added

#13 Updated by Radoslaw Zarzynski about 2 months ago

This is a known issue that has been fixed in master by commit https://github.com/ceph/ceph/commit/d51d80b3234e17690061f65dc7e1515f4244a5a3.

The backtrace matches the code that was executing without the AsyncMessenger::lock being hold:

   void unregister_conn(const AsyncConnectionRef& conn) {
     std::lock_guard l{deleted_lock};
-    if (!accepting_conns.count(conn) || anon_conns.count(conn))
-      conn->get_perf_counter()->dec(l_msgr_active_connections);
     deleted_conns.emplace(std::move(conn));
     conn->unregister();

Both members are instances of std::set<AsyncConnectionRef>:

  /**
   * list of connection are in the process of accepting
   *
   * These are not yet in the conns map.
   */
  std::set<AsyncConnectionRef> accepting_conns;

  /// anonymous outgoing connections
  std::set<AsyncConnectionRef> anon_conns;

The problem was that the backports got delayed. The one for Octopus (https://github.com/ceph/ceph/pull/43325) is already sent (just reviewed and approved it). For Pacific the PR is missed (https://tracker.ceph.com/issues/50483).

#14 Updated by Radoslaw Zarzynski about 2 months ago

  • Status changed from New to Pending Backport

Just sent https://github.com/ceph/ceph/pull/43548 for pacific. Nothing more to do here, I think.

#15 Updated by Loïc Dachary about 1 month ago

  • Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

Also available in: Atom PDF