Project

General

Profile

Bug #23352

osd: segfaults under normal operation

Added by Alex Gorbachev over 1 year ago. Updated 10 months ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
-
Start date:
03/14/2018
Due date:
% Done:

0%

Source:
Community (user)
Tags:
Backport:
mimic, luminous
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
OSD
Pull request ID:

Description

-1> 2018-03-13 22:03:27.390956 7f42eec36700 4 rocksdb: EVENT_LOG_v1 {"time_micros": 1520993007390955, "job": 454, "event": "table_file_deletion", "file_number": 7466}
0> 2018-03-13 22:04:32.172308 7f4307c68700 -1 ** Caught signal (Segmentation fault) *
in thread 7f4307c68700 thread_name:safe_timer

ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b) luminous (stable)
1: (()+0xa74234) [0x557b9bdf4234]
2: (()+0x11390) [0x7f430f7d9390]
3: (SafeTimer::timer_thread()+0x44f) [0x557b9be3398f]
4: (SafeTimerThread::entry()+0xd) [0x557b9be34d9d]
5: (()+0x76ba) [0x7f430f7cf6ba]
6: (clone()+0x6d) [0x7f430e84641d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
0/ 5 none
0/ 0 lockdep
0/ 0 context
0/ 0 crush
1/ 5 mds
1/ 5 mds_balancer
1/ 5 mds_locker
1/ 5 mds_log
1/ 5 mds_log_expire
1/ 5 mds_migrator
0/ 0 buffer
0/ 0 timer
0/ 0 filer
0/ 1 striper
0/ 0 objecter
0/ 0 rados
0/ 0 rbd
0/ 5 rbd_mirror
0/ 5 rbd_replay
0/ 0 journaler
0/ 5 objectcacher
0/ 0 client
0/ 0 osd
0/ 0 optracker
0/ 0 objclass
0/ 0 filestore
0/ 0 journal
0/ 0 ms
0/ 0 mon
0/ 0 monc
0/ 0 paxos
0/ 0 tp
0/ 0 auth
1/ 5 crypto
0/ 0 finisher
1/ 1 reserver
0/ 0 heartbeatmap
0/ 0 perfcounter
0/ 0 rgw
1/10 civetweb
1/ 5 javaclient
0/ 0 asok
0/ 0 throttle
0/ 0 refs
1/ 5 xio
1/ 5 compressor
1/ 5 bluestore
1/ 5 bluefs
1/ 3 bdev
1/ 5 kstore
4/ 5 rocksdb
4/ 5 leveldb
4/ 5 memdb
1/ 5 kinetic
1/ 5 fuse
1/ 5 mgr
1/ 5 mgrc
1/ 5 dpdk
1/ 5 eventtrace
2/-2 (syslog threshold)
-1/-1 (stderr threshold)
max_recent 10000
max_new 1000
log_file /var/log/ceph/ceph-osd.81.log
--
end dump of recent events ---

ceph-osd.194.log.3.gz (386 KB) Alex Gorbachev, 05/22/2018 03:54 AM

ceph-osd.123.log.gz (230 KB) Beom-Seok Park, 05/25/2018 03:11 AM


Related issues

Related to RADOS - Bug #24667: osd: SIGSEGV in MMgrReport::encode_payload Can't reproduce 06/26/2018
Related to RADOS - Bug #24037: osd: Assertion `!node_algorithms::inited(this->priv_value_traits().to_node_ptr(value))' failed. Resolved 05/07/2018
Duplicated by RADOS - Bug #23431: OSD Segmentation fault in thread_name:safe_timer Duplicate 03/21/2018
Duplicated by RADOS - Bug #24023: Segfault on OSD in 12.2.5 Duplicate 05/05/2018
Duplicated by RADOS - Bug #23564: OSD Segfaults Duplicate 04/05/2018
Duplicated by RADOS - Bug #23585: osd: safe_timer segfault Duplicate 04/08/2018
Duplicated by RADOS - Bug #24511: osd crushed at thread_name:safe_timer Duplicate 06/13/2018
Duplicated by RADOS - Bug #24935: SafeTimer? osd killed by kernel for Segmentation fault Duplicate 07/16/2018
Duplicated by RADOS - Bug #37978: osd killed by kernel for Segmentation fault Duplicate 01/21/2019
Copied to RADOS - Backport #26870: mimic: osd: segfaults under normal operation Resolved
Copied to RADOS - Backport #26871: luminous: osd: segfaults under normal operation Resolved

History

#1 Updated by Sage Weil over 1 year ago

  • Status changed from New to Need More Info

Do you have a core file? I haven't seen this crash before.

#2 Updated by Alex Gorbachev over 1 year ago

HI Sage, I do have the core dump. Where can I upload the file, it's rather large, 850 MB compressed.

#3 Updated by Alex Gorbachev over 1 year ago

Sage, PM'ed to you the public download link, hope it works.

#5 Updated by Chris Hoy Poy over 1 year ago

Also seeing these, no core dump but have now had 3 segfaults in 2 weeks since upgrading to 12.2.4 from a very stable 12.2.2.
Luminous 12.2.4, bluestore. I usually have logging disabled, will enable it again at next opportunity.

[dmesg]
[35103471.167728] safe_timer491476: segfault at 21300080000 ip 00007f9f3f7bfccb sp 00007f9f31c7df70 error 4 in libgcc_s.so.1[7f9f3f7b1000+16000]

[log just after it restarted, to confirm versions]
2018-03-22 06:24:58.603146 7f796e6c3e00 0 ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b) luminous (stable), process (unknown), pid 1339770
2018-03-22 06:24:58.613610 7f796e6c3e00 0 pidfile_write: ignore empty --pid-file
2018-03-22 06:24:58.619318 7f796e6c3e00 0 load: jerasure load: lrc load: isa
2018-03-22 06:24:58.619392 7f796e6c3e00 1 bdev create path /var/lib/ceph/osd/osceph-38/block type kernel
2018-03-22 06:24:58.619403 7f796e6c3e00 1 bdev(0x5642923baac0 /var/lib/ceph/osd/osceph-38/block) open path /var/lib/ceph/osd/osceph-38/block
2018-03-22 06:24:58.619609 7f796e6c3e00 1 bdev(0x5642923baac0 /var/lib/ceph/osd/osceph-38/block) open size 3000588304384 (0x2baa1000000, 2794 GB) block_size 4096 (4096 B) rotational
2018-03-22 06:24:58.619820 7f796e6c3e00 1 bluestore(/var/lib/ceph/osd/osceph-38) _set_cache_sizes max 0.0639132 < ratio 0.99
2018-03-22 06:24:58.619837 7f796e6c3e00 1 bluestore(/var/lib/ceph/osd/osceph-38) _set_cache_sizes cache_size 8400000000 meta 0.936087 kv 0.0639132 data 2.23517e-08
2018-03-22 06:24:58.619841 7f796e6c3e00 1 bdev(0x5642923baac0 /var/lib/ceph/osd/osceph-38/block) close
2018-03-22 06:24:58.909527 7f796e6c3e00 1 bluestore(/var/lib/ceph/osd/osceph-38) _mount path /var/lib/ceph/osd/osceph-38
2018-03-22 06:24:58.909773 7f796e6c3e00 1 bdev create path /var/lib/ceph/osd/osceph-38/block type kernel
2018-03-22 06:24:58.909779 7f796e6c3e00 1 bdev(0x5642923bad00 /var/lib/ceph/osd/osceph-38/block) open path /var/lib/ceph/osd/osceph-38/block
2018-03-22 06:24:58.909893 7f796e6c3e00 1 bdev(0x5642923bad00 /var/lib/ceph/osd/osceph-38/block) open size 3000588304384 (0x2baa1000000, 2794 GB) block_size 4096 (4096 B) rotational
2018-03-22 06:24:58.910062 7f796e6c3e00 1 bluestore(/var/lib/ceph/osd/osceph-38) _set_cache_sizes max 0.0639132 < ratio 0.99
2018-03-22 06:24:58.910071 7f796e6c3e00 1 bluestore(/var/lib/ceph/osd/osceph-38) _set_cache_sizes cache_size 8400000000 meta 0.936087 kv 0.0639132 data 2.23517e-08
2018-03-22 06:24:58.910113 7f796e6c3e00 1 bdev create path /dev/disk/by-partuuid/f6942001-a7af-4f8f-8ffd-119fbcec34b3 type kernel
2018-03-22 06:24:58.910118 7f796e6c3e00 1 bdev(0x5642923baf40 /dev/disk/by-partuuid/f6942001-a7af-4f8f-8ffd-119fbcec34b3) open path /dev/disk/by-partuuid/f6942001-a7af-4f8f-8ffd-119fbcec34b3
2018-03-22 06:24:58.910256 7f796e6c3e00 1 bdev(0x5642923baf40 /dev/disk/by-partuuid/f6942001-a7af-4f8f-8ffd-119fbcec34b3) open size 33554432000 (0x7d0000000, 32000 MB) block_size 4096 (4096 B) non-rotational
2018-03-22 06:24:58.910272 7f796e6c3e00 1 bluefs add_block_device bdev 1 path /dev/disk/by-partuuid/f6942001-a7af-4f8f-8ffd-119fbcec34b3 size 32000 MB
2018-03-22 06:24:58.910486 7f796e6c3e00 1 bdev create path /var/lib/ceph/osd/osceph-38/block type kernel
2018-03-22 06:24:58.910493 7f796e6c3e00 1 bdev(0x5642923bb180 /var/lib/ceph/osd/osceph-38/block) open path /var/lib/ceph/osd/osceph-38/block
2018-03-22 06:24:58.911462 7f796e6c3e00 1 bdev(0x5642923bb180 /var/lib/ceph/osd/osceph-38/block) open size 3000588304384 (0x2baa1000000, 2794 GB) block_size 4096 (4096 B) rotational
2018-03-22 06:24:58.911469 7f796e6c3e00 1 bluefs add_block_device bdev 2 path /var/lib/ceph/osd/osceph-38/block size 2794 GB
2018-03-22 06:24:58.911514 7f796e6c3e00 1 bluefs mount
2018-03-22 06:24:58.923367 7f796e6c3e00 0 set rocksdb option compaction_readahead_size = 2097152
2018-03-22 06:24:58.923375 7f796e6c3e00 0 set rocksdb option compression = kNoCompression
2018-03-22 06:24:58.923390 7f796e6c3e00 0 set rocksdb option max_write_buffer_number = 4
2018-03-22 06:24:58.923393 7f796e6c3e00 0 set rocksdb option min_write_buffer_number_to_merge = 1
2018-03-22 06:24:58.923396 7f796e6c3e00 0 set rocksdb option recycle_log_file_num = 4
2018-03-22 06:24:58.923397 7f796e6c3e00 0 set rocksdb option writable_file_max_buffer_size = 0
2018-03-22 06:24:58.923400 7f796e6c3e00 0 set rocksdb option write_buffer_size = 268435456
2018-03-22 06:24:58.923416 7f796e6c3e00 0 set rocksdb option compaction_readahead_size = 2097152
2018-03-22 06:24:58.923418 7f796e6c3e00 0 set rocksdb option compression = kNoCompression
2018-03-22 06:24:58.923420 7f796e6c3e00 0 set rocksdb option max_write_buffer_number = 4
2018-03-22 06:24:58.923422 7f796e6c3e00 0 set rocksdb option min_write_buffer_number_to_merge = 1
2018-03-22 06:24:58.923424 7f796e6c3e00 0 set rocksdb option recycle_log_file_num = 4
2018-03-22 06:24:58.923425 7f796e6c3e00 0 set rocksdb option writable_file_max_buffer_size = 0
2018-03-22 06:24:58.923427 7f796e6c3e00 0 set rocksdb option write_buffer_size = 268435456
2018-03-22 06:24:58.923532 7f796e6c3e00 4 rocksdb: RocksDB version: 5.4.0

2018-03-22 06:24:58.923538 7f796e6c3e00 4 rocksdb: Git sha rocksdb_build_git_sha:0
2018-03-22 06:24:58.923540 7f796e6c3e00 4 rocksdb: Compile date Feb 26 2018

#6 Updated by Kjetil Joergensen over 1 year ago

"Me too". I had a brief look at the coredump, without becoming all that much wiser. Judging by the lock attached to the safe_timer thread, the recipient of the segfault were the osd tick timer. (OSD::tick_timer_lock). Although - what segfaults is instruction fetch, so it sort-of feels like we maybe clobbered the callback.

#7 Updated by Nathan Cutler over 1 year ago

  • Status changed from Need More Info to New
  • Priority changed from Normal to High

Raising priority because it's a possible regression in Luminous.

#8 Updated by Brad Hubbard about 1 year ago

  • Related to Bug #23431: OSD Segmentation fault in thread_name:safe_timer added

#9 Updated by Brad Hubbard about 1 year ago

Chris,

Was your stack identical to Alex's original description or was it more like the stack in #23431 ?

#10 Updated by Kefu Chai about 1 year ago

for those who wants to check the coredump. you should use apport-unpack to unpack it first.

and it crashed at /build/ceph-12.2.4/src/common/Timer.cc:97

   │95            if (!safe_callbacks)                                                                                                                                                                                                                                         │
   │96              lock.Unlock();                                                                                                                                                                                                                                             │
  >│97            callback->complete(0);                                                                                                                                                                                                                                       │
   │98            if (!safe_callbacks)                                                                                                                                                                                                                                         │
   │99              lock.Lock();

(gdb) p callback
$34 = (Context *) 0x557bd02d9f00
(gdb) p *callback
$35 = {_vptr.Context = 0x557bd0a59810}
0x557bd0a59810: 0x60    0x92    0xa5    0xd0    0x7b    0x55    0x00    0x00
0x557bd0a59818: 0x2c    0x00    0x00    0x00    0x01    0x00    0x00    0x00
0x557bd0a59820: 0x00    0x00    0x00    0x00    0x00    0x00    0x00    0x00
0x557bd0a59828: 0x00    0x00    0x00    0x00    0x00    0x00    0x01    0x00
0x557bd0a59830: 0x00    0x00    0x32    0x5a    0x26    0x00    0x00    0x00
0x557bd0a59838: 0x00    0x00    0x08    0x00    0x7b    0x55    0x00    0x00
0x557bd0a59840: 0x00    0x00    0x00    0x00    0x00    0x00    0x00    0x00
0x557bd0a59848: 0x00    0x00    0x01    0x00    0x7c    0x55    0x00    0x00
...
(gdb) info vtbl callback
vtable for 'Context' @ 0x557bd0a59810 (subobject @ 0x557bd02d9f00):
[0]: 0x557bd0a59260
[1]: 0x10000002c
[2]: 0x0
[3]: 0x1000000000000

the callback should be an instance of C_Tick_WithoutOSDLock, but its vtable does not look right. with a correct "osd" field:

(gdb) p *(OSD::C_Tick_WithoutOSDLock*)callback
$18 = {<Context> = {_vptr.Context = 0x557bd0a59810}, osd = 0x557ba607a000}
(gdb) p stopping
$1 = false
(gdb) thread 32
(gdb) p osd
$19 = (OSD *) 0x557ba607a000

0x557bd0a59810 is an address on heap, but i tried to inspect the mem chunk pointed by it in different ways. not recognisable ...

#11 Updated by Patrick Donnelly about 1 year ago

  • Project changed from Ceph to RADOS
  • Subject changed from OSD segfaults under normal operation to osd: segfaults under normal operation
  • Source set to Community (user)
  • Backport set to luminous
  • Component(RADOS) OSD added

#12 Updated by wei jin about 1 year ago

I run into this issue with 12.2.5, it affects cluster stability heavily.

#13 Updated by Alex Gorbachev about 1 year ago

For some reason we are also seeing more of these happening, simultaneous failures and recoveries are occurring during the day and night

#14 Updated by Alex Gorbachev about 1 year ago

The latest ones look like this, below.

Crash dump at https://drive.google.com/open?id=12v95-TCHlkrBZ16ni5UkhYkXRtJcbPXq

0> 2018-05-18 14:00:47.405486 7f1a76064700 -1 ** Caught signal (Segmentation fault) *
in thread 7f1a76064700 thread_name:msgr-worker-1
ceph version 12.2.5 (cad919881333ac92274171586c827e01f554a70a) luminous (stable)
1: (()+0xa7cab4) [0x55b69fa49ab4]
2: (()+0x11390) [0x7f1a79be4390]
3: (MMgrReport::encode_payload(unsigned long)+0x5d2) [0x55b69fac21c2]
4: (Message::encode(unsigned long, int)+0x29) [0x55b69fac5b89]
5: (AsyncConnection::prepare_send_message(unsigned long, Message*, ceph::buffer::list&)+0x392) [0x55b69fd759f2]
6: (AsyncConnection::handle_write()+0x5c5) [0x55b69fd7e7f5]
7: (EventCenter::process_events(int, std::chrono::duration&lt;unsigned long, std::ratio&lt;1l, 1000000000l&gt; >*)+0x1125) [0x55b69fb30645]
8: (()+0xb66eb8) [0x55b69fb33eb8]
9: (()+0xb8c80) [0x7f1a794ebc80]
10: (()+0x76ba) [0x7f1a79bda6ba]
11: (clone()+0x6d) [0x7f1a78c5141d]
NOTE: a copy of the executable, or `objdump -rdS &lt;executable&gt;` is needed to interpret this.

--- logging levels ---
0/ 5 none
0/ 0 lockdep
0/ 0 context
0/ 0 crush
1/ 5 mds
1/ 5 mds_balancer
1/ 5 mds_locker
1/ 5 mds_log
1/ 5 mds_log_expire
1/ 5 mds_migrator
0/ 0 buffer
0/ 0 timer
0/ 0 filer
0/ 1 striper
0/ 0 objecter
0/ 0 rados
0/ 0 rbd
0/ 5 rbd_mirror
0/ 5 rbd_replay
0/ 0 journaler
0/ 5 objectcacher
0/ 0 client
0/ 0 osd
0/ 0 optracker
0/ 0 objclass
0/ 0 filestore
0/ 0 journal
0/ 0 ms
0/ 0 mon
0/ 0 monc
0/ 0 paxos
0/ 0 tp
0/ 0 auth
0/ 0 crypto
0/ 0 finisher
1/ 1 reserver
0/ 0 heartbeatmap
0/ 0 perfcounter

#15 Updated by Josh Durgin about 1 year ago

  • Related to Bug #24023: Segfault on OSD in 12.2.5 added

#16 Updated by Josh Durgin about 1 year ago

  • Related to Bug #23585: osd: safe_timer segfault added

#17 Updated by Josh Durgin about 1 year ago

#18 Updated by Josh Durgin about 1 year ago

I put the core file from comment #14 and binaries from 12.2.5 in senta02:/slow/jdurgin/ceph/bugs/tracker_23352/2018-05-18 - but it'll need the right versions of libgcc/libstdc++ in another env (container/vm/different box) to read the symbols properly.

#19 Updated by Alex Gorbachev about 1 year ago

Josh, is this something I can extract from the OSD node for you, or is this an internal comment?

#20 Updated by Josh Durgin about 1 year ago

It's an internal comment for others looking at this - though if you (Alex) have an osd log to go with the 'MMgrReport::encode_payload' crash, that would help

#21 Updated by Alex Gorbachev about 1 year ago

OSD log attached

#22 Updated by Sage Weil about 1 year ago

  • Status changed from New to Need More Info

Alex, how reproducible is this for you? Could you reproduce with debug timer = 20?

#23 Updated by Alex Gorbachev about 1 year ago

Sage, I had tried to do this, but we don't know when these crashes would happen, just that they will occur. Random times, and I get paged after the fact, and since this is in production, running with debug timer=20 is likely to fill up local space, correct? Also odd is that I have not seen a crash in the last couple of days, while the week before we had 3-4 per day.

#24 Updated by Brad Hubbard about 1 year ago

Hi Alex,

I notice there are several more coredumps attached to the related bug reports. Are they all separate crashes? Please feel free to upload any more you can if it's not too much trouble as the more we have the more likely we are to see a pattern. I'm currently looking at the ones you linked to this bug and will update here when I have completed my analysis.

#26 Updated by Brad Hubbard about 1 year ago

Thanks Beom-Seok,

I've set up a centos environment to debug those cores along with the Xenial ones. I will update here with any findings going forward.

#28 Updated by Brad Hubbard about 1 year ago

Thanks,

That gives us seven cores across 12.2.4-12.2.5 on Xenial and Centos and one core from the MMgrReport::encode_payload crash. That should be enough for the time being.

#29 Updated by Brad Hubbard about 1 year ago

If we look at the coredump from 23585 and compare it to this message.

[117735.930255] safe_timer52573: segfault at 561500001000 ip 00007f156d774ccb sp 00007f15668c1f70 error 4 in libgcc_s.so.1[7f156d766000+16000]

It's clear the memory address we segfaulted at was 0x561500001000. Looking at the disassembly for line 97 in Timer.cc.

97            callback->complete(0);
   0x0000561579b76987 <+1095>:  mov    (%rbx),%rax
   0x0000561579b7698a <+1098>:  xor    %esi,%esi  <--- IGNORE
   0x0000561579b7698c <+1100>:  mov    %rbx,%rdi  <--- IGNORE
   0x0000561579b7698f <+1103>:  callq  *0x18(%rax)

The address of callback is in rbx and we dereference it and store it in rax.

(gdb) p callback
$3 = (Context *) 0x5616160cf130

(gdb) i r rbx
rbx            0x5616160cf130   94652859216176

(gdb) x/a 0x5616160cf130                                                                                                                                                                                                                     
0x5616160cf130: 0x56162de55220

(gdb) i r rax
rax            0x56162de55220   94653259272736

We then dereference rax, add 0x18 to it and call that address.

(gdb) p/x 0x56162de55220+0x18
$4 = 0x56162de55238
(gdb) x/a 0x56162de55238
0x56162de55238: 0x561500001000 <--- Segfault address

As Kefu pointed out above this is the fourth entry in the vtable for "callback".

(gdb) info vtbl callback
vtable for 'Context' @ 0x56162de55220 (subobject @ 0x5616160cf130):
[0]: 0x561609696f80
[1]: 0x561591ff9000
[2]: 0x56162de56c00
[3]: 0x561500001000

This fits since 0x18 is 24 in decimal and, since pointers are 8 bytes here, it's the offset to the fourth entry in the table (first entry offset being zero).

(gdb) l 83
78        lock.Lock();
79        ldout(cct,10) << "timer_thread starting" << dendl;
80        while (!stopping) {
81          utime_t now = ceph_clock_now();
82
83          while (!schedule.empty()) {
84            scheduled_map_t::iterator p = schedule.begin();
85
86            // is the future now?
87            if (p->first > now)
(gdb) 
88              break;
89
90            Context *callback = p->second;
91            events.erase(callback);
92            schedule.erase(p);
93            ldout(cct,10) << "timer_thread executing " << callback << dendl;
94            
95            if (!safe_callbacks)
96              lock.Unlock();
97            callback->complete(0);
(gdb) 
98            if (!safe_callbacks)
99              lock.Lock();
100         }

(gdb) p schedule
$2 = std::multimap with 0 elements

So callback comes from the multimap schedule and, in this case, it appears to have been the last element. Checking whether that is the case for all coredumps as that may be significant.

#30 Updated by Brad Hubbard about 1 year ago

I've confirmed that in all of the SafeTimer segfaults the 'schedule' multimap is empty, indicating this is the last entry in the map. Working our whether that is expected or not and whether it is important or not :)

#31 Updated by Nathan Cutler about 1 year ago

  • Related to deleted (Bug #23431: OSD Segmentation fault in thread_name:safe_timer)

#32 Updated by Nathan Cutler about 1 year ago

  • Duplicated by Bug #23431: OSD Segmentation fault in thread_name:safe_timer added

#33 Updated by Brad Hubbard about 1 year ago

  • Related to deleted (Bug #24023: Segfault on OSD in 12.2.5)

#34 Updated by Brad Hubbard about 1 year ago

  • Duplicated by Bug #24023: Segfault on OSD in 12.2.5 added

#35 Updated by Brad Hubbard about 1 year ago

  • Related to deleted (Bug #23585: osd: safe_timer segfault)

#36 Updated by Brad Hubbard about 1 year ago

  • Duplicates Bug #23585: osd: safe_timer segfault added

#37 Updated by Brad Hubbard about 1 year ago

#38 Updated by Brad Hubbard about 1 year ago

#39 Updated by Brad Hubbard about 1 year ago

  • Duplicates deleted (Bug #23585: osd: safe_timer segfault)

#40 Updated by Brad Hubbard about 1 year ago

  • Duplicated by Bug #23585: osd: safe_timer segfault added

#41 Updated by Kjetil Joergensen about 1 year ago

Brad Hubbard wrote:

I've confirmed that in all of the SafeTimer segfaults the 'schedule' multimap is empty, indicating this is the last entry in the map. Working our whether that is expected or not and whether it is important or not :)

Does threads other than the timer thread add/cancel events ? It could be that std::map/std::multimap isn't inherently thread-safe, and we're finding some corner case around that. "Works most of the time, except under these specific timing conditions".

#42 Updated by Brad Hubbard about 1 year ago

Hi Kjetil,

Sure, worth a look, but AFAICT all access is protected by SafeTimers locks.

#43 Updated by Brad Hubbard about 1 year ago

Looking at the crash in http://tracker.ceph.com/issues/23352#note-14 there's a fairly glaring problem.

(gdb) f 13
#13 MMgrReport::encode_payload (this=0x55b6beed2400, features=<optimized out>) at /build/ceph-12.2.5/src/messages/MMgrReport.h:122
122         ::encode(osd_health_metrics, payload);
(gdb) whatis osd_health_metrics
type = std::vector<OSDHealthMetric>
$47 = std::vector of length -5890450176077, capacity -5890450176077 = {{
    type = 1 '\001', 
    value = {
      {
        n1 = 0, 
        n2 = 0
      }, 
      n = 0
    }
  }, {
    type = 0 '\000', 
    value = {
      {
        n1 = 65536, 
        n2 = 65536
      }, 
      n = 281474976776192
    }
  }, {
...
gdb) p osd_health_metrics.size()
$48 = -5890450176077

We can disable the pretty printers to take a look at the implementation members.

(gdb) disable pretty-printer
141 printers disabled
0 of 141 printers enabled
(gdb) p osd_health_metrics       
$49 = {
  <std::_Vector_base<OSDHealthMetric, std::allocator<OSDHealthMetric> >> = {
    _M_impl = {
      <std::allocator<OSDHealthMetric>> = {
        <__gnu_cxx::new_allocator<OSDHealthMetric>> = {<No data fields>}, <No data fields>}, 
      members of std::_Vector_base<OSDHealthMetric, std::allocator<OSDHealthMetric> >::_Vector_impl: 
      _M_start = 0x55b7a30b84d0, 
      _M_finish = 0x0, 
      _M_end_of_storage = 0x0
    }
  }, <No data fields>}

If the vector was empty I'd expect _M_start and _M_finish to be zero and, if not, _M_start and _M_finish to not be zero.

It's possible _M_finish and _M_end_of_storage have been overwritten with zeros here or that this vector<OSDHealthMetric> has been corrupted in some other way prior to the call to MMgrReport::encode_payload. OSDHealthMetric and the supporting code were only added in v12.2.3 so seem like a good candidate but, so far, I see nothing which might explain this behaviour.

#44 Updated by Vangelis Tasoulas about 1 year ago

Experiencing the a safe_timer segfault with a freshly deployed cluster. No data on the cluster yet. Just an empty pool has been created. The cluster was only running for 5 hours before the segfault on OSD 1 occured.

# ceph osd pool get mypool all
size: 4
min_size: 2
crash_replay_interval: 0
pg_num: 1024
pgp_num: 1024
crush_rule: replicated_rule
hashpspool: true
nodelete: false
nopgchange: false
nosizechange: false
write_fadvise_dontneed: false
noscrub: false
nodeep-scrub: false
use_gmt_hitset: 1
auid: 0
fast_read: 0

The ceph.log at the time of the failure:

2018-06-07 23:47:37.284748 mon.cephmon1 mon.0 192.168.1.1:6789/0 483 : cluster [INF] osd.0 failed (root=default,host=cephosd1) (connection refused reported by osd.2)
2018-06-07 23:47:37.645272 mon.cephmon1 mon.0 192.168.1.1:6789/0 530 : cluster [WRN] Health check failed: 1 osds down (OSD_DOWN)
2018-06-07 23:47:37.645407 mon.cephmon1 mon.0 192.168.1.1:6789/0 531 : cluster [WRN] Health check failed: 1 host (1 osds) down (OSD_HOST_DOWN)
2018-06-07 23:47:38.670555 mon.cephmon1 mon.0 192.168.1.1:6789/0 533 : cluster [WRN] Health check failed: Reduced data availability: 9 pgs inactive, 22 pgs peering (PG_AVAILABILITY)
2018-06-07 23:47:44.643207 mon.cephmon1 mon.0 192.168.1.1:6789/0 535 : cluster [INF] Health check cleared: PG_AVAILABILITY (was: Reduced data availability: 12 pgs inactive, 41 pgs peering)
2018-06-07 23:47:58.351776 mon.cephmon1 mon.0 192.168.1.1:6789/0 538 : cluster [INF] Health check cleared: OSD_DOWN (was: 1 osds down)
2018-06-07 23:47:58.351891 mon.cephmon1 mon.0 192.168.1.1:6789/0 539 : cluster [INF] Health check cleared: OSD_HOST_DOWN (was: 1 host (1 osds) down)
2018-06-07 23:47:58.351926 mon.cephmon1 mon.0 192.168.1.1:6789/0 540 : cluster [INF] Cluster is now healthy
2018-06-07 23:47:58.369415 mon.cephmon1 mon.0 192.168.1.1:6789/0 541 : cluster [INF] osd.0 192.168.1.10:6800/69949 boot

Dmesg segfault line on the host where the osd.0 is located:

[Thu Jun  7 23:47:37 2018] safe_timer[20005]: segfault at 800000006 ip 00007fc90498e688 sp 00007fc8fdf40080 error 4 in libgcc_s.so.1[7fc90497f000+16000]

OS is Debian Jessie:

# uname -r
4.9.0-0.bpo.5-amd64

And the Ceph version:

 *** 12.2.5-1~bpo80+1 0
        500 http://download.ceph.com/debian-luminous/ jessie/main amd64 Packages

I can provide further information if needed to pinpoint and fix the problem.

#45 Updated by Josh Durgin about 1 year ago

  • Assignee set to Brad Hubbard

#46 Updated by Josh Durgin about 1 year ago

  • Duplicated by Bug #24511: osd crushed at thread_name:safe_timer added

#47 Updated by Josh Durgin about 1 year ago

Another instance: http://pulpito.ceph.com/yuriw-2018-06-19_21:29:48-rados-wip-yuri-testing-2018-06-19-1953-luminous-distro-basic-smithi/2682702/

Backtrace is the same:

#0  0x00007fe8229daccb in ?? () from /lib/x86_64-linux-gnu/libgcc_s.so.1
#1  0x00007fe8229dc668 in _Unwind_Backtrace () from /lib/x86_64-linux-gnu/libgcc_s.so.1
#2  0x00007fe822717b4f in __GI___backtrace (array=array@entry=0x7fe81835f568, size=size@entry=100) at ../sysdeps/x86_64/backtrace.c:110
#3  0x000055de72423d94 in ceph::BackTrace::BackTrace (s=0, this=0x7fe81835f560) at /build/ceph-12.2.5-482-g94ab47a/src/common/BackTrace.h:23
#4  handle_fatal_signal (signum=11) at /build/ceph-12.2.5-482-g94ab47a/src/global/signal_handler.cc:117
#5  <signal handler called>
#6  0x0000400000010000 in ?? ()
#7  0x000055de72462e02 in SafeTimer::timer_thread (this=0x55de7c5b01c0) at /build/ceph-12.2.5-482-g94ab47a/src/common/Timer.cc:97
#8  0x000055de7246420d in SafeTimerThread::entry (this=<optimized out>) at /build/ceph-12.2.5-482-g94ab47a/src/common/Timer.cc:30
#9  0x00007fe8236926ba in start_thread (arg=0x7fe818362700) at pthread_create.c:333
#10 0x00007fe82270941d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

#48 Updated by Brad Hubbard about 1 year ago

In several of the crashes we are seeing lines like the following prior to the crash.

2018-06-19 23:53:56.795241 7fc42d47b700 -1 mgr.server send_report send_report osd,6.0x5639619d68d0 sent me an unknown health metric:

2018-06-20T18:15:37.109 INFO:tasks.ceph.mgr.x.argo022.stderr:2018-06-20 22:15:37.037543 7f4bf7727700 -1 mgr.server send_report send_report osd,0.0x5623242fe178 sent me an unknown health metric: �

In both cases the osd involved is the one that crashed.

This, to me, looks like a clear indication of corruption similar to what was seen in the MMgrReport::encode_payload crash above. This further indicates the osd_health_metrics which was added between 12.2.2 and 12.2.3. I'll continue to stare really hard at this code.

Can anyone confirm seeing the "unknown health metric" messages in the mgr logs prior to the segfault and confirm they are referring to the osd that crashed?

#49 Updated by Brad Hubbard about 1 year ago

  • Status changed from Need More Info to Verified
  • Severity changed from 3 - minor to 2 - major

#50 Updated by Serg D about 1 year ago

Yeah. We got in log mgr before segfault ceph-osd:

mgr.server send_report send_report osd,74.0x560276d34ed8 sent me an unknown health metric:

Brad Hubbard wrote:

Can anyone confirm seeing the "unknown health metric" messages in the mgr logs prior to the segfault and confirm they are referring to the osd that crashed?

#51 Updated by Kjetil Joergensen 12 months ago

Brad Hubbard wrote:

Can anyone confirm seeing the "unknown health metric" messages in the mgr logs prior to the segfault and confirm they are referring to the osd that crashed?

This is the same for us as well.

2018-06-19 12:26:52.542539 7f43cf269700 -1 mgr.server send_report send_report osd,430.0x5631bd2bc690 sent me an unknown health metric:

Around the same time, osd.430 segfaulted.

#52 Updated by Alex Gorbachev 12 months ago

Same here

2018-06-24 19:42:41.348699 7f3e53a46700 -1 mgr.server send_report send_report osd,226.0x55678069c850 sent me an unknown health metric

#53 Updated by Beom-Seok Park 12 months ago

confirmed

ceph-mgr.log
2018-06-20 08:46:05.528656 7fb998ff2700 -1 mgr.server send_report send_report osd,215.0x563a20eaced8 sent me an unknown health metric: Ð

ceph.log
2018-06-20 08:46:06.093646 mon.MON001 mon.0 10.xxx.xx.xx:6789/0 76106 : cluster [INF] osd.215 failed (root=volumes,host=OSD001) (connection refused reported by osd.170)

#54 Updated by Matt Dunavant 12 months ago

Also confirmed on Ubuntu 18.04/Ceph 13.2.0:

ceph-mgr.log

2018-06-24 11:14:47.317 7ff17b0db700 -1 mgr.server send_report send_report osd,121.0x55983610f8d0 sent me an unknown health metric: �

ceph-mon.log

2018-06-24 11:14:51.753 7fd1ae701700 1 mon.mn335@0(leader).osd e508 prepare_failure osd.121 10.xx.xx.xx:6816/1301379 from osd.92 10.xx.xx.xx:6816/1505976 is reporting failure:1

dmesg of OSD server

[3467506.385786] msgr-worker-01308098: segfault at 0 ip 00007f1320e56bcb sp 00007f131c0dff40 error 4 in libtcmalloc.so.4.3.0[7f1320e32000+45000]

uname -r

4.15.0-20-generic

ceph --version

ceph version 13.2.0 (79a10589f1f80dfe21e8f9794365ed98143071c4) mimic (stable)

#55 Updated by Brad Hubbard 12 months ago

Matt,

Can you provide a coredump or full backtrace?

#56 Updated by Dan van der Ster 12 months ago

We're getting a few crashes like this per week here on 12.2.5.

Here's a fileStore OSD:

    -3> 2018-06-27 01:24:05.524583 7f681f467700  1 -- 128.142.161.228:6912/3301378 >> 10.49.12.186:0/10
26606479 conn(0x7f6846a8f800 :6912 s=STATE_OPEN pgs=169384743 cs=1 l=1).tick idle(60000913) more than 6
0000000 us, mark self fault.
    -2> 2018-06-27 01:24:05.524595 7f681f467700  1 -- 128.142.161.228:6912/3301378 >> 10.49.12.186:0/10
26606479 conn(0x7f6846a8f800 :6912 s=STATE_OPEN pgs=169384743 cs=1 l=1).fault on lossy channel, failing
    -1> 2018-06-27 01:24:05.524603 7f681f467700  2 -- 128.142.161.228:6912/3301378 >> 10.49.12.186:0/10
26606479 conn(0x7f6846a8f800 :6912 s=STATE_OPEN pgs=169384743 cs=1 l=1)._stop
     0> 2018-06-27 01:24:05.537841 7f681b49d700 -1 *** Caught signal (Segmentation fault) **
 in thread 7f681b49d700 thread_name:safe_timer

 ceph version 12.2.5 (cad919881333ac92274171586c827e01f554a70a) luminous (stable)
 1: (()+0xa41df1) [0x7f6825af8df1]
 2: (()+0xf6d0) [0x7f68225516d0]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Coredump uploaded with ceph-post-file: 3e2a2a95-0249-4ffa-b562-f742bbcaf484

#57 Updated by Josh Durgin 12 months ago

  • Related to Bug #24667: osd: SIGSEGV in MMgrReport::encode_payload added

#58 Updated by Josh Durgin 12 months ago

  • Related to Bug #24664: osd: crash in OpTracker::unregister_inflight_op via OSD::get_health_metrics added

#59 Updated by Brad Hubbard 12 months ago

  • Related to Bug #24037: osd: Assertion `!node_algorithms::inited(this->priv_value_traits().to_node_ptr(value))' failed. added

#60 Updated by Brad Hubbard 12 months ago

Pretty sure this all revolves around the racy code highlighted in #24037 and, unfortunately, the PR does not fix all instances since it's in the branch used in #24667 and #24664. Still investigating but I think we are finally zeroing in.

#61 Updated by Brad Hubbard 12 months ago

We are investigating the potential race between get_health_metrics and the op_tracker code.

In the mean time, for anyone seeing a lot of these, it might be an interesting test to set osd_enable_op_tracker to false temporarily to see if it works around the problem. Please let us know results if you decide you'd like to test this.

#62 Updated by Alex Gorbachev 12 months ago

I set it Brad, watching the status. We normal get one failure in 3-4 days.

#63 Updated by Brad Hubbard 12 months ago

Thanks Alex!

#64 Updated by Dan van der Ster 12 months ago

I injected osd_enable_op_tracker=false yesterday

2018-07-03 10:47:23.329688 7f7da3d3b700  0 osd.553 1172324 do_command r=0 osd_enable_op_tracker = 'false'
2018-07-03 10:47:23.329699 7f7da3d3b700  0 log_channel(cluster) log [INF] : osd_enable_op_tracker = 'false'

but unfortunately got a thread_name::safe_timer crash during the night:

    -2> 2018-07-03 22:36:16.549276 7f7dc46e4700  2 -- 128.142.161.226:6889/3971383 >> 10.16.105.22:0/3679208441 conn(0x7f7de1b46000 :6889 s=STATE_OPEN pgs=177781170 cs=1 l=1)._stop
    -1> 2018-07-03 22:36:16.550164 7f7dc46e4700  1 -- 128.142.161.226:6889/3971383 >> - conn(0x7f7dfd607000 :6889 s=STATE_ACCEPTING pgs=0 cs=0 l=0)._process_connection sd=2359 -
     0> 2018-07-03 22:36:16.550379 7f7dbff19700 -1 *** Caught signal (Segmentation fault) **
 in thread 7f7dbff19700 thread_name:safe_timer

 ceph version 12.2.5 (cad919881333ac92274171586c827e01f554a70a) luminous (stable)
 1: (()+0xa41df1) [0x7f7dca574df1]
 2: (()+0xf6d0) [0x7f7dc6fcd6d0]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

I have the abrt dump if that can be useful.

#65 Updated by Brad Hubbard 12 months ago

Having reviewed the code in question again I was afraid that may be the case. If you can provide the crash dump Dan, I'd appreciate it.

#66 Updated by Dan van der Ster 12 months ago

Brad Hubbard wrote:

Having reviewed the code in question again I was afraid that may be the case. If you can provide the crash dump Dan, I'd appreciate it.

Here it is with ceph-post-file: 3b64cadc-3365-46d3-a983-300de3e43b95

#67 Updated by Brad Hubbard 12 months ago

  • Related to deleted (Bug #24664: osd: crash in OpTracker::unregister_inflight_op via OSD::get_health_metrics)

#68 Updated by Alex Gorbachev 12 months ago

No issues so far since injecting osd_enable_op_tracker=false

#69 Updated by Brad Hubbard 12 months ago

Alex, So that's a week without issue when previously you were getting a crash every 3-4 days right?

#70 Updated by Alex Gorbachev 12 months ago

That is correct, Brad. No crashes for 7 days now.

#71 Updated by Serg D 12 months ago

We see periodically with osd_enable_op_tracker = false
Last time

[Sat Jul  7 12:56:42 2018] safe_timer[13626]: segfault at 400000004000 ip 00007fca68a91ccb sp 00007fca61bdef70 error 4 in libgcc_s.so.1[7fca68a83000+16000]

#72 Updated by Brad Hubbard 11 months ago

The latest core uploaded by Dan in comment 66 is slightly different to the others we've seen so far.

Once again the lock is an "OSD::tick_timer_lock".

(gdb) f 5
#5  0x00007f7dca5aea04 in SafeTimer::timer_thread (this=0x7f7dd3eaa190) at /usr/src/debug/ceph-12.2.5/src/common/Timer.cc:97
97            callback->complete(0);
(gdb) p lock
$10 = (Mutex &) @0x7f7dd3eaa138: {
  name = "OSD::tick_timer_lock",
...

This time callback appears to be an instance of C_handle_write though.

(gdb) p callback
$15 = (Context *) 0x7f7deb04a080
(gdb) p *callback
$16 = {
  _vptr.Context = 0x7f7dcaf70f70 <vtable for C_handle_write+16>
}
(gdb) p *(C_handle_write*)callback
$17 = {
  <EventCallback> = {
    _vptr.EventCallback = 0x7f7dcaf70f70 <vtable for C_handle_write+16>
  }, 
  members of C_handle_write: 
  conn = {
    px = 0x7f7de55ee800
  }
}
(gdb) info vtbl callback
vtable for 'Context' @ 0x7f7dcaf70f70 (subobject @ 0x7f7deb04a080):
[0]: 0x7f7dca864a90 <C_handle_write::do_request(int)>
[1]: 0x7f7dca863350 <C_handle_write::~C_handle_write()>
[2]: 0x7f7dca863380 <C_handle_write::~C_handle_write()>
[3]: 0x0

The "conn" member looks valid so, like the others, this appears to be just corruption of the vtable.

(gdb) p *((AsyncConnection*)0x7f7de55ee800)->write_handler
$19 = {
  _vptr.EventCallback = 0x7f7dcaf70f70 <vtable for C_handle_write+16>
}

I'm looking at ways we can instrument the binaries to get more information about these crashes.

#73 Updated by Alex Gorbachev 11 months ago

We actually got one on July 15: Jul 14 23:54:42 roc04r-sc3a080 kernel: [6988357.283555] safe_timer19917: segfault at 7a0000080000 ip 00007f5100205ccb sp 00007f50f9352f70 error 4 in libgcc_s.so.1[7f51001f7000+16000]

crash dump in https://drive.google.com/open?id=1LImiTIBjOJgPsEXYYBgTgVbUm8LHXy8K

#74 Updated by Josh Durgin 11 months ago

  • Duplicated by Bug #24935: SafeTimer? osd killed by kernel for Segmentation fault added

#75 Updated by Alex Gorbachev 11 months ago

After the upgrade to 12.2.7 I am still seeing crashes on OSDs. Please check and advise if a separate tracker should be opened.

OSD log at: https://drive.google.com/open?id=1b_e1VFmGhYO4Y4Mj6-ommqtG5QFoPJT8

OSD crash dump at: https://drive.google.com/open?id=1VK3O6fugxOm2h4oh3mK8p2NCNuWtu6sC

#76 Updated by Brad Hubbard 11 months ago

Thanks Alex,

I'll check out the core tomorrow and let you know.

I have been working on instrumenting the ceph-osd binary to detect this earlier and perhaps more often and we are testing with that binary now.

I was also thinking that, since the OSDHealthMetric related code only triggers when there are PENDING_CREATING_PGS, these clusters may be seeing a lot of pg creates due to workload? Perhaps there is a lot of splitting going on in these clusters or some other factors at play leading to a lot of pg creates? I'd be interested in anyone's thoughts on this aspect and the workload generating this behaviour.

#77 Updated by Roberto Valverde 11 months ago

Hi Brad,

We spotted again this issue in one of our clusters, just 2 hours after we upgraded from 12.2.5 -> 12.2.7. Curiously is the first time that we see this issue on this cluster since creation (~end of January'18). It's a very small test cluster with only 5 OSD's and is not heavily used.

We spotted the same signature on the mgr just before the crash:

mgr.server send_report send_report osd,4.0x5562b44518d8 sent me an unknown health metric: P

Here is the osd log and coredump:

ceph-post-file: a13a7ebb-850f-456e-b7ec-5a35121e9ca1

I hope it helps,

Roberto

#78 Updated by Brad Hubbard 11 months ago

Thanks Roberto,

Your core, as well as the last uploaded by Alex show the now familiar corruption to the vtable of the C_Tick_WithoutOSDLock instance (callback pointer within SafeTimer). I think we have enough core dumps for now as they don't seem to be providing any new information. Thanks to all who provided data.

I am currently working on testing with an instrumented ceph-osd binary which should, in theory, capture the crash scenario at an earlier point and may therefore give us more information about this perplexing problem. I plan to push hard to reproduce this in our testing environment in the next few days and any feedback from the question in comment 76 may help me to do so. If I fail to reproduce in-house I can make the packages available to anyone willing to install them in an attempt to capture a coredump that may provide more clues.

#79 Updated by Dan van der Ster 11 months ago

Brad Hubbard wrote:

I was also thinking that, since the OSDHealthMetric related code only triggers when there are PENDING_CREATING_PGS, these clusters may be seeing a lot of pg creates due to workload? Perhaps there is a lot of splitting going on in these clusters or some other factors at play leading to a lot of pg creates? I'd be interested in anyone's thoughts on this aspect and the workload generating this behaviour.

Can we see that state from the coredump somehow? Basically none of our clusters should have ongoing pg creates -- we don't have any splitting going on.

Otherwise:
  • Echoing what others have said, I am quite sure the bug arrived between 12.2.2 and 12.2.4 -- we ran with 12.2.2 for several months but started seeing this crash just a few days after our first update to 12.2.4.
  • The OSD health metric changes sure are a juicy candidate to be the root cause -- but we shouldn't exclude the possibility that it was another commit to the osd (or common?) which is corrupting both the timer callbacks and the metrics.
  • When running with debug_timer = 0/20 we can see when the timers are added and called. From this I've learned that (for all cases I've checked) the same callback address which is crashing had been used successfully several times already. e.g. here is a crash
        -2> 2018-07-06 21:17:26.357784 7fab82a5c700 20 timer(0x55d9b1092190).timer_thread awake
        -1> 2018-07-06 21:17:26.357793 7fab82a5c700 10 timer(0x55d9b1092190).timer_thread executing 0x55d9effa6f70
         0> 2018-07-06 21:17:26.359394 7fab82a5c700 -1 *** Caught signal (Segmentation fault) **
     in thread 7fab82a5c700 thread_name:safe_timer
    

    and here are previous successful 0x55d9effa6f70 callbacks:
    # zgrep 0x55d9effa6f70 /var/log/ceph/ceph-osd.380.log-20180707.gz
     -7927> 2018-07-06 21:17:19.356895 7fab82a5c700 10 timer(0x55d9b1092190).add_event_at 2018-07-06 21:17:20.356892 -> 0x55d9effa6f70
     -7537> 2018-07-06 21:17:20.356966 7fab82a5c700 10 timer(0x55d9b1092190).timer_thread executing 0x55d9effa6f70
     -6859> 2018-07-06 21:17:22.357369 7fab82a5c700 10 timer(0x55d9b1092190).add_event_at 2018-07-06 21:17:23.357367 -> 0x55d9effa6f70
     -5149> 2018-07-06 21:17:23.357436 7fab82a5c700 10 timer(0x55d9b1092190).timer_thread executing 0x55d9effa6f70
     -2963> 2018-07-06 21:17:25.357724 7fab82a5c700 10 timer(0x55d9b1092190).add_event_at 2018-07-06 21:17:26.357724 -> 0x55d9effa6f70
        -1> 2018-07-06 21:17:26.357793 7fab82a5c700 10 timer(0x55d9b1092190).timer_thread executing 0x55d9effa6f70
    

    So we know that the corruption happened sometime after 21:17:23.357436. I've posted that log (ceph-post-file: e1cdcb94-8dd9-4b5c-b665-5b6e87bd47f9) in case it's useful.

#80 Updated by Dan van der Ster 11 months ago

Dan van der Ster wrote:

  • The OSD health metric changes sure are a juicy candidate to be the root cause -- but we shouldn't exclude the possibility that it was another commit to the osd (or common?) which is corrupting both the timer callbacks and the metrics.

That said, this commit (landed in 12.2.3) that sends the health check sure is close to where the C_Tick_WithoutOSDLock is added:

commit bb5ee06baac34aff59578575e14fe2a6bb6ae0ab
Author: Kefu Chai <kchai@redhat.com>
Date:   Mon Oct 23 12:29:31 2017 +0800

    osd: send health-checks to mgr

...

diff --git a/src/osd/OSD.cc b/src/osd/OSD.cc
index 8628d0c547..76aa56d058 100644
--- a/src/osd/OSD.cc
+++ b/src/osd/OSD.cc
@@ -5326,7 +5326,7 @@ void OSD::tick_without_osd_lock()
     }
   }

-  check_ops_in_flight();
+  mgrc.update_osd_health(get_health_metrics());
   service.kick_recovery_queue();
   tick_timer_without_osd_lock.add_event_after(OSD_TICK_INTERVAL, new C_Tick_WithoutOSDLock(this));
 }
@@ -7539,6 +7539,20 @@ void OSD::sched_scrub()

#81 Updated by Brad Hubbard 11 months ago

Dan van der Ster wrote:

Can we see that state from the coredump somehow? Basically none of our clusters should have ongoing pg creates -- we don't have any splitting going on.

Yes, of course you're right, we can see the current state.

(gdb) p osd
$1 = (OSD *) 0x7f7dd3eaa000
(gdb) p osd->pending_creates_from_mon
$2 = 0
(gdb) p osd->pending_creates_from_osd
$3 = std::set with 0 elements

So this shows us there are no pending creates at the time we crashed, but not how many there were (if any) leading up to this point. The reason we are speculating about pg creates is because the sole metric being collected by get_health_metrics(), and therefore the sole metric being stored in an OSDHealthMetric vector is a count of pending creates. However, even if there are no pending creates an OSDHealthMetric containing an entry of type osd_metric::PENDING_CREATING_PGS and value zero is created. If you are seeing pending creates you should be seeing warnings like the following in the cluster and mon logs. If you aren't seeing those then I guess there is no relationship between pending creates and this crash.

Health check failed: 3 PGs pending on creation (PENDING_CREATING_PGS)
Otherwise:
  • Echoing what others have said, I am quite sure the bug arrived between 12.2.2 and 12.2.4 -- we ran with 12.2.2 for several months but started seeing this crash just a few days after our first update to 12.2.4.

Understood.

  • The OSD health metric changes sure are a juicy candidate to be the root cause -- but we shouldn't exclude the possibility that it was another commit to the osd (or common?) which is corrupting both the timer callbacks and the metrics.

Agreed, and also we can't ignore the possibility that a commit added is not itself at fault, but somehow triggered a dormant issue in the existing code. We have pored over these commits but, as yet, have failed to identify any good candidates.

  • When running with debug_timer = 0/20 we can see when the timers are added and called. From this I've learned that (for all cases I've checked) the same callback address which is crashing had been used successfully several times already. e.g. here is a crash
    [...]
    and here are previous successful 0x55d9effa6f70 callbacks:
    [...]

I agree it's the same address, but not the same instance. As your code sample in comment 80 shows a new instance of C_Tick_WithoutOSDLock is created each time an event is scheduled (with add_event_after calling add_event_at with the new pointer) and when callback->complete() is called it's actually calling complete on the Context object (C_Tick_WithoutOSDLock being derived from Context) which looks like this.

  virtual void complete(int r) {
    finish(r);
    delete this;
  }

So with every cycle of an event being scheduled and the callback being executed a new instance of C_Tick_WithoutOSDLock is allocated and deleted. I believe what the logs reflect is the propensity of the memory allocator to hand back the previously allocated memory when a new allocation of the same size is requested as the following test program and its output shows.

#include <iostream>
using namespace std;
struct dummy
{
  void complete() { delete this; }
};
void add_event_at(dummy* x) {
  cout << "x = " << x << std::endl;
  x->complete();
}
int main(int argc, char **argv)
{
  for (int x = 0; x < 10; x++ ) {
    add_event_at(new dummy());
  }
    return 0;
}

$ ./a.out
x = 0x7a0e70
x = 0x7a0e70
x = 0x7a0e70
x = 0x7a0e70
x = 0x7a0e70
x = 0x7a0e70
x = 0x7a0e70
x = 0x7a0e70
x = 0x7a0e70
x = 0x7a0e70

So we know that the corruption happened sometime after 21:17:23.357436. I've posted that log (ceph-post-file: e1cdcb94-8dd9-4b5c-b665-5b6e87bd47f9) in case it's useful.

I'd suggest that we only know it happened after 21:17:26.357724 when the C_Tick_WithoutOSDLock object was created.

I agree the patch in comment 80 is suspicious, as are the others that implement the OSDHealthMetrics collection but these have been looked at by multiple eyes without raising any red flags so far. If my current test come up empty I could offer a couple of options to anyone who wants to test with them. Option 1: An instrumented ceph-osd binary that tries to capture the issue as early as possible in an attempt to gather more information on what teh actual issue is. Option 2: A version where we back out the OSDHealthMetric changes to see whether this stops the crashes and, therefore, that these changes are somehow related to the crashes (although not necessarily the root cause).

Thank you for the assist Dan, I appreciate the feedback and data.

#82 Updated by Alex Gorbachev 11 months ago

Latest crash just happened here, no messages not in the OSD log, but crash dump is generated and dmesg shows:

[Tue Jul 31 15:46:18 2018] safe_timer10698: segfault at 10000012f ip 00007fade6921ccb sp 00007faddfc75f70 error 4 in libgcc_s.so.1[7fade6913000+16000]

in ceph.log:

2018-07-31 15:00:00.000145 mon.roc-vm-sc3c234 mon.0 10.80.3.234:6789/0 78162 : cluster [INF] overall HEALTH_OK
2018-07-31 15:46:06.369353 mon.roc-vm-sc3c234 mon.0 10.80.3.234:6789/0 78328 : cluster [INF] osd.303 failed (root=sata4t,host=roc02r-sc3a040) (2 reporters from different host after 20.000141 >= grace 20.000000)

OSD log: https://drive.google.com/open?id=1uiyQXUX_9cYDBaiAfHBhjYXSMNf84AM2

Crash dump: https://drive.google.com/open?id=1DNV1N1bLXCINOEIvj-a3LDPlE0o2GUf6

#83 Updated by Brad Hubbard 11 months ago

segfault at 10000012f

(gdb) info vtbl callback
vtable for 'Context' @ 0x55636ca8db20 (subobject @ 0x5562c1fc1e00):
[0]: 0x55636ca8acc0
[1]: 0x5562000000ba
[2]: 0x55636ca8d900
[3]: 0x10000012f <--- *

I think we can stop uploading coredumps for now as they are not providing any more information at this stage, but thanks.

I have been unsuccessful so far in trying to reproduce this although I have one more avenue to try and, of course, I will update here if I am successful or not.

#84 Updated by Richard Bade 11 months ago

Just adding another "me too" on this. I've hit this on Luminous 12.2.7 also under Ubuntu 16.04.4 with 4.15.0-24-generic #26~16.04.1-Ubuntu

Getting osd's restarting one or two a day, often at the beginning or end of the day.
Here's some dmesg entries for one node.

[Sun Jul 29 11:17:22 2018] safe_timer[4900]: segfault at 100000010000 ip 00007f0e1860fccb sp 00007f0e11963830 error 4 in libgcc_s.so.1[7f0e18601000+16000]
[Tue Jul 31 18:11:29 2018] safe_timer[4895]: segfault at 400000010000 ip 00007fdb3ff70ccb sp 00007fdb392c4830 error 4 in libgcc_s.so.1[7fdb3ff62000+16000]
[Wed Aug  1 14:30:29 2018] safe_timer[4893]: segfault at 10000008c ip 00007fa40c586ccb sp 00007fa4058da830 error 4 in libgcc_s.so.1[7fa40c578000+16000]

And some entries from syslog

2018-08-01 14:30:20.000    cstor01.XXXX systemd[1]: ceph-osd@130.service: Failed with result 'signal'.
2018-08-01 14:30:20.000    cstor01.XXXX ceph-osd[4819]: *** Caught signal (Segmentation fault) **
2018-08-01 14:30:20.000    cstor01.XXXX systemd[1]: ceph-osd@130.service: Main process exited, code=killed, status=11/SEGV
2018-08-01 14:30:20.000    cstor01.XXXX ceph-osd[4819]:  in thread 7fa4058de700 thread_name:safe_timer
2018-08-01 14:30:20.000    cstor01.XXXX kernel: [1202527.072835] safe_timer[4893]: segfault at 10000008c ip 00007fa40c586ccb sp 00007fa4058da830 error 4 in libgcc_s.so.1[7fa40c578000+16000]
2018-08-01 14:30:20.000    cstor101.XXX ceph-mgr: 2018-08-01 14:30:20.716863 7fe767e2a700  1 mgr send_beacon standby
2018-08-01 14:30:20.000    cstor01.XXXX systemd[1]: ceph-osd@130.service: Unit entered failed state.

Please let me know if I can provide any further info to help track this down.

#85 Updated by Kjetil Joergensen 11 months ago

For MgrClient::update_osd_health, does the move-assignment compile into updating a pointer to a std::vector, or does it compile into copying the internal fields from one into the other ? If it's the latter, MgrClient::_send_report() could in theory read a half-constructed std::vector.

// src/mgr/MgrClient.h
class MgrClient : public Dispatcher
{
// ....
  std::vector<OSDHealthMetric> osd_health_metrics;
// ....
}
// src/mgr/MgrClient.cc
void MgrClient::update_osd_health(std::vector<OSDHealthMetric>&& metrics)
{
  osd_health_metrics = std::move(metrics);
}

This also seems to fit with the introduction of the first use of MgrClient::update_osd_health, as previously MgrClient::osd_health_metrics would have remained unchanged. Tentatively hard to reproduce given fairly specific timings between threads, one of which is scheduled only once every OSD::OSD_TICK_INTERVAL of 1s.

#86 Updated by Brad Hubbard 11 months ago

Thanks Kjetil,
I think you are right, we should hold the lock in update_osd_health(). Not sure how we all missed that, especially me. I'm working on a test package now for anyone who's interested and, of course, will get a fix into luminous mimic and master since all are affected.

If you give me your preferred email address I'll make sure you get credited for the fix Kjetil.

#87 Updated by Kjetil Joergensen 11 months ago

Brad - you can just use

#88 Updated by Brad Hubbard 11 months ago

  • Status changed from Verified to In Progress
  • Backport changed from luminous to mimic, luminous

#89 Updated by Brad Hubbard 11 months ago

I've created a test package here based on 12.2.7 and including the one line patch above.

https://shaman.ceph.com/repos/ceph/wip-badone-testing-luminous/b234ea982830f9b635e8afc5bb4b6fb4adfbee39/

I'm running it through our test environment now but, due to heavy load, this may take a while to complete. If you test with this please report your results.

#90 Updated by Kefu Chai 11 months ago

  • Status changed from In Progress to Pending Backport

#91 Updated by Brad Hubbard 11 months ago

  • Copied to Backport #26870: mimic: osd: segfaults under normal operation added

#92 Updated by Brad Hubbard 11 months ago

  • Copied to Backport #26871: luminous: osd: segfaults under normal operation added

#94 Updated by lin zhou 11 months ago

Brad Hubbard wrote:

I've created a test package here based on 12.2.7 and including the one line patch above.

https://shaman.ceph.com/repos/ceph/wip-badone-testing-luminous/b234ea982830f9b635e8afc5bb4b6fb4adfbee39/

I'm running it through our test environment now but, due to heavy load, this may take a while to complete. If you test with this please report your results.

if i want to compile this for debian,should I replace all daemon? or just osd?

#95 Updated by Brad Hubbard 10 months ago

  • Status changed from Pending Backport to Resolved

The patch is only relevant to the osds.

#96 Updated by Phat Le Ton 10 months ago

Brad Hubbard wrote:

I've created a test package here based on 12.2.7 and including the one line patch above.

https://shaman.ceph.com/repos/ceph/wip-badone-testing-luminous/b234ea982830f9b635e8afc5bb4b6fb4adfbee39/

I'm running it through our test environment now but, due to heavy load, this may take a while to complete. If you test with this please report your results.

Hi Brad Hubbard,
I've just seen 12.2.8 release, Was your patch included in this release ?

#97 Updated by Nathan Cutler 10 months ago

Phat Le Ton wrote:

I've just seen 12.2.8 release, Was your patch included in this release ?

Yes. See https://tracker.ceph.com/issues/26871 (target version is v12.2.8, which means the patch is in that version).

#98 Updated by Josh Durgin 5 months ago

  • Duplicated by Bug #37978: osd killed by kernel for Segmentation fault added

Also available in: Atom PDF