Bug #23352
osd: segfaults under normal operation
0%
Description
-1> 2018-03-13 22:03:27.390956 7f42eec36700 4 rocksdb: EVENT_LOG_v1 {"time_micros": 1520993007390955, "job": 454, "event": "table_file_deletion", "file_number": 7466}
0> 2018-03-13 22:04:32.172308 7f4307c68700 -1 ** Caught signal (Segmentation fault) *
in thread 7f4307c68700 thread_name:safe_timer
ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b) luminous (stable)
1: (()+0xa74234) [0x557b9bdf4234]
2: (()+0x11390) [0x7f430f7d9390]
3: (SafeTimer::timer_thread()+0x44f) [0x557b9be3398f]
4: (SafeTimerThread::entry()+0xd) [0x557b9be34d9d]
5: (()+0x76ba) [0x7f430f7cf6ba]
6: (clone()+0x6d) [0x7f430e84641d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
--- logging levels ---
0/ 5 none
0/ 0 lockdep
0/ 0 context
0/ 0 crush
1/ 5 mds
1/ 5 mds_balancer
1/ 5 mds_locker
1/ 5 mds_log
1/ 5 mds_log_expire
1/ 5 mds_migrator
0/ 0 buffer
0/ 0 timer
0/ 0 filer
0/ 1 striper
0/ 0 objecter
0/ 0 rados
0/ 0 rbd
0/ 5 rbd_mirror
0/ 5 rbd_replay
0/ 0 journaler
0/ 5 objectcacher
0/ 0 client
0/ 0 osd
0/ 0 optracker
0/ 0 objclass
0/ 0 filestore
0/ 0 journal
0/ 0 ms
0/ 0 mon
0/ 0 monc
0/ 0 paxos
0/ 0 tp
0/ 0 auth
1/ 5 crypto
0/ 0 finisher
1/ 1 reserver
0/ 0 heartbeatmap
0/ 0 perfcounter
0/ 0 rgw
1/10 civetweb
1/ 5 javaclient
0/ 0 asok
0/ 0 throttle
0/ 0 refs
1/ 5 xio
1/ 5 compressor
1/ 5 bluestore
1/ 5 bluefs
1/ 3 bdev
1/ 5 kstore
4/ 5 rocksdb
4/ 5 leveldb
4/ 5 memdb
1/ 5 kinetic
1/ 5 fuse
1/ 5 mgr
1/ 5 mgrc
1/ 5 dpdk
1/ 5 eventtrace
2/-2 (syslog threshold) end dump of recent events ---
-1/-1 (stderr threshold)
max_recent 10000
max_new 1000
log_file /var/log/ceph/ceph-osd.81.log
--
Related issues
History
#1 Updated by Sage Weil about 6 years ago
- Status changed from New to Need More Info
Do you have a core file? I haven't seen this crash before.
#2 Updated by Alex Gorbachev about 6 years ago
HI Sage, I do have the core dump. Where can I upload the file, it's rather large, 850 MB compressed.
#3 Updated by Alex Gorbachev about 6 years ago
Sage, PM'ed to you the public download link, hope it works.
#4 Updated by Alex Gorbachev about 6 years ago
Here is the link to the core dump https://drive.google.com/open?id=1tOTqSOaS94gOhHfXmGbbfuXLNFFfOVuf
#5 Updated by Chris Hoy Poy about 6 years ago
Also seeing these, no core dump but have now had 3 segfaults in 2 weeks since upgrading to 12.2.4 from a very stable 12.2.2.
Luminous 12.2.4, bluestore. I usually have logging disabled, will enable it again at next opportunity.
[dmesg]
[35103471.167728] safe_timer491476: segfault at 21300080000 ip 00007f9f3f7bfccb sp 00007f9f31c7df70 error 4 in libgcc_s.so.1[7f9f3f7b1000+16000]
[log just after it restarted, to confirm versions]
2018-03-22 06:24:58.603146 7f796e6c3e00 0 ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b) luminous (stable), process (unknown), pid 1339770
2018-03-22 06:24:58.613610 7f796e6c3e00 0 pidfile_write: ignore empty --pid-file
2018-03-22 06:24:58.619318 7f796e6c3e00 0 load: jerasure load: lrc load: isa
2018-03-22 06:24:58.619392 7f796e6c3e00 1 bdev create path /var/lib/ceph/osd/osceph-38/block type kernel
2018-03-22 06:24:58.619403 7f796e6c3e00 1 bdev(0x5642923baac0 /var/lib/ceph/osd/osceph-38/block) open path /var/lib/ceph/osd/osceph-38/block
2018-03-22 06:24:58.619609 7f796e6c3e00 1 bdev(0x5642923baac0 /var/lib/ceph/osd/osceph-38/block) open size 3000588304384 (0x2baa1000000, 2794 GB) block_size 4096 (4096 B) rotational
2018-03-22 06:24:58.619820 7f796e6c3e00 1 bluestore(/var/lib/ceph/osd/osceph-38) _set_cache_sizes max 0.0639132 < ratio 0.99
2018-03-22 06:24:58.619837 7f796e6c3e00 1 bluestore(/var/lib/ceph/osd/osceph-38) _set_cache_sizes cache_size 8400000000 meta 0.936087 kv 0.0639132 data 2.23517e-08
2018-03-22 06:24:58.619841 7f796e6c3e00 1 bdev(0x5642923baac0 /var/lib/ceph/osd/osceph-38/block) close
2018-03-22 06:24:58.909527 7f796e6c3e00 1 bluestore(/var/lib/ceph/osd/osceph-38) _mount path /var/lib/ceph/osd/osceph-38
2018-03-22 06:24:58.909773 7f796e6c3e00 1 bdev create path /var/lib/ceph/osd/osceph-38/block type kernel
2018-03-22 06:24:58.909779 7f796e6c3e00 1 bdev(0x5642923bad00 /var/lib/ceph/osd/osceph-38/block) open path /var/lib/ceph/osd/osceph-38/block
2018-03-22 06:24:58.909893 7f796e6c3e00 1 bdev(0x5642923bad00 /var/lib/ceph/osd/osceph-38/block) open size 3000588304384 (0x2baa1000000, 2794 GB) block_size 4096 (4096 B) rotational
2018-03-22 06:24:58.910062 7f796e6c3e00 1 bluestore(/var/lib/ceph/osd/osceph-38) _set_cache_sizes max 0.0639132 < ratio 0.99
2018-03-22 06:24:58.910071 7f796e6c3e00 1 bluestore(/var/lib/ceph/osd/osceph-38) _set_cache_sizes cache_size 8400000000 meta 0.936087 kv 0.0639132 data 2.23517e-08
2018-03-22 06:24:58.910113 7f796e6c3e00 1 bdev create path /dev/disk/by-partuuid/f6942001-a7af-4f8f-8ffd-119fbcec34b3 type kernel
2018-03-22 06:24:58.910118 7f796e6c3e00 1 bdev(0x5642923baf40 /dev/disk/by-partuuid/f6942001-a7af-4f8f-8ffd-119fbcec34b3) open path /dev/disk/by-partuuid/f6942001-a7af-4f8f-8ffd-119fbcec34b3
2018-03-22 06:24:58.910256 7f796e6c3e00 1 bdev(0x5642923baf40 /dev/disk/by-partuuid/f6942001-a7af-4f8f-8ffd-119fbcec34b3) open size 33554432000 (0x7d0000000, 32000 MB) block_size 4096 (4096 B) non-rotational
2018-03-22 06:24:58.910272 7f796e6c3e00 1 bluefs add_block_device bdev 1 path /dev/disk/by-partuuid/f6942001-a7af-4f8f-8ffd-119fbcec34b3 size 32000 MB
2018-03-22 06:24:58.910486 7f796e6c3e00 1 bdev create path /var/lib/ceph/osd/osceph-38/block type kernel
2018-03-22 06:24:58.910493 7f796e6c3e00 1 bdev(0x5642923bb180 /var/lib/ceph/osd/osceph-38/block) open path /var/lib/ceph/osd/osceph-38/block
2018-03-22 06:24:58.911462 7f796e6c3e00 1 bdev(0x5642923bb180 /var/lib/ceph/osd/osceph-38/block) open size 3000588304384 (0x2baa1000000, 2794 GB) block_size 4096 (4096 B) rotational
2018-03-22 06:24:58.911469 7f796e6c3e00 1 bluefs add_block_device bdev 2 path /var/lib/ceph/osd/osceph-38/block size 2794 GB
2018-03-22 06:24:58.911514 7f796e6c3e00 1 bluefs mount
2018-03-22 06:24:58.923367 7f796e6c3e00 0 set rocksdb option compaction_readahead_size = 2097152
2018-03-22 06:24:58.923375 7f796e6c3e00 0 set rocksdb option compression = kNoCompression
2018-03-22 06:24:58.923390 7f796e6c3e00 0 set rocksdb option max_write_buffer_number = 4
2018-03-22 06:24:58.923393 7f796e6c3e00 0 set rocksdb option min_write_buffer_number_to_merge = 1
2018-03-22 06:24:58.923396 7f796e6c3e00 0 set rocksdb option recycle_log_file_num = 4
2018-03-22 06:24:58.923397 7f796e6c3e00 0 set rocksdb option writable_file_max_buffer_size = 0
2018-03-22 06:24:58.923400 7f796e6c3e00 0 set rocksdb option write_buffer_size = 268435456
2018-03-22 06:24:58.923416 7f796e6c3e00 0 set rocksdb option compaction_readahead_size = 2097152
2018-03-22 06:24:58.923418 7f796e6c3e00 0 set rocksdb option compression = kNoCompression
2018-03-22 06:24:58.923420 7f796e6c3e00 0 set rocksdb option max_write_buffer_number = 4
2018-03-22 06:24:58.923422 7f796e6c3e00 0 set rocksdb option min_write_buffer_number_to_merge = 1
2018-03-22 06:24:58.923424 7f796e6c3e00 0 set rocksdb option recycle_log_file_num = 4
2018-03-22 06:24:58.923425 7f796e6c3e00 0 set rocksdb option writable_file_max_buffer_size = 0
2018-03-22 06:24:58.923427 7f796e6c3e00 0 set rocksdb option write_buffer_size = 268435456
2018-03-22 06:24:58.923532 7f796e6c3e00 4 rocksdb: RocksDB version: 5.4.0
2018-03-22 06:24:58.923538 7f796e6c3e00 4 rocksdb: Git sha rocksdb_build_git_sha:0
2018-03-22 06:24:58.923540 7f796e6c3e00 4 rocksdb: Compile date Feb 26 2018
#6 Updated by Kjetil Joergensen about 6 years ago
"Me too". I had a brief look at the coredump, without becoming all that much wiser. Judging by the lock attached to the safe_timer thread, the recipient of the segfault were the osd tick timer. (OSD::tick_timer_lock). Although - what segfaults is instruction fetch, so it sort-of feels like we maybe clobbered the callback.
#7 Updated by Nathan Cutler about 6 years ago
- Status changed from Need More Info to New
- Priority changed from Normal to High
Raising priority because it's a possible regression in Luminous.
#8 Updated by Brad Hubbard about 6 years ago
- Related to Bug #23431: OSD Segmentation fault in thread_name:safe_timer added
#9 Updated by Brad Hubbard about 6 years ago
Chris,
Was your stack identical to Alex's original description or was it more like the stack in #23431 ?
#10 Updated by Kefu Chai about 6 years ago
for those who wants to check the coredump. you should use apport-unpack to unpack it first.
and it crashed at /build/ceph-12.2.4/src/common/Timer.cc:97
│95 if (!safe_callbacks) │ │96 lock.Unlock(); │ >│97 callback->complete(0); │ │98 if (!safe_callbacks) │ │99 lock.Lock();
(gdb) p callback $34 = (Context *) 0x557bd02d9f00 (gdb) p *callback $35 = {_vptr.Context = 0x557bd0a59810} 0x557bd0a59810: 0x60 0x92 0xa5 0xd0 0x7b 0x55 0x00 0x00 0x557bd0a59818: 0x2c 0x00 0x00 0x00 0x01 0x00 0x00 0x00 0x557bd0a59820: 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x557bd0a59828: 0x00 0x00 0x00 0x00 0x00 0x00 0x01 0x00 0x557bd0a59830: 0x00 0x00 0x32 0x5a 0x26 0x00 0x00 0x00 0x557bd0a59838: 0x00 0x00 0x08 0x00 0x7b 0x55 0x00 0x00 0x557bd0a59840: 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x557bd0a59848: 0x00 0x00 0x01 0x00 0x7c 0x55 0x00 0x00 ... (gdb) info vtbl callback vtable for 'Context' @ 0x557bd0a59810 (subobject @ 0x557bd02d9f00): [0]: 0x557bd0a59260 [1]: 0x10000002c [2]: 0x0 [3]: 0x1000000000000
the callback should be an instance of C_Tick_WithoutOSDLock, but its vtable does not look right. with a correct "osd" field:
(gdb) p *(OSD::C_Tick_WithoutOSDLock*)callback $18 = {<Context> = {_vptr.Context = 0x557bd0a59810}, osd = 0x557ba607a000} (gdb) p stopping $1 = false (gdb) thread 32 (gdb) p osd $19 = (OSD *) 0x557ba607a000
0x557bd0a59810 is an address on heap, but i tried to inspect the mem chunk pointed by it in different ways. not recognisable ...
#11 Updated by Patrick Donnelly almost 6 years ago
- Project changed from Ceph to RADOS
- Subject changed from OSD segfaults under normal operation to osd: segfaults under normal operation
- Source set to Community (user)
- Backport set to luminous
- Component(RADOS) OSD added
#12 Updated by wei jin almost 6 years ago
I run into this issue with 12.2.5, it affects cluster stability heavily.
#13 Updated by Alex Gorbachev almost 6 years ago
For some reason we are also seeing more of these happening, simultaneous failures and recoveries are occurring during the day and night
#14 Updated by Alex Gorbachev almost 6 years ago
The latest ones look like this, below.
Crash dump at https://drive.google.com/open?id=12v95-TCHlkrBZ16ni5UkhYkXRtJcbPXq
0> 2018-05-18 14:00:47.405486 7f1a76064700 -1 ** Caught signal (Segmentation fault) *
in thread 7f1a76064700 thread_name:msgr-worker-1
ceph version 12.2.5 (cad919881333ac92274171586c827e01f554a70a) luminous (stable)
1: (()+0xa7cab4) [0x55b69fa49ab4]
2: (()+0x11390) [0x7f1a79be4390]
3: (MMgrReport::encode_payload(unsigned long)+0x5d2) [0x55b69fac21c2]
4: (Message::encode(unsigned long, int)+0x29) [0x55b69fac5b89]
5: (AsyncConnection::prepare_send_message(unsigned long, Message*, ceph::buffer::list&)+0x392) [0x55b69fd759f2]
6: (AsyncConnection::handle_write()+0x5c5) [0x55b69fd7e7f5]
7: (EventCenter::process_events(int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0x1125) [0x55b69fb30645]
8: (()+0xb66eb8) [0x55b69fb33eb8]
9: (()+0xb8c80) [0x7f1a794ebc80]
10: (()+0x76ba) [0x7f1a79bda6ba]
11: (clone()+0x6d) [0x7f1a78c5141d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
--- logging levels ---
0/ 5 none
0/ 0 lockdep
0/ 0 context
0/ 0 crush
1/ 5 mds
1/ 5 mds_balancer
1/ 5 mds_locker
1/ 5 mds_log
1/ 5 mds_log_expire
1/ 5 mds_migrator
0/ 0 buffer
0/ 0 timer
0/ 0 filer
0/ 1 striper
0/ 0 objecter
0/ 0 rados
0/ 0 rbd
0/ 5 rbd_mirror
0/ 5 rbd_replay
0/ 0 journaler
0/ 5 objectcacher
0/ 0 client
0/ 0 osd
0/ 0 optracker
0/ 0 objclass
0/ 0 filestore
0/ 0 journal
0/ 0 ms
0/ 0 mon
0/ 0 monc
0/ 0 paxos
0/ 0 tp
0/ 0 auth
0/ 0 crypto
0/ 0 finisher
1/ 1 reserver
0/ 0 heartbeatmap
0/ 0 perfcounter
#15 Updated by Josh Durgin almost 6 years ago
- Related to Bug #24023: Segfault on OSD in 12.2.5 added
#16 Updated by Josh Durgin almost 6 years ago
- Related to Bug #23585: osd: safe_timer segfault added
#17 Updated by Josh Durgin almost 6 years ago
- Related to Bug #23564: OSD Segfaults added
#18 Updated by Josh Durgin almost 6 years ago
I put the core file from comment #14 and binaries from 12.2.5 in senta02:/slow/jdurgin/ceph/bugs/tracker_23352/2018-05-18 - but it'll need the right versions of libgcc/libstdc++ in another env (container/vm/different box) to read the symbols properly.
#19 Updated by Alex Gorbachev almost 6 years ago
Josh, is this something I can extract from the OSD node for you, or is this an internal comment?
#20 Updated by Josh Durgin almost 6 years ago
It's an internal comment for others looking at this - though if you (Alex) have an osd log to go with the 'MMgrReport::encode_payload' crash, that would help
#22 Updated by Sage Weil almost 6 years ago
- Status changed from New to Need More Info
Alex, how reproducible is this for you? Could you reproduce with debug timer = 20?
#23 Updated by Alex Gorbachev almost 6 years ago
Sage, I had tried to do this, but we don't know when these crashes would happen, just that they will occur. Random times, and I get paged after the fact, and since this is in production, running with debug timer=20 is likely to fill up local space, correct? Also odd is that I have not seen a crash in the last couple of days, while the week before we had 3-4 per day.
#24 Updated by Brad Hubbard almost 6 years ago
Hi Alex,
I notice there are several more coredumps attached to the related bug reports. Are they all separate crashes? Please feel free to upload any more you can if it's not too much trouble as the more we have the more likely we are to see a pattern. I'm currently looking at the ones you linked to this bug and will update here when I have completed my analysis.
#25 Updated by Beom-Seok Park almost 6 years ago
- File ceph-osd.123.log.gz added
Today two osd crashes.
coredump at:
https://drive.google.com/open?id=1rXtW0riZMBwP5OqrJ7QdRIOAsKFr-kYw
https://drive.google.com/open?id=16J-GoHRECap-jchy8gOEXMtIcI5z4Ugw
osd log attached.
#26 Updated by Brad Hubbard almost 6 years ago
Thanks Beom-Seok,
I've set up a centos environment to debug those cores along with the Xenial ones. I will update here with any findings going forward.
#27 Updated by Alex Gorbachev almost 6 years ago
Hi Brad, we had one too just now, core dump and log:
https://drive.google.com/open?id=1t1jfjqwjhUUBzWjxamos3Hr7ghjxRPg6
https://drive.google.com/open?id=1iuu0GQ8yy8UT3d2qLBITyDUseJIVYmh-
#28 Updated by Brad Hubbard almost 6 years ago
Thanks,
That gives us seven cores across 12.2.4-12.2.5 on Xenial and Centos and one core from the MMgrReport::encode_payload crash. That should be enough for the time being.
#29 Updated by Brad Hubbard almost 6 years ago
If we look at the coredump from 23585 and compare it to this message.
[117735.930255] safe_timer52573: segfault at 561500001000 ip 00007f156d774ccb sp 00007f15668c1f70 error 4 in libgcc_s.so.1[7f156d766000+16000]
It's clear the memory address we segfaulted at was 0x561500001000. Looking at the disassembly for line 97 in Timer.cc.
97 callback->complete(0); 0x0000561579b76987 <+1095>: mov (%rbx),%rax 0x0000561579b7698a <+1098>: xor %esi,%esi <--- IGNORE 0x0000561579b7698c <+1100>: mov %rbx,%rdi <--- IGNORE 0x0000561579b7698f <+1103>: callq *0x18(%rax)
The address of callback is in rbx and we dereference it and store it in rax.
(gdb) p callback $3 = (Context *) 0x5616160cf130 (gdb) i r rbx rbx 0x5616160cf130 94652859216176 (gdb) x/a 0x5616160cf130 0x5616160cf130: 0x56162de55220 (gdb) i r rax rax 0x56162de55220 94653259272736
We then dereference rax, add 0x18 to it and call that address.
(gdb) p/x 0x56162de55220+0x18 $4 = 0x56162de55238 (gdb) x/a 0x56162de55238 0x56162de55238: 0x561500001000 <--- Segfault address
As Kefu pointed out above this is the fourth entry in the vtable for "callback".
(gdb) info vtbl callback vtable for 'Context' @ 0x56162de55220 (subobject @ 0x5616160cf130): [0]: 0x561609696f80 [1]: 0x561591ff9000 [2]: 0x56162de56c00 [3]: 0x561500001000
This fits since 0x18 is 24 in decimal and, since pointers are 8 bytes here, it's the offset to the fourth entry in the table (first entry offset being zero).
(gdb) l 83 78 lock.Lock(); 79 ldout(cct,10) << "timer_thread starting" << dendl; 80 while (!stopping) { 81 utime_t now = ceph_clock_now(); 82 83 while (!schedule.empty()) { 84 scheduled_map_t::iterator p = schedule.begin(); 85 86 // is the future now? 87 if (p->first > now) (gdb) 88 break; 89 90 Context *callback = p->second; 91 events.erase(callback); 92 schedule.erase(p); 93 ldout(cct,10) << "timer_thread executing " << callback << dendl; 94 95 if (!safe_callbacks) 96 lock.Unlock(); 97 callback->complete(0); (gdb) 98 if (!safe_callbacks) 99 lock.Lock(); 100 } (gdb) p schedule $2 = std::multimap with 0 elements
So callback comes from the multimap schedule and, in this case, it appears to have been the last element. Checking whether that is the case for all coredumps as that may be significant.
#30 Updated by Brad Hubbard almost 6 years ago
I've confirmed that in all of the SafeTimer segfaults the 'schedule' multimap is empty, indicating this is the last entry in the map. Working our whether that is expected or not and whether it is important or not :)
#31 Updated by Nathan Cutler almost 6 years ago
- Related to deleted (Bug #23431: OSD Segmentation fault in thread_name:safe_timer)
#32 Updated by Nathan Cutler almost 6 years ago
- Duplicated by Bug #23431: OSD Segmentation fault in thread_name:safe_timer added
#33 Updated by Brad Hubbard almost 6 years ago
- Related to deleted (Bug #24023: Segfault on OSD in 12.2.5)
#34 Updated by Brad Hubbard almost 6 years ago
- Duplicated by Bug #24023: Segfault on OSD in 12.2.5 added
#35 Updated by Brad Hubbard almost 6 years ago
- Related to deleted (Bug #23585: osd: safe_timer segfault)
#36 Updated by Brad Hubbard almost 6 years ago
- Duplicates Bug #23585: osd: safe_timer segfault added
#37 Updated by Brad Hubbard almost 6 years ago
- Related to deleted (Bug #23564: OSD Segfaults)
#38 Updated by Brad Hubbard almost 6 years ago
- Duplicated by Bug #23564: OSD Segfaults added
#39 Updated by Brad Hubbard almost 6 years ago
- Duplicates deleted (Bug #23585: osd: safe_timer segfault)
#40 Updated by Brad Hubbard almost 6 years ago
- Duplicated by Bug #23585: osd: safe_timer segfault added
#41 Updated by Kjetil Joergensen almost 6 years ago
Brad Hubbard wrote:
I've confirmed that in all of the SafeTimer segfaults the 'schedule' multimap is empty, indicating this is the last entry in the map. Working our whether that is expected or not and whether it is important or not :)
Does threads other than the timer thread add/cancel events ? It could be that std::map/std::multimap isn't inherently thread-safe, and we're finding some corner case around that. "Works most of the time, except under these specific timing conditions".
#42 Updated by Brad Hubbard almost 6 years ago
Hi Kjetil,
Sure, worth a look, but AFAICT all access is protected by SafeTimers locks.
#43 Updated by Brad Hubbard almost 6 years ago
Looking at the crash in http://tracker.ceph.com/issues/23352#note-14 there's a fairly glaring problem.
(gdb) f 13 #13 MMgrReport::encode_payload (this=0x55b6beed2400, features=<optimized out>) at /build/ceph-12.2.5/src/messages/MMgrReport.h:122 122 ::encode(osd_health_metrics, payload); (gdb) whatis osd_health_metrics type = std::vector<OSDHealthMetric> $47 = std::vector of length -5890450176077, capacity -5890450176077 = {{ type = 1 '\001', value = { { n1 = 0, n2 = 0 }, n = 0 } }, { type = 0 '\000', value = { { n1 = 65536, n2 = 65536 }, n = 281474976776192 } }, { ... gdb) p osd_health_metrics.size() $48 = -5890450176077
We can disable the pretty printers to take a look at the implementation members.
(gdb) disable pretty-printer 141 printers disabled 0 of 141 printers enabled (gdb) p osd_health_metrics $49 = { <std::_Vector_base<OSDHealthMetric, std::allocator<OSDHealthMetric> >> = { _M_impl = { <std::allocator<OSDHealthMetric>> = { <__gnu_cxx::new_allocator<OSDHealthMetric>> = {<No data fields>}, <No data fields>}, members of std::_Vector_base<OSDHealthMetric, std::allocator<OSDHealthMetric> >::_Vector_impl: _M_start = 0x55b7a30b84d0, _M_finish = 0x0, _M_end_of_storage = 0x0 } }, <No data fields>}
If the vector was empty I'd expect _M_start and _M_finish to be zero and, if not, _M_start and _M_finish to not be zero.
It's possible _M_finish and _M_end_of_storage have been overwritten with zeros here or that this vector<OSDHealthMetric> has been corrupted in some other way prior to the call to MMgrReport::encode_payload. OSDHealthMetric and the supporting code were only added in v12.2.3 so seem like a good candidate but, so far, I see nothing which might explain this behaviour.
#44 Updated by Vangelis Tasoulas almost 6 years ago
Experiencing the a safe_timer segfault with a freshly deployed cluster. No data on the cluster yet. Just an empty pool has been created. The cluster was only running for 5 hours before the segfault on OSD 1 occured.
# ceph osd pool get mypool all size: 4 min_size: 2 crash_replay_interval: 0 pg_num: 1024 pgp_num: 1024 crush_rule: replicated_rule hashpspool: true nodelete: false nopgchange: false nosizechange: false write_fadvise_dontneed: false noscrub: false nodeep-scrub: false use_gmt_hitset: 1 auid: 0 fast_read: 0
The ceph.log at the time of the failure:
2018-06-07 23:47:37.284748 mon.cephmon1 mon.0 192.168.1.1:6789/0 483 : cluster [INF] osd.0 failed (root=default,host=cephosd1) (connection refused reported by osd.2) 2018-06-07 23:47:37.645272 mon.cephmon1 mon.0 192.168.1.1:6789/0 530 : cluster [WRN] Health check failed: 1 osds down (OSD_DOWN) 2018-06-07 23:47:37.645407 mon.cephmon1 mon.0 192.168.1.1:6789/0 531 : cluster [WRN] Health check failed: 1 host (1 osds) down (OSD_HOST_DOWN) 2018-06-07 23:47:38.670555 mon.cephmon1 mon.0 192.168.1.1:6789/0 533 : cluster [WRN] Health check failed: Reduced data availability: 9 pgs inactive, 22 pgs peering (PG_AVAILABILITY) 2018-06-07 23:47:44.643207 mon.cephmon1 mon.0 192.168.1.1:6789/0 535 : cluster [INF] Health check cleared: PG_AVAILABILITY (was: Reduced data availability: 12 pgs inactive, 41 pgs peering) 2018-06-07 23:47:58.351776 mon.cephmon1 mon.0 192.168.1.1:6789/0 538 : cluster [INF] Health check cleared: OSD_DOWN (was: 1 osds down) 2018-06-07 23:47:58.351891 mon.cephmon1 mon.0 192.168.1.1:6789/0 539 : cluster [INF] Health check cleared: OSD_HOST_DOWN (was: 1 host (1 osds) down) 2018-06-07 23:47:58.351926 mon.cephmon1 mon.0 192.168.1.1:6789/0 540 : cluster [INF] Cluster is now healthy 2018-06-07 23:47:58.369415 mon.cephmon1 mon.0 192.168.1.1:6789/0 541 : cluster [INF] osd.0 192.168.1.10:6800/69949 boot
Dmesg segfault line on the host where the osd.0 is located:
[Thu Jun 7 23:47:37 2018] safe_timer[20005]: segfault at 800000006 ip 00007fc90498e688 sp 00007fc8fdf40080 error 4 in libgcc_s.so.1[7fc90497f000+16000]
OS is Debian Jessie:
# uname -r 4.9.0-0.bpo.5-amd64
And the Ceph version:
*** 12.2.5-1~bpo80+1 0 500 http://download.ceph.com/debian-luminous/ jessie/main amd64 Packages
I can provide further information if needed to pinpoint and fix the problem.
#45 Updated by Josh Durgin almost 6 years ago
- Assignee set to Brad Hubbard
#46 Updated by Josh Durgin almost 6 years ago
- Duplicated by Bug #24511: osd crushed at thread_name:safe_timer added
#47 Updated by Josh Durgin almost 6 years ago
Another instance: http://pulpito.ceph.com/yuriw-2018-06-19_21:29:48-rados-wip-yuri-testing-2018-06-19-1953-luminous-distro-basic-smithi/2682702/
Backtrace is the same:
#0 0x00007fe8229daccb in ?? () from /lib/x86_64-linux-gnu/libgcc_s.so.1 #1 0x00007fe8229dc668 in _Unwind_Backtrace () from /lib/x86_64-linux-gnu/libgcc_s.so.1 #2 0x00007fe822717b4f in __GI___backtrace (array=array@entry=0x7fe81835f568, size=size@entry=100) at ../sysdeps/x86_64/backtrace.c:110 #3 0x000055de72423d94 in ceph::BackTrace::BackTrace (s=0, this=0x7fe81835f560) at /build/ceph-12.2.5-482-g94ab47a/src/common/BackTrace.h:23 #4 handle_fatal_signal (signum=11) at /build/ceph-12.2.5-482-g94ab47a/src/global/signal_handler.cc:117 #5 <signal handler called> #6 0x0000400000010000 in ?? () #7 0x000055de72462e02 in SafeTimer::timer_thread (this=0x55de7c5b01c0) at /build/ceph-12.2.5-482-g94ab47a/src/common/Timer.cc:97 #8 0x000055de7246420d in SafeTimerThread::entry (this=<optimized out>) at /build/ceph-12.2.5-482-g94ab47a/src/common/Timer.cc:30 #9 0x00007fe8236926ba in start_thread (arg=0x7fe818362700) at pthread_create.c:333 #10 0x00007fe82270941d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
#48 Updated by Brad Hubbard almost 6 years ago
In several of the crashes we are seeing lines like the following prior to the crash.
2018-06-19 23:53:56.795241 7fc42d47b700 -1 mgr.server send_report send_report osd,6.0x5639619d68d0 sent me an unknown health metric: 2018-06-20T18:15:37.109 INFO:tasks.ceph.mgr.x.argo022.stderr:2018-06-20 22:15:37.037543 7f4bf7727700 -1 mgr.server send_report send_report osd,0.0x5623242fe178 sent me an unknown health metric: �
In both cases the osd involved is the one that crashed.
This, to me, looks like a clear indication of corruption similar to what was seen in the MMgrReport::encode_payload crash above. This further indicates the osd_health_metrics which was added between 12.2.2 and 12.2.3. I'll continue to stare really hard at this code.
Can anyone confirm seeing the "unknown health metric" messages in the mgr logs prior to the segfault and confirm they are referring to the osd that crashed?
#49 Updated by Brad Hubbard almost 6 years ago
- Status changed from Need More Info to 12
- Severity changed from 3 - minor to 2 - major
#50 Updated by Serg D almost 6 years ago
Yeah. We got in log mgr before segfault ceph-osd:
mgr.server send_report send_report osd,74.0x560276d34ed8 sent me an unknown health metric:
Brad Hubbard wrote:
Can anyone confirm seeing the "unknown health metric" messages in the mgr logs prior to the segfault and confirm they are referring to the osd that crashed?
#51 Updated by Kjetil Joergensen almost 6 years ago
Brad Hubbard wrote:
Can anyone confirm seeing the "unknown health metric" messages in the mgr logs prior to the segfault and confirm they are referring to the osd that crashed?
This is the same for us as well.
2018-06-19 12:26:52.542539 7f43cf269700 -1 mgr.server send_report send_report osd,430.0x5631bd2bc690 sent me an unknown health metric:
Around the same time, osd.430 segfaulted.
#52 Updated by Alex Gorbachev almost 6 years ago
Same here
2018-06-24 19:42:41.348699 7f3e53a46700 -1 mgr.server send_report send_report osd,226.0x55678069c850 sent me an unknown health metric
#53 Updated by Beom-Seok Park almost 6 years ago
confirmed
ceph-mgr.log2018-06-20 08:46:05.528656 7fb998ff2700 -1 mgr.server send_report send_report osd,215.0x563a20eaced8 sent me an unknown health metric: Ð
ceph.log2018-06-20 08:46:06.093646 mon.MON001 mon.0 10.xxx.xx.xx:6789/0 76106 : cluster [INF] osd.215 failed (root=volumes,host=OSD001) (connection refused reported by osd.170)
#54 Updated by Matt Dunavant almost 6 years ago
Also confirmed on Ubuntu 18.04/Ceph 13.2.0:
ceph-mgr.log
2018-06-24 11:14:47.317 7ff17b0db700 -1 mgr.server send_report send_report osd,121.0x55983610f8d0 sent me an unknown health metric: �
ceph-mon.log
2018-06-24 11:14:51.753 7fd1ae701700 1 mon.mn335@0(leader).osd e508 prepare_failure osd.121 10.xx.xx.xx:6816/1301379 from osd.92 10.xx.xx.xx:6816/1505976 is reporting failure:1
dmesg of OSD server
[3467506.385786] msgr-worker-01308098: segfault at 0 ip 00007f1320e56bcb sp 00007f131c0dff40 error 4 in libtcmalloc.so.4.3.0[7f1320e32000+45000]
uname -r
4.15.0-20-generic
ceph --version
ceph version 13.2.0 (79a10589f1f80dfe21e8f9794365ed98143071c4) mimic (stable)
#55 Updated by Brad Hubbard almost 6 years ago
Matt,
Can you provide a coredump or full backtrace?
#56 Updated by Dan van der Ster almost 6 years ago
We're getting a few crashes like this per week here on 12.2.5.
Here's a fileStore OSD:
-3> 2018-06-27 01:24:05.524583 7f681f467700 1 -- 128.142.161.228:6912/3301378 >> 10.49.12.186:0/10 26606479 conn(0x7f6846a8f800 :6912 s=STATE_OPEN pgs=169384743 cs=1 l=1).tick idle(60000913) more than 6 0000000 us, mark self fault. -2> 2018-06-27 01:24:05.524595 7f681f467700 1 -- 128.142.161.228:6912/3301378 >> 10.49.12.186:0/10 26606479 conn(0x7f6846a8f800 :6912 s=STATE_OPEN pgs=169384743 cs=1 l=1).fault on lossy channel, failing -1> 2018-06-27 01:24:05.524603 7f681f467700 2 -- 128.142.161.228:6912/3301378 >> 10.49.12.186:0/10 26606479 conn(0x7f6846a8f800 :6912 s=STATE_OPEN pgs=169384743 cs=1 l=1)._stop 0> 2018-06-27 01:24:05.537841 7f681b49d700 -1 *** Caught signal (Segmentation fault) ** in thread 7f681b49d700 thread_name:safe_timer ceph version 12.2.5 (cad919881333ac92274171586c827e01f554a70a) luminous (stable) 1: (()+0xa41df1) [0x7f6825af8df1] 2: (()+0xf6d0) [0x7f68225516d0] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Coredump uploaded with ceph-post-file: 3e2a2a95-0249-4ffa-b562-f742bbcaf484
#57 Updated by Josh Durgin over 5 years ago
- Related to Bug #24667: osd: SIGSEGV in MMgrReport::encode_payload added
#58 Updated by Josh Durgin over 5 years ago
- Related to Bug #24664: osd: crash in OpTracker::unregister_inflight_op via OSD::get_health_metrics added
#59 Updated by Brad Hubbard over 5 years ago
- Related to Bug #24037: osd: Assertion `!node_algorithms::inited(this->priv_value_traits().to_node_ptr(value))' failed. added
#60 Updated by Brad Hubbard over 5 years ago
#61 Updated by Brad Hubbard over 5 years ago
We are investigating the potential race between get_health_metrics and the op_tracker code.
In the mean time, for anyone seeing a lot of these, it might be an interesting test to set osd_enable_op_tracker to false temporarily to see if it works around the problem. Please let us know results if you decide you'd like to test this.
#62 Updated by Alex Gorbachev over 5 years ago
I set it Brad, watching the status. We normal get one failure in 3-4 days.
#63 Updated by Brad Hubbard over 5 years ago
Thanks Alex!
#64 Updated by Dan van der Ster over 5 years ago
I injected osd_enable_op_tracker=false yesterday
2018-07-03 10:47:23.329688 7f7da3d3b700 0 osd.553 1172324 do_command r=0 osd_enable_op_tracker = 'false' 2018-07-03 10:47:23.329699 7f7da3d3b700 0 log_channel(cluster) log [INF] : osd_enable_op_tracker = 'false'
but unfortunately got a thread_name::safe_timer crash during the night:
-2> 2018-07-03 22:36:16.549276 7f7dc46e4700 2 -- 128.142.161.226:6889/3971383 >> 10.16.105.22:0/3679208441 conn(0x7f7de1b46000 :6889 s=STATE_OPEN pgs=177781170 cs=1 l=1)._stop -1> 2018-07-03 22:36:16.550164 7f7dc46e4700 1 -- 128.142.161.226:6889/3971383 >> - conn(0x7f7dfd607000 :6889 s=STATE_ACCEPTING pgs=0 cs=0 l=0)._process_connection sd=2359 - 0> 2018-07-03 22:36:16.550379 7f7dbff19700 -1 *** Caught signal (Segmentation fault) ** in thread 7f7dbff19700 thread_name:safe_timer ceph version 12.2.5 (cad919881333ac92274171586c827e01f554a70a) luminous (stable) 1: (()+0xa41df1) [0x7f7dca574df1] 2: (()+0xf6d0) [0x7f7dc6fcd6d0] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
I have the abrt dump if that can be useful.
#65 Updated by Brad Hubbard over 5 years ago
Having reviewed the code in question again I was afraid that may be the case. If you can provide the crash dump Dan, I'd appreciate it.
#66 Updated by Dan van der Ster over 5 years ago
Brad Hubbard wrote:
Having reviewed the code in question again I was afraid that may be the case. If you can provide the crash dump Dan, I'd appreciate it.
Here it is with ceph-post-file: 3b64cadc-3365-46d3-a983-300de3e43b95
#67 Updated by Brad Hubbard over 5 years ago
- Related to deleted (Bug #24664: osd: crash in OpTracker::unregister_inflight_op via OSD::get_health_metrics)
#68 Updated by Alex Gorbachev over 5 years ago
No issues so far since injecting osd_enable_op_tracker=false
#69 Updated by Brad Hubbard over 5 years ago
Alex, So that's a week without issue when previously you were getting a crash every 3-4 days right?
#70 Updated by Alex Gorbachev over 5 years ago
That is correct, Brad. No crashes for 7 days now.
#71 Updated by Serg D over 5 years ago
We see periodically with osd_enable_op_tracker = false
Last time
[Sat Jul 7 12:56:42 2018] safe_timer[13626]: segfault at 400000004000 ip 00007fca68a91ccb sp 00007fca61bdef70 error 4 in libgcc_s.so.1[7fca68a83000+16000]
#72 Updated by Brad Hubbard over 5 years ago
The latest core uploaded by Dan in comment 66 is slightly different to the others we've seen so far.
Once again the lock is an "OSD::tick_timer_lock".
(gdb) f 5 #5 0x00007f7dca5aea04 in SafeTimer::timer_thread (this=0x7f7dd3eaa190) at /usr/src/debug/ceph-12.2.5/src/common/Timer.cc:97 97 callback->complete(0); (gdb) p lock $10 = (Mutex &) @0x7f7dd3eaa138: { name = "OSD::tick_timer_lock", ...
This time callback appears to be an instance of C_handle_write though.
(gdb) p callback $15 = (Context *) 0x7f7deb04a080 (gdb) p *callback $16 = { _vptr.Context = 0x7f7dcaf70f70 <vtable for C_handle_write+16> } (gdb) p *(C_handle_write*)callback $17 = { <EventCallback> = { _vptr.EventCallback = 0x7f7dcaf70f70 <vtable for C_handle_write+16> }, members of C_handle_write: conn = { px = 0x7f7de55ee800 } } (gdb) info vtbl callback vtable for 'Context' @ 0x7f7dcaf70f70 (subobject @ 0x7f7deb04a080): [0]: 0x7f7dca864a90 <C_handle_write::do_request(int)> [1]: 0x7f7dca863350 <C_handle_write::~C_handle_write()> [2]: 0x7f7dca863380 <C_handle_write::~C_handle_write()> [3]: 0x0
The "conn" member looks valid so, like the others, this appears to be just corruption of the vtable.
(gdb) p *((AsyncConnection*)0x7f7de55ee800)->write_handler $19 = { _vptr.EventCallback = 0x7f7dcaf70f70 <vtable for C_handle_write+16> }
I'm looking at ways we can instrument the binaries to get more information about these crashes.
#73 Updated by Alex Gorbachev over 5 years ago
We actually got one on July 15: Jul 14 23:54:42 roc04r-sc3a080 kernel: [6988357.283555] safe_timer19917: segfault at 7a0000080000 ip 00007f5100205ccb sp 00007f50f9352f70 error 4 in libgcc_s.so.1[7f51001f7000+16000]
crash dump in https://drive.google.com/open?id=1LImiTIBjOJgPsEXYYBgTgVbUm8LHXy8K
#74 Updated by Josh Durgin over 5 years ago
- Duplicated by Bug #24935: SafeTimer? osd killed by kernel for Segmentation fault added
#75 Updated by Alex Gorbachev over 5 years ago
After the upgrade to 12.2.7 I am still seeing crashes on OSDs. Please check and advise if a separate tracker should be opened.
OSD log at: https://drive.google.com/open?id=1b_e1VFmGhYO4Y4Mj6-ommqtG5QFoPJT8
OSD crash dump at: https://drive.google.com/open?id=1VK3O6fugxOm2h4oh3mK8p2NCNuWtu6sC
#76 Updated by Brad Hubbard over 5 years ago
Thanks Alex,
I'll check out the core tomorrow and let you know.
I have been working on instrumenting the ceph-osd binary to detect this earlier and perhaps more often and we are testing with that binary now.
I was also thinking that, since the OSDHealthMetric related code only triggers when there are PENDING_CREATING_PGS, these clusters may be seeing a lot of pg creates due to workload? Perhaps there is a lot of splitting going on in these clusters or some other factors at play leading to a lot of pg creates? I'd be interested in anyone's thoughts on this aspect and the workload generating this behaviour.
#77 Updated by Roberto Valverde over 5 years ago
Hi Brad,
We spotted again this issue in one of our clusters, just 2 hours after we upgraded from 12.2.5 -> 12.2.7. Curiously is the first time that we see this issue on this cluster since creation (~end of January'18). It's a very small test cluster with only 5 OSD's and is not heavily used.
We spotted the same signature on the mgr just before the crash:
mgr.server send_report send_report osd,4.0x5562b44518d8 sent me an unknown health metric: P
Here is the osd log and coredump:
ceph-post-file: a13a7ebb-850f-456e-b7ec-5a35121e9ca1
I hope it helps,
Roberto
#78 Updated by Brad Hubbard over 5 years ago
Thanks Roberto,
Your core, as well as the last uploaded by Alex show the now familiar corruption to the vtable of the C_Tick_WithoutOSDLock instance (callback pointer within SafeTimer). I think we have enough core dumps for now as they don't seem to be providing any new information. Thanks to all who provided data.
I am currently working on testing with an instrumented ceph-osd binary which should, in theory, capture the crash scenario at an earlier point and may therefore give us more information about this perplexing problem. I plan to push hard to reproduce this in our testing environment in the next few days and any feedback from the question in comment 76 may help me to do so. If I fail to reproduce in-house I can make the packages available to anyone willing to install them in an attempt to capture a coredump that may provide more clues.
#79 Updated by Dan van der Ster over 5 years ago
Brad Hubbard wrote:
I was also thinking that, since the OSDHealthMetric related code only triggers when there are PENDING_CREATING_PGS, these clusters may be seeing a lot of pg creates due to workload? Perhaps there is a lot of splitting going on in these clusters or some other factors at play leading to a lot of pg creates? I'd be interested in anyone's thoughts on this aspect and the workload generating this behaviour.
Can we see that state from the coredump somehow? Basically none of our clusters should have ongoing pg creates -- we don't have any splitting going on.
Otherwise:- Echoing what others have said, I am quite sure the bug arrived between 12.2.2 and 12.2.4 -- we ran with 12.2.2 for several months but started seeing this crash just a few days after our first update to 12.2.4.
- The OSD health metric changes sure are a juicy candidate to be the root cause -- but we shouldn't exclude the possibility that it was another commit to the osd (or common?) which is corrupting both the timer callbacks and the metrics.
- When running with debug_timer = 0/20 we can see when the timers are added and called. From this I've learned that (for all cases I've checked) the same callback address which is crashing had been used successfully several times already. e.g. here is a crash
-2> 2018-07-06 21:17:26.357784 7fab82a5c700 20 timer(0x55d9b1092190).timer_thread awake -1> 2018-07-06 21:17:26.357793 7fab82a5c700 10 timer(0x55d9b1092190).timer_thread executing 0x55d9effa6f70 0> 2018-07-06 21:17:26.359394 7fab82a5c700 -1 *** Caught signal (Segmentation fault) ** in thread 7fab82a5c700 thread_name:safe_timer
and here are previous successful 0x55d9effa6f70 callbacks:# zgrep 0x55d9effa6f70 /var/log/ceph/ceph-osd.380.log-20180707.gz -7927> 2018-07-06 21:17:19.356895 7fab82a5c700 10 timer(0x55d9b1092190).add_event_at 2018-07-06 21:17:20.356892 -> 0x55d9effa6f70 -7537> 2018-07-06 21:17:20.356966 7fab82a5c700 10 timer(0x55d9b1092190).timer_thread executing 0x55d9effa6f70 -6859> 2018-07-06 21:17:22.357369 7fab82a5c700 10 timer(0x55d9b1092190).add_event_at 2018-07-06 21:17:23.357367 -> 0x55d9effa6f70 -5149> 2018-07-06 21:17:23.357436 7fab82a5c700 10 timer(0x55d9b1092190).timer_thread executing 0x55d9effa6f70 -2963> 2018-07-06 21:17:25.357724 7fab82a5c700 10 timer(0x55d9b1092190).add_event_at 2018-07-06 21:17:26.357724 -> 0x55d9effa6f70 -1> 2018-07-06 21:17:26.357793 7fab82a5c700 10 timer(0x55d9b1092190).timer_thread executing 0x55d9effa6f70
So we know that the corruption happened sometime after 21:17:23.357436. I've posted that log (ceph-post-file: e1cdcb94-8dd9-4b5c-b665-5b6e87bd47f9) in case it's useful.
#80 Updated by Dan van der Ster over 5 years ago
Dan van der Ster wrote:
- The OSD health metric changes sure are a juicy candidate to be the root cause -- but we shouldn't exclude the possibility that it was another commit to the osd (or common?) which is corrupting both the timer callbacks and the metrics.
That said, this commit (landed in 12.2.3) that sends the health check sure is close to where the C_Tick_WithoutOSDLock is added:
commit bb5ee06baac34aff59578575e14fe2a6bb6ae0ab Author: Kefu Chai <kchai@redhat.com> Date: Mon Oct 23 12:29:31 2017 +0800 osd: send health-checks to mgr ... diff --git a/src/osd/OSD.cc b/src/osd/OSD.cc index 8628d0c547..76aa56d058 100644 --- a/src/osd/OSD.cc +++ b/src/osd/OSD.cc @@ -5326,7 +5326,7 @@ void OSD::tick_without_osd_lock() } } - check_ops_in_flight(); + mgrc.update_osd_health(get_health_metrics()); service.kick_recovery_queue(); tick_timer_without_osd_lock.add_event_after(OSD_TICK_INTERVAL, new C_Tick_WithoutOSDLock(this)); } @@ -7539,6 +7539,20 @@ void OSD::sched_scrub()
#81 Updated by Brad Hubbard over 5 years ago
Dan van der Ster wrote:
Can we see that state from the coredump somehow? Basically none of our clusters should have ongoing pg creates -- we don't have any splitting going on.
Yes, of course you're right, we can see the current state.
(gdb) p osd $1 = (OSD *) 0x7f7dd3eaa000 (gdb) p osd->pending_creates_from_mon $2 = 0 (gdb) p osd->pending_creates_from_osd $3 = std::set with 0 elements
So this shows us there are no pending creates at the time we crashed, but not how many there were (if any) leading up to this point. The reason we are speculating about pg creates is because the sole metric being collected by get_health_metrics(), and therefore the sole metric being stored in an OSDHealthMetric vector is a count of pending creates. However, even if there are no pending creates an OSDHealthMetric containing an entry of type osd_metric::PENDING_CREATING_PGS and value zero is created. If you are seeing pending creates you should be seeing warnings like the following in the cluster and mon logs. If you aren't seeing those then I guess there is no relationship between pending creates and this crash.
Health check failed: 3 PGs pending on creation (PENDING_CREATING_PGS)
Otherwise:
- Echoing what others have said, I am quite sure the bug arrived between 12.2.2 and 12.2.4 -- we ran with 12.2.2 for several months but started seeing this crash just a few days after our first update to 12.2.4.
Understood.
- The OSD health metric changes sure are a juicy candidate to be the root cause -- but we shouldn't exclude the possibility that it was another commit to the osd (or common?) which is corrupting both the timer callbacks and the metrics.
Agreed, and also we can't ignore the possibility that a commit added is not itself at fault, but somehow triggered a dormant issue in the existing code. We have pored over these commits but, as yet, have failed to identify any good candidates.
- When running with debug_timer = 0/20 we can see when the timers are added and called. From this I've learned that (for all cases I've checked) the same callback address which is crashing had been used successfully several times already. e.g. here is a crash
[...]
and here are previous successful 0x55d9effa6f70 callbacks:
[...]
I agree it's the same address, but not the same instance. As your code sample in comment 80 shows a new instance of C_Tick_WithoutOSDLock is created each time an event is scheduled (with add_event_after calling add_event_at with the new pointer) and when callback->complete() is called it's actually calling complete on the Context object (C_Tick_WithoutOSDLock being derived from Context) which looks like this.
virtual void complete(int r) { finish(r); delete this; }
So with every cycle of an event being scheduled and the callback being executed a new instance of C_Tick_WithoutOSDLock is allocated and deleted. I believe what the logs reflect is the propensity of the memory allocator to hand back the previously allocated memory when a new allocation of the same size is requested as the following test program and its output shows.
#include <iostream> using namespace std; struct dummy { void complete() { delete this; } }; void add_event_at(dummy* x) { cout << "x = " << x << std::endl; x->complete(); } int main(int argc, char **argv) { for (int x = 0; x < 10; x++ ) { add_event_at(new dummy()); } return 0; } $ ./a.out x = 0x7a0e70 x = 0x7a0e70 x = 0x7a0e70 x = 0x7a0e70 x = 0x7a0e70 x = 0x7a0e70 x = 0x7a0e70 x = 0x7a0e70 x = 0x7a0e70 x = 0x7a0e70
So we know that the corruption happened sometime after 21:17:23.357436. I've posted that log (ceph-post-file: e1cdcb94-8dd9-4b5c-b665-5b6e87bd47f9) in case it's useful.
I'd suggest that we only know it happened after 21:17:26.357724 when the C_Tick_WithoutOSDLock object was created.
I agree the patch in comment 80 is suspicious, as are the others that implement the OSDHealthMetrics collection but these have been looked at by multiple eyes without raising any red flags so far. If my current test come up empty I could offer a couple of options to anyone who wants to test with them. Option 1: An instrumented ceph-osd binary that tries to capture the issue as early as possible in an attempt to gather more information on what teh actual issue is. Option 2: A version where we back out the OSDHealthMetric changes to see whether this stops the crashes and, therefore, that these changes are somehow related to the crashes (although not necessarily the root cause).
Thank you for the assist Dan, I appreciate the feedback and data.
#82 Updated by Alex Gorbachev over 5 years ago
Latest crash just happened here, no messages not in the OSD log, but crash dump is generated and dmesg shows:
[Tue Jul 31 15:46:18 2018] safe_timer10698: segfault at 10000012f ip 00007fade6921ccb sp 00007faddfc75f70 error 4 in libgcc_s.so.1[7fade6913000+16000]
in ceph.log:
2018-07-31 15:00:00.000145 mon.roc-vm-sc3c234 mon.0 10.80.3.234:6789/0 78162 : cluster [INF] overall HEALTH_OK
2018-07-31 15:46:06.369353 mon.roc-vm-sc3c234 mon.0 10.80.3.234:6789/0 78328 : cluster [INF] osd.303 failed (root=sata4t,host=roc02r-sc3a040) (2 reporters from different host after 20.000141 >= grace 20.000000)
OSD log: https://drive.google.com/open?id=1uiyQXUX_9cYDBaiAfHBhjYXSMNf84AM2
Crash dump: https://drive.google.com/open?id=1DNV1N1bLXCINOEIvj-a3LDPlE0o2GUf6
#83 Updated by Brad Hubbard over 5 years ago
segfault at 10000012f (gdb) info vtbl callback vtable for 'Context' @ 0x55636ca8db20 (subobject @ 0x5562c1fc1e00): [0]: 0x55636ca8acc0 [1]: 0x5562000000ba [2]: 0x55636ca8d900 [3]: 0x10000012f <--- *
I think we can stop uploading coredumps for now as they are not providing any more information at this stage, but thanks.
I have been unsuccessful so far in trying to reproduce this although I have one more avenue to try and, of course, I will update here if I am successful or not.
#84 Updated by Richard Bade over 5 years ago
Just adding another "me too" on this. I've hit this on Luminous 12.2.7 also under Ubuntu 16.04.4 with 4.15.0-24-generic #26~16.04.1-Ubuntu
Getting osd's restarting one or two a day, often at the beginning or end of the day.
Here's some dmesg entries for one node.
[Sun Jul 29 11:17:22 2018] safe_timer[4900]: segfault at 100000010000 ip 00007f0e1860fccb sp 00007f0e11963830 error 4 in libgcc_s.so.1[7f0e18601000+16000] [Tue Jul 31 18:11:29 2018] safe_timer[4895]: segfault at 400000010000 ip 00007fdb3ff70ccb sp 00007fdb392c4830 error 4 in libgcc_s.so.1[7fdb3ff62000+16000] [Wed Aug 1 14:30:29 2018] safe_timer[4893]: segfault at 10000008c ip 00007fa40c586ccb sp 00007fa4058da830 error 4 in libgcc_s.so.1[7fa40c578000+16000]
And some entries from syslog
2018-08-01 14:30:20.000 cstor01.XXXX systemd[1]: ceph-osd@130.service: Failed with result 'signal'. 2018-08-01 14:30:20.000 cstor01.XXXX ceph-osd[4819]: *** Caught signal (Segmentation fault) ** 2018-08-01 14:30:20.000 cstor01.XXXX systemd[1]: ceph-osd@130.service: Main process exited, code=killed, status=11/SEGV 2018-08-01 14:30:20.000 cstor01.XXXX ceph-osd[4819]: in thread 7fa4058de700 thread_name:safe_timer 2018-08-01 14:30:20.000 cstor01.XXXX kernel: [1202527.072835] safe_timer[4893]: segfault at 10000008c ip 00007fa40c586ccb sp 00007fa4058da830 error 4 in libgcc_s.so.1[7fa40c578000+16000] 2018-08-01 14:30:20.000 cstor101.XXX ceph-mgr: 2018-08-01 14:30:20.716863 7fe767e2a700 1 mgr send_beacon standby 2018-08-01 14:30:20.000 cstor01.XXXX systemd[1]: ceph-osd@130.service: Unit entered failed state.
Please let me know if I can provide any further info to help track this down.
#85 Updated by Kjetil Joergensen over 5 years ago
For MgrClient::update_osd_health, does the move-assignment compile into updating a pointer to a std::vector, or does it compile into copying the internal fields from one into the other ? If it's the latter, MgrClient::_send_report() could in theory read a half-constructed std::vector.
// src/mgr/MgrClient.h
class MgrClient : public Dispatcher
{
// ....
std::vector<OSDHealthMetric> osd_health_metrics;
// ....
}
// src/mgr/MgrClient.cc
void MgrClient::update_osd_health(std::vector<OSDHealthMetric>&& metrics)
{
osd_health_metrics = std::move(metrics);
}
This also seems to fit with the introduction of the first use of MgrClient::update_osd_health, as previously MgrClient::osd_health_metrics would have remained unchanged. Tentatively hard to reproduce given fairly specific timings between threads, one of which is scheduled only once every OSD::OSD_TICK_INTERVAL of 1s.
#86 Updated by Brad Hubbard over 5 years ago
Thanks Kjetil,
I think you are right, we should hold the lock in update_osd_health(). Not sure how we all missed that, especially me. I'm working on a test package now for anyone who's interested and, of course, will get a fix into luminous mimic and master since all are affected.
If you give me your preferred email address I'll make sure you get credited for the fix Kjetil.
#87 Updated by Kjetil Joergensen over 5 years ago
Brad - you can just use kjetil@medallia.com
#88 Updated by Brad Hubbard over 5 years ago
- Status changed from 12 to In Progress
- Backport changed from luminous to mimic, luminous
#89 Updated by Brad Hubbard over 5 years ago
I've created a test package here based on 12.2.7 and including the one line patch above.
I'm running it through our test environment now but, due to heavy load, this may take a while to complete. If you test with this please report your results.
#90 Updated by Kefu Chai over 5 years ago
- Status changed from In Progress to Pending Backport
#91 Updated by Brad Hubbard over 5 years ago
- Copied to Backport #26870: mimic: osd: segfaults under normal operation added
#92 Updated by Brad Hubbard over 5 years ago
- Copied to Backport #26871: luminous: osd: segfaults under normal operation added
#93 Updated by Yuri Weinstein over 5 years ago
#94 Updated by lin zhou over 5 years ago
Brad Hubbard wrote:
I've created a test package here based on 12.2.7 and including the one line patch above.
I'm running it through our test environment now but, due to heavy load, this may take a while to complete. If you test with this please report your results.
if i want to compile this for debian,should I replace all daemon? or just osd?
#95 Updated by Brad Hubbard over 5 years ago
- Status changed from Pending Backport to Resolved
The patch is only relevant to the osds.
#96 Updated by Phat Le Ton over 5 years ago
Brad Hubbard wrote:
I've created a test package here based on 12.2.7 and including the one line patch above.
I'm running it through our test environment now but, due to heavy load, this may take a while to complete. If you test with this please report your results.
Hi Brad Hubbard,
I've just seen 12.2.8 release, Was your patch included in this release ?
#97 Updated by Nathan Cutler over 5 years ago
Phat Le Ton wrote:
I've just seen 12.2.8 release, Was your patch included in this release ?
Yes. See https://tracker.ceph.com/issues/26871 (target version is v12.2.8, which means the patch is in that version).
#98 Updated by Josh Durgin about 5 years ago
- Duplicated by Bug #37978: osd killed by kernel for Segmentation fault added