Project

General

Profile

Bug #42117

MDS: daemon and cephfs-data-scan dump core on (probably) damaged omap entry

Added by Jan Fajerski over 4 years ago. Updated about 4 years ago.

Status:
Need More Info
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Development
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

This was observed with ceph-12.2.10, but afaict the code path hasn't changed.

The root cause is not definitive, but it seems that the MDS daemons were oom killed over and over again until the metadata pool ended up with an object omap with an empty key (rados omaplistvals output):


value (9 bytes) :
00000000  61 6e 79 74 68 69 6e 67  0a                       |anything.|
00000009

...

This hits an assertion in dentry_key_t::decode_helper (sn=<synthetic pointer>..., nm=..., key=...) at /usr/src/debug/ceph-12.2.10-543-gfc6f0c7299/src/mds/mdstypes.h:1130 (see also attached log). Both the MDS and cephfs-data-scan use this.

I'd be happy to provide a fix for this (against master) but am not sure if this should be handled and how. Is it valid to just skip over the omap entry with an empty key?

History

#1 Updated by Patrick Donnelly over 4 years ago

  • Subject changed from MDS daemon and cephfs-data-scan dump core on (probably) damaged omap entry to MDS: daemon and cephfs-data-scan dump core on (probably) damaged omap entry
  • Status changed from New to Need More Info
  • Assignee set to Jan Fajerski
  • Target version set to v15.0.0
  • Source set to Development

Do you have any core dumps available?

#2 Updated by Jan Fajerski over 4 years ago

Here's a trace of the MDS:

Core was generated by `/usr/bin/ceph-mds -f --cluster ceph --id cigw1 --setuser ceph --setgroup ceph'.
Program terminated with signal SIGABRT, Aborted.
#0  0x00007f2d259f2adb in raise () from ./lib64/libpthread.so.0
[Current thread is 1 (LWP 8442)]
#bt
#0  0x00007f2d259f2adb in raise () from ./lib64/libpthread.so.0
#1  0x0000560766dd9597 in reraise_fatal (signum=6) at /usr/src/debug/ceph-12.2.10-543-gfc6f0c7299/src/global/signal_handler.cc:74
#2  handle_fatal_signal (signum=6) at /usr/src/debug/ceph-12.2.10-543-gfc6f0c7299/src/global/signal_handler.cc:138
#3  <signal handler called>
#4  0x00007f2d249a7f67 in raise () from ./lib64/libc.so.6
#5  0x00007f2d249a933a in abort () from ./lib64/libc.so.6
#6  0x0000560766e187c0 in ceph::__ceph_assert_fail (assertion=assertion@entry=0x5607671ff12f "i != string::npos", file=file@entry=0x5607671d8060 "/home/abuild/rpmbuild/BUILD/ceph-12.2.10-543-gfc6f0c7299/src/mds/mdstypes.h", 
    line=line@entry=1130, 
    func=func@entry=0x5607671ffe80 <dentry_key_t::decode_helper(boost::basic_string_view<char, std::char_traits<char> >, std::string&, snapid_t&)::__PRETTY_FUNCTION__> "static void dentry_key_t::decode_helper(boost::string_view, std::string&, snapid_t&)") at /usr/src/debug/ceph-12.2.10-543-gfc6f0c7299/src/common/assert.cc:66
#7  0x0000560766cef83a in dentry_key_t::decode_helper (sn=<synthetic pointer>..., nm=..., key=...) at /usr/src/debug/ceph-12.2.10-543-gfc6f0c7299/src/mds/mdstypes.h:1130
#8  CDir::_omap_fetched (this=0x5607c2696700, hdrbl=..., omap=..., complete=true, r=r@entry=0) at /usr/src/debug/ceph-12.2.10-543-gfc6f0c7299/src/mds/CDir.cc:1948
#9  0x0000560766cf9b38 in C_IO_Dir_OMAP_Fetched::finish (this=0x5607c4380480, r=0) at /usr/src/debug/ceph-12.2.10-543-gfc6f0c7299/src/mds/CDir.cc:1598
#10 0x0000560766d662f1 in Context::complete (r=0, this=0x5607c4380480) at /usr/src/debug/ceph-12.2.10-543-gfc6f0c7299/src/include/Context.h:70
#11 MDSIOContextBase::complete (this=0x5607c4380480, r=0) at /usr/src/debug/ceph-12.2.10-543-gfc6f0c7299/src/mds/MDSContext.cc:116
#12 0x0000560766e17618 in Finisher::finisher_thread_entry (this=0x560771c2e160) at /usr/src/debug/ceph-12.2.10-543-gfc6f0c7299/src/common/Finisher.cc:72
#13 0x00007f2d259ea724 in start_thread () from ./lib64/libpthread.so.0
#14 0x00007f2d24a5fe8d in clone () from ./lib64/libc.so.6

#3 Updated by Jan Fajerski over 4 years ago

And trace for cephfs-data-scan:

Core was generated by `cephfs-data-scan scan_links'.
Program terminated with signal SIGABRT, Aborted.
#0  0x00007f761d791adb in raise () from ./lib64/libpthread.so.0
[Current thread is 1 (LWP 1194841)]
#bt
#0  0x00007f761d791adb in raise () from ./lib64/libpthread.so.0
#1  0x000055d029170107 in reraise_fatal (signum=6) at /usr/src/debug/ceph-12.2.10-543-gfc6f0c7299/src/global/signal_handler.cc:74
#2  handle_fatal_signal (signum=6) at /usr/src/debug/ceph-12.2.10-543-gfc6f0c7299/src/global/signal_handler.cc:138
#3  <signal handler called>
#4  0x00007f761c765f67 in raise () from ./lib64/libc.so.6
#5  0x00007f761c76733a in abort () from ./lib64/libc.so.6
#6  0x00007f761dc4b0a0 in ceph::__ceph_assert_fail(char const*, char const*, int, char const*) () from ./usr/lib64/ceph/libceph-common.so.0
#7  0x000055d028e1ca53 in dentry_key_t::decode_helper (sn=<synthetic pointer>..., nm=..., key=...) at /usr/src/debug/ceph-12.2.10-543-gfc6f0c7299/src/mds/mdstypes.h:1130
#8  DataScan::scan_links (this=this@entry=0x7ffdff123c20) at /usr/src/debug/ceph-12.2.10-543-gfc6f0c7299/src/tools/cephfs/DataScan.cc:961
#9  0x000055d028e0a8ff in DataScan::main (this=0x7ffdff123c20, args=...) at /usr/src/debug/ceph-12.2.10-543-gfc6f0c7299/src/tools/cephfs/DataScan.cc:311
#10 0x000055d028e09777 in main (argc=<optimized out>, argv=<optimized out>) at /usr/src/debug/ceph-12.2.10-543-gfc6f0c7299/src/tools/cephfs/cephfs-data-scan.cc:37

#4 Updated by Patrick Donnelly over 4 years ago

I took a glance at the code. I don't see how that could happen even with an OOM situation.

#5 Updated by Patrick Donnelly about 4 years ago

  • Target version deleted (v15.0.0)

Also available in: Atom PDF