Project

General

Profile

Actions

Bug #38203

closed

ceph-mds segfault during migrator nicely exporting

Added by Dan van der Ster about 5 years ago. Updated over 4 years ago.

Status:
Can't reproduce
Priority:
Urgent
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
multimds
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

We had a v12.2.10 mds crash during md balancing:

   -22> 2019-02-06 03:49:51.932215 7f7be2bc9700  0 mds.3.migrator nicely exporting to mds.0 [dir 0x500006410dd /volumes/_nogroup/69d07521-b740-46d1-a545-7057160a8fb2/user/atlact1/www/jobs/2019-02-04/ [2,head] auth{0=1,1=1,2=1,4=1} v=744446 cv=744381/744381 REP dir_auth=3 state=1610612738|complete f(v1 m2019-02-05 10:16:58.337992 57=0+57) n(v9947 rc2019-02-06 03:49:44.347839 b110305046691 180971=180914+57)/n(v9947 rc2019-02-06 03:49:37.147924 b110298302472 180966=180909+57) hs=57+0,ss=0+0 dirty=1 | dnwaiter=0 child=1 frozen=0 subtree=1 importing=0 replicated=1 dirty=1 waiter=0 authpin=0 0x561f5f294e00]
   -21> 2019-02-06 03:49:52.583911 7f7be63d0700  2 mds.3.cache check_memory_usage total 14002256, rss 11375468, heap 315424, baseline 315424, buffers 0, 55325 / 2141345 inodes have caps, 83015 caps, 0.0387677 caps per inode
   -20> 2019-02-06 03:49:54.692285 7f7be5bcf700 10 monclient: _send_mon_message to mon.cephflax-mon-9b406e0261 at 137.138.121.135:6789/0
   -19> 2019-02-06 03:49:56.929179 7f7be7bd3700 10 monclient: tick
   -18> 2019-02-06 03:49:56.929232 7f7be7bd3700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2019-02-06 03:49:26.929206)
   -17> 2019-02-06 03:49:57.635358 7f7be63d0700  2 mds.3.cache check_memory_usage total 14002256, rss 11375172, heap 315424, baseline 315424, buffers 0, 55334 / 2133234 inodes have caps, 83029 caps, 0.0389217 caps per inode
   -16> 2019-02-06 03:49:58.692466 7f7be5bcf700 10 monclient: _send_mon_message to mon.cephflax-mon-9b406e0261 at 137.138.121.135:6789/0
   -15> 2019-02-06 03:50:01.357112 7f7be2bc9700  0 mds.3.migrator nicely exporting to mds.0 [dir 0x100093748b9.101* /volumes/_nogroup/f51e1a94-b1ff-4f2c-9686-c7e9835e0f6c/work/cli-build/ [2,head] auth{0=1} v=437725 cv=437725/437725 REP dir_auth=3 state=1073741826|complete f(v48 m2019-02-05 15:22:45.539666 8788=0+8788) n(v5346 rc2019-02-05 15:23:10.481882 b105348841048 24087=15299+8788) hs=8788+0,ss=0+0 | dnwaiter=0 child=1 frozen=0 subtree=1 importing=0 replicated=1 dirty=0 waiter=0 authpin=0 tempexporting=0 0x561f77cfa000]
   -14> 2019-02-06 03:50:01.523391 7f7be8bd5700  0 mds.3.migrator nicely exporting to mds.0 [dir 0x100093748b9.010* /volumes/_nogroup/f51e1a94-b1ff-4f2c-9686-c7e9835e0f6c/work/cli-build/ [2,head] auth{0=1,2=1} v=426986 cv=426986/426986 REP dir_auth=3 state=1073741826|complete f(v48 m2019-02-06 00:38:56.760195 8769=0+8769) n(v5346 rc2019-02-06 00:39:12.952094 b100212238080 24118=15349+8769) hs=8769+0,ss=0+0 | dnwaiter=0 child=1 frozen=0 subtree=1 importing=0 replicated=1 0x561f875d1800]
   -13> 2019-02-06 03:50:02.190203 7f7be93d6700  5 asok(0x561da983e1c0) AdminSocket: request 'get_command_descriptions' '' to 0x561da98201e0 returned 4774 bytes
   -12> 2019-02-06 03:50:02.208436 7f7be93d6700  1 do_command 'perf dump' '
   -11> 2019-02-06 03:50:02.208745 7f7be93d6700  1 do_command 'perf dump' 'result is 13156 bytes
   -10> 2019-02-06 03:50:02.208755 7f7be93d6700  5 asok(0x561da983e1c0) AdminSocket: request 'perf dump' '' to 0x561da9820040 returned 13156 bytes
    -9> 2019-02-06 03:50:02.209381 7f7be93d6700  5 asok(0x561da983e1c0) AdminSocket: request 'get_command_descriptions' '' to 0x561da98201e0 returned 4774 bytes
    -8> 2019-02-06 03:50:02.227158 7f7be93d6700  1 mds.cephflax-mds-2a4cfd0e2c asok_command: session ls (starting...)
    -7> 2019-02-06 03:50:02.234906 7f7be93d6700  1 mds.cephflax-mds-2a4cfd0e2c asok_command: session ls (complete)
    -6> 2019-02-06 03:50:02.236787 7f7be93d6700  5 asok(0x561da983e1c0) AdminSocket: request 'session ls' '' to 0x561da9820220 returned 420758 bytes
    -5> 2019-02-06 03:50:02.302902 7f7beb355700  0 -- 137.138.52.159:6800/1556042223 >> 10.32.3.223:0/1782848130 conn(0x561dd03b7800 :6800 s=STATE_OPEN pgs=1008711 cs=1 l=0).process bad tag 50
    -4> 2019-02-06 03:50:02.569959 7f7beb355700  0 -- 137.138.52.159:6800/1556042223 >> 10.32.3.223:0/1782848130 conn(0x561dbecb3800 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 2 vs existing csq=1 existing_state=STATE_STANDBY
    -3> 2019-02-06 03:50:02.641677 7f7be63d0700  2 mds.3.cache check_memory_usage total 14002256, rss 11374236, heap 315424, baseline 315424, buffers 0, 55459 / 2107149 inodes have caps, 83164 caps, 0.0394675 caps per inode
    -2> 2019-02-06 03:50:02.692685 7f7be5bcf700 10 monclient: _send_mon_message to mon.cephflax-mon-9b406e0261 at 137.138.121.135:6789/0
    -1> 2019-02-06 03:50:04.494387 7f7beb355700  0 -- 137.138.52.159:6800/1556042223 >> 188.184.64.209:0/3134808329 conn(0x561dd0419000 :6800 s=STATE_OPEN pgs=1128257 cs=1 l=0).process bad tag 50
     0> 2019-02-06 03:50:05.914673 7f7be8bd5700 -1 *** Caught signal (Segmentation fault) **
 in thread 7f7be8bd5700 thread_name:ms_dispatch

 ceph version 12.2.10 (177915764b752804194937482a39e95e0ca3de94) luminous (stable)
 1: (()+0x5ca6c1) [0x561d9f4036c1]
 2: (()+0xf5d0) [0x7f7beda135d0]
 3: (tcmalloc::CentralFreeList::ReleaseListToSpans(void*)+0x13) [0x7f7bef2c0cb3]
 4: (tcmalloc::CentralFreeList::ShrinkCache(int, bool)+0x124) [0x7f7bef2c0df4]
 5: (tcmalloc::CentralFreeList::MakeCacheSpace()+0x5d) [0x7f7bef2c0f1d]
 6: (tcmalloc::CentralFreeList::InsertRange(void*, void*, int)+0x68) [0x7f7bef2c0f98]
 7: (tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*, unsigned int, int)+0xb8) [0x7f7bef2c4408]
 8: (tcmalloc::ThreadCache::Scavenge()+0x48) [0x7f7bef2c4798]
 9: (MDRequestImpl::~MDRequestImpl()+0x224) [0x561d9f1b5b64]
 10: (MDRequestImpl::~MDRequestImpl()+0x9) [0x561d9f1b5ce9]
 11: (OpTracker::unregister_inflight_op(TrackedOp*)+0xf2) [0x561d9f3a88e2]
 12: (Migrator::handle_export_discover_ack(MExportDirDiscoverAck*)+0x2e6) [0x561d9f2c9196]
 13: (Migrator::dispatch(Message*)+0xbb) [0x561d9f2d370b]
 14: (MDSRank::handle_deferrable_message(Message*)+0x613) [0x561d9f1079d3]
 15: (MDSRank::_dispatch(Message*, bool)+0x1e3) [0x561d9f115e13]
 16: (MDSRankDispatcher::ms_dispatch(Message*)+0x15) [0x561d9f116d55]
 17: (MDSDaemon::ms_dispatch(Message*)+0xf3) [0x561d9f0ff093]
 18: (DispatchQueue::entry()+0x792) [0x561d9f73ba22]
 19: (DispatchQueue::DispatchThread::entry()+0xd) [0x561d9f4cc9cd]
 20: (()+0x7dd5) [0x7f7beda0bdd5]
 21: (clone()+0x6d) [0x7f7becae8ead]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Log is in ceph-post-file: 3933388c-0fe6-4095-a122-11d4a4c4c6e8

Core dump is available but 14GB so it might be more convenient if you tell me what to grab from gdb.

Actions

Also available in: Atom PDF