Project

General

Profile

Actions

Bug #40200

closed

luminous: mds: does fails assert(session->get_nref() == 1) when balancing

Added by Dan van der Ster almost 5 years ago. Updated over 4 years ago.

Status:
Rejected
Priority:
Normal
Category:
-
Target version:
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
Yes
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
multimds
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

We've seen this assertion twice after upgrading MDS's from v12.2.11 to v12.2.12 and due to #40190 it can be disruptive.
We have 5 active mds's, and use md pinning actively. The crash happens when a new (non-pinned) manila share is created (then shortly afterwards the dir is exported)

Here is mds.4 crashing when /volumes is being exported to it:

2019-06-05 19:54:35.531836 7fda83009700  2 mds.4.cache Memory usage:  total 831740, rss 44619
6, heap 307232, baseline 307232, buffers 0, 32597 / 124263 inodes have caps, 43703 caps, 0.35
1698 caps per inode
2019-06-05 19:54:38.988369 7fda8580e700  5 mds.4.bal mds.4 epoch 88 load mdsload<[7.70303,5.5
9759 18.8982]/[7.70303,5.59759 18.8982], req 7.32644, hr 0, qlen 0, cpu 0.929972>
2019-06-05 19:54:38.988400 7fda8580e700  5 mds.4.bal   import_map from 0 -> 18.8955
2019-06-05 19:54:38.990163 7fda8778d700  0 -- 137.138.44.130:6800/2077487243 >> 137.138.13.14
4:6800/2743440898 conn(0x55b1cd098800 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=
0 l=0).handle_connect_msg accept connect_seq 64 vs existing csq=64 existing_state=STATE_CONNE
CTING_WAIT_CONNECT_REPLY
2019-06-05 19:54:38.990302 7fda87f8e700  0 -- 137.138.44.130:6800/2077487243 >> 137.138.52.15
9:6800/3470607014 conn(0x55b1cf76f000 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=
0 l=0).handle_connect_msg accept connect_seq 78 vs existing csq=78 existing_state=STATE_CONNE
CTING_WAIT_CONNECT_REPLY
2019-06-05 19:54:38.992253 7fda8580e700  5 mds.4.bal  prep_rebalance: cluster loads are
2019-06-05 19:54:38.992268 7fda8580e700  5 mds.4.bal prep_rebalance:  my load 18.8982   targe
t 4254.38   total 21271.9
2019-06-05 19:54:38.992275 7fda8580e700  5 mds.4.bal  mds.4 is underloaded or barely overload
ed.
2019-06-05 19:54:38.992277 7fda8580e700  5 mds.4.bal  mds.3 is underloaded or barely overload
ed.
2019-06-05 19:54:38.992279 7fda8580e700  5 mds.4.bal  mds.1 is underloaded or barely overload
ed.
2019-06-05 19:54:38.992280 7fda8580e700  5 mds.4.bal   i am underloaded or barely overloaded,
 doing nothing.
2019-06-05 19:54:40.531866 7fda83009700  2 mds.4.cache Memory usage:  total 831740, rss 44619
6, heap 307232, baseline 307232, buffers 0, 32598 / 124264 inodes have caps, 43704 caps, 0.35
1703 caps per inode
2019-06-05 19:54:44.239730 7fda8580e700  1 mds.cephflax-mds-f5327d21dd Updating MDS map to ve
rsion 198343 from mon.2
2019-06-05 19:54:44.765660 7fda87f8e700  0 -- 137.138.44.130:6800/2077487243 >> 188.184.94.18
2:0/1391033891 conn(0x55b1c1d7d800 :6800 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=12
303575 cs=41 l=0).process missed message?  skipped from seq 0 to 192910665
2019-06-05 19:54:44.769521 7fda8580e700 -1 /builddir/build/BUILD/ceph-12.2.12/src/mds/MDSRank
.cc: In function 'Session* MDSRank::get_session(Message*)' thread 7fda8580e700 time 2019-06-0
5 19:54:44.765758
/builddir/build/BUILD/ceph-12.2.12/src/mds/MDSRank.cc: 1304: FAILED assert(session->get_nref() == 1)

 ceph version 12.2.12 (1436006594665279fe734b4c15d7e08c13ebd777) luminous (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x110) [0x55b1af348590]
 2: (MDSRank::get_session(Message*)+0x7af) [0x55b1af00760f]
 3: (Server::handle_client_session(MClientSession*)+0x32) [0x55b1af060ba2]
 4: (Server::dispatch(Message*)+0x5cb) [0x55b1af0932bb]
 5: (MDSRank::handle_deferrable_message(Message*)+0x804) [0x55b1aefff044]
 6: (MDSRank::_dispatch(Message*, bool)+0x1e3) [0x55b1af016d73]
 7: (MDSRankDispatcher::ms_dispatch(Message*)+0xa0) [0x55b1af017d40]
 8: (MDSDaemon::ms_dispatch(Message*)+0xf3) [0x55b1aeff5e63]
 9: (DispatchQueue::entry()+0x792) [0x55b1af64c9e2]
 10: (DispatchQueue::DispatchThread::entry()+0xd) [0x55b1af3d038d]
 11: (()+0x7dd5) [0x7fda8a644dd5]
 12: (clone()+0x6d) [0x7fda89721ead]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Here is the log of the MDS rank which is exporting to mds.4:

p05153026289093.cern.ch: 2019-06-05 19:54:38.990760 7fa3f1b2c700  0 mds.0.migrator nicely exporting
to mds.4 [dir 0x1000023defd /volumes/ [2,head] auth{1=1,2=1,3=1,4=1} pv=124082107 v=124082104
cv=124068891/124068891 REP ap=2+4+5 state=1610612738|complete f(v8031 m2019-06-05 19:34:30.424858 198=196+2)
n(v7676403 rc2019-06-06 04:13:08.000000 b82443486820215 67691330=61370142+6321188) hs=198+5,ss=0+0 dirty=7
| child=1 frozen=0 subtree=0 importing=0 replicated=1 dirty=1 waiter=0 authpin=1 tempexporting=0 0x55d4540d9800]


Related issues 1 (0 open1 closed)

Related to CephFS - Backport #37739: luminous: extend reconnect period when mds is busyResolvedZheng YanActions
Actions

Also available in: Atom PDF