Project

General

Profile

Bug #40200

luminous: mds: does fails assert(session->get_nref() == 1) when balancing

Added by Dan van der Ster almost 5 years ago. Updated over 4 years ago.

Status:
Rejected
Priority:
Normal
Category:
-
Target version:
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
Yes
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
multimds
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

We've seen this assertion twice after upgrading MDS's from v12.2.11 to v12.2.12 and due to #40190 it can be disruptive.
We have 5 active mds's, and use md pinning actively. The crash happens when a new (non-pinned) manila share is created (then shortly afterwards the dir is exported)

Here is mds.4 crashing when /volumes is being exported to it:

2019-06-05 19:54:35.531836 7fda83009700  2 mds.4.cache Memory usage:  total 831740, rss 44619
6, heap 307232, baseline 307232, buffers 0, 32597 / 124263 inodes have caps, 43703 caps, 0.35
1698 caps per inode
2019-06-05 19:54:38.988369 7fda8580e700  5 mds.4.bal mds.4 epoch 88 load mdsload<[7.70303,5.5
9759 18.8982]/[7.70303,5.59759 18.8982], req 7.32644, hr 0, qlen 0, cpu 0.929972>
2019-06-05 19:54:38.988400 7fda8580e700  5 mds.4.bal   import_map from 0 -> 18.8955
2019-06-05 19:54:38.990163 7fda8778d700  0 -- 137.138.44.130:6800/2077487243 >> 137.138.13.14
4:6800/2743440898 conn(0x55b1cd098800 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=
0 l=0).handle_connect_msg accept connect_seq 64 vs existing csq=64 existing_state=STATE_CONNE
CTING_WAIT_CONNECT_REPLY
2019-06-05 19:54:38.990302 7fda87f8e700  0 -- 137.138.44.130:6800/2077487243 >> 137.138.52.15
9:6800/3470607014 conn(0x55b1cf76f000 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=
0 l=0).handle_connect_msg accept connect_seq 78 vs existing csq=78 existing_state=STATE_CONNE
CTING_WAIT_CONNECT_REPLY
2019-06-05 19:54:38.992253 7fda8580e700  5 mds.4.bal  prep_rebalance: cluster loads are
2019-06-05 19:54:38.992268 7fda8580e700  5 mds.4.bal prep_rebalance:  my load 18.8982   targe
t 4254.38   total 21271.9
2019-06-05 19:54:38.992275 7fda8580e700  5 mds.4.bal  mds.4 is underloaded or barely overload
ed.
2019-06-05 19:54:38.992277 7fda8580e700  5 mds.4.bal  mds.3 is underloaded or barely overload
ed.
2019-06-05 19:54:38.992279 7fda8580e700  5 mds.4.bal  mds.1 is underloaded or barely overload
ed.
2019-06-05 19:54:38.992280 7fda8580e700  5 mds.4.bal   i am underloaded or barely overloaded,
 doing nothing.
2019-06-05 19:54:40.531866 7fda83009700  2 mds.4.cache Memory usage:  total 831740, rss 44619
6, heap 307232, baseline 307232, buffers 0, 32598 / 124264 inodes have caps, 43704 caps, 0.35
1703 caps per inode
2019-06-05 19:54:44.239730 7fda8580e700  1 mds.cephflax-mds-f5327d21dd Updating MDS map to ve
rsion 198343 from mon.2
2019-06-05 19:54:44.765660 7fda87f8e700  0 -- 137.138.44.130:6800/2077487243 >> 188.184.94.18
2:0/1391033891 conn(0x55b1c1d7d800 :6800 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=12
303575 cs=41 l=0).process missed message?  skipped from seq 0 to 192910665
2019-06-05 19:54:44.769521 7fda8580e700 -1 /builddir/build/BUILD/ceph-12.2.12/src/mds/MDSRank
.cc: In function 'Session* MDSRank::get_session(Message*)' thread 7fda8580e700 time 2019-06-0
5 19:54:44.765758
/builddir/build/BUILD/ceph-12.2.12/src/mds/MDSRank.cc: 1304: FAILED assert(session->get_nref() == 1)

 ceph version 12.2.12 (1436006594665279fe734b4c15d7e08c13ebd777) luminous (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x110) [0x55b1af348590]
 2: (MDSRank::get_session(Message*)+0x7af) [0x55b1af00760f]
 3: (Server::handle_client_session(MClientSession*)+0x32) [0x55b1af060ba2]
 4: (Server::dispatch(Message*)+0x5cb) [0x55b1af0932bb]
 5: (MDSRank::handle_deferrable_message(Message*)+0x804) [0x55b1aefff044]
 6: (MDSRank::_dispatch(Message*, bool)+0x1e3) [0x55b1af016d73]
 7: (MDSRankDispatcher::ms_dispatch(Message*)+0xa0) [0x55b1af017d40]
 8: (MDSDaemon::ms_dispatch(Message*)+0xf3) [0x55b1aeff5e63]
 9: (DispatchQueue::entry()+0x792) [0x55b1af64c9e2]
 10: (DispatchQueue::DispatchThread::entry()+0xd) [0x55b1af3d038d]
 11: (()+0x7dd5) [0x7fda8a644dd5]
 12: (clone()+0x6d) [0x7fda89721ead]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Here is the log of the MDS rank which is exporting to mds.4:

p05153026289093.cern.ch: 2019-06-05 19:54:38.990760 7fa3f1b2c700  0 mds.0.migrator nicely exporting
to mds.4 [dir 0x1000023defd /volumes/ [2,head] auth{1=1,2=1,3=1,4=1} pv=124082107 v=124082104
cv=124068891/124068891 REP ap=2+4+5 state=1610612738|complete f(v8031 m2019-06-05 19:34:30.424858 198=196+2)
n(v7676403 rc2019-06-06 04:13:08.000000 b82443486820215 67691330=61370142+6321188) hs=198+5,ss=0+0 dirty=7
| child=1 frozen=0 subtree=0 importing=0 replicated=1 dirty=1 waiter=0 authpin=1 tempexporting=0 0x55d4540d9800]


Related issues

Related to CephFS - Backport #37739: luminous: extend reconnect period when mds is busy Resolved

History

#1 Updated by Dan van der Ster almost 5 years ago

We have seen 3 identical crashes so far. (Logs of the crashed MDSs are at ceph-post-file: a74beec8-0a68-44c1-bfc5-56da4872414f)

The crash is always triggered by an export of /volumes/ to an mds, who then crashes in MDSRank::get_session for the client at 188.184.94.182. This IP is our OpenStack Manila controller (who is creating/deleting a test share once per hour, but otherwise pretty idle).

Here's an example export which triggered a crash today:

2019-06-09 20:34:08.382444 7fa3f1b2c700  0 mds.0.migrator nicely exporting 
to mds.2 [dir 0x1000023defd /volumes/ [2,head] auth{1=3,2=3,3=1,4=2} pv=124817319 v=124817316 
cv=124814979/124814979 REP ap=2+1+1 state=1610612738|complete f(v8364 m2019-06-09 20:34:02.721217 
201=199+2) n(v7712004 rc2029-07-04 21:56:43.000000 b84032403688154 74034480=65652517+8381963) 
hs=201+4,ss=0+0 dirty=6 | dnwaiter=0 child=1 frozen=0 subtree=0 importing=0 replicated=1 
dirty=1 waiter=0 authpin=1 tempexporting=0 0x55d4540d9800]

and the crash this time on mds.2:

    -5> 2019-06-09 20:34:09.111916 7f664e49d700  0 -- 137.138.149.200:6800/331189051 >> 188.184.94.182:0/3076084349 conn(0x55dd38659000 :6800 
s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept we reset (peer sent cseq 116), sending RESETSESSION
    -4> 2019-06-09 20:34:10.438319 7f6648516700 10 monclient: _send_mon_message to mon.cephflax-mon-9b406e0261 at 137.138.121.135:6789/0
    -3> 2019-06-09 20:34:10.646909 7f6648d17700  2 mds.2.cache Memory usage:  total 17978216, rss 17503916, heap 307232, baseline 307232, buffers 0, 84434 / 
5372784 inodes have caps, 85166 caps, 0.0158514 caps per inode
    -2> 2019-06-09 20:34:13.916519 7f664b51c700  1 mds.cephflax-mds-4b5d89fe52 Updating MDS map to version 198555 from mon.0
    -1> 2019-06-09 20:34:14.001817 7f664e49d700  0 -- 137.138.149.200:6800/331189051 >> 188.184.94.182:0/3076084349 conn(0x55dd38659000 :6800 
s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=419012 cs=1 l=0).process missed message?  skipped from seq 0 to 2131513044
     0> 2019-06-09 20:34:14.005921 7f664b51c700 -1 /builddir/build/BUILD/ceph-12.2.12/src/mds/MDSRank.cc: In function 'Session* MDSRank::get_session(Message*)' 
thread 7f664b51c700 time 2019-06-09 20:34:14.001930
/builddir/build/BUILD/ceph-12.2.12/src/mds/MDSRank.cc: 1304: FAILED assert(session->get_nref() == 1)

With gdb I found that session->get_nref() = 2:

(gdb) p session->nref
value has been optimized out
(gdb) p ((Session *)(*(Connection *)(m->connection)).priv).nref
$67 = {<std::__atomic_base<unsigned long>> = {_M_i = 2}, <No data fields>}

Here's more of that session in case it helps:

(gdb) p *((Session *)(*(Connection *)(m->connection)).priv)
$69 = {<RefCountedObject> = {_vptr.RefCountedObject = 0x55b1afb21c50 <vtable for Session+16>, 
    nref = {<std::__atomic_base<unsigned long>> = {_M_i = 2}, <No data fields>}, cct = 0x0}, 
  state = 0, state_seq = 0, importing_count = 0, human_name = "", 
  projected = std::deque with 0 elements, load_avg = {val = 0, delta = 0, vel = 0, last_decay = {
      tv = {tv_sec = 0, tv_nsec = 0}}, rate = {k = 0}}, load_avg_rate = {k = 0}, recall_caps = {
    val = 0, delta = 0, vel = 0, last_decay = {tv = {tv_sec = 1559757242, tv_nsec = 260002304}}, 
    rate = {k = -0.011552453009332421}}, release_caps = {val = 0, delta = 0, vel = 0, last_decay = {
      tv = {tv_sec = 1559757242, tv_nsec = 260003346}}, rate = {k = -0.011552453009332421}}, 
  recall_caps_throttle = {val = 0, delta = 0, vel = 0, last_decay = {tv = {tv_sec = 1559757242, 
        tv_nsec = 260004647}}, rate = {k = -0.2772588722239781}}, recall_limit = 0, birth_time = {
    __d = {__r = 1762031282356192}}, info = {inst = {name = {_type = 8 '\b', _num = 76672649, 
...

Looking at `MDSRank::get_session`, I wonder if it's possible that this is the only client which is triggering the `if (session->is_closed())` codepath ?
If so, this would be the only time that we ever check that that nref 1 on the sessions (so maybe it is never 1 for any session ??)

I was looking at commit 5dd000eb44e39567657a346451eec6adba3efb63 and I'm wondering if that backport is missing `session->put()` in MDSRankDispatcher::ms_dispatch ?? (Every other get_priv of a Session in luminous mds code doesn't carry the ref.)
So maybe the correct (luminous only) fix would be:

diff --git a/src/mds/MDSRank.cc b/src/mds/MDSRank.cc
index 89b6e97b06..36a3e9367e 100644
--- a/src/mds/MDSRank.cc
+++ b/src/mds/MDSRank.cc
@@ -952,8 +952,10 @@ bool MDSRankDispatcher::ms_dispatch(Message *m)
 {
   if (m->get_source().is_client()) {
     Session *session = static_cast<Session*>(m->get_connection()->get_priv());
-    if (session)
+    if (session) {
+      session->put(); // do not carry ref
       session->last_seen = Session::clock::now();
+    }
   }

   inc_dispatch_depth();

For now, we're running with mds_bal_interval = 0 to disable the balancer. But of course it would be good to hear if the above should fix, then we'll compile and try it out. (But I want to be cautious, as this is happening on our prod cluster).

#2 Updated by Zheng Yan almost 5 years ago

  • Target version set to v12.2.13

your patch should fix the issue. Thanks for tracking it down. Could you please create PR ?

#3 Updated by Patrick Donnelly almost 5 years ago

  • Related to Backport #37739: luminous: extend reconnect period when mds is busy added

#4 Updated by Patrick Donnelly almost 5 years ago

  • Subject changed from v12.2.12 MDS does fails assert(session->get_nref() == 1) when balancing to luminous: mds: does fails assert(session->get_nref() == 1) when balancing
  • Status changed from New to Fix Under Review
  • Assignee set to Dan van der Ster
  • Start date deleted (06/07/2019)
  • Pull request ID set to 28467
  • Affected Versions v12.2.11 added
  • Affected Versions deleted (v12.2.12)

#5 Updated by Patrick Donnelly over 4 years ago

  • Status changed from Fix Under Review to Rejected

https://github.com/ceph/ceph/pull/28467#issuecomment-541079181

Thanks for the PR anyway Dan. Luminous QA didn't pick it up in time.

Also available in: Atom PDF