Bug #40200
luminous: mds: does fails assert(session->get_nref() == 1) when balancing
0%
Description
We've seen this assertion twice after upgrading MDS's from v12.2.11 to v12.2.12 and due to #40190 it can be disruptive.
We have 5 active mds's, and use md pinning actively. The crash happens when a new (non-pinned) manila share is created (then shortly afterwards the dir is exported)
Here is mds.4 crashing when /volumes is being exported to it:
2019-06-05 19:54:35.531836 7fda83009700 2 mds.4.cache Memory usage: total 831740, rss 44619 6, heap 307232, baseline 307232, buffers 0, 32597 / 124263 inodes have caps, 43703 caps, 0.35 1698 caps per inode 2019-06-05 19:54:38.988369 7fda8580e700 5 mds.4.bal mds.4 epoch 88 load mdsload<[7.70303,5.5 9759 18.8982]/[7.70303,5.59759 18.8982], req 7.32644, hr 0, qlen 0, cpu 0.929972> 2019-06-05 19:54:38.988400 7fda8580e700 5 mds.4.bal import_map from 0 -> 18.8955 2019-06-05 19:54:38.990163 7fda8778d700 0 -- 137.138.44.130:6800/2077487243 >> 137.138.13.14 4:6800/2743440898 conn(0x55b1cd098800 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs= 0 l=0).handle_connect_msg accept connect_seq 64 vs existing csq=64 existing_state=STATE_CONNE CTING_WAIT_CONNECT_REPLY 2019-06-05 19:54:38.990302 7fda87f8e700 0 -- 137.138.44.130:6800/2077487243 >> 137.138.52.15 9:6800/3470607014 conn(0x55b1cf76f000 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs= 0 l=0).handle_connect_msg accept connect_seq 78 vs existing csq=78 existing_state=STATE_CONNE CTING_WAIT_CONNECT_REPLY 2019-06-05 19:54:38.992253 7fda8580e700 5 mds.4.bal prep_rebalance: cluster loads are 2019-06-05 19:54:38.992268 7fda8580e700 5 mds.4.bal prep_rebalance: my load 18.8982 targe t 4254.38 total 21271.9 2019-06-05 19:54:38.992275 7fda8580e700 5 mds.4.bal mds.4 is underloaded or barely overload ed. 2019-06-05 19:54:38.992277 7fda8580e700 5 mds.4.bal mds.3 is underloaded or barely overload ed. 2019-06-05 19:54:38.992279 7fda8580e700 5 mds.4.bal mds.1 is underloaded or barely overload ed. 2019-06-05 19:54:38.992280 7fda8580e700 5 mds.4.bal i am underloaded or barely overloaded, doing nothing. 2019-06-05 19:54:40.531866 7fda83009700 2 mds.4.cache Memory usage: total 831740, rss 44619 6, heap 307232, baseline 307232, buffers 0, 32598 / 124264 inodes have caps, 43704 caps, 0.35 1703 caps per inode 2019-06-05 19:54:44.239730 7fda8580e700 1 mds.cephflax-mds-f5327d21dd Updating MDS map to ve rsion 198343 from mon.2 2019-06-05 19:54:44.765660 7fda87f8e700 0 -- 137.138.44.130:6800/2077487243 >> 188.184.94.18 2:0/1391033891 conn(0x55b1c1d7d800 :6800 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=12 303575 cs=41 l=0).process missed message? skipped from seq 0 to 192910665 2019-06-05 19:54:44.769521 7fda8580e700 -1 /builddir/build/BUILD/ceph-12.2.12/src/mds/MDSRank .cc: In function 'Session* MDSRank::get_session(Message*)' thread 7fda8580e700 time 2019-06-0 5 19:54:44.765758 /builddir/build/BUILD/ceph-12.2.12/src/mds/MDSRank.cc: 1304: FAILED assert(session->get_nref() == 1) ceph version 12.2.12 (1436006594665279fe734b4c15d7e08c13ebd777) luminous (stable) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x110) [0x55b1af348590] 2: (MDSRank::get_session(Message*)+0x7af) [0x55b1af00760f] 3: (Server::handle_client_session(MClientSession*)+0x32) [0x55b1af060ba2] 4: (Server::dispatch(Message*)+0x5cb) [0x55b1af0932bb] 5: (MDSRank::handle_deferrable_message(Message*)+0x804) [0x55b1aefff044] 6: (MDSRank::_dispatch(Message*, bool)+0x1e3) [0x55b1af016d73] 7: (MDSRankDispatcher::ms_dispatch(Message*)+0xa0) [0x55b1af017d40] 8: (MDSDaemon::ms_dispatch(Message*)+0xf3) [0x55b1aeff5e63] 9: (DispatchQueue::entry()+0x792) [0x55b1af64c9e2] 10: (DispatchQueue::DispatchThread::entry()+0xd) [0x55b1af3d038d] 11: (()+0x7dd5) [0x7fda8a644dd5] 12: (clone()+0x6d) [0x7fda89721ead] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Here is the log of the MDS rank which is exporting to mds.4:
p05153026289093.cern.ch: 2019-06-05 19:54:38.990760 7fa3f1b2c700 0 mds.0.migrator nicely exporting to mds.4 [dir 0x1000023defd /volumes/ [2,head] auth{1=1,2=1,3=1,4=1} pv=124082107 v=124082104 cv=124068891/124068891 REP ap=2+4+5 state=1610612738|complete f(v8031 m2019-06-05 19:34:30.424858 198=196+2) n(v7676403 rc2019-06-06 04:13:08.000000 b82443486820215 67691330=61370142+6321188) hs=198+5,ss=0+0 dirty=7 | child=1 frozen=0 subtree=0 importing=0 replicated=1 dirty=1 waiter=0 authpin=1 tempexporting=0 0x55d4540d9800]
Related issues
History
#1 Updated by Dan van der Ster almost 5 years ago
We have seen 3 identical crashes so far. (Logs of the crashed MDSs are at ceph-post-file: a74beec8-0a68-44c1-bfc5-56da4872414f)
The crash is always triggered by an export of /volumes/ to an mds, who then crashes in MDSRank::get_session for the client at 188.184.94.182. This IP is our OpenStack Manila controller (who is creating/deleting a test share once per hour, but otherwise pretty idle).
Here's an example export which triggered a crash today:
2019-06-09 20:34:08.382444 7fa3f1b2c700 0 mds.0.migrator nicely exporting to mds.2 [dir 0x1000023defd /volumes/ [2,head] auth{1=3,2=3,3=1,4=2} pv=124817319 v=124817316 cv=124814979/124814979 REP ap=2+1+1 state=1610612738|complete f(v8364 m2019-06-09 20:34:02.721217 201=199+2) n(v7712004 rc2029-07-04 21:56:43.000000 b84032403688154 74034480=65652517+8381963) hs=201+4,ss=0+0 dirty=6 | dnwaiter=0 child=1 frozen=0 subtree=0 importing=0 replicated=1 dirty=1 waiter=0 authpin=1 tempexporting=0 0x55d4540d9800]
and the crash this time on mds.2:
-5> 2019-06-09 20:34:09.111916 7f664e49d700 0 -- 137.138.149.200:6800/331189051 >> 188.184.94.182:0/3076084349 conn(0x55dd38659000 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept we reset (peer sent cseq 116), sending RESETSESSION -4> 2019-06-09 20:34:10.438319 7f6648516700 10 monclient: _send_mon_message to mon.cephflax-mon-9b406e0261 at 137.138.121.135:6789/0 -3> 2019-06-09 20:34:10.646909 7f6648d17700 2 mds.2.cache Memory usage: total 17978216, rss 17503916, heap 307232, baseline 307232, buffers 0, 84434 / 5372784 inodes have caps, 85166 caps, 0.0158514 caps per inode -2> 2019-06-09 20:34:13.916519 7f664b51c700 1 mds.cephflax-mds-4b5d89fe52 Updating MDS map to version 198555 from mon.0 -1> 2019-06-09 20:34:14.001817 7f664e49d700 0 -- 137.138.149.200:6800/331189051 >> 188.184.94.182:0/3076084349 conn(0x55dd38659000 :6800 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=419012 cs=1 l=0).process missed message? skipped from seq 0 to 2131513044 0> 2019-06-09 20:34:14.005921 7f664b51c700 -1 /builddir/build/BUILD/ceph-12.2.12/src/mds/MDSRank.cc: In function 'Session* MDSRank::get_session(Message*)' thread 7f664b51c700 time 2019-06-09 20:34:14.001930 /builddir/build/BUILD/ceph-12.2.12/src/mds/MDSRank.cc: 1304: FAILED assert(session->get_nref() == 1)
With gdb I found that session->get_nref() = 2:
(gdb) p session->nref value has been optimized out (gdb) p ((Session *)(*(Connection *)(m->connection)).priv).nref $67 = {<std::__atomic_base<unsigned long>> = {_M_i = 2}, <No data fields>}
Here's more of that session in case it helps:
(gdb) p *((Session *)(*(Connection *)(m->connection)).priv) $69 = {<RefCountedObject> = {_vptr.RefCountedObject = 0x55b1afb21c50 <vtable for Session+16>, nref = {<std::__atomic_base<unsigned long>> = {_M_i = 2}, <No data fields>}, cct = 0x0}, state = 0, state_seq = 0, importing_count = 0, human_name = "", projected = std::deque with 0 elements, load_avg = {val = 0, delta = 0, vel = 0, last_decay = { tv = {tv_sec = 0, tv_nsec = 0}}, rate = {k = 0}}, load_avg_rate = {k = 0}, recall_caps = { val = 0, delta = 0, vel = 0, last_decay = {tv = {tv_sec = 1559757242, tv_nsec = 260002304}}, rate = {k = -0.011552453009332421}}, release_caps = {val = 0, delta = 0, vel = 0, last_decay = { tv = {tv_sec = 1559757242, tv_nsec = 260003346}}, rate = {k = -0.011552453009332421}}, recall_caps_throttle = {val = 0, delta = 0, vel = 0, last_decay = {tv = {tv_sec = 1559757242, tv_nsec = 260004647}}, rate = {k = -0.2772588722239781}}, recall_limit = 0, birth_time = { __d = {__r = 1762031282356192}}, info = {inst = {name = {_type = 8 '\b', _num = 76672649, ...
Looking at `MDSRank::get_session`, I wonder if it's possible that this is the only client which is triggering the `if (session->is_closed())` codepath ?
If so, this would be the only time that we ever check that that nref 1 on the sessions (so maybe it is never 1 for any session ??)
I was looking at commit 5dd000eb44e39567657a346451eec6adba3efb63 and I'm wondering if that backport is missing `session->put()` in MDSRankDispatcher::ms_dispatch ?? (Every other get_priv of a Session in luminous mds code doesn't carry the ref.)
So maybe the correct (luminous only) fix would be:
diff --git a/src/mds/MDSRank.cc b/src/mds/MDSRank.cc index 89b6e97b06..36a3e9367e 100644 --- a/src/mds/MDSRank.cc +++ b/src/mds/MDSRank.cc @@ -952,8 +952,10 @@ bool MDSRankDispatcher::ms_dispatch(Message *m) { if (m->get_source().is_client()) { Session *session = static_cast<Session*>(m->get_connection()->get_priv()); - if (session) + if (session) { + session->put(); // do not carry ref session->last_seen = Session::clock::now(); + } } inc_dispatch_depth();
For now, we're running with mds_bal_interval = 0 to disable the balancer. But of course it would be good to hear if the above should fix, then we'll compile and try it out. (But I want to be cautious, as this is happening on our prod cluster).
#2 Updated by Zheng Yan almost 5 years ago
- Target version set to v12.2.13
your patch should fix the issue. Thanks for tracking it down. Could you please create PR ?
#3 Updated by Patrick Donnelly almost 5 years ago
- Related to Backport #37739: luminous: extend reconnect period when mds is busy added
#4 Updated by Patrick Donnelly almost 5 years ago
- Subject changed from v12.2.12 MDS does fails assert(session->get_nref() == 1) when balancing to luminous: mds: does fails assert(session->get_nref() == 1) when balancing
- Status changed from New to Fix Under Review
- Assignee set to Dan van der Ster
- Start date deleted (
06/07/2019) - Pull request ID set to 28467
- Affected Versions v12.2.11 added
- Affected Versions deleted (
v12.2.12)
#5 Updated by Patrick Donnelly over 4 years ago
- Status changed from Fix Under Review to Rejected
https://github.com/ceph/ceph/pull/28467#issuecomment-541079181
Thanks for the PR anyway Dan. Luminous QA didn't pick it up in time.