Project

General

Profile

Actions

Bug #23059

closed

mds: FAILED assert (p != active_requests.end()) in MDRequestRef MDCache::request_get(metareqid_t)

Added by Dan van der Ster about 6 years ago. Updated over 5 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
% Done:

0%

Source:
Community (user)
Tags:
Backport:
luminous,jewel
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Saw this in prod on v12.2.2 with three active MDS's. All clients were blocked from a few minutes before this assert on mds.0.

mds.0 was on server p05153026289093 until 04:25:23:

2018-02-21 04:24:22.251609 7f718d504700  2 mds.0.cache check_memory_usage total 2652024, rss 2202716, heap 150100, baseline 150100, buffers 0, 2950 / 3671 inodes have caps, 3184 caps, 0.867339 caps per inode
2018-02-21 04:25:03.675323 7f719050a700  1 mds.p05153026289093 asok_command: session ls (starting...)
2018-02-21 04:25:23.714305 7f718fd09700  1 heartbeat_map reset_timeout 'MDSRank' had timed out after 60
2018-02-21 04:25:23.724658 7f719050a700  1 mds.p05153026289093 asok_command: session ls (complete)
2018-02-21 04:25:23.729166 7f719050a700 -1 asok(0x7f71a0d381c0) AdminSocket: error writing response length (32) Broken pipe
2018-02-21 04:25:23.805790 7f7189cfd700 -1 MDSIOContextBase: blacklisted!  Restarting...
2018-02-21 04:25:23.805837 7f7189cfd700  1 mds.p05153026289093 respawn

mds.0 moved to p05153026639750 after 04:25:12:

2018-02-21 04:25:12.502244 7f02a4387700  1 mds.0.52096 handle_mds_map i am now mds.0.52096

p05153026639750 starts the reconnects at 04:25:13:

2018-02-21 04:25:13.513549 7f02a4387700  1 mds.0.52096 reconnect_start
2018-02-21 04:25:13.513796 7f02a4387700  1 mds.0.server reconnect_clients -- 323 sessions
2018-02-21 04:25:13.513861 7f02a4387700  0 log_channel(cluster) log [DBG] : reconnect by client.24666101 10.32.4.205:0/1896164517 after 0.000027
2018-02-21 04:25:13.513911 7f02a4387700  0 log_channel(cluster) log [DBG] : reconnect by client.24666137 10.32.4.164:0/3819287346 after 0.000097
2018-02-21 04:25:13.513937 7f02a4387700  0 log_channel(cluster) log [DBG] : reconnect by client.25392090 10.32.4.163:0/4112756617 after 0.000140
...

But this new mds.0 (p05153026639750) seems to be stuck between 04:25:15 an 04:33:23 -- the symptom is that it never prints the mds.0.cache check_memory_usage log as expected. Finally at 04:33:23 mds.0 starts reconnecting again and evicts all clients as "unresponsive":

2018-02-21 04:33:23.262407 7f02a4387700  1 mds.0.52096 handle_mds_map i am now mds.0.52096
2018-02-21 04:33:23.262417 7f02a4387700  1 mds.0.52096 handle_mds_map state change up:reconnect --> up:rejoin
2018-02-21 04:33:23.262427 7f02a4387700  1 mds.0.52096 rejoin_start
2018-02-21 04:33:23.263209 7f02a4387700  1 mds.0.52096 rejoin_joint_start
2018-02-21 04:33:25.741109 7f02a4387700  1 mds.0.52096 rejoin_done
2018-02-21 04:33:25.802294 7f02a4387700  1 mds.0.52096 handle_mds_map i am now mds.0.52096
2018-02-21 04:33:25.802301 7f02a4387700  1 mds.0.52096 handle_mds_map state change up:rejoin --> up:clientreplay
2018-02-21 04:33:25.802315 7f02a4387700  1 mds.0.52096 recovery_done -- successful recovery!
2018-02-21 04:33:25.803570 7f02a4387700  1 mds.0.52096 clientreplay_start
2018-02-21 04:33:25.846034 7f029fb7e700  1 mds.0.52096 clientreplay_done
2018-02-21 04:33:25.978638 7f029e37b700  1 mds.0.52096 clientreplay_done
2018-02-21 04:33:26.028929 7f02a1b82700  0 log_channel(cluster) log [WRN] : evicting unresponsive client boincai09.cern.ch:boincauthid02 (29306014), after 492.425494 seconds
...

then immediately after evicting all the clients, the mds.0 hits this assert:

    -6> 2018-02-21 04:33:26.894621 7f029d379700  1 -- 137.138.13.144:6800/1770044186 --> 128.142.161.229:6859/738454 -- osd_op(unknown.0.52096:2020 2.3a9 2:95f498d3:::200.0170143f:head [write 2493913~133 [fadvise_dontneed]] snapc 0=[] ondisk+write+known_if_redirected+full_force e24720) v8 -- 0x7f02ba7d4dc0 con 0
    -5> 2018-02-21 04:33:26.894671 7f02a630e700  1 -- 137.138.13.144:6800/1770044186 >> 10.32.5.60:0/3709995754 conn(0x7f02b5812000 :6800 s=STATE_OPEN pgs=694721 cs=1 l=0).read_bulk peer close file descriptor 1102
    -4> 2018-02-21 04:33:26.894682 7f029d379700  1 -- 137.138.13.144:6800/1770044186 --> 128.142.161.229:6859/738454 -- osd_op(unknown.0.52096:2021 2.3a9 2:95f498d3:::200.0170143f:head [write 2494046~149 [fadvise_dontneed]] snapc 0=[] ondisk+write+known_if_redirected+full_force e24720) v8 -- 0x7f02ba7d5340 con 0
    -3> 2018-02-21 04:33:26.894701 7f02a630e700  1 -- 137.138.13.144:6800/1770044186 >> 10.32.5.60:0/3709995754 conn(0x7f02b5812000 :6800 s=STATE_OPEN pgs=694721 cs=1 l=0).read_until read failed
    -2> 2018-02-21 04:33:26.894725 7f02a630e700  1 -- 137.138.13.144:6800/1770044186 >> 10.32.5.60:0/3709995754 conn(0x7f02b5812000 :6800 s=STATE_OPEN pgs=694721 cs=1 l=0).process read tag failed
    -1> 2018-02-21 04:33:26.894753 7f029d379700  1 -- 137.138.13.144:6800/1770044186 --> 128.142.161.229:6859/738454 -- osd_op(unknown.0.52096:2022 2.3a9 2:95f498d3:::200.0170143f:head [write 2494195~149 [fadvise_dontneed]] snapc 0=[] ondisk+write+known_if_redirected+full_force e24720) v8 -- 0x7f02ba7d63c0 con 0
     0> 2018-02-21 04:33:26.894758 7f02a4387700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.2/rpm/el7/BUILD/ceph-12.2.2/src/mds/MDCache.cc: In function 'MDRequestRef MDCache::request_get(metareqid_t)' thread 7f02a4387700 time 2018-02-21 04:33:26.889261
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.2/rpm/el7/BUILD/ceph-12.2.2/src/mds/MDCache.cc: 9059: FAILED assert(p != active_requests.end())

 ceph version 12.2.2 (cf0baeeeeba3b47f9427c6c97e2144b094b7e5ba) luminous (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x110) [0x7f02ab91f850]
 2: (MDCache::request_get(metareqid_t)+0x267) [0x7f02ab6d7967]
 3: (Server::handle_slave_request_reply(MMDSSlaveRequest*)+0x314) [0x7f02ab68d6c4]
 4: (Server::handle_slave_request(MMDSSlaveRequest*)+0x9ab) [0x7f02ab68edfb]
 5: (Server::dispatch(Message*)+0x633) [0x7f02ab68fad3]
 6: (MDSRank::handle_deferrable_message(Message*)+0x804) [0x7f02ab6068f4]
 7: (MDSRank::_dispatch(Message*, bool)+0x1e3) [0x7f02ab614573]
 8: (MDSRankDispatcher::ms_dispatch(Message*)+0x15) [0x7f02ab6153b5]
 9: (MDSDaemon::ms_dispatch(Message*)+0xf3) [0x7f02ab5fdff3]
 10: (DispatchQueue::entry()+0x792) [0x7f02abc03be2]
 11: (DispatchQueue::DispatchThread::entry()+0xd) [0x7f02ab9a4fbd]
 12: (()+0x7e25) [0x7f02a93f9e25]
 13: (clone()+0x6d) [0x7f02a84ca34d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

After this ceph-mds process restarts it replays again as mds.0 but then denies all reconnect attempts from the clients:

2018-02-21 04:33:41.567046 7fedbfb2f700  1 mds.0.52261 handle_mds_map i am now mds.0.52261
2018-02-21 04:33:41.567056 7fedbfb2f700  1 mds.0.52261 handle_mds_map state change up:resolve --> up:reconnect
2018-02-21 04:33:41.567066 7fedbfb2f700  1 mds.0.52261 reconnect_start
2018-02-21 04:33:41.567220 7fedbfb2f700  1 mds.0.server reconnect_clients -- 4 sessions
2018-02-21 04:33:41.567238 7fedbfb2f700  1 mds.0.server  session is closed, ignoring reconnect, sending close
2018-02-21 04:33:41.567311 7fedbfb2f700  0 log_channel(cluster) log [INF] : denied reconnect attempt (mds is up:reconnect) from client.27156114 10.32.3.41:0/773630832 (session is closed)
...

The clients (mostly v12.2.2) don't seem to handle this and stay blocked.

Our MDS's have failed over like this several times in the past but never before have we seen the clients get blocked like this.
Looking in the logs just before this outage, there were some trees exported from mds.1 and mds.2 to mds.0 a few minutes before the crash:

2018-02-21 04:23:12.246330 7f9b3df2c700  0 mds.1.bal    - exporting [84.0456,96.7639 277.574] 277.574 to mds.0 [dir 0x1000020e243 /user/marinat4/ [2,head] auth{2=1} v=3665791 cv=3664819/3664819 REP state=1610612738|complete f(v180 m2018-02-20 18:58:45.845994 30=20+10) n(v172224 rc2018-02-21 04:22:49.872664 b1779395068031 2687918=2684469+3449) hs=30+0,ss=0+0 dirty=1 | dnwaiter=0 child=1 frozen=0 subtree=0 importing=0 importbound=0 sticky=0 replicated=1 dirty=1 waiter=0 authpin=0 tempexporting=0 0x555ee212b400]
2018-02-21 04:24:12.518767 7f987fd8f700  0 mds.2.bal    - exporting [29.5521,0 29.5521] 29.5521 to mds.0 [dir 0x30000b15922 /volumes/_nogroup/393f2dcc-6b09-44d7-8d20-0e84b072ed26/user/thibaut/WeakMagnetHCMOctopan0-01mbar/TransportTruePressure85-100us/ [2,head] auth v=4378 cv=4376/4376 state=1610612738|complete f(v0 8=7+1)/f(v0 m2018-02-20 18:01:57.294809 8=7+1) n(v209 rc2018-02-21 03:34:19.270600 b15163969763 64=56+8) hs=8+0,ss=0+0 dirty=1 | child=1 frozen=0 dirty=1 waiter=0 authpin=0 0x5627db01d0c0]

Logs are available with ceph-post-file: 7f9ad0d9-af8f-45ff-926c-353fe11dd713


Related issues 3 (0 open3 closed)

Has duplicate CephFS - Bug #23503: mds: crash during pressure testDuplicate03/29/2018

Actions
Copied to CephFS - Backport #23154: luminous: mds: FAILED assert (p != active_requests.end()) in MDRequestRef MDCache::request_get(metareqid_t)ResolvedNathan CutlerActions
Copied to CephFS - Backport #23155: jewel: mds: FAILED assert (p != active_requests.end()) in MDRequestRef MDCache::request_get(metareqid_t)RejectedPatrick DonnellyActions
Actions #1

Updated by Patrick Donnelly about 6 years ago

  • Subject changed from FAILED assert (p != active_requests.end()) in MDRequestRef MDCache::request_get(metareqid_t) to mds: FAILED assert (p != active_requests.end()) in MDRequestRef MDCache::request_get(metareqid_t)
  • Priority changed from Normal to High
  • Component(FS) MDS added

Thanks for the report Dan.

Actions #2

Updated by Zheng Yan about 6 years ago

there are two bugs. one caused the assertion, another caused eviction of clients.

the bug caused assertion is in MDCache::request_kill(). mds can request remote authpin or remote wrlock while mdr->done_locking is false. If request that initiates remote authpin/wrlock gets cleaned up , reply of remote authpin/wrlock trigger the assertion.

  // rollback slave requests is tricky. just let the request proceed.
  if ( && mdr->has_more() &&
      (!mdr->more()->witnessed.empty() || !mdr->more()->waiting_on_slave.empty())) {
    dout(10) << "request_kill " << *mdr << " -- already started slave requests, no-op" << dendl;

    assert(mdr->used_prealloc_ino == 0);
    assert(mdr->prealloc_inos.empty());

    mdr->session = NULL;
    mdr->item_session_request.remove_myself();
    return;
  }

The bug that caused eviction of clients is mds delay handling CEPH_SESSION_REQUEST_RENEWCAPS messages when it's not active

Actions #3

Updated by Zheng Yan about 6 years ago

  • Status changed from New to Fix Under Review
Actions #4

Updated by Patrick Donnelly about 6 years ago

  • Backport set to luminous,jewel
Actions #5

Updated by Patrick Donnelly about 6 years ago

  • Status changed from Fix Under Review to Pending Backport
  • Assignee set to Zheng Yan
  • Target version set to v13.0.0
  • Component(FS) deleted (ceph-fuse)
Actions #6

Updated by Nathan Cutler about 6 years ago

  • Copied to Backport #23154: luminous: mds: FAILED assert (p != active_requests.end()) in MDRequestRef MDCache::request_get(metareqid_t) added
Actions #7

Updated by Nathan Cutler about 6 years ago

  • Copied to Backport #23155: jewel: mds: FAILED assert (p != active_requests.end()) in MDRequestRef MDCache::request_get(metareqid_t) added
Actions #8

Updated by Patrick Donnelly about 6 years ago

  • Has duplicate Bug #23503: mds: crash during pressure test added
Actions #9

Updated by Patrick Donnelly over 5 years ago

  • Status changed from Pending Backport to Resolved
Actions

Also available in: Atom PDF