Project

General

Profile

Actions

Bug #45080

open

mds slow journal ops then cephfs damaged after restarting an osd host

Added by Dan van der Ster about 4 years ago. Updated almost 4 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
fs
Component(FS):
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

We had a cephfs cluster get damaged following a routine osd restart.

  • All MDS's have ms type = simple, which seems to solve a buffer_anon memory leak.
  • There are two active MDSs, with automatic balancing enabled and working well for weeks.
  • We restarted osds on one out of 3 osd hosts. (20 out of 60 osds restarted -- PGs were active throughout).
  • After all OSDs were back up and all PGs were active+clean, there were a few hundred stuck ops on both MDSs, e.g.
    2020-04-14 10:29:54.817599 mon.cephdwight-mon-1633994557 (mon.0) 1591523 : cluster [WRN] Health check failed: 1 MDSs report slow metadata IOs (MDS_SLOW_METADATA_IO)
    2020-04-14 10:29:54.833090 mon.cephdwight-mon-1633994557 (mon.0) 1591524 : cluster [DBG] mds.1 v1:137.138.122.17:6800/871421306 up:active
    2020-04-14 10:29:54.833187 mon.cephdwight-mon-1633994557 (mon.0) 1591525 : cluster [DBG] fsmap cephfs:2 {0=cephdwightmds0=up:active,1=cephdwightmds2=up:active} 1 up:standby
    2020-04-14 10:29:53.123282 mds.cephdwightmds0 (mds.0) 137 : cluster [WRN] 2 slow requests, 2 included below; oldest blocked for > 30.240923 secs
    2020-04-14 10:29:53.123300 mds.cephdwightmds0 (mds.0) 138 : cluster [WRN] slow request 30.240922 seconds old, received at 2020-04-14 10:29:22.881996: client_request(client.893240419:110923293 create #0x10039d3aa76/.83f17681-34d5-4a23-a6be-08b6dcfad25e.tmp 2020-04-14 10:29:22.881825 caller_uid=0, caller_gid=0{}) currently submit entry: journal_and_reply
    2020-04-14 10:29:53.123303 mds.cephdwightmds0 (mds.0) 139 : cluster [WRN] slow request 30.220358 seconds old, received at 2020-04-14 10:29:22.902560: client_request(client.893240419:110923299 create #0x10039d3aa76/..83f17681-34d5-4a23-a6be-08b6dcfad25e.tmp.crc 2020-04-14 10:29:22.902402 caller_uid=0, caller_gid=0{}) currently submit entry: journal_and_reply
    
  • Then I restarted mds.1, which asserted during shutdown then the rank was damaged:
2020-04-14 10:37:19.806 7f2f93596700  0 -- v1:137.138.122.17:6800/871421306 >> v1:188.185.72.230:0/1226312550 pipe(0x55fd33ea9400 sd=205 :6800 s=2 pgs=58569 cs=1 l=0 c=0x55fd25f2b520).fault with nothing to send, going to standby
2020-04-14 10:37:20.097 7f2f942a3700  0 -- v1:137.138.122.17:6800/871421306 >> v1:188.184.84.39:0/1767657975 pipe(0x55fd15d1e800 sd=197 :6800 s=2 pgs=405022 cs=1 l=0 c=0x55fd24c89860).fault with nothing to send, going to standby
2020-04-14 10:37:22.125 7f2fb247f700  1 mds.1.77095 shutdown: shutting down rank 1
2020-04-14 10:37:22.127 7f2fabce4700  2 mds.1.cache Memory usage:  total 3199000, rss 1287996, heap 675068, baseline 773372, 11094 / 424727 inodes have caps, 12151 caps, 0.028609 caps per inode
2020-04-14 10:37:22.127 7f2fb11f3700  0 ms_deliver_dispatch: unhandled message 0x55fd128ba280 osd_map(315270..315270 src has 314665..315270) v4 from mon.2 v1:188.185.66.208:6790/0
2020-04-14 10:37:22.127 7f2fb11f3700  0 ms_deliver_dispatch: unhandled message 0x55fd2cebf680 mdsmap(e 77284) v1 from mon.2 v1:188.185.66.208:6790/0
2020-04-14 10:37:22.127 7f2fb11f3700  0 ms_deliver_dispatch: unhandled message 0x55fd9ef25d40 mdsmap(e 77285) v1 from mon.2 v1:188.185.66.208:6790/0
2020-04-14 10:37:22.129 7f2fb11f3700  0 ms_deliver_dispatch: unhandled message 0x55fd9ef24d80 mdsmap(e 77286) v1 from mon.2 v1:188.185.66.208:6790/0
2020-04-14 10:37:22.136 7f2fa9ce0700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.8/rpm/el7/BUILD/ceph-14.2.8/src/mds/Server.cc: In f
unction 'virtual void C_MDS_unlink_local_finish::finish(int)' thread 7f2fa9ce0700 time 2020-04-14 10:37:22.132033
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.8/rpm/el7/BUILD/ceph-14.2.8/src/mds/Server.cc: 6810: FAILED ceph_assert(r == 0)

 ceph version 14.2.8 (2d095e947a02261ce61424021bb43bd3022d35cb) nautilus (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14a) [0x7f2fbaf07325]
 2: (()+0x2534ed) [0x7f2fbaf074ed]
 3: (()+0x20611d) [0x55fd0f8de11d]
 4: (MDSContext::complete(int)+0x74) [0x55fd0faa3cd4]
 5: (MDSIOContextBase::complete(int)+0x155) [0x55fd0faa3f15]
 6: (MDSLogContextBase::complete(int)+0x40) [0x55fd0faa4150]
 7: (Finisher::finisher_thread_entry()+0x16f) [0x7f2fbaf9026f]
 8: (()+0x7e65) [0x7f2fb8dc7e65]
 9: (clone()+0x6d) [0x7f2fb7a7588d]

and mds.1 wouldn't restart:
2020-04-14 10:41:49.204 7fc473d64700  1 mds.cephdwightmds2 Updating MDS map to version 77344 from mon.1
2020-04-14 10:41:56.410 7fc473d64700  1 mds.cephdwightmds2 Updating MDS map to version 77345 from mon.1
2020-04-14 10:41:59.955 7fc469a38700 -1 mds.1.journaler.mdlog(ro) try_read_entry: decode error from _is_readable
2020-04-14 10:41:59.955 7fc469a38700  0 mds.1.log _replay journaler got error -22, aborting
2020-04-14 10:41:59.956 7fc469a38700 -1 log_channel(cluster) log [ERR] : Error loading MDS rank 1: (22) Invalid argument
2020-04-14 10:41:59.976 7fc469a38700  1 mds.cephdwightmds2 respawn!

Following the cephfs/disaster-recovery-experts docs, I reset the journals and all sessions then the mds's could go active again. I am now running a recursive scrub.

Logs are all at ceph-post-file: 5d09c35d-eb9f-430c-9936-91ee3ed2ef3e

Any idea why the journal ops got blocked following this routine osd restart ?
Is there a better way to unblock the slow journal ops in this scenario ?

Actions #1

Updated by Dan van der Ster about 4 years ago

It seems this is caused by ms type = simple (and msgr2). I can reproduce the bug easily by restarting an osd -- the mds gets deadlocked ops then eventually when you kill the mds the journal is unreadable.

I reverted the mds's to ms_type = async and I can restart any osd without any problem.

(I am testing ms_type simple because async is leaking buffer_anon on my mds's.)

Actions #2

Updated by Greg Farnum almost 4 years ago

Hmm so the MDS shutdown assert is a bug we've been dancing around on where we were trying to clean up Context callbacks more carefully but they couldn't handle the ESHUTDOWN return code, I don't have it handy but it should be fixed now.

I am confused about why ops would be unable to complete on SimpleMessenger when OSDs restart, if the PGs remain inactive. But as it's gone in Octopus and was already heading out the door in Nautilus we may not be able to dig into it.

Actions

Also available in: Atom PDF