Project

General

Profile

Bug #19706

Laggy mon daemons causing MDS failover (symptom: failed to set counters on mds daemons: set(['mds.dir_split']))

Added by Zheng Yan over 1 year ago. Updated 3 months ago.

Status:
Can't reproduce
Priority:
Urgent
Assignee:
-
Category:
Correctness/Safety
Target version:
Start date:
04/20/2017
Due date:
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDSMonitor, qa-suite
Labels (FS):
qa, task(medium)
Pull request ID:


Related issues

Related to fs - Bug #23519: mds: mds got laggy because of MDSBeacon stuck in mqueue Resolved 03/30/2018
Related to fs - Feature #17309: qa: mon_thrash test for CephFS New 09/20/2016
Duplicated by fs - Bug #20614: [WRN] MDS daemon 'a-s' is not responding, replacing it as rank 0 with standby 'a'" in cluster log Duplicate 07/13/2017

History

#1 Updated by John Spray over 1 year ago

  • Status changed from New to In Progress
  • Assignee set to John Spray

Should disable this check for the misc workunit, it was mainly intended for the workunits that have more sustained load like kernel builds etc.

#2 Updated by John Spray over 1 year ago

So it turns out this is exposing an underlying issue (in at least some cases):
http://qa-proxy.ceph.com/teuthology/jspray-2017-05-11_14:40:13-fs-wip-jcsp-testing-20170511-distro-basic-smithi/1124272/teuthology.log

In this one, the splitting is occurring:

2017-05-11 15:21:44.594381 7f1e4b8f0700 10 mds.0.cache fragment_finish 10000000154resultfrags.size=8

But later, the MDS is going laggy and getting blacklisted, so restarting (which resets the perf counters):

2017-05-11 15:23:14.053010 7f1e518fc700  5 mds.beacon.a is_laggy 15.016571 > 15 since last acked beacon
...
2017-05-11 15:25:46.519542 7f1e4b8f0700 -1 mds.0.journaler.mdlog(rw) _finish_flush got (108) Cannot send after transport endpoint shutdown
...
2017-05-11 15:25:48.351503 7f4d3f2b2380  0 ceph version 12.0.1-2287-gb682d02 (b682d02394335207716b6fba17a94cc1e90213a9), process ceph-mds, pid 24292

#3 Updated by John Spray over 1 year ago

The blacklisting happened here:

2017-05-11 15:25:14.292282 7f45289ad700 10 mon.b@0(leader).mds e9 e9: 1/1/1 up {0=a=up:active}, 1 up:standby
2017-05-11 15:25:14.292291 7f45289ad700 10 mon.b@0(leader).mds e9 no beacon from 4147 172.21.15.192:6808/794576702 mds.0.8 up:active since 2017-05-11 15:24:55.034467
2017-05-11 15:25:14.292296 7f45289ad700 10 mon.b@0(leader).mds e9  replacing 4147 172.21.15.192:6808/794576702 mds.0.8 up:active with 4131/a-s 172.21.15.69:6809/410483415
2017-05-11 15:25:14.292300 7f45289ad700 10 mon.b@0(leader).mds e9 fail_mds_gid 4147 mds.a role 0
2017-05-11 15:25:14.292302 7f45289ad700 10 mon.b@0(leader).osd e11 blacklist 172.21.15.192:6808/794576702 until 2017-05-11 15:49:14.292301

The most recent beacon had been dropped like this:

2017-05-11 15:25:14.292064 7f452b1b2700 10 mon.b@0(leader).paxosservice(mdsmap 1..9) dispatch 0x7f453a5e83c0 mdsbeacon(4147/a up:active seq 100 v9) v7 from mds.0 172.21.15.192:6808/794576702 con 0x7f453ab4da00
2017-05-11 15:25:14.292067 7f452b1b2700 10 mon.b@0(leader).paxosservice(mdsmap 1..9)  discarding forwarded message from previous election epoch 14 < 15

Handling slow mon elections was supposed to be fixed in this commit:

commit 9950062f90d35b20d195317e78d7ab51f851da4a
Author: John Spray <john.spray@redhat.com>
Date:   Wed Sep 21 11:45:38 2016 +0100

    mon: make MDSMonitor tolerant of slow mon elections

    Previously MDS daemons would get failed incorrectly
    when they appeared to have timed out due to
    delays in calling into MDSMonitor that were
    actually caused by e.g. slow leveldb writes leading
    to slow mon elections.

    Fixes: http://tracker.ceph.com/issues/17308
    Signed-off-by: John Spray <john.spray@redhat.com>

#4 Updated by John Spray over 1 year ago

Another one here, easily identifiable from teuthology log:

...during the ffsb script's execution...
2017-05-22T11:31:19.122 INFO:tasks.ceph.mon.c.smithi079.stderr:2017-05-22 11:31:19.128383 7fe62dd96700 -1 mon.c@2(peon).paxos(paxos updating c 1..176) lease_expire from mon.0 172.21.15.79:6789/0 is 0.306861 seconds in the past; mons are probably laggy (or possibly clocks are too skewed)
2017-05-22T11:31:28.934 INFO:tasks.ceph.mds.a-s.smithi079.stdout:starting mds.a-s at -

http://qa-proxy.ceph.com/teuthology/teuthology-2017-05-22_03:15:05-fs-master---basic-smithi/1205932/

#5 Updated by John Spray over 1 year ago

The issue is that we're getting nice regular tick() calls even when we're not readable, so the code in ::tick to reset the beacon timeouts is not getting triggered. However, because we're not readable, beacons are getting ignored (in wait_for_readable state).

However, we do clear out last_beacon in on_restart(), which is supposed to be called when an election starts.

jspray-2017-05-23_11:58:06-fs-wip-jcsp-testing-20170523-distro-basic-smithi/1221142

2017-05-23 13:34:14.627396 7f77319c7700 10 mon.b@0(leader).mds e11 e11: 1/1/1 up {0=a=up:active}, 1 up:standby
2017-05-23 13:34:14.627414 7f77319c7700 10 mon.b@0(leader).mds e11 no beacon from 4141 172.21.15.77:6809/2700711429 mds.0.7 up:active since 2017-05-23 13:33:56.375452
2017-05-23 13:34:14.627429 7f77319c7700 10 mon.b@0(leader).mds e11  replacing 4141 172.21.15.77:6809/2700711429 mds.0.7 up:active with 4131/a-s 172.21.15.74:6808/2057197859
2017-05-23 13:34:14.627436 7f77319c7700 10 mon.b@0(leader).mds e11 fail_mds_gid 4141 mds.a role 0

The actual elections are happening after the incident at 13:34:18 and 13:34:33.

So presumably the way the mon leases are working, we're going into a non-readable state some period before an election is actually triggered, and it's during that non-readable period that the MDS daemon is getting wrongly killed.

#6 Updated by Kefu Chai over 1 year ago

2017-05-23 13:28:01.578885 7f772f1c2700 10 mon.b@0(leader).mds e9 preprocess_query mdsbeacon(4131/a-s up:standby seq 5 v7) v7 from mds.? 172.21.15.74:6808/2057197859
2017-05-23 13:28:01.578893 7f772f1c2700 20 is_capable service=mds command= exec on cap allow *
2017-05-23 13:28:01.578896 7f772f1c2700 20  allow so far , doing grant allow *
2017-05-23 13:28:01.578897 7f772f1c2700 20  allow all
2017-05-23 13:28:01.578900 7f772f1c2700 12 mon.b@0(leader).mds e9 preprocess_beacon mdsbeacon(4131/a-s up:standby seq 5 v7) v7 from mds.? 172.21.15.74:6808/2057197859 compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=file layout v2}
2017-05-23 13:28:01.578908 7f772f1c2700  1 mon.b@0(leader).mds e9  mds mds.? 172.21.15.74:6808/2057197859 can't write to fsmap compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline data,8=file layout v2}
...
2017-05-23 13:28:04.781652 7f77319c7700 10 mon.b@0(leader).mds e9  failing and removing 4131 172.21.15.74:6808/2057197859 mds.-1.0 up:standby
2017-05-23 13:28:04.781657 7f77319c7700 10 mon.b@0(leader).mds e9 fail_mds_gid 4131 mds.a-s role -1
2017-05-23 13:28:04.875444 7f00409b6700 10 mds.a-s      my compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline data,8=file layout v2}

so the beacon message sent from mds/a-s (172.21.15.74:6808) was dropped on the floor, and was not noted down, because it was not compatible with the mdsmap.

#7 Updated by Kefu Chai over 1 year ago

from /a/teuthology-2017-05-22_03:15:05-fs-master---basic-smithi/1205932/remote/smithi079/log/ceph-mon.c.log.gz

2017-05-22 11:31:13.822115 7fe62dd96700 10 mon.c@2(peon).paxos(paxos updating c 1..176) store_state [176..176]
2017-05-22 11:31:17.270531 7fe62cd94700  1 -- 172.21.15.79:0/6890 --> 172.21.15.185:6800/12997 -- mgrreport(+0-0 packed 1662) v2 -- 0x7fe64066a300 con 0
2017-05-22 11:31:17.270596 7fe629d8e700 10 _calc_signature seq 71 front_crc_ = 4107025821 middle_crc = 0 data_crc = 0 sig = 10256893405846455674
2017-05-22 11:31:17.270604 7fe629d8e700 20 Putting signature in client message(seq # 71): sig = 10256893405846455674
2017-05-22 11:31:19.127817 7fe62ad90700 10 _calc_signature seq 681 front_crc_ = 3760723367 middle_crc = 0 data_crc = 0 sig = 8018174443504836687
2017-05-22 11:31:19.128097 7fe62dd96700 10 mon.c@2(peon) e2 refresh_from_paxos

the peon took more than 5 seconds to apply the transaction. when it finished it, it got a lease which has already expired. so it wasted a chance to return to readable/writable state.

i think we need to figure out why it took so long to apply the transaction.

#8 Updated by John Spray over 1 year ago

My assumption was just that this test was hammering the nodes hard, as it seems to be repeatedly failing on the same CephFS workload (the ffsb benchmark), but I don't have any direct evidence that that's what caused the mon slowness.

We only started noticing this when introducing this particular test check that validates an MDS counter at the end of the test -- previously MDS daemons could bounce during the test without triggering a failure.

#9 Updated by John Spray over 1 year ago

  • Status changed from In Progress to Need Review

#10 Updated by John Spray over 1 year ago

  • Status changed from Need Review to Resolved
  • Severity changed from 3 - minor to 5 - suggestion

#11 Updated by John Spray over 1 year ago

  • Status changed from Resolved to Verified

The fix is not working in at least some cases. Here's a smoking gun failure:

http://pulpito.ceph.com/jspray-2017-06-21_17:52:06-fs-wip-jcsp-testing-20170621b-distro-basic-smithi/1312242

2017-06-21 20:36:46.323081 mon.c mon.2 172.21.15.187:6790/0 33 : cluster [INF] mon.c calling new monitor election
2017-06-21 20:36:46.323090 mon.b mon.1 172.21.15.187:6789/0 23 : cluster [INF] mon.b calling new monitor election
2017-06-21 20:36:48.338987 mon.a mon.0 172.21.15.3:6789/0 369 : cluster [WRN] MDS daemon 'a-s' is not responding, replacing it as rank 0 with standby 'a'

#12 Updated by Patrick Donnelly over 1 year ago

Here's another: /a/pdonnell-2017-06-27_19:50:40-fs-wip-pdonnell-20170627---basic-smithi/1333703

#13 Updated by John Spray over 1 year ago

  • Subject changed from failed to set counters on mds daemons: set(['mds.dir_split']) to Laggy mon daemons trigger causing MDS failover (symptom: failed to set counters on mds daemons: set(['mds.dir_split']))
  • Assignee deleted (John Spray)
  • Severity changed from 5 - suggestion to 3 - minor

#14 Updated by John Spray over 1 year ago

  • Subject changed from Laggy mon daemons trigger causing MDS failover (symptom: failed to set counters on mds daemons: set(['mds.dir_split'])) to Laggy mon daemons causing MDS failover (symptom: failed to set counters on mds daemons: set(['mds.dir_split']))

#16 Updated by Patrick Donnelly over 1 year ago

  • Duplicated by Bug #20614: [WRN] MDS daemon 'a-s' is not responding, replacing it as rank 0 with standby 'a'" in cluster log added

#18 Updated by Patrick Donnelly 7 months ago

  • Related to Bug #23519: mds: mds got laggy because of MDSBeacon stuck in mqueue added

#19 Updated by Patrick Donnelly 7 months ago

#20 Updated by Patrick Donnelly 7 months ago

  • Category set to Correctness/Safety
  • Target version set to v14.0.0
  • Source set to Q/A
  • Component(FS) MDSMonitor, qa-suite added
  • Labels (FS) qa, task(medium) added

#21 Updated by Patrick Donnelly 7 months ago

  • Priority changed from Normal to Urgent

#22 Updated by Patrick Donnelly 6 months ago

I don't have reason to believe use of utime_t caused this issue but it's possible this could fix it: https://github.com/ceph/ceph/pull/21842/commits/1933bdf37098130342d0debacf4daa0a41011351

#23 Updated by Patrick Donnelly 3 months ago

  • Status changed from Verified to Can't reproduce

Last instance almost a year ago:

tail n 100 /a/*2017*fs*master*/*/teu* | grep -C150 'The following counters failed to' | tee READ

/ceph/teuthology-archive/teuthology-2017-09-16_03:15:02-fs-master-distro-basic-smithi/1639578/teuthology.log.gz

This has been resolved upstream somehow, apparently.

Also available in: Atom PDF