Bug #19706
closedLaggy mon daemons causing MDS failover (symptom: failed to set counters on mds daemons: set(['mds.dir_split']))
Added by Zheng Yan about 7 years ago. Updated over 5 years ago.
0%
Description
Updated by John Spray about 7 years ago
- Status changed from New to In Progress
- Assignee set to John Spray
Should disable this check for the misc workunit, it was mainly intended for the workunits that have more sustained load like kernel builds etc.
Updated by John Spray almost 7 years ago
So it turns out this is exposing an underlying issue (in at least some cases):
http://qa-proxy.ceph.com/teuthology/jspray-2017-05-11_14:40:13-fs-wip-jcsp-testing-20170511-distro-basic-smithi/1124272/teuthology.log
In this one, the splitting is occurring:
2017-05-11 15:21:44.594381 7f1e4b8f0700 10 mds.0.cache fragment_finish 10000000154resultfrags.size=8
But later, the MDS is going laggy and getting blacklisted, so restarting (which resets the perf counters):
2017-05-11 15:23:14.053010 7f1e518fc700 5 mds.beacon.a is_laggy 15.016571 > 15 since last acked beacon ... 2017-05-11 15:25:46.519542 7f1e4b8f0700 -1 mds.0.journaler.mdlog(rw) _finish_flush got (108) Cannot send after transport endpoint shutdown ... 2017-05-11 15:25:48.351503 7f4d3f2b2380 0 ceph version 12.0.1-2287-gb682d02 (b682d02394335207716b6fba17a94cc1e90213a9), process ceph-mds, pid 24292
Updated by John Spray almost 7 years ago
The blacklisting happened here:
2017-05-11 15:25:14.292282 7f45289ad700 10 mon.b@0(leader).mds e9 e9: 1/1/1 up {0=a=up:active}, 1 up:standby 2017-05-11 15:25:14.292291 7f45289ad700 10 mon.b@0(leader).mds e9 no beacon from 4147 172.21.15.192:6808/794576702 mds.0.8 up:active since 2017-05-11 15:24:55.034467 2017-05-11 15:25:14.292296 7f45289ad700 10 mon.b@0(leader).mds e9 replacing 4147 172.21.15.192:6808/794576702 mds.0.8 up:active with 4131/a-s 172.21.15.69:6809/410483415 2017-05-11 15:25:14.292300 7f45289ad700 10 mon.b@0(leader).mds e9 fail_mds_gid 4147 mds.a role 0 2017-05-11 15:25:14.292302 7f45289ad700 10 mon.b@0(leader).osd e11 blacklist 172.21.15.192:6808/794576702 until 2017-05-11 15:49:14.292301
The most recent beacon had been dropped like this:
2017-05-11 15:25:14.292064 7f452b1b2700 10 mon.b@0(leader).paxosservice(mdsmap 1..9) dispatch 0x7f453a5e83c0 mdsbeacon(4147/a up:active seq 100 v9) v7 from mds.0 172.21.15.192:6808/794576702 con 0x7f453ab4da00 2017-05-11 15:25:14.292067 7f452b1b2700 10 mon.b@0(leader).paxosservice(mdsmap 1..9) discarding forwarded message from previous election epoch 14 < 15
Handling slow mon elections was supposed to be fixed in this commit:
commit 9950062f90d35b20d195317e78d7ab51f851da4a Author: John Spray <john.spray@redhat.com> Date: Wed Sep 21 11:45:38 2016 +0100 mon: make MDSMonitor tolerant of slow mon elections Previously MDS daemons would get failed incorrectly when they appeared to have timed out due to delays in calling into MDSMonitor that were actually caused by e.g. slow leveldb writes leading to slow mon elections. Fixes: http://tracker.ceph.com/issues/17308 Signed-off-by: John Spray <john.spray@redhat.com>
Updated by John Spray almost 7 years ago
Another one here, easily identifiable from teuthology log:
...during the ffsb script's execution... 2017-05-22T11:31:19.122 INFO:tasks.ceph.mon.c.smithi079.stderr:2017-05-22 11:31:19.128383 7fe62dd96700 -1 mon.c@2(peon).paxos(paxos updating c 1..176) lease_expire from mon.0 172.21.15.79:6789/0 is 0.306861 seconds in the past; mons are probably laggy (or possibly clocks are too skewed) 2017-05-22T11:31:28.934 INFO:tasks.ceph.mds.a-s.smithi079.stdout:starting mds.a-s at -
http://qa-proxy.ceph.com/teuthology/teuthology-2017-05-22_03:15:05-fs-master---basic-smithi/1205932/
Updated by John Spray almost 7 years ago
The issue is that we're getting nice regular tick() calls even when we're not readable, so the code in ::tick to reset the beacon timeouts is not getting triggered. However, because we're not readable, beacons are getting ignored (in wait_for_readable state).
However, we do clear out last_beacon in on_restart(), which is supposed to be called when an election starts.
jspray-2017-05-23_11:58:06-fs-wip-jcsp-testing-20170523-distro-basic-smithi/1221142
2017-05-23 13:34:14.627396 7f77319c7700 10 mon.b@0(leader).mds e11 e11: 1/1/1 up {0=a=up:active}, 1 up:standby 2017-05-23 13:34:14.627414 7f77319c7700 10 mon.b@0(leader).mds e11 no beacon from 4141 172.21.15.77:6809/2700711429 mds.0.7 up:active since 2017-05-23 13:33:56.375452 2017-05-23 13:34:14.627429 7f77319c7700 10 mon.b@0(leader).mds e11 replacing 4141 172.21.15.77:6809/2700711429 mds.0.7 up:active with 4131/a-s 172.21.15.74:6808/2057197859 2017-05-23 13:34:14.627436 7f77319c7700 10 mon.b@0(leader).mds e11 fail_mds_gid 4141 mds.a role 0
The actual elections are happening after the incident at 13:34:18 and 13:34:33.
So presumably the way the mon leases are working, we're going into a non-readable state some period before an election is actually triggered, and it's during that non-readable period that the MDS daemon is getting wrongly killed.
Updated by Kefu Chai almost 7 years ago
2017-05-23 13:28:01.578885 7f772f1c2700 10 mon.b@0(leader).mds e9 preprocess_query mdsbeacon(4131/a-s up:standby seq 5 v7) v7 from mds.? 172.21.15.74:6808/2057197859 2017-05-23 13:28:01.578893 7f772f1c2700 20 is_capable service=mds command= exec on cap allow * 2017-05-23 13:28:01.578896 7f772f1c2700 20 allow so far , doing grant allow * 2017-05-23 13:28:01.578897 7f772f1c2700 20 allow all 2017-05-23 13:28:01.578900 7f772f1c2700 12 mon.b@0(leader).mds e9 preprocess_beacon mdsbeacon(4131/a-s up:standby seq 5 v7) v7 from mds.? 172.21.15.74:6808/2057197859 compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=file layout v2} 2017-05-23 13:28:01.578908 7f772f1c2700 1 mon.b@0(leader).mds e9 mds mds.? 172.21.15.74:6808/2057197859 can't write to fsmap compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline data,8=file layout v2} ... 2017-05-23 13:28:04.781652 7f77319c7700 10 mon.b@0(leader).mds e9 failing and removing 4131 172.21.15.74:6808/2057197859 mds.-1.0 up:standby 2017-05-23 13:28:04.781657 7f77319c7700 10 mon.b@0(leader).mds e9 fail_mds_gid 4131 mds.a-s role -1
2017-05-23 13:28:04.875444 7f00409b6700 10 mds.a-s my compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline data,8=file layout v2}
so the beacon message sent from mds/a-s (172.21.15.74:6808) was dropped on the floor, and was not noted down, because it was not compatible with the mdsmap.
Updated by Kefu Chai almost 7 years ago
from /a/teuthology-2017-05-22_03:15:05-fs-master---basic-smithi/1205932/remote/smithi079/log/ceph-mon.c.log.gz
2017-05-22 11:31:13.822115 7fe62dd96700 10 mon.c@2(peon).paxos(paxos updating c 1..176) store_state [176..176] 2017-05-22 11:31:17.270531 7fe62cd94700 1 -- 172.21.15.79:0/6890 --> 172.21.15.185:6800/12997 -- mgrreport(+0-0 packed 1662) v2 -- 0x7fe64066a300 con 0 2017-05-22 11:31:17.270596 7fe629d8e700 10 _calc_signature seq 71 front_crc_ = 4107025821 middle_crc = 0 data_crc = 0 sig = 10256893405846455674 2017-05-22 11:31:17.270604 7fe629d8e700 20 Putting signature in client message(seq # 71): sig = 10256893405846455674 2017-05-22 11:31:19.127817 7fe62ad90700 10 _calc_signature seq 681 front_crc_ = 3760723367 middle_crc = 0 data_crc = 0 sig = 8018174443504836687 2017-05-22 11:31:19.128097 7fe62dd96700 10 mon.c@2(peon) e2 refresh_from_paxos
the peon took more than 5 seconds to apply the transaction. when it finished it, it got a lease which has already expired. so it wasted a chance to return to readable/writable state.
i think we need to figure out why it took so long to apply the transaction.
Updated by John Spray almost 7 years ago
My assumption was just that this test was hammering the nodes hard, as it seems to be repeatedly failing on the same CephFS workload (the ffsb benchmark), but I don't have any direct evidence that that's what caused the mon slowness.
We only started noticing this when introducing this particular test check that validates an MDS counter at the end of the test -- previously MDS daemons could bounce during the test without triggering a failure.
Updated by John Spray almost 7 years ago
- Status changed from In Progress to Fix Under Review
Updated by John Spray almost 7 years ago
- Status changed from Fix Under Review to Resolved
- Severity changed from 3 - minor to 5 - suggestion
Updated by John Spray almost 7 years ago
- Status changed from Resolved to 12
The fix is not working in at least some cases. Here's a smoking gun failure:
2017-06-21 20:36:46.323081 mon.c mon.2 172.21.15.187:6790/0 33 : cluster [INF] mon.c calling new monitor election 2017-06-21 20:36:46.323090 mon.b mon.1 172.21.15.187:6789/0 23 : cluster [INF] mon.b calling new monitor election 2017-06-21 20:36:48.338987 mon.a mon.0 172.21.15.3:6789/0 369 : cluster [WRN] MDS daemon 'a-s' is not responding, replacing it as rank 0 with standby 'a'
Updated by Patrick Donnelly almost 7 years ago
Here's another: /a/pdonnell-2017-06-27_19:50:40-fs-wip-pdonnell-20170627---basic-smithi/1333703
Updated by John Spray almost 7 years ago
- Subject changed from failed to set counters on mds daemons: set(['mds.dir_split']) to Laggy mon daemons trigger causing MDS failover (symptom: failed to set counters on mds daemons: set(['mds.dir_split']))
- Assignee deleted (
John Spray) - Severity changed from 5 - suggestion to 3 - minor
Updated by John Spray almost 7 years ago
- Subject changed from Laggy mon daemons trigger causing MDS failover (symptom: failed to set counters on mds daemons: set(['mds.dir_split'])) to Laggy mon daemons causing MDS failover (symptom: failed to set counters on mds daemons: set(['mds.dir_split']))
Updated by Patrick Donnelly almost 7 years ago
PR for debugging: https://github.com/ceph/ceph/pull/16278
Updated by Patrick Donnelly almost 7 years ago
- Has duplicate Bug #20614: [WRN] MDS daemon 'a-s' is not responding, replacing it as rank 0 with standby 'a'" in cluster log added
Updated by Zheng Yan over 6 years ago
Updated by Patrick Donnelly about 6 years ago
- Related to Bug #23519: mds: mds got laggy because of MDSBeacon stuck in mqueue added
Updated by Patrick Donnelly about 6 years ago
- Related to Feature #17309: qa: mon_thrash test for CephFS added
Updated by Patrick Donnelly about 6 years ago
- Category set to Correctness/Safety
- Target version set to v14.0.0
- Source set to Q/A
- Component(FS) MDSMonitor, qa-suite added
- Labels (FS) qa, task(medium) added
Updated by Patrick Donnelly about 6 years ago
- Priority changed from Normal to Urgent
Updated by Patrick Donnelly almost 6 years ago
I don't have reason to believe use of utime_t caused this issue but it's possible this could fix it: https://github.com/ceph/ceph/pull/21842/commits/1933bdf37098130342d0debacf4daa0a41011351
Updated by Patrick Donnelly over 5 years ago
- Status changed from 12 to Can't reproduce
Last instance almost a year ago:
tail n 100 /a/*2017*fs*master*/*/teu* | grep -C150 'The following counters failed to' | tee READ
/ceph/teuthology-archive/teuthology-2017-09-16_03:15:02-fs-master-distro-basic-smithi/1639578/teuthology.log.gz
This has been resolved upstream somehow, apparently.