mon.a says:
2014-11-04 12:53:56.430421 7f096d00b700 10 mon.a@1(leader).mds e18 no beacon from 6220 10.214.134.2:6811/37422 mds.0.3 up:active since 2014-11-04 12:53:39.506502
2014-11-04 12:53:56.430433 7f096d00b700 10 mon.a@1(leader).mds e18 replacing 6220 10.214.134.2:6811/37422 mds.0.3 up:active with 7525/a 10.214.136.134:6811/29541
2014-11-04 12:54:36.433472 7f096d00b700 10 mon.a@1(leader).mds e18 no beacon from 6220 10.214.134.2:6811/37422 mds.0.3 up:active since 2014-11-04 12:54:18.531904
2014-11-04 12:54:36.433487 7f096d00b700 10 mon.a@1(leader).mds e18 replacing 6220 10.214.134.2:6811/37422 mds.0.3 up:active with 7525/a 10.214.136.134:6811/29541
The second message is a bit weird, because the 'since' time isn't a time that we actually received a beacon at. Also, why are we going through the same replacement twice?
2014-11-04 12:53:20.869914 7f096c80a700 10 mon.a@1(leader).paxosservice(mdsmap 1..18) dispatch mdsbeacon(6220/a-s up:active seq 210 v18) v3 from mds.0 10.214.134.2:6811/37422
2014-11-04 12:53:20.870211 7f096c80a700 10 mon.a@1(leader).paxosservice(mdsmap 1..18) dispatch mdsbeacon(7525/a up:standby seq 50 v18) v3 from mds.? 10.214.136.134:6811/29541
2014-11-04 12:53:20.870433 7f096c80a700 10 mon.a@1(leader).paxosservice(mdsmap 1..18) dispatch mdsbeacon(7525/a up:standby seq 51 v18) v3 from mds.? 10.214.136.134:6811/29541
2014-11-04 12:53:20.871085 7f096c80a700 10 mon.a@1(leader).paxosservice(mdsmap 1..18) dispatch mdsbeacon(6220/a-s up:active seq 211 v18) v3 from mds.0 10.214.134.2:6811/37422
2014-11-04 12:53:21.424787 7f096c80a700 10 mon.a@1(leader).paxosservice(mdsmap 1..18) dispatch mdsbeacon(7525/a up:standby seq 52 v18) v3 from mds.? 10.214.136.134:6811/29541
2014-11-04 12:53:23.506505 7f096c80a700 10 mon.a@1(leader).paxosservice(mdsmap 1..18) dispatch mdsbeacon(6220/a-s up:active seq 212 v18) v3 from mds.0 10.214.134.2:6811/37422
2014-11-04 12:53:25.424944 7f096c80a700 10 mon.a@1(leader).paxosservice(mdsmap 1..18) dispatch mdsbeacon(7525/a up:standby seq 53 v18) v3 from mds.? 10.214.136.134:6811/29541
2014-11-04 12:53:27.506369 7f096c80a700 10 mon.a@1(leader).paxosservice(mdsmap 1..18) dispatch mdsbeacon(6220/a-s up:active seq 213 v18) v3 from mds.0 10.214.134.2:6811/37422
2014-11-04 12:53:29.425082 7f096c80a700 10 mon.a@1(leader).paxosservice(mdsmap 1..18) dispatch mdsbeacon(7525/a up:standby seq 54 v18) v3 from mds.? 10.214.136.134:6811/29541
2014-11-04 12:53:31.506457 7f096c80a700 10 mon.a@1(leader).paxosservice(mdsmap 1..18) dispatch mdsbeacon(6220/a-s up:active seq 214 v18) v3 from mds.0 10.214.134.2:6811/37422
2014-11-04 12:53:33.425300 7f096c80a700 10 mon.a@1(leader).paxosservice(mdsmap 1..18) dispatch mdsbeacon(7525/a up:standby seq 55 v18) v3 from mds.? 10.214.136.134:6811/29541
2014-11-04 12:53:35.506452 7f096c80a700 10 mon.a@1(leader).paxosservice(mdsmap 1..18) dispatch mdsbeacon(6220/a-s up:active seq 215 v18) v3 from mds.0 10.214.134.2:6811/37422
2014-11-04 12:53:37.425595 7f096c80a700 10 mon.a@1(leader).paxosservice(mdsmap 1..18) dispatch mdsbeacon(7525/a up:standby seq 56 v18) v3 from mds.? 10.214.136.134:6811/29541
2014-11-04 12:53:39.506443 7f096c80a700 10 mon.a@1(leader).paxosservice(mdsmap 1..18) dispatch mdsbeacon(6220/a-s up:active seq 216 v18) v3 from mds.0 10.214.134.2:6811/37422
2014-11-04 12:53:41.425912 7f096c80a700 10 mon.a@1(leader).paxosservice(mdsmap 1..18) dispatch mdsbeacon(7525/a up:standby seq 57 v18) v3 from mds.? 10.214.136.134:6811/29541
2014-11-04 12:54:18.538614 7f096c80a700 10 mon.a@1(leader).paxosservice(mdsmap 1..18) dispatch mdsbeacon(7525/a up:standby seq 63 v18) v3 from mds.? 10.214.136.134:6811/29541
There's lots of "no longer laggy" spam in the MDS log, so either the mons are not responding promptly or somethings wrong with the way we're reporting that.
The weird thing then is that the last MDS messages are stamped 12:54:04.056040, ~30 seconds before the mon has decided to replace this MDS due to it being laggy.
I note that the failing MDS (a-s) was sharing a host with two mons and two OSDs.
The relevant period in the cluster log:
2014-11-04 12:54:01.464986 mon.0 10.214.134.2:6789/0 6 : cluster [INF] mdsmap e18: 1/1/1 up {0=a-s=up:active}, 1 up:standby
2014-11-04 12:54:01.465174 mon.0 10.214.134.2:6789/0 7 : cluster [INF] osdmap e9: 4 osds: 4 up, 4 in
2014-11-04 12:54:01.464986 mon.0 10.214.134.2:6789/0 6 : cluster [INF] mdsmap e18: 1/1/1 up {0=a-s=up:active}, 1 up:standby
2014-11-04 12:54:01.465174 mon.0 10.214.134.2:6789/0 7 : cluster [INF] osdmap e9: 4 osds: 4 up, 4 in
2014-11-04 12:54:02.508375 mon.0 10.214.134.2:6789/0 11 : cluster [INF] scrub ok on 0,1,2: ScrubResult(keys {auth=64,logm=877,mdsmap=20,monmap=3,osdmap=21,pgmap=281,pgmap_meta=6,pgmap_osd=4,pgmap_pg=528} crc {auth=3204458802,logm=3129699316,mdsmap=1435440698,monmap=650486722,osdmap=3984225998,pgmap=2217722726,pgmap_meta=1976240145,pgmap_osd=1705388810,pgmap_pg=213665068})
2014-11-04 12:54:02.508375 mon.0 10.214.134.2:6789/0 11 : cluster [INF] scrub ok on 0,1,2: ScrubResult(keys {auth=64,logm=877,mdsmap=20,monmap=3,osdmap=21,pgmap=281,pgmap_meta=6,pgmap_osd=4,pgmap_pg=528} crc {auth=3204458802,logm=3129699316,mdsmap=1435440698,monmap=650486722,osdmap=3984225998,pgmap=2217722726,pgmap_meta=1976240145,pgmap_osd=1705388810,pgmap_pg=213665068})
2014-11-04 12:54:18.531924 mon.1 10.214.136.134:6789/0 44 : cluster [INF] mdsmap e18: 1/1/1 up {0=a-s=up:active}, 1 up:standby
2014-11-04 12:54:18.532039 mon.1 10.214.136.134:6789/0 45 : cluster [INF] osdmap e9: 4 osds: 4 up, 4 in
2014-11-04 12:54:36.439108 mon.1 10.214.136.134:6789/0 50 : cluster [INF] osdmap e10: 4 osds: 4 up, 4 in
2014-11-04 12:54:36.443530 mon.1 10.214.136.134:6789/0 51 : cluster [INF] mdsmap e19: 1/1/1 up {0=a=up:replay}
2014-11-04 12:54:37.455944 mon.1 10.214.136.134:6789/0 54 : cluster [INF] mdsmap e20: 1/1/1 up {0=a=up:reconnect}
2014-11-04 12:54:02.508375 mon.0 10.214.134.2:6789/0 11 : cluster [INF] scrub ok on 0,1,2: ScrubResult(keys {auth=64,logm=877,mdsmap=20,monmap=3,osdmap=21,pgmap=281,pgmap_meta=6,pgmap_osd=4,pgmap_pg=528} crc {auth=3204458802,logm=3129699316,mdsmap=1435440698,monmap=650486722,osdmap=3984225998,pgmap=2217722726,pgmap_meta=1976240145,pgmap_osd=1705388810,pgmap_pg=213665068})
2014-11-04 12:54:37.455944 mon.1 10.214.136.134:6789/0 54 : cluster [INF] mdsmap e20: 1/1/1 up {0=a=up:reconnect}
2014-11-04 12:54:39.274024 mon.0 10.214.134.2:6789/0 6 : cluster [INF] mdsmap e21: 1/1/1 up {0=a=up:rejoin}
2014-11-04 12:54:39.274099 mon.0 10.214.134.2:6789/0 7 : cluster [INF] osdmap e10: 4 osds: 4 up, 4 in
2014-11-04 12:54:39.274024 mon.0 10.214.134.2:6789/0 6 : cluster [INF] mdsmap e21: 1/1/1 up {0=a=up:rejoin}
2014-11-04 12:54:39.274099 mon.0 10.214.134.2:6789/0 7 : cluster [INF] osdmap e10: 4 osds: 4 up, 4 in
2014-11-04 12:54:40.272907 mon.0 10.214.134.2:6789/0 11 : cluster [INF] scrub ok on 0,1,2: ScrubResult(keys {auth=65,logm=895,mdsmap=23,monmap=3,osdmap=23,pgmap=287,pgmap_meta=6,pgmap_osd=4,pgmap_pg=528} crc {auth=4175571516,logm=1239312095,mdsmap=136519059,monmap=650486722,osdmap=1911257321,pgmap=1019801681,pgmap_meta=930017212,pgmap_osd=2681313588,pgmap_pg=3075239165})
2014-11-04 12:54:40.272907 mon.0 10.214.134.2:6789/0 11 : cluster [INF] scrub ok on 0,1,2: ScrubResult(keys {auth=65,logm=895,mdsmap=23,monmap=3,osdmap=23,pgmap=287,pgmap_meta=6,pgmap_osd=4,pgmap_pg=528} crc {auth=4175571516,logm=1239312095,mdsmap=136519059,monmap=650486722,osdmap=1911257321,pgmap=1019801681,pgmap_meta=930017212,pgmap_osd=2681313588,pgmap_pg=3075239165})
In the audit log nothing but a bunch of scrubs:
2014-11-04 12:54:02.461793 mon.1 10.214.136.134:6789/0 38 : audit [INF] from='client.? 10.214.136.134:0/1014326' entity='client.admin' cmd=[{"prefix": "scrub"}]: dispatch
2014-11-04 12:54:02.464953 mon.0 10.214.134.2:6789/0 10 : audit [INF] from='client.7924 :/0' entity='forwarded-request' cmd=[{"prefix": "scrub"}]: dispatch
2014-11-04 12:54:02.461793 mon.1 10.214.136.134:6789/0 38 : audit [INF] from='client.? 10.214.136.134:0/1014326' entity='client.admin' cmd=[{"prefix": "scrub"}]: dispatch
2014-11-04 12:54:02.464953 mon.0 10.214.134.2:6789/0 10 : audit [INF] from='client.7924 :/0' entity='forwarded-request' cmd=[{"prefix": "scrub"}]: dispatch
2014-11-04 12:54:02.461793 mon.1 10.214.136.134:6789/0 38 : audit [INF] from='client.? 10.214.136.134:0/1014326' entity='client.admin' cmd=[{"prefix": "scrub"}]: dispatch
2014-11-04 12:54:02.464953 mon.0 10.214.134.2:6789/0 10 : audit [INF] from='client.7924 :/0' entity='forwarded-request' cmd=[{"prefix": "scrub"}]: dispatch
2014-11-04 12:54:40.233713 mon.1 10.214.136.134:6789/0 60 : audit [INF] from='client.? 10.214.136.134:0/1014757' entity='client.admin' cmd=[{"prefix": "scrub"}]: dispatch
2014-11-04 12:54:40.235130 mon.0 10.214.134.2:6789/0 10 : audit [INF] from='client.7939 :/0' entity='forwarded-request' cmd=[{"prefix": "scrub"}]: dispatch
2014-11-04 12:54:40.233713 mon.1 10.214.136.134:6789/0 60 : audit [INF] from='client.? 10.214.136.134:0/1014757' entity='client.admin' cmd=[{"prefix": "scrub"}]: dispatch
2014-11-04 12:54:40.235130 mon.0 10.214.134.2:6789/0 10 : audit [INF] from='client.7939 :/0' entity='forwarded-request' cmd=[{"prefix": "scrub"}]: dispatch