Bug #10151
closedmds client cache pressure health warning oscillates on/off
0%
Description
seeing this on lab cluster. not sure if it is a problem in the mds health reporting or the mon, but it goes on and off every few seconds. probably depends whether you hit the leader mon?
Updated by John Spray over 9 years ago
Yes -- the leader is reporting the health warning but the peons are not.
The warning is "Client 2922132 failing to respond to cache pressure", the session state is:
[ { "id": 5443238, "num_leases": 0, "num_caps": 50766, "state": "open", "replay_requests": 0, "reconnecting": false, "inst": "client.5443238 10.214.131.141:0\/27885", "client_metadata": {}}, { "id": 2922132, "num_leases": 0, "num_caps": 150, "state": "open", "replay_requests": 0, "reconnecting": false, "inst": "client.2922132 10.214.131.102:0\/951298617", "client_metadata": {}}, { "id": 1756771, "num_leases": 0, "num_caps": 94, "state": "open", "replay_requests": 0, "reconnecting": false, "inst": "client.1756771 10.214.137.25:0\/1841820156", "client_metadata": {}}, { "id": 4894101, "num_leases": 5476, "num_caps": 104401, "state": "open", "replay_requests": 0, "reconnecting": false, "inst": "client.4894101 10.214.137.23:0\/2571774570", "client_metadata": {}}, { "id": 1756816, "num_leases": 0, "num_caps": 1, "state": "open", "replay_requests": 0, "reconnecting": false, "inst": "client.1756816 10.214.137.27:0\/2508210603", "client_metadata": {}}]
So aside from the inconsistency between mons, the warning looks bogus, as the named session only has 150 caps.
Updated by John Spray over 9 years ago
- Status changed from New to In Progress
Reproduced this locally by just allowing 3 mons in a vstart cluster and following the procedure from the mds_client_limits/_test_client_pin test.
Updated by John Spray over 9 years ago
- Status changed from In Progress to Fix Under Review
Updated by John Spray over 9 years ago
Opened PR against master instead of next by mistake. Next PR is https://github.com/ceph/ceph/pull/2996
Updated by Greg Farnum over 9 years ago
- Status changed from Fix Under Review to Pending Backport
Merged to master as of commit:aa4d1478647ce416e9cf4e8fcd32411230639f40. I like to let things go through testing before backporting, so I'll let you do that John.
Updated by John Spray over 9 years ago
- Status changed from Pending Backport to Resolved
The version on next has a pass on client-limits (the one that exercises health): http://pulpito.front.sepia.ceph.com/sage-2014-12-01_11:11:17-fs-next-distro-basic-multi/628932/
Merged backport to giant:
commit c8b46d68c71f66d4abbda1230741cc4c7284193b Author: John Spray <john.spray@redhat.com> Date: Mon Nov 24 11:00:25 2014 +0000 mon: fix MDS health status from peons The health data was there, but we were attempting to enumerate MDS GIDs from pending_mdsmap (empty on peons) instead of mdsmap (populated from paxos updates) Fixes: #10151 Backport: giant Signed-off-by: John Spray <john.spray@redhat.com> (cherry picked from commit 0c33930e3a90f3873b7c7b18ff70dec2894fce29) Conflicts: src/mon/MDSMonitor.cc