Project

General

Profile

Actions

Bug #10151

closed

mds client cache pressure health warning oscillates on/off

Added by Sage Weil over 9 years ago. Updated over 9 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

seeing this on lab cluster. not sure if it is a problem in the mds health reporting or the mon, but it goes on and off every few seconds. probably depends whether you hit the leader mon?

Actions #1

Updated by John Spray over 9 years ago

Yes -- the leader is reporting the health warning but the peons are not.

The warning is "Client 2922132 failing to respond to cache pressure", the session state is:

[
    { "id": 5443238,
      "num_leases": 0,
      "num_caps": 50766,
      "state": "open",
      "replay_requests": 0,
      "reconnecting": false,
      "inst": "client.5443238 10.214.131.141:0\/27885",
      "client_metadata": {}},
    { "id": 2922132,
      "num_leases": 0,
      "num_caps": 150,
      "state": "open",
      "replay_requests": 0,
      "reconnecting": false,
      "inst": "client.2922132 10.214.131.102:0\/951298617",
      "client_metadata": {}},
    { "id": 1756771,
      "num_leases": 0,
      "num_caps": 94,
      "state": "open",
      "replay_requests": 0,
      "reconnecting": false,
      "inst": "client.1756771 10.214.137.25:0\/1841820156",
      "client_metadata": {}},
    { "id": 4894101,
      "num_leases": 5476,
      "num_caps": 104401,
      "state": "open",
      "replay_requests": 0,
      "reconnecting": false,
      "inst": "client.4894101 10.214.137.23:0\/2571774570",
      "client_metadata": {}},
    { "id": 1756816,
      "num_leases": 0,
      "num_caps": 1,
      "state": "open",
      "replay_requests": 0,
      "reconnecting": false,
      "inst": "client.1756816 10.214.137.27:0\/2508210603",
      "client_metadata": {}}]

So aside from the inconsistency between mons, the warning looks bogus, as the named session only has 150 caps.

Actions #2

Updated by John Spray over 9 years ago

  • Status changed from New to In Progress

Reproduced this locally by just allowing 3 mons in a vstart cluster and following the procedure from the mds_client_limits/_test_client_pin test.

Actions #3

Updated by John Spray over 9 years ago

  • Status changed from In Progress to Fix Under Review
Actions #4

Updated by John Spray over 9 years ago

Opened PR against master instead of next by mistake. Next PR is https://github.com/ceph/ceph/pull/2996

Actions #5

Updated by Greg Farnum over 9 years ago

  • Status changed from Fix Under Review to Pending Backport

Merged to master as of commit:aa4d1478647ce416e9cf4e8fcd32411230639f40. I like to let things go through testing before backporting, so I'll let you do that John.

Actions #6

Updated by John Spray over 9 years ago

  • Status changed from Pending Backport to Resolved

The version on next has a pass on client-limits (the one that exercises health): http://pulpito.front.sepia.ceph.com/sage-2014-12-01_11:11:17-fs-next-distro-basic-multi/628932/

Merged backport to giant:

commit c8b46d68c71f66d4abbda1230741cc4c7284193b
Author: John Spray <john.spray@redhat.com>
Date:   Mon Nov 24 11:00:25 2014 +0000

    mon: fix MDS health status from peons

    The health data was there, but we were attempting
    to enumerate MDS GIDs from pending_mdsmap (empty on
    peons) instead of mdsmap (populated from paxos updates)

    Fixes: #10151
    Backport: giant

    Signed-off-by: John Spray <john.spray@redhat.com>
    (cherry picked from commit 0c33930e3a90f3873b7c7b18ff70dec2894fce29)

    Conflicts:
        src/mon/MDSMonitor.cc

Actions

Also available in: Atom PDF