Project

General

Profile

Actions

Bug #21604

open

mds: may recall all client caps (?) because dirty/pinned metadata is not flushed

Added by Patrick Donnelly over 6 years ago. Updated about 5 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Development
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
fs
Component(FS):
MDS
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Testing tasks.cephfs.test_client_limits.TestClientLimits.test_client_pin with this patch:

$ git diff
diff --git a/qa/tasks/cephfs/test_client_limits.py b/qa/tasks/cephfs/test_client_limits.py
index 14ac2c4..1789b7f 100644
--- a/qa/tasks/cephfs/test_client_limits.py
+++ b/qa/tasks/cephfs/test_client_limits.py
@@ -90,8 +90,8 @@ class TestClientLimits(CephFSTestCase):
                 raise
             elif num_caps == mds_min_caps_per_client:
                 return True
-            elif num_caps == int((1.0-mds_max_ratio_caps_per_client)*(open_files+2)):
-                return True
+            #elif num_caps == int((1.0-mds_max_ratio_caps_per_client)*(open_files+2)):
+                #return True
             else:
                 return False
$ ../src/vstart.sh -d -b -l -n
$ python2 ../qa/tasks/vstart_runner.py --interactive tasks.cephfs.test_client_limits.TestClientLimits.test_client_pin
[...]
2017-09-29 10:27:13,124.124 INFO:tasks.cephfs.test_client_limits:Terminating process holding files open
2017-09-29 10:27:13,124.124 INFO:__main__:kill
2017-09-29 10:27:13,125.125 INFO:__main__:kill: killing pid 8534 (['python', '-c', '\nimport sys\nimport time\nimport os\n\nn = 800\nabs_path = "/tmp/tmpXH62es/mnt.0/subdir/mount_a"\n\nif not os.path.exists(os.path.dirname(abs_path)):\n    os.makedirs(os.path.dirname(abs_path))\n\nhandles = []\nfor i in range(0, n):\n    fname = "{0}_{1}".format(abs_path, i)\n    handles.append(open(fname, \'w\'))\n\nwhile True:\n    time.sleep(1)\n'])
2017-09-29 10:27:13,143.143 INFO:__main__:run args=['./bin/ceph', 'fs', 'dump', '--format=json']
2017-09-29 10:27:13,143.143 INFO:__main__:Running ['./bin/ceph', 'fs', 'dump', '--format=json']
2017-09-29 10:27:13,483.483 INFO:__main__:run args=['./bin/ceph', 'daemon', 'mds.c', 'session', 'ls']
2017-09-29 10:27:13,484.484 INFO:__main__:Running ['./bin/ceph', 'daemon', 'mds.c', 'session', 'ls']
2017-09-29 10:27:13,662.662 INFO:tasks.cephfs.filesystem:_json_asok output: [
    {
        "id": 4227,
        "num_leases": 0,
        "num_caps": 1,
        "state": "open",
        "replay_requests": 0,
        "completed_requests": 1,
        "reconnecting": false,
        "inst": "client.4227 127.0.0.1:0/1930023096",
        "client_metadata": {
            "ceph_sha1": "1aef50a1edbd2a41e7694ffb61bfd546e5648bf9",
            "ceph_version": "ceph version 12.1.2-2473-g1aef50a (1aef50a1edbd2a41e7694ffb61bfd546e5648bf9) mimic (dev)",
            "entity_id": "1",
            "hostname": "senta02",
            "mount_point": "/tmp/tmpXH62es/mnt.1",
            "pid": "8112",
            "root": "/" 
        }   
    },  
    {
        "id": 4229,
        "num_leases": 0,
        "num_caps": 802,
        "state": "open",
        "replay_requests": 0,
        "completed_requests": 800,
        "reconnecting": false,
        "inst": "client.4229 127.0.0.1:0/3107993577",
        "client_metadata": {
            "ceph_sha1": "1aef50a1edbd2a41e7694ffb61bfd546e5648bf9",
            "ceph_version": "ceph version 12.1.2-2473-g1aef50a (1aef50a1edbd2a41e7694ffb61bfd546e5648bf9) mimic (dev)",
            "entity_id": "0",
            "hostname": "senta02",
            "mount_point": "/tmp/tmpXH62es/mnt.0",
            "pid": "8075",
            "root": "/" 
        }
    }
]

2017-09-29 10:27:18,667.667 INFO:__main__:run args=['./bin/ceph', 'fs', 'dump', '--format=json']
2017-09-29 10:27:18,742.742 INFO:__main__:Running ['./bin/ceph', 'fs', 'dump', '--format=json']
2017-09-29 10:27:20,075.075 INFO:__main__:run args=['./bin/ceph', 'daemon', 'mds.c', 'session', 'ls']
2017-09-29 10:27:20,075.075 INFO:__main__:Running ['./bin/ceph', 'daemon', 'mds.c', 'session', 'ls']
2017-09-29 10:27:20,245.245 INFO:tasks.cephfs.filesystem:_json_asok output: [
    {
        "id": 4227,
        "num_leases": 0,
        "num_caps": 1,
        "state": "open",
        "replay_requests": 0,
        "completed_requests": 1,
        "reconnecting": false,
        "inst": "client.4227 127.0.0.1:0/1930023096",
        "client_metadata": {
            "ceph_sha1": "1aef50a1edbd2a41e7694ffb61bfd546e5648bf9",
            "ceph_version": "ceph version 12.1.2-2473-g1aef50a (1aef50a1edbd2a41e7694ffb61bfd546e5648bf9) mimic (dev)",
            "entity_id": "1",
            "hostname": "senta02",
            "mount_point": "/tmp/tmpXH62es/mnt.1",
            "pid": "8112",
            "root": "/" 
        }
    },
    {
        "id": 4229,
        "num_leases": 0,
        "num_caps": 100,
        "state": "open",
        "replay_requests": 0,
        "completed_requests": 800,
        "reconnecting": false,
        "inst": "client.4229 127.0.0.1:0/3107993577",
        "client_metadata": {
            "ceph_sha1": "1aef50a1edbd2a41e7694ffb61bfd546e5648bf9",
            "ceph_version": "ceph version 12.1.2-2473-g1aef50a (1aef50a1edbd2a41e7694ffb61bfd546e5648bf9) mimic (dev)",
            "entity_id": "0",
            "hostname": "senta02",
            "mount_point": "/tmp/tmpXH62es/mnt.0",
            "pid": "8075",
            "root": "/" 
        }
    }
]

2017-09-29 10:27:20,245.245 INFO:__main__:run args=['./bin/ceph', 'log', 'Ended test tasks.cephfs.test_client_limits.TestClientLimits.test_client_pin']
2017-09-29 10:27:20,246.246 INFO:__main__:Running ['./bin/ceph', 'log', 'Ended test tasks.cephfs.test_client_limits.TestClientLimits.test_client_pin']

The MDS is recalling the client caps to the minimum because it cannot/does not flush its dirty metadata:

2017-09-29 10:27:18.441826 7f19cf4e5700  7 mds.0.cache trim bytes_used=1MB limit=1GB reservation=0.05% count=0
2017-09-29 10:27:18.441841 7f19cf4e5700  7 mds.0.cache trim_lru trimming 0 items from LRU size=811 mid=0 pintail=811 pinned=811
2017-09-29 10:27:18.441851 7f19cf4e5700  7 mds.0.cache trim_lru trimmed 0 items
2017-09-29 10:27:18.441855 7f19cf4e5700 10 mds.0.cache trim_client_leases
2017-09-29 10:27:18.442331 7f19cf4e5700  2 mds.0.cache check_memory_usage total 430556, rss 48376, heap 313848, baseline 313848, buffers 0, 160 / 813 inodes have caps, 161 caps, 0.198032 caps per inode
2017-09-29 10:27:18.442367 7f19cf4e5700 10 mds.0.server recall_client_state 0.2, caps per client 100-128
2017-09-29 10:27:18.442373 7f19cf4e5700 10 mds.0.server  session client.4229 127.0.0.1:0/3107993577 caps 160, leases 0
2017-09-29 10:27:18.442381 7f19cf4e5700 10 mds.0.16 send_message_client client.4229 127.0.0.1:0/3107993577 client_session(recall_state max_caps 100 max_leases 0) v2
2017-09-29 10:27:18.442387 7f19cf4e5700  1 -- 127.0.0.1:6848/563895976 --> 127.0.0.1:0/3107993577 -- client_session(recall_state max_caps 100 max_leases 0) v2 -- 0x557ffeed2d80 con 0

Once the client reaches the minimum number of caps, the test ends.

Using "flush journal" on the admin socket causes the MDS to finally flush the dirty metadata and unpin the inodes allowing the cache to shrink.

Actions #1

Updated by Greg Farnum over 6 years ago

This sounds like something controlled by config options. The MDS wants to batch up data when flushing to avoid gratuitous numbers of IOs. I don't know what the default sizes are, but if you're running a test with only one client and a very small cache those numbers may be much smaller than the default minimum size.

Actions #2

Updated by Patrick Donnelly over 6 years ago

  • Assignee set to Patrick Donnelly
  • Priority changed from High to Immediate
Actions #3

Updated by Patrick Donnelly over 6 years ago

  • Priority changed from Immediate to Normal
Actions #4

Updated by Patrick Donnelly about 5 years ago

  • Assignee deleted (Patrick Donnelly)
Actions

Also available in: Atom PDF