Project

General

Profile

Actions

Bug #22487

closed

mds: setattr blocked when metadata pool is full

Added by Patrick Donnelly over 6 years ago. Updated about 6 years ago.

Status:
Rejected
Priority:
Urgent
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Development
Tags:
Backport:
luminous
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

With both of these PRs on mimic-dev1:

https://github.com/ceph/ceph/pull/19588
https://github.com/ceph/ceph/pull/19602

$ python2 ../qa/tasks/vstart_runner.py --interactive tasks.cephfs.test_full.TestClusterFull.test_full_different_file
...
2017-12-19 16:49:11,247.247 INFO:tasks.cephfs.test_full:Writing 66MB should fill this cluster
2017-12-19 16:49:11,247.247 INFO:__main__:run args=['stat', '--file-system', '--printf=%T\n', '--', '/tmp/tmp7IWgkx/mnt.0']
2017-12-19 16:49:11,248.248 INFO:__main__:Running ['stat', '--file-system', '--printf=%T\n', '--', '/tmp/tmp7IWgkx/mnt.0']
2017-12-19 16:49:11,259.259 INFO:tasks.cephfs.fuse_mount:ceph-fuse is mounted on /tmp/tmp7IWgkx/mnt.0
2017-12-19 16:49:11,260.260 INFO:__main__:run args=['dd', 'if=/dev/urandom', 'of=large_file_a', 'bs=1M', 'conv=fdatasync', 'count=33', 'seek=0']
2017-12-19 16:49:11,260.260 INFO:__main__:Running ['dd', 'if=/dev/urandom', 'of=large_file_a', 'bs=1M', 'conv=fdatasync', 'count=33', 'seek=0']
2017-12-19 16:49:15,319.319 INFO:__main__:run args=['stat', '--file-system', '--printf=%T\n', '--', '/tmp/tmp7IWgkx/mnt.0']
2017-12-19 16:49:15,320.320 INFO:__main__:Running ['stat', '--file-system', '--printf=%T\n', '--', '/tmp/tmp7IWgkx/mnt.0']
2017-12-19 16:49:15,331.331 INFO:tasks.cephfs.fuse_mount:ceph-fuse is mounted on /tmp/tmp7IWgkx/mnt.0
2017-12-19 16:49:15,331.331 INFO:__main__:run args=['dd', 'if=/dev/urandom', 'of=large_file_b', 'bs=1M', 'conv=fdatasync', 'count=33', 'seek=0']
2017-12-19 16:49:15,331.331 INFO:__main__:Running ['dd', 'if=/dev/urandom', 'of=large_file_b', 'bs=1M', 'conv=fdatasync', 'count=33', 'seek=0']
2017-12-19 16:49:19,304.304 INFO:tasks.cephfs.test_full:Writing file B succeeded (full status will happen soon)
2017-12-19 16:49:19,305.305 INFO:__main__:run args=['./bin/ceph', 'osd', 'dump', '--format=json-pretty']
2017-12-19 16:49:19,305.305 INFO:__main__:Running ['./bin/ceph', 'osd', 'dump', '--format=json-pretty']
2017-12-19 16:49:24,644.644 INFO:__main__:run args=['./bin/ceph', 'osd', 'dump', '--format=json-pretty']
2017-12-19 16:49:24,645.645 INFO:__main__:Running ['./bin/ceph', 'osd', 'dump', '--format=json-pretty']
2017-12-19 16:49:24,996.996 INFO:__main__:run args=['stat', '--file-system', '--printf=%T\n', '--', '/tmp/tmp7IWgkx/mnt.0']
2017-12-19 16:49:24,996.996 INFO:__main__:Running ['stat', '--file-system', '--printf=%T\n', '--', '/tmp/tmp7IWgkx/mnt.0']
2017-12-19 16:49:25,008.008 INFO:tasks.cephfs.fuse_mount:ceph-fuse is mounted on /tmp/tmp7IWgkx/mnt.0
2017-12-19 16:49:25,008.008 INFO:__main__:run args=['dd', 'if=/dev/urandom', 'of=large_file_b', 'bs=1M', 'conv=fdatasync', 'count=50', 'seek=33']
2017-12-19 16:49:25,008.008 INFO:__main__:Running ['dd', 'if=/dev/urandom', 'of=large_file_b', 'bs=1M', 'conv=fdatasync', 'count=50', 'seek=33']
$ bin/ceph --admin-daemon /tmp/ceph-asok.pu3EX7/mds.c.asok dump_blocked_ops
*** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
{
    "ops": [
        {
            "description": "client_request(client.4230:8 setattr size=34603008 #0x10000000001 2017-12-19 16:49:25.020565 caller_uid=1163, caller_gid=1163{27,119,1163,})",
            "initiated_at": "2017-12-19 16:49:25.020873",
            "age": 133.987241,
            "duration": 133.987273,
            "type_data": {
                "flag_point": "failed to xlock, waiting",
                "reqid": "client.4230:8",
                "op_type": "client_request",
                "client_info": {
                    "client": "client.4230",
                    "tid": 8
                },
                "events": [
                    {
                        "time": "2017-12-19 16:49:25.020873",
                        "event": "initiated" 
                    },
                    {
                        "time": "2017-12-19 16:49:25.021836",
                        "event": "failed to xlock, waiting" 
                    },
                    {
                        "time": "2017-12-19 16:49:25.023219",
                        "event": "failed to xlock, waiting" 
                    }
                ]
            }
        }
    ],
    "complaint_time": 30.000000,
    "num_blocked_ops": 1
}

Related issues 1 (1 open0 closed)

Related to CephFS - Bug #22488: mds: unlink blocks on large file when metadata pool is fullNew12/19/2017

Actions
Actions #1

Updated by Patrick Donnelly over 6 years ago

  • Related to Bug #22488: mds: unlink blocks on large file when metadata pool is full added
Actions #2

Updated by Zheng Yan over 6 years ago

I reproduced this locally.

It was caused by stuck log flush

[root@zhyan-alpha build]# bin/ceph daemon mds.a dump_ops_in_flight
*** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
{
    "ops": [
        {
            "description": "client_request(client.4155:8 setattr size=557842432 #0x10000000001 2017-12-21 10:17:59.001781 caller_uid=0, caller_gid=0{0,})",
            "initiated_at": "2017-12-21 10:17:59.001834",
            "age": 371.770464,
            "duration": 371.770485,
            "type_data": {
                "flag_point": "failed to xlock, waiting",
                "reqid": "client.4155:8",
                "op_type": "client_request",
                "client_info": {
                    "client": "client.4155",
                    "tid": 8
                },
                "events": [
                    {
                        "time": "2017-12-21 10:17:59.001834",
                        "event": "initiated" 
                    },
                    {
                        "time": "2017-12-21 10:17:59.001921",
                        "event": "failed to xlock, waiting" 
                    },
                    {
                        "time": "2017-12-21 10:17:59.002146",
                        "event": "failed to xlock, waiting" 
                    }
                ]
            }
        }
    ],
    "num_ops": 1
}
[root@zhyan-alpha build]# bin/ceph daemon mds.a objecter_requests
*** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
{
    "ops": [
        {
            "tid": 67,
            "pg": "7.6e5f474",
            "osd": 0,
            "object_id": "200.00000001",
            "object_locator": "@7",
            "target_object_id": "200.00000001",
            "target_object_locator": "@7",
            "paused": 0,
            "used_replica": 0,
            "precalc_pgid": 0,
            "last_sent": "262226s",
            "attempts": 1,
            "snapid": "head",
            "snap_context": "0=[]",
            "mtime": "2017-12-21 10:17:59.0.001474s",
            "osd_ops": [
                "write 75939~4420 [fadvise_dontneed]" 
            ]
        },
        {
            "tid": 68,
            "pg": "7.844f3494",
            "osd": 0,
            "object_id": "200.00000000",
            "object_locator": "@7",
            "target_object_id": "200.00000000",
            "target_object_locator": "@7",
            "paused": 0,
            "used_replica": 0,
            "precalc_pgid": 0,
            "last_sent": "262226s",
            "attempts": 1,
            "snapid": "head",
            "snap_context": "0=[]",
            "mtime": "2017-12-21 10:17:59.0.001545s",
            "osd_ops": [
                "writefull 0~90 [fadvise_dontneed]" 
            ]
        },
        {
            "tid": 69,
            "pg": "7.6e5f474",
            "osd": 0,
            "object_id": "200.00000001",
            "object_locator": "@7",
            "target_object_id": "200.00000001",
            "target_object_locator": "@7",
            "paused": 0,
            "used_replica": 0,
            "precalc_pgid": 0,
            "last_sent": "262226s",
            "attempts": 1,
            "snapid": "head",
            "snap_context": "0=[]",
            "mtime": "2017-12-21 10:17:59.0.002128s",
            "osd_ops": [
                "write 80359~1935 [fadvise_dontneed]" 
            ]
        }
    ],
    "linger_ops": [],
    "pool_ops": [],
    "pool_stat_ops": [],
    "statfs_ops": [],
    "command_ops": []
}

cap flush process held wrlock

[inode 0x10000000001 [2,head] /large_file_b auth v54 pv58 ap=4+0 dirtyparent s=439676928 n(v0 b439676928 1=1+0) (iauth excl) (ifile lock->excl w=2) (ixattr excl) (iversion lock w=1 last_client=4155) cr={4155=0-880803840@1} caps={4155=pAsxLsXsxFcb/pAsxXsxFsxcrwb@19},l=4155 | ptrwaiter=2 request=1 lock=3 caps=1 dirtyparent=1 dirty=1 waiter=1 authpin=1 0x55555eff4700]

I only reproduced this with vstart cluster's default conf

        osd failsafe full ratio = .99
        mon osd nearfull ratio = .99
        mon osd backfillfull ratio = .99
        mon osd full ratio = .99

can't reproduce this with full test's config

        mon osd nearfull ratio: 0.6
        mon osd backfillfull ratio: 0.6
        mon osd full ratio: 0.7
        osd failsafe full ratio: 1.0

Actions #3

Updated by Patrick Donnelly over 6 years ago

Presumably that would be because with the vstart config the MDS writes cannot actually be written whereas with the test configs (0.7 full ratio), the writes can still proceed because the MDS is treated specially. Right?

Actions #4

Updated by Zheng Yan over 6 years ago

right. full test should have no problem

Actions #5

Updated by Patrick Donnelly about 6 years ago

  • Status changed from New to Rejected
Actions

Also available in: Atom PDF