Project

General

Profile

Actions

Bug #58846

open

Large snapshot delete causing locking "issues".

Added by Brian Woods about 1 year ago. Updated about 1 year ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Sorry for the generic title, but several things seem broken, so I am not 100% sure hot to title this one.

I have a rather large folder (about 19TBs) that had a half dozen snapshots on it. The other night I deleted them. Things got busy as expected, but a few apps that were doing IO in that folder just stopped. Killed and restarted them, but they locked again.

Started a tar command on the folder to /dev/zero and it was going along fine until it hit some file in that folder (didn't check what file, but can do if wanted) and it also just stopped.

So I checked 'ceph -s' and the snaptrim count was catching up (I think there was about 60 PGs), so left it for the morning as this isn't a critical folder. About 1.3TBs of RAW space was recovered, but in the morning the tar command was still locked and some PGs appeared to be stuck.

This has been the state for over 24 hours:

 - 15.79 active+clean+snaptrim_wait - snaptrim_duration: 0             - Queue Len: 8
 - 15.68 active+clean+snaptrim_wait - snaptrim_duration: 594.114416225 - Queue Len: 7
 - 15.60 active+clean+snaptrim_wait - snaptrim_duration: 0             - Queue Len: 8
 - 15.0 active+clean+snaptrim       - snaptrim_duration: 0.050225681   - Queue Len: 8
 - 15.14 active+clean+snaptrim      - snaptrim_duration: 5.939379017   - Queue Len: 8
 - 15.2a active+clean+snaptrim_wait - snaptrim_duration: 86.089696106  - Queue Len: 7
 - 15.2e active+clean+snaptrim_wait - snaptrim_duration: 0             - Queue Len: 8

I did a rolling restart of all of the MDSs and MON servers, but no change.

Here is the current ceph -s:

    health: HEALTH_WARN
            1 pools have many more objects per pg than average
            1 clients failing to respond to capability release
            1 MDSs report slow requests
            64 pgs not deep-scrubbed in time
            71 pgs not scrubbed in time
            2 pools have too few placement groups

I did find this KB:
https://access.redhat.com/solutions/3399771

I found "failed to rdlock, waiting" and tried kicking clients per the article, but that didn't seem to help.

# ceph daemon mds.mds-default.SERVERNAME.ptkjle dump_ops_in_flight
{
    "ops": [
        {
            "description": "client_request(client.55807368:11991113 
getattr pAsLsXsFs #0x10000e3fb93 2023-02-24T02:57:53.743251+0000
RETRY=1 caller_uid=0, caller_gid=0{0,})",
            "initiated_at": "2023-02-24T18:51:32.633289+0000",
            "age": 4303.5721312469996,
            "duration": 4303.5721652660004,
            "type_data": {
                "flag_point": "failed to rdlock, waiting",
                "reqid": "client.55807368:11991113",
                "op_type": "client_request",
                "client_info": {
                    "client": "client.55807368",
                    "tid": 11991113
                },
                "events": [
                    {
                        "time": "2023-02-24T18:51:32.633289+0000",
                        "event": "initiated" 
                    },
                    {
                        "time": "2023-02-24T18:51:32.633290+0000",
                        "event": "throttled" 
                    },
                    {
                        "time": "2023-02-24T18:51:32.633289+0000",
                        "event": "header_read" 
                    },
                    {
                        "time": "2023-02-24T18:51:32.633302+0000",
                        "event": "all_read" 
                    },
                    {
                        "time": "2023-02-24T18:51:32.633348+0000",
                        "event": "dispatched" 
                    },
                    {
                        "time": "2023-02-24T18:51:35.531203+0000",
                        "event": "failed to rdlock, waiting" 
                    }
                ]
            }
        }
    ],
    "num_ops": 1
}

The only other thing I can think to do is a complete cluster shutdown and restart.

Actions

Also available in: Atom PDF