Project

General

Profile

Actions

Bug #58846

open

Large snapshot delete causing locking "issues".

Added by Brian Woods about 1 year ago. Updated about 1 year ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Sorry for the generic title, but several things seem broken, so I am not 100% sure hot to title this one.

I have a rather large folder (about 19TBs) that had a half dozen snapshots on it. The other night I deleted them. Things got busy as expected, but a few apps that were doing IO in that folder just stopped. Killed and restarted them, but they locked again.

Started a tar command on the folder to /dev/zero and it was going along fine until it hit some file in that folder (didn't check what file, but can do if wanted) and it also just stopped.

So I checked 'ceph -s' and the snaptrim count was catching up (I think there was about 60 PGs), so left it for the morning as this isn't a critical folder. About 1.3TBs of RAW space was recovered, but in the morning the tar command was still locked and some PGs appeared to be stuck.

This has been the state for over 24 hours:

 - 15.79 active+clean+snaptrim_wait - snaptrim_duration: 0             - Queue Len: 8
 - 15.68 active+clean+snaptrim_wait - snaptrim_duration: 594.114416225 - Queue Len: 7
 - 15.60 active+clean+snaptrim_wait - snaptrim_duration: 0             - Queue Len: 8
 - 15.0 active+clean+snaptrim       - snaptrim_duration: 0.050225681   - Queue Len: 8
 - 15.14 active+clean+snaptrim      - snaptrim_duration: 5.939379017   - Queue Len: 8
 - 15.2a active+clean+snaptrim_wait - snaptrim_duration: 86.089696106  - Queue Len: 7
 - 15.2e active+clean+snaptrim_wait - snaptrim_duration: 0             - Queue Len: 8

I did a rolling restart of all of the MDSs and MON servers, but no change.

Here is the current ceph -s:

    health: HEALTH_WARN
            1 pools have many more objects per pg than average
            1 clients failing to respond to capability release
            1 MDSs report slow requests
            64 pgs not deep-scrubbed in time
            71 pgs not scrubbed in time
            2 pools have too few placement groups

I did find this KB:
https://access.redhat.com/solutions/3399771

I found "failed to rdlock, waiting" and tried kicking clients per the article, but that didn't seem to help.

# ceph daemon mds.mds-default.SERVERNAME.ptkjle dump_ops_in_flight
{
    "ops": [
        {
            "description": "client_request(client.55807368:11991113 
getattr pAsLsXsFs #0x10000e3fb93 2023-02-24T02:57:53.743251+0000
RETRY=1 caller_uid=0, caller_gid=0{0,})",
            "initiated_at": "2023-02-24T18:51:32.633289+0000",
            "age": 4303.5721312469996,
            "duration": 4303.5721652660004,
            "type_data": {
                "flag_point": "failed to rdlock, waiting",
                "reqid": "client.55807368:11991113",
                "op_type": "client_request",
                "client_info": {
                    "client": "client.55807368",
                    "tid": 11991113
                },
                "events": [
                    {
                        "time": "2023-02-24T18:51:32.633289+0000",
                        "event": "initiated" 
                    },
                    {
                        "time": "2023-02-24T18:51:32.633290+0000",
                        "event": "throttled" 
                    },
                    {
                        "time": "2023-02-24T18:51:32.633289+0000",
                        "event": "header_read" 
                    },
                    {
                        "time": "2023-02-24T18:51:32.633302+0000",
                        "event": "all_read" 
                    },
                    {
                        "time": "2023-02-24T18:51:32.633348+0000",
                        "event": "dispatched" 
                    },
                    {
                        "time": "2023-02-24T18:51:35.531203+0000",
                        "event": "failed to rdlock, waiting" 
                    }
                ]
            }
        }
    ],
    "num_ops": 1
}

The only other thing I can think to do is a complete cluster shutdown and restart.

Actions #1

Updated by Brian Woods about 1 year ago

I forgot to add that this is a cephadm 17.2.4 deployment on Ubuntu 20.04 hosts.

Actions #2

Updated by Brian Woods about 1 year ago

Update, after about 3 days, the trimming pgs suddenly cleared without intervention... Not a clue why.

I still have MDS reporting slow requests, but I assume that is a stuck client.

Actions #3

Updated by Brian Woods about 1 year ago

After mounting and re-mounting the volume, I am still hitting a problem where attempting to read a file logs up the thread.

For posterity, the file file to lock up is:
/DFS/MediaStore/Backups/Dockers/data/1/1386

And I still show this in my ceph -s:

    health: HEALTH_WARN
            1 pools have many more objects per pg than average
            1 clients failing to respond to capability release
            1 MDSs report slow requests

My next debugging step would be a complete cluster shutdown.

Actions #4

Updated by Brian Woods about 1 year ago

Logs from the active MDS:

debug 2023-02-27T22:44:08.254+0000 7fea0174e700  0 log_channel(cluster) log [WRN] : 2 slow requests, 0 included below; oldest blocked for > 273155.621904 secs
debug 2023-02-27T22:44:13.254+0000 7fea0174e700  0 log_channel(cluster) log [WRN] : 2 slow requests, 0 included below; oldest blocked for > 273160.622113 secs
debug 2023-02-27T22:44:18.254+0000 7fea0174e700  0 log_channel(cluster) log [WRN] : 2 slow requests, 0 included below; oldest blocked for > 273165.622187 secs
debug 2023-02-27T22:44:23.254+0000 7fea0174e700  0 log_channel(cluster) log [WRN] : 2 slow requests, 0 included below; oldest blocked for > 273170.622272 secs
debug 2023-02-27T22:44:28.269+0000 7fea0174e700  0 log_channel(cluster) log [WRN] : 2 slow requests, 1 included below; oldest blocked for > 273175.638000 secs
debug 2023-02-27T22:44:28.269+0000 7fea0174e700  0 log_channel(cluster) log [WRN] : slow request 244.219832 seconds old,
received at 2023-02-27T22:40:24.051456+0000: client_request(client.57381755:297524 lookup #0x10000720e25/1387
2023-02-27T22:40:24.030295+0000 caller_uid=0, caller_gid=0{0,}) currently failed to rdlock, waiting
debug 2023-02-27T22:44:33.269+0000 7fea0174e700  0 log_channel(cluster) log [WRN] : 2 slow requests, 0 included below; oldest blocked for > 273180.638081 secs
debug 2023-02-27T22:44:38.269+0000 7fea0174e700  0 log_channel(cluster) log [WRN] : 2 slow requests, 0 included below; oldest blocked for > 273185.638163 secs
debug 2023-02-27T22:44:43.269+0000 7fea0174e700  0 log_channel(cluster) log [WRN] : 2 slow requests, 0 included below; oldest blocked for > 273190.638250 secs
debug 2023-02-27T22:44:48.285+0000 7fea0174e700  0 log_channel(cluster) log [WRN] : 2 slow requests, 0 included below; oldest blocked for > 273195.653936 secs
debug 2023-02-27T22:44:53.285+0000 7fea0174e700  0 log_channel(cluster) log [WRN] : 2 slow requests, 0 included below; oldest blocked for > 273200.654011 secs
debug 2023-02-27T22:44:58.285+0000 7fea0174e700  0 log_channel(cluster) log [WRN] : 2 slow requests, 0 included below; oldest blocked for > 273205.654087 secs
debug 2023-02-27T22:45:03.284+0000 7fea0174e700  0 log_channel(cluster) log [WRN] : 2 slow requests, 0 included below; oldest blocked for > 273210.654182 secs
debug 2023-02-27T22:45:08.300+0000 7fea0174e700  0 log_channel(cluster) log [WRN] : 2 slow requests, 0 included below; oldest blocked for > 273215.669778 secs
debug 2023-02-27T22:45:13.300+0000 7fea0174e700  0 log_channel(cluster) log [WRN] : 2 slow requests, 0 included below; oldest blocked for > 273220.669847 secs

Actions #5

Updated by Brian Woods about 1 year ago

A cold restart on the cluster cleared the issue. From what I am hearing, this isn't uncommon problem when deleting large snapshots... :/

Apparently an OSD becomes "stuck" and a rolling restart will clear it. I wanted to patch and swap some hardware, so I just did a cold start.

This problem is still an issue, but I am no longer in a locked state.

Actions

Also available in: Atom PDF