Bug #58846
openLarge snapshot delete causing locking "issues".
0%
Description
Sorry for the generic title, but several things seem broken, so I am not 100% sure hot to title this one.
I have a rather large folder (about 19TBs) that had a half dozen snapshots on it. The other night I deleted them. Things got busy as expected, but a few apps that were doing IO in that folder just stopped. Killed and restarted them, but they locked again.
Started a tar command on the folder to /dev/zero and it was going along fine until it hit some file in that folder (didn't check what file, but can do if wanted) and it also just stopped.
So I checked 'ceph -s' and the snaptrim count was catching up (I think there was about 60 PGs), so left it for the morning as this isn't a critical folder. About 1.3TBs of RAW space was recovered, but in the morning the tar command was still locked and some PGs appeared to be stuck.
This has been the state for over 24 hours:
- 15.79 active+clean+snaptrim_wait - snaptrim_duration: 0 - Queue Len: 8 - 15.68 active+clean+snaptrim_wait - snaptrim_duration: 594.114416225 - Queue Len: 7 - 15.60 active+clean+snaptrim_wait - snaptrim_duration: 0 - Queue Len: 8 - 15.0 active+clean+snaptrim - snaptrim_duration: 0.050225681 - Queue Len: 8 - 15.14 active+clean+snaptrim - snaptrim_duration: 5.939379017 - Queue Len: 8 - 15.2a active+clean+snaptrim_wait - snaptrim_duration: 86.089696106 - Queue Len: 7 - 15.2e active+clean+snaptrim_wait - snaptrim_duration: 0 - Queue Len: 8
I did a rolling restart of all of the MDSs and MON servers, but no change.
Here is the current ceph -s:
health: HEALTH_WARN 1 pools have many more objects per pg than average 1 clients failing to respond to capability release 1 MDSs report slow requests 64 pgs not deep-scrubbed in time 71 pgs not scrubbed in time 2 pools have too few placement groups
I did find this KB:
https://access.redhat.com/solutions/3399771
I found "failed to rdlock, waiting" and tried kicking clients per the article, but that didn't seem to help.
# ceph daemon mds.mds-default.SERVERNAME.ptkjle dump_ops_in_flight { "ops": [ { "description": "client_request(client.55807368:11991113 getattr pAsLsXsFs #0x10000e3fb93 2023-02-24T02:57:53.743251+0000 RETRY=1 caller_uid=0, caller_gid=0{0,})", "initiated_at": "2023-02-24T18:51:32.633289+0000", "age": 4303.5721312469996, "duration": 4303.5721652660004, "type_data": { "flag_point": "failed to rdlock, waiting", "reqid": "client.55807368:11991113", "op_type": "client_request", "client_info": { "client": "client.55807368", "tid": 11991113 }, "events": [ { "time": "2023-02-24T18:51:32.633289+0000", "event": "initiated" }, { "time": "2023-02-24T18:51:32.633290+0000", "event": "throttled" }, { "time": "2023-02-24T18:51:32.633289+0000", "event": "header_read" }, { "time": "2023-02-24T18:51:32.633302+0000", "event": "all_read" }, { "time": "2023-02-24T18:51:32.633348+0000", "event": "dispatched" }, { "time": "2023-02-24T18:51:35.531203+0000", "event": "failed to rdlock, waiting" } ] } } ], "num_ops": 1 }
The only other thing I can think to do is a complete cluster shutdown and restart.
Updated by Brian Woods about 1 year ago
I forgot to add that this is a cephadm 17.2.4 deployment on Ubuntu 20.04 hosts.
Updated by Brian Woods about 1 year ago
Update, after about 3 days, the trimming pgs suddenly cleared without intervention... Not a clue why.
I still have MDS reporting slow requests, but I assume that is a stuck client.
Updated by Brian Woods about 1 year ago
After mounting and re-mounting the volume, I am still hitting a problem where attempting to read a file logs up the thread.
For posterity, the file file to lock up is:
/DFS/MediaStore/Backups/Dockers/data/1/1386
And I still show this in my ceph -s:
health: HEALTH_WARN 1 pools have many more objects per pg than average 1 clients failing to respond to capability release 1 MDSs report slow requests
My next debugging step would be a complete cluster shutdown.
Updated by Brian Woods about 1 year ago
Logs from the active MDS:
debug 2023-02-27T22:44:08.254+0000 7fea0174e700 0 log_channel(cluster) log [WRN] : 2 slow requests, 0 included below; oldest blocked for > 273155.621904 secs debug 2023-02-27T22:44:13.254+0000 7fea0174e700 0 log_channel(cluster) log [WRN] : 2 slow requests, 0 included below; oldest blocked for > 273160.622113 secs debug 2023-02-27T22:44:18.254+0000 7fea0174e700 0 log_channel(cluster) log [WRN] : 2 slow requests, 0 included below; oldest blocked for > 273165.622187 secs debug 2023-02-27T22:44:23.254+0000 7fea0174e700 0 log_channel(cluster) log [WRN] : 2 slow requests, 0 included below; oldest blocked for > 273170.622272 secs debug 2023-02-27T22:44:28.269+0000 7fea0174e700 0 log_channel(cluster) log [WRN] : 2 slow requests, 1 included below; oldest blocked for > 273175.638000 secs debug 2023-02-27T22:44:28.269+0000 7fea0174e700 0 log_channel(cluster) log [WRN] : slow request 244.219832 seconds old, received at 2023-02-27T22:40:24.051456+0000: client_request(client.57381755:297524 lookup #0x10000720e25/1387 2023-02-27T22:40:24.030295+0000 caller_uid=0, caller_gid=0{0,}) currently failed to rdlock, waiting debug 2023-02-27T22:44:33.269+0000 7fea0174e700 0 log_channel(cluster) log [WRN] : 2 slow requests, 0 included below; oldest blocked for > 273180.638081 secs debug 2023-02-27T22:44:38.269+0000 7fea0174e700 0 log_channel(cluster) log [WRN] : 2 slow requests, 0 included below; oldest blocked for > 273185.638163 secs debug 2023-02-27T22:44:43.269+0000 7fea0174e700 0 log_channel(cluster) log [WRN] : 2 slow requests, 0 included below; oldest blocked for > 273190.638250 secs debug 2023-02-27T22:44:48.285+0000 7fea0174e700 0 log_channel(cluster) log [WRN] : 2 slow requests, 0 included below; oldest blocked for > 273195.653936 secs debug 2023-02-27T22:44:53.285+0000 7fea0174e700 0 log_channel(cluster) log [WRN] : 2 slow requests, 0 included below; oldest blocked for > 273200.654011 secs debug 2023-02-27T22:44:58.285+0000 7fea0174e700 0 log_channel(cluster) log [WRN] : 2 slow requests, 0 included below; oldest blocked for > 273205.654087 secs debug 2023-02-27T22:45:03.284+0000 7fea0174e700 0 log_channel(cluster) log [WRN] : 2 slow requests, 0 included below; oldest blocked for > 273210.654182 secs debug 2023-02-27T22:45:08.300+0000 7fea0174e700 0 log_channel(cluster) log [WRN] : 2 slow requests, 0 included below; oldest blocked for > 273215.669778 secs debug 2023-02-27T22:45:13.300+0000 7fea0174e700 0 log_channel(cluster) log [WRN] : 2 slow requests, 0 included below; oldest blocked for > 273220.669847 secs
Updated by Brian Woods about 1 year ago
A cold restart on the cluster cleared the issue. From what I am hearing, this isn't uncommon problem when deleting large snapshots... :/
Apparently an OSD becomes "stuck" and a rolling restart will clear it. I wanted to patch and swap some hardware, so I just did a cold start.
This problem is still an issue, but I am no longer in a locked state.