Bug #12159
closedsetfilelock requests stuck
0%
Description
Sometimes on our cluster fcntl lock requests get stuck. Restarting the MDS server does not help.
The mdsc file contains the following:
931466 mds0 setfilelock #10000568fb1 931467 mds0 setfilelock #10000568fab 931468 mds0 setfilelock #10000568fa6
Locks and file access from another client to the same file still work.
I am using Ceph 0.94.2 together with Linux kernel 3.18.10 and the following patches:
ceph: exclude setfilelock requests when calculating oldest tid e8a7b8b12b13831467c6158c1e82801e25b5dd98 ceph: fix dentry leaks 5cba372c0fe78d24e83d9e0556ecbeb219625c15 libceph: kfree() in put_osd() shouldn't depend on authorizer b28ec2f37e6a2bbd0bdf74b39cb89c74e4ad17f3 libceph: request a new osdmap if lingering request maps to no osd b0494532214bdfbf241e94fabab5dd46f7b82631 ceph: re-send requests when MDS enters reconnecting stage 3de22be6771353241eaec237fe594dfea3daf30f Revert "libceph: clear r_req_lru_item in __unregister_linger_request()" 521a04d06a729e5971cdee7f84080387ed320527 ceph: introduce global empty snap context 97c85a828f36bbfffe9d77b977b65a5872b6cad4
Is there any other patch available, I might try?
Files
Updated by Zheng Yan almost 9 years ago
please enable mds debugging, the restart mds and send mds log to me.
Updated by Markus Blank-Burian almost 9 years ago
- File setfilelock.txt.xz.001 setfilelock.txt.xz.001 added
Here are the mds logs. I had to split them up to because of the upload limit.
Updated by Markus Blank-Burian almost 9 years ago
- File setfilelock.txt.xz.002 setfilelock.txt.xz.002 added
Updated by Markus Blank-Burian almost 9 years ago
- File setfilelock.txt.xz.003 setfilelock.txt.xz.003 added
Updated by Markus Blank-Burian almost 9 years ago
- File setfilelock.txt.xz.004 setfilelock.txt.xz.004 added
Updated by Markus Blank-Burian almost 9 years ago
- File setfilelock.txt.xz.005 setfilelock.txt.xz.005 added
Updated by Markus Blank-Burian almost 9 years ago
- File setfilelock.txt.xz.006 setfilelock.txt.xz.006 added
Updated by Markus Blank-Burian almost 9 years ago
- File setfilelock.txt.xz.007 setfilelock.txt.xz.007 added
Updated by Markus Blank-Burian almost 9 years ago
- File setfilelock.txt.xz.008 setfilelock.txt.xz.008 added
Updated by Markus Blank-Burian almost 9 years ago
- File setfilelock.txt.xz.009 setfilelock.txt.xz.009 added
Updated by Markus Blank-Burian almost 9 years ago
- File setfilelock.txt.xz.010 setfilelock.txt.xz.010 added
Updated by Zheng Yan almost 9 years ago
which client has the hung lock request?
(/sys/kernel/debug/ceph/46574da0-8084-4405-95fe-cb546a197920.clientXXXXX)
Updated by Zheng Yan almost 9 years ago
please enable dynamic debug for ceph kernel client with hung requests
(echo module ceph +p > /sys/kernel/debug/dynamic_debug/control)
then restart the MDS. collect the kernel message after MDS recovers, send the kernel message to me)
thanks
Updated by Markus Blank-Burian almost 9 years ago
I have rebooted the clients yesterday, so I expect them to have different IDs. As far as I remember, the IPs of the clients were 192.168.1.30 and 192.168.1.47.
Updated by Zheng Yan almost 9 years ago
no 192.168.1.30 or 192.168.1.47 in the mds log. maybe sessions between them and MDS were closed for some unknown reasons
Updated by Markus Blank-Burian almost 9 years ago
- File kaa-50.log kaa-50.log added
i made a log of another client, which has currently two outstanding setfilelock requests, during mds failover.
Updated by Zheng Yan almost 9 years ago
please provide both mds log and client kernel log, during mds failover.
Updated by Markus Blank-Burian almost 9 years ago
- File kaa-50.log kaa-50.log added
Ok, here are the combined log files from client7322026 and the mds
Updated by Markus Blank-Burian almost 9 years ago
Updated by Markus Blank-Burian almost 9 years ago
Updated by Markus Blank-Burian almost 9 years ago
Updated by Markus Blank-Burian almost 9 years ago
Updated by Zheng Yan almost 9 years ago
the mds debug level is not enough. please set debug_mds to 20 and try again.
Updated by Markus Blank-Burian almost 9 years ago
- File kaa-50.log kaa-50.log added
debug mds was set to 20, but within the [mds] section. so here are some better logs
Updated by Markus Blank-Burian almost 9 years ago
Updated by Markus Blank-Burian almost 9 years ago
Updated by Markus Blank-Burian almost 9 years ago
Updated by Markus Blank-Burian almost 9 years ago
Updated by Markus Blank-Burian almost 9 years ago
Updated by Markus Blank-Burian almost 9 years ago
Updated by Zheng Yan almost 9 years ago
the mds log shows that session for client.7322026 has been closed, no clue why mds closed the session. Is there any message like "closing stale session client.1046359 10.0.0.10:0/14716 after 300.034073" in your cluster log.
Updated by Markus Blank-Burian almost 9 years ago
yes, there are such entries:
# zgrep "stale session client.7322026" *mds* ceph-mds-bagheera2.log-20150703.gz:2015-06-30T15:47:40+02:00 bagheera2 ceph-mds[32447]: 2015-06-30 15:47:40.586782 7f68f3fff700 0 log_channel(cluster) log [INF] : closing stale session client.7322026 192.168.1.50:0/1488184409 after 14065.952572 ceph-mds-bagheera2.log-20150703.gz:2015-07-02T16:57:47+02:00 bagheera2 ceph-mds[4354]: 2015-07-02 16:57:47.418309 7f6811bff700 0 log_channel(cluster) log [INF] : closing stale session client.7322026 192.168.1.50:0/1488184409 after 18233.697692 ceph-mds-bagheera.log-20150703.gz:2015-07-02T09:17:32+02:00 bagheera ceph-mds[6555]: 2015-07-02 09:17:32.425285 7f8763bff700 0 log_channel(cluster) log [INF] : closing stale session client.7322026 192.168.1.50:0/1488184409 after 106203.943828 ceph-mds-bagheera.log-20150704.gz:2015-07-03T10:32:49+02:00 bagheera ceph-mds[10306]: 2015-07-03 10:32:49.606605 7f0ad53ff700 0 log_channel(cluster) log [INF] : closing stale session client.7322026 192.168.1.50:0/1488184409 after 56260.511592
Updated by Zheng Yan almost 9 years ago
maybe there are network problem between 192.168.1.50 and MDS. do other clients suffer the same issue?
Updated by Zheng Yan almost 9 years ago
please try commit:
ceph: cleanup unsafe requests when reconnecting is denied
1c841a96b5f369cbb0b169d13825c7ff7d0fba8d
It improves handling of session get closed
Updated by Markus Blank-Burian almost 9 years ago
many clients had this issue. network problems are unlikely, i never noticed any problems on the logs from our switches. i will test this patch over the next days, thanks!
Updated by Markus Blank-Burian almost 9 years ago
looking through the logs, there were indeed some cases of "ceph: mds0 reconnect denied" per client.
Updated by Zheng Yan almost 9 years ago
maybe you should enlarge mds_reconnect_timeout. besides, please don't enable mds_debug in normal times
Updated by Markus Blank-Burian almost 9 years ago
Yesterday, testing the patch, I saw a process stuck in a write request (shown by strace), with no entry in osdc and mdsc. This happened immediately after a triggered failover (ceph mds fail) from an active to a standby mds. Since the load on the cluster dropped massively at this time and did not recover, I suspect that many programs had the same problem.
Today I tested the same with a single mds and had no problem. All processes survived multiple fails.
Updated by Zheng Yan almost 9 years ago
I can't reproduce this hang. If it happens again, please check /proc/<PID>/stack to find where does it wait at.
Updated by Zheng Yan almost 8 years ago
- Status changed from New to Need More Info
Updated by Zheng Yan almost 7 years ago
- Status changed from Need More Info to Can't reproduce