Bug #12159: setfilelock requests stuck - Linux kernel client - Ceph

Actions

Copy link

Bug #12159

closed

setfilelock requests stuck

Added by Markus Blank-Burian almost 9 years ago. Updated almost 7 years ago.

Status:

Can't reproduce

Priority:

Normal

Assignee:

Zheng Yan

Category:

Target version:

% Done:

Source:

other

Tags:

Backport:

Regression:

Severity:

2 - major

Reviewed:

Affected Versions:

ceph-qa-suite:

Crash signature (v1):

Crash signature (v2):

Description

Sometimes on our cluster fcntl lock requests get stuck. Restarting the MDS server does not help.
The mdsc file contains the following:

931466  mds0    setfilelock      #10000568fb1
931467  mds0    setfilelock      #10000568fab
931468  mds0    setfilelock      #10000568fa6

Locks and file access from another client to the same file still work.

I am using Ceph 0.94.2 together with Linux kernel 3.18.10 and the following patches:

ceph: exclude setfilelock requests when calculating oldest tid
e8a7b8b12b13831467c6158c1e82801e25b5dd98

ceph: fix dentry leaks
5cba372c0fe78d24e83d9e0556ecbeb219625c15

libceph: kfree() in put_osd() shouldn't depend on authorizer
b28ec2f37e6a2bbd0bdf74b39cb89c74e4ad17f3

libceph: request a new osdmap if lingering request maps to no osd
b0494532214bdfbf241e94fabab5dd46f7b82631

ceph: re-send requests when MDS enters reconnecting stage
3de22be6771353241eaec237fe594dfea3daf30f

Revert "libceph: clear r_req_lru_item in __unregister_linger_request()" 
521a04d06a729e5971cdee7f84080387ed320527

ceph: introduce global empty snap context
97c85a828f36bbfffe9d77b977b65a5872b6cad4

Is there any other patch available, I might try?

Files

Download all files

setfilelock.txt.xz.001 (768 KB) setfilelock.txt.xz.001		Markus Blank-Burian, 06/30/2015 09:32 AM
setfilelock.txt.xz.002 (768 KB) setfilelock.txt.xz.002		Markus Blank-Burian, 06/30/2015 09:34 AM
setfilelock.txt.xz.003 (768 KB) setfilelock.txt.xz.003		Markus Blank-Burian, 06/30/2015 09:34 AM
setfilelock.txt.xz.004 (768 KB) setfilelock.txt.xz.004		Markus Blank-Burian, 06/30/2015 09:34 AM
setfilelock.txt.xz.005 (768 KB) setfilelock.txt.xz.005		Markus Blank-Burian, 06/30/2015 09:34 AM
setfilelock.txt.xz.006 (768 KB) setfilelock.txt.xz.006		Markus Blank-Burian, 06/30/2015 09:34 AM
setfilelock.txt.xz.007 (768 KB) setfilelock.txt.xz.007		Markus Blank-Burian, 06/30/2015 09:35 AM
setfilelock.txt.xz.008 (768 KB) setfilelock.txt.xz.008		Markus Blank-Burian, 06/30/2015 09:35 AM
setfilelock.txt.xz.009 (768 KB) setfilelock.txt.xz.009		Markus Blank-Burian, 06/30/2015 09:35 AM
setfilelock.txt.xz.010 (239 KB) setfilelock.txt.xz.010		Markus Blank-Burian, 06/30/2015 09:35 AM
kaa-50.log (77.2 KB) kaa-50.log		Markus Blank-Burian, 07/02/2015 03:07 PM
kaa-50.log (219 KB) kaa-50.log		Markus Blank-Burian, 07/02/2015 05:05 PM
ceph-mds-bagheera2.log.xz.001 (768 KB) ceph-mds-bagheera2.log.xz.001		Markus Blank-Burian, 07/02/2015 05:05 PM
ceph-mds-bagheera2.log.xz.002 (768 KB) ceph-mds-bagheera2.log.xz.002		Markus Blank-Burian, 07/02/2015 05:05 PM
ceph-mds-bagheera2.log.xz.003 (768 KB) ceph-mds-bagheera2.log.xz.003		Markus Blank-Burian, 07/02/2015 05:05 PM
ceph-mds-bagheera2.log.xz.004 (546 KB) ceph-mds-bagheera2.log.xz.004		Markus Blank-Burian, 07/02/2015 05:05 PM
kaa-50.log (42.6 KB) kaa-50.log		Markus Blank-Burian, 07/03/2015 08:49 AM
ceph-mds-bagheera2.log.xz.001 (900 KB) ceph-mds-bagheera2.log.xz.001		Markus Blank-Burian, 07/03/2015 08:49 AM
ceph-mds-bagheera2.log.xz.002 (900 KB) ceph-mds-bagheera2.log.xz.002		Markus Blank-Burian, 07/03/2015 08:50 AM
ceph-mds-bagheera2.log.xz.003 (900 KB) ceph-mds-bagheera2.log.xz.003		Markus Blank-Burian, 07/03/2015 08:50 AM
ceph-mds-bagheera2.log.xz.004 (900 KB) ceph-mds-bagheera2.log.xz.004		Markus Blank-Burian, 07/03/2015 08:50 AM
ceph-mds-bagheera2.log.xz.005 (900 KB) ceph-mds-bagheera2.log.xz.005		Markus Blank-Burian, 07/03/2015 08:50 AM
ceph-mds-bagheera2.log.xz.006 (365 KB) ceph-mds-bagheera2.log.xz.006		Markus Blank-Burian, 07/03/2015 08:50 AM

Actions

Copy link

Updated by Zheng Yan almost 9 years ago

please enable mds debugging, the restart mds and send mds log to me.

Actions

Copy link

Updated by Markus Blank-Burian almost 9 years ago

File setfilelock.txt.xz.001 setfilelock.txt.xz.001 added

Here are the mds logs. I had to split them up to because of the upload limit.

Actions

Copy link

Updated by Markus Blank-Burian almost 9 years ago

File setfilelock.txt.xz.002 setfilelock.txt.xz.002 added

Actions

Copy link

Updated by Markus Blank-Burian almost 9 years ago

File setfilelock.txt.xz.003 setfilelock.txt.xz.003 added

Actions

Copy link

Updated by Markus Blank-Burian almost 9 years ago

File setfilelock.txt.xz.004 setfilelock.txt.xz.004 added

Actions

Copy link

Updated by Markus Blank-Burian almost 9 years ago

File setfilelock.txt.xz.005 setfilelock.txt.xz.005 added

Actions

Copy link

Updated by Markus Blank-Burian almost 9 years ago

File setfilelock.txt.xz.006 setfilelock.txt.xz.006 added

Actions

Copy link

Updated by Markus Blank-Burian almost 9 years ago

File setfilelock.txt.xz.007 setfilelock.txt.xz.007 added

Actions

Copy link

Updated by Markus Blank-Burian almost 9 years ago

File setfilelock.txt.xz.008 setfilelock.txt.xz.008 added

Actions

Copy link

#10

Updated by Markus Blank-Burian almost 9 years ago

File setfilelock.txt.xz.009 setfilelock.txt.xz.009 added

Actions

Copy link

#11

Updated by Markus Blank-Burian almost 9 years ago

File setfilelock.txt.xz.010 setfilelock.txt.xz.010 added

Actions

Copy link

#12

Updated by Zheng Yan almost 9 years ago

which client has the hung lock request?

(/sys/kernel/debug/ceph/46574da0-8084-4405-95fe-cb546a197920.clientXXXXX)

Actions

Copy link

#13

Updated by Zheng Yan almost 9 years ago

please enable dynamic debug for ceph kernel client with hung requests
(echo module ceph +p > /sys/kernel/debug/dynamic_debug/control)

then restart the MDS. collect the kernel message after MDS recovers, send the kernel message to me)

thanks

Actions

Copy link

#14

Updated by Markus Blank-Burian almost 9 years ago

I have rebooted the clients yesterday, so I expect them to have different IDs. As far as I remember, the IPs of the clients were 192.168.1.30 and 192.168.1.47.

Actions

Copy link

#15

Updated by Zheng Yan almost 9 years ago

no 192.168.1.30 or 192.168.1.47 in the mds log. maybe sessions between them and MDS were closed for some unknown reasons

Actions

Copy link

#16

Updated by Markus Blank-Burian almost 9 years ago

File kaa-50.log kaa-50.log added

i made a log of another client, which has currently two outstanding setfilelock requests, during mds failover.

Actions

Copy link

#17

Updated by Zheng Yan almost 9 years ago

please provide both mds log and client kernel log, during mds failover.

Actions

Copy link

#18

Updated by Markus Blank-Burian almost 9 years ago

File kaa-50.log kaa-50.log added

Ok, here are the combined log files from client7322026 and the mds

Actions

Copy link

#19

Updated by Markus Blank-Burian almost 9 years ago

File ceph-mds-bagheera2.log.xz.001 ceph-mds-bagheera2.log.xz.001 added

Actions

Copy link

#20

Updated by Markus Blank-Burian almost 9 years ago

File ceph-mds-bagheera2.log.xz.002 ceph-mds-bagheera2.log.xz.002 added

Actions

Copy link

#21

Updated by Markus Blank-Burian almost 9 years ago

File ceph-mds-bagheera2.log.xz.003 ceph-mds-bagheera2.log.xz.003 added

Actions

Copy link

#22

Updated by Markus Blank-Burian almost 9 years ago

File ceph-mds-bagheera2.log.xz.004 ceph-mds-bagheera2.log.xz.004 added

Actions

Copy link

#23

Updated by Zheng Yan almost 9 years ago

the mds debug level is not enough. please set debug_mds to 20 and try again.

Actions

Copy link

#24

Updated by Markus Blank-Burian almost 9 years ago

File kaa-50.log kaa-50.log added

debug mds was set to 20, but within the [mds] section. so here are some better logs

Actions

Copy link

#25

Updated by Markus Blank-Burian almost 9 years ago

File ceph-mds-bagheera2.log.xz.001 ceph-mds-bagheera2.log.xz.001 added

Actions

Copy link

#26

Updated by Markus Blank-Burian almost 9 years ago

File ceph-mds-bagheera2.log.xz.002 ceph-mds-bagheera2.log.xz.002 added

Actions

Copy link

#27

Updated by Markus Blank-Burian almost 9 years ago

File ceph-mds-bagheera2.log.xz.003 ceph-mds-bagheera2.log.xz.003 added

Actions

Copy link

#28

Updated by Markus Blank-Burian almost 9 years ago

File ceph-mds-bagheera2.log.xz.004 ceph-mds-bagheera2.log.xz.004 added

Actions

Copy link

#29

Updated by Markus Blank-Burian almost 9 years ago

File ceph-mds-bagheera2.log.xz.005 ceph-mds-bagheera2.log.xz.005 added

Actions

Copy link

#30

Updated by Markus Blank-Burian almost 9 years ago

File ceph-mds-bagheera2.log.xz.006 ceph-mds-bagheera2.log.xz.006 added

Actions

Copy link

#31

Updated by Zheng Yan almost 9 years ago

the mds log shows that session for client.7322026 has been closed, no clue why mds closed the session. Is there any message like "closing stale session client.1046359 10.0.0.10:0/14716 after 300.034073" in your cluster log.

Actions

Copy link

#32

Updated by Markus Blank-Burian almost 9 years ago

yes, there are such entries:

# zgrep "stale session client.7322026" *mds*
ceph-mds-bagheera2.log-20150703.gz:2015-06-30T15:47:40+02:00 bagheera2 ceph-mds[32447]: 2015-06-30 15:47:40.586782 7f68f3fff700  0 log_channel(cluster) log [INF] : closing stale session client.7322026 192.168.1.50:0/1488184409 after 14065.952572
ceph-mds-bagheera2.log-20150703.gz:2015-07-02T16:57:47+02:00 bagheera2 ceph-mds[4354]: 2015-07-02 16:57:47.418309 7f6811bff700  0 log_channel(cluster) log [INF] : closing stale session client.7322026 192.168.1.50:0/1488184409 after 18233.697692
ceph-mds-bagheera.log-20150703.gz:2015-07-02T09:17:32+02:00 bagheera ceph-mds[6555]: 2015-07-02 09:17:32.425285 7f8763bff700  0 log_channel(cluster) log [INF] : closing stale session client.7322026 192.168.1.50:0/1488184409 after 106203.943828
ceph-mds-bagheera.log-20150704.gz:2015-07-03T10:32:49+02:00 bagheera ceph-mds[10306]: 2015-07-03 10:32:49.606605 7f0ad53ff700  0 log_channel(cluster) log [INF] : closing stale session client.7322026 192.168.1.50:0/1488184409 after 56260.511592

Actions

Copy link

#33

Updated by Zheng Yan almost 9 years ago

maybe there are network problem between 192.168.1.50 and MDS. do other clients suffer the same issue?

Actions

Copy link

#34

Updated by Zheng Yan almost 9 years ago

please try commit:
ceph: cleanup unsafe requests when reconnecting is denied
1c841a96b5f369cbb0b169d13825c7ff7d0fba8d

It improves handling of session get closed

Actions

Copy link

#35

Updated by Markus Blank-Burian almost 9 years ago

many clients had this issue. network problems are unlikely, i never noticed any problems on the logs from our switches. i will test this patch over the next days, thanks!

Actions

Copy link

#36

Updated by Markus Blank-Burian almost 9 years ago

looking through the logs, there were indeed some cases of "ceph: mds0 reconnect denied" per client.

Actions

Copy link

#37

Updated by Zheng Yan almost 9 years ago

maybe you should enlarge mds_reconnect_timeout. besides, please don't enable mds_debug in normal times

Actions

Copy link

#38

Updated by Markus Blank-Burian almost 9 years ago

Yesterday, testing the patch, I saw a process stuck in a write request (shown by strace), with no entry in osdc and mdsc. This happened immediately after a triggered failover (ceph mds fail) from an active to a standby mds. Since the load on the cluster dropped massively at this time and did not recover, I suspect that many programs had the same problem.
Today I tested the same with a single mds and had no problem. All processes survived multiple fails.

Actions

Copy link

#39

Updated by Zheng Yan almost 9 years ago

I can't reproduce this hang. If it happens again, please check /proc/<PID>/stack to find where does it wait at.

Actions

Copy link

#40

Updated by Zheng Yan almost 8 years ago

Status changed from New to Need More Info

Actions

Copy link

#41

Updated by Zheng Yan almost 7 years ago

Status changed from Need More Info to Can't reproduce

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » Linux kernel client

Custom queries

Bug #12159

setfilelock requests stuck

Updated by Zheng Yan almost 9 years ago

Updated by Markus Blank-Burian almost 9 years ago

Updated by Markus Blank-Burian almost 9 years ago

Updated by Markus Blank-Burian almost 9 years ago

Updated by Markus Blank-Burian almost 9 years ago

Updated by Markus Blank-Burian almost 9 years ago

Updated by Markus Blank-Burian almost 9 years ago

Updated by Markus Blank-Burian almost 9 years ago

Updated by Markus Blank-Burian almost 9 years ago

Updated by Markus Blank-Burian almost 9 years ago

Updated by Markus Blank-Burian almost 9 years ago

Updated by Zheng Yan almost 9 years ago

Updated by Zheng Yan almost 9 years ago

Updated by Markus Blank-Burian almost 9 years ago

Updated by Zheng Yan almost 9 years ago

Updated by Markus Blank-Burian almost 9 years ago

Updated by Zheng Yan almost 9 years ago

Updated by Markus Blank-Burian almost 9 years ago

Updated by Markus Blank-Burian almost 9 years ago

Updated by Markus Blank-Burian almost 9 years ago

Updated by Markus Blank-Burian almost 9 years ago

Updated by Markus Blank-Burian almost 9 years ago

Updated by Zheng Yan almost 9 years ago

Updated by Markus Blank-Burian almost 9 years ago

Updated by Markus Blank-Burian almost 9 years ago

Updated by Markus Blank-Burian almost 9 years ago

Updated by Markus Blank-Burian almost 9 years ago

Updated by Markus Blank-Burian almost 9 years ago

Updated by Markus Blank-Burian almost 9 years ago

Updated by Markus Blank-Burian almost 9 years ago

Updated by Zheng Yan almost 9 years ago

Updated by Markus Blank-Burian almost 9 years ago

Updated by Zheng Yan almost 9 years ago

Updated by Zheng Yan almost 9 years ago

Updated by Markus Blank-Burian almost 9 years ago

Updated by Markus Blank-Burian almost 9 years ago

Updated by Zheng Yan almost 9 years ago

Updated by Markus Blank-Burian almost 9 years ago

Updated by Zheng Yan almost 9 years ago

Updated by Zheng Yan almost 8 years ago

Updated by Zheng Yan almost 7 years ago