Project

General

Profile

Actions

Bug #6914

closed

Accessing specific file on CephFS causes client hang

Added by Rolf Fokkens over 10 years ago. Updated almost 8 years ago.

Status:
Resolved
Priority:
High
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

When accessing a specific file (/disco/herbert/disco-data/output/video-conversion) the client hangs. Any kind of fstat on the file results in a hang, but doing a plain ls (not ls -l) on /disco/herbert/disco-data/output works.

[root@ceph-mon01 init.d]# ceph mds tell 0 dumpcache /tmp/dump.txt
ok
[root@ceph-mon01 init.d]# grep herbert/disco-data/output/video /tmp/dump.txt
[inode 10000013022 [2,head] /disco/herbert/disco-data/output/video-conversion auth v34 ap=6+0 s=36 n(v0 b36 1=1+0) (ifile mix->sync) (iversion lock) cr={9726=0-4194304@1,9729=0-4194304@1,9732=0-4194304@1,10155=0-4194304@1} caps={9726=pAsLsXsFr/pAsxXsxFsxcrwb@2,9729=pAsLsXsFr/pAsLsXsFrw/pAsxXsxFsxcrwb@2,9732=pAsLsXsFr/pAsxXsxFsxcrwb@2,10155=pAsLsXsFr/pAsxXsxFsxcrwb@2,10602=pAsLsXsFr/-@1} | request=5 lock=1 caps=1 waiter=1 authpin=1 0x88a7378]
[dentry #1/disco/herbert/disco-data/output/video-conversion [2,head] auth (dversion lock) v=34 ap=0+6 inode=0x88a7378 | inodepin=1 0x9470f18]
[root@ceph-mon01 init.d]#

Seems to present relevant information: The request=5 (iversion lock) and the caps=... suggest (is there any documentation?) that clients 9726, 9729, 9732, 10155 and 10602 are involved. Possibly holding the lock, definitely some are waiting.

[root@ceph-mon01 init.d]# for i in 9726 9729 9732 10155 10602; do grep client.$i /var/log/ceph/ceph-mds.ceph-mon01.log | head; done
2013-11-30 14:20:52.706846 7fe144d7b700 0 log [DBG] : reconnect by client.9726 X.Y.23.154:0/3869725424 after 0.003437
2013-11-30 14:27:29.018735 7f2dbcc0b700 1 mds.0.server reconnect gave up on client.9726 X.Y.23.154:0/3869725424
2013-11-30 14:31:44.025454 7f2dbcc0b700 0 log [INF] : closing stale session client.9726 X.Y.23.154:0/3869725424 after 304.961662
2013-11-30 14:20:52.716187 7fe144d7b700 0 log [DBG] : reconnect by client.9729 X.Y.23.155:0/2487236169 after 0.012752
2013-11-30 14:27:29.018748 7f2dbcc0b700 1 mds.0.server reconnect gave up on client.9729 X.Y.23.155:0/2487236169
2013-11-30 14:31:44.025951 7f2dbcc0b700 0 log [INF] : closing stale session client.9729 X.Y.23.155:0/2487236169 after 304.961658
2013-11-30 14:21:41.356094 7fe142375700 1 mds.0.server reconnect gave up on client.9732 X.Y.23.156:0/216645998
2013-11-30 14:25:56.362768 7fe142375700 0 log [INF] : closing stale session client.9732 X.Y.23.156:0/216645998 after 304.957062
2013-11-30 14:21:41.356105 7fe142375700 1 mds.0.server reconnect gave up on client.10155 X.Y.23.157:0/4097063187
2013-11-30 14:25:56.362644 7fe142375700 0 log [INF] : closing stale session client.10155 X.Y.23.157:0/4097063187 after 304.957067
2013-11-30 14:20:52.703747 7fe144d7b700 0 log [DBG] : reconnect by client.10602 X.Y.23.168:0/7237118 after 0.000324
2013-11-30 14:26:40.419783 7f2dbf611700 0 log [DBG] : reconnect by client.10602 X.Y.23.168:0/7237118 after 0.000973
2013-11-30 14:35:04.030942 7f2dbcc0b700 0 log [INF] : closing stale session client.10602 X.Y.23.168:0/7237118 after 300.897098
[root@ceph-mon01 init.d]#

This confirms that the numbers probably (documentation?) represent client numbers: any of those numbers was client number in the past.

[root@ceph-mon01 init.d]# for i in X.Y.23.154 X.Y.23.155 X.Y.23.156 X.Y.23.157 X.Y.23.168; do grep $i /var/log/ceph/ceph-mds.ceph-mon01.log | grep client[.] | tail -n 1; done
2013-11-30 15:13:08.316764 7ff871889700 0 log [DBG] : reconnect by client.10641 X.Y.23.154:0/3851316717 after 0.001849
2013-11-30 15:13:08.316515 7ff871889700 0 log [DBG] : reconnect by client.10623 X.Y.23.155:0/1362068526 after 0.001678
2013-11-30 15:13:08.318742 7ff871889700 0 log [DBG] : reconnect by client.10638 X.Y.23.156:0/3825825779 after 0.003957
2013-11-30 15:13:08.319199 7ff871889700 0 log [DBG] : reconnect by client.10635 X.Y.23.157:0/2941121077 after 0.004415
2013-11-30 15:13:08.318027 7ff871889700 0 log [DBG] : reconnect by client.10629 X.Y.23.168:0/914977541 after 0.003219
[root@ceph-mon01 init.d]#

This suggests that all clients now have different ID's. This is consistent with the fact that fact that the clients have all been rebooted, and the mds has been restarted afterwards as well. The odd thing however is this:

[root@ceph-mon01 init.d]# ceph mds tell 0 dumpcache /tmp/dump.txt
ok

[root@ceph-mon01 init.d]# grep herbert/disco-data/output/video /tmp/dump.txt
[inode 10000013022 [2,head] /disco/herbert/disco-data/output/video-conversion auth v34 ap=6+0 s=36 n(v0 b36 1=1+0) (ifile mix->sync) (iversion lock) cr={9726=0-4194304@1,9729=0-4194304@1,9732=0-4194304@1,10155=0-4194304@1} caps={9726=pAsLsXsFr/pAsxXsxFsxcrwb@2,9729=pAsLsXsFr/pAsLsXsFrw/pAsxXsxFsxcrwb@2,9732=pAsLsXsFr/pAsxXsxFsxcrwb@2,10155=pAsLsXsFr/pAsxXsxFsxcrwb@2,10602=pAsLsXsFr/-@1} | request=5 lock=1 caps=1 waiter=1 authpin=1 0x88a7378]
[dentry #1/disco/herbert/disco-data/output/video-conversion [2,head] auth (dversion lock) v=34 ap=0+6 inode=0x88a7378 | inodepin=1 0x9470f18]
[root@ceph-mon01 init.d]#

So after reboots, and apparent client ID changes the MDS seems (no documentation?) to attribute locks to the old client ID's.

We're using CentOS 6.4 and the ceph-0.72.1-0.el6.x86_64 software on the Ceph cluster. Client nodes are also CentOS 6.4 but are using a newer kernel-ml-3.12.1-1.el6.elrepo.x86_64 for ceph support


Files

mds-20131216.dump.bz2 (553 KB) mds-20131216.dump.bz2 The cache dump Rolf Fokkens, 12/19/2013 12:52 AM
mds-20131219.dump.bz2 (553 KB) mds-20131219.dump.bz2 A more recent mds cache dump Rolf Fokkens, 12/19/2013 01:00 AM
mds-ceph-parent-setxattr.patch (1.24 KB) mds-ceph-parent-setxattr.patch Zheng Yan, 12/23/2013 11:46 PM
0001-ceph-check-inode-caps-in-ceph_d_revalidate.patch (2.4 KB) 0001-ceph-check-inode-caps-in-ceph_d_revalidate.patch Zheng Yan, 12/24/2013 09:41 PM
pool.patch (539 Bytes) pool.patch Zheng Yan, 12/30/2013 09:20 PM
0001-mds-always-store-backtrace-xattr-in-the-default-pool.patch (4.19 KB) 0001-mds-always-store-backtrace-xattr-in-the-default-pool.patch Zheng Yan, 12/31/2013 07:06 PM
ceph.spec (20.2 KB) ceph.spec I'm currently rebuilding the RPM's using this ceph.spec Rolf Fokkens, 01/02/2014 02:08 AM
Actions

Also available in: Atom PDF