Bug #6914: Accessing specific file on CephFS causes client hang - CephFS - Ceph

Actions

Copy link

Bug #6914

closed

Accessing specific file on CephFS causes client hang

Added by Rolf Fokkens over 10 years ago. Updated almost 8 years ago.

Status:

Resolved

Priority:

High

Assignee:

Category:

Target version:

% Done:

Source:

other

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(FS):

MDS

Labels (FS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

When accessing a specific file (/disco/herbert/disco-data/output/video-conversion) the client hangs. Any kind of fstat on the file results in a hang, but doing a plain ls (not ls -l) on /disco/herbert/disco-data/output works.

[root@ceph-mon01 init.d]# ceph mds tell 0 dumpcache /tmp/dump.txt
ok
[root@ceph-mon01 init.d]# grep herbert/disco-data/output/video /tmp/dump.txt
[inode 10000013022 [2,head] /disco/herbert/disco-data/output/video-conversion auth v34 ap=6+0 s=36 n(v0 b36 1=1+0) (ifile mix->sync) (iversion lock) cr={9726=0-4194304@1,9729=0-4194304@1,9732=0-4194304@1,10155=0-4194304@1} caps={9726=pAsLsXsFr/pAsxXsxFsxcrwb@2,9729=pAsLsXsFr/pAsLsXsFrw/pAsxXsxFsxcrwb@2,9732=pAsLsXsFr/pAsxXsxFsxcrwb@2,10155=pAsLsXsFr/pAsxXsxFsxcrwb@2,10602=pAsLsXsFr/-@1} | request=5 lock=1 caps=1 waiter=1 authpin=1 0x88a7378]
[dentry #1/disco/herbert/disco-data/output/video-conversion [2,head] auth (dversion lock) v=34 ap=0+6 inode=0x88a7378 | inodepin=1 0x9470f18]
[root@ceph-mon01 init.d]#

Seems to present relevant information: The request=5 (iversion lock) and the caps=... suggest (is there any documentation?) that clients 9726, 9729, 9732, 10155 and 10602 are involved. Possibly holding the lock, definitely some are waiting.

[root@ceph-mon01 init.d]# for i in 9726 9729 9732 10155 10602; do grep client.$i /var/log/ceph/ceph-mds.ceph-mon01.log | head; done
2013-11-30 14:20:52.706846 7fe144d7b700 0 log [DBG] : reconnect by client.9726 X.Y.23.154:0/3869725424 after 0.003437
2013-11-30 14:27:29.018735 7f2dbcc0b700 1 mds.0.server reconnect gave up on client.9726 X.Y.23.154:0/3869725424
2013-11-30 14:31:44.025454 7f2dbcc0b700 0 log [INF] : closing stale session client.9726 X.Y.23.154:0/3869725424 after 304.961662
2013-11-30 14:20:52.716187 7fe144d7b700 0 log [DBG] : reconnect by client.9729 X.Y.23.155:0/2487236169 after 0.012752
2013-11-30 14:27:29.018748 7f2dbcc0b700 1 mds.0.server reconnect gave up on client.9729 X.Y.23.155:0/2487236169
2013-11-30 14:31:44.025951 7f2dbcc0b700 0 log [INF] : closing stale session client.9729 X.Y.23.155:0/2487236169 after 304.961658
2013-11-30 14:21:41.356094 7fe142375700 1 mds.0.server reconnect gave up on client.9732 X.Y.23.156:0/216645998
2013-11-30 14:25:56.362768 7fe142375700 0 log [INF] : closing stale session client.9732 X.Y.23.156:0/216645998 after 304.957062
2013-11-30 14:21:41.356105 7fe142375700 1 mds.0.server reconnect gave up on client.10155 X.Y.23.157:0/4097063187
2013-11-30 14:25:56.362644 7fe142375700 0 log [INF] : closing stale session client.10155 X.Y.23.157:0/4097063187 after 304.957067
2013-11-30 14:20:52.703747 7fe144d7b700 0 log [DBG] : reconnect by client.10602 X.Y.23.168:0/7237118 after 0.000324
2013-11-30 14:26:40.419783 7f2dbf611700 0 log [DBG] : reconnect by client.10602 X.Y.23.168:0/7237118 after 0.000973
2013-11-30 14:35:04.030942 7f2dbcc0b700 0 log [INF] : closing stale session client.10602 X.Y.23.168:0/7237118 after 300.897098
[root@ceph-mon01 init.d]#

This confirms that the numbers probably (documentation?) represent client numbers: any of those numbers was client number in the past.

[root@ceph-mon01 init.d]# for i in X.Y.23.154 X.Y.23.155 X.Y.23.156 X.Y.23.157 X.Y.23.168; do grep $i /var/log/ceph/ceph-mds.ceph-mon01.log | grep client[.] | tail -n 1; done
2013-11-30 15:13:08.316764 7ff871889700 0 log [DBG] : reconnect by client.10641 X.Y.23.154:0/3851316717 after 0.001849
2013-11-30 15:13:08.316515 7ff871889700 0 log [DBG] : reconnect by client.10623 X.Y.23.155:0/1362068526 after 0.001678
2013-11-30 15:13:08.318742 7ff871889700 0 log [DBG] : reconnect by client.10638 X.Y.23.156:0/3825825779 after 0.003957
2013-11-30 15:13:08.319199 7ff871889700 0 log [DBG] : reconnect by client.10635 X.Y.23.157:0/2941121077 after 0.004415
2013-11-30 15:13:08.318027 7ff871889700 0 log [DBG] : reconnect by client.10629 X.Y.23.168:0/914977541 after 0.003219
[root@ceph-mon01 init.d]#

This suggests that all clients now have different ID's. This is consistent with the fact that fact that the clients have all been rebooted, and the mds has been restarted afterwards as well. The odd thing however is this:

[root@ceph-mon01 init.d]# ceph mds tell 0 dumpcache /tmp/dump.txt
ok

[root@ceph-mon01 init.d]# grep herbert/disco-data/output/video /tmp/dump.txt
[inode 10000013022 [2,head] /disco/herbert/disco-data/output/video-conversion auth v34 ap=6+0 s=36 n(v0 b36 1=1+0) (ifile mix->sync) (iversion lock) cr={9726=0-4194304@1,9729=0-4194304@1,9732=0-4194304@1,10155=0-4194304@1} caps={9726=pAsLsXsFr/pAsxXsxFsxcrwb@2,9729=pAsLsXsFr/pAsLsXsFrw/pAsxXsxFsxcrwb@2,9732=pAsLsXsFr/pAsxXsxFsxcrwb@2,10155=pAsLsXsFr/pAsxXsxFsxcrwb@2,10602=pAsLsXsFr/-@1} | request=5 lock=1 caps=1 waiter=1 authpin=1 0x88a7378]
[dentry #1/disco/herbert/disco-data/output/video-conversion [2,head] auth (dversion lock) v=34 ap=0+6 inode=0x88a7378 | inodepin=1 0x9470f18]
[root@ceph-mon01 init.d]#

So after reboots, and apparent client ID changes the MDS seems (no documentation?) to attribute locks to the old client ID's.

We're using CentOS 6.4 and the ceph-0.72.1-0.el6.x86_64 software on the Ceph cluster. Client nodes are also CentOS 6.4 but are using a newer kernel-ml-3.12.1-1.el6.elrepo.x86_64 for ceph support

Files

Download all files

mds-20131216.dump.bz2 (553 KB) mds-20131216.dump.bz2	The cache dump	Rolf Fokkens, 12/19/2013 12:52 AM
mds-20131219.dump.bz2 (553 KB) mds-20131219.dump.bz2	A more recent mds cache dump	Rolf Fokkens, 12/19/2013 01:00 AM
mds-ceph-parent-setxattr.patch (1.24 KB) mds-ceph-parent-setxattr.patch		Zheng Yan, 12/23/2013 11:46 PM
0001-ceph-check-inode-caps-in-ceph_d_revalidate.patch (2.4 KB) 0001-ceph-check-inode-caps-in-ceph_d_revalidate.patch		Zheng Yan, 12/24/2013 09:41 PM
pool.patch (539 Bytes) pool.patch		Zheng Yan, 12/30/2013 09:20 PM
0001-mds-always-store-backtrace-xattr-in-the-default-pool.patch (4.19 KB) 0001-mds-always-store-backtrace-xattr-in-the-default-pool.patch		Zheng Yan, 12/31/2013 07:06 PM
ceph.spec (20.2 KB) ceph.spec	I'm currently rebuilding the RPM's using this ceph.spec	Rolf Fokkens, 01/02/2014 02:08 AM

Actions

Copy link

Updated by Rolf Fokkens over 10 years ago

Well, actually now the client hang seems to be gone. Still the MDS seems to attribute the locks to the old client ID's.

Actions

Copy link

Updated by Herbert XYZ over 10 years ago

As an addition to the problem Rolf is experiencing, please note this:

The 'ls' process got the 'state' D as reported by /proc/PID/status and the stack trace would also include ceph_mdsc_do_request, ceph_do_getattr and ceph_getattr (as in /proc/PID/stack).

Actions

Copy link

Updated by Zheng Yan over 10 years ago

Rolf Fokkens wrote:

Well, actually now the client hang seems to be gone. Still the MDS seems to attribute the locks to the old client ID's.

"ceph mds tell 0 dumpcache /tmp/dump.txt" does not overwite existing file

Actions

Copy link

Updated by Zheng Yan over 10 years ago

Rolf Fokkens wrote:

[inode 10000013022 [2,head] /disco/herbert/disco-data/output/video-conversion auth v34 ap=6+0 s=36 n(v0 b36 1=1+0) (ifile mix->sync) (iversion lock) cr={9726=0-4194304@1,9729=0-4194304@1,9732=0-4194304@1,10155=0-4194304@1} caps={9726=pAsLsXsFr/pAsxXsxFsxcrwb@2,9729=pAsLsXsFr/pAsLsXsFrw/pAsxXsxFsxcrwb@2,9732=pAsLsXsFr/pAsxXsxFsxcrwb@2,10155=pAsLsXsFr/pAsxXsxFsxcrwb@2,10602=pAsLsXsFr/-@1} | request=5 lock=1 caps=1 waiter=1 authpin=1 0x88a7378]

client.9732 refused to release Fw cap, this is strange. next time you encounter hang like this, please check if there is hang osd request (cat /sys/kernel/debug/ceph/*/osdc)

Actions

Copy link

Updated by Rolf Fokkens over 10 years ago

At first my response would be: there's no /sys/kernel/debug/ceph dir. But later it became clear that debugfs needs to be mounted, which isn't done by default in CentOS. It may be usefull to document this (to mount debugfs) on this page: http://ceph.com/docs/master/dev/kernel-client-troubleshooting/

Actions

Copy link

Updated by Jasper Siero over 10 years ago

We added an extra mds on another host and turned the primary mds off. It took a while (8 minutes I think) but after that the ceph status is showing that the health is ok. When running a ls on the ceph client everything looks fine except two specific directories which let ls hang (terminal session is not responding).
In /sys/kernel/debug/ceph/554b69c8-bdca-4d63-b45c-f5fd16b5a836.client18675:

[root@ceph-client 554b69c8-bdca-4d63-b45c-f5fd16b5a836.client18675]# cat mdsc
11 mds0 getattr #1000000041e
12 mds0 getattr #1000000041e
15 mds0 getattr #10000000446

[root@ceph-client 554b69c8-bdca-4d63-b45c-f5fd16b5a836.client18675]# cat osdc
[root@ceph-client 554b69c8-bdca-4d63-b45c-f5fd16b5a836.client18675]#

root@ceph-client 554b69c8-bdca-4d63-b45c-f5fd16b5a836.client18675]# cat mdsmap
epoch 4906
root 0
session_timeout 60
session_autoclose 300
mds0 10.1.2.2:6815 (up:active)
18

When running ceph mds tell 0 dumpcache /tmp/dump20131217.txt the inodes #1000000041e #10000000446 can't be found in this file. There is an old entry (25-11-25) of the directory I'm trying to 'ls' in the dump20131217 file which is possible related to this problem:

[dentry #1/disco/jasper.siero [2,head] auth (dversion lock) v=8528 inode=0x919d358 | inodepin=1 0x620a528]
[inode 1000000041c [...2,head] /disco/jasper.siero/ auth v8528 f(v4 m2013-11-25 10:08:21.230300 37=37+0) n(v0 rc2013-11-25 10:08:21.230300 b3260545330 38=37+1) (inest lock) (iversion lock) caps={18675=pAsLsXsFs/-@2},l=18675 | request=0 lock=0 caps=1 authpin=0 0x919d358]

Actions

Copy link

Updated by Jasper Siero over 10 years ago

After forcing ceph to use the original mds again by stopping the new mds and using ceph -w we see:
2013-12-18 13:37:15.670336 mon.0 [INF] mds.0 10.1.2.1:6800/7061 up:reconnect
2013-12-18 13:37:15.670416 mon.0 [INF] mdsmap e4909: 1/1/1 up {0=ceph-mon01=up:reconnect}
2013-12-18 13:37:15.682754 mon.0 [INF] mds.0 10.1.2.1:6800/7061 up:rejoin
2013-12-18 13:37:15.682786 mon.0 [INF] mdsmap e4910: 1/1/1 up {0=ceph-mon01=up:rejoin}
2013-12-18 13:37:15.735397 mon.0 [INF] mds.0 10.1.2.1:6800/7061 up:active
2013-12-18 13:37:15.735436 mon.0 [INF] mdsmap e4911: 1/1/1 up {0=ceph-mon01=up:active}
2013-12-18 13:37:16.738495 mon.0 [INF] pgmap v51413: 792 pgs: 792 active+clean; 12469 MB data, 26348 MB used, 409 GB / 434 GB avail; 6 B/s rd, 0 op/s

After approximately 10 minutes:

2013-12-18 13:37:15.745101 mds.0 [WRN] failed to reconnect caps for missing inodes:
2013-12-18 13:37:15.745124 mds.0 [WRN] ino 10000000419
2013-12-18 13:37:15.745142 mds.0 [WRN] ino 1000000042a
2013-12-18 13:37:15.745160 mds.0 [WRN] ino 10000000445
2013-12-18 13:37:15.745178 mds.0 [WRN] ino 10000008914
2013-12-18 13:37:15.745196 mds.0 [WRN] ino 10000008915
2013-12-18 13:37:15.745214 mds.0 [WRN] ino 1000000891f
2013-12-18 13:37:15.745232 mds.0 [WRN] ino 10000008920
2013-12-18 13:37:15.745259 mds.0 [WRN] ino 10000008921
2013-12-18 13:37:15.745278 mds.0 [WRN] ino 10000010a3d
2013-12-18 13:37:15.745295 mds.0 [WRN] ino 10000010a3e
2013-12-18 13:37:15.745313 mds.0 [WRN] ino 10000010a46
2013-12-18 13:37:15.745331 mds.0 [WRN] ino 100000
10a49
2013-12-18 13:37:15.745356 mds.0 [WRN] ino 10000010a4d
2013-12-18 13:37:15.745374 mds.0 [WRN] ino 10000010a4e
2013-12-18 13:37:15.745391 mds.0 [WRN] ino 10000010e06
2013-12-18 13:37:15.745409 mds.0 [WRN] ino 10000010e08
2013-12-18 13:37:15.745426 mds.0 [WRN] ino 10000010e09
2013-12-18 13:37:15.745489 mds.0 [WRN] ino 10000010e0a
2013-12-18 13:37:15.745534 mds.0 [WRN] ino 10000010e0c
2013-12-18 13:37:15.745553 mds.0 [WRN] ino 100000111f5
2013-12-18 13:37:15.745571 mds.0 [WRN] ino 100000111f8
2013-12-18 13:37:15.745589 mds.0 [WRN] ino 100000111f9
2013-12-18 13:37:15.745608 mds.0 [WRN] ino 100000111fa
2013-12-18 13:37:15.745626 mds.0 [WRN] ino 100000111fb
2013-12-18 13:37:15.745644 mds.0 [WRN] ino 100000111fc
2013-12-18 13:37:15.745662 mds.0 [WRN] ino 100000115e6

Actions

Copy link

Updated by Jasper Siero over 10 years ago

Some simularity with http://tracker.ceph.com/issues/6087? We use the Emperor release so that should be resolved.

Actions

Copy link

Updated by Greg Farnum over 10 years ago

Can you upload the cache dump? How did you search through it for those inodes? The only way I can think of to make a request hang without returning is if there's a bug in the MDS affecting cached data (by far the most likely), or the underlying RADOS pool isn't healthy.

Actions

Copy link

#10

Updated by Rolf Fokkens over 10 years ago

File mds-20131216.dump.bz2 mds-20131216.dump.bz2 added

Actions

Copy link

#11

Updated by Rolf Fokkens over 10 years ago

Also note the MDS log:

2013-12-18 17:03:43.775909 7f5e99e39700 1 mds.0.62 active_start
2013-12-18 17:03:43.782390 7f5e99e39700 0 log [WRN] : failed to reconnect caps for missing inodes:
2013-12-18 17:03:43.782423 7f5e99e39700 0 log [WRN] : ino 10000000415
2013-12-18 17:03:43.782466 7f5e99e39700 0 log [WRN] : ino 10000000416
2013-12-18 17:03:43.782486 7f5e99e39700 0 log [WRN] : ino 10000000417
2013-12-18 17:03:43.782504 7f5e99e39700 0 log [WRN] : ino 1000000041d
2013-12-18 17:03:43.782528 7f5e99e39700 0 log [WRN] : ino 1000000041e
2013-12-18 17:03:43.782546 7f5e99e39700 0 log [WRN] : ino 1000000041f
2013-12-18 17:03:43.782564 7f5e99e39700 0 log [WRN] : ino 10000000420
2013-12-18 17:03:43.782581 7f5e99e39700 0 log [WRN] : ino 10000000421
2013-12-18 17:03:43.782599 7f5e99e39700 0 log [WRN] : ino 10000000422
2013-12-18 17:03:43.782617 7f5e99e39700 0 log [WRN] : ino 10000000423
2013-12-18 17:03:43.782656 7f5e99e39700 0 log [WRN] : ino 10000000424
2013-12-18 17:03:43.782675 7f5e99e39700 0 log [WRN] : ino 10000000425
2013-12-18 17:03:43.782697 7f5e99e39700 0 log [WRN] : ino 10000000426
2013-12-18 17:03:43.782748 7f5e99e39700 0 log [WRN] : ino 10000000427
2013-12-18 17:03:43.782766 7f5e99e39700 0 log [WRN] : ino 10000000428
2013-12-18 17:03:43.782784 7f5e99e39700 0 log [WRN] : ino 1000000042a
2013-12-18 17:03:43.782800 7f5e99e39700 0 log [WRN] : ino 1000000042b
2013-12-18 17:03:43.782825 7f5e99e39700 0 log [WRN] : ino 1000000042c
2013-12-18 17:03:43.782843 7f5e99e39700 0 log [WRN] : ino 1000000042d
2013-12-18 17:03:43.782861 7f5e99e39700 0 log [WRN] : ino 1000000042e
2013-12-18 17:03:43.782878 7f5e99e39700 0 log [WRN] : ino 1000000042f
2013-12-18 17:03:43.782916 7f5e99e39700 0 log [WRN] : ino 10000000430
2013-12-18 17:03:43.782934 7f5e99e39700 0 log [WRN] : ino 10000000431
2013-12-18 17:03:43.782951 7f5e99e39700 0 log [WRN] : ino 10000000432
2013-12-18 17:03:43.782969 7f5e99e39700 0 log [WRN] : ino 10000000433
2013-12-18 17:03:43.782986 7f5e99e39700 0 log [WRN] : ino 10000000434
2013-12-18 17:03:43.783003 7f5e99e39700 0 log [WRN] : ino 10000000435
2013-12-18 17:03:43.783021 7f5e99e39700 0 log [WRN] : ino 10000000436
2013-12-18 17:03:43.783037 7f5e99e39700 0 log [WRN] : ino 10000000437
2013-12-18 17:03:43.783054 7f5e99e39700 0 log [WRN] : ino 10000000438
2013-12-18 17:03:43.783072 7f5e99e39700 0 log [WRN] : ino 10000000439
2013-12-18 17:03:43.783089 7f5e99e39700 0 log [WRN] : ino 1000000043a
2013-12-18 17:03:43.783106 7f5e99e39700 0 log [WRN] : ino 1000000043b
2013-12-18 17:03:43.783122 7f5e99e39700 0 log [WRN] : ino 1000000043c
2013-12-18 17:03:43.783139 7f5e99e39700 0 log [WRN] : ino 1000000043d
2013-12-18 17:03:43.783156 7f5e99e39700 0 log [WRN] : ino 1000000043e
2013-12-18 17:03:43.783173 7f5e99e39700 0 log [WRN] : ino 1000000043f
2013-12-18 17:03:43.783190 7f5e99e39700 0 log [WRN] : ino 10000000440
2013-12-18 17:03:43.783211 7f5e99e39700 0 log [WRN] : ino 10000000441
2013-12-18 17:03:43.783230 7f5e99e39700 0 log [WRN] : ino 10000000442
2013-12-18 17:03:43.783247 7f5e99e39700 0 log [WRN] : ino 10000000443
2013-12-18 17:03:43.783273 7f5e99e39700 0 log [WRN] : ino 10000000445
2013-12-18 17:03:43.783291 7f5e99e39700 0 log [WRN] : ino 10000000446
2013-12-18 17:03:43.783307 7f5e99e39700 0 log [WRN] : ino 10000000447
2013-12-18 17:03:43.783324 7f5e99e39700 0 log [WRN] : ino 10000000448
2013-12-18 17:03:43.783343 7f5e99e39700 0 log [WRN] : ino 10000000449
2013-12-18 17:03:43.783364 7f5e99e39700 0 log [WRN] : ino 1000000044a
2013-12-18 17:03:43.783381 7f5e99e39700 0 log [WRN] : ino 1000000044b
2013-12-18 17:03:43.783398 7f5e99e39700 0 log [WRN] : ino 1000000044c
2013-12-18 17:03:43.783419 7f5e99e39700 1 mds.0.62 cluster recovered.
2013-12-18 17:20:56.522134 7f5e99e39700 1 mds.0.cache dump_cache to /tmp/dump20131218.rolf

And this:

[root@ceph-client ~]# cat /sys/kernel/debug/ceph/554b69c8-bdca-4d63-b45c-f5fd16b5a836.client18750/mdsc
13 mds0 getattr #1000000042a
15 mds0 getattr #1000000041e
16 mds0 getattr #1000000041e
You have new mail in /var/spool/mail/root
[root@ceph-client ~]#

mdsc reports (hanging?) inodes that also are in the mds log "failed to reconnect caps for missing inodes"

Actions

Copy link

#12

Updated by Rolf Fokkens over 10 years ago

File mds-20131219.dump.bz2 mds-20131219.dump.bz2 added

Actions

Copy link

#13

Updated by Rolf Fokkens over 10 years ago

searching for the nodes was done like this:

[rolf@ceph-mon01 ~]$ bzip2 -d < /tmp/mds-20131219.dump.bz2 | grep 1000000041e
[rolf@ceph-mon01 ~]$ bzip2 -d < /tmp/mds-20131219.dump.bz2 | grep 1000000042a
[rolf@ceph-mon01 ~]$

Rendering no results.

Actions

Copy link

#14

Updated by Rolf Fokkens over 10 years ago

Actually waiting for this to be sorted out takes a while. So to be able to make use of cephfs I did the following:

mv /data/disco /data/disco.broken
  mkdir /data/disco
  ceph osd pool rename cephfs cephfs.broken
  ceph osd pool create cephfs 400 400
  ceph mds add_data_pool cephfs
  cephfs /data/disco set_layout --pool 5

So the old "ceph.broken" pool can still be mounted, be we can now try to move on with a "fresh pool".

Actions

Copy link

#15

Updated by Zheng Yan over 10 years ago

Jasper Siero wrote:

Some simularity with http://tracker.ceph.com/issues/6087? We use the Emperor release so that should be resolved.

when was the fs created? before dumpling?

Actions

Copy link

#16

Updated by Zheng Yan over 10 years ago

File mds-ceph-parent-setxattr.patch mds-ceph-parent-setxattr.patch added

Rolf Fokkens wrote:

Actually waiting for this to be sorted out takes a while. So to be able to make use of cephfs I did the following:

mv /data/disco /data/disco.broken
mkdir /data/disco
ceph osd pool rename cephfs cephfs.broken
ceph osd pool create cephfs 400 400
ceph mds add_data_pool cephfs
cephfs /data/disco set_layout --pool 5

So the old "ceph.broken" pool can still be mounted, be we can now try to move on with a "fresh pool".

you can heal the fs by:

https://www.mail-archive.com/ceph-devel@vger.kernel.org/msg17336.html

Actions

Copy link

#17

Updated by Rolf Fokkens over 10 years ago

The Ceph cluster was built using Ceph 0.67. On nov 13 it was upgraded to Ceph 0.72

Actions

Copy link

#18

Updated by Rolf Fokkens over 10 years ago

I mounted the broken CephFS and did this (because it's easier than locating the specific inodes):

find -type d -exec setfattr -n ceph.parent {} \; -print

Loggin shows:

2013-12-24 12:05:07.421844 7f5e96a32700 1 mds.0.62 suicide. wanted down:dne, now up:active
2013-12-24 12:06:27.732728 7fb294bda7a0 0 ceph version 0.72.1 (4d923861868f6a15dcb33fef7f50f674997322de), process ceph-mds, pid 7216
2013-12-24 12:06:27.769510 7fb291efa700 1 mds.-1.0 handle_mds_map standby
2013-12-24 12:06:27.781003 7fb291efa700 1 mds.0.63 handle_mds_map i am now mds.0.63
2013-12-24 12:06:27.781006 7fb291efa700 1 mds.0.63 handle_mds_map state change up:standby --> up:replay
2013-12-24 12:06:27.781018 7fb291efa700 1 mds.0.63 replay_start
2013-12-24 12:06:27.781024 7fb291efa700 1 mds.0.63 recovery set is
2013-12-24 12:06:27.781026 7fb291efa700 1 mds.0.63 need osdmap epoch 8082, have 8081
2013-12-24 12:06:27.781028 7fb291efa700 1 mds.0.63 waiting for osdmap 8082 (which blacklists prior instance)
2013-12-24 12:06:27.854393 7fb291efa700 0 mds.0.cache creating system inode with ino:100
2013-12-24 12:06:27.854580 7fb291efa700 0 mds.0.cache creating system inode with ino:1
2013-12-24 12:11:18.409009 7fb28d4e6700 1 mds.0.63 replay_done
2013-12-24 12:11:18.409094 7fb28d4e6700 1 mds.0.63 making mds journal writeable
2013-12-24 12:11:18.416221 7fb291efa700 1 mds.0.63 handle_mds_map i am now mds.0.63
2013-12-24 12:11:18.416224 7fb291efa700 1 mds.0.63 handle_mds_map state change up:replay --> up:reconnect
2013-12-24 12:11:18.416226 7fb291efa700 1 mds.0.63 reconnect_start
2013-12-24 12:11:18.416228 7fb291efa700 1 mds.0.63 reopen_log
2013-12-24 12:11:18.416238 7fb291efa700 1 mds.0.63 reconnect_done
2013-12-24 12:11:18.420394 7fb291efa700 1 mds.0.63 handle_mds_map i am now mds.0.63
2013-12-24 12:11:18.420396 7fb291efa700 1 mds.0.63 handle_mds_map state change up:reconnect --> up:rejoin
2013-12-24 12:11:18.420398 7fb291efa700 1 mds.0.63 rejoin_start
2013-12-24 12:11:18.421493 7fb291efa700 1 mds.0.63 rejoin_joint_start
2013-12-24 12:11:18.441659 7fb291efa700 1 mds.0.63 rejoin_done
2013-12-24 12:11:18.449265 7fb291efa700 1 mds.0.63 handle_mds_map i am now mds.0.63
2013-12-24 12:11:18.449299 7fb291efa700 1 mds.0.63 handle_mds_map state change up:rejoin --> up:active
2013-12-24 12:11:18.449318 7fb291efa700 1 mds.0.63 recovery_done -- successful recovery!
2013-12-24 12:11:18.450921 7fb291efa700 1 mds.0.63 active_start
2013-12-24 12:11:18.452526 7fb291efa700 1 mds.0.63 cluster recovered.

So that seems OK!

Actions

Copy link

#19

Updated by Rolf Fokkens over 10 years ago

We'll try to access the repaired FS and see if it still hangs.

Actions

Copy link

#20

Updated by Zheng Yan over 10 years ago

File 0001-ceph-check-inode-caps-in-ceph_d_revalidate.patch 0001-ceph-check-inode-caps-in-ceph_d_revalidate.patch added

Rolf Fokkens wrote:

The Ceph cluster was built using Ceph 0.67. On nov 13 it was upgraded to Ceph 0.72

looks like the hang was not due to missing backtraces. please try the attached kernel patch if you still experience hang

Actions

Copy link

#21

Updated by Rolf Fokkens over 10 years ago

I managed to get me a whole new list of "failed to reconnect caps for missing inodes" by doing the following:

- rebuild the requested kernel on cephfs on a ceph client (nice test case)
- while rebuilding the kernel, shutdown the ceph storage cluster
- upgrade the cluster from ceph 0.72.1 to 0.72.2.
- reboot the cluster

After the mds is in state "replay" for a while an impressive list of "failed to reconnect caps for missing inodes" is presented in the mds log. And the client hangs with mdsc containing "456488 mds0 getattr #1000001de96" which refers to one of the inodes mentioned in the mds log

So it looks like this problem is the result of rebooting the ceph cluster while one of the clients is fiddling with files/directories on cephfs.

This hang situation is nice though since it allows me to test the patch.

Actions

Copy link

#22

Updated by Rolf Fokkens over 10 years ago

FYI one other thing: the performance is really bad when building a kernel on cephfs. "ceph status" shows in general < 30 op/s, mostly < 15 op/s and incidently 100 op/s. mdsc on the client sometimes shows 1000 entries during intial extracting the kernel archive, during which "ceph status" also shows "HEALTH_WARN # requests are blocked > 32 sec; # osds have slow requests". I don't see any network latency or I/O wait on the OSD's, but I need to look into this in further detaill

This may be totally unrelated, but I guess it might be worth mentioning.

Actions

Copy link

#23

Updated by Zheng Yan over 10 years ago

Rolf Fokkens wrote:

I managed to get me a whole new list of "failed to reconnect caps for missing inodes" by doing the following:

- rebuild the requested kernel on cephfs on a ceph client (nice test case)
- while rebuilding the kernel, shutdown the ceph storage cluster
- upgrade the cluster from ceph 0.72.1 to 0.72.2.
- reboot the cluster

After the mds is in state "replay" for a while an impressive list of "failed to reconnect caps for missing inodes" is presented in the mds log. And the client hangs with mdsc containing "456488 mds0 getattr #1000001de96" which refers to one of the inodes mentioned in the mds log

please check if these missing inodes exist on object store. If they exist, do they have parent xattr? (For example: rados -p data/metadata getxattr parent 1000001de96.00000000)

So it looks like this problem is the result of rebooting the ceph cluster while one of the clients is fiddling with files/directories on cephfs.

This hang situation is nice though since it allows me to test the patch.

Actions

Copy link

#24

Updated by Zheng Yan over 10 years ago

Rolf Fokkens wrote:

FYI one other thing: the performance is really bad when building a kernel on cephfs. "ceph status" shows in general < 30 op/s, mostly < 15 op/s and incidently 100 op/s. mdsc on the client sometimes shows 1000 entries during intial extracting the kernel archive, during which "ceph status" also shows "HEALTH_WARN # requests are blocked > 32 sec; # osds have slow requests". I don't see any network latency or I/O wait on the OSD's, but I need to look into this in further detaill

are cpu usages of ceph-mds and ceph-osd high?

This may be totally unrelated, but I guess it might be worth mentioning.

Actions

Copy link

#25

Updated by Rolf Fokkens over 10 years ago

Zheng Yan wrote:

Rolf Fokkens wrote:

After the mds is in state "replay" for a while an impressive list of "failed to reconnect caps for missing inodes" is presented in the mds log. And the client hangs with mdsc containing "456488 mds0 getattr #1000001de96" which refers to one of the inodes mentioned in the mds log

please check if these missing inodes exist on object store. If they exist, do they have parent xattr? (For example: rados -p data/metadata getxattr parent 1000001de96.00000000)

Because rebuilding the kernel removes all files and the creates new ones, inodes like 1000001de96 seem to be gone. Restarting the MDS though creates a new list:
2013-12-30 10:08:33.529417 7f85eae09700 0 log [WRN] : failed to reconnect caps for missing inodes:
2013-12-30 10:08:33.529449 7f85eae09700 0 log [WRN] : ino 10000000472
2013-12-30 10:08:33.529469 7f85eae09700 0 log [WRN] : ino 10000000473
2013-12-30 10:08:33.529522 7f85eae09700 0 log [WRN] : ino 10000016885
2013-12-30 10:08:33.529547 7f85eae09700 0 log [WRN] : ino 10000016886
2013-12-30 10:08:33.529566 7f85eae09700 0 log [WRN] : ino 10000016887
2013-12-30 10:08:33.529584 7f85eae09700 0 log [WRN] : ino 10000016888
2013-12-30 10:08:33.529602 7f85eae09700 0 log [WRN] : ino 1000001688d
2013-12-30 10:08:33.529621 7f85eae09700 0 log [WRN] : ino 1000002e413
2013-12-30 10:08:33.529639 7f85eae09700 0 log [WRN] : ino 1000002e414
2013-12-30 10:08:33.529668 7f85eae09700 0 log [WRN] : ino 1000002e415

[root@ceph-mon01 ceph]# ceph osd lspools
0 data,1 metadata,2 rbd,3 cephfs.broken,4 fatsheep,5 cephfs,

rados ls -p <pool> | grep 10000000472

only renders results for pool cephfs (which is the pool we use for cephfs):
[root@ceph-mon01 ceph]# rados ls -p cephfs | grep 10000000472
10000000472.00000001
10000000472.00000000
[root@ceph-mon01 ceph]#
[root@ceph-mon01 ceph]# rados -p cephfs getxattr parent 10000000472.00000000
error getting xattr cephfs/parent/10000000472.00000000: No such file or directory
[root@ceph-mon01 ceph]# rados -p cephfs getxattr parent 10000000472.00000001
error getting xattr cephfs/parent/10000000472.00000001: No such file or directory

But, for other objects:
[root@ceph-mon01 ceph]# rados ls -p cephfs | grep 10000012f69
10000012f69.00000000
[root@ceph-mon01 ceph]# rados -p cephfs getxattr parent 10000012f69.00000000
error getting xattr cephfs/parent/10000012f69.00000000: No such file or directory
[root@ceph-mon01 ceph]#

Same result. What am I doing wrong?

Actions

Copy link

#26

Updated by Rolf Fokkens over 10 years ago

Zheng Yan wrote:

are cpu usages of ceph-mds and ceph-osd high?

Hardly any CPU usage; Idle is > 95% on all cluster nodes. iowait is very low too.

Some potentially relevant stats:

[root@ceph-mon01 ceph]# ceph osd perf
osdid fs_commit_latency(ms) fs_apply_latency(ms)
0 23 54
1 27 2
2 28 2
3 2 5
4 23 26
5 16 6
6 20 25
7 23 71
8 30 75
9 27 66
10 4 3
11 3 3
12 176 74
13 242 114
14 258 124

[root@ceph-mon01 ceph]#

Network latency between the cluster nodes is about 1ms.

Actions

Copy link

#27

Updated by Rolf Fokkens over 10 years ago

We plan to do a clean reinstall of the cluster next week (january 6). Of course we're more than happy to provide any information on the cluster until then. Because it's a test cluster we can also provide SSH access to the cluster to people involved.

Actions

Copy link

#28

Updated by Zheng Yan over 10 years ago

For building kernel, try increasing number of jobs that run simultaneously.

Actions

Copy link

#29

Updated by Zheng Yan over 10 years ago

File pool.patch pool.patch added

I think I know what was going on. please try the attached patch, it should solve your issue. Thank you for your detailed bug report

Actions

Copy link

#30

Updated by Zheng Yan over 10 years ago

File 0001-mds-always-store-backtrace-xattr-in-the-default-pool.patch 0001-mds-always-store-backtrace-xattr-in-the-default-pool.patch added
Status changed from New to 12

previous patch doesn't work. please try the new one, thank you.

Actions

Copy link

#31

Updated by Rolf Fokkens over 10 years ago

File ceph.spec ceph.spec added

Actions

Copy link

#32

Updated by Rolf Fokkens over 10 years ago

I'm currently building the RPM's using the mentioned ceph.spec which includes the patch.

Actions

Copy link

#33

Updated by Rolf Fokkens over 10 years ago

error: Installed (but unpackaged) file(s) found:
/usr/lib64/erasure-code/libec_example.so
/usr/lib64/erasure-code/libec_example.so.0
/usr/lib64/erasure-code/libec_example.so.0.0.0
/usr/lib64/erasure-code/libec_fail_to_initialize.so
/usr/lib64/erasure-code/libec_fail_to_initialize.so.0
/usr/lib64/erasure-code/libec_fail_to_initialize.so.0.0.0
/usr/lib64/erasure-code/libec_fail_to_register.so
/usr/lib64/erasure-code/libec_fail_to_register.so.0
/usr/lib64/erasure-code/libec_fail_to_register.so.0.0.0
/usr/lib64/erasure-code/libec_hangs.so
/usr/lib64/erasure-code/libec_hangs.so.0
/usr/lib64/erasure-code/libec_hangs.so.0.0.0
/usr/lib64/erasure-code/libec_jerasure.so
/usr/lib64/erasure-code/libec_jerasure.so.1
/usr/lib64/erasure-code/libec_jerasure.so.1.0.0
/usr/lib64/erasure-code/libec_missing_entry_point.so
/usr/lib64/erasure-code/libec_missing_entry_point.so.0
/usr/lib64/erasure-code/libec_missing_entry_point.so.0.0.0

I fixed the RPM (like https://github.com/ceph/ceph/blob/master/ceph.spec.in), rebuilding...

Actions

Copy link

#34

Updated by Rolf Fokkens over 10 years ago

The RPM's built fine now. I upgraded all nodes in the ceph cluster.

Actions

Copy link

#35

Updated by Rolf Fokkens over 10 years ago

First tested the kernel rebuild (-j 8 for the make command) twice, this went fine. (Except for performance: 1 kernel built took 18 hours).

Then i initiated the kernel rebuild again, and stopped and started the MDS during the build a few times. So far so good: no more "failed to reconnect caps for missing inodes" messages in the MDS log, and the rebuild continued after the MDS was back online. No more stuck inodes in the client /sys/kernel/debug/ceph/.../mdsc.

Actions

Copy link

#36

Updated by Zheng Yan over 10 years ago

Status changed from 12 to Resolved

fixed by f292992b82d042a9f738e4698802e3b205b82c83

Actions

Copy link

#37

Updated by Greg Farnum almost 8 years ago

Component(FS) MDS added

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » CephFS

Custom queries

Bug #6914

Accessing specific file on CephFS causes client hang

Updated by Rolf Fokkens over 10 years ago

Updated by Herbert XYZ over 10 years ago

Updated by Zheng Yan over 10 years ago

Updated by Zheng Yan over 10 years ago

Updated by Rolf Fokkens over 10 years ago

Updated by Jasper Siero over 10 years ago

Updated by Jasper Siero over 10 years ago

Updated by Jasper Siero over 10 years ago

Updated by Greg Farnum over 10 years ago

Updated by Rolf Fokkens over 10 years ago

Updated by Rolf Fokkens over 10 years ago

Updated by Rolf Fokkens over 10 years ago

Updated by Rolf Fokkens over 10 years ago

Updated by Rolf Fokkens over 10 years ago

Updated by Zheng Yan over 10 years ago

Updated by Zheng Yan over 10 years ago

Updated by Rolf Fokkens over 10 years ago

Updated by Rolf Fokkens over 10 years ago

Updated by Rolf Fokkens over 10 years ago

Updated by Zheng Yan over 10 years ago

Updated by Rolf Fokkens over 10 years ago

Updated by Rolf Fokkens over 10 years ago

Updated by Zheng Yan over 10 years ago

Updated by Zheng Yan over 10 years ago

Updated by Rolf Fokkens over 10 years ago

Updated by Rolf Fokkens over 10 years ago

Updated by Rolf Fokkens over 10 years ago

Updated by Zheng Yan over 10 years ago

Updated by Zheng Yan over 10 years ago

Updated by Zheng Yan over 10 years ago

Updated by Rolf Fokkens over 10 years ago

Updated by Rolf Fokkens over 10 years ago

Updated by Rolf Fokkens over 10 years ago

Updated by Rolf Fokkens over 10 years ago

Updated by Rolf Fokkens over 10 years ago

Updated by Zheng Yan over 10 years ago

Updated by Greg Farnum almost 8 years ago