Bug #16013: Failing file operations on kernel based cephfs mount point leaves unaccessible file behind on hammer 0.94.7 - CephFS - Ceph

Actions

Copy link

Bug #16013

closed

Failing file operations on kernel based cephfs mount point leaves unaccessible file behind on hammer 0.94.7

Added by Burkhard Linke almost 8 years ago. Updated over 7 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

Category:

Correctness/Safety

Target version:

% Done:

Source:

other

Tags:

Backport:

jewel,hammer

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(FS):

Labels (FS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

After some number of operations on files (which could not be traced for reproduction), we end up with a broken directory:

ls -al
ls: cannot access DensityMap: Invalid argument
total 0
drwxr-sr-x 1 XXXX XXXX 0 May 24 17:50 .
drwxr-sr-x 1 XXXX XXXX 260874151 May 24 17:55 ..
l????????? ? ? ? ? ? DensityMap

Accessing the directory results in a kernel error message:

uname -a
Linux waas 4.4.0-22-generic #39~14.04.1-Ubuntu SMP Thu May 5 19:19:06 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

[Wed May 25 09:08:23 2016] ------------[ cut here ]------------
[Wed May 25 09:08:23 2016] WARNING: CPU: 8 PID: 9700 at /build/linux-lts-xenial-7RlTta/linux-lts-xenial-4.4.0/fs/ceph/inode.c:811 fill_inode.isra.16+0xb2a/0xc00 [ceph]()
[Wed May 25 09:08:23 2016] Modules linked in: tcp_diag(E) inet_diag(E) ceph(E) libceph(E) usblp(E) parport_pc(E) ppdev(E) ufs(E) qnx4(E) hfsplus(E) hfs(E) minix(E) ntfs(E) msdos(E) jfs(E) xfs(E) iptable_filter ip_tables x_tables cts 8021q garp mrp stp llc openvswitch nf_defrag_ipv6 nf_conntrack libcrc32c rpcsec_gss_krb5 nfsd auth_rpcgss nfs_acl nfs lockd grace sunrpc fscache intel_rapl x86_pkg_temp_thermal ipmi_ssif intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul aesni_intel aes_x86_64 lrw input_leds hpilo gf128mul glue_helper joydev ablk_helper shpchp ioatdma cryptd lpc_ich sb_edac serio_raw edac_core acpi_power_meter 8250_fintek wmi mac_hid ipmi_si ipmi_msghandler lp parport btrfs xor raid6_pq hid_generic usbhid hid ixgbe dca psmouse vxlan tg3 ip6_udp_tunnel hpsa udp_tunnel ptp pps_core scsi_transport_sas mdio fjes [last unloaded: libceph]
[Wed May 25 09:08:23 2016] CPU: 8 PID: 9700 Comm: kworker/8:0 Tainted: G W E 4.4.0-22-generic #39~14.04.1-Ubuntu
[Wed May 25 09:08:23 2016] Hardware name: HP ProLiant DL380 Gen9, BIOS P89 07/20/2015
[Wed May 25 09:08:23 2016] Workqueue: ceph-msgr ceph_con_workfn [libceph]
[Wed May 25 09:08:23 2016] 0000000000000000 ffff881fde657a28 ffffffff813cde6c 0000000000000000
[Wed May 25 09:08:23 2016] ffffffffc08003b0 ffff881fde657a60 ffffffff8107d856 00000000ffffffea
[Wed May 25 09:08:23 2016] 0000000000000000 0000000000000000 ffff880422556308 ffff881f1e5e053a
[Wed May 25 09:08:23 2016] Call Trace:
[Wed May 25 09:08:23 2016] [<ffffffff813cde6c>] dump_stack+0x63/0x87
[Wed May 25 09:08:23 2016] [<ffffffff8107d856>] warn_slowpath_common+0x86/0xc0
[Wed May 25 09:08:23 2016] [<ffffffff8107d94a>] warn_slowpath_null+0x1a/0x20
[Wed May 25 09:08:23 2016] [<ffffffffc07d899a>] fill_inode.isra.16+0xb2a/0xc00 [ceph]
[Wed May 25 09:08:23 2016] [<ffffffffc07d65e0>] ? ceph_mount+0x810/0x810 [ceph]
[Wed May 25 09:08:23 2016] [<ffffffffc07d9604>] ceph_readdir_prepopulate+0x224/0x8c0 [ceph]
[Wed May 25 09:08:23 2016] [<ffffffffc07f7342>] handle_reply+0xa32/0xca0 [ceph]
[Wed May 25 09:08:23 2016] [<ffffffffc07f918e>] dispatch+0xae/0xaf0 [ceph]
[Wed May 25 09:08:23 2016] [<ffffffffc0618ac3>] try_read+0x443/0x1120 [libceph]
[Wed May 25 09:08:23 2016] [<ffffffff81036a09>] ? sched_clock+0x9/0x10
[Wed May 25 09:08:23 2016] [<ffffffff810b31c5>] ? put_prev_entity+0x35/0x670
[Wed May 25 09:08:23 2016] [<ffffffff8102c696>] ? __switch_to+0x1d6/0x570
[Wed May 25 09:08:23 2016] [<ffffffffc0619852>] ceph_con_workfn+0xb2/0x5d0 [libceph]
[Wed May 25 09:08:23 2016] [<ffffffff810959cd>] process_one_work+0x14d/0x3f0
[Wed May 25 09:08:23 2016] [<ffffffff8109614a>] worker_thread+0x11a/0x470
[Wed May 25 09:08:23 2016] [<ffffffff817ebe19>] ? __schedule+0x359/0x970
[Wed May 25 09:08:23 2016] [<ffffffff81096030>] ? rescuer_thread+0x310/0x310
[Wed May 25 09:08:23 2016] [<ffffffff8109b882>] kthread+0xd2/0xf0
[Wed May 25 09:08:23 2016] [<ffffffff8109b7b0>] ? kthread_park+0x50/0x50
[Wed May 25 09:08:23 2016] [<ffffffff817f004f>] ret_from_fork+0x3f/0x70
[Wed May 25 09:08:23 2016] [<ffffffff8109b7b0>] ? kthread_park+0x50/0x50
[Wed May 25 09:08:23 2016] ---[ end trace 99ae552d517bb8d0 ]---
[Wed May 25 09:08:23 2016] ceph: fill_inode badness on ffff880422556308
[Wed May 25 09:08:23 2016] ------------[ cut here ]------------
[Wed May 25 09:08:23 2016] WARNING: CPU: 8 PID: 9700 at /build/linux-lts-xenial-7RlTta/linux-lts-xenial-4.4.0/fs/ceph/inode.c:811 fill_inode.isra.16+0xb2a/0xc00 [ceph]()
[Wed May 25 09:08:23 2016] Modules linked in: tcp_diag(E) inet_diag(E) ceph(E) libceph(E) usblp(E) parport_pc(E) ppdev(E) ufs(E) qnx4(E) hfsplus(E) hfs(E) minix(E) ntfs(E) msdos(E) jfs(E) xfs(E) iptable_filter ip_tables x_tables cts 8021q garp mrp stp llc openvswitch nf_defrag_ipv6 nf_conntrack libcrc32c rpcsec_gss_krb5 nfsd auth_rpcgss nfs_acl nfs lockd grace sunrpc fscache intel_rapl x86_pkg_temp_thermal ipmi_ssif intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul aesni_intel aes_x86_64 lrw input_leds hpilo gf128mul glue_helper joydev ablk_helper shpchp ioatdma cryptd lpc_ich sb_edac serio_raw edac_core acpi_power_meter 8250_fintek wmi mac_hid ipmi_si ipmi_msghandler lp parport btrfs xor raid6_pq hid_generic usbhid hid ixgbe dca psmouse vxlan tg3 ip6_udp_tunnel hpsa udp_tunnel ptp pps_core scsi_transport_sas mdio fjes [last unloaded: libceph]
[Wed May 25 09:08:23 2016] CPU: 8 PID: 9700 Comm: kworker/8:0 Tainted: G W E 4.4.0-22-generic #39~14.04.1-Ubuntu
[Wed May 25 09:08:23 2016] Hardware name: HP ProLiant DL380 Gen9, BIOS P89 07/20/2015
[Wed May 25 09:08:23 2016] Workqueue: ceph-msgr ceph_con_workfn [libceph]
[Wed May 25 09:08:23 2016] 0000000000000000 ffff881fde657a50 ffffffff813cde6c 0000000000000000
[Wed May 25 09:08:23 2016] ffffffffc08003b0 ffff881fde657a88 ffffffff8107d856 00000000ffffffea
[Wed May 25 09:08:23 2016] 0000000000000000 0000000000000000 ffff880422556e48 ffff881f1e5e0130
[Wed May 25 09:08:23 2016] Call Trace:
[Wed May 25 09:08:23 2016] [<ffffffff813cde6c>] dump_stack+0x63/0x87
[Wed May 25 09:08:23 2016] [<ffffffff8107d856>] warn_slowpath_common+0x86/0xc0
[Wed May 25 09:08:23 2016] [<ffffffff8107d94a>] warn_slowpath_null+0x1a/0x20
[Wed May 25 09:08:23 2016] [<ffffffffc07d899a>] fill_inode.isra.16+0xb2a/0xc00 [ceph]
[Wed May 25 09:08:23 2016] [<ffffffff812167c5>] ? inode_init_always+0x105/0x1b0
[Wed May 25 09:08:23 2016] [<ffffffffc07d8b87>] ceph_fill_trace+0x117/0x970 [ceph]
[Wed May 25 09:08:23 2016] [<ffffffffc07f6d39>] handle_reply+0x429/0xca0 [ceph]
[Wed May 25 09:08:23 2016] [<ffffffffc07f918e>] dispatch+0xae/0xaf0 [ceph]
[Wed May 25 09:08:23 2016] [<ffffffffc0618ac3>] try_read+0x443/0x1120 [libceph]
[Wed May 25 09:08:23 2016] [<ffffffff810b31c5>] ? put_prev_entity+0x35/0x670
[Wed May 25 09:08:23 2016] [<ffffffff8102c696>] ? __switch_to+0x1d6/0x570
[Wed May 25 09:08:23 2016] [<ffffffffc0619852>] ceph_con_workfn+0xb2/0x5d0 [libceph]
[Wed May 25 09:08:23 2016] [<ffffffff810959cd>] process_one_work+0x14d/0x3f0
[Wed May 25 09:08:23 2016] [<ffffffff8109614a>] worker_thread+0x11a/0x470
[Wed May 25 09:08:23 2016] [<ffffffff817ebe19>] ? __schedule+0x359/0x970
[Wed May 25 09:08:23 2016] [<ffffffff81096030>] ? rescuer_thread+0x310/0x310
[Wed May 25 09:08:23 2016] [<ffffffff8109b882>] kthread+0xd2/0xf0
[Wed May 25 09:08:23 2016] [<ffffffff8109b7b0>] ? kthread_park+0x50/0x50
[Wed May 25 09:08:23 2016] [<ffffffff817f004f>] ret_from_fork+0x3f/0x70
[Wed May 25 09:08:23 2016] [<ffffffff8109b7b0>] ? kthread_park+0x50/0x50
[Wed May 25 09:08:23 2016] ---[ end trace 99ae552d517bb8d1 ]---
[Wed May 25 09:08:23 2016] ceph: fill_inode badness ffff880422556e48 100041dee77.fffffffffffffffe

Attempts to remove the affected file or moving the directory itself of the the way also fail with the similar error messages.

How do I resolve this problem?

Related issues 3 (0 open — 3 closed)

Actions

Copy link

Updated by Burkhard Linke almost 8 years ago

Additional information:

- ceph cluster reported client failing to release capabilities prior to the current state
- attempts to terminated processing the file on the client failed (number of concurrent clients unknown)
- access to directory on other clients was blocked
- after some time trying to terminate all process on the client the lock was release
- directory was in state described above
- attempt to umount the file system resulted in MDS crash:
~~10> 2016-05-24 18:59:44.676372 7fb57683e700 5 -~~ op tracker -- seq: 103094532, time: 2016-05-24 18:59:44.676372, event: done, op: client_request(client.2755773:7
974969 getattr pAsLsXsFs #1000415faaa 2016-05-24 18:54:17.345798)
~~9> 2016-05-24 18:59:44.676401 7fb57683e700 5 -~~ op tracker -- seq: 103094659, time: 2016-05-24 18:59:44.676401, event: acquired locks, op: client_request(client
.2459020:4018182 getattr pAsLsXsFs #1000415faaa 2016-05-24 18:54:36.970343)
~~8> 2016-05-24 18:59:44.676422 7fb57683e700 5 -~~ op tracker -- seq: 103094659, time: 2016-05-24 18:59:44.676422, event: replying, op: client_request(client.24590
20:4018182 getattr pAsLsXsFs #1000415faaa 2016-05-24 18:54:36.970343)
~~7> 2016-05-24 18:59:44.676433 7fb57683e700 1 -~~ 192.168.6.129:6801/12700 --> 134.176.31.195:0/3862619230 -- client_reply(???:4018182 = 0 (0) Success) v1 -- ?+0
0x51fdd9600 con 0x52c6100
~~6> 2016-05-24 18:59:44.676455 7fb57683e700 5 -~~ op tracker -- seq: 103094659, time: 2016-05-24 18:59:44.676455, event: finishing request, op: client_request(cli
ent.2459020:4018182 getattr pAsLsXsFs #1000415faaa 2016-05-24 18:54:36.970343)
~~5> 2016-05-24 18:59:44.676468 7fb57683e700 5 -~~ op tracker -- seq: 103094659, time: 2016-05-24 18:59:44.676468, event: cleaned up request, op: client_request(cl
ient.2459020:4018182 getattr pAsLsXsFs #1000415faaa 2016-05-24 18:54:36.970343)
~~4> 2016-05-24 18:59:44.676608 7fb57683e700 5 -~~ op tracker -- seq: 103094659, time: 2016-05-24 18:59:44.676608, event: done, op: client_request(client.2459020:4
018182 getattr pAsLsXsFs #1000415faaa 2016-05-24 18:54:36.970343)
~~3> 2016-05-24 18:59:44.676685 7fb57683e700 1 -~~ 192.168.6.129:6801/12700 --> 134.176.31.195:0/3862619230 -- client_caps(grant ino 100023ff2a0 269530785 seq 3 ca
ps=pAsLsXsFs dirty=- wanted=pAsLsXsFsx follows 0 size 0/0 ts 1 mtime 2016-04-28 16:25:57.293052) v5 -- ?+0 0x511064400 con 0x52c6100
~~2> 2016-05-24 18:59:44.677056 7fb57683e700 5 -~~ op tracker -- seq: 103090515, time: 2016-05-24 18:59:44.677056, event: cleaned up request, op: client_request(cl
ient.2459020:4018181 rename #100023ff2a0/bin #1000415faaa/bin 2016-05-24 18:30:15.609041)
~~1> 2016-05-24 18:59:44.683497 7fb55002c700 1 -~~ 192.168.6.129:6801/12700 <== osd.20 192.168.6.137:6816/10708 12536 ==== osd_op_reply(55226564 300.00065741 [writ
e 3902322~35604] v372342'228652 uv228652 ondisk = 0) v6 ==== 179+0+0 (3698124487 0 0) 0x41eb66000 con 0x519389b80
0> 2016-05-24 18:59:44.831074 7fb57683e700 -1 mds/MDCache.cc: In function 'MDRequestRef MDCache::request_get(metareqid_t)' thread 7fb57683e700 time 2016-05-24 18
:59:44.681534
mds/MDCache.cc: 8554: FAILED assert(p != active_requests.end())

ceph version 0.94.6 (e832001feaf8c176593e0325c8298e3f16dfb403)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0x968c15]
 2: (MDCache::request_get(metareqid_t)+0x2a8) [0x64a818]
 3: (Server::journal_close_session(Session*, int, Context*)+0x7a2) [0x5f4e92]
 4: (Server::handle_client_session(MClientSession*)+0xf9b) [0x5f605b]
 5: (Server::dispatch(Message*)+0x583) [0x6262f3]
 6: (MDS::handle_deferrable_message(Message*)+0x837) [0x59ddd7]
 7: (MDS::_dispatch(Message*)+0x97) [0x5ba337]
 8: (MDS::ms_dispatch(Message*)+0xa3) [0x5bb713]
 9: (DispatchQueue::entry()+0x62a) [0xa5835a]
 10: (DispatchQueue::DispatchThread::entry()+0xd) [0x950b2d]
 11: (()+0x7dc5) [0x7fb57bc4ddc5]
 12: (clone()+0x6d) [0x7fb57a93328d]
 NOTE: a copy of the executable, or `objdump -rdS &lt;executable&gt;` is needed to interpret this.

(MDS version at that time was 0.94.6, after restart 0.94.7 due to a pending update)

- rebooted client
- further attempts to access the directory result in error message above

Actions

Copy link

Updated by Burkhard Linke almost 8 years ago

another addition:

- mounted cephfs using ceph-fuse
- directory is listed correctly:

ls al bin
total 1
drwxr-sr-x 1 XXXX XXXX 0 May 24 17:50 .
drwxr-sr-x 1 XXXX XXXX 5547911 May 25 09:48 ..
lrwxrwxrwx 1 XXXX XXXX 0 May 24 17:48 DensityMap -> ../share/DensityMap/DensityMap.pl
removing the directory with rm ~~r blocks~~
MDS log:
2016-05-25 09:53:12.733145 7f49809bd700 0 log_channel(cluster) log [WRN] : 1 slow requests, 1 included below; oldest blocked for > 34.750201 secs
2016-05-25 09:53:12.733175 7f49809bd700 0 log_channel(cluster) log [WRN] : slow request 34.750201 seconds old, received at 2016-05-25 09:52:37.982900: client_request(client.2779204:8 unlink #100041dee2e/DensityMap 2016-05-25 09:52:37.982202) currently failed to xlock, waiting
2016-05-25 09:53:42.674361 7f49809bd700 0 log_channel(cluster) log [WRN] : client.2643258 isn't responding to mclientcaps(revoke), ino 100041dee77 pending pAsXsFscr issued pAsLsXsFscr, sent 64.690293 seconds ago

Actions

Copy link

Updated by Zheng Yan almost 8 years ago

Status changed from New to 12

The warning is caused by bad inode size for symlink, commit https://github.com/ceph/ceph-client/commit/8e876900de5196216f193e747d79fa727ed1453f removes the warning.

The bad symlink i_size is likely causes by the 'open truncate symlink' bug, which is fixed by https://github.com/ceph/ceph/pull/8778

To avoid the warning. just delete the symlink from ceph-fuse mount, then re-create it.

Actions

Copy link

Updated by Burkhard Linke almost 8 years ago

Thanks for the fast reply.

I had to restart the MDS server due to pending locks (which also blocked the ceph-fuse client), but afterwards I was able to remove the stale symlink. The issuse is solved now.

Actions

Copy link