Bug #62604
closedCephFS - Bug #56067: Cephfs data loss with root_squash enabled
write() hangs forever in ceph_get_caps
0%
Description
Hi,
While doing some tests after setting up a ceph with 2 nodes in NixOS, each with 4 OSDs, I observe dd hanging while writting to a large file. I'm using the kernel cephfs module from another client node and I'm able to reproduce it on every invocation.
Here is the command I'm using:
hut$ strace -f -tt dd if=/dev/urandom of=/ceph/rarias/kk bs=32M count=$((32*1024)) status=progress ... 08:45:34.985445 write(2, "603979776 bytes (604 MB, 576 MiB"..., 55603979776 bytes (604 MB, 576 MiB) copied, 5 s, 117 MB/s) = 55 08:45:34.985525 read(0, "I2\223ss\201}\233b\204{\262:L\212/W\25t\301vuq}\354d\272\302*u0\253"..., 33554432) = 33554432 08:45:35.237872 write(1, "I2\223ss\201}\233b\204{\262:L\212/W\25t\301vuq}\354d\272\302*u0\253"..., 33554432) = 33554432 08:45:35.270395 read(0, "\372+\234\205g\332|\201\350A;\254\34P\215x\374\255'90\257\257\v\341\227\251\355A\32\350#"..., 33554432) = 33554432 08:45:35.523104 write(1, "\372+\234\205g\332|\201\350A;\254\34P\215x\374\255'90\257\257\v\341\227\251\355A\32\350#"..., 33554432) = 33554432 08:45:35.556310 read(0, "I\352\203L\17\323\345\355\335L\304\334XB~\327'\177U\24\333\221I\273Sjz\177N\243Hh"..., 33554432) = 33554432 08:45:35.808834 write(1, "I\352\203L\17\323\345\355\335L\304\334XB~\327'\177U\24\333\221I\273Sjz\177N\243Hh"..., 33554432^Cstrace: Process 1695154 detached <detached ...> 21+0 records in 20+0 records out 671088640 bytes (671 MB, 640 MiB) copied, 85,9042 s, 7,8 MB/s
I left this hang for one day. Here is the stack:
hut# cat /proc/1402180/stack [<0>] wait_woken+0x54/0x70 [<0>] ceph_get_caps+0x4b3/0x6f0 [ceph] [<0>] ceph_write_iter+0x316/0xdc0 [ceph] [<0>] vfs_write+0x22e/0x3f0 [<0>] ksys_write+0x6f/0xf0 [<0>] do_syscall_64+0x3e/0x90 [<0>] entry_SYSCALL_64_after_hwframe+0x77/0xe1
I tested it with ceph 17.2.5 and 18.2.0, both tests on the linux kernel 6.4.11 in all nodes.
This is similar to https://tracker.ceph.com/issues/54044 but it seems to be not fixed on 6.4.11 if is the same problem.
The status is kept as HEALTHY during the hang, so I suspect there is a problem on the client side.
I enabled the dynamic debug for the caps file in the ceph kernel module with:
hut# echo 'file fs/ceph/caps.c +p' > /sys/kernel/debug/dynamic_debug/control
I attach what I see on dmesg (AFAIK they use different clocks, so the timestamps don't match exactly with strace).
Here is the status:
bay$ sudo ceph -s cluster: id: 9c8d06e0-485f-4aaf-b16b-06d6daf1232b health: HEALTH_OK services: mon: 1 daemons, quorum bay (age 9h) mgr: bay(active, since 9h) mds: 1/1 daemons up, 1 standby osd: 8 osds: 8 up (since 9h), 8 in (since 2d) data: volumes: 1/1 healthy pools: 4 pools, 545 pgs objects: 1.98k objects, 7.6 GiB usage: 24 GiB used, 8.7 TiB / 8.7 TiB avail pgs: 545 active+clean
Here is the fs:
bay$ sudo ceph fs dump e3468 enable_multiple, ever_enabled_multiple: 1,1 default compat: compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no anchor table,9=file layout v2,10=snaprealm v2} legacy client fscid: 3 Filesystem 'cephfs' (3) fs_name cephfs epoch 3468 flags 12 joinable allow_snaps allow_multimds_snaps created 2023-08-02T11:55:26.535585+0200 modified 2023-08-27T23:51:49.313474+0200 tableserver 0 root 0 session_timeout 60 session_autoclose 300 max_file_size 1099511627776 required_client_features {} last_failure 0 last_failure_osd_epoch 1010 compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline data,8=no anchor table,9=file layout v2,10=snaprealm v2} max_mds 1 in 0 up {0=244132} failed damaged stopped data_pools [5] metadata_pool 6 inline_data disabled balancer bal_rank_mask -1 standby_count_wanted 1 [mds.mds1{0:244132} state up:active seq 5 addr [v2:10.0.40.40:6802/2653564626,v1:10.0.40.40:6803/2653564626] compat {c=[1],r=[1],i=[7ff]}] Standby daemons: [mds.mds0{-1:254107} state up:standby seq 1 addr [v2:10.0.40.40:6800/1345562638,v1:10.0.40.40:6801/1345562638] compat {c=[1],r=[1],i=[7ff]}] dumped fsmap epoch 3468
Pools:
bay$ sudo ceph osd pool ls detail pool 3 '.mgr' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 119 flags hashpspool stripe_width 0 pg_num_max 32 pg_num_min 1 application mgr read_balance_score 7.89 pool 4 '.rgw.root' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 131 lfor 0/0/94 flags hashpspool stripe_width 0 application rgw read_balance_score 3.50 pool 5 'cephfs_data' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 256 pgp_num 256 autoscale_mode on last_change 926 lfor 0/0/751 flags hashpspool,bulk stripe_width 0 target_size_ratio 0.8 application cephfs read_balance_score 2.00 pool 6 'cephfs_metadata' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 256 pgp_num 256 autoscale_mode on last_change 938 lfor 0/0/936 flags hashpspool,bulk stripe_width 0 pg_autoscale_bias 4 pg_num_min 16 recovery_priority 5 target_size_ratio 0.2 application cephfs read_balance_score 1.78
I modified the CRUSH rule to replicate among OSDs instead of nodes, as I don't have enough nodes yet for a pool size of 3. Here is my ceph config:
bay$ sudo cat /etc/ceph/ceph.conf [global] auth client required=cephx auth cluster required=cephx auth service required=cephx cluster name=ceph cluster network=10.0.40.40/24 err_to_stderr=true fsid=9c8d06e0-485f-4aaf-b16b-06d6daf1232b log_file=/dev/null log_to_file=false log_to_stderr=true max open files=131072 mgr module path=/nix/store/c7xa89zq3ns2557mw7gqr2rx63mfjvq6-ceph-18.2.0-lib/lib/ceph/mgr mon host=10.0.40.40 mon initial members=bay mon_cluster_log_file=/dev/null rgw mime types file=/nix/store/q723yrnx2nkwz3a0f7i5yb9pzj942cf8-mailcap-2.1.53/etc/mime.types [mds] host=bay [osd] osd crush chooseleaf type=0 osd journal size=10000 osd pool default min size=2 osd pool default pg num=200 osd pool default pgp num=200 osd pool default size=3
Also I manually deployed ceph, so I may have made some mistakes along the way. I can provide more information if you are unable to reproduce it and also test patches in the kernel or ceph, as we build them from source.
Files
Updated by Venky Shankar 8 months ago
- Project changed from CephFS to Linux kernel client
- Category changed from Correctness/Safety to fs/ceph
- Assignee set to Xiubo Li
Updated by Venky Shankar 8 months ago
Looks like a kclient thing. Handing it over to Xiubo for analysis.
Updated by Xiubo Li 8 months ago
- Status changed from New to In Progress
The 6.4.11 has already included that fix, please see https://kernel.googlesource.com/pub/scm/linux/kernel/git/stable/linux-stable/+/refs/tags/v6.4.11/fs/ceph/caps.c.
Locally I am trying to reproduce it but not see it yet.
Could you reproduce it again and then provide the caps output as Jeff did in https://tracker.ceph.com/issues/54044 ?
Thanks
Updated by Rodrigo Arias 8 months ago
- File config config added
- File ceph-build.log ceph-build.log added
Xiubo Li wrote:
The 6.4.11 has already included that fix, please see https://kernel.googlesource.com/pub/scm/linux/kernel/git/stable/linux-stable/+/refs/tags/v6.4.11/fs/ceph/caps.c.
Locally I am trying to reproduce it but not see it yet.
I attach our kernel config and the build log of ceph.
Could you reproduce it again and then provide the caps output as Jeff did in https://tracker.ceph.com/issues/54044 ?
Yes, forgot to add before:
total 27 avail 24 used 3 reserved 0 min 1024 ino mds issued implemented -------------------------------------------------- 0x1 0 pAsLsXsFs pAsLsXsFs 0x10000000000 0 pAsLsXsFsx pAsLsXsFsx 0x10000000005 0 pAsxLsXsxFsxcrwb pAsxLsXsxFsxcrwb Waiters: -------- tgid ino need want ----------------------------------------------------- 1483674 0x10000000005 Fw Fb
Updated by Rodrigo Arias 8 months ago
As I commented in #62664, with the ceph-fuse client (after some tweaking) I was able to run some I/O benchmarks and it seems to be working okay. So I suspect this must be an interaction of the kernel module with the MDS server.
If you are unable to reproduce it, we can consider giving you access to the machine so you can debug it there (contact me by email to discuss the details at rodrigo.arias@bsc.es in that case). In our cluster we are able to reproduce it 100% of the times with dd.
In fact, it always gets stuck after writing 4M:
hut$ strace -f -e trace=write dd if=/dev/zero of=/ceph/rarias/kk bs=1M count=100 write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576) = 1048576 write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576) = 1048576 write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576) = 1048576 write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576) = 1048576 write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576 hut$ strace -f -e trace=write dd if=/dev/zero of=/ceph/rarias/kk bs=2M count=100 write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 2097152) = 2097152 write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 2097152) = 2097152 write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 2097152 hut$ strace -f -e trace=write dd if=/dev/zero of=/ceph/rarias/kk bs=4M count=100 write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4194304) = 4194304 write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4194304 hut$ strace -f -e trace=write dd if=/dev/zero of=/ceph/rarias/kk bs=4097k count=100 write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4195328
Updated by Xiubo Li 8 months ago
Rodrigo Arias wrote:
Xiubo Li wrote:
The 6.4.11 has already included that fix, please see https://kernel.googlesource.com/pub/scm/linux/kernel/git/stable/linux-stable/+/refs/tags/v6.4.11/fs/ceph/caps.c.
Locally I am trying to reproduce it but not see it yet.
I attach our kernel config and the build log of ceph.
Could you reproduce it again and then provide the caps output as Jeff did in https://tracker.ceph.com/issues/54044 ?
Yes, forgot to add before:
[...]
Okay, it seems we missed the waking up in some corner cases. Let me have a check.
Thanks
Updated by Xiubo Li 8 months ago
When writing to the file the max_size reached and then issued requests to MDS:
2023-08-28T08:45:35,769295+02:00 ceph: i_size approaching max_size 2023-08-28T08:45:35,769299+02:00 ceph: __mark_caps_flushing flushing Fw, flushing_caps - -> Fw 2023-08-28T08:45:35,769302+02:00 ceph: inode 000000001380e341 now !dirty 2023-08-28T08:45:35,769307+02:00 ceph: __prep_cap 000000001380e341 cap 00000000823d5d94 session 000000005bcf35b0 pAsxLsXsxFsxcrwb -> pAsxLsXsxFsxcrwb (revoking -) 2023-08-28T08:45:35,769319+02:00 ceph: encode_cap_msg update 1 10000000006 caps pAsxLsXsxFsxcrwb wanted pAsxXsxFxwb dirty Fw seq 52/50 tid 86/86 mseq 0 follows 1 size 341835776/167772160 xattr_ver 0 xattr_len 0 2023-08-28T08:45:35,769337+02:00 ceph: __ceph_caps_issued 000000001380e341 cap 00000000823d5d94 issued pAsxLsXsxFsxcrwb 2023-08-28T08:45:35,769341+02:00 ceph: check_caps 10000000006.fffffffffffffffe file_want pAsxXsxFxwb used Fcwb dirty - flushing Fw issued pAsxLsXsxFsxcrwb revoking - retain pAsxLsxXsxFsxcrwbl AUTHONLY ... 2023-08-28T08:45:38,614115+02:00 ceph: get_cap_refs 000000001380e341 need Fw want Fb 2023-08-28T08:45:38,614119+02:00 ceph: __ceph_caps_issued 000000001380e341 cap 00000000823d5d94 issued pAsxLsXsxFsxcrwb 2023-08-28T08:45:38,614123+02:00 ceph: get_cap_refs 000000001380e341 endoff 704643072 > maxsize 683671552 2023-08-28T08:45:38,614128+02:00 ceph: get_cap_refs 000000001380e341 ret -27 got - 2023-08-28T08:45:38,614132+02:00 ceph: write 000000001380e341 at large endoff 704643072, req max_size 2023-08-28T08:45:38,614136+02:00 ceph: __ceph_caps_issued 000000001380e341 cap 00000000823d5d94 issued pAsxLsXsxFsxcrwb 2023-08-28T08:45:38,614140+02:00 ceph: check_caps 10000000006.fffffffffffffffe file_want pAsxXsxFxwb used Fcb dirty Fw flushing - issued pAsxLsXsxFsxcrwb revoking - retain pAsxLsxXsxFsxcrwbl AUTHONLY 2023-08-28T08:45:38,614146+02:00 ceph: mds0 cap 00000000823d5d94 used Fcb issued pAsxLsXsxFsxcrwb implemented pAsxLsXsxFsxcrwb revoking - 2023-08-28T08:45:38,614150+02:00 ceph: requesting new max_size 2023-08-28T08:45:38,614154+02:00 ceph: __mark_caps_flushing flushing Fw, flushing_caps - -> Fw 2023-08-28T08:45:38,614157+02:00 ceph: inode 000000001380e341 now !dirty 2023-08-28T08:45:38,614161+02:00 ceph: __prep_cap 000000001380e341 cap 00000000823d5d94 session 000000005bcf35b0 pAsxLsXsxFsxcrwb -> pAsxLsXsxFsxcrwb (revoking -) 2023-08-28T08:45:38,614170+02:00 ceph: encode_cap_msg update 1 10000000006 caps pAsxLsXsxFsxcrwb wanted pAsxXsxFxwb dirty Fw seq 52/50 tid 90/90 mseq 0 follows 1 size 671088640/704643072 xattr_ver 0 xattr_len 0 2023-08-28T08:45:38,614184+02:00 ceph: __ceph_caps_issued 000000001380e341 cap 00000000823d5d94 issued pAsxLsXsxFsxcrwb 2023-08-28T08:45:38,614188+02:00 ceph: check_caps 10000000006.fffffffffffffffe file_want pAsxXsxFxwb used Fcb dirty - flushing Fw issued pAsxLsXsxFsxcrwb revoking - retain pAsxLsxXsxFsxcrwbl AUTHONLY 2023-08-28T08:45:38,614193+02:00 ceph: get_cap_refs 000000001380e341 need Fw want Fb 2023-08-28T08:45:38,614196+02:00 ceph: __ceph_caps_issued 000000001380e341 cap 00000000823d5d94 issued pAsxLsXsxFsxcrwb 2023-08-28T08:45:38,614200+02:00 ceph: get_cap_refs 000000001380e341 endoff 704643072 > maxsize 683671552 2023-08-28T08:45:38,614204+02:00 ceph: get_cap_refs 000000001380e341 ret 0 got - 2023-08-28T08:45:38,614208+02:00 ceph: get_cap_refs 000000001380e341 need Fw want Fb 2023-08-28T08:45:38,614211+02:00 ceph: __ceph_caps_issued 000000001380e341 cap 00000000823d5d94 issued pAsxLsXsxFsxcrwb 2023-08-28T08:45:38,614214+02:00 ceph: get_cap_refs 000000001380e341 endoff 704643072 > maxsize 683671552 2023-08-28T08:45:38,614217+02:00 ceph: get_cap_refs 000000001380e341 ret 0 got - 2023-08-28T08:45:38,616047+02:00 ceph: handle_caps from mds0 ... 2023-08-28T08:46:58,532590+02:00 ceph: get_cap_refs 000000001380e341 need Fw want Fb 2023-08-28T08:46:58,532603+02:00 ceph: __ceph_caps_issued 000000001380e341 cap 00000000823d5d94 issued pAsxLsXsxFsxcrwb 2023-08-28T08:46:58,532609+02:00 ceph: get_cap_refs 000000001380e341 endoff 704643072 > maxsize 683671552 2023-08-28T08:46:58,532615+02:00 ceph: get_cap_refs 000000001380e341 ret 0 got - ...
But from the dmesg logs I didn't see any handle_cap_grant reply for that. Since there is no MDS debug logs and not enough kclient debug logs, so I don't know what has happened.
Could you enable the MDS debug logs: debug_mds = 25 and debug_ms = 1 ? And also the whole dynamic debug logs for the ceph.ko module ?
Thanks
- Xiubo
Updated by Xiubo Li 8 months ago
- Status changed from In Progress to Rejected
Talked via the mail, I will paste the mds logs here:
I repeated the dd test (writting zeroes with 1MB blocks) after enabling all the debug_mds* settings to level 20 and. This time it allowed more writes to succeed: hut% strace -tt -f -e trace=write dd if=/dev/zero of=/ceph/rarias/kk bs=1M count=100 08:09:45.175365 write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576) = 1048576 08:09:45.177158 write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576) = 1048576 08:09:45.178336 write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576) = 1048576 08:09:45.179479 write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576) = 1048576 08:09:45.180626 write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576) = 1048576 08:09:45.181789 write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576) = 1048576 08:09:45.182952 write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576) = 1048576 08:09:45.184096 write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576) = 1048576 08:09:45.185245 write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576) = 1048576 08:09:45.186459 write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576) = 1048576 08:09:45.187695 write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576) = 1048576 08:09:45.188943 write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576) = 1048576 08:09:45.190183 write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576) = 1048576 08:09:45.191431 write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576) = 1048576 08:09:45.192745 write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576) = 1048576 08:09:45.194004 write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576) = 1048576 08:09:45.195257 write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576) = 1048576 08:09:45.196507 write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576) = 1048576 08:09:45.197761 write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576) = 1048576 08:09:45.199023 write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576) = 1048576 08:09:45.200268 write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576) = 1048576 08:09:45.201577 write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576) = 1048576 08:09:45.202825 write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576) = 1048576 08:09:45.204071 write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576) = 1048576 08:09:45.205315 write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576) = 1048576 08:09:45.206588 write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576) = 1048576 08:09:45.207825 write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576) = 1048576 08:09:45.209116 write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576) = 1048576 08:09:45.210604 write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576^C) = ? ERESTARTSYS (To be restarted if SA_RESTART is set) strace: Process 75920 detached 29+0 records in 28+0 records out 29360128 bytes (29 MB, 28 MiB) copied, 50,0488 s, 587 kB/s I attach the whole log from the journal. These lines seem to be relevant: Sep 04 08:09:45 bay ceph-mds[1624]: 2023-09-04T08:09:45.200+0200 7fdcbbbfe6c0 10 mds.0.locker _do_cap_update dirty Fw issued pAsxLsXsxFsxcrwb wanted pAsxXsxFxwb on [inode 0x1000000000b [2,head] /rarias/kk auth v1432 DIRTYPARENT s=0 n(v0 rc2023-09-04T08:09:45.166826+0200 1=1+0) (iauth excl) (ifile excl) (ixattr excl) (iversion lock) cr={494177=0-29360128@1} caps={494177=pAsxLsXsxFsxcrwb/pAsxXsxFxwb@9},l=494177 | ptrwaiter=0 request=0 lock=0 caps=1 truncating=0 dirtyparent=1 dirty=1 waiter=0 authpin=0 0x7fdcb9ee8c00] Sep 04 08:09:45 bay ceph-mds[1624]: 2023-09-04T08:09:45.200+0200 7fdcbbbfe6c0 20 mds.0.locker inode is file Sep 04 08:09:45 bay ceph-mds[1624]: 2023-09-04T08:09:45.200+0200 7fdcbbbfe6c0 20 mds.0.locker client has write caps; m->get_max_size=5242880; old_max=29360128 Sep 04 08:09:45 bay ceph-mds[1624]: 2023-09-04T08:09:45.200+0200 7fdcbbbfe6c0 20 mds.0.5743 get_session have 0x7fdcaec15f00 client.494177 v1:10.0.40.7:0/1970188192 state open Sep 04 08:09:45 bay ceph-mds[1624]: 2023-09-04T08:09:45.200+0200 7fdcbbbfe6c0 20 Session check_access path /rarias/kk Sep 04 08:09:45 bay ceph-mds[1624]: 2023-09-04T08:09:45.200+0200 7fdcbbbfe6c0 10 MDSAuthCap is_capable inode(path /rarias/kk owner 1880:564 mode 0100644) by caller 0:0 mask 2 new 0:0 cap: MDSAuthCaps[allow rw fsname=cephfs root_squash] Sep 04 08:09:45 bay ceph-mds[1624]: 2023-09-04T08:09:45.200+0200 7fdcbbbfe6c0 10 mds.0.locker check_access failed, dropping cap update on [inode 0x1000000000b [2,head] /rarias/kk auth v1432 DIRTYPARENT s=0 n(v0 rc2023-09-04T08:09:45.166826+0200 1=1+0) (iauth excl) (ifile excl) (ixattr excl) (iversion lock) cr={494177=0-29360128@1} caps={494177=pAsxLsXsxFsxcrwb/pAsxXsxFxwb@9},l=494177 | ptrwaiter=0 request=0 lock=0 caps=1 truncating=0 dirtyparent=1 dirty=1 waiter=0 authpin=0 0x7fdcb9ee8c00]
The root_squash was enabled and the caller was 0:0, the root user. And then check_access just reject it and then the MDS dropped the cap updating.
This is not a bug.
Updated by Rodrigo Arias 8 months ago
The root_squash was enabled and the caller was 0:0, the root user. And then check_access just reject it and then the MDS dropped the cap updating.
This is not a bug.
I issued the dd command using my non-root user (rarias) to the /ceph mountpoint using the "user" credentials that have the root_squash enabled:
bay% sudo ceph auth list client.foo key: ... caps: [mds] allow rw fsname=cephfs caps: [mon] allow r fsname=cephfs caps: [osd] allow rw tag cephfs data=cephfs client.user key: ... caps: [mds] allow rw fsname=cephfs root_squash caps: [mon] allow r fsname=cephfs caps: [osd] allow rw tag cephfs data=cephfs ... hut% sudo mount -t ceph user@9c8d06e0-485f-4aaf-b16b-06d6daf1232b.cephfs=/ /ceph hut% id uid=1880(rarias) gid=564(Computational) groups=564(Computational),1(wheel) hut% strace -f -e trace=write dd if=/dev/zero of=/ceph/rarias/kk bs=1M count=$((32*1024)) status=progress write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576) = 1048576 write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576) = 1048576 write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576) = 1048576 write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576) = 1048576 write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576 <-- stuck there
When I try to do with root, it properly reports a permission denied error, as expected:
sudo dd if=/dev/zero of=/ceph/rarias/kk bs=1M count=$((32*1024)) status=progress dd: failed to open '/ceph/rarias/kk': Permission denied
It shouldn't hang with root_squash while using my non-root user.
However, when mounted using the foo keyring without root_squash, it writes properly with my user and root:
hut% sudo mount -t ceph foo@9c8d06e0-485f-4aaf-b16b-06d6daf1232b.cephfs=/ /foo hut% dd if=/dev/urandom of=/foo/rarias/kk bs=1M count=2048 status=progress 2142240768 bytes (2,1 GB, 2,0 GiB) copied, 27 s, 79,3 MB/s 2048+0 records in 2048+0 records out 2147483648 bytes (2,1 GB, 2,0 GiB) copied, 27,0711 s, 79,3 MB/s hut% sudo dd if=/dev/urandom of=/foo/rarias/kk bs=1M count=2048 status=progress 2071986176 bytes (2,1 GB, 1,9 GiB) copied, 27 s, 76,7 MB/s 2048+0 records in 2048+0 records out 2147483648 bytes (2,1 GB, 2,0 GiB) copied, 27,996 s, 76,7 MB/s
But then, it causes the /ceph mountpoint to behave properly, even after unmounting /foo:
hut% dd if=/dev/urandom of=/ceph/rarias/kk bs=1M count=2048 status=progress 2068840448 bytes (2,1 GB, 1,9 GiB) copied, 26 s, 79,5 MB/s 2048+0 records in 2048+0 records out 2147483648 bytes (2,1 GB, 2,0 GiB) copied, 26,9828 s, 79,6 MB/s hut% sudo umount /foo hut% dd if=/dev/urandom of=/ceph/rarias/kk bs=1M count=2048 status=progress 2034237440 bytes (2,0 GB, 1,9 GiB) copied, 17 s, 120 MB/s 2048+0 records in 2048+0 records out 2147483648 bytes (2,1 GB, 2,0 GiB) copied, 17,9453 s, 120 MB/s
So, I think the problem is that at the begining those credentials shouldn't be 0:0.
If I reboot the client node, and then try again with dd, it hangs:
hut% strace -f -e trace=write dd if=/dev/urandom of=/ceph/rarias/kk bs=1M count=2048 status=progress ... write(1, "\26\203m\213\356T\215\330\375uz\301_\315$\353HJ\"\203\365\216{ln\315\231F\230\177\1\252"..., 1048576) = 1048576 write(1, "i\303\272\203\364\357X\215\252\235E\247\307\23 2K*\344\215\363\2\325\260\212\214\275Z\254\226|\367"..., 1048576) = 1048576 write(1, "m\240mhe|\323\314U\250\177\330m[\375\36I\326\230\25\0035\315z\233t\310i\315(\336e"..., 1048576) = 1048576 write(1, "\347y\374\220J\322_\333\274\372\25\224\335\v-}\322\274\365%!@\206\307&6\335\254b2\21i"..., 1048576) = 1048576 write(1, "\35/\360\256\252\246\224\rMU\333\251[9\225\25\266A\253\371\261\240l\177\250\213\4\37Sx/\202"..., 1048576) = 1048576 write(1, "~\232\351A\325\365\345\241X\5{\323Z!\3376\277\23\304\363\374\5\6\3008<8\24\f%\225\276"..., 1048576) = 1048576 write(1, ";\271\320^\224\33\177\223sn\roVqI;\346\2\302\313\261\37\276\247\5\202e\tB^\r\0"..., 1048576
And here are the is_capable calls:
bay$ sudo journalctl -f | grep MDSAuthCap Sep 04 10:48:49 bay ceph-mds[8761]: 2023-09-04T10:48:49.185+0200 7fb8559fe6c0 10 MDSAuthCap is_capable inode(path /rarias/kk owner 1880:564 mode 0100644) by caller 0:0 mask 2 new 0:0 cap: MDSAuthCaps[allow rw fsname=cephfs root_squash] Sep 04 10:48:49 bay ceph-mds[8761]: 2023-09-04T10:48:49.186+0200 7fb8559fe6c0 10 MDSAuthCap is_capable inode(path /rarias/kk owner 1880:564 mode 0100644) by caller 1880:564 mask 2 new 0:0 cap: MDSAuthCaps[allow rw fsname=cephfs root_squash] Sep 04 10:48:49 bay ceph-mds[8761]: 2023-09-04T10:48:49.319+0200 7fb8559fe6c0 10 MDSAuthCap is_capable inode(path /rarias/kk owner 1880:564 mode 0100644) by caller 0:0 mask 2 new 0:0 cap: MDSAuthCaps[allow rw fsname=cephfs root_squash] Sep 04 10:48:49 bay ceph-mds[8761]: 2023-09-04T10:48:49.382+0200 7fb8559fe6c0 10 MDSAuthCap is_capable inode(path /rarias/kk owner 1880:564 mode 0100644) by caller 0:0 mask 2 new 0:0 cap: MDSAuthCaps[allow rw fsname=cephfs root_squash] ...
Now, if I mount the /foo filesystem again, without root_squash, while keeping dd stuck in the write:
hut% sudo mount -t ceph foo@9c8d06e0-485f-4aaf-b16b-06d6daf1232b.cephfs=/ /foo hut% md5sum /foo/rarias/kk d41d8cd98f00b204e9800998ecf8427e /foo/rarias/kk
I can see more lines in the journal with mixed "caller 0:0" and "caller 1880:564" lines:
SAuthCaps[allow rw fsname=cephfs root_squash] Sep 04 10:53:44 bay ceph-mds[8761]: 2023-09-04T10:53:44.815+0200 7fb8559fe6c0 10 MDSAuthCap is_capable inode(path /rarias/kk owner 1880:564 mode 0100644) by caller 0:0 mask 2 new 0:0 cap: MDSAuthCaps[allow rw fsname=cephfs root_squash] Sep 04 10:53:44 bay ceph-mds[8761]: 2023-09-04T10:53:44.815+0200 7fb8559fe6c0 10 MDSAuthCap is_capable inode(path /rarias/kk owner 1880:564 mode 0100644) by caller 1880:564 mask 1 new 0:0 cap: MDSAuthCaps[allow rw fsname=cephfs] Sep 04 10:53:44 bay ceph-mds[8761]: 2023-09-04T10:53:44.817+0200 7fb8559fe6c0 10 MDSAuthCap is_capable inode(path /rarias/kk owner 1880:564 mode 0100644) by caller 0:0 mask 2 new 0:0 cap: MDSAuthCaps[allow rw fsname=cephfs root_squash] Sep 04 10:53:45 bay ceph-mds[8761]: 2023-09-04T10:53:45.916+0200 7fb8559fe6c0 10 MDSAuthCap is_capable inode(path /rarias/kk owner 1880:564 mode 0100644) by caller 0:0 mask 2 new 0:0 cap: MDSAuthCaps[allow rw fsname=cephfs root_squash] Sep 04 10:53:45 bay ceph-mds[8761]: 2023-09-04T10:53:45.918+0200 7fb8559fe6c0 10 MDSAuthCap is_capable inode(path /rarias/kk owner 1880:564 mode 0100644) by caller 1880:564 mask 1 new 0:0 cap: MDSAuthCaps[allow rw fsname=cephfs] Sep 04 10:53:45 bay ceph-mds[8761]: 2023-09-04T10:53:45.946+0200 7fb8559fe6c0 10 MDSAuthCap is_capable inode(path /rarias/kk owner 1880:564 mode 0100644) by caller 0:0 mask 2 new 0:0 cap: MDSAuthCaps[allow rw fsname=cephfs root_squash] Sep 04 10:53:45 bay ceph-mds[8761]: 2023-09-04T10:53:45.947+0200 7fb8559fe6c0 10 MDSAuthCap is_capable inode(path /rarias/kk owner 1880:564 mode 0100644) by caller 1880:564 mask 1 new 0:0 cap: MDSAuthCaps[allow rw fsname=cephfs] Sep 04 10:53:50 bay ceph-mds[8761]: 2023-09-04T10:53:50.975+0200 7fb8559fe6c0 10 MDSAuthCap is_capable inode(path /rarias/kk owner 1880:564 mode 0100644) by caller 0:0 mask 2 new 0:0 cap: MDSAuthCaps[allow rw fsname=cephfs root_squash] Sep 04 10:53:53 bay ceph-mds[8761]: 2023-09-04T10:53:53.496+0200 7fb8559fe6c0 10 MDSAuthCap is_capable inode(path /rarias/kk owner 1880:564 mode 0100644) by caller 0:0 mask 2 new 0:0 cap: MDSAuthCaps[allow rw fsname=cephfs root_squash] Sep 04 10:53:54 bay ceph-mds[8761]: 2023-09-04T10:53:54.657+0200 7fb8559fe6c0 10 MDSAuthCap is_capable inode(path /rarias/kk owner 1880:564 mode 0100644) by caller 0:0 mask 2 new 0:0 cap: MDSAuthCaps[allow rw fsname=cephfs root_squash]
And dd continues to write a bit more, until it hangs again:
write(2, "346030080 bytes (346 MB, 330 MiB"..., 57346030080 bytes (346 MB, 330 MiB) copied, 302 s, 1,1 MB/s) = 57 write(1, "\350\374\332\261~\246M\315\301\347Q1\267o\207n\316C\32\232\257\220\227N\305-M\244\21&\321\10"..., 1048576) = 1048576 write(1, "\377\303\\\223\324\246\200\232;\371\30T[\355\226\265\33}\7y\3\313H\t7\16\37\202\346\247_\347"..., 1048576) = 1048576 write(1, "\233\363\230E\325\356DS\24\305\2x\204\341\365\274\311.y\203x\347\25\30\35\372\203\264\355\22\365\0"..., 1048576) = 1048576 write(1, "UO\252\271\333w\226\0\365S\fL\307\304\307;\346\300u\223\273o\376\2328\21\375B^U\225\315"..., 1048576) = 1048576 write(1, "\372\247(\30\363L\0\260\272\333\17O\216ZW\323\302\216kFu\360S\266b^{\337m\273\227\205"..., 1048576) = 1048576 write(1, "oi=F$\213\231.\373D-Z^\341\231\33\354\374=\2\311B:\226\2G\vb\27.\240\226"..., 1048576) = 1048576 write(1, "1\347\373\5\243\2607\351\323o\tu\0327\273\326\315v\234T\266\356N\245\371B\30 \315\351\32\r"..., 1048576) = 1048576 write(1, "uk\251<`\346\23\372\376H\3\30\5&#\272\201\262\256[e\342v\214\312\212\217\313\342R\177\204"..., 1048576) = 1048576 write(1, "\355\364\17}=\366\22%\242\270|F\376\221M\374\313T\316Z\357\217\235\254(\350,?\344\301k\200"..., 1048576) = 1048576 write(1, "\21\262\260\362(\21\365\352\244\0M\27\204\307J\325@\240C\232\250\322%{K\224d\25\331\310\31\332"..., 1048576) = 1048576 write(1, "|\277\325eP\240\3\n@Xc\223j\264h\276\343&e\177\37\356a\\|I\"k\262d\257\246"..., 1048576) = 1048576 write(1, "\353y-\336jT\211z\230\232$_\334'\f0\f\343\337\336\363|\201\231J\342\337\2019\336\334i"..., 1048576) = 1048576 write(1, "\"\247qh\326\362\256r\303\236\3265\251\n\332(0'`,\323\241Q\353!sT\334\243\7\311X"..., 1048576) = 1048576 ...
From my side, I can keep it mounted without root_squash, but I think this behavior is not correct.
I will reboot the client and restart ceph.target in the other nodes. Then I will enable the kernel debug and the mds debug level and try again the dd command and upload the logs.
Updated by Rodrigo Arias 8 months ago
- File ceph.dmesg.log ceph.dmesg.log added
- File ceph.mds.log.gz ceph.mds.log.gz added
I attach the logs. I followed these steps:
# /ceph gets mounted on boot, so unmount hut% sudo umount /ceph # Enable kernel client debug hut% echo 'file fs/ceph/* +p' | sudo tee /sys/kernel/debug/dynamic_debug/control # Find which mds is operating bay$ sudo systemctl status ceph-mds-mds\* | grep standby Sep 04 11:08:55 bay ceph-mds[13741]: 2023-09-04T11:08:55.954+0200 7fa9d85fe6c0 1 mds.mds1 Monitors have assigned me to become a standby. # mds1 is standby, so use mds0 bay$ sudo ceph daemon mds.mds0 config set debug_mds 25 { "success": "" } bay$ sudo ceph daemon mds.mds0 config set debug_ms 1 { "success": "" } # Mount again hut% sudo mount -t ceph user@9c8d06e0-485f-4aaf-b16b-06d6daf1232b.cephfs=/ /ceph # Ensure we are not root hut% id uid=1880(rarias) gid=564(Computational) groups=564(Computational),1(wheel) # Run the dd command under strace hut% strace -f -e trace=write dd if=/dev/urandom of=/ceph/rarias/test1 bs=1M count=2048 status=progress write(1, "E\343>\310m\22\37\247}\376\374\351\271E\314s\36d\23210\251\347\252\323\253\253i\270W,X"..., 1048576) = 1048576 write(1, "\365\16N\222WC\342z\305\203|\177D\336q\30g\357\312\264\217i\247\204\302\304\355\273?6\312w"..., 1048576) = 1048576 write(1, "\365\311\2676\vK\273%\30\243\207\276\5+\255c\305`\246\224\1 \200\177\30Pc\25191\177\276"..., 1048576) = 1048576 write(1, "\305\222\207\242\331\337m&5H\226G9+\215\345}e\204=;\347\33\267Pa\254\212\rRA\31"..., 1048576) = 1048576 write(1, "C\336L\203\213\312\r\34Z;\207\177\345\3535\212\205\305\2\352om\20\245\274X\303\330\25\320\362r"..., 1048576 # Collecting logs now hut% sudo journalctl --since 11:20:31 > ceph.dmesg.log bay$ sudo journalctl --since 11:08:00 > ceph.mds.log # The mds log is too large, so compress % gzip ceph.mds.log
Updated by Xiubo Li 8 months ago
Please check https://tracker.ceph.com/issues/57154 and https://tracker.ceph.com/issues/56067, they should be the same issue.
Please test https://github.com/ceph/ceph/pull/48027.