Bug #62604
closedCephFS - Bug #56067: Cephfs data loss with root_squash enabled
write() hangs forever in ceph_get_caps
0%
Description
Hi,
While doing some tests after setting up a ceph with 2 nodes in NixOS, each with 4 OSDs, I observe dd hanging while writting to a large file. I'm using the kernel cephfs module from another client node and I'm able to reproduce it on every invocation.
Here is the command I'm using:
hut$ strace -f -tt dd if=/dev/urandom of=/ceph/rarias/kk bs=32M count=$((32*1024)) status=progress ... 08:45:34.985445 write(2, "603979776 bytes (604 MB, 576 MiB"..., 55603979776 bytes (604 MB, 576 MiB) copied, 5 s, 117 MB/s) = 55 08:45:34.985525 read(0, "I2\223ss\201}\233b\204{\262:L\212/W\25t\301vuq}\354d\272\302*u0\253"..., 33554432) = 33554432 08:45:35.237872 write(1, "I2\223ss\201}\233b\204{\262:L\212/W\25t\301vuq}\354d\272\302*u0\253"..., 33554432) = 33554432 08:45:35.270395 read(0, "\372+\234\205g\332|\201\350A;\254\34P\215x\374\255'90\257\257\v\341\227\251\355A\32\350#"..., 33554432) = 33554432 08:45:35.523104 write(1, "\372+\234\205g\332|\201\350A;\254\34P\215x\374\255'90\257\257\v\341\227\251\355A\32\350#"..., 33554432) = 33554432 08:45:35.556310 read(0, "I\352\203L\17\323\345\355\335L\304\334XB~\327'\177U\24\333\221I\273Sjz\177N\243Hh"..., 33554432) = 33554432 08:45:35.808834 write(1, "I\352\203L\17\323\345\355\335L\304\334XB~\327'\177U\24\333\221I\273Sjz\177N\243Hh"..., 33554432^Cstrace: Process 1695154 detached <detached ...> 21+0 records in 20+0 records out 671088640 bytes (671 MB, 640 MiB) copied, 85,9042 s, 7,8 MB/s
I left this hang for one day. Here is the stack:
hut# cat /proc/1402180/stack [<0>] wait_woken+0x54/0x70 [<0>] ceph_get_caps+0x4b3/0x6f0 [ceph] [<0>] ceph_write_iter+0x316/0xdc0 [ceph] [<0>] vfs_write+0x22e/0x3f0 [<0>] ksys_write+0x6f/0xf0 [<0>] do_syscall_64+0x3e/0x90 [<0>] entry_SYSCALL_64_after_hwframe+0x77/0xe1
I tested it with ceph 17.2.5 and 18.2.0, both tests on the linux kernel 6.4.11 in all nodes.
This is similar to https://tracker.ceph.com/issues/54044 but it seems to be not fixed on 6.4.11 if is the same problem.
The status is kept as HEALTHY during the hang, so I suspect there is a problem on the client side.
I enabled the dynamic debug for the caps file in the ceph kernel module with:
hut# echo 'file fs/ceph/caps.c +p' > /sys/kernel/debug/dynamic_debug/control
I attach what I see on dmesg (AFAIK they use different clocks, so the timestamps don't match exactly with strace).
Here is the status:
bay$ sudo ceph -s cluster: id: 9c8d06e0-485f-4aaf-b16b-06d6daf1232b health: HEALTH_OK services: mon: 1 daemons, quorum bay (age 9h) mgr: bay(active, since 9h) mds: 1/1 daemons up, 1 standby osd: 8 osds: 8 up (since 9h), 8 in (since 2d) data: volumes: 1/1 healthy pools: 4 pools, 545 pgs objects: 1.98k objects, 7.6 GiB usage: 24 GiB used, 8.7 TiB / 8.7 TiB avail pgs: 545 active+clean
Here is the fs:
bay$ sudo ceph fs dump e3468 enable_multiple, ever_enabled_multiple: 1,1 default compat: compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no anchor table,9=file layout v2,10=snaprealm v2} legacy client fscid: 3 Filesystem 'cephfs' (3) fs_name cephfs epoch 3468 flags 12 joinable allow_snaps allow_multimds_snaps created 2023-08-02T11:55:26.535585+0200 modified 2023-08-27T23:51:49.313474+0200 tableserver 0 root 0 session_timeout 60 session_autoclose 300 max_file_size 1099511627776 required_client_features {} last_failure 0 last_failure_osd_epoch 1010 compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline data,8=no anchor table,9=file layout v2,10=snaprealm v2} max_mds 1 in 0 up {0=244132} failed damaged stopped data_pools [5] metadata_pool 6 inline_data disabled balancer bal_rank_mask -1 standby_count_wanted 1 [mds.mds1{0:244132} state up:active seq 5 addr [v2:10.0.40.40:6802/2653564626,v1:10.0.40.40:6803/2653564626] compat {c=[1],r=[1],i=[7ff]}] Standby daemons: [mds.mds0{-1:254107} state up:standby seq 1 addr [v2:10.0.40.40:6800/1345562638,v1:10.0.40.40:6801/1345562638] compat {c=[1],r=[1],i=[7ff]}] dumped fsmap epoch 3468
Pools:
bay$ sudo ceph osd pool ls detail pool 3 '.mgr' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 119 flags hashpspool stripe_width 0 pg_num_max 32 pg_num_min 1 application mgr read_balance_score 7.89 pool 4 '.rgw.root' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 131 lfor 0/0/94 flags hashpspool stripe_width 0 application rgw read_balance_score 3.50 pool 5 'cephfs_data' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 256 pgp_num 256 autoscale_mode on last_change 926 lfor 0/0/751 flags hashpspool,bulk stripe_width 0 target_size_ratio 0.8 application cephfs read_balance_score 2.00 pool 6 'cephfs_metadata' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 256 pgp_num 256 autoscale_mode on last_change 938 lfor 0/0/936 flags hashpspool,bulk stripe_width 0 pg_autoscale_bias 4 pg_num_min 16 recovery_priority 5 target_size_ratio 0.2 application cephfs read_balance_score 1.78
I modified the CRUSH rule to replicate among OSDs instead of nodes, as I don't have enough nodes yet for a pool size of 3. Here is my ceph config:
bay$ sudo cat /etc/ceph/ceph.conf [global] auth client required=cephx auth cluster required=cephx auth service required=cephx cluster name=ceph cluster network=10.0.40.40/24 err_to_stderr=true fsid=9c8d06e0-485f-4aaf-b16b-06d6daf1232b log_file=/dev/null log_to_file=false log_to_stderr=true max open files=131072 mgr module path=/nix/store/c7xa89zq3ns2557mw7gqr2rx63mfjvq6-ceph-18.2.0-lib/lib/ceph/mgr mon host=10.0.40.40 mon initial members=bay mon_cluster_log_file=/dev/null rgw mime types file=/nix/store/q723yrnx2nkwz3a0f7i5yb9pzj942cf8-mailcap-2.1.53/etc/mime.types [mds] host=bay [osd] osd crush chooseleaf type=0 osd journal size=10000 osd pool default min size=2 osd pool default pg num=200 osd pool default pgp num=200 osd pool default size=3
Also I manually deployed ceph, so I may have made some mistakes along the way. I can provide more information if you are unable to reproduce it and also test patches in the kernel or ceph, as we build them from source.
Files