Project

General

Profile

Actions

Bug #62604

closed

CephFS - Bug #56067: Cephfs data loss with root_squash enabled

write() hangs forever in ceph_get_caps

Added by Rodrigo Arias 9 months ago. Updated 8 months ago.

Status:
Duplicate
Priority:
Normal
Assignee:
Category:
fs/ceph
Target version:
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):

Description

Hi,

While doing some tests after setting up a ceph with 2 nodes in NixOS, each with 4 OSDs, I observe dd hanging while writting to a large file. I'm using the kernel cephfs module from another client node and I'm able to reproduce it on every invocation.

Here is the command I'm using:

hut$ strace -f -tt dd if=/dev/urandom of=/ceph/rarias/kk bs=32M count=$((32*1024)) status=progress
...
08:45:34.985445 write(2, "603979776 bytes (604 MB, 576 MiB"..., 55603979776 bytes (604 MB, 576 MiB) copied, 5 s, 117 MB/s) = 55
08:45:34.985525 read(0, "I2\223ss\201}\233b\204{\262:L\212/W\25t\301vuq}\354d\272\302*u0\253"..., 33554432) = 33554432
08:45:35.237872 write(1, "I2\223ss\201}\233b\204{\262:L\212/W\25t\301vuq}\354d\272\302*u0\253"..., 33554432) = 33554432
08:45:35.270395 read(0, "\372+\234\205g\332|\201\350A;\254\34P\215x\374\255'90\257\257\v\341\227\251\355A\32\350#"..., 33554432) = 33554432
08:45:35.523104 write(1, "\372+\234\205g\332|\201\350A;\254\34P\215x\374\255'90\257\257\v\341\227\251\355A\32\350#"..., 33554432) = 33554432
08:45:35.556310 read(0, "I\352\203L\17\323\345\355\335L\304\334XB~\327'\177U\24\333\221I\273Sjz\177N\243Hh"..., 33554432) = 33554432
08:45:35.808834 write(1, "I\352\203L\17\323\345\355\335L\304\334XB~\327'\177U\24\333\221I\273Sjz\177N\243Hh"..., 33554432^Cstrace: Process 1695154 detached
 <detached ...>
21+0 records in
20+0 records out
671088640 bytes (671 MB, 640 MiB) copied, 85,9042 s, 7,8 MB/s

I left this hang for one day. Here is the stack:

hut# cat /proc/1402180/stack
[<0>] wait_woken+0x54/0x70
[<0>] ceph_get_caps+0x4b3/0x6f0 [ceph]
[<0>] ceph_write_iter+0x316/0xdc0 [ceph]
[<0>] vfs_write+0x22e/0x3f0
[<0>] ksys_write+0x6f/0xf0
[<0>] do_syscall_64+0x3e/0x90
[<0>] entry_SYSCALL_64_after_hwframe+0x77/0xe1

I tested it with ceph 17.2.5 and 18.2.0, both tests on the linux kernel 6.4.11 in all nodes.

This is similar to https://tracker.ceph.com/issues/54044 but it seems to be not fixed on 6.4.11 if is the same problem.

The status is kept as HEALTHY during the hang, so I suspect there is a problem on the client side.

I enabled the dynamic debug for the caps file in the ceph kernel module with:

hut# echo 'file fs/ceph/caps.c +p' > /sys/kernel/debug/dynamic_debug/control

I attach what I see on dmesg (AFAIK they use different clocks, so the timestamps don't match exactly with strace).

Here is the status:

bay$ sudo ceph -s
  cluster:
    id:     9c8d06e0-485f-4aaf-b16b-06d6daf1232b
    health: HEALTH_OK

  services:
    mon: 1 daemons, quorum bay (age 9h)
    mgr: bay(active, since 9h)
    mds: 1/1 daemons up, 1 standby
    osd: 8 osds: 8 up (since 9h), 8 in (since 2d)

  data:
    volumes: 1/1 healthy
    pools:   4 pools, 545 pgs
    objects: 1.98k objects, 7.6 GiB
    usage:   24 GiB used, 8.7 TiB / 8.7 TiB avail
    pgs:     545 active+clean

Here is the fs:

bay$ sudo ceph fs dump
e3468
enable_multiple, ever_enabled_multiple: 1,1
default compat: compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no anchor table,9=file layout v2,10=snaprealm v2}
legacy client fscid: 3

Filesystem 'cephfs' (3)
fs_name cephfs
epoch   3468
flags   12 joinable allow_snaps allow_multimds_snaps
created 2023-08-02T11:55:26.535585+0200
modified        2023-08-27T23:51:49.313474+0200
tableserver     0
root    0
session_timeout 60
session_autoclose       300
max_file_size   1099511627776
required_client_features        {}
last_failure    0
last_failure_osd_epoch  1010
compat  compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline data,8=no anchor table,9=file layout v2,10=snaprealm v2}
max_mds 1
in      0
up      {0=244132}
failed
damaged
stopped
data_pools      [5]
metadata_pool   6
inline_data     disabled
balancer
bal_rank_mask   -1
standby_count_wanted    1
[mds.mds1{0:244132} state up:active seq 5 addr [v2:10.0.40.40:6802/2653564626,v1:10.0.40.40:6803/2653564626] compat {c=[1],r=[1],i=[7ff]}]

Standby daemons:

[mds.mds0{-1:254107} state up:standby seq 1 addr [v2:10.0.40.40:6800/1345562638,v1:10.0.40.40:6801/1345562638] compat {c=[1],r=[1],i=[7ff]}]
dumped fsmap epoch 3468

Pools:

bay$ sudo ceph osd pool ls detail
pool 3 '.mgr' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 119 flags hashpspool stripe_width 0 pg_num_max 32 pg_num_min 1 application mgr read_balance_score 7.89
pool 4 '.rgw.root' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 131 lfor 0/0/94 flags hashpspool stripe_width 0 application rgw read_balance_score 3.50
pool 5 'cephfs_data' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 256 pgp_num 256 autoscale_mode on last_change 926 lfor 0/0/751 flags hashpspool,bulk stripe_width 0 target_size_ratio 0.8 application cephfs read_balance_score 2.00
pool 6 'cephfs_metadata' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 256 pgp_num 256 autoscale_mode on last_change 938 lfor 0/0/936 flags hashpspool,bulk stripe_width 0 pg_autoscale_bias 4 pg_num_min 16 recovery_priority 5 target_size_ratio 0.2 application cephfs read_balance_score 1.78

I modified the CRUSH rule to replicate among OSDs instead of nodes, as I don't have enough nodes yet for a pool size of 3. Here is my ceph config:

bay$ sudo cat /etc/ceph/ceph.conf
[global]
auth client required=cephx
auth cluster required=cephx
auth service required=cephx
cluster name=ceph
cluster network=10.0.40.40/24
err_to_stderr=true
fsid=9c8d06e0-485f-4aaf-b16b-06d6daf1232b
log_file=/dev/null
log_to_file=false
log_to_stderr=true
max open files=131072
mgr module path=/nix/store/c7xa89zq3ns2557mw7gqr2rx63mfjvq6-ceph-18.2.0-lib/lib/ceph/mgr
mon host=10.0.40.40
mon initial members=bay
mon_cluster_log_file=/dev/null
rgw mime types file=/nix/store/q723yrnx2nkwz3a0f7i5yb9pzj942cf8-mailcap-2.1.53/etc/mime.types

[mds]
host=bay

[osd]
osd crush chooseleaf type=0
osd journal size=10000
osd pool default min size=2
osd pool default pg num=200
osd pool default pgp num=200
osd pool default size=3

Also I manually deployed ceph, so I may have made some mistakes along the way. I can provide more information if you are unable to reproduce it and also test patches in the kernel or ceph, as we build them from source.


Files

dmesg.log (116 KB) dmesg.log Rodrigo Arias, 08/28/2023 08:00 AM
config (261 KB) config Kernel configuration Rodrigo Arias, 08/30/2023 06:54 AM
ceph-build.log (896 KB) ceph-build.log Ceph build log Rodrigo Arias, 08/30/2023 06:56 AM
ceph.dmesg.log (662 KB) ceph.dmesg.log Rodrigo Arias, 09/04/2023 09:31 AM
ceph.mds.log.gz (381 KB) ceph.mds.log.gz Rodrigo Arias, 09/04/2023 09:33 AM
Actions

Also available in: Atom PDF