Bug #62604: write() hangs forever in ceph_get_caps - Linux kernel client - Ceph

Actions

Copy link

Bug #62604

closed

CephFS - Bug #56067: Cephfs data loss with root_squash enabled

write() hangs forever in ceph_get_caps

Added by Rodrigo Arias 8 months ago. Updated 8 months ago.

Status:

Duplicate

Priority:

Normal

Assignee:

Xiubo Li

Category:

fs/ceph

Target version:

Ceph - v18.2.0

% Done:

Source:

Community (user)

Tags:

Backport:

Regression:

Severity:

2 - major

Reviewed:

Affected Versions:

Ceph - v18.2.0

ceph-qa-suite:

Crash signature (v1):

Crash signature (v2):

Description

Hi,

While doing some tests after setting up a ceph with 2 nodes in NixOS, each with 4 OSDs, I observe dd hanging while writting to a large file. I'm using the kernel cephfs module from another client node and I'm able to reproduce it on every invocation.

Here is the command I'm using:

hut$ strace -f -tt dd if=/dev/urandom of=/ceph/rarias/kk bs=32M count=$((32*1024)) status=progress
...
08:45:34.985445 write(2, "603979776 bytes (604 MB, 576 MiB"..., 55603979776 bytes (604 MB, 576 MiB) copied, 5 s, 117 MB/s) = 55
08:45:34.985525 read(0, "I2\223ss\201}\233b\204{\262:L\212/W\25t\301vuq}\354d\272\302*u0\253"..., 33554432) = 33554432
08:45:35.237872 write(1, "I2\223ss\201}\233b\204{\262:L\212/W\25t\301vuq}\354d\272\302*u0\253"..., 33554432) = 33554432
08:45:35.270395 read(0, "\372+\234\205g\332|\201\350A;\254\34P\215x\374\255'90\257\257\v\341\227\251\355A\32\350#"..., 33554432) = 33554432
08:45:35.523104 write(1, "\372+\234\205g\332|\201\350A;\254\34P\215x\374\255'90\257\257\v\341\227\251\355A\32\350#"..., 33554432) = 33554432
08:45:35.556310 read(0, "I\352\203L\17\323\345\355\335L\304\334XB~\327'\177U\24\333\221I\273Sjz\177N\243Hh"..., 33554432) = 33554432
08:45:35.808834 write(1, "I\352\203L\17\323\345\355\335L\304\334XB~\327'\177U\24\333\221I\273Sjz\177N\243Hh"..., 33554432^Cstrace: Process 1695154 detached
 <detached ...>
21+0 records in
20+0 records out
671088640 bytes (671 MB, 640 MiB) copied, 85,9042 s, 7,8 MB/s

I left this hang for one day. Here is the stack:

hut# cat /proc/1402180/stack
[<0>] wait_woken+0x54/0x70
[<0>] ceph_get_caps+0x4b3/0x6f0 [ceph]
[<0>] ceph_write_iter+0x316/0xdc0 [ceph]
[<0>] vfs_write+0x22e/0x3f0
[<0>] ksys_write+0x6f/0xf0
[<0>] do_syscall_64+0x3e/0x90
[<0>] entry_SYSCALL_64_after_hwframe+0x77/0xe1

I tested it with ceph 17.2.5 and 18.2.0, both tests on the linux kernel 6.4.11 in all nodes.

This is similar to https://tracker.ceph.com/issues/54044 but it seems to be not fixed on 6.4.11 if is the same problem.

The status is kept as HEALTHY during the hang, so I suspect there is a problem on the client side.

I enabled the dynamic debug for the caps file in the ceph kernel module with:

hut# echo 'file fs/ceph/caps.c +p' > /sys/kernel/debug/dynamic_debug/control

I attach what I see on dmesg (AFAIK they use different clocks, so the timestamps don't match exactly with strace).

Here is the status:

bay$ sudo ceph -s
  cluster:
    id:     9c8d06e0-485f-4aaf-b16b-06d6daf1232b
    health: HEALTH_OK

  services:
    mon: 1 daemons, quorum bay (age 9h)
    mgr: bay(active, since 9h)
    mds: 1/1 daemons up, 1 standby
    osd: 8 osds: 8 up (since 9h), 8 in (since 2d)

  data:
    volumes: 1/1 healthy
    pools:   4 pools, 545 pgs
    objects: 1.98k objects, 7.6 GiB
    usage:   24 GiB used, 8.7 TiB / 8.7 TiB avail
    pgs:     545 active+clean

Here is the fs:

bay$ sudo ceph fs dump
e3468
enable_multiple, ever_enabled_multiple: 1,1
default compat: compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no anchor table,9=file layout v2,10=snaprealm v2}
legacy client fscid: 3

Filesystem 'cephfs' (3)
fs_name cephfs
epoch   3468
flags   12 joinable allow_snaps allow_multimds_snaps
created 2023-08-02T11:55:26.535585+0200
modified        2023-08-27T23:51:49.313474+0200
tableserver     0
root    0
session_timeout 60
session_autoclose       300
max_file_size   1099511627776
required_client_features        {}
last_failure    0
last_failure_osd_epoch  1010
compat  compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline data,8=no anchor table,9=file layout v2,10=snaprealm v2}
max_mds 1
in      0
up      {0=244132}
failed
damaged
stopped
data_pools      [5]
metadata_pool   6
inline_data     disabled
balancer
bal_rank_mask   -1
standby_count_wanted    1
[mds.mds1{0:244132} state up:active seq 5 addr [v2:10.0.40.40:6802/2653564626,v1:10.0.40.40:6803/2653564626] compat {c=[1],r=[1],i=[7ff]}]

Standby daemons:

[mds.mds0{-1:254107} state up:standby seq 1 addr [v2:10.0.40.40:6800/1345562638,v1:10.0.40.40:6801/1345562638] compat {c=[1],r=[1],i=[7ff]}]
dumped fsmap epoch 3468

Pools:

bay$ sudo ceph osd pool ls detail
pool 3 '.mgr' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 119 flags hashpspool stripe_width 0 pg_num_max 32 pg_num_min 1 application mgr read_balance_score 7.89
pool 4 '.rgw.root' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 131 lfor 0/0/94 flags hashpspool stripe_width 0 application rgw read_balance_score 3.50
pool 5 'cephfs_data' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 256 pgp_num 256 autoscale_mode on last_change 926 lfor 0/0/751 flags hashpspool,bulk stripe_width 0 target_size_ratio 0.8 application cephfs read_balance_score 2.00
pool 6 'cephfs_metadata' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 256 pgp_num 256 autoscale_mode on last_change 938 lfor 0/0/936 flags hashpspool,bulk stripe_width 0 pg_autoscale_bias 4 pg_num_min 16 recovery_priority 5 target_size_ratio 0.2 application cephfs read_balance_score 1.78

I modified the CRUSH rule to replicate among OSDs instead of nodes, as I don't have enough nodes yet for a pool size of 3. Here is my ceph config:

bay$ sudo cat /etc/ceph/ceph.conf
[global]
auth client required=cephx
auth cluster required=cephx
auth service required=cephx
cluster name=ceph
cluster network=10.0.40.40/24
err_to_stderr=true
fsid=9c8d06e0-485f-4aaf-b16b-06d6daf1232b
log_file=/dev/null
log_to_file=false
log_to_stderr=true
max open files=131072
mgr module path=/nix/store/c7xa89zq3ns2557mw7gqr2rx63mfjvq6-ceph-18.2.0-lib/lib/ceph/mgr
mon host=10.0.40.40
mon initial members=bay
mon_cluster_log_file=/dev/null
rgw mime types file=/nix/store/q723yrnx2nkwz3a0f7i5yb9pzj942cf8-mailcap-2.1.53/etc/mime.types

[mds]
host=bay

[osd]
osd crush chooseleaf type=0
osd journal size=10000
osd pool default min size=2
osd pool default pg num=200
osd pool default pgp num=200
osd pool default size=3

Also I manually deployed ceph, so I may have made some mistakes along the way. I can provide more information if you are unable to reproduce it and also test patches in the kernel or ceph, as we build them from source.

Files

Download all files

dmesg.log (116 KB) dmesg.log		Rodrigo Arias, 08/28/2023 08:00 AM
config (261 KB) config	Kernel configuration	Rodrigo Arias, 08/30/2023 06:54 AM
ceph-build.log (896 KB) ceph-build.log	Ceph build log	Rodrigo Arias, 08/30/2023 06:56 AM
ceph.dmesg.log (662 KB) ceph.dmesg.log		Rodrigo Arias, 09/04/2023 09:31 AM
ceph.mds.log.gz (381 KB) ceph.mds.log.gz		Rodrigo Arias, 09/04/2023 09:33 AM

Actions

Copy link

Updated by Venky Shankar 8 months ago

Project changed from CephFS to Linux kernel client
Category changed from Correctness/Safety to fs/ceph
Assignee set to Xiubo Li

Actions

Copy link

Updated by Venky Shankar 8 months ago

Looks like a kclient thing. Handing it over to Xiubo for analysis.

Actions

Copy link

Updated by Xiubo Li 8 months ago

Status changed from New to In Progress

The 6.4.11 has already included that fix, please see https://kernel.googlesource.com/pub/scm/linux/kernel/git/stable/linux-stable/+/refs/tags/v6.4.11/fs/ceph/caps.c.

Locally I am trying to reproduce it but not see it yet.

Could you reproduce it again and then provide the caps output as Jeff did in https://tracker.ceph.com/issues/54044 ?

Thanks

Actions

Copy link Download all files

Updated by Rodrigo Arias 8 months ago

File config config added
File ceph-build.log ceph-build.log added

Xiubo Li wrote:

The 6.4.11 has already included that fix, please see https://kernel.googlesource.com/pub/scm/linux/kernel/git/stable/linux-stable/+/refs/tags/v6.4.11/fs/ceph/caps.c.

Locally I am trying to reproduce it but not see it yet.

I attach our kernel config and the build log of ceph.

Could you reproduce it again and then provide the caps output as Jeff did in https://tracker.ceph.com/issues/54044 ?

Yes, forgot to add before:

total        27
avail        24
used        3
reserved    0
min        1024

ino              mds  issued           implemented
--------------------------------------------------
0x1                0  pAsLsXsFs        pAsLsXsFs        
0x10000000000      0  pAsLsXsFsx       pAsLsXsFsx       
0x10000000005      0  pAsxLsXsxFsxcrwb pAsxLsXsxFsxcrwb 

Waiters:
--------
tgid         ino                need             want
-----------------------------------------------------
1483674      0x10000000005      Fw               Fb

Actions

Copy link

Updated by Rodrigo Arias 8 months ago

As I commented in #62664, with the ceph-fuse client (after some tweaking) I was able to run some I/O benchmarks and it seems to be working okay. So I suspect this must be an interaction of the kernel module with the MDS server.

If you are unable to reproduce it, we can consider giving you access to the machine so you can debug it there (contact me by email to discuss the details at rodrigo.arias@bsc.es in that case). In our cluster we are able to reproduce it 100% of the times with dd.

In fact, it always gets stuck after writing 4M:

hut$ strace -f -e trace=write dd if=/dev/zero of=/ceph/rarias/kk bs=1M count=100
write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576) = 1048576
write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576) = 1048576
write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576) = 1048576
write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576) = 1048576
write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576

hut$ strace -f -e trace=write dd if=/dev/zero of=/ceph/rarias/kk bs=2M count=100
write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 2097152) = 2097152
write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 2097152) = 2097152
write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 2097152

hut$ strace -f -e trace=write dd if=/dev/zero of=/ceph/rarias/kk bs=4M count=100
write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4194304) = 4194304
write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4194304

hut$ strace -f -e trace=write dd if=/dev/zero of=/ceph/rarias/kk bs=4097k count=100
write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4195328

Actions

Copy link

Updated by Xiubo Li 8 months ago

Rodrigo Arias wrote:

Xiubo Li wrote:

The 6.4.11 has already included that fix, please see https://kernel.googlesource.com/pub/scm/linux/kernel/git/stable/linux-stable/+/refs/tags/v6.4.11/fs/ceph/caps.c.

Locally I am trying to reproduce it but not see it yet.

I attach our kernel config and the build log of ceph.

Could you reproduce it again and then provide the caps output as Jeff did in https://tracker.ceph.com/issues/54044 ?

Yes, forgot to add before:

[...]

Okay, it seems we missed the waking up in some corner cases. Let me have a check.

Thanks

Actions

Copy link

Updated by Xiubo Li 8 months ago

When writing to the file the max_size reached and then issued requests to MDS:

2023-08-28T08:45:35,769295+02:00 ceph:  i_size approaching max_size
2023-08-28T08:45:35,769299+02:00 ceph:  __mark_caps_flushing flushing Fw, flushing_caps - -> Fw
2023-08-28T08:45:35,769302+02:00 ceph:   inode 000000001380e341 now !dirty
2023-08-28T08:45:35,769307+02:00 ceph:  __prep_cap 000000001380e341 cap 00000000823d5d94 session 000000005bcf35b0 pAsxLsXsxFsxcrwb -> pAsxLsXsxFsxcrwb (revoking -)
2023-08-28T08:45:35,769319+02:00 ceph:  encode_cap_msg update 1 10000000006 caps pAsxLsXsxFsxcrwb wanted pAsxXsxFxwb dirty Fw seq 52/50 tid 86/86 mseq 0 follows 1 size 341835776/167772160 xattr_ver 0 xattr_len 0
2023-08-28T08:45:35,769337+02:00 ceph:  __ceph_caps_issued 000000001380e341 cap 00000000823d5d94 issued pAsxLsXsxFsxcrwb
2023-08-28T08:45:35,769341+02:00 ceph:  check_caps 10000000006.fffffffffffffffe file_want pAsxXsxFxwb used Fcwb dirty - flushing Fw issued pAsxLsXsxFsxcrwb revoking - retain pAsxLsxXsxFsxcrwbl  AUTHONLY
...
2023-08-28T08:45:38,614115+02:00 ceph:  get_cap_refs 000000001380e341 need Fw want Fb
2023-08-28T08:45:38,614119+02:00 ceph:  __ceph_caps_issued 000000001380e341 cap 00000000823d5d94 issued pAsxLsXsxFsxcrwb
2023-08-28T08:45:38,614123+02:00 ceph:  get_cap_refs 000000001380e341 endoff 704643072 > maxsize 683671552
2023-08-28T08:45:38,614128+02:00 ceph:  get_cap_refs 000000001380e341 ret -27 got -
2023-08-28T08:45:38,614132+02:00 ceph:  write 000000001380e341 at large endoff 704643072, req max_size
2023-08-28T08:45:38,614136+02:00 ceph:  __ceph_caps_issued 000000001380e341 cap 00000000823d5d94 issued pAsxLsXsxFsxcrwb
2023-08-28T08:45:38,614140+02:00 ceph:  check_caps 10000000006.fffffffffffffffe file_want pAsxXsxFxwb used Fcb dirty Fw flushing - issued pAsxLsXsxFsxcrwb revoking - retain pAsxLsxXsxFsxcrwbl  AUTHONLY
2023-08-28T08:45:38,614146+02:00 ceph:   mds0 cap 00000000823d5d94 used Fcb issued pAsxLsXsxFsxcrwb implemented pAsxLsXsxFsxcrwb revoking -
2023-08-28T08:45:38,614150+02:00 ceph:  requesting new max_size
2023-08-28T08:45:38,614154+02:00 ceph:  __mark_caps_flushing flushing Fw, flushing_caps - -> Fw
2023-08-28T08:45:38,614157+02:00 ceph:   inode 000000001380e341 now !dirty
2023-08-28T08:45:38,614161+02:00 ceph:  __prep_cap 000000001380e341 cap 00000000823d5d94 session 000000005bcf35b0 pAsxLsXsxFsxcrwb -> pAsxLsXsxFsxcrwb (revoking -)
2023-08-28T08:45:38,614170+02:00 ceph:  encode_cap_msg update 1 10000000006 caps pAsxLsXsxFsxcrwb wanted pAsxXsxFxwb dirty Fw seq 52/50 tid 90/90 mseq 0 follows 1 size 671088640/704643072 xattr_ver 0 xattr_len 0
2023-08-28T08:45:38,614184+02:00 ceph:  __ceph_caps_issued 000000001380e341 cap 00000000823d5d94 issued pAsxLsXsxFsxcrwb
2023-08-28T08:45:38,614188+02:00 ceph:  check_caps 10000000006.fffffffffffffffe file_want pAsxXsxFxwb used Fcb dirty - flushing Fw issued pAsxLsXsxFsxcrwb revoking - retain pAsxLsxXsxFsxcrwbl  AUTHONLY
2023-08-28T08:45:38,614193+02:00 ceph:  get_cap_refs 000000001380e341 need Fw want Fb
2023-08-28T08:45:38,614196+02:00 ceph:  __ceph_caps_issued 000000001380e341 cap 00000000823d5d94 issued pAsxLsXsxFsxcrwb
2023-08-28T08:45:38,614200+02:00 ceph:  get_cap_refs 000000001380e341 endoff 704643072 > maxsize 683671552
2023-08-28T08:45:38,614204+02:00 ceph:  get_cap_refs 000000001380e341 ret 0 got -
2023-08-28T08:45:38,614208+02:00 ceph:  get_cap_refs 000000001380e341 need Fw want Fb
2023-08-28T08:45:38,614211+02:00 ceph:  __ceph_caps_issued 000000001380e341 cap 00000000823d5d94 issued pAsxLsXsxFsxcrwb
2023-08-28T08:45:38,614214+02:00 ceph:  get_cap_refs 000000001380e341 endoff 704643072 > maxsize 683671552
2023-08-28T08:45:38,614217+02:00 ceph:  get_cap_refs 000000001380e341 ret 0 got -
2023-08-28T08:45:38,616047+02:00 ceph:  handle_caps from mds0
...
2023-08-28T08:46:58,532590+02:00 ceph:  get_cap_refs 000000001380e341 need Fw want Fb
2023-08-28T08:46:58,532603+02:00 ceph:  __ceph_caps_issued 000000001380e341 cap 00000000823d5d94 issued pAsxLsXsxFsxcrwb
2023-08-28T08:46:58,532609+02:00 ceph:  get_cap_refs 000000001380e341 endoff 704643072 > maxsize 683671552
2023-08-28T08:46:58,532615+02:00 ceph:  get_cap_refs 000000001380e341 ret 0 got -
...

But from the dmesg logs I didn't see any handle_cap_grant reply for that. Since there is no MDS debug logs and not enough kclient debug logs, so I don't know what has happened.

Could you enable the MDS debug logs: debug_mds = 25 and debug_ms = 1 ? And also the whole dynamic debug logs for the ceph.ko module ?

Thanks
- Xiubo

Actions

Copy link

Updated by Xiubo Li 8 months ago

By reading the MDS side logs I couldn't figure why the MDS didn't issue a grant msg to kclient, need the MDS logs.

Actions

Copy link

Updated by Xiubo Li 8 months ago

Status changed from In Progress to Rejected

Talked via the mail, I will paste the mds logs here:

I repeated the dd test (writting zeroes with 1MB blocks) after enabling all the debug_mds* settings to level 20 and. This time it allowed more writes to succeed:

hut% strace -tt -f -e trace=write dd if=/dev/zero of=/ceph/rarias/kk bs=1M count=100
08:09:45.175365 write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576) = 1048576
08:09:45.177158 write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576) = 1048576
08:09:45.178336 write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576) = 1048576
08:09:45.179479 write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576) = 1048576
08:09:45.180626 write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576) = 1048576
08:09:45.181789 write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576) = 1048576
08:09:45.182952 write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576) = 1048576
08:09:45.184096 write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576) = 1048576
08:09:45.185245 write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576) = 1048576
08:09:45.186459 write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576) = 1048576
08:09:45.187695 write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576) = 1048576
08:09:45.188943 write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576) = 1048576
08:09:45.190183 write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576) = 1048576
08:09:45.191431 write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576) = 1048576
08:09:45.192745 write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576) = 1048576
08:09:45.194004 write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576) = 1048576
08:09:45.195257 write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576) = 1048576
08:09:45.196507 write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576) = 1048576
08:09:45.197761 write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576) = 1048576
08:09:45.199023 write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576) = 1048576
08:09:45.200268 write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576) = 1048576
08:09:45.201577 write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576) = 1048576
08:09:45.202825 write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576) = 1048576
08:09:45.204071 write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576) = 1048576
08:09:45.205315 write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576) = 1048576
08:09:45.206588 write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576) = 1048576
08:09:45.207825 write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576) = 1048576
08:09:45.209116 write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576) = 1048576
08:09:45.210604 write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576^C) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
strace: Process 75920 detached
29+0 records in
28+0 records out
29360128 bytes (29 MB, 28 MiB) copied, 50,0488 s, 587 kB/s

I attach the whole log from the journal. These lines seem to be relevant:

Sep 04 08:09:45 bay ceph-mds[1624]: 2023-09-04T08:09:45.200+0200 7fdcbbbfe6c0 10 mds.0.locker _do_cap_update dirty Fw issued pAsxLsXsxFsxcrwb wanted pAsxXsxFxwb on [inode 0x1000000000b [2,head] /rarias/kk auth v1432 DIRTYPARENT s=0 n(v0 rc2023-09-04T08:09:45.166826+0200 1=1+0) (iauth excl) (ifile excl) (ixattr excl) (iversion lock) cr={494177=0-29360128@1} caps={494177=pAsxLsXsxFsxcrwb/pAsxXsxFxwb@9},l=494177 | ptrwaiter=0 request=0 lock=0 caps=1 truncating=0 dirtyparent=1 dirty=1 waiter=0 authpin=0 0x7fdcb9ee8c00]
Sep 04 08:09:45 bay ceph-mds[1624]: 2023-09-04T08:09:45.200+0200 7fdcbbbfe6c0 20 mds.0.locker inode is file
Sep 04 08:09:45 bay ceph-mds[1624]: 2023-09-04T08:09:45.200+0200 7fdcbbbfe6c0 20 mds.0.locker client has write caps; m->get_max_size=5242880; old_max=29360128
Sep 04 08:09:45 bay ceph-mds[1624]: 2023-09-04T08:09:45.200+0200 7fdcbbbfe6c0 20 mds.0.5743 get_session have 0x7fdcaec15f00 client.494177 v1:10.0.40.7:0/1970188192 state open
Sep 04 08:09:45 bay ceph-mds[1624]: 2023-09-04T08:09:45.200+0200 7fdcbbbfe6c0 20 Session check_access path /rarias/kk
Sep 04 08:09:45 bay ceph-mds[1624]: 2023-09-04T08:09:45.200+0200 7fdcbbbfe6c0 10 MDSAuthCap is_capable inode(path /rarias/kk owner 1880:564 mode 0100644) by caller 0:0 mask 2 new 0:0 cap: MDSAuthCaps[allow rw fsname=cephfs root_squash]
Sep 04 08:09:45 bay ceph-mds[1624]: 2023-09-04T08:09:45.200+0200 7fdcbbbfe6c0 10 mds.0.locker check_access failed, dropping cap update on [inode 0x1000000000b [2,head] /rarias/kk auth v1432 DIRTYPARENT s=0 n(v0 rc2023-09-04T08:09:45.166826+0200 1=1+0) (iauth excl) (ifile excl) (ixattr excl) (iversion lock) cr={494177=0-29360128@1} caps={494177=pAsxLsXsxFsxcrwb/pAsxXsxFxwb@9},l=494177 | ptrwaiter=0 request=0 lock=0 caps=1 truncating=0 dirtyparent=1 dirty=1 waiter=0 authpin=0 0x7fdcb9ee8c00]

The root_squash was enabled and the caller was 0:0, the root user. And then check_access just reject it and then the MDS dropped the cap updating.

This is not a bug.

Actions

Copy link

#10

Updated by Rodrigo Arias 8 months ago

The root_squash was enabled and the caller was 0:0, the root user. And then check_access just reject it and then the MDS dropped the cap updating.

This is not a bug.

I issued the dd command using my non-root user (rarias) to the /ceph mountpoint using the "user" credentials that have the root_squash enabled:

bay% sudo ceph auth list
client.foo
        key: ...
        caps: [mds] allow rw fsname=cephfs
        caps: [mon] allow r fsname=cephfs
        caps: [osd] allow rw tag cephfs data=cephfs
client.user
        key: ...
        caps: [mds] allow rw fsname=cephfs root_squash
        caps: [mon] allow r fsname=cephfs
        caps: [osd] allow rw tag cephfs data=cephfs
...

hut% sudo mount -t ceph user@9c8d06e0-485f-4aaf-b16b-06d6daf1232b.cephfs=/ /ceph

hut% id
uid=1880(rarias) gid=564(Computational) groups=564(Computational),1(wheel)

hut% strace -f -e trace=write dd if=/dev/zero of=/ceph/rarias/kk bs=1M count=$((32*1024)) status=progress
write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576) = 1048576
write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576) = 1048576
write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576) = 1048576
write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576) = 1048576
write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576  <-- stuck there

When I try to do with root, it properly reports a permission denied error, as expected:

sudo dd if=/dev/zero of=/ceph/rarias/kk bs=1M count=$((32*1024)) status=progress
dd: failed to open '/ceph/rarias/kk': Permission denied

It shouldn't hang with root_squash while using my non-root user.

However, when mounted using the foo keyring without root_squash, it writes properly with my user and root:

hut% sudo mount -t ceph foo@9c8d06e0-485f-4aaf-b16b-06d6daf1232b.cephfs=/ /foo

hut% dd if=/dev/urandom of=/foo/rarias/kk bs=1M count=2048 status=progress
2142240768 bytes (2,1 GB, 2,0 GiB) copied, 27 s, 79,3 MB/s
2048+0 records in
2048+0 records out
2147483648 bytes (2,1 GB, 2,0 GiB) copied, 27,0711 s, 79,3 MB/s

hut% sudo dd if=/dev/urandom of=/foo/rarias/kk bs=1M count=2048 status=progress
2071986176 bytes (2,1 GB, 1,9 GiB) copied, 27 s, 76,7 MB/s
2048+0 records in
2048+0 records out
2147483648 bytes (2,1 GB, 2,0 GiB) copied, 27,996 s, 76,7 MB/s

But then, it causes the /ceph mountpoint to behave properly, even after unmounting /foo:

hut% dd if=/dev/urandom of=/ceph/rarias/kk bs=1M count=2048 status=progress
2068840448 bytes (2,1 GB, 1,9 GiB) copied, 26 s, 79,5 MB/s
2048+0 records in
2048+0 records out
2147483648 bytes (2,1 GB, 2,0 GiB) copied, 26,9828 s, 79,6 MB/s

hut% sudo umount /foo

hut% dd if=/dev/urandom of=/ceph/rarias/kk bs=1M count=2048 status=progress
2034237440 bytes (2,0 GB, 1,9 GiB) copied, 17 s, 120 MB/s
2048+0 records in
2048+0 records out
2147483648 bytes (2,1 GB, 2,0 GiB) copied, 17,9453 s, 120 MB/s

So, I think the problem is that at the begining those credentials shouldn't be 0:0.

If I reboot the client node, and then try again with dd, it hangs:

hut% strace -f -e trace=write dd if=/dev/urandom of=/ceph/rarias/kk bs=1M count=2048 status=progress
...
write(1, "\26\203m\213\356T\215\330\375uz\301_\315$\353HJ\"\203\365\216{ln\315\231F\230\177\1\252"..., 1048576) = 1048576
write(1, "i\303\272\203\364\357X\215\252\235E\247\307\23 2K*\344\215\363\2\325\260\212\214\275Z\254\226|\367"..., 1048576) = 1048576
write(1, "m\240mhe|\323\314U\250\177\330m[\375\36I\326\230\25\0035\315z\233t\310i\315(\336e"..., 1048576) = 1048576
write(1, "\347y\374\220J\322_\333\274\372\25\224\335\v-}\322\274\365%!@\206\307&6\335\254b2\21i"..., 1048576) = 1048576
write(1, "\35/\360\256\252\246\224\rMU\333\251[9\225\25\266A\253\371\261\240l\177\250\213\4\37Sx/\202"..., 1048576) = 1048576
write(1, "~\232\351A\325\365\345\241X\5{\323Z!\3376\277\23\304\363\374\5\6\3008<8\24\f%\225\276"..., 1048576) = 1048576
write(1, ";\271\320^\224\33\177\223sn\roVqI;\346\2\302\313\261\37\276\247\5\202e\tB^\r\0"..., 1048576

And here are the is_capable calls:

bay$ sudo journalctl -f | grep MDSAuthCap
Sep 04 10:48:49 bay ceph-mds[8761]: 2023-09-04T10:48:49.185+0200 7fb8559fe6c0 10 MDSAuthCap is_capable inode(path /rarias/kk owner 1880:564 mode 0100644) by caller 0:0 mask 2 new 0:0 cap: MDSAuthCaps[allow rw fsname=cephfs root_squash]
Sep 04 10:48:49 bay ceph-mds[8761]: 2023-09-04T10:48:49.186+0200 7fb8559fe6c0 10 MDSAuthCap is_capable inode(path /rarias/kk owner 1880:564 mode 0100644) by caller 1880:564 mask 2 new 0:0 cap: MDSAuthCaps[allow rw fsname=cephfs root_squash]
Sep 04 10:48:49 bay ceph-mds[8761]: 2023-09-04T10:48:49.319+0200 7fb8559fe6c0 10 MDSAuthCap is_capable inode(path /rarias/kk owner 1880:564 mode 0100644) by caller 0:0 mask 2 new 0:0 cap: MDSAuthCaps[allow rw fsname=cephfs root_squash]
Sep 04 10:48:49 bay ceph-mds[8761]: 2023-09-04T10:48:49.382+0200 7fb8559fe6c0 10 MDSAuthCap is_capable inode(path /rarias/kk owner 1880:564 mode 0100644) by caller 0:0 mask 2 new 0:0 cap: MDSAuthCaps[allow rw fsname=cephfs root_squash]
...

Now, if I mount the /foo filesystem again, without root_squash, while keeping dd stuck in the write:

hut% sudo mount -t ceph foo@9c8d06e0-485f-4aaf-b16b-06d6daf1232b.cephfs=/ /foo
hut% md5sum /foo/rarias/kk
d41d8cd98f00b204e9800998ecf8427e  /foo/rarias/kk

I can see more lines in the journal with mixed "caller 0:0" and "caller 1880:564" lines:

SAuthCaps[allow rw fsname=cephfs root_squash]
Sep 04 10:53:44 bay ceph-mds[8761]: 2023-09-04T10:53:44.815+0200 7fb8559fe6c0 10 MDSAuthCap is_capable inode(path /rarias/kk owner 1880:564 mode 0100644) by caller 0:0 mask 2 new 0:0 cap: MDSAuthCaps[allow rw fsname=cephfs root_squash]
Sep 04 10:53:44 bay ceph-mds[8761]: 2023-09-04T10:53:44.815+0200 7fb8559fe6c0 10 MDSAuthCap is_capable inode(path /rarias/kk owner 1880:564 mode 0100644) by caller 1880:564 mask 1 new 0:0 cap: MDSAuthCaps[allow rw fsname=cephfs]
Sep 04 10:53:44 bay ceph-mds[8761]: 2023-09-04T10:53:44.817+0200 7fb8559fe6c0 10 MDSAuthCap is_capable inode(path /rarias/kk owner 1880:564 mode 0100644) by caller 0:0 mask 2 new 0:0 cap: MDSAuthCaps[allow rw fsname=cephfs root_squash]
Sep 04 10:53:45 bay ceph-mds[8761]: 2023-09-04T10:53:45.916+0200 7fb8559fe6c0 10 MDSAuthCap is_capable inode(path /rarias/kk owner 1880:564 mode 0100644) by caller 0:0 mask 2 new 0:0 cap: MDSAuthCaps[allow rw fsname=cephfs root_squash]
Sep 04 10:53:45 bay ceph-mds[8761]: 2023-09-04T10:53:45.918+0200 7fb8559fe6c0 10 MDSAuthCap is_capable inode(path /rarias/kk owner 1880:564 mode 0100644) by caller 1880:564 mask 1 new 0:0 cap: MDSAuthCaps[allow rw fsname=cephfs]
Sep 04 10:53:45 bay ceph-mds[8761]: 2023-09-04T10:53:45.946+0200 7fb8559fe6c0 10 MDSAuthCap is_capable inode(path /rarias/kk owner 1880:564 mode 0100644) by caller 0:0 mask 2 new 0:0 cap: MDSAuthCaps[allow rw fsname=cephfs root_squash]
Sep 04 10:53:45 bay ceph-mds[8761]: 2023-09-04T10:53:45.947+0200 7fb8559fe6c0 10 MDSAuthCap is_capable inode(path /rarias/kk owner 1880:564 mode 0100644) by caller 1880:564 mask 1 new 0:0 cap: MDSAuthCaps[allow rw fsname=cephfs]
Sep 04 10:53:50 bay ceph-mds[8761]: 2023-09-04T10:53:50.975+0200 7fb8559fe6c0 10 MDSAuthCap is_capable inode(path /rarias/kk owner 1880:564 mode 0100644) by caller 0:0 mask 2 new 0:0 cap: MDSAuthCaps[allow rw fsname=cephfs root_squash]
Sep 04 10:53:53 bay ceph-mds[8761]: 2023-09-04T10:53:53.496+0200 7fb8559fe6c0 10 MDSAuthCap is_capable inode(path /rarias/kk owner 1880:564 mode 0100644) by caller 0:0 mask 2 new 0:0 cap: MDSAuthCaps[allow rw fsname=cephfs root_squash]
Sep 04 10:53:54 bay ceph-mds[8761]: 2023-09-04T10:53:54.657+0200 7fb8559fe6c0 10 MDSAuthCap is_capable inode(path /rarias/kk owner 1880:564 mode 0100644) by caller 0:0 mask 2 new 0:0 cap: MDSAuthCaps[allow rw fsname=cephfs root_squash]

And dd continues to write a bit more, until it hangs again:

write(2, "346030080 bytes (346 MB, 330 MiB"..., 57346030080 bytes (346 MB, 330 MiB) copied, 302 s, 1,1 MB/s) = 57
write(1, "\350\374\332\261~\246M\315\301\347Q1\267o\207n\316C\32\232\257\220\227N\305-M\244\21&\321\10"..., 1048576) = 1048576
write(1, "\377\303\\\223\324\246\200\232;\371\30T[\355\226\265\33}\7y\3\313H\t7\16\37\202\346\247_\347"..., 1048576) = 1048576
write(1, "\233\363\230E\325\356DS\24\305\2x\204\341\365\274\311.y\203x\347\25\30\35\372\203\264\355\22\365\0"..., 1048576) = 1048576
write(1, "UO\252\271\333w\226\0\365S\fL\307\304\307;\346\300u\223\273o\376\2328\21\375B^U\225\315"..., 1048576) = 1048576
write(1, "\372\247(\30\363L\0\260\272\333\17O\216ZW\323\302\216kFu\360S\266b^{\337m\273\227\205"..., 1048576) = 1048576
write(1, "oi=F$\213\231.\373D-Z^\341\231\33\354\374=\2\311B:\226\2G\vb\27.\240\226"..., 1048576) = 1048576
write(1, "1\347\373\5\243\2607\351\323o\tu\0327\273\326\315v\234T\266\356N\245\371B\30 \315\351\32\r"..., 1048576) = 1048576
write(1, "uk\251<`\346\23\372\376H\3\30\5&#\272\201\262\256[e\342v\214\312\212\217\313\342R\177\204"..., 1048576) = 1048576
write(1, "\355\364\17}=\366\22%\242\270|F\376\221M\374\313T\316Z\357\217\235\254(\350,?\344\301k\200"..., 1048576) = 1048576
write(1, "\21\262\260\362(\21\365\352\244\0M\27\204\307J\325@\240C\232\250\322%{K\224d\25\331\310\31\332"..., 1048576) = 1048576
write(1, "|\277\325eP\240\3\n@Xc\223j\264h\276\343&e\177\37\356a\\|I\"k\262d\257\246"..., 1048576) = 1048576
write(1, "\353y-\336jT\211z\230\232$_\334'\f0\f\343\337\336\363|\201\231J\342\337\2019\336\334i"..., 1048576) = 1048576
write(1, "\"\247qh\326\362\256r\303\236\3265\251\n\332(0'`,\323\241Q\353!sT\334\243\7\311X"..., 1048576) = 1048576
...

From my side, I can keep it mounted without root_squash, but I think this behavior is not correct.

I will reboot the client and restart ceph.target in the other nodes. Then I will enable the kernel debug and the mds debug level and try again the dd command and upload the logs.

Actions

Copy link Download all files

#11

Updated by Rodrigo Arias 8 months ago

File ceph.dmesg.log ceph.dmesg.log added
File ceph.mds.log.gz ceph.mds.log.gz added

I attach the logs. I followed these steps:


# /ceph gets mounted on boot, so unmount
hut% sudo umount /ceph

# Enable kernel client debug

hut% echo 'file fs/ceph/* +p' | sudo tee /sys/kernel/debug/dynamic_debug/control

# Find which mds is operating

bay$ sudo systemctl status ceph-mds-mds\* | grep standby
Sep 04 11:08:55 bay ceph-mds[13741]: 2023-09-04T11:08:55.954+0200 7fa9d85fe6c0  1 mds.mds1 Monitors have assigned me to become a standby.

# mds1 is standby, so use mds0

bay$ sudo ceph daemon mds.mds0 config set debug_mds 25
{
    "success": "" 
}
bay$ sudo ceph daemon mds.mds0 config set debug_ms 1
{
    "success": "" 
}

# Mount again

hut% sudo mount -t ceph user@9c8d06e0-485f-4aaf-b16b-06d6daf1232b.cephfs=/ /ceph

# Ensure we are not root

hut% id
uid=1880(rarias) gid=564(Computational) groups=564(Computational),1(wheel)

# Run the dd command under strace

hut% strace -f -e trace=write dd if=/dev/urandom of=/ceph/rarias/test1 bs=1M count=2048 status=progress
write(1, "E\343>\310m\22\37\247}\376\374\351\271E\314s\36d\23210\251\347\252\323\253\253i\270W,X"..., 1048576) = 1048576
write(1, "\365\16N\222WC\342z\305\203|\177D\336q\30g\357\312\264\217i\247\204\302\304\355\273?6\312w"..., 1048576) = 1048576
write(1, "\365\311\2676\vK\273%\30\243\207\276\5+\255c\305`\246\224\1 \200\177\30Pc\25191\177\276"..., 1048576) = 1048576
write(1, "\305\222\207\242\331\337m&5H\226G9+\215\345}e\204=;\347\33\267Pa\254\212\rRA\31"..., 1048576) = 1048576
write(1, "C\336L\203\213\312\r\34Z;\207\177\345\3535\212\205\305\2\352om\20\245\274X\303\330\25\320\362r"..., 1048576

# Collecting logs now
hut% sudo journalctl --since 11:20:31 > ceph.dmesg.log
bay$ sudo journalctl --since 11:08:00 > ceph.mds.log

# The mds log is too large, so compress
% gzip ceph.mds.log

Actions

Copy link

#12