Bug #63906: Inconsistent file mode across two clients - CephFS - Ceph

Actions

Copy link

Bug #63906

open

Inconsistent file mode across two clients

Added by Tao Lyu 4 months ago. Updated 8 days ago.

Status:

Fix Under Review

Priority:

Normal

Assignee:

Xiubo Li

Category:

Correctness/Safety

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

1 - critical

Reviewed:

Affected Versions:

Ceph - v15.2.1

ceph-qa-suite:

Component(FS):

Client, MDS, kceph

Labels (FS):

Pull request ID:

56602

Crash signature (v1):

Crash signature (v2):

Description

Hi,

I'm running ceph version 15.2.1 (9fd2f65f91d9246fae2c841a6222d34d121680ee) octopus (stable). And the client is Kclient (Linux 5.15)

After executing two sequences of syscalls on two clients respectively, the file mode retrieved by stat syscall immediately on two clients are different.
This violates the strong consistency model of CephFS, which is illustrated here: https://docs.ceph.com/en/reef/cephfs/posix/?highlight=atomicity#bottom-line

Reproduction steps:
1. Set up a CephFS instance with 7 nodes, including 1 monitor, 2 OSDs, 2 MDSes, 2 clients mounted by Kclient. Each node runs in one VM.
2. Client1: Create a regular file "file0" on with syscall(_NR_open,"./file0", 0x40, 0xa);
3. Client2: open the "file0" with syscall(_NR_open, "./file0", 0x80e82, 0x0);
4. Client2: syscall(__NR_chmod, "./file0", 0x0);
5. Client1 and client2 execute stat syscall concurrently and immediately after step 4. Then you will see file mode 32768 from client1, and 32776 from client2.

One thing to note is that, from step 2 to step 5, there shouldn't be a long time gap. We are using shared memory to communicate among VM nodes for immediate and fast execution.

And it we will for a while to stat from both nodes, then the file mode are consistent.

Files

Download all files

poc.c (1.47 KB) poc.c		Tao Lyu, 01/05/2024 09:15 AM
memset.c (598 Bytes) memset.c		Tao Lyu, 01/05/2024 09:15 AM
fix.patch (2.26 KB) fix.patch		Xiubo Li, 01/10/2024 10:21 AM

Actions

Copy link

Updated by Tao Lyu 4 months ago

And if we wait for a while to stat from both clients, then the file mode is consistent.

Actions

Copy link

Updated by Tao Lyu 4 months ago

Tao Lyu wrote:

Hi,

I'm running ceph version 15.2.1 (9fd2f65f91d9246fae2c841a6222d34d121680ee) octopus (stable). And the client is Kclient (Linux 5.15)

After executing two sequences of syscalls on two clients respectively, the file mode retrieved by stat syscall immediately on two clients are different.
This violates the strong consistency model of CephFS, which is illustrated here: https://docs.ceph.com/en/reef/cephfs/posix/?highlight=atomicity#bottom-line

Reproduction steps:
1. Set up a CephFS instance with 7 nodes, including 1 monitor, 2 OSDs, 2 MDSes, 2 clients mounted by Kclient. Each node runs in one VM.
2. Client1: Create a regular file "file0" on with syscall(_NR_open,"./file0", 0x40, 0xa);
3. Client2: open the "file0" with syscall(_NR_open, "./file0", 0x80e82, 0x0);
4. Client2: syscall(__NR_chmod, "./file0", 0x0);
5. Client1 and client2 execute stat syscall concurrently and immediately after step 4. Then you will see file mode 32768 from client1, and 32776 from client2.

One thing to note is that, from step 2 to step 5, there shouldn't be a long time gap. We are using shared memory to communicate among VM nodes for immediate and fast execution.

And if we wait for a while to stat from both nodes, then the file mode are consistent.

Actions

Copy link

Updated by Venky Shankar 4 months ago

Status changed from New to Triaged
Assignee set to Xiubo Li

Xiubo, this could be the same issue you were working on (reported by Frank for 15.x version). PTAL.

Actions

Copy link

Updated by Tao Lyu 4 months ago

Sorry, there is a typo in the reproducing steps.

It should be:

1. Set up a CephFS instance with 7 nodes, including 1 monitor, 2 OSDs, 2 MDSes, 2 clients mounted by Kclient. Each node runs in one VM.
2. Client1: Create a regular file "file0" on with syscall(NR_open,"./file0", 0x40, 0xa);
3. Client2: open the "file0" with syscall(_NR_open, "./file0", 0x80e82, 0x0);
4. Client1: syscall(_NR_chmod, "./file0", 0x0);
5. Client1 and client2 execute stat syscall concurrently and immediately after step 4. Then you will see file mode 32768 from client1, and 32776 from client2.

Actions

Copy link

Updated by Xiubo Li 4 months ago

Tao Lyu wrote:

Sorry, there is a typo in the reproducing steps.

It should be:

1. Set up a CephFS instance with 7 nodes, including 1 monitor, 2 OSDs, 2 MDSes, 2 clients mounted by Kclient. Each node runs in one VM.
2. Client1: Create a regular file "file0" on with syscall(NR_open,"./file0", 0x40, 0xa);
3. Client2: open the "file0" with syscall(_NR_open, "./file0", 0x80e82, 0x0);
4. Client1: syscall(_NR_chmod, "./file0", 0x0);
5. Client1 and client2 execute stat syscall concurrently and immediately after step 4. Then you will see file mode 32768 from client1, and 32776 from client2.

BTW, the above looks like an output of the strace, could you share the test script ?

Thanks

Actions

Copy link Download all files

Updated by Tao Lyu 4 months ago

File poc.c poc.c added
File memset.c memset.c added

Sure.

Firstly, we use IVSHM among VM nodes to synchronize syscall executions on multiple clients. So, you need to specify the arguments during booting.
Here is the command we use in QEMU:

sudo qemu-system-x86_64 -m 10240 -smp 4 -chardev socket,id=SOCKSYZ,server=on,nowait,host=localhost,port=3262 -mon chardev=SOCKSYZ,mode=control -display none -serial stdio -device virtio-rng-pci -enable-kvm -cpu host,migratable=off -netdev bridge,id=hn44 -device virtio-net,netdev=hn44,mac=e6:c8:ff:09:76:42 -hda focal.qcow2 -kernel bzImage -append "root=/dev/sda console=ttyS0 net.ifnames=0 ip=192.168.0.13" -snapshot -object memory-backend-file,size=1M,share=on,mem-path=/dev/shm/shm12,id=shm12 -device ivshmem-plain,memdev=shm12,bus=pci.0,addr=0x10,master=on

Two c files "poc.c" and "memset.c" have been attached.
You can compile them into individual binaries, poc and memset.

Then reproducing steps should be:
1. Set up a CephFS instance with 7 nodes, including 1 monitor, 2 OSDs, 2 MDSes, 2 clients mounted by Kclient. Each node runs in one VM.
2. Upload the memset and poc into two mounted clients.
3. Execute ./memset on both clients
4. On client 1, execute "./poc 1" under the mounted client root dir.
5. On client 2, execute "./poc 2" under the mounted client root dir.

If you repeat reproductions, you need to execute step 3-5 repeatedly instead of step 4-5 only.

Please feel free to contact me if you meet any issues during the reproduction.

Best,
Tao

Actions

Copy link

Updated by Xiubo Li 4 months ago

Tao Lyu wrote:

Sure.

Firstly, we use IVSHM among VM nodes to synchronize syscall executions on multiple clients. So, you need to specify the arguments during booting.
Here is the command we use in QEMU:

sudo qemu-system-x86_64 -m 10240 -smp 4 -chardev socket,id=SOCKSYZ,server=on,nowait,host=localhost,port=3262 -mon chardev=SOCKSYZ,mode=control -display none -serial stdio -device virtio-rng-pci -enable-kvm -cpu host,migratable=off -netdev bridge,id=hn44 -device virtio-net,netdev=hn44,mac=e6:c8:ff:09:76:42 -hda focal.qcow2 -kernel bzImage -append "root=/dev/sda console=ttyS0 net.ifnames=0 ip=192.168.0.13" -snapshot -object memory-backend-file,size=1M,share=on,mem-path=/dev/shm/shm12,id=shm12 -device ivshmem-plain,memdev=shm12,bus=pci.0,addr=0x10,master=on

Two c files "poc.c" and "memset.c" have been attached.
You can compile them into individual binaries, poc and memset.

Then reproducing steps should be:
1. Set up a CephFS instance with 7 nodes, including 1 monitor, 2 OSDs, 2 MDSes, 2 clients mounted by Kclient. Each node runs in one VM.
2. Upload the memset and poc into two mounted clients.
3. Execute ./memset on both clients
4. On client 1, execute "./poc 1" under the mounted client root dir.
5. On client 2, execute "./poc 2" under the mounted client root dir.

If you repeat reproductions, you need to execute step 3-5 repeatedly instead of step 4-5 only.

Please feel free to contact me if you meet any issues during the reproduction.

Best,
Tao

Hi Tao,

This is not easy to me to reproduce this with QEMU setups, I do not have resources to do this.

BTW, could you help me reproduce this by enabling the debug logs ? And then provide the debug logs.

1, MDS:

debug_mds = 25
debug_ms = 1

Then provide the mds logs.

2, Kclient node:

echo "module ceph +p" | sudo tee /sys/kernel/debug/dynamic_debug/control

Then provide the dmesg logs.

Locally I ran a similar test but didn't reproduce it. Have you tried the newer ceph ? I found you are using octopus, in higher ceph we have fixed several similar issues, so maybe this has been fixed already there.

Thanks

Actions

Copy link

Updated by Xiubo Li 4 months ago

Xiubo Li wrote:

Tao Lyu wrote:

Sure.

Firstly, we use IVSHM among VM nodes to synchronize syscall executions on multiple clients. So, you need to specify the arguments during booting.
Here is the command we use in QEMU:

sudo qemu-system-x86_64 -m 10240 -smp 4 -chardev socket,id=SOCKSYZ,server=on,nowait,host=localhost,port=3262 -mon chardev=SOCKSYZ,mode=control -display none -serial stdio -device virtio-rng-pci -enable-kvm -cpu host,migratable=off -netdev bridge,id=hn44 -device virtio-net,netdev=hn44,mac=e6:c8:ff:09:76:42 -hda focal.qcow2 -kernel bzImage -append "root=/dev/sda console=ttyS0 net.ifnames=0 ip=192.168.0.13" -snapshot -object memory-backend-file,size=1M,share=on,mem-path=/dev/shm/shm12,id=shm12 -device ivshmem-plain,memdev=shm12,bus=pci.0,addr=0x10,master=on

Two c files "poc.c" and "memset.c" have been attached.
You can compile them into individual binaries, poc and memset.

Then reproducing steps should be:
1. Set up a CephFS instance with 7 nodes, including 1 monitor, 2 OSDs, 2 MDSes, 2 clients mounted by Kclient. Each node runs in one VM.
2. Upload the memset and poc into two mounted clients.
3. Execute ./memset on both clients
4. On client 1, execute "./poc 1" under the mounted client root dir.
5. On client 2, execute "./poc 2" under the mounted client root dir.

If you repeat reproductions, you need to execute step 3-5 repeatedly instead of step 4-5 only.

Please feel free to contact me if you meet any issues during the reproduction.

Best,
Tao

Hi Tao,

This is not easy to me to reproduce this with QEMU setups, I do not have resources to do this.

BTW, could you help me reproduce this by enabling the debug logs ? And then provide the debug logs.

1, MDS:

[...]

Then provide the mds logs.

2, Kclient node:

[...]

Then provide the dmesg logs.

Locally I ran a similar test but didn't reproduce it. Have you tried the newer ceph ? I found you are using octopus, in higher ceph we have fixed several similar issues, so maybe this has been fixed already there.

Thanks

Please just ignore this, I reproduce it locally by using your poc.c with a little changes:

#define _GNU_SOURCE
#include <sys/mman.h>
#include <endian.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/syscall.h>
#include <sys/types.h>
#include <unistd.h>
#include <sys/stat.h>
#include <fcntl.h>

void main(int argc, char **argv){

    int clt = atoi(argv[1]);

    //loop for star
    int mfd = open("/data/lxb11", O_RDWR);
    if (mfd <= 0) printf("open IVSHM error\n");
    volatile char *msg = (char *)mmap(0, 1 * 1024 * 1024, PROT_READ | PROT_WRITE, MAP_SHARED, mfd, 0);
    if(msg == NULL) printf("mmap failed\n");

    if (clt == 1) {
        //clt4 starts
        while(!msg[1]) {}
        __sync_synchronize();
        //create file0
        syscall(__NR_open, "./file0", 0x40, 0xa);
        __sync_synchronize();
        //ping clt4
        msg[0] = 1;
        __sync_synchronize();
        //wait clt4 finish
        while(!msg[2]) {}
        __sync_synchronize();
        syscall(__NR_chmod, "./file0", 0x0);
        msg[3] = 1;

        struct stat stat_buf;
        stat("./file0", &stat_buf);
        printf("stat: %d, %d\n", stat_buf.st_mode, S_ISREG(stat_buf.st_mode));
    } else if (clt == 2) {
        msg[1] = 1;
        __sync_synchronize();
        while(!msg[0]) {}
        __sync_synchronize();
        int r0 = syscall(__NR_open, "./file0", 0x80e82, 0x0);
        msg[2] = 1;
        __sync_synchronize();
        while(!msg[3]) {}
        __sync_synchronize();
        struct stat stat_buf;
        stat("./file0", &stat_buf);
        printf("stat: %d, %d\n", stat_buf.st_mode, S_ISREG(stat_buf.st_mode));
    }
}

Then in client1 run:

$ dd if=/dev/zero of=/data/lxb11 bs=2M count=1; rm -f file0; ./poc 1
1+0 records in
1+0 records out
2097152 bytes (2.1 MB, 2.0 MiB) copied, 0.0252125 s, 83.2 MB/s
stat: 32768, 1

And in client2 run:

$ ./poc 2
stat: 32776, 1

Actions

Copy link

Updated by Xiubo Li 4 months ago

I am using the latest ceph and kclient code:

Firstly, the setattr request came from client.4164:

2262 2024-01-10T09:24:11.714+0800 7f712b239640  7 mds.0.server dispatch_client_request client_request(client.4164:1811 setattr mode=0100000 #0x1000000039b 2024-01-10T09:24:11.712124+0800 caller_uid=1000, caller_gid=1000{10,1000,}) v6
2263 2024-01-10T09:24:11.714+0800 7f712b239640 10 mds.0.server rdlock_path_pin_ref request(client.4164:1811 nref=2 cr=0x557d371eac00) #0x1000000039b
2264 2024-01-10T09:24:11.714+0800 7f712b239640 10 mds.0.locker acquire_locks request(client.4164:1811 nref=2 cr=0x557d371eac00)

And did the early reply by holding the xlock for authlock:

2318 2024-01-10T09:24:11.714+0800 7f712b239640 10 mds.0.server early_reply 0 ((0) Success) client_request(client.4164:1811 setattr mode=0100000 #0x1000000039b 2024-01-10T09:24:11.712124+0800 caller_uid=1000, caller_gid=1000{10,1000,}) v6   
2319 2024-01-10T09:24:11.714+0800 7f712b239640 20 mds.0.server set_trace_dist snapid head
2320 2024-01-10T09:24:11.714+0800 7f712b239640 10 mds.0.server set_trace_dist snaprealm snaprealm(0x1 seq 1 lc 0 cr 0 cps 1 snaps={} last_modified 2024-01-10T08:23:45.518860+0800 change_attr 0 0x557d370c2fc0) len=48
2321 2024-01-10T09:24:11.714+0800 7f712b239640 20 mds.0.cache.ino(0x1000000039b)  pfile 0 pauth 1 plink 0 pxattr 0 plocal 1 mtime 2024-01-10T09:24:11.571173+0800 ctime 2024-01-10T09:24:11.712124+0800 change_attr 1 valid=1
2322 2024-01-10T09:24:11.714+0800 7f712b239640 10 mds.0.cache.ino(0x1000000039b) encode_inodestat issuing pLsXsFrw seq 6
2323 2024-01-10T09:24:11.714+0800 7f712b239640 10 mds.0.cache.ino(0x1000000039b) encode_inodestat caps pLsXsFrw seq 6 mseq 0 xattrv 0
2324 2024-01-10T09:24:11.715+0800 7f712b239640 20 mds.0.server set_trace_dist added snap head in [inode 0x1000000039b [2,head] /file0 auth v2662 pv2666 ap=3 DIRTYPARENT s=0 n(v0 1=1+0) (iauth xlockdone x=1) (ifile mix w=1) (iversion lock w=1 last_client=4164) caps={4164=pLsXsFrw/pFscr@6,4165=pLsXsFrw/pAsxXsxFsxcrwb@3} | ptrwaiter=1 request=1 lock=4 caps=1 dirtyparent=1 dirty=1 waiter=0 authpin=1 0x557d37463b80]

And then client.4164 and client.4165 sent the getattr and then the bactched them:

2349 2024-01-10T09:24:11.716+0800 7f712b239640  1 -- v1:192.168.0.108:6814/2994867683 <== client.4164 v1:192.168.0.108:0/441632341 6430 ==== client_request(client.4164:1812 getattr AsLsXsFs #0x1000000039b 2024-01-10T09:24:11.716210+0800 caller_uid=1000, caller_gid=1000{10,1000,}) v6 ==== 164+0+0 (unknown 3833158978 0 0) 0x557d371ed200 con 0x557d37249b00                                                                                                      
2350 2024-01-10T09:24:11.716+0800 7f712b239640  4 mds.0.server handle_client_request client_request(client.4164:1812 getattr AsLsXsFs #0x1000000039b 2024-01-10T09:24:11.716210+0800 caller_uid=1000, caller_gid=1000{10,1000,}) v6
2351 2024-01-10T09:24:11.716+0800 7f712b239640 20 mds.0.4 get_session have 0x557d37211900 client.4164 v1:192.168.0.108:0/441632341 state open
2352 2024-01-10T09:24:11.716+0800 7f712b239640 15 mds.0.server  oldest_client_tid=1811

...

2400 2024-01-10T09:24:11.717+0800 7f712b239640  1 -- v1:192.168.0.108:6814/2994867683 <== client.4165 v1:192.168.0.108:0/1431932272 4511 ==== client_request(client.4165:226 getattr AsLsXsFs #0x1000000039b 2024-01-10T09:24:11.716210+0800 caller_uid=1000, caller_gid=1000{10,1000,}) v6 ==== 164+0+0 (unknown 339279103 0 0) 0x557d376d9500 con 0x557d3724a400
2401 2024-01-10T09:24:11.717+0800 7f712b239640  4 mds.0.server handle_client_request client_request(client.4165:226 getattr AsLsXsFs #0x1000000039b 2024-01-10T09:24:11.716210+0800 caller_uid=1000, caller_gid=1000{10,1000,}) v6
2402 2024-01-10T09:24:11.717+0800 7f712b239640 20 mds.0.4 get_session have 0x557d37212300 client.4165 v1:192.168.0.108:0/1431932272 state open
2403 2024-01-10T09:24:11.717+0800 7f712b239640 15 mds.0.server  oldest_client_tid=226
2404 2024-01-10T09:24:11.717+0800 7f712b239640  7 mds.0.cache request_start request(client.4165:226 nref=2 cr=0x557d376d9500)
2405 2024-01-10T09:24:11.717+0800 7f712b239640  7 mds.0.server dispatch_client_request client_request(client.4165:226 getattr AsLsXsFs #0x1000000039b 2024-01-10T09:24:11.716210+0800 caller_uid=1000, caller_gid=1000{10,1000,}) v6
2406 2024-01-10T09:24:11.717+0800 7f712b239640  7 mds.0.cache traverse: opening base ino 0x1000000039b snap head
2407 2024-01-10T09:24:11.717+0800 7f712b239640 20 mds.0.4 get_session have 0x557d37212300 client.4165 v1:192.168.0.108:0/1431932272 state open
2408 2024-01-10T09:24:11.717+0800 7f712b239640  7 mds.0.cache no tail dentry, base [inode 0x1000000039b [2,head] /file0 auth v2662 pv2666 ap=5 DIRTYPARENT s=0 n(v0 1=1+0) (iauth xlockdone x=1) (ifile mix->sync w=1) (iversion lock w=1 last_client=4164) caps={4164=pLsXsFr/pLsXsFrw/pFscr@7,4165=pLsXsFr/pLsXsFrw/pAsxXsxFsxcrwb@4} | ptrwaiter=1 request=2 lock=4 caps=1 dirtyparent=1 dirty=1 waiter=1 authpin=1 0x557d37463b80]
2409 2024-01-10T09:24:11.717+0800 7f712b239640  7 mds.0.cache found target [inode 0x1000000039b [2,head] /file0 auth v2662 pv2666 ap=5 DIRTYPARENT s=0 n(v0 1=1+0) (iauth xlockdone x=1) (ifile mix->sync w=1) (iversion lock w=1 last_client=4164) caps={4164=pLsXsFr/pLsXsFrw/pFscr@7,4165=pLsXsFr/pLsXsFrw/pAsxXsxFsxcrwb@4} | ptrwaiter=1 request=2 lock=4 caps=1 dirtyparent=1 dirty=1 waiter=1 authpin=1 0x557d37463b80]
2410 2024-01-10T09:24:11.717+0800 7f712b239640 10 mds.0.cache path_traverse finish on snapid head
2411 2024-01-10T09:24:11.717+0800 7f712b239640 20 mds.0.server handle_client_getattr: GETATTR op, wait for previous same getattr ops to respond. request(client.4165:226 nref=3 cr=0x557d376d9500)

Then these two getattr request finished, but since the client.4164 was still holding the xlock for authlock then in set_trace_dist() it will use the pi to set the auth_i, while for client.4165 will use the oi instead:

2487 2024-01-10T09:24:11.915+0800 7f712522d640  7 mds.0.server dispatch_client_request client_request(client.4164:1812 getattr AsLsXsFs #0x1000000039b 2024-01-10T09:24:11.716210+0800 caller_uid=1000, caller_gid=1000{10,1000,}) v6
2488 2024-01-10T09:24:11.915+0800 7f712522d640 10 mds.0.server rdlock_path_pin_ref request(client.4164:1812 nref=3 cr=0x557d371ed200) #0x1000000039b
2489 2024-01-10T09:24:11.915+0800 7f712522d640 10 mds.0.locker acquire_locks request(client.4164:1812 nref=3 cr=0x557d371ed200)                       

...

2504 2024-01-10T09:24:11.915+0800 7f712522d640 20 Session check_access path /file0
2505 2024-01-10T09:24:11.915+0800 7f712522d640 10 MDSAuthCap is_capable inode(path /file0 owner 1000:1000 mode 0100010) by caller 1000:1000 mask 1 new 0:0 cap: MDSAuthCaps[allow *]
2506 2024-01-10T09:24:11.915+0800 7f712522d640 20 mds.0.bal hit_dir 0 pop is 3.86044, frag * size 2 [pop IRD:[C 3.86e+00] IWR:[C 3.35e+00] RDR:[C 0.00e+00] FET:[C 0.00e+00] STR:[C 0.00e+00] *LOAD:10.6]
2507 2024-01-10T09:24:11.915+0800 7f712522d640 10 mds.0.server reply to stat on client_request(client.4164:1812 getattr AsLsXsFs #0x1000000039b 2024-01-10T09:24:11.716210+0800 caller_uid=1000, caller_gid=1000{10,1000,}) v6
2508 2024-01-10T09:24:11.915+0800 7f712522d640 20 mds.0.server respond_to_request batch head request(client.4164:1812 nref=3 cr=0x557d371ed200)
2509 2024-01-10T09:24:11.915+0800 7f712522d640 20 respond: responding to batch ops with result=0: [batch front=request(client.4164:1812 nref=3 cr=0x557d371ed200)]
2510 2024-01-10T09:24:11.915+0800 7f712522d640  7 mds.0.server reply_client_request 0 ((0) Success) client_request(client.4165:226 getattr AsLsXsFs #0x1000000039b 2024-01-10T09:24:11.716210+0800 caller_uid=1000, caller_gid=1000{10,1000,}) v6
2511 2024-01-10T09:24:11.915+0800 7f712522d640 10 mds.0.server apply_allocated_inos 0x0 / [] / 0x0
2512 2024-01-10T09:24:11.915+0800 7f712522d640 20 mds.0.server lat 0.199208
2513 2024-01-10T09:24:11.915+0800 7f712522d640 20 mds.0.server set_trace_dist snapid head
2514 2024-01-10T09:24:11.915+0800 7f712522d640 10 mds.0.server set_trace_dist snaprealm snaprealm(0x1 seq 1 lc 0 cr 0 cps 1 snaps={} last_modified 2024-01-10T08:23:45.518860+0800 change_attr 0 0x557d370c2fc0) len=48
2515 2024-01-10T09:24:11.915+0800 7f712522d640 20 mds.0.cache.ino(0x1000000039b)  pfile 0 pauth 0 plink 0 pxattr 0 plocal 0 mtime 2024-01-10T09:24:11.571173+0800 ctime 2024-01-10T09:24:11.571173+0800 change_attr 0 valid=1
2516 2024-01-10T09:24:11.915+0800 7f712522d640 10 mds.0.cache.ino(0x1000000039b) encode_inodestat issuing pLsXsFscr seq 5
2517 2024-01-10T09:24:11.915+0800 7f712522d640 10 mds.0.cache.ino(0x1000000039b) encode_inodestat caps pLsXsFscr seq 5 mseq 0 xattrv 0
2518 2024-01-10T09:24:11.915+0800 7f712522d640 20 mds.0.server set_trace_dist added snap head in [inode 0x1000000039b [2,head] /file0 auth v2664 pv2666 ap=4 DIRTYPARENT s=0 n(v0 1=1+0) (iauth xlockdone r=1 x=1) (ilink sync r=1) (ifile sync r=1) (ixattr sync r=1) (iversion lock w=1 last_client=4164) cr={4165=0-4194304@1} caps={4164=pLsXsFr/pFscr@7,4165=pLsXsFscr/pAsxXsxFsxcrwb@5} | ptrwaiter=1 request=3 lock=7 caps=1 dirtyparent=1 dirty=1 waiter=0 authpin=1 0x557d37463b80]
2519 2024-01-10T09:24:11.915+0800 7f712522d640 10 mds.0.4 send_message_client client.4165 v1:192.168.0.108:0/1431932272 client_reply(???:226 = 0 (0) Success) v1
2520 2024-01-10T09:24:11.915+0800 7f712522d640  1 -- v1:192.168.0.108:6814/2994867683 --> v1:192.168.0.108:0/1431932272 -- client_reply(???:226 = 0 (0) Success) v1 -- 0x557d374cbc00 con 0x557d3724a400
2521 2024-01-10T09:24:11.915+0800 7f712522d640  7 mds.0.cache request_finish request(client.4165:226 nref=2 cr=0x557d376d9500)
2522 2024-01-10T09:24:11.915+0800 7f712522d640 15 mds.0.cache request_cleanup request(client.4165:226 nref=2 cr=0x557d376d9500)
2523 2024-01-10T09:24:11.915+0800 7f712522d640  7 mds.0.server reply_client_request 0 ((0) Success) client_request(client.4164:1812 getattr AsLsXsFs #0x1000000039b 2024-01-10T09:24:11.716210+0800 caller_uid=1000, caller_gid=1000{10,1000,}) v6
2524 2024-01-10T09:24:11.915+0800 7f712522d640 10 mds.0.server apply_allocated_inos 0x0 / [] / 0x0
2525 2024-01-10T09:24:11.915+0800 7f712522d640 20 mds.0.server lat 0.199439
2526 2024-01-10T09:24:11.915+0800 7f712522d640 20 mds.0.server set_trace_dist snapid head
2527 2024-01-10T09:24:11.915+0800 7f712522d640 10 mds.0.server set_trace_dist snaprealm snaprealm(0x1 seq 1 lc 0 cr 0 cps 1 snaps={} last_modified 2024-01-10T08:23:45.518860+0800 change_attr 0 0x557d370c2fc0) len=48
2528 2024-01-10T09:24:11.915+0800 7f712522d640 20 mds.0.cache.ino(0x1000000039b)  pfile 0 pauth 1 plink 0 pxattr 0 plocal 1 mtime 2024-01-10T09:24:11.571173+0800 ctime 2024-01-10T09:24:11.712124+0800 change_attr 1 valid=1
2529 2024-01-10T09:24:11.915+0800 7f712522d640 10 mds.0.cache.ino(0x1000000039b) encode_inodestat issuing pLsXsFscr seq 8
2530 2024-01-10T09:24:11.915+0800 7f712522d640 10 mds.0.cache.ino(0x1000000039b) encode_inodestat caps pLsXsFscr seq 8 mseq 0 xattrv 1
2531 2024-01-10T09:24:11.915+0800 7f712522d640 10 mds.0.cache.ino(0x1000000039b) including xattrs version 1
2532 2024-01-10T09:24:11.915+0800 7f712522d640 20 mds.0.server set_trace_dist added snap head in [inode 0x1000000039b [2,head] /file0 auth v2664 pv2666 ap=4 DIRTYPARENT s=0 n(v0 1=1+0) (iauth xlockdone r=1 x=1) (ilink sync r=1) (ifile sync r=1) (ixattr sync r=1) (iversion lock w=1 last_client=4164) cr={4165=0-4194304@1} caps={4164=pLsXsFscr/pFscr@8,4165=pLsXsFscr/pAsxXsxFsxcrwb@5} | ptrwaiter=1 request=2 lock=7 caps=1 dirtyparent=1 dirty=1 waiter=0 authpin=1 0x557d37463b80]
2533 2024-01-10T09:24:11.915+0800 7f712522d640 10 mds.0.4 send_message_client client.4164 v1:192.168.0.108:0/441632341 client_reply(???:1812 = 0 (0) Success) v1
2534 2024-01-10T09:24:11.915+0800 7f712522d640  1 -- v1:192.168.0.108:6814/2994867683 --> v1:192.168.0.108:0/441632341 -- client_reply(???:1812 = 0 (0) Success) v1 -- 0x557d374ca000 con 0x557d37249b00

Since the new mode 0100000 was still being projected in pi inode, so for the client.4164 it could get the new mode, while for the client.4165 it will get the old mode.

Then the safe reply of setattr request hit by popping the projected inode:

2585 2024-01-10T09:24:11.915+0800 7f712522d640 15 mds.0.cache.dir(0x1) pop_and_dirty_projected_fnode 0x557d376afb20 v2667
2586 2024-01-10T09:24:11.915+0800 7f712522d640 10 mds.0.cache.dir(0x1) _mark_dirty (already dirty) [dir 0x1 / [2,head] auth v=2667 cv=223/223 REP dir_auth=0 ap=1+3 state=1610874881|complete f(v0 m2024-01-10T09:24:11.571173+0800 2=2+0) n(v0 rc2024-01-10T09:24:11.712124+0800 b26088 2=2+0) hs=2+0,ss=0+0 dirty=2 | child=1 subtree=1 dirty=1 authpin=1 0x557d37160900] version 2667
2587 2024-01-10T09:24:11.915+0800 7f712522d640 20 mds.0.bal hit_dir 1 pop is 5.3481, frag * size 2 [pop IRD:[C 3.86e+00] IWR:[C 5.35e+00] RDR:[C 0.00e+00] FET:[C 0.00e+00] STR:[C 0.00e+00] *LOAD:14.6]
2588 2024-01-10T09:24:11.915+0800 7f712522d640  7 mds.0.server reply_client_request 0 ((0) Success) client_request(client.4164:1811 setattr mode=0100000 #0x1000000039b 2024-01-10T09:24:11.712124+0800 caller_uid=1000, caller_gid=1000{10,1000,}) v6

So this is why you can see the new mode in both clients by just waiting for a while. To be exactly you should wait 5s and you can 100% see the new mode. The 5s here is because the mds will flush the journal logs to Rados.

Actions

Copy link

#10

Updated by Tao Lyu 4 months ago

Hi Xiubo,

Thanks a lot for your reproduction and analysis.

Can I rephrase the root cause of this issue as:
The chmod issued by client 1 returns before it is done.
Then the stat from client 1 can read the latest mode as it is issued from the same client of the chmod,
while the stat from client 2 can only read the outdated mode.

If my understanding above is correct,
then I'm confused with two points:

1. Since client 1 grabs the xlock for authlock,
which is an exclusive lock (https://docs.ceph.com/en/latest/dev/mds_internals/locking/),
the stat after it should not be possible to continue until it releases that authlock.
But why do these two stats still proceed?

2. In the documentation, ceph claims to be a strongly consistent file system. But this issue violates the strong cache coherence and should be a bug, right?
https://docs.ceph.com/en/reef/cephfs/posix/
CephFS aims to adhere to POSIX semantics wherever possible. For example, in contrast to many other common network file systems like NFS, CephFS maintains strong cache coherency across clients.

Actions

Copy link

#11

Updated by Xiubo Li 4 months ago

Tao Lyu wrote:

Hi Xiubo,

Thanks a lot for your reproduction and analysis.

Can I rephrase the root cause of this issue as:
The chmod issued by client 1 returns before it is done.
Then the stat from client 1 can read the latest mode as it is issued from the same client of the chmod,
while the stat from client 2 can only read the outdated mode.

Correct.

If my understanding above is correct,
then I'm confused with two points:

1. Since client 1 grabs the xlock for authlock,
which is an exclusive lock (https://docs.ceph.com/en/latest/dev/mds_internals/locking/),
the stat after it should not be possible to continue until it releases that authlock.
But why do these two stats still proceed?

Yeah, this is why the client2 returned old metadata since the client1 was still holding the xlock.

2. In the documentation, ceph claims to be a strongly consistent file system. But this issue violates the strong cache coherence and should be a bug, right?

Certainly this is a bug. Locally I have a draft patch to fix it and it works well. But I still need more time to improve it.

https://docs.ceph.com/en/reef/cephfs/posix/
CephFS aims to adhere to POSIX semantics wherever possible. For example, in contrast to many other common network file systems like NFS, CephFS maintains strong cache coherency across clients.

Yeah, this just a very corner case bug and it's very hard to reproduce normally. Your use case is very special and could trigger it very easy.

Actions

Copy link

#12

Updated by Tao Lyu 4 months ago

Thank you so much for the quick reply.
Good to hear your confirmation.

This bug was found by an academic research tool, and we are submitting the paper recently.
It would be nice if you could share us with your local patch to help me detail this bug in my paper.

Have a nice day!

Actions

Copy link

#13

Updated by Xiubo Li 4 months ago

File fix.patch fix.patch added

Tao Lyu wrote:

Thank you so much for the quick reply.
Good to hear your confirmation.

This bug was found by an academic research tool, and we are submitting the paper recently.
It would be nice if you could share us with your local patch to help me detail this bug in my paper.

Have a nice day!

Here it is, please note it's just a draft patch.

Actions

Copy link

#14

Updated by Tao Lyu 4 months ago

Xiubo Li wrote:

Here it is, please note it's just a draft patch.

No worries. Thanks a lot.

Actions

Copy link

#15

Updated by Xiubo Li about 1 month ago

Pull request ID set to 56602

Actions

Copy link

#16

Updated by Xiubo Li about 1 month ago

Status changed from Triaged to Fix Under Review

Actions

Copy link

#17

Updated by Leonid Usov 9 days ago

Hi, I came here from the PR.

I'd like to suggest that the issue here is not a bug in the MDS. While I can't find an explicit mention about this in the POSIX standard, it appears that `chmod` would require an `fsync` if you want to make sure that the change is persisted.

The observed behavior is generally expected for a change that hasn't been flushed. Metadata is also subject to local caching and flushing by the kernel, just as the data.

Here is a BZ I found that mentions an issue with chmod, and there the changes weren't persisted after the fsync call, which is one of the indirect evidence that fsync is wanted after changing the metadata: https://bugzilla.kernel.org/show_bug.cgi?id=202485

Another indirect evidence is the documentation for https://linux.die.net/man/2/fdatasync which mentions that it doesn't flush the file metadata, unlike `fsync()`.

Finally, this POSIX documentation about chmod https://pubs.opengroup.org/onlinepubs/9699919799/ doesn't mention that the operation must be synchronous. Furthermore, it mentions the following:

Upon successfully changing the file mode bits of a file, the chmod utility shall mark for update the last file status change timestamp of the file.

suggesting that at least some of the effects of this command are asynchronous.

Please update your test to issue an fsync syscall post chmod and before notifying the second client. If that still doesn't give consistent results from both clients, then it will be a bug. Otherwise, I believe it's working as designed.

Actions

Copy link

#18

Updated by Tao Lyu 9 days ago

Hi Leonid,

Thanks for your question. I answer you confusions below:

I'd like to suggest that the issue here is not a bug in the MDS. While I can't find an explicit mention about this in the POSIX standard, it appears that `chmod` would require an `fsync` if you want to make sure that the change is persisted.

The observed behavior is generally expected for a change that hasn't been flushed. Metadata is also subject to local caching and flushing by the kernel, just as the data.

This bug is not related to fsync.
Fsync is for persistency not for cache coherency.

This bug just violates the strong cache coherency across clients guaranteed by CephFS.
See the CephFS documents below.

https://docs.ceph.com/en/reef/cephfs/posix/:
CephFS aims to adhere to POSIX semantics wherever possible.
For example, in contrast to many other common network file systems like NFS,
CephFS maintains strong cache coherency across clients.

The document above means that file operations operating across clients are exactly the same as on the same node
because CephFS ensures strong cache coherency across clients.
For example, if a chmod followed by the stat on one client should be the same when chmod on client1 and then stat on client2.

POSIX follows strong consistency;
that is, every data/metadata read should see the latest data/metadata written by the latest FS operations.

Thus, whether in one or across multiple clients, a stat after a chmod should always see the new mode.
However, this bug violates this semantic.

Here is a BZ I found that mentions an issue with chmod, and there the changes weren't persisted after the fsync call, which is one of the indirect evidence that fsync is wanted after changing the metadata: https://bugzilla.kernel.org/show_bug.cgi?id=202485

Another indirect evidence is the documentation for https://linux.die.net/man/2/fdatasync which mentions that it doesn't flush the file metadata, unlike `fsync()`.

This is a crash consistency bug, which is a different type of bug.
Crash consistency bug happens when the data/metadata explicitly flushed by fsync/fsyncdata is lost after a system crash,
which violates the persistence semantic of fsync/fsyncdata.

Finally, this POSIX documentation about chmod https://pubs.opengroup.org/onlinepubs/9699919799/ doesn't mention that the operation must be synchronous. Furthermore, it mentions the following:

Upon successfully changing the file mode bits of a file, the chmod utility shall mark for update the last file status change timestamp of the file.

suggesting that at least some of the effects of this command are asynchronous.

The last file access time is an exception in file systems,
which is asynchronous to avoid turning a read-only operation into a write operation just because of modifying the timestamp.

Please refer to: https://lwn.net/Articles/244829/

Actions

Copy link

#19

Updated by Leonid Usov 9 days ago

Tao, I appreciate your quick and detailed response. I reviewed the client code for setxattr, and the code doesn't use caching, it performs a synchronous call to the MDS. This changes my opinion about this issue.

suggesting that at least some of the effects of this command are asynchronous.

The last file access time is an exception in file systems, which is asynchronous to avoid turning a read-only operation into a write operation just because of modifying the timestamp.

OK, this is interesting, though I don't see how it applies to chmod, which is already a write operation.

Actions

Copy link

#20

Updated by Tao Lyu 8 days ago

Hi Leonid,

Thanks. Good to hear.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » CephFS

Custom queries

Bug #63906

Inconsistent file mode across two clients

Updated by Tao Lyu 4 months ago

Updated by Tao Lyu 4 months ago

Updated by Venky Shankar 4 months ago

Updated by Tao Lyu 4 months ago

Updated by Xiubo Li 4 months ago

Updated by Tao Lyu 4 months ago

Updated by Xiubo Li 4 months ago

Updated by Xiubo Li 4 months ago

Updated by Xiubo Li 4 months ago

Updated by Tao Lyu 4 months ago

Updated by Xiubo Li 4 months ago

Updated by Tao Lyu 4 months ago

Updated by Xiubo Li 4 months ago

Updated by Tao Lyu 4 months ago

Updated by Xiubo Li about 1 month ago

Updated by Xiubo Li about 1 month ago

Updated by Leonid Usov 9 days ago

Updated by Tao Lyu 9 days ago

Updated by Leonid Usov 9 days ago

Updated by Tao Lyu 8 days ago