Bug #50083
Updated by Patrick Donnelly about 3 years ago
Ceph cluster is running 14.2.9 (nautilus), a 3 node containerised cluster. 1 active MDS, 2 standby Using ceph kernel driver on 5.11.11-1.el7.elrepo.x86_64 (also tested on 5.10.10-1.el7.elrepo.x86_64). Since moving to the 5.11.11 and 5.11.10 kernels, we've noticed files on cephfs mounts are being overwritten with null bytes. Normal text files are full of "^@" instead of written content. Additionally the metadata for these files isn't correct; the last modified time seems to be slow to update: <pre> ``` [root@svr02 albacore] /opt/dcl/deploy/log> echo test >> cmd.210331.log ; date Wed Mar 31 14:14:20 BST 2021 [root@svr02 albacore] /opt/dcl/deploy/log> ls -ltr cmd.210331.log -rw-rw---- 1 dcmbox dcl 39012 Mar 31 14:03 cmd.210331.log [root@svr02 albacore] /opt/dcl/deploy/log> ls -ltr cmd.210331.log -rw-rw---- 1 dcmbox dcl 39012 Mar 31 14:03 cmd.210331.log [root@svr02 albacore] /opt/dcl/deploy/log> echo test >> cmd.210331.log ; date Wed Mar 31 14:15:05 BST 2021 [root@svr02 albacore] /opt/dcl/deploy/log> ls -ltr cmd.210331.log -rw-rw---- 1 dcmbox dcl 39017 Mar 31 14:03 cmd.210331.log </pre> ``` We didn't experience these issues when running on 5.8.10-1.el7.elrepo.x86_64 . Ceph cluster is healthy: <pre> [qs-admin@albacore_sc0 metaswitch]$ ceph -s ++ sudo docker ps --filter name=ceph-mon- -q ++ sudo docker exec d384020a8fc1 ceph cluster: id: e4e508a2-21fd-4495-9645-2a7ac1521481 health: HEALTH_OK services: mon: 3 daemons, quorum albacore_sc0,albacore_sc1,albacore_sc2 (age 11d) mgr: albacore_sc2(active, since 6d), standbys: albacore_sc0, albacore_sc1 mds: cephfs:1 {0=albacore_sc0=up:active} 2 up:standby osd: 3 osds: 3 up (since 11d), 3 in (since 3M) rgw: 6 daemons active (albacore_sc0.pubsub, albacore_sc0.rgw0, albacore_sc1.pubsub, albacore_sc1.rgw0, albacore_sc2.pubsub, albacore_sc2.rgw0) data: pools: 13 pools, 136 pgs objects: 1.25M objects, 76 GiB usage: 238 GiB used, 62 GiB / 300 GiB avail pgs: 136 active+clean io: client: 40 KiB/s rd, 26 KiB/s wr, 39 op/s rd, 37 op/s wr </pre> We have two client machines, each with 21 cephfs mounts, so a total of 42 clients according to ceph. Our mount config: <pre> 10.225.41.221,10.225.41.222,10.225.41.223:6789:/albacore/system/deploy on /opt/dcl/deploy type ceph (rw,noatime,name=albacore,secret=<hidden>,acl,wsize=32768,rsize=32768,_netdev) </pre> No warnings or slow requests. No trace of hanging ops on the client or server. No ops stuck in flight: <pre> [qs-admin@albacore_sc0 ~]$ ceph daemon mds.albacore_sc0 dump_ops_in_flight ++ sudo docker ps --filter name=ceph-mon- -q ++ sudo docker exec d384020a8fc1 ceph { "ops": [], "num_ops": 0 } </pre> No issues reported in dmesg on the client (attached) Some evictions logged by MDS on the ceph servers (MDS output attached). No obvious errors in MON logs, but frequent calls to _set_new_cache_sizes which I don't recall seeing before (MON output attached).