Bug #50083: CephFS file access issues using kernel driver: file overwritten with null bytes - CephFS - Ceph

Bug #50083

Updated by Patrick Donnelly about 3 years ago

Ceph cluster is running 14.2.9 (nautilus), a 3 node containerised cluster. 1 active MDS, 2 standby 
 Using ceph kernel driver on 5.11.11-1.el7.elrepo.x86_64 (also tested on 5.10.10-1.el7.elrepo.x86_64). 


 Since moving to the 5.11.11 and 5.11.10 kernels, we've noticed files on cephfs mounts are being overwritten with null bytes.    Normal text files are full of "^@" instead of written content. 

 Additionally the metadata for these files isn't correct; the last modified time seems to be slow to update: 

 <pre> ``` 
 [root@svr02 albacore] /opt/dcl/deploy/log> echo test >> cmd.210331.log ; date 
 Wed Mar 31 14:14:20 BST 2021 
 [root@svr02 albacore] /opt/dcl/deploy/log> ls -ltr cmd.210331.log 
 -rw-rw---- 1 dcmbox dcl 39012 Mar 31 14:03 cmd.210331.log 
 [root@svr02 albacore] /opt/dcl/deploy/log> ls -ltr cmd.210331.log 
 -rw-rw---- 1 dcmbox dcl 39012 Mar 31 14:03 cmd.210331.log 
 [root@svr02 albacore] /opt/dcl/deploy/log> echo test >> cmd.210331.log ; date 
 Wed Mar 31 14:15:05 BST 2021 
 [root@svr02 albacore] /opt/dcl/deploy/log> ls -ltr cmd.210331.log 
 -rw-rw---- 1 dcmbox dcl 39017 Mar 31 14:03 cmd.210331.log 
 </pre> ``` 


 We didn't experience these issues when running on 5.8.10-1.el7.elrepo.x86_64 . 

 




 Ceph cluster is healthy: 

 <pre> 
 [qs-admin@albacore_sc0 metaswitch]$ ceph -s 
  ++ sudo docker ps --filter name=ceph-mon- -q 
  ++ sudo docker exec d384020a8fc1 ceph 
   cluster: 
     id:       e4e508a2-21fd-4495-9645-2a7ac1521481 
     health: HEALTH_OK 

   services: 
     mon: 3 daemons, quorum albacore_sc0,albacore_sc1,albacore_sc2 (age 11d) 
     mgr: albacore_sc2(active, since 6d), standbys: albacore_sc0, albacore_sc1 
     mds: cephfs:1 {0=albacore_sc0=up:active} 2 up:standby 
     osd: 3 osds: 3 up (since 11d), 3 in (since 3M) 
     rgw: 6 daemons active (albacore_sc0.pubsub, albacore_sc0.rgw0, albacore_sc1.pubsub, albacore_sc1.rgw0, albacore_sc2.pubsub, albacore_sc2.rgw0) 

   data: 
     pools:     13 pools, 136 pgs 
     objects: 1.25M objects, 76 GiB 
     usage:     238 GiB used, 62 GiB / 300 GiB avail 
     pgs:       136 active+clean 

   io: 
     client:     40 KiB/s rd, 26 KiB/s wr, 39 op/s rd, 37 op/s wr 
 </pre> 


 We have two client machines, each with 21 cephfs mounts, so a total of 42 clients according to ceph. 


 Our mount config: 

 <pre> 
 10.225.41.221,10.225.41.222,10.225.41.223:6789:/albacore/system/deploy on /opt/dcl/deploy type ceph (rw,noatime,name=albacore,secret=<hidden>,acl,wsize=32768,rsize=32768,_netdev) 
 </pre> 

 


 No warnings or slow requests. No trace of hanging ops on the client or server. 
 No ops stuck in flight: 

 <pre> 
 [qs-admin@albacore_sc0 ~]$ ceph daemon mds.albacore_sc0 dump_ops_in_flight 
  ++ sudo docker ps --filter name=ceph-mon- -q 
  ++ sudo docker exec d384020a8fc1 ceph 
 { 
     "ops": [], 
     "num_ops": 0 
 } 
 </pre> 


 No issues reported in dmesg on the client (attached) 
 Some evictions logged by MDS on the ceph servers (MDS output attached). 
 No obvious errors in MON logs, but frequent calls to _set_new_cache_sizes which I don't recall seeing before (MON output attached).

Back

Project

General

Profile

Ceph » CephFS

Bug #50083