Bug #24405
closedI/O errors reported by client copy apps (rsync/cp)
0%
Description
Hello,
Globally cephfs is working fine (thanks for that !), except for one point: sometimes, after a random load of copy work, the filesystem send an I/O error to the program.
(Mounted with "mount -t ceph" on a client also in the role of osd-pool/mgr/mon/mds.)
In a copy process (~30TB), it kills rsync or skip files in cp:
tom@telegeo02:~> rsync -avvv /somewhere/* /mnt/cephfs/pool_21p3/ ... S2A_MSIL2A_20161228T170712_N0204_R069_T14QPF_20161228T171337.SAFE/S2A_OPER_SSC_L2VALD_14QPF____20161228.DBL.DIR/S2A_OPER_SSC_PDTANX_L2VALD_14QPF____20161228_QLT_R2.DBL.TIF 90,464,592 100% 63.53MB/s 0:00:01 (xfr#31253, ir-chk=1014/45890) S2A_MSIL2A_20161228T170712_N0204_R069_T14QPF_20161228T171337.SAFE/S2A_OPER_SSC_L2VALD_14QPF____20161228.DBL.DIR/S2A_OPER_SSC_PDTANX_L2VALD_14QPF____20161228_QLT_R2.HDR 2,946 100% 8.01kB/s 0:00:00 (xfr#31254, ir-chk=1013/45890) S2A_MSIL2A_20161228T170712_N0204_R069_T14QPF_20161228T171337.SAFE/S2A_OPER_SSC_L2VALD_14QPF____20161228.DBL.DIR/S2A_OPER_SSC_PDTIMG_L2VALD_14QPF____20161228_FRE_R1.DBL.TIF 521,175,040 54% 18.79MB/s 0:00:23 rsync: [sender] write error: Broken pipe (32) rsync: write failed on "/mnt/cephfs/pool_21p3/S2A_MSIL2A_20161228T170712_N0204_R069_T14QPF_20161228T171337.SAFE/S2A_OPER_SSC_L2VALD_14QPF____20161228.DBL.DIR/S2A_OPER_SSC_PDTIMG_L2VALD_14QPF____20161228_FRE_R1.DBL.TIF": Input/output error (5) rsync error: error in file IO (code 11) at receiver.c(389) [receiver=3.1.0]
tom@telegeo02:~> cp -rnv /somewhere/* /mnt/cephfs/pool_21p3/ > /export/miro/tom/logs/log_ceph_testcp cp: error writing '/mnt/cephfs/pool_21p3/S2A_MSIL2A_20170223T165311_N0204_R026_T14QNE_20170223T170837.SAFE/S2A_OPER_SSC_L2VALD_14QNE____20170223.DBL.DIR/S2A_OPER_SSC_PDTIMG_L2VALD_14QNE____20170223_FRE_R1.DBL.TIF': Input/output error cp: error writing '/mnt/cephfs/pool_21p3/S2A_MSIL2A_20170223T165311_N0204_R026_T14QQK_20170223T165747.SAFE/S2A_OPER_SSC_L2VALD_14QQK____20170223.DBL.DIR/S2A_OPER_SSC_PDTIMG_L2VALD_14QQK____20170223_FRE_R1.DBL.TIF': Input/output error cp: error writing '/mnt/cephfs/pool_21p3/S2A_MSIL2A_20170224T162331_N0204_R040_T15PXT_20170224T163738.SAFE/S2A_OPER_SSC_L2VALD_15PXT____20170224.DBL.DIR/S2A_OPER_SSC_PDTANX_L2VALD_15PXT____20170224_QLT_R1.DBL.TIF': Input/output error cp: error writing '/mnt/cephfs/pool_21p3/S2A_MSIL2A_20170224T162331_N0204_R040_T15QZE_20170224T162512.SAFE/S2A_OPER_SSC_L2VALD_15QZE____20170224.DBL.DIR/S2A_OPER_SSC_PDTIMG_L2VALD_15QZE____20170224_FRE_R1.DBL.TIF': Input/output error cp: error writing '/mnt/cephfs/pool_21p3/S2A_MSIL2A_20170224T162331_N0204_R040_T15QZV_20170224T163738.SAFE/S2A_OPER_SSC_L2VALD_15QZV____20170224.DBL.DIR/S2A_OPER_SSC_PDTIMG_L2VALD_15QZV____20170224_FRE_R1.DBL.TIF': Input/output error
I'm using deprecated hardware to make "real-world" stress tests and as you can see (in the status file linked) 3 osd are already dead. When they died, such an error appeared. That is a logical reason but not ok for prod. This is also the case when the osd get killed because of a "no memory left" error (160% of max use !), issue solved now (set to 512MB => 1G used). However, the error is still present, and I have no trace in any log ! (nor syslog or ceph logs, checked on all machines).
For me, there are 2 issues:
- any disk error prompted before the osd is down should be handled, if not at the osd-level, at least by the ceph client. Only if the error is "real" (eg. not enough osds present to write data) it can be reported on the filesystem users as an I/O error.
- there is literally no information to locate the source of the remaining issue (a disfunctionning disk ? an network/memory overload ?). The info in /sys/kernel/debug/ceph/ is not helping (most of the time empty => seems not to be a "lag" issue).
Any idea to locate the issue / any fix is welcome !
Thanks!
Files