Project

General

Profile

Actions

Bug #24405

closed

I/O errors reported by client copy apps (rsync/cp)

Added by Thomas De Maet almost 6 years ago. Updated almost 6 years ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Category:
fs/ceph
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
ceph-ansible
Crash signature (v1):
Crash signature (v2):

Description

Hello,

Globally cephfs is working fine (thanks for that !), except for one point: sometimes, after a random load of copy work, the filesystem send an I/O error to the program.

(Mounted with "mount -t ceph" on a client also in the role of osd-pool/mgr/mon/mds.)

In a copy process (~30TB), it kills rsync or skip files in cp:

tom@telegeo02:~> rsync -avvv /somewhere/* /mnt/cephfs/pool_21p3/
...
S2A_MSIL2A_20161228T170712_N0204_R069_T14QPF_20161228T171337.SAFE/S2A_OPER_SSC_L2VALD_14QPF____20161228.DBL.DIR/S2A_OPER_SSC_PDTANX_L2VALD_14QPF____20161228_QLT_R2.DBL.TIF
     90,464,592 100%   63.53MB/s    0:00:01 (xfr#31253, ir-chk=1014/45890)
S2A_MSIL2A_20161228T170712_N0204_R069_T14QPF_20161228T171337.SAFE/S2A_OPER_SSC_L2VALD_14QPF____20161228.DBL.DIR/S2A_OPER_SSC_PDTANX_L2VALD_14QPF____20161228_QLT_R2.HDR
          2,946 100%    8.01kB/s    0:00:00 (xfr#31254, ir-chk=1013/45890)
S2A_MSIL2A_20161228T170712_N0204_R069_T14QPF_20161228T171337.SAFE/S2A_OPER_SSC_L2VALD_14QPF____20161228.DBL.DIR/S2A_OPER_SSC_PDTIMG_L2VALD_14QPF____20161228_FRE_R1.DBL.TIF
    521,175,040  54%   18.79MB/s    0:00:23  
rsync: [sender] write error: Broken pipe (32)
rsync: write failed on "/mnt/cephfs/pool_21p3/S2A_MSIL2A_20161228T170712_N0204_R069_T14QPF_20161228T171337.SAFE/S2A_OPER_SSC_L2VALD_14QPF____20161228.DBL.DIR/S2A_OPER_SSC_PDTIMG_L2VALD_14QPF____20161228_FRE_R1.DBL.TIF": Input/output error (5)
rsync error: error in file IO (code 11) at receiver.c(389) [receiver=3.1.0]
tom@telegeo02:~> cp -rnv /somewhere/* /mnt/cephfs/pool_21p3/ > /export/miro/tom/logs/log_ceph_testcp
cp: error writing '/mnt/cephfs/pool_21p3/S2A_MSIL2A_20170223T165311_N0204_R026_T14QNE_20170223T170837.SAFE/S2A_OPER_SSC_L2VALD_14QNE____20170223.DBL.DIR/S2A_OPER_SSC_PDTIMG_L2VALD_14QNE____20170223_FRE_R1.DBL.TIF': Input/output error
cp: error writing '/mnt/cephfs/pool_21p3/S2A_MSIL2A_20170223T165311_N0204_R026_T14QQK_20170223T165747.SAFE/S2A_OPER_SSC_L2VALD_14QQK____20170223.DBL.DIR/S2A_OPER_SSC_PDTIMG_L2VALD_14QQK____20170223_FRE_R1.DBL.TIF': Input/output error
cp: error writing '/mnt/cephfs/pool_21p3/S2A_MSIL2A_20170224T162331_N0204_R040_T15PXT_20170224T163738.SAFE/S2A_OPER_SSC_L2VALD_15PXT____20170224.DBL.DIR/S2A_OPER_SSC_PDTANX_L2VALD_15PXT____20170224_QLT_R1.DBL.TIF': Input/output error
cp: error writing '/mnt/cephfs/pool_21p3/S2A_MSIL2A_20170224T162331_N0204_R040_T15QZE_20170224T162512.SAFE/S2A_OPER_SSC_L2VALD_15QZE____20170224.DBL.DIR/S2A_OPER_SSC_PDTIMG_L2VALD_15QZE____20170224_FRE_R1.DBL.TIF': Input/output error
cp: error writing '/mnt/cephfs/pool_21p3/S2A_MSIL2A_20170224T162331_N0204_R040_T15QZV_20170224T163738.SAFE/S2A_OPER_SSC_L2VALD_15QZV____20170224.DBL.DIR/S2A_OPER_SSC_PDTIMG_L2VALD_15QZV____20170224_FRE_R1.DBL.TIF': Input/output error

I'm using deprecated hardware to make "real-world" stress tests and as you can see (in the status file linked) 3 osd are already dead. When they died, such an error appeared. That is a logical reason but not ok for prod. This is also the case when the osd get killed because of a "no memory left" error (160% of max use !), issue solved now (set to 512MB => 1G used). However, the error is still present, and I have no trace in any log ! (nor syslog or ceph logs, checked on all machines).

For me, there are 2 issues:

- any disk error prompted before the osd is down should be handled, if not at the osd-level, at least by the ceph client. Only if the error is "real" (eg. not enough osds present to write data) it can be reported on the filesystem users as an I/O error.

- there is literally no information to locate the source of the remaining issue (a disfunctionning disk ? an network/memory overload ?). The info in /sys/kernel/debug/ceph/ is not helping (most of the time empty => seems not to be a "lag" issue).

Any idea to locate the issue / any fix is welcome !

Thanks!


Files

status.txt (9.61 KB) status.txt Thomas De Maet, 06/04/2018 11:35 AM
cmd_debug_mode.txt (10.7 KB) cmd_debug_mode.txt commands used Thomas De Maet, 06/04/2018 01:36 PM
trace_ceph_io_error (679 KB) trace_ceph_io_error 10000 lines after 1st bug Thomas De Maet, 06/04/2018 01:36 PM
trace_ceph_io_error_2 (720 KB) trace_ceph_io_error_2 10000 lines after 2nd bug Thomas De Maet, 06/04/2018 01:36 PM
trace_ceph_io_error_3 (674 KB) trace_ceph_io_error_3 10000 lines after 3rd bug Thomas De Maet, 06/04/2018 01:36 PM
trace_ceph_io_error_4 (675 KB) trace_ceph_io_error_4 10000 lines after 4th bug Thomas De Maet, 06/04/2018 01:36 PM
Actions

Also available in: Atom PDF