Project

General

Profile

Actions

Bug #13256

closed

I/O error with cephfs accessing root .snap directory on v9.0.3

Added by Eric Eastman over 8 years ago. Updated over 8 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I am running a test Ceph cluster using Ceph v9.0.3 with all Kernels at 4.2.0 on Ubuntu Trusty. I have enabled snapshots, and have a cron job creating a snapshot every hour from a system using a fuse mount. On two other systems using kernel mounts, I am now seeing:

 
# ls -l /cephfs/.snap
ls: reading directory /cephfs/.snap: Input/output error
total 0

When this happens, I get the following error message in the /var/log/kern.log file:

Sep 26 11:46:42 dfgw02 kernel: [642211.597617] ceph: dir contents are larger than expected
Sep 26 11:46:42 dfgw02 kernel: [642211.597651] ------------[ cut here ]------------
Sep 26 11:46:42 dfgw02 kernel: [642211.597670] WARNING: CPU: 5 PID: 122627 at /home/kernel/COD/linux/fs/ceph/mds_client.c:188 handle_reply+0xafe/0xbd0 [ceph]()
Sep 26 11:46:42 dfgw02 kernel: [642211.597672] Modules linked in: ipmi_devintf ipmi_ssif ttm drm_kms_helper ceph drm i2c_algo_bit gpio_ich libceph coretemp input_leds kvm acpi_power_meter 8250_fintek i7core_edac hpilo libcrc32c fscache serio_raw edac_core ipmi_si ipmi_msghandler lpc_ich shpchp mac_hid bonding lp parport mlx4_en vxlan ip6_udp_tunnel udp_tunnel ptp pps_core hid_generic usbhid hid psmouse bnx2 mlx4_core hpsa
Sep 26 11:46:42 dfgw02 kernel: [642211.597697] CPU: 5 PID: 122627 Comm: kworker/5:0 Tainted: G        W I     4.2.0-040200-generic #201508301530
Sep 26 11:46:42 dfgw02 kernel: [642211.597699] Hardware name: HP ProLiant DL360 G6, BIOS P64 01/22/2015
Sep 26 11:46:42 dfgw02 kernel: [642211.597710] Workqueue: ceph-msgr con_work [libceph]
Sep 26 11:46:42 dfgw02 kernel: [642211.597712]  ffffffffc03c0838 ffff880c0115fb68 ffffffff817a1b43 0000000000000000
Sep 26 11:46:42 dfgw02 kernel: [642211.597714]  0000000000000000 ffff880c0115fba8 ffffffff8107719a ffff880c0115fbc8
Sep 26 11:46:42 dfgw02 kernel: [642211.597716]  ffff880c03b66c00 ffff880c00c67e00 ffffc9000dd85830 ffff880601545008
Sep 26 11:46:42 dfgw02 kernel: [642211.597718] Call Trace:
Sep 26 11:46:42 dfgw02 kernel: [642211.597726]  [<ffffffff817a1b43>] dump_stack+0x45/0x57
Sep 26 11:46:42 dfgw02 kernel: [642211.597732]  [<ffffffff8107719a>] warn_slowpath_common+0x8a/0xc0
Sep 26 11:46:42 dfgw02 kernel: [642211.597734]  [<ffffffff8107728a>] warn_slowpath_null+0x1a/0x20
Sep 26 11:46:42 dfgw02 kernel: [642211.597742]  [<ffffffffc03af07e>] handle_reply+0xafe/0xbd0 [ceph]
Sep 26 11:46:42 dfgw02 kernel: [642211.597750]  [<ffffffffc03b0c7e>] dispatch+0xae/0xc10 [ceph]
Sep 26 11:46:42 dfgw02 kernel: [642211.597755]  [<ffffffffc02b9af8>] ? ceph_tcp_recvmsg+0x48/0x60 [libceph]
Sep 26 11:46:42 dfgw02 kernel: [642211.597761]  [<ffffffffc02bdac1>] try_read+0x3d1/0x1060 [libceph]
Sep 26 11:46:42 dfgw02 kernel: [642211.597766]  [<ffffffff8101dc6b>] ? native_sched_clock+0x2b/0x80
Sep 26 11:46:42 dfgw02 kernel: [642211.597768]  [<ffffffff8101dcc9>] ? sched_clock+0x9/0x10
Sep 26 11:46:42 dfgw02 kernel: [642211.597772]  [<ffffffff810abbaf>] ? put_prev_entity+0x2f/0x4a0
Sep 26 11:46:42 dfgw02 kernel: [642211.597777]  [<ffffffffc02be802>] con_work+0xb2/0x5f0 [libceph]
Sep 26 11:46:42 dfgw02 kernel: [642211.597783]  [<ffffffff8108f21e>] process_one_work+0x14e/0x3d0
Sep 26 11:46:42 dfgw02 kernel: [642211.597785]  [<ffffffff8108f8ca>] worker_thread+0x11a/0x470
Sep 26 11:46:42 dfgw02 kernel: [642211.597787]  [<ffffffff8108f7b0>] ? rescuer_thread+0x310/0x310
Sep 26 11:46:42 dfgw02 kernel: [642211.597790]  [<ffffffff81094e29>] kthread+0xc9/0xe0
Sep 26 11:46:42 dfgw02 kernel: [642211.597792]  [<ffffffff81094d60>] ? kthread_create_on_node+0x180/0x180
Sep 26 11:46:42 dfgw02 kernel: [642211.597796]  [<ffffffff817a925f>] ret_from_fork+0x3f/0x70
Sep 26 11:46:42 dfgw02 kernel: [642211.597798]  [<ffffffff81094d60>] ? kthread_create_on_node+0x180/0x180
Sep 26 11:46:42 dfgw02 kernel: [642211.597799] ---[ end trace cdeabe6cb4c8bb2d ]---
Sep 26 11:46:42 dfgw02 kernel: [642211.597800] ceph: problem parsing dir contents -5
Sep 26 11:46:42 dfgw02 kernel: [642211.597825] ceph: mds parse_reply err -5
Sep 26 11:46:42 dfgw02 kernel: [642211.597849] ceph: mdsc_handle_reply got corrupt reply mds0(tid:2010)
Sep 26 11:46:42 dfgw02 kernel: [642211.597878] header: 00000000: bc 05 00 00 00 00 00 00 da 07 00 00 00 00 00 00  ................
Sep 26 11:46:42 dfgw02 kernel: [642211.597879] header: 00000010: 1a 00 7f 00 01 00 30 b8 00 00 00 00 00 00 00 00  ......0.........

Their are a lot more error message lines in the kern.log file, so I attached it to this report.

I have attached a file with the ls -l of the /cephfs/.snap directory taken from the system taking the snapshots using a fuse mount.

System info:

keeper@dfgw02:~$ uname -a
Linux dfgw02 4.2.0-040200-generic #201508301530 SMP Sun Aug 30 19:31:40 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
keeper@dfgw02:~$ ceph -v
ceph version 9.0.3 (7295612d29f953f46e6e88812ef372b89a43b9da)

Kernel mount:
root@dfgw02:~# mount | grep cephfs
10.16.51.21,10.16.51.22,10.16.51.23:/ on /cephfs type ceph (name=cephfs,key=client.cephfs)

Fuse mount on separate system that is performing the snapshots
root@dfadm01:~# mount | grep ceph
ceph-fuse on /cephfs type fuse.ceph-fuse (rw,noatime,_netdev)

root@dfgw02:~# ceph -s
    cluster c261c2dc-5e29-11e5-98ba-68b599c50db0
     health HEALTH_WARN
            21 requests are blocked > 32 sec
     monmap e1: 3 mons at {dfmon01=10.16.51.21:6789/0,dfmon02=10.16.51.22:6789/0,dfmon03=10.16.51.23:6789/0}
            election epoch 6, quorum 0,1,2 dfmon01,dfmon02,dfmon03
     mdsmap e3222: 1/1/1 up {0=dfmds02=up:active}, 1 up:standby
     osdmap e5901: 176 osds: 169 up, 169 in
      pgmap v351926: 18496 pgs, 4 pools, 46873 GB data, 11909 kobjects
            137 TB used, 108 TB / 246 TB avail
               18496 active+clean
  client io 60516 kB/s rd, 49 op/s

Files

kern-26-sep-15.log.gz (200 KB) kern-26-sep-15.log.gz kern.log Eric Eastman, 09/26/2015 05:07 PM
snapshot-list-26-sep-15.txt (13.5 KB) snapshot-list-26-sep-15.txt ls -l output of /cephfs/.snap taken from system using fuse mount Eric Eastman, 09/26/2015 05:07 PM
Actions #1

Updated by Zheng Yan over 8 years ago

  • Status changed from New to Fix Under Review
Actions #2

Updated by Greg Farnum over 8 years ago

  • Status changed from Fix Under Review to 7
Actions #3

Updated by Greg Farnum over 8 years ago

  • Status changed from 7 to Resolved
Actions

Also available in: Atom PDF