Project

General

Profile

Bug #10313

IO drop outs/timeouts on one of the rbd client

Added by Srinivasula Reddy Maram over 9 years ago. Updated about 9 years ago.

Status:
Need More Info
Priority:
Normal
Assignee:
Category:
libceph
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):

Description

Seeing IO drop outs/timeouts in one of the rbd clients:

Setup details:

OS : ubuntu 14.04 LTS

kernel: 3.13.0-24-generic
------------

steps:

1. created 2 node cluster with 64 OSDs.
2. created 4 pools with 2048 PGs on each pool.
4. created 2 rbds from each pool and mapped on each client node.
5. ran some performance tests after pre-condition.
6. after 3 or 4 days seen the IO dropouts continuously with below stack traces.

7. after reboot the system is narmal.

Dec 10, 2014 interval i/o MB/sec bytes read resp read write resp resp queue cpu% cpu%
rate 1024**2 i/o pct time resp resp max stddev depth sys+u sys
10:49:50.042 1 31.40 1.96 65536 100.00 45.501 45.501 0.000 498.706 132.416 63.4 8.1 7.7
10:50:00.051 2 0.00 0.00 0 0.00 0.000 0.000 0.000 0.000 0.000 63.9 5.1 5.0
10:50:10.047 3 0.00 0.00 0 0.00 0.000 0.000 0.000 0.000 0.000 64.2 5.1 5.0
10:50:20.050 4 0.00 0.00 0 0.00 0.000 0.000 0.000 0.000 0.000 63.8 5.1 5.0
10:50:30.050 5 0.00 0.00 0 0.00 0.000 0.000 0.000 0.000 0.000 64.0 5.1 5.0
10:50:40.045 6 0.00 0.00 0 0.00 0.000 0.000 0.000 0.000 0.000 64.2 5.1 5.0
10:50:50.048 7 0.00 0.00 0 0.00 0.000 0.000 0.000 0.000 0.000 63.8 5.0 5.0

Please find attached syslogs:

Dec 10 11:22:28 rack3-client-1 kernel: [661597.506625] BUG: soft lockup - CPU#2 stuck for 22s! [java:29169]
Dec 10 11:22:28 rack3-client-1 kernel: [661597.540650] Modules linked in: rbd libceph(OF) libcrc32c ib_iser rdma_cm iw_cm ib_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi gpio_ich dcdbas x86_pkg_temp_thermal intel_powerclamp coretemp joydev kvm_intel kvm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul sb_edac glue_helper ablk_helper cryptd edac_core shpchp lpc_ich mei_me mei wmi ipmi_si mac_hid acpi_power_meter lp parport mlx4_en vxlan ip_tunnel ses enclosure hid_generic tg3 ahci usbhid ptp hid libahci mlx4_core megaraid_sas pps_core
Dec 10 11:22:28 rack3-client-1 kernel: [661597.540676] CPU: 2 PID: 29169 Comm: java Tainted: GF O 3.13.0-24-generic #46-Ubuntu
Dec 10 11:22:28 rack3-client-1 kernel: [661597.540677] Hardware name: Dell Inc. PowerEdge R620/01W23F, BIOS 2.2.2 01/16/2014
Dec 10 11:22:28 rack3-client-1 kernel: [661597.540678] task: ffff8800b8f1c7d0 ti: ffff880fe1c48000 task.ti: ffff880fe1c48000
Dec 10 11:22:28 rack3-client-1 kernel: [661597.540679] RIP: 0010:[<ffffffffa0305068>] [<ffffffffa0305068>] __map_request+0x3b8/0x640 [libceph]
Dec 10 11:22:28 rack3-client-1 kernel: [661597.540686] RSP: 0018:ffff880fe1c49890 EFLAGS: 00000202
Dec 10 11:22:28 rack3-client-1 kernel: [661597.540687] RAX: ffff880fccf01818 RBX: 0000000000000000 RCX: 0000000000000000
Dec 10 11:22:28 rack3-client-1 kernel: [661597.540688] RDX: 000000000000000c RSI: ffff880fe1c498bc RDI: ffff880f83dd4b74
Dec 10 11:22:28 rack3-client-1 kernel: [661597.540689] RBP: ffff880fe1c49918 R08: 0000000000000002 R09: ffff880fd3f4ab00
Dec 10 11:22:28 rack3-client-1 kernel: [661597.540690] R10: ffff880fd3f4a440 R11: 0000000000000038 R12: 0000000000000000
Dec 10 11:22:28 rack3-client-1 kernel: [661597.540691] R13: ffff880fe1c498b0 R14: ffff880fe4053480 R15: ffff880fe1c49830
Dec 10 11:22:28 rack3-client-1 kernel: [661597.540692] FS: 00007f0227cfc700(0000) GS:ffff88103f040000(0000) knlGS:0000000000000000
Dec 10 11:22:28 rack3-client-1 kernel: [661597.540693] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Dec 10 11:22:28 rack3-client-1 kernel: [661597.540694] CR2: 00007f2ab404a000 CR3: 0000000f8641f000 CR4: 00000000001407e0
Dec 10 11:22:28 rack3-client-1 kernel: [661597.540694] Stack:
Dec 10 11:22:28 rack3-client-1 kernel: [661597.540695] ffffffffa0303dca 00000000003b0000 0000000000000001 ffff880f3bd4c436
Dec 10 11:22:28 rack3-client-1 kernel: [661597.540698] 0000001c0000001d 0000000000000036 0000000100000010 ffff880f83dd4af0
Dec 10 11:22:28 rack3-client-1 kernel: [661597.540700] ffff880fe0f96bc8 0000000000000000 0000000000000000 ffff8800b477dad0
Dec 10 11:22:28 rack3-client-1 kernel: [661597.540702] Call Trace:
Dec 10 11:22:28 rack3-client-1 kernel: [661597.540707] [<ffffffffa0303dca>] ? ceph_osdc_build_request+0x1fa/0x510 [libceph]
Dec 10 11:22:28 rack3-client-1 kernel: [661597.540711] [<ffffffffa03066a4>] ceph_osdc_start_request+0x64/0x130 [libceph]
Dec 10 11:22:28 rack3-client-1 kernel: [661597.540714] [<ffffffffa034490a>] rbd_obj_request_submit+0x2a/0x60 [rbd]
Dec 10 11:22:28 rack3-client-1 kernel: [661597.540717] [<ffffffffa0347925>] rbd_img_obj_request_submit+0x185/0x440 [rbd]
Dec 10 11:22:28 rack3-client-1 kernel: [661597.540719] [<ffffffffa0347c34>] rbd_img_request_submit+0x54/0x90 [rbd]
Dec 10 11:22:28 rack3-client-1 kernel: [661597.540721] [<ffffffffa0348f7a>] rbd_request_fn+0x2ea/0x3a0 [rbd]
Dec 10 11:22:28 rack3-client-1 kernel: [661597.540725] [<ffffffff813317d3>] __blk_run_queue+0x33/0x40
Dec 10 11:22:28 rack3-client-1 kernel: [661597.540726] [<ffffffff8133188a>] queue_unplugged+0x2a/0xa0
Dec 10 11:22:28 rack3-client-1 kernel: [661597.540729] [<ffffffff813350e0>] blk_flush_plug_list+0x1f0/0x230
Dec 10 11:22:28 rack3-client-1 kernel: [661597.540730] [<ffffffff81335494>] blk_finish_plug+0x14/0x40
Dec 10 11:22:28 rack3-client-1 kernel: [661597.540733] [<ffffffff811f6b81>] do_blockdev_direct_IO+0x1af1/0x2910
Dec 10 11:22:28 rack3-client-1 kernel: [661597.540735] [<ffffffff811f1be0>] ? I_BDEV+0x10/0x10
Dec 10 11:22:28 rack3-client-1 kernel: [661597.540737] [<ffffffff811f79f5>] __blockdev_direct_IO+0x55/0x60
Dec 10 11:22:28 rack3-client-1 kernel: [661597.540738] [<ffffffff811f1be0>] ? I_BDEV+0x10/0x10
Dec 10 11:22:28 rack3-client-1 kernel: [661597.540740] [<ffffffff811f22d6>] blkdev_direct_IO+0x56/0x60
Dec 10 11:22:28 rack3-client-1 kernel: [661597.540741] [<ffffffff811f1be0>] ? I_BDEV+0x10/0x10
Dec 10 11:22:28 rack3-client-1 kernel: [661597.540745] [<ffffffff81150b5b>] generic_file_aio_read+0x69b/0x700
Dec 10 11:22:28 rack3-client-1 kernel: [661597.540748] [<ffffffff810d7ad6>] ? wake_futex+0x66/0x90
Dec 10 11:22:28 rack3-client-1 kernel: [661597.540749] [<ffffffff811f275b>] blkdev_aio_read+0x4b/0x70
Dec 10 11:22:28 rack3-client-1 kernel: [661597.540752] [<ffffffff811b8d1a>] do_sync_read+0x5a/0x90
Dec 10 11:22:28 rack3-client-1 kernel: [661597.540754] [<ffffffff811b93b5>] vfs_read+0x95/0x160
Dec 10 11:22:28 rack3-client-1 kernel: [661597.540756] [<ffffffff811ba032>] SyS_pread64+0x72/0xb0
Dec 10 11:22:28 rack3-client-1 kernel: [661597.540759] [<ffffffff8172663f>] tracesys+0xe1/0xe6
Dec 10 11:22:28 rack3-client-1 kernel: [661597.540759] Code: 41 83 fe ff 0f 84 81 00 00 00 45 31 ed e9 a1 fd ff ff 90 d1 e8 89 c1 83 e1 01 f6 c2 10 74 ac e9 5b fd ff ff 0f 1f 80 00 00 00 00 <0f> 8e 82 00 00 00 48 8b 40 08 e9 01 fe ff ff 66 0f 1f 84 00 00

syslogs_krbd_io_errors.zip (159 KB) Srinivasula Reddy Maram, 12/14/2014 09:38 PM

History

#1 Updated by Anand Bhat over 9 years ago

Looks like fixed by the following commit:

commit ff513ace9b772e75e337f8e058cc7f12816843fe
Author: Ilya Dryomov <>
Date: Mon Feb 3 13:56:33 2014 +0200

libceph: take map_sem for read in handle_reply()
Handling redirect replies requires both map_sem and request_mutex.
Taking map_sem unconditionally near the top of handle_reply() avoids
possible race conditions that arise from releasing request_mutex to be
able to acquire map_sem in redirect reply case. (Lock ordering is:
map_sem, request_mutex, crush_mutex.)
Signed-off-by: Ilya Dryomov &lt;&gt;

#2 Updated by Ilya Dryomov over 9 years ago

  • Assignee set to Ilya Dryomov

From a quick look, I doubt it. Anand, are you reporter's colleague?
That commit went into 3.14-rc2 and it fixes an issue with the code that went into 3.14-rc1. IOW there is no way to hit that issue unless you are running 3.14-rc1.

#3 Updated by Ilya Dryomov about 9 years ago

  • Status changed from New to Need More Info

Also available in: Atom PDF