Project

General

Profile

Actions

Bug #19650

closed

rbd-nbd: client reboot if ceph cluster down

Added by François Blondel about 7 years ago. Updated about 7 years ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hi,
doing

rbd-nbd map rbd/block1
mount /dev/nbd0 /mnt
dd if=/data/test.tar.gz of=/mnt/test.tar.gz status=progress

and stopping all ceph-mon services during the dd copy leads to a hard reboot of the rbd-nbd client machine, after about 6 minutes.

Is this a "normal" behaviour ?

We would like to use RBD block devices to do backups of some production servers.
These prod machines should not reboot if the ceph cluster goes down.

We have been seeing this behaviour since Jewel.
Tested again today with:
ceph version 12.0.1 (5456408827a1a31690514342624a4ff9b66be1d5)
Linux 4.4.0-72-generic #93-Ubuntu SMP Fri Mar 31 14:07:41 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

Many thanks for your work,
François

Actions #1

Updated by Nathan Cutler about 7 years ago

  • Project changed from Ceph to rbd
  • Subject changed from rbd-ndb: client reboot if ceph cluster down to rbd-nbd: client reboot if ceph cluster down
  • Category deleted (librbd)
Actions #2

Updated by Jason Dillaman about 7 years ago

  • Status changed from New to Need More Info
  • Release deleted (jewel)
  • Release deleted (master)
  • Release deleted (kraken)
  • Affected Versions deleted (v12.0.0)

@François: sounds like you encountered a kernel panic -- which we don't have any control over (it isn't our code rebooting the machine). Did the kernel provide any backtrace information?

Actions #3

Updated by François Blondel about 7 years ago

Hi,
issue was due to our kernel config:

kernel.hung_task_panic = 1
kernel.hung_task_timeout_secs = 300
kernel.panic = 60

We changed to kernel.hung_task_panic = 0 and we are now getting errors in our dmesg.

[Wed Apr 19 15:05:43 2017] INFO: task jbd2/nbd0-8:32390 blocked for more than 60 seconds.
[Wed Apr 19 15:05:43 2017] Not tainted 4.4.0-72-generic #93-Ubuntu
[Wed Apr 19 15:05:43 2017] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Wed Apr 19 15:05:43 2017] jbd2/nbd0-8 D ffff880233407ad8 0 32390 2 0x00000000
[Wed Apr 19 15:05:43 2017] ffff880233407ad8 ffff880232f7e000 ffff880236250000 ffff880231811980
[Wed Apr 19 15:05:43 2017] ffff880233408000 ffff88023fc56dc0 7fffffffffffffff ffffffff81838d60
[Wed Apr 19 15:05:43 2017] ffff880233407c30 ffff880233407af0 ffffffff81838565 0000000000000000
[Wed Apr 19 15:05:43 2017] Call Trace:
[Wed Apr 19 15:05:43 2017] [<ffffffff81838d60>] ? bit_wait+0x60/0x60
[Wed Apr 19 15:05:43 2017] [<ffffffff81838565>] schedule+0x35/0x80
[Wed Apr 19 15:05:43 2017] [<ffffffff8183b6b5>] schedule_timeout+0x1b5/0x270
[Wed Apr 19 15:05:43 2017] [<ffffffff813c3eb3>] ? __blk_run_queue+0x33/0x40
[Wed Apr 19 15:05:43 2017] [<ffffffff8106428e>] ? kvm_clock_get_cycles+0x1e/0x20
[Wed Apr 19 15:05:43 2017] [<ffffffff8106428e>] ? kvm_clock_get_cycles+0x1e/0x20
[Wed Apr 19 15:05:43 2017] [<ffffffff810f611c>] ? ktime_get+0x3c/0xb0
[Wed Apr 19 15:05:43 2017] [<ffffffff81838d60>] ? bit_wait+0x60/0x60
[Wed Apr 19 15:05:43 2017] [<ffffffff81837a94>] io_schedule_timeout+0xa4/0x110
[Wed Apr 19 15:05:43 2017] [<ffffffff81838d7b>] bit_wait_io+0x1b/0x70
[Wed Apr 19 15:05:43 2017] [<ffffffff8183890d>] __wait_on_bit+0x5d/0x90
[Wed Apr 19 15:05:43 2017] [<ffffffff81838d60>] ? bit_wait+0x60/0x60
[Wed Apr 19 15:05:43 2017] [<ffffffff818389c2>] out_of_line_wait_on_bit+0x82/0xb0
[Wed Apr 19 15:05:43 2017] [<ffffffff810c4250>] ? autoremove_wake_function+0x40/0x40
[Wed Apr 19 15:05:43 2017] [<ffffffff812467f2>] __wait_on_buffer+0x32/0x40
[Wed Apr 19 15:05:43 2017] [<ffffffff812ef34f>] jbd2_journal_commit_transaction+0x10cf/0x1870
[Wed Apr 19 15:05:43 2017] [<ffffffff810ecfce>] ? try_to_del_timer_sync+0x5e/0x90
[Wed Apr 19 15:05:43 2017] [<ffffffff812f370a>] kjournald2+0xca/0x250
[Wed Apr 19 15:05:43 2017] [<ffffffff810c4210>] ? wake_atomic_t_function+0x60/0x60
[Wed Apr 19 15:05:43 2017] [<ffffffff812f3640>] ? commit_timeout+0x10/0x10
[Wed Apr 19 15:05:43 2017] [<ffffffff810a0be8>] kthread+0xd8/0xf0
[Wed Apr 19 15:05:43 2017] [<ffffffff810a0b10>] ? kthread_create_on_node+0x1e0/0x1e0
[Wed Apr 19 15:05:43 2017] [<ffffffff8183ca0f>] ret_from_fork+0x3f/0x70
[Wed Apr 19 15:05:43 2017] [<ffffffff810a0b10>] ? kthread_create_on_node+0x1e0/0x1e0

...
and finally:

[Wed Apr 19 15:07:45 2017] block nbd0: NBD_DISCONNECT

running rbd-nbd unmap /dev/nbd0 took time but worked. Finally got:

[Wed Apr 19 16:31:27 2017] block nbd0: Receive control failed (result -32)
[Wed Apr 19 16:31:27 2017] block nbd0: shutting down socket

So it seems to be working well.
Thanks,
François

Actions #4

Updated by Mykola Golub about 7 years ago

  • Status changed from Need More Info to Closed
Actions

Also available in: Atom PDF