Bug #19650: rbd-nbd: client reboot if ceph cluster down - rbd - Ceph

Actions

Copy link

Bug #19650

closed

rbd-nbd: client reboot if ceph cluster down

Added by François Blondel about 7 years ago. Updated about 7 years ago.

Status:

Closed

Priority:

Normal

Assignee:

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Hi,
doing

rbd-nbd map rbd/block1
  mount /dev/nbd0 /mnt
  dd if=/data/test.tar.gz of=/mnt/test.tar.gz status=progress

and stopping all ceph-mon services during the dd copy leads to a hard reboot of the rbd-nbd client machine, after about 6 minutes.

Is this a "normal" behaviour ?

We would like to use RBD block devices to do backups of some production servers.
These prod machines should not reboot if the ceph cluster goes down.

We have been seeing this behaviour since Jewel.
Tested again today with:
ceph version 12.0.1 (5456408827a1a31690514342624a4ff9b66be1d5)
Linux 4.4.0-72-generic #93-Ubuntu SMP Fri Mar 31 14:07:41 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

Many thanks for your work,
François

Actions

Copy link

Updated by Nathan Cutler about 7 years ago

Project changed from Ceph to rbd
Subject changed from rbd-ndb: client reboot if ceph cluster down to rbd-nbd: client reboot if ceph cluster down
Category deleted (~~librbd~~)

Actions

Copy link

Updated by Jason Dillaman about 7 years ago

Status changed from New to Need More Info
Release deleted (~~jewel~~)
Release deleted (~~master~~)
Release deleted (~~kraken~~)
Affected Versions deleted (~~v12.0.0~~)

@François: sounds like you encountered a kernel panic -- which we don't have any control over (it isn't our code rebooting the machine). Did the kernel provide any backtrace information?

Actions

Copy link

Updated by François Blondel about 7 years ago

Hi,
issue was due to our kernel config:

kernel.hung_task_panic = 1
kernel.hung_task_timeout_secs = 300
kernel.panic = 60

We changed to kernel.hung_task_panic = 0 and we are now getting errors in our dmesg.

[Wed Apr 19 15:05:43 2017] INFO: task jbd2/nbd0-8:32390 blocked for more than 60 seconds.
[Wed Apr 19 15:05:43 2017]       Not tainted 4.4.0-72-generic #93-Ubuntu
[Wed Apr 19 15:05:43 2017] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Wed Apr 19 15:05:43 2017] jbd2/nbd0-8     D ffff880233407ad8     0 32390      2 0x00000000
[Wed Apr 19 15:05:43 2017]  ffff880233407ad8 ffff880232f7e000 ffff880236250000 ffff880231811980
[Wed Apr 19 15:05:43 2017]  ffff880233408000 ffff88023fc56dc0 7fffffffffffffff ffffffff81838d60
[Wed Apr 19 15:05:43 2017]  ffff880233407c30 ffff880233407af0 ffffffff81838565 0000000000000000
[Wed Apr 19 15:05:43 2017] Call Trace:
[Wed Apr 19 15:05:43 2017]  [&lt;ffffffff81838d60&gt;] ? bit_wait+0x60/0x60
[Wed Apr 19 15:05:43 2017]  [&lt;ffffffff81838565&gt;] schedule+0x35/0x80
[Wed Apr 19 15:05:43 2017]  [&lt;ffffffff8183b6b5&gt;] schedule_timeout+0x1b5/0x270
[Wed Apr 19 15:05:43 2017]  [&lt;ffffffff813c3eb3&gt;] ? __blk_run_queue+0x33/0x40
[Wed Apr 19 15:05:43 2017]  [&lt;ffffffff8106428e&gt;] ? kvm_clock_get_cycles+0x1e/0x20
[Wed Apr 19 15:05:43 2017]  [&lt;ffffffff8106428e&gt;] ? kvm_clock_get_cycles+0x1e/0x20
[Wed Apr 19 15:05:43 2017]  [&lt;ffffffff810f611c&gt;] ? ktime_get+0x3c/0xb0
[Wed Apr 19 15:05:43 2017]  [&lt;ffffffff81838d60&gt;] ? bit_wait+0x60/0x60
[Wed Apr 19 15:05:43 2017]  [&lt;ffffffff81837a94&gt;] io_schedule_timeout+0xa4/0x110
[Wed Apr 19 15:05:43 2017]  [&lt;ffffffff81838d7b&gt;] bit_wait_io+0x1b/0x70
[Wed Apr 19 15:05:43 2017]  [&lt;ffffffff8183890d&gt;] __wait_on_bit+0x5d/0x90
[Wed Apr 19 15:05:43 2017]  [&lt;ffffffff81838d60&gt;] ? bit_wait+0x60/0x60
[Wed Apr 19 15:05:43 2017]  [&lt;ffffffff818389c2&gt;] out_of_line_wait_on_bit+0x82/0xb0
[Wed Apr 19 15:05:43 2017]  [&lt;ffffffff810c4250&gt;] ? autoremove_wake_function+0x40/0x40
[Wed Apr 19 15:05:43 2017]  [&lt;ffffffff812467f2&gt;] __wait_on_buffer+0x32/0x40
[Wed Apr 19 15:05:43 2017]  [&lt;ffffffff812ef34f&gt;] jbd2_journal_commit_transaction+0x10cf/0x1870
[Wed Apr 19 15:05:43 2017]  [&lt;ffffffff810ecfce&gt;] ? try_to_del_timer_sync+0x5e/0x90
[Wed Apr 19 15:05:43 2017]  [&lt;ffffffff812f370a&gt;] kjournald2+0xca/0x250
[Wed Apr 19 15:05:43 2017]  [&lt;ffffffff810c4210&gt;] ? wake_atomic_t_function+0x60/0x60
[Wed Apr 19 15:05:43 2017]  [&lt;ffffffff812f3640&gt;] ? commit_timeout+0x10/0x10
[Wed Apr 19 15:05:43 2017]  [&lt;ffffffff810a0be8&gt;] kthread+0xd8/0xf0
[Wed Apr 19 15:05:43 2017]  [&lt;ffffffff810a0b10&gt;] ? kthread_create_on_node+0x1e0/0x1e0
[Wed Apr 19 15:05:43 2017]  [&lt;ffffffff8183ca0f&gt;] ret_from_fork+0x3f/0x70
[Wed Apr 19 15:05:43 2017]  [&lt;ffffffff810a0b10&gt;] ? kthread_create_on_node+0x1e0/0x1e0

...
and finally:

[Wed Apr 19 15:07:45 2017] block nbd0: NBD_DISCONNECT

running rbd-nbd unmap /dev/nbd0 took time but worked. Finally got:

[Wed Apr 19 16:31:27 2017] block nbd0: Receive control failed (result -32)
[Wed Apr 19 16:31:27 2017] block nbd0: shutting down socket

So it seems to be working well.
Thanks,
François

Actions

Copy link

Updated by Mykola Golub about 7 years ago

Status changed from Need More Info to Closed

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » rbd

Custom queries

Bug #19650

rbd-nbd: client reboot if ceph cluster down

Updated by Nathan Cutler about 7 years ago

Updated by Jason Dillaman about 7 years ago

Updated by François Blondel about 7 years ago

Updated by Mykola Golub about 7 years ago