Project

General

Profile

Bug #39945

RBD I/O error leads to ghost-mapped RBD

Added by Cliff Pajaro about 1 month ago. Updated 10 days ago.

Status:
New
Priority:
Normal
Assignee:
Category:
rbd
Target version:
Start date:
05/15/2019
Due date:
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:

Description

I have attached a couple of kernel logs (same system) showing the IO errors.

On the system that had the RBDs mapped, the RBDs don't appear in "rbd showmapped" and are not inside "/sys/bus/rbd/devices". Attempting "echo # > /sys/bus/rbd/remove" doesn't work.

The RBDs do appear in /sys/kernel/debug/block/rbd7 and /sys/kernel/debug/block/rbd8.
The RBDs appear when calling "rbd status".

Watchers:
        watcher=10.10.4.63:0/876863094 client.44641389 cookie=18446462598732841001

Watchers:
        watcher=10.10.4.63:0/876863094 client.44641389 cookie=18446462598733070026

It seems the only way to release the RBDs is to reboot the system. The RBDs can be mapped, mounted, read, etc. with no problems (attempted on a different system and on the same system).

# ps aux | grep [r]bd
root     3011042  0.0  0.0      0     0 ?        I<   Apr20   0:00 [rbd]
root     3373133  0.0  0.0      0     0 ?        I<   May09   0:00 [rbd8-tasks]
root     3957315  0.0  0.0      0     0 ?        I<   Apr26   0:00 [rbd7-tasks]
# cat /proc/3011042/stack 
[<0>] rescuer_thread+0x2ef/0x330
[<0>] kthread+0x111/0x130
[<0>] ret_from_fork+0x1f/0x30
[<0>] 0xffffffffffffffff
# cat /proc/3373133/stack        
[<0>] rescuer_thread+0x2ef/0x330
[<0>] kthread+0x111/0x130
[<0>] ret_from_fork+0x1f/0x30
[<0>] 0xffffffffffffffff
# cat /proc/3957315/stack       
[<0>] rescuer_thread+0x2ef/0x330
[<0>] kthread+0x111/0x130
[<0>] ret_from_fork+0x1f/0x30
[<0>] 0xffffffffffffffff

Version information:

# ceph version
ceph version 12.2.12 (1436006594665279fe734b4c15d7e08c13ebd777) luminous (stable)
# ceph --version
ceph version 12.2.12 (1436006594665279fe734b4c15d7e08c13ebd777) luminous (stable)
# uname -r
4.19.34-vanilla-cephsb-2

var_log_messages_rbd7.txt View (7.98 KB) Cliff Pajaro, 05/15/2019 05:43 PM

var_log_messages_rbd8.txt View (8.19 KB) Cliff Pajaro, 05/15/2019 05:43 PM

History

#1 Updated by Ilya Dryomov 10 days ago

  • Assignee set to Ilya Dryomov

On the system that had the RBDs mapped, the RBDs don't appear in "rbd showmapped" and are not inside "/sys/bus/rbd/devices". Attempting "echo # > /sys/bus/rbd/remove" doesn't work.

This is because someone issued "rbd unmap" for these devices:

__ioc_clear_queue+0x36/0x60
ioc_clear_queue+0x91/0xd0
blk_exit_queue+0x15/0x40
blk_cleanup_queue+0xd5/0x140
rbd_free_disk+0x19/0x40 [rbd]
rbd_dev_device_release+0x2c/0x50 [rbd]
do_rbd_remove.isra.18+0x197/0x220 [rbd]
kernfs_fop_write+0x105/0x180
__vfs_write+0x33/0x1d0
vfs_write+0xa9/0x1a0
ksys_write+0x4f/0xb0
do_syscall_64+0x3e/0xe0
entry_SYSCALL_64_after_hwframe+0x44/0xa9

"rbd unmap" removed them from the list of mapped devices and then crashed in the block layer:

BUG: unable to handle kernel paging request at ffffffff8139c2a0
RIP: 0010:ioc_destroy_icq+0x37/0xb0

Cliff, was it regular "rbd unmap" or "rbd unmap -o force"? Did you attempt to unmount ext4 before unmapping?

Just to get an idea of how to reproduce, where are the I/O errors coming from? Is this the kernel where you replaced the "write to read-only mapping" assert with an error?

Also available in: Atom PDF