Bug #39945
RBD I/O error leads to ghost-mapped RBD
0%
Description
I have attached a couple of kernel logs (same system) showing the IO errors.
On the system that had the RBDs mapped, the RBDs don't appear in "rbd showmapped" and are not inside "/sys/bus/rbd/devices". Attempting "echo # > /sys/bus/rbd/remove" doesn't work.
The RBDs do appear in /sys/kernel/debug/block/rbd7 and /sys/kernel/debug/block/rbd8.
The RBDs appear when calling "rbd status".
Watchers: watcher=10.10.4.63:0/876863094 client.44641389 cookie=18446462598732841001 Watchers: watcher=10.10.4.63:0/876863094 client.44641389 cookie=18446462598733070026
It seems the only way to release the RBDs is to reboot the system. The RBDs can be mapped, mounted, read, etc. with no problems (attempted on a different system and on the same system).
# ps aux | grep [r]bd root 3011042 0.0 0.0 0 0 ? I< Apr20 0:00 [rbd] root 3373133 0.0 0.0 0 0 ? I< May09 0:00 [rbd8-tasks] root 3957315 0.0 0.0 0 0 ? I< Apr26 0:00 [rbd7-tasks] # cat /proc/3011042/stack [<0>] rescuer_thread+0x2ef/0x330 [<0>] kthread+0x111/0x130 [<0>] ret_from_fork+0x1f/0x30 [<0>] 0xffffffffffffffff # cat /proc/3373133/stack [<0>] rescuer_thread+0x2ef/0x330 [<0>] kthread+0x111/0x130 [<0>] ret_from_fork+0x1f/0x30 [<0>] 0xffffffffffffffff # cat /proc/3957315/stack [<0>] rescuer_thread+0x2ef/0x330 [<0>] kthread+0x111/0x130 [<0>] ret_from_fork+0x1f/0x30 [<0>] 0xffffffffffffffff
Version information:
# ceph version ceph version 12.2.12 (1436006594665279fe734b4c15d7e08c13ebd777) luminous (stable) # ceph --version ceph version 12.2.12 (1436006594665279fe734b4c15d7e08c13ebd777) luminous (stable) # uname -r 4.19.34-vanilla-cephsb-2
History
#1 Updated by Ilya Dryomov almost 5 years ago
- Assignee set to Ilya Dryomov
On the system that had the RBDs mapped, the RBDs don't appear in "rbd showmapped" and are not inside "/sys/bus/rbd/devices". Attempting "echo # > /sys/bus/rbd/remove" doesn't work.
This is because someone issued "rbd unmap" for these devices:
__ioc_clear_queue+0x36/0x60 ioc_clear_queue+0x91/0xd0 blk_exit_queue+0x15/0x40 blk_cleanup_queue+0xd5/0x140 rbd_free_disk+0x19/0x40 [rbd] rbd_dev_device_release+0x2c/0x50 [rbd] do_rbd_remove.isra.18+0x197/0x220 [rbd] kernfs_fop_write+0x105/0x180 __vfs_write+0x33/0x1d0 vfs_write+0xa9/0x1a0 ksys_write+0x4f/0xb0 do_syscall_64+0x3e/0xe0 entry_SYSCALL_64_after_hwframe+0x44/0xa9
"rbd unmap" removed them from the list of mapped devices and then crashed in the block layer:
BUG: unable to handle kernel paging request at ffffffff8139c2a0 RIP: 0010:ioc_destroy_icq+0x37/0xb0
Cliff, was it regular "rbd unmap" or "rbd unmap -o force"? Did you attempt to unmount ext4 before unmapping?
Just to get an idea of how to reproduce, where are the I/O errors coming from? Is this the kernel where you replaced the "write to read-only mapping" assert with an error?
#2 Updated by Ilya Dryomov over 4 years ago
- Status changed from New to Need More Info
- Priority changed from Normal to Low
#3 Updated by Cliff Pajaro about 4 years ago
Hi Ilya,
This issue has popped up again and I have a similar `rbd25-tasks` process still alive, with the rbd status command showing a watcher remaining.
In this case we have a python script using subprocess to call a normal `rbd unmap`.
When I finally take a look at the system (because it has an RBD that is stuck), the python script is waiting for the `rbd unmap` process to return:
Thread 0x7fdd0b69f700 (LWP 12982) "python3" 0x00007fdd0d50024e in __waitpid (pid=122759, stat_loc=0x7fdd0b69b614, options=0) at ../sysdeps/unix/sysv/linux/waitpid.c:30
The `rbd unmap` process is in a zombie state. This process only goes away if the python script (and the java application that started it) are stopped. Even after the `rbd unmap` process disappears the `rbd25-tasks` process remains.
#4 Updated by Ilya Dryomov about 4 years ago
Hi Cliff,
What kernel are you running? Is it the same one or something newer?
Is there anything in dmesg? Previously there were I/O errors, and a "unable to handle kernel paging request" crash in the block layer.
#5 Updated by Cliff Pajaro about 4 years ago
Hi Ilya,
We are now running a newer kernel: 4.19.68-vanilla-ams-2
I just noticed traces in /var/log/messages.
Occurrence #1:
BUG: unable to handle kernel paging request at ffffffff814d68a0 PGD 240c067 P4D 240c067 PUD 240d063 PMD 14001e1 Oops: 0003 [#1] SMP NOPTI CPU: 2 PID: 80583 Comm: rbd Tainted: G O 4.19.68-vanilla-ams-2 #1 Hardware name: Quanta Cloud Technology Inc. L08A-AMS/L08A-AMS-MB, BIOS S1Q_3A08.NEO03 10/18/2016 RIP: 0010:ioc_destroy_icq+0x37/0xb0 Code: 89 fb 48 8b 6f 08 48 8b 50 18 48 63 b0 ac 01 00 00 48 8d 7d 28 4c 8b 22 e8 66 17 53 00 48 8b 43 28 48 85 c0 74 20 48 8b 53 20 <48> 89 10 48 85 d2 74 04 48 89 42 08 48 c7 43 20 00 00 00 00 48 c7 RSP: 0018:ffffc9000558bd28 EFLAGS: 00010086 RAX: ffffffff814d68a0 RBX: ffff88813106c000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 000000000000000d RDI: ffff888856505d28 RBP: ffff888856505d00 R08: 0000000000000000 R09: ffffffff814f9f80 R10: ffff8885c0ceaa00 R11: 0000000000000000 R12: ffffffff824eca20 R13: 0000000000000206 R14: ffffc9000558bf08 R15: ffff88872659d020 FS: 00007fe763d900c0(0000) GS:ffff88885f880000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffffff814d68a0 CR3: 000000071dd2e006 CR4: 00000000003606e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: __ioc_clear_queue+0x36/0x60 ioc_clear_queue+0x91/0xd0 blk_exit_queue+0x15/0x40 blk_cleanup_queue+0xd5/0x140 rbd_free_disk+0x19/0x40 [rbd] rbd_dev_device_release+0x2c/0x50 [rbd] do_rbd_remove.isra.23+0x197/0x220 [rbd] ? x86_setup_perfctr+0xe5/0x140 kernfs_fop_write+0x116/0x190 __vfs_write+0x36/0x1e0 vfs_write+0xa9/0x1a0 ksys_write+0x57/0xd0 do_syscall_64+0x48/0x100 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7fe7648133fd Code: 80 00 00 00 00 48 89 55 d8 48 89 75 e0 89 7d e8 e8 b8 fc ff ff 48 8b 55 d8 48 8b 75 e0 41 89 c0 8b 7d e8 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 30 44 89 c7 48 89 45 e8 e8 ef fc ff ff 48 8b RSP: 002b:00007ffdaedc1d70 EFLAGS: 00000293 ORIG_RAX: 0000000000000001 RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007fe7648133fd RDX: 0000000000000002 RSI: 00007ffdaedc1fa0 RDI: 0000000000000007 RBP: 00007ffdaedc1da0 R08: 0000000000000000 R09: 000055743cb70810 R10: fffffffffffff86a R11: 0000000000000293 R12: 00007ffdaedc1fa0 R13: 0000000000000007 R14: 00007ffdaedc1e90 R15: 00007ffdaedc1f90 Modules linked in: ipt_MASQUERADE usb_storage rbd libceph xt_UDPLB(O) xt_nat xt_multiport xt_addrtype iptable_security iptable_mangle iptable_raw iptable_nat nf_nat_ipv4 nf_nat bpfilter sb_edac x86_pkg_temp_thermal intel_pch_thermal ioatdma gkuart(O) usbserial dca CR2: ffffffff814d68a0 ---[ end trace 58804622e1ce6039 ]--- RIP: 0010:ioc_destroy_icq+0x37/0xb0 Code: 89 fb 48 8b 6f 08 48 8b 50 18 48 63 b0 ac 01 00 00 48 8d 7d 28 4c 8b 22 e8 66 17 53 00 48 8b 43 28 48 85 c0 74 20 48 8b 53 20 <48> 89 10 48 85 d2 74 04 48 89 42 08 48 c7 43 20 00 00 00 00 48 c7 RSP: 0018:ffffc9000558bd28 EFLAGS: 00010086 RAX: ffffffff814d68a0 RBX: ffff88813106c000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 000000000000000d RDI: ffff888856505d28 RBP: ffff888856505d00 R08: 0000000000000000 R09: ffffffff814f9f80 R10: ffff8885c0ceaa00 R11: 0000000000000000 R12: ffffffff824eca20 R13: 0000000000000206 R14: ffffc9000558bf08 R15: ffff88872659d020 FS: 00007fe763d900c0(0000) GS:ffff88885f880000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffffff814d68a0 CR3: 000000071dd2e006 CR4: 00000000003606e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Occurrence #2 (completely different data center):
BUG: unable to handle kernel paging request at ffffffff814d68a0 PGD 240c067 P4D 240c067 PUD 240d063 PMD 14001e1 Oops: 0003 [#1] SMP NOPTI CPU: 7 PID: 122759 Comm: rbd Tainted: G O 4.19.68-vanilla-ams-2 #1 Hardware name: Quanta Cloud Technology Inc. L08A-AMS/L08A-AMS-MB, BIOS S1Q_3A08.NEO03 10/18/2016 RIP: 0010:ioc_destroy_icq+0x37/0xb0 Code: 89 fb 48 8b 6f 08 48 8b 50 18 48 63 b0 ac 01 00 00 48 8d 7d 28 4c 8b 22 e8 66 17 53 00 48 8b 43 28 48 85 c0 74 20 48 8b 53 20 <48> 89 10 48 85 d2 74 04 48 89 42 08 48 c7 43 20 00 00 00 00 48 c7 RSP: 0018:ffffc90004e1fd28 EFLAGS: 00010086 RAX: ffffffff814d68a0 RBX: ffff88812e45f000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 000000000000001c RDI: ffff888853617b88 RBP: ffff888853617b60 R08: 0000000000000000 R09: ffffffff814f9f80 R10: ffff8881207d79c0 R11: 0000000000000000 R12: ffffffff824eca20 R13: 0000000000000206 R14: ffffc90004e1ff08 R15: ffff888147b4a920 FS: 00007f5901f540c0(0000) GS:ffff88885f9c0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffffff814d68a0 CR3: 0000000794d3c003 CR4: 00000000003606e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: __ioc_clear_queue+0x36/0x60 ioc_clear_queue+0x91/0xd0 blk_exit_queue+0x15/0x40 blk_cleanup_queue+0xd5/0x140 rbd_free_disk+0x19/0x40 [rbd] rbd_dev_device_release+0x2c/0x50 [rbd] do_rbd_remove.isra.23+0x197/0x220 [rbd] ? x86_setup_perfctr+0xe5/0x140 kernfs_fop_write+0x116/0x190 __vfs_write+0x36/0x1e0 vfs_write+0xa9/0x1a0 ksys_write+0x57/0xd0 do_syscall_64+0x48/0x100 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f59029d73fd Code: 80 00 00 00 00 48 89 55 d8 48 89 75 e0 89 7d e8 e8 b8 fc ff ff 48 8b 55 d8 48 8b 75 e0 41 89 c0 8b 7d e8 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 30 44 89 c7 48 89 45 e8 e8 ef fc ff ff 48 8b RSP: 002b:00007fffa6406050 EFLAGS: 00000293 ORIG_RAX: 0000000000000001 RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f59029d73fd RDX: 0000000000000002 RSI: 00007fffa6406280 RDI: 0000000000000007 RBP: 00007fffa6406080 R08: 0000000000000000 R09: 0000558bee491860 R10: fffffffffffff86a R11: 0000000000000293 R12: 00007fffa6406280 R13: 0000000000000007 R14: 00007fffa6406170 R15: 00007fffa6406270 Modules linked in: ipt_MASQUERADE rbd libceph xt_nat xt_UDPLB(O) xt_multiport xt_addrtype iptable_security iptable_mangle iptable_raw iptable_nat nf_nat_ipv4 nf_nat bpfilter sb_edac x86_pkg_temp_thermal intel_pch_thermal gkuart(O) usbserial ioatdma dca CR2: ffffffff814d68a0 ---[ end trace f6752c709f61d42f ]--- RIP: 0010:ioc_destroy_icq+0x37/0xb0 Code: 89 fb 48 8b 6f 08 48 8b 50 18 48 63 b0 ac 01 00 00 48 8d 7d 28 4c 8b 22 e8 66 17 53 00 48 8b 43 28 48 85 c0 74 20 48 8b 53 20 <48> 89 10 48 85 d2 74 04 48 89 42 08 48 c7 43 20 00 00 00 00 48 c7 RSP: 0018:ffffc90004e1fd28 EFLAGS: 00010086 RAX: ffffffff814d68a0 RBX: ffff88812e45f000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 000000000000001c RDI: ffff888853617b88 RBP: ffff888853617b60 R08: 0000000000000000 R09: ffffffff814f9f80 R10: ffff8881207d79c0 R11: 0000000000000000 R12: ffffffff824eca20 R13: 0000000000000206 R14: ffffc90004e1ff08 R15: ffff888147b4a920 FS: 00007f5901f540c0(0000) GS:ffff88885f9c0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffffff814d68a0 CR3: 0000000794d3c003 CR4: 00000000003606e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
From what I can tell the stack trace is identical between the 2 new occurrences and what you pointed out above (from 7 months ago).
#6 Updated by Cliff Pajaro about 4 years ago
To reply to your questions from before:
was it regular "rbd unmap" or "rbd unmap -o force"?
This was a regular rbd unmap
Did you attempt to unmount ext4 before unmapping?
Yes, there are successful ext4 filesystem unmounts before rbd unmapping.
Just to get an idea of how to reproduce, where are the I/O errors coming from? Is this the kernel where you replaced the "write to read-only mapping" assert with an error?
This kernel does have a fix to prevent "trim/write to read-only devices". The I/O errors I reported before are no longer appearing (if I am recalling this correctly).
#7 Updated by Ilya Dryomov about 4 years ago
Looks like it crashed on a corrupted ioc->icq_list:
18: e8 66 17 53 00 callq 0x531783 # radix_tree_delete() 1d: 48 8b 43 28 mov 0x28(%rbx),%rax # icq->ioc_node->pprev 21: 48 85 c0 test %rax,%rax # hlist_unhashed() 24: 74 20 je 0x46 26: 48 8b 53 20 mov 0x20(%rbx),%rdx # icq->ioc_node->next 2a:* 48 89 10 mov %rdx,(%rax) # *pprev = next 63 static void ioc_destroy_icq(struct io_cq *icq) 64 { 65 struct io_context *ioc = icq->ioc; 66 struct request_queue *q = icq->q; 67 struct elevator_type *et = q->elevator->type; 68 69 lockdep_assert_held(&ioc->lock); 70 71 radix_tree_delete(&ioc->icq_tree, icq->q->id); 72 hlist_del_init(&icq->ioc_node); <-- 73 list_del_init(&icq->q_node); 691 static inline void hlist_del_init(struct hlist_node *n) 692 { 693 if (!hlist_unhashed(n)) { 694 __hlist_del(n); <-- 695 INIT_HLIST_NODE(n); 696 } 697 } 674 static inline void __hlist_del(struct hlist_node *n) 675 { 676 struct hlist_node *next = n->next; 677 struct hlist_node **pprev = n->pprev; 678 679 WRITE_ONCE(*pprev, next); <-- 680 if (next) 681 next->pprev = pprev; 682 }
WRITE_ONCE(*pprev, next) is basically *pprev = next, and pprev is busted. It's notable that you got the same busted address (ffffffff814d68a0) in two completely different data centers.
Unfortunately, there is not much else I can do without a crash dump. This list is managed by the block layer, and so far I don't see anything wrong on the rbd side.
Where did this 4.19.68-vanilla-ams-2 kernel come from? Are you building your own kernels? If so, are there any modifications or just config changes?
#8 Updated by Cliff Pajaro about 4 years ago
Thanks for the explanation. Yes, we build our own kernels and we have some modifications.
diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 585378bc988c..5516c52edecf 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -3652,6 +3652,15 @@ static void rbd_queue_workfn(struct work_struct *work)
goto err;
}
+ /* snapshot mounts are readonly, so don't allow discards */
+ if (rbd_dev->spec->snap_id != CEPH_NOSNAP && op_type == OBJ_OP_DISCARD) {
+ result = -EBADF;
+ goto err;
+ } else if (rbd_dev->spec->snap_id != CEPH_NOSNAP && op_type == OBJ_OP_WRITE) {
+ result = -EROFS;
+ goto err;
+ }
+
/* Ignore/skip any zero-length requests */
if (!length) {
and a couple of things from upstream:
From: Ilya Dryomov <idryomov@gmail.com>
[ Upstream commit 187df76325af5d9e12ae9daec1510307797e54f0 ]
A bvec can now consist of multiple physically contiguous pages.
This means that bvec_iter_advance() can move to a different page while
staying in the same bvec (i.e. ->bi_bvec_done != 0).
The messenger works in terms of segments which can now be defined as
the smaller of a bvec and a page. The "more bytes to process in this
segment" condition holds only if bvec_iter_advance() leaves us in the
same bvec _and_ in the same page. On next bvec (possibly in the same
page) and on next page (possibly in the same bvec) we may need to set
->last_piece.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Sasha Levin (Microsoft) <sashal@kernel.org>
---
net/ceph/messenger.c | 8 ++++++--
1 file changed, 6 insertions(+), 2 deletions(-)
diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
index f7d7f32ac673..ef5216206bdf 100644
--- a/net/ceph/messenger.c
+++ b/net/ceph/messenger.c
@@ -870,6 +870,7 @@ static bool ceph_msg_data_bio_advance(struct ceph_msg_data_cursor *cursor,
size_t bytes)
{
struct ceph_bio_iter *it = &cursor->bio_iter;
+ struct page *page = bio_iter_page(it->bio, it->iter);
BUG_ON(bytes > cursor->resid);
BUG_ON(bytes > bio_iter_len(it->bio, it->iter));
@@ -881,7 +882,8 @@ static bool ceph_msg_data_bio_advance(struct ceph_msg_data_cursor *cursor,
return false; /* no more data */
}
- if (!bytes || (it->iter.bi_size && it->iter.bi_bvec_done))
+ if (!bytes || (it->iter.bi_size && it->iter.bi_bvec_done &&
+ page == bio_iter_page(it->bio, it->iter)))
return false; /* more bytes to process in this segment */
if (!it->iter.bi_size) {
@@ -929,6 +931,7 @@ static bool ceph_msg_data_bvecs_advance(struct ceph_msg_data_cursor *cursor,
size_t bytes)
{
struct bio_vec *bvecs = cursor->data->bvec_pos.bvecs;
+ struct page *page = bvec_iter_page(bvecs, cursor->bvec_iter);
BUG_ON(bytes > cursor->resid);
BUG_ON(bytes > bvec_iter_len(bvecs, cursor->bvec_iter));
@@ -940,7 +943,8 @@ static bool ceph_msg_data_bvecs_advance(struct ceph_msg_data_cursor *cursor,
return false; /* no more data */
}
- if (!bytes || cursor->bvec_iter.bi_bvec_done)
+ if (!bytes || (cursor->bvec_iter.bi_bvec_done &&
+ page == bvec_iter_page(bvecs, cursor->bvec_iter)))
return false; /* more bytes to process in this segment */
BUG_ON(cursor->last_piece);
From: Al Viro <viro@zeniv.linux.org.uk>
[ Upstream commit daf5cc27eed99afdea8d96e71b89ba41f5406ef6 ]
free the symlink body after the same RCU delay we have for freeing the
struct inode itself, so that traversal during RCU pathwalk wouldn't step
into freed memory.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Sasha Levin (Microsoft) <sashal@kernel.org>
---
fs/ceph/inode.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
index 4055ab4d5c52..3e518c2ae2bf 100644
--- a/fs/ceph/inode.c
+++ b/fs/ceph/inode.c
@@ -524,6 +524,7 @@ static void ceph_i_callback(struct rcu_head *head)
struct inode *inode = container_of(head, struct inode, i_rcu);
struct ceph_inode_info *ci = ceph_inode(inode);
+ kfree(ci->i_symlink);
kmem_cache_free(ceph_inode_cachep, ci);
}
@@ -561,7 +562,6 @@ void ceph_destroy_inode(struct inode *inode)
ceph_put_snap_realm(mdsc, realm);
}
- kfree(ci->i_symlink);
while ((n = rb_first(&ci->i_fragtree)) != NULL) {
frag = rb_entry(n, struct ceph_inode_frag, node);
rb_erase(n, &ci->i_fragtree);
--
2.19.1
#9 Updated by Ilya Dryomov about 4 years ago
These modifications shouldn't have any effect on the block layer.
Perhaps try to configure kdump and capture a crash dump next time? It will allow us to examine ioc->icq_list list and related data structures for inconsistencies. The locking around them appears to be quite complex...