Bug #39010: Hang with read-only RBD - Linux kernel client - Ceph

Actions

Copy link

Bug #39010

closed

Hang with read-only RBD

Added by Cliff Pajaro about 5 years ago. Updated almost 5 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

Ilya Dryomov

Category:

rbd

Target version:

% Done:

Source:

Community (user)

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

Ceph - v12.2.8

ceph-qa-suite:

Crash signature (v1):

Crash signature (v2):

Description

I noticed some filesystem commands hang on a read-only mapped (and mounted) RBD.

I attached an strace when I run "tree" on the mounted RBD. The command hangs at "getdents(3"

This is information about the "tree" process that is hung:

# cat /proc/122962/stat
122962 (tree) D 122960 122951 71163 34821 122951 4210944 101 0 0 0 0 0 0 0 20 0 1 0 15070459 6254592 219 18446744073709551615 94183868289024 94183868356608 140734750163120 0 0 0 0 0 0 1 0 0 17 3 0 0 0 0 0 94183870454192 94183870457808 94183896190976 140734750164935 140734750164963 140734750164963 140734750167018 0
# cat /proc/122962/stack
[<0>] io_schedule+0x12/0x40
[<0>] __lock_page+0x114/0x150
[<0>] pagecache_get_page+0x168/0x1f0
[<0>] __getblk_gfp+0xe9/0x2a0
[<0>] ext4_getblk+0xba/0x1b0 [ext4]
[<0>] ext4_bread+0x1e/0xa0 [ext4]
[<0>] ext4_readdir+0x1dd/0xa70 [ext4]
[<0>] iterate_dir+0x8d/0x190
[<0>] __se_sys_getdents+0xa0/0x130
[<0>] do_syscall_64+0x4e/0x100
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[<0>] 0xffffffffffffffff

Kernel:

Linux Kernel 4.19.21

Ceph versions:

$ sudo ceph version  
ceph version 12.2.8_p1 (ae699615bac534ea496ee965ac6192cb7e0e07c0) luminous (stable)
$ sudo ceph --version
ceph version 12.2.8_p1 (ae699615bac534ea496ee965ac6192cb7e0e07c0) luminous (stable)

This is what I see in dmesg:

[Tue Mar 26 20:57:36 2019] Key type dns_resolver registered
[Tue Mar 26 20:57:36 2019] Key type ceph registered
[Tue Mar 26 20:57:36 2019] libceph: loaded (mon/osd proto 15/24)
[Tue Mar 26 20:57:36 2019] rbd: loaded (major 251)
[Tue Mar 26 20:57:36 2019] libceph: mon2 10.15.138.61:6789 session established
[Tue Mar 26 20:57:36 2019] libceph: client155756057 fsid 4c69c40e-8e9e-5748-80c0-63871a5207ab
[Tue Mar 26 20:57:36 2019] rbd: rbd0: capacity 21474836480 features 0x1
[Tue Mar 26 20:57:36 2019] EXT4-fs (rbd0): mounted filesystem without journal. Opts: (null)
[Tue Mar 26 20:57:36 2019] EXT4-fs warning (device rbd0): ext4_dirent_csum_verify:355: inode #524512: comm python3.6: No space for directory leaf checksum. Please run e2fsck -D.
[Tue Mar 26 20:57:36 2019] EXT4-fs error (device rbd0): htree_dirblock_to_tree:979: inode #524512: comm python3.6: Directory block failed checksum
[Tue Mar 26 20:57:36 2019] ------------[ cut here ]------------
[Tue Mar 26 20:57:36 2019] generic_make_request: Trying to write to read-only block-device rbd0 (partno 0)
[Tue Mar 26 20:57:36 2019] WARNING: CPU: 1 PID: 47946 at block/blk-core.c:2174 generic_make_request_checks+0x10b/0x440
[Tue Mar 26 20:57:36 2019] Modules linked in: rbd libceph dns_resolver bonding x86_pkg_temp_thermal intel_pch_thermal ie31200_edac ipmi_ssif ext4 mbcache jbd2 fscrypto raid1 ixgbe igb i2c_algo_bit crc32c_intel dca mdio
[Tue Mar 26 20:57:36 2019] CPU: 1 PID: 47946 Comm: python3.6 Not tainted 4.19.21-vanilla-base-1 #1
[Tue Mar 26 20:57:36 2019] Hardware name: Quanta Cloud Technology Inc. QuantaGrid S31A-1U/S3A, BIOS S3A_3A11 07/15/2016
[Tue Mar 26 20:57:36 2019] RIP: 0010:generic_make_request_checks+0x10b/0x440
[Tue Mar 26 20:57:36 2019] Code: 54 03 00 00 48 8d 74 24 08 48 89 df c6 05 9a a3 f1 00 01 e8 67 6d 01 00 48 c7 c7 60 1e fc 89 48 89 c6 44 89 e2 e8 05 e2 d6 ff <0f> 0b 8b 43 30 4c 8b 63 08 c1 e8 09 49 8b 74 24 50 85 c0 74 16 48
[Tue Mar 26 20:57:36 2019] RSP: 0018:ffffaa87ca2979d8 EFLAGS: 00010282
[Tue Mar 26 20:57:36 2019] RAX: 0000000000000000 RBX: ffff89ef548239c0 RCX: 0000000000000006
[Tue Mar 26 20:57:36 2019] RDX: 0000000000000007 RSI: 0000000000000092 RDI: ffff89ef8fa955a0
[Tue Mar 26 20:57:36 2019] RBP: ffff89ef866e4c00 R08: 000000000000035a R09: 0000000000000001
[Tue Mar 26 20:57:36 2019] R10: ffffaa87ca297ab0 R11: 0000000000000001 R12: 0000000000000000
[Tue Mar 26 20:57:36 2019] R13: 0000000000020000 R14: 0000000000000008 R15: ffff89eebbd921a0
[Tue Mar 26 20:57:36 2019] FS:  00007fc003950540(0000) GS:ffff89ef8fa80000(0000) knlGS:0000000000000000
[Tue Mar 26 20:57:36 2019] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[Tue Mar 26 20:57:36 2019] CR2: 00007fc0016bb018 CR3: 00000007b15f4003 CR4: 00000000003606e0
[Tue Mar 26 20:57:36 2019] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[Tue Mar 26 20:57:36 2019] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[Tue Mar 26 20:57:36 2019] Call Trace:
[Tue Mar 26 20:57:36 2019]  ? sched_clock+0x5/0x10
[Tue Mar 26 20:57:36 2019]  generic_make_request+0x64/0x390
[Tue Mar 26 20:57:36 2019]  ? submit_bio+0x6c/0x140
[Tue Mar 26 20:57:36 2019]  ? mod_timer+0x14f/0x390
[Tue Mar 26 20:57:36 2019]  submit_bio+0x6c/0x140
[Tue Mar 26 20:57:36 2019]  ? guard_bio_eod+0x2c/0xf0
[Tue Mar 26 20:57:36 2019]  submit_bh_wbc.isra.17+0x12f/0x150
[Tue Mar 26 20:57:36 2019]  __sync_dirty_buffer+0x3f/0xa0
[Tue Mar 26 20:57:36 2019]  ext4_commit_super+0x206/0x2c0 [ext4]
[Tue Mar 26 20:57:36 2019]  __ext4_error_inode+0xe0/0x240 [ext4]
[Tue Mar 26 20:57:36 2019]  ? out_of_line_wait_on_bit+0x91/0xb0
[Tue Mar 26 20:57:36 2019]  __ext4_read_dirblock+0x1fa/0x2c0 [ext4]
[Tue Mar 26 20:57:36 2019]  htree_dirblock_to_tree+0x8e/0x2c0 [ext4]
[Tue Mar 26 20:57:36 2019]  ext4_htree_fill_tree+0xe6/0x300 [ext4]
[Tue Mar 26 20:57:36 2019]  ? filename_lookup+0x105/0x1a0
[Tue Mar 26 20:57:36 2019]  ext4_readdir+0x6a4/0xa70 [ext4]
[Tue Mar 26 20:57:36 2019]  iterate_dir+0x8d/0x190
[Tue Mar 26 20:57:36 2019]  __se_sys_getdents+0xa0/0x130
[Tue Mar 26 20:57:36 2019]  ? __ia32_compat_sys_getdents+0x130/0x130
[Tue Mar 26 20:57:36 2019]  ? do_syscall_64+0x4e/0x100
[Tue Mar 26 20:57:36 2019]  ? __se_sys_getdents+0x130/0x130
[Tue Mar 26 20:57:36 2019]  do_syscall_64+0x4e/0x100
[Tue Mar 26 20:57:36 2019]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[Tue Mar 26 20:57:36 2019] RIP: 0033:0x7fc002d2fde8
[Tue Mar 26 20:57:36 2019] Code: 04 00 41 57 41 56 41 55 41 54 55 53 48 89 f3 48 83 ec 18 64 48 8b 04 25 28 00 00 00 48 89 44 24 08 31 c0 b8 4e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 68 4c 8d 2c 06 49 89 c4 4c 39 ee 73 36 0f 1f
[Tue Mar 26 20:57:36 2019] RSP: 002b:00007ffe20731cb0 EFLAGS: 00000246 ORIG_RAX: 000000000000004e
[Tue Mar 26 20:57:36 2019] RAX: ffffffffffffffda RBX: 000055cd6dcc3c40 RCX: 00007fc002d2fde8
[Tue Mar 26 20:57:36 2019] RDX: 0000000000008000 RSI: 000055cd6dcc3c40 RDI: 0000000000000004
[Tue Mar 26 20:57:36 2019] RBP: 000055cd6dcc3c40 R08: 00007ffe20731cd8 R09: 0000000000000034
[Tue Mar 26 20:57:36 2019] R10: 0000000000000231 R11: 0000000000000246 R12: ffffffffffffff80
[Tue Mar 26 20:57:36 2019] R13: 0000000000000000 R14: 00007fc0039504c0 R15: 000000000000005d
[Tue Mar 26 20:57:36 2019] ---[ end trace 246b9ec67d062105 ]---
[Tue Mar 26 20:57:36 2019] 
                           Assertion failure in rbd_queue_workfn() at line 3664:

                                rbd_assert(op_type == OBJ_OP_READ || rbd_dev->spec->snap_id == CEPH_NOSNAP);

Patrick Mclean, an SRE here, says this is an fstrim issue where the VFS layer does not treat a DISCARD operation as a write so it allows it even though the filesystem is mounted read-only.
He suggested the attached patch (untitled.diff) to resolve the assertion failure: erroring to userspace rather than failing an assert.
We are unsure if that is the reason for the hang.

Please advise on the patch for the assertion and the hang.

Files

Download all files

strace_tree.txt (14.1 KB) strace_tree.txt		Cliff Pajaro, 03/28/2019 06:42 PM
untitled.diff (1.36 KB) untitled.diff		Cliff Pajaro, 03/28/2019 06:45 PM

Actions

Copy link

Updated by Tim Marx about 5 years ago

AFAIK the idea was that this is something the block layer should handle and looking at the corresponding conversation [0] this exact example (discard) was mentioned as a problem, because the block layer isn't/wasn't taking care of it. The proposed solution was to enforce this as well at the block layer, but looking at your bug report this still seems to be an issue and I can reproduce it too.

[0] https://marc.info/?l=ceph-devel&m=151006119215391&w=2

Actions

Copy link

Updated by Cliff Pajaro almost 5 years ago

We've made some patches in the kernel for these issues but have decided to use device mapper as an extra layer on top of the normal krbd to absorb the unwanted writes:

device=$(rbd map $pool/$rbd)
truncate -s 16M /tmp/cow.bin
losetup /dev/loop0 /tmp/cow.bin
dmsetup create dmrbd --table '0 41943040 snapshot $device /dev/loop0 N 8'
mount /dev/mapper/dmrbd temp

Actions

Copy link