Project

General

Profile

Bug #40822

rbd-nbd: reproducible crashes on nbd request timeout

Added by Marc Schöchlin 3 months ago. Updated about 1 month ago.

Status:
Need More Info
Priority:
Normal
Assignee:
-
Target version:
Start date:
07/18/2019
Due date:
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature:

Description

On my system rbd-nbd currentky crashes in a reproducible way.
I never experienced something like this krbd or rbd-nbd on my xen system (12.2.5 in my xen-hypervisor, https://github.com/vico-research-and-consulting/RBDSR)

I invoke the following command on a freshly mapped rbd and rbd_rbd crashes:

# find . -type f -name "*.sql" -exec ionice -c3 nice -n 20 gzip -v {} \;
gzip: ./deprecated_data/data_archive.done/entry_search_201232.sql.gz already exists; do you wish to overwrite (y or n)? y
./deprecated_data/data_archive.done/entry_search_201232.sql:         84.1% -- replaced with ./deprecated_data/data_archive.done/entry_search_201232.sql.gz
./deprecated_data/data_archive.done/entry_search_201233.sql:
gzip: ./deprecated_data/data_archive.done/entry_search_201233.sql: Input/output error
gzip: ./deprecated_data/data_archive.done/entry_search_201234.sql: Input/output error
gzip: ./deprecated_data/data_archive.done/entry_search_201235.sql: Input/output error
gzip: ./deprecated_data/data_archive.done/entry_search_201236.sql: Input/output error
....

The execution crahes while compressing a 140MB sql file.

dmesg output:

[579763.020890] block nbd0: Connection timed out
[579763.020926] block nbd0: shutting down sockets
[579763.020943] print_req_error: I/O error, dev nbd0, sector 3221296950
[579763.020946] block nbd0: Receive data failed (result -32)
[579763.020952] print_req_error: I/O error, dev nbd0, sector 4523172248
[579763.021001] XFS (nbd0): metadata I/O error: block 0xc0011736 ("xlog_iodone") error 5 numblks 512
[579763.021031] XFS (nbd0): xfs_do_force_shutdown(0x2) called from line 1261 of file /build/linux-hwe-xJVMkx/linux-hwe-4.15.0/fs/xfs/xfs_log.c.  Return address = 0x00000000918af758
[579763.021046] print_req_error: I/O error, dev nbd0, sector 4523172248
[579763.021161] XFS (nbd0): Log I/O Error Detected.  Shutting down filesystem
[579763.021176] XFS (nbd0): Please umount the filesystem and rectify the problem(s)
[579763.176834] print_req_error: I/O error, dev nbd0, sector 3221296969
[579763.176856] print_req_error: I/O error, dev nbd0, sector 2195113096
[579763.176869] XFS (nbd0): metadata I/O error: block 0xc0011749 ("xlog_iodone") error 5 numblks 512
[579763.176884] XFS (nbd0): xfs_do_force_shutdown(0x2) called from line 1261 of file /build/linux-hwe-xJVMkx/linux-hwe-4.15.0/fs/xfs/xfs_log.c.  Return address = 0x00000000918af758
[579763.252836] print_req_error: I/O error, dev nbd0, sector 2195113352
[579763.252859] print_req_error: I/O error, dev nbd0, sector 2195113608
[579763.252869] print_req_error: I/O error, dev nbd0, sector 2195113864
[579763.356841] print_req_error: I/O error, dev nbd0, sector 2195114120
[579763.356885] print_req_error: I/O error, dev nbd0, sector 2195114376
[579763.358040] XFS (nbd0): writeback error on sector 2195119688
[579763.916813] block nbd0: Connection timed out
[579768.140839] block nbd0: Connection timed out
[579768.140859] print_req_error: 21 callbacks suppressed
[579768.140860] print_req_error: I/O error, dev nbd0, sector 2195112840
[579768.141101] XFS (nbd0): writeback error on sector 219511559

/var/log/ceph/ceph-client.archiv.log
(i activated debug options "--debug-rbd=20 --debug-ms=1", see attached logfile)

2019-07-18 14:52:55.387784 7fffcf7fe700 10 librbd::Watcher: 0x7fffc0001010 unregister_watch:
2019-07-18 14:52:55.387815 7fffcf7fe700  1 -- 10.23.27.200:0/3920476044 --> 10.23.27.151:6806/2322641 -- osd_op(unknown.0.0:1853 34.132 34:4cb446f4:::rbd_header.6c73776b8b4567:head [watch unwatch cookie 140736414969824] snapc 0=[] ondisk+write+known_if_redirected e256219) v8 -- 0x7fffc803a340 con 0
2019-07-18 14:52:55.388656 7fffe913b700  1 -- 10.23.27.200:0/3920476044 <== osd.17 10.23.27.151:6806/2322641 90 ==== watch-notify(notify (1) cookie 140736414969824 notify 1100452225614816 ret 0) v3 ==== 68+0+0 (1852866777 0 0) 0x7fffe187b4c0 con 0x7fffc00054d0
2019-07-18 14:52:55.388738 7fffe913b700  1 -- 10.23.27.200:0/3920476044 <== osd.17 10.23.27.151:6806/2322641 91 ==== osd_op_reply(1852 rbd_header.6c73776b8b4567 [notify cookie 140736550101040] v0'0 uv2102967 ondisk = 0) v8 ==== 169+0+8 (3077247585 0 3199212159) 0x7fffe0002ef0 con 0x7fffc00054d0
2019-07-18 14:52:55.388815 7fffcffff700  5 librbd::Watcher: 0x7fffc0001010 notifications_blocked: blocked=1
2019-07-18 14:52:55.388904 7fffcffff700  1 -- 10.23.27.200:0/3920476044 --> 10.23.27.151:6806/2322641 -- osd_op(unknown.0.0:1854 34.132 34:4cb446f4:::rbd_header.6c73776b8b4567:head [notify-ack cookie 0] snapc 0=[] ondisk+read+known_if_redirected e256219) v8 -- 0x7fffc00600a0 con 0
2019-07-18 14:52:55.389594 7fffe913b700  1 -- 10.23.27.200:0/3920476044 <== osd.17 10.23.27.151:6806/2322641 92 ==== osd_op_reply(1853 rbd_header.6c73776b8b4567 [watch unwatch cookie 140736414969824] v256219'2102968 uv2102967 ondisk = 0) v8 ==== 169+0+0 (242862078 0 0) 0x7fffe0002ef0 con 0x7fffc00054d0
2019-07-18 14:52:55.389838 7fffcd7fa700 10 librbd::image::CloseRequest: 0x555555946390 handle_unregister_image_watcher: r=0
2019-07-18 14:52:55.389849 7fffcd7fa700 10 librbd::image::CloseRequest: 0x555555946390 send_flush_readahead
2019-07-18 14:52:55.389848 7fffe913b700  1 -- 10.23.27.200:0/3920476044 <== osd.17 10.23.27.151:6806/2322641 93 ==== watch-notify(notify_complete (2) cookie 140736550101040 notify 1100452225614816 ret 0) v3 ==== 42+0+8 (2900410459 0 0) 0x7fffe09888e0 con 0x7fffc00054d0
2019-07-18 14:52:55.389895 7fffe913b700  1 -- 10.23.27.200:0/3920476044 <== osd.17 10.23.27.151:6806/2322641 94 ==== osd_op_reply(1854 rbd_header.6c73776b8b4567 [notify-ack cookie 0] v0'0 uv2102967 ondisk = 0) v8 ==== 169+0+0 (3363304947 0 0) 0x7fffe0002ef0 con 0x7fffc00054d0
2019-07-18 14:52:55.389944 7fffcffff700 20 librbd::watcher::Notifier: 0x7fffc00010c0 handle_notify: r=0
2019-07-18 14:52:55.389955 7fffcffff700 20 librbd::watcher::Notifier: 0x7fffc00010c0 handle_notify: pending=0
2019-07-18 14:52:55.389974 7fffcf7fe700 10 librbd::image::CloseRequest: 0x555555946390 handle_flush_readahead: r=0
2019-07-18 14:52:55.389978 7fffcf7fe700 10 librbd::image::CloseRequest: 0x555555946390 send_shut_down_cache
2019-07-18 14:52:55.390119 7fffcf7fe700 10 librbd::image::CloseRequest: 0x555555946390 handle_shut_down_cache: r=0
2019-07-18 14:52:55.390125 7fffcf7fe700 10 librbd::image::CloseRequest: 0x555555946390 send_flush_op_work_queue
2019-07-18 14:52:55.390130 7fffcf7fe700 10 librbd::image::CloseRequest: 0x555555946390 handle_flush_op_work_queue: r=0
2019-07-18 14:52:55.390135 7fffcf7fe700 10 librbd::image::CloseRequest: 0x555555946390 handle_flush_image_watcher: r=0
2019-07-18 14:52:55.390168 7fffcf7fe700 10 librbd::ImageState: 0x555555947170 0x555555947170 handle_close: r=0
2019-07-18 14:52:55.391016 7ffff7fe3000  1 -- 10.23.27.200:0/3920476044 >> 10.23.27.151:6806/2322641 conn(0x7fffc00054d0 :-1 s=STATE_OPEN pgs=823187 cs=1 l=1).mark_down
2019-07-18 14:52:55.391054 7ffff7fe3000  1 -- 10.23.27.200:0/3920476044 >> 10.23.27.152:6804/3417032 conn(0x7fffc8108a70 :-1 s=STATE_OPEN pgs=517757 cs=1 l=1).mark_down
2019-07-18 14:52:55.391070 7ffff7fe3000  1 -- 10.23.27.200:0/3920476044 >> 10.23.27.150:6814/2366227 conn(0x7fffc807dcb0 :-1 s=STATE_OPEN pgs=529918 cs=1 l=1).mark_down
2019-07-18 14:52:55.391098 7ffff7fe3000  1 -- 10.23.27.200:0/3920476044 >> 10.23.27.150:6804/2366065 conn(0x7fffc80fb770 :-1 s=STATE_OPEN pgs=912307 cs=1 l=1).mark_down
2019-07-18 14:52:55.391120 7ffff7fe3000  1 -- 10.23.27.200:0/3920476044 >> 10.23.27.151:6810/2322644 conn(0x7fffc81e2130 :-1 s=STATE_OPEN pgs=612830 cs=1 l=1).mark_down
2019-07-18 14:52:55.391129 7ffff7fe3000  1 -- 10.23.27.200:0/3920476044 >> 10.23.27.150:6818/2366231 conn(0x7fffc81fa980 :-1 s=STATE_OPEN pgs=791458 cs=1 l=1).mark_down
2019-07-18 14:52:55.391145 7ffff7fe3000  1 -- 10.23.27.200:0/3920476044 >> 10.23.27.150:6812/2366206 conn(0x555555954480 :-1 s=STATE_OPEN pgs=528620 cs=1 l=1).mark_down
2019-07-18 14:52:55.391153 7ffff7fe3000  1 -- 10.23.27.200:0/3920476044 >> 10.23.27.151:6818/2322735 conn(0x7fffc80d42f0 :-1 s=STATE_OPEN pgs=696439 cs=1 l=1).mark_down
2019-07-18 14:52:55.391161 7ffff7fe3000  1 -- 10.23.27.200:0/3920476044 >> 10.23.27.151:6802/2322629 conn(0x7fffc805b6a0 :-1 s=STATE_OPEN pgs=623470 cs=1 l=1).mark_down
2019-07-18 14:52:55.391477 7fffe993c700  1 -- 10.23.27.200:0/3920476044 reap_dead start
2019-07-18 14:52:55.391566 7ffff7fe3000  1 -- 10.23.27.200:0/3920476044 >> 10.23.27.150:6802/2366074 conn(0x7fffc8068670 :-1 s=STATE_OPEN pgs=757332 cs=1 l=1).mark_down
2019-07-18 14:52:55.392068 7fffe993c700  1 -- 10.23.27.200:0/3920476044 reap_dead start
2019-07-18 14:52:55.392086 7fffe993c700  1 -- 10.23.27.200:0/3920476044 reap_dead start
2019-07-18 14:52:55.392093 7ffff7fe3000  1 -- 10.23.27.200:0/3920476044 >> 10.23.27.151:6814/2322637 conn(0x7fffc8077360 :-1 s=STATE_OPEN pgs=680587 cs=1 l=1).mark_down
2019-07-18 14:52:55.392114 7ffff7fe3000  1 -- 10.23.27.200:0/3920476044 >> 10.23.27.151:6816/2322731 conn(0x7fffc8061f00 :-1 s=STATE_OPEN pgs=605415 cs=1 l=1).mark_down
2019-07-18 14:52:55.392129 7ffff7fe3000  1 -- 10.23.27.200:0/3920476044 >> 10.23.27.152:6820/51139 conn(0x7fffc811d5e0 :-1 s=STATE_OPEN pgs=343039 cs=1 l=1).mark_down
2019-07-18 14:52:55.392138 7ffff7fe3000  1 -- 10.23.27.200:0/3920476044 >> 10.23.27.152:6800/3416932 conn(0x7fffc820fa10 :-1 s=STATE_OPEN pgs=781718 cs=1 l=1).mark_down
2019-07-18 14:52:55.392147 7ffff7fe3000  1 -- 10.23.27.200:0/3920476044 >> 10.23.27.152:6806/3417034 conn(0x7fffc80533c0 :-1 s=STATE_OPEN pgs=536148 cs=1 l=1).mark_down
2019-07-18 14:52:55.392165 7ffff7fe3000  1 -- 10.23.27.200:0/3920476044 >> 10.23.27.152:6808/3417140 conn(0x7fffc8134810 :-1 s=STATE_OPEN pgs=761314 cs=1 l=1).mark_down
2019-07-18 14:52:55.392181 7ffff7fe3000  1 -- 10.23.27.200:0/3920476044 >> 10.23.27.152:6816/3416957 conn(0x7fffc8146610 :-1 s=STATE_OPEN pgs=837393 cs=1 l=1).mark_down
2019-07-18 14:52:55.392214 7fffe993c700  1 -- 10.23.27.200:0/3920476044 reap_dead start
2019-07-18 14:52:55.392228 7fffe993c700  1 -- 10.23.27.200:0/3920476044 reap_dead start
2019-07-18 14:52:55.392256 7fffe993c700  1 -- 10.23.27.200:0/3920476044 reap_dead start
2019-07-18 14:52:55.392286 7fffe993c700  1 -- 10.23.27.200:0/3920476044 reap_dead start
2019-07-18 14:52:55.392436 7fffe993c700  1 -- 10.23.27.200:0/3920476044 reap_dead start
2019-07-18 14:52:55.392448 7fffe993c700  1 -- 10.23.27.200:0/3920476044 reap_dead start
2019-07-18 14:52:55.392452 7fffe993c700  1 -- 10.23.27.200:0/3920476044 reap_dead start
2019-07-18 14:52:55.392194 7ffff7fe3000  1 -- 10.23.27.200:0/3920476044 >> 10.23.27.150:6810/2366205 conn(0x7fffc810f360 :-1 s=STATE_OPEN pgs=587338 cs=1 l=1).mark_down
2019-07-18 14:52:55.392882 7ffff7fe3000  1 -- 10.23.27.200:0/3920476044 >> 10.23.27.150:6806/2366079 conn(0x7fffc8005320 :-1 s=STATE_OPEN pgs=588682 cs=1 l=1).mark_down
2019-07-18 14:52:55.393119 7ffff7fe3000  1 -- 10.23.27.200:0/3920476044 >> 10.23.27.152:6822/3417160 conn(0x7fffc813b3d0 :-1 s=STATE_OPEN pgs=667392 cs=1 l=1).mark_down
2019-07-18 14:52:55.393136 7ffff7fe3000  1 -- 10.23.27.200:0/3920476044 >> 10.23.27.151:6812/2322646 conn(0x7fffc8148070 :-1 s=STATE_OPEN pgs=671116 cs=1 l=1).mark_down
2019-07-18 14:52:55.393150 7ffff7fe3000  1 -- 10.23.27.200:0/3920476044 >> 10.23.27.161:6810/3215426 conn(0x7fffc800f2c0 :-1 s=STATE_OPEN pgs=432372 cs=1 l=1).mark_down
2019-07-18 14:52:55.393165 7ffff7fe3000  1 -- 10.23.27.200:0/3920476044 >> 10.23.27.161:6822/3215746 conn(0x7fffc8166d60 :-1 s=STATE_OPEN pgs=363273 cs=1 l=1).mark_down
2019-07-18 14:52:55.393179 7ffff7fe3000  1 -- 10.23.27.200:0/3920476044 >> 10.23.27.161:6814/3215437 conn(0x7fffc8045cd0 :-1 s=STATE_OPEN pgs=360091 cs=1 l=1).mark_down
2019-07-18 14:52:55.393187 7ffff7fe3000  1 -- 10.23.27.200:0/3920476044 >> 10.23.27.161:6806/3215303 conn(0x7fffc80f4e90 :-1 s=STATE_OPEN pgs=301563 cs=1 l=1).mark_down
2019-07-18 14:52:55.393195 7ffff7fe3000  1 -- 10.23.27.200:0/3920476044 >> 10.23.27.161:6802/3215397 conn(0x7fffc806eec0 :-1 s=STATE_OPEN pgs=288488 cs=1 l=1).mark_down
2019-07-18 14:52:55.393644 7fffe993c700  1 -- 10.23.27.200:0/3920476044 reap_dead start
2019-07-18 14:52:55.393717 7fffe993c700  1 -- 10.23.27.200:0/3920476044 reap_dead start
2019-07-18 14:52:55.393732 7fffe993c700  1 -- 10.23.27.200:0/3920476044 reap_dead start
2019-07-18 14:52:55.393736 7fffe993c700  1 -- 10.23.27.200:0/3920476044 reap_dead start
2019-07-18 14:52:55.393741 7fffe993c700  1 -- 10.23.27.200:0/3920476044 reap_dead start
2019-07-18 14:52:55.393208 7ffff7fe3000  1 -- 10.23.27.200:0/3920476044 >> 10.23.27.161:6804/3215221 conn(0x7fffc81259b0 :-1 s=STATE_OPEN pgs=321005 cs=1 l=1).mark_down
2019-07-18 14:52:55.393758 7ffff7fe3000  1 -- 10.23.27.200:0/3920476044 >> 10.23.27.161:6808/3215396 conn(0x7fffc812d9a0 :-1 s=STATE_OPEN pgs=361415 cs=1 l=1).mark_down
2019-07-18 14:52:55.393767 7ffff7fe3000  1 -- 10.23.27.200:0/3920476044 >> 10.23.27.161:6818/3215660 conn(0x7fffc817ca00 :-1 s=STATE_OPEN pgs=357502 cs=1 l=1).mark_down
2019-07-18 14:52:55.393933 7ffff7fe3000  1 -- 10.23.27.200:0/3920476044 >> 10.23.27.153:6801/2546 conn(0x7fffd000ca40 :-1 s=STATE_OPEN pgs=2798975 cs=1 l=1).mark_down
2019-07-18 14:52:55.393965 7ffff7fe3000  1 -- 10.23.27.200:0/3920476044 >> 10.23.27.154:6789/0 conn(0x7fffa801a900 :-1 s=STATE_OPEN pgs=7851695 cs=1 l=1).mark_down
2019-07-18 14:52:55.394004 7fffe993c700  1 -- 10.23.27.200:0/3920476044 reap_dead start
2019-07-18 14:52:55.396891 7ffff7fe3000  1 -- 10.23.27.200:0/3920476044 shutdown_connections
2019-07-18 14:52:55.397293 7ffff7fe3000  1 -- 10.23.27.200:0/3920476044 shutdown_connections
2019-07-18 14:52:55.397384 7ffff7fe3000  1 -- 10.23.27.200:0/3920476044 wait complete.
2019-07-18 14:52:55.397405 7ffff7fe3000  1 -- 10.23.27.200:0/3920476044 >> 10.23.27.200:0/3920476044 conn(0x555555938f80 :-1 s=STATE_NONE pgs=0 cs=0 l=0).mark_down

The problem appeared now every 3 reproduction attempts.
The filesystem is inaccessible and

Simple read/write of a file larger than rbd cache works without a problem:

$ dd if=/dev/zero of=la bs=1M count=1024
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 27.9281 s, 38.4 MB/s

$ dd if=la of=/dev/null bs=1M
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 93.6031 s, 11.5 MB/s

Environment:

Kernel: 4.15.0-45-generic
Ceph Client: 12.2.11
XFS Mount options: rw,relatime,attr2,discard,largeio,inode64,allocsize=4096k,logbufs=8,logbsize=256k,noquota,_netdev

Volume

rbd image 'archive-001-srv_ec':
    size 3TiB in 786432 objects
    order 22 (4MiB objects)
    data_pool: rbd_hdd_ec
    block_name_prefix: rbd_data.34.6c73776b8b4567
    format: 2
    features: layering, exclusive-lock, object-map, fast-diff, deep-flatten, data-pool
    flags:
    create_timestamp: Thu Jul 11 17:59:00 2019

Client configuration

[client]
    rbd cache = true
    rbd cache size = 536870912
    rbd cache max dirty = 268435456
    rbd cache target dirty = 134217728
    rbd cache max dirty age = 15
    rbd readahead max bytes = 4194304

Pool created by

ceph osd erasure-code-profile set archive_profile \
     k=2 \
     m=2 \
     crush-device-class=hdd \
     crush-failure-domain=host

ceph osd pool create rbd_hdd_ec 64 64 erasure archive_profile
ceph osd pool set rbd_hdd_ec  allow_ec_overwrites true
ceph osd pool application enable rbd_hdd_ec rbd

ceph-client.archiv.log.gz - rbd-nbd log of the crash (795 KB) Marc Schöchlin, 07/18/2019 05:42 PM

History

#1 Updated by Jason Dillaman 3 months ago

  • Project changed from Ceph to rbd
  • Category deleted (librbd)

#2 Updated by Marc Schöchlin 3 months ago

From the mailinglist:

Am 18.07.19 um 20:10 schrieb Jason Dillaman:
> On Thu, Jul 18, 2019 at 1:47 PM Marc Schöchlin <ms@256bit.org> wrote:
>>
>> Hello cephers,
>>
>> rbd-nbd crashes in a reproducible way here.
>
> I don't see a crash report in the log below. Is it really crashing or
> is it shutting down? If it is crashing and it's reproducable, can you
> install the debuginfo packages, attach gdb, and get a full backtrace
> of the crash?

I do not get a crash report of rbd-nbd.
I seems that "rbd-nbd" just terminates, and crashes the xfs filesystem because the nbd device is not available anymore.
("rbd nbd ls" shows no mapped device anymore)
>
>
> It seems like your cluster cannot keep up w/ the load and the nbd
> kernel driver is timing out the IO and shutting down. There is a
> "--timeout" option on "rbd-nbd" that you can use to increase the
> kernel IO timeout for nbd.
>
I have also a 36TB XFS (non_ec) volume on this virtual system mapped by krbd which is under really heavy read/write usage.
I never experienced problems like this on this system with similar usage patterns.

The volume which is involved in the problem only handles a really low load and i was capable to create the error situation by using the simple "find . -type f -name "*.sql" -exec ionice -c3 nice -n 20 gzip -v {} \;" command.
I copied and read ~1.5 TB of data to this volume without a problem - it seems that the gzip command provokes a io pattern which leads to the error situation.

As described i use a luminous "12.2.11" client which does not support that "--timeout" option (btw. a backport would be nice).
Our ceph system runs with a heavy write load, therefore we already set a 60 seconds timeout using the following code:
(https://github.com/OnApp/nbd-kernel_mod/blob/master/nbd_set_timeout.c)

We have ~500 heavy load rbd-nbd devices in our xen cluster (rbd-nbd 12.2.5, kernel 4.4.0+10, centos clone) and ~20 high load krbd devices (kernel 4.15.0-45, ubuntu 16.04) - we never experienced problems like this.
We only experience problems like this with rbd-nbd > 12.2.5 on ubuntu 16.04 (kernel 4.15) or ubuntu 18.04 (kernel 4.15) with erasure encoding or without.

Regards
Marc

#3 Updated by Mykola Golub 3 months ago

Here is the end of `grep rbd-nbd:` from the attached log:

2019-07-18 14:51:53.936995 7fffbffff700 20 rbd-nbd: [3500000000000000 FLUSH 0~0 0]: start
2019-07-18 14:51:53.937072 7fffbffff700 20 rbd-nbd: reader_entry: waiting for nbd request
2019-07-18 14:51:53.960371 7fffcf7fe700 20 rbd-nbd: aio_callback: [3300000000000000 READ 21b387f3000~20000 0]
2019-07-18 14:51:53.960414 7fffbf7fe700 20 rbd-nbd: writer_entry: got: [3300000000000000 READ 21b387f3000~20000 0]
2019-07-18 14:51:53.960604 7fffbf7fe700 20 rbd-nbd: [3300000000000000 READ 21b387f3000~20000 0]: finish
2019-07-18 14:51:53.960622 7fffbf7fe700 20 rbd-nbd: writer_entry: waiting for io request
2019-07-18 14:51:53.961537 7fffcf7fe700 20 rbd-nbd: aio_callback: [3400000000000000 READ 21b38813000~20000 0]
2019-07-18 14:51:53.961559 7fffbf7fe700 20 rbd-nbd: writer_entry: got: [3400000000000000 READ 21b38813000~20000 0]
2019-07-18 14:51:53.961698 7fffbf7fe700 20 rbd-nbd: [3400000000000000 READ 21b38813000~20000 0]: finish
2019-07-18 14:51:53.961704 7fffbf7fe700 20 rbd-nbd: writer_entry: waiting for io request
2019-07-18 14:51:53.967587 7fffbffff700 20 rbd-nbd: [3400000000000000 READ 21b38833000~20000 0]: start
2019-07-18 14:51:53.967628 7fffbffff700 20 rbd-nbd: reader_entry: waiting for nbd request
2019-07-18 14:51:53.971134 7fffcf7fe700 20 rbd-nbd: aio_callback: [3400000000000000 READ 21b38833000~20000 0]
2019-07-18 14:51:53.971165 7fffbf7fe700 20 rbd-nbd: writer_entry: got: [3400000000000000 READ 21b38833000~20000 0]
2019-07-18 14:51:53.973706 7fffbffff700 20 rbd-nbd: [3600000000000000 READ 21b38853000~20000 0]: start
2019-07-18 14:51:53.973754 7fffbffff700 20 rbd-nbd: reader_entry: waiting for nbd request
2019-07-18 14:52:54.148726 7fffbffff700 -1 rbd-nbd: failed to read nbd request header: (33) Numerical argument out of domain
2019-07-18 14:52:54.148950 7ffff7fe3000 10 rbd-nbd: stop: terminating
2019-07-18 14:52:54.150668 7fffbf7fe700 -1 rbd-nbd: [3400000000000000 READ 21b38833000~20000 0]: failed to write replay data: (32) Broken pipe
2019-07-18 14:52:54.185349 7fffcf7fe700 20 rbd-nbd: aio_callback: [3600000000000000 READ 21b38853000~20000 0]
2019-07-18 14:52:55.177928 7fffcf7fe700 20 rbd-nbd: aio_callback: [3500000000000000 FLUSH 0~0 0]

As you can see at 14:51:53 a flush request was received, which proceeded at 14:52:55 (i.e. just a bit more than 1 minute). It triggered the 60 sec timeout at 14:52:54 (observed as "failed to read nbd request header: (33) Numerical argument out of domain" error).

It does not look like any issue with rbd-nbd (librbd), but just it did not have enough time to proceed the flush request, either because the cluster was slow or because it was heavy in your case due to rather large cache.

I would suggest to either increase timeout or decrease the rbd cache.

#4 Updated by Marc Schöchlin 3 months ago

Thank you for the comment, i will tried out smaller rbd-cache sizes now.

I just enabled rbd cache by configuration and used the default memory settings (just "rbd cache = true" and remapping - nothing else).
At first that looks promising, the test runs for 30 minutes without a crash and a single stream iops rate of 200IOPS ("find /srv_ec type f -name "*.sql" -exec gzip -v {} \;") Then it crashed again with the same messages :(

I really think there is something wrong with rbd-nbd or the kernel.

Andditionally i am wondering about:
- why is there a heavy loaded krbd volume on the same system without any crash (30-60% iowait, mixed read/write load) for many months (lifetime)?
- why is the nbd device/mapping always gone and needs to be remapped to get it working when the crash appears?
(from my point of view: if this problem is cased by xfs filesystem code, the nbd device should still work with a xfs problem)
- if this problem is really caused by cache configuration (from my current point of view this is not the case): is the rbd-nbd so fragile a improper rbd cache size setting leads can lead to crashes of the device?

#5 Updated by Marc Schöchlin 3 months ago

- i disabled rbd cache with "rbd cache = false"
- unmounted and unmapped the image
- mapped and mounted the image
- re-executed my test

find /srv_ec type f -name "*.sql" -exec gzip -v {} \;

It took several hours, but at the end i have the same error situation.

#6 Updated by Mykola Golub 3 months ago

So in the attached log, after the flush was requested, I see many write ops (for dirty cache objects) sent to different osds, like this one:

2019-07-18 14:51:53.961783 7fffcf7fe700  1 -- 10.23.27.200:0/3920476044 --> 10.23.27.150:6814/2366227 -- osd_op(unknown.0.0:1816 36.12s0 36:489d4b84:::rbd_data.34.6c73776b8b4567.00000000000416b7:head [writefull 0~4194304] snapc a8=[a8,a4,a0] ondisk+write+known_if_redirected e256219) v8 -- 0x7fffc8181b90 con 0

But there were no response from the osds. Then in about 1 minute, there was "watch notify disconnect" request from the osd.17 (for the image header watcher), which cased the watcher to reconnect:
2019-07-18 14:52:54.162191 7fffe913b700  1 -- 10.23.27.200:0/3920476044 <== osd.17 10.23.27.151:6806/2322641 63 ==== watch-notify(disconnect (3) cookie 140736414969824 notify 0 ret 0) v3 ==== 42+0+0 (1115836510 0 0) 0x7fffe0a2da30 con 0x7fffc00054d0
2019-07-18 14:52:54.162411 7fffcffff700 -1 librbd::ImageWatcher: 0x7fffc0001010 image watch failed: 140736414969824, (107) Transport endpoint is not connected

Shortly after this we started to receive osd_op_reply for our previous write requests from the osd.17 and then other osds.

Right now I don't see what could be wrong with the librbd (nbd) layer. To me it looks like the osds were just overloaded with requests. Or may be it is network related? It looks a little strange (if it is not a coincidence) that it "unfreezed" after the "watch notify disconnect".

Do the osd report about slow requests at the time when the error occurs? If they do, could you please provide details (e.g. `ceph daemon osd.X dump_historic_ops`)?

#7 Updated by Marc Schöchlin 3 months ago

As described i repeated my test with a disabled rbd cache.
This leaded to a delay but also same reproduction of the problem.

Our ceph cluster network consists of 10G SFP+ equipment (juniper ex4600) and the virtual machines are connected by a 4*1GBIT connection (juniper ex4300).
All network interfaces of the hypervisor are part of a bonding interface, network load is distributed equally and no network errors are visible on the deeply monitored network and hypervisor equipment.

From my point of view i suppose there is a rbd-nbd, librbd or nbd.ko problem because:
A much more loaded krbd interface (99% utilization, according to "iostat -d -x 120) which is also used on that linux system, runs for months without a problem.

#8 Updated by Marc Schöchlin 3 months ago

Can you rename the issue to "reproducible rbd-nbd crashes" to help the search engines? :-)

#9 Updated by Mykola Golub 3 months ago

  • Subject changed from reproducable rbd-nbd crashes to rbd-nbd: reproducible crashes on nbd request timeout

#10 Updated by Marc Schöchlin 3 months ago

I installed kernel 4.4.0-154.181 (from ubuntu package sources) and performed the crash reproduction because by xenserver runs that release without any problems for months now. The problem also re-appeared with that kernel release.

A bunch of 10 gunzip processes throwed 1600 write and 330 read IOPS against the cluster/the rbd_ec volume with a transfer rate of 290MB/sec for 10 Minutes.
After that the same problem re-appeared.

What should we do now?

Testing with a 10.2.5 librbd/rbd-nbd ist currently not that easy for me, because the ceph apt source does not contain that version.
Do you know a package source?

#11 Updated by Mykola Golub 3 months ago

Marc Schöchlin wrote:

What should we do now?

It seems you ignored my question if ceph reported about slow ops when the issue was observed? As the first step I would like to make sure it is not server side issue. If it is not then it looks like some sort of contention or throttling on the client (rbd-nbd) side. In this case if you could rerun your initial test but with more debugging enabled ("--debug-rbd=20 --debug-ms=5 --debug-objecter=20") it will produce much larger output but it might give some clue.

Also, it would be nice to see perf stats from the rbd-nbd admin socket:

ceph --admin-daemon /path/to/rbd-nbd/client.admin.asok perf dump

preferrebly collected just before or at the moment when the issue occurs (you can try increasing timeout to have more time to catch that moment).

#12 Updated by Marc Schöchlin 3 months ago

Oh sorry, i missed that question for slow requests.
No, we definitely do not have any slow ops when this problem occurs - i checked this multiple times....

From my current point of view our ceph system is currently healthy.
The health log/the dashboard does not output information which leads to problems.
The only thing which is logged are sporadic slow requests while our snapshot creation/deletion process is in progress.
A few months ago, we had a higher amount of slow requests, but after upgrading to 12.2.10 the amount decreased to 1-5 slow request logmessages per day.

I will collect the requested debug data, this night or tomorrow in the morning.

#13 Updated by Mykola Golub 3 months ago

I just read some additional information from the user mailing list about your setup, which I missed before.

So, the setup where the problem is not observed:

our virtual machines of our xen environments completly run on rbd-nbd devices. Every host runs dozends of rbd-nbd maps which are visible as xen disks in the virtual systems.

The setup where the problem is observed.

the problematic rbd-nbd runs on a virtual system which only utilizes one single nbd and one single krbd device.

The fact that you are running `rbd-nbd` inside a VM may be more important than rbd-nbd (or kernel) version. In the xen environment the rbd-nbd is using the host network stack, while on the VM setup it is using a vitrualized network stack, which is not so performant and (at least in the past) known for its issues. So I would check if it is not the case here.

Another thing to consider is that in the xen setup, io requests go through the virt io stack before reaching the rbd-nbd. They might be throttled/queued on this layer before reaching the rbd-nbd, so rbd-nbd is not overloaded with so much requests as it in the VM setup case, where the io requests go directly to the rbd-nbd.

#14 Updated by Jason Dillaman 3 months ago

  • Status changed from New to Need More Info

#15 Updated by Marc Schöchlin 3 months ago

Sorry for the delay, i was a bit ill at the weekend :-)

As described i run a krbd device and(!) a rbd-nbd device in the same virtual machine: The krbd device runs without any problem for months now, the rbd-nbd device can be crashed in reproducible way. The fact that rbd-nbd crashes in the same way with a old and up2date kernel, let me imagine that there is a problem with rbd-nbd or librbd.

As requested i reproduced the problem with "ceph --admin-daemon /path/to/rbd-nbd/client.admin.asok perf dump" invocations and the requested debug/loglevel.

Shell 1:

$ rmmod nbd
$ rbd-nbd --debug-rbd=20 --debug-ms=5 --debug-objecter=20 --id archiv map rbd_hdd/archiv-001-srv_ec --device /dev/nbd0
/dev/nbd0
$ nbd_set_ioctl /dev/nbd0 120
$ mount /dev/nbd0 /srv_ec
$ mount |grep /srv_ec
/dev/nbd0 on /srv_ec type xfs (rw,relatime,attr2,inode64,noquota)
$ while true; do date; ceph --admin-daemon /var/run/ceph/ceph-client.archiv.asok perf dump > /var/log/ceph/perf/perfdump-`date --date="today" "+%Y-%m-%d_%H-%M-%S"`; sleep 5; done

Shell 2:

$ dmesg -c
$ find /srv_ec -name "*.sql" -exec gzip {} \;
$ find /srv/ec -name "*.sql" -print0 |xargs -0 -n 2 -P 10 gzip -v
...
gzip: ./forums_current/redacted1.sql: Input/output error
gzip: ./forums_current/redacted2.sql: Input/output error
gzip: 
gzip: ./forums_current/redacted3.sql: Input/output error

$ dmesg
$ dmesg -T > dmesg-crash

Unfortunately the logfiles are too big for uploading to the ceph tracker :-)
Therefore i uploaded them to my own site.

URI: https://www.256bit.org/nextcloud/index.php/s/nzkHGxzXmrXxzpj
Password: AiT7gah;

  • File 1: "test-with-timeout.zip" the problem produced by the procedure listed above (see dmesg output, problem started at Mon Jul 29 21:57:16 2019)
  • File 2: "test-without-timeout.zip" almost the same procedure, but without any timeout settings

#16 Updated by Mykola Golub 3 months ago

I have no idea about the cause but here are some observation from the log attached.

For "with timeout" case, there is exactly 2 minute gap (between 21:55:21 and 21:57:21) without any entries in the log:

...
2019-07-29 21:55:21.385551 7fffcf7fe700 20 librbd::BlockGuard: 0x7fffc80f1d10 detain: block_start=607907, block_end=607908, free_slots=32
2019-07-29 21:55:21.385553 7fffcf7fe700 20 librbd::ObjectMap: 0x7fffc80dcd40 detained_aio_update: in-flight update cell: 0x7fffc8256928
2019-07-29 21:55:21.385555 7fffcf7fe700 20 librbd::ManagedLock: 0x7fffc0060470 is_lock_owner=1
2019-07-29 21:55:21.385557 7fffcf7fe700 20 librbd::ObjectMap: 0x7fffc80dcd40 aio_update: start=607907, end=607908, ->1
2019-07-29 21:55:21.385560 7fffcf7fe700 20 librbd::object_map::UpdateRequest: 0x7fffc8171730 update_object_map: ictx=0x555555950e20, oid=rbd_object_map.6c73776b8b4567, [607907,607908) = ->1
2019-07-29 21:55:21.385574 7fffcf7fe700 10 client.24411159.objecter _op_submit op 0x7fffc8357270
2019-07-29 21:55:21.385576 7fffcf7fe700 20 client.24411159.objecter _calc_target epoch 263014 base rbd_object_map.6c73776b8b4567 @34 precalc_pgid 0 pgid 0.0 is_write
2019-07-29 21:57:21.784562 7fffe893a700  5 -- 10.23.27.200:0/1915353349 >> 10.23.27.153:6789/0 conn(0x55555594bdc0 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=1066715 cs=1 l=1). rx mon.0 seq 206 0x7fff988e1c80 osd_map(263015..263015 src has 262285..263015) v3
2019-07-29 21:57:21.785111 7fffbf7fe700 -1 rbd-nbd: [4500000000000000 READ 24043755000~20000 0]: failed to write replay data: (32) Broken pipe
2019-07-29 21:57:21.786462 7fffe37fe700  1 -- 10.23.27.200:0/1915353349 <== mon.0 10.23.27.153:6789/0 206 ==== osd_map(263015..263015 src has 262285..263015) v3 ==== 3981+0+0 (1660450611 0 0) 0x7fff988e1c80 con 0x55555594bdc0
2019-07-29 21:57:21.786487 7fffe37fe700 10 client.24411159.objecter ms_dispatch 0x5555557c66b0 osd_map(263015..263015 src has 262285..263015) v3
2019-07-29 21:57:21.787677 7fffe913b700  5 -- 10.23.27.200:0/1915353349 >> 10.23.27.151:6818/2322735 conn(0x7fffc810ce30 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=799246 cs=1 l=1). rx osd.30 seq 2997 0x7fff9ca940f0 osd_op_reply(154170 rbd_object_map.6c73776b8b4567 [call,call] v263014'3161446 uv3161446 ondisk = 0) v8
2019-07-29 21:57:21.787694 7fffe913b700  1 -- 10.23.27.200:0/1915353349 <== osd.30 10.23.27.151:6818/2322735 2997 ==== osd_op_reply(154170 rbd_object_map.6c73776b8b4567 [call,call] v263014'3161446 uv3161446 ondisk = 0) v8 ==== 215+0+0 (3739644802 0 0) 0x7fff9ca940f0 con 0x7fffc810ce30
2019-07-29 21:57:21.787702 7fffe913b700 10 client.24411159.objecter ms_dispatch 0x5555557c66b0 osd_op_reply(154170 rbd_object_map.6c73776b8b4567 [call,call] v263014'3161446 uv3161446 ondisk = 0) v8
2019-07-29 21:57:21.787706 7fffe913b700 10 client.24411159.objecter in handle_osd_op_reply
2019-07-29 21:57:21.795963 7fffe993c700  5 -- 10.23.27.200:0/1915353349 >> 10.23.27.150:6804/2366065 conn(0x7fffc81430d0 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=1054476 cs=1 l=1). rx osd.21 seq 7433 0x7fffddb47010 osd_map(263015..263026 src has 262285..263026) v3
...

So it looked like the rbd-bd process "frose" and "unfrose" when the nbd driver hit the timeout.

Note, it differs a little from the case reported earlier, as in that log there was still some activity on rados (messanger) layer recorded in that period.

In the "without timeout case" I looks like it worked well by exactly '13:00:00' and since then no nbd requests were recorded, still there were some activity on objecter/messanger layer, related to the watcher. Is that the moment when the test got stuck? Otherwise it just looks like there were no more requests to rbd-nbd since 13:00.

#17 Updated by Marc Schöchlin 3 months ago

Hello Jason ,

it seems that there is something wrong in the rbd-nbd implementation.
(added this information also at https://tracker.ceph.com/issues/40822)

The problem not seems to be related to kernel releases, filesystem types or the ceph and network setup.
Release 12.2.5 seems to work properly, and at least releases >= 12.2.10 seems to have the described problem.

This night a 18 hour testrun with the following procedure was successful:

#!/bin/bash
set -x
while true; do
   date
   find /srv_ec -type f -name "*.MYD" -print0 |head -n 50|xargs -0 -P 10 -n 2 gzip -v
   date
   find /srv_ec -type f -name "*.MYD.gz" -print0 |head -n 50|xargs -0 -P 10 -n 2 gunzip -v
done

Previous tests crashed in a reproducible manner with "-P 1" (single io gzip/gunzip) after a few minutes up to 45 minutes.

Overview of my tests:

  • SUCCESSFUL: kernel 4.15, ceph 12.2.5, 1TB ec-volume, ext4 file system, 120s device timeout
    => 18 hour testrun was successful, no dmesg output
  • FAILED: kernel 4.4, ceph 12.2.11, 2TB ec-volume, xfs file system, 120s device timeout
    => failed after < 1 hour, rbd-nbd map/device is gone, mount throws io errors, map/mount can be re-created without reboot
    => parallel krbd device usage with 99% io usage worked without a problem while running the test
  • FAILED: kernel 4.15, ceph 12.2.11, 2TB ec-volume, xfs file system, 120s device timeout
    => failed after < 1 hour, rbd-nbd map/device is gone, mount throws io errors, map/mount can be re-created
    => parallel krbd device usage with 99% io usage worked without a problem while running the test
  • FAILED: kernel 4.4, ceph 12.2.11, 2TB ec-volume, xfs file system, no timeout
    => failed after < 10 minutes
    => system runs in a high system load, system is almost unusable, unable to shutdown the system, hard reset of vm necessary, manual exclusive lock removal is necessary before remapping the device
  • FAILED: kernel 4.4, ceph 12.2.11, 2TB 3-replica-volume, xfs file system, 120s device timeout
    => failed after < 1 hour, rbd-nbd map/device is gone, mount throws io errors, map/mount can be re-created

All device timeouts were set separately set by the nbd_set_ioctl tool because luminous rbd-nbd does not provide the possibility to define timeouts.

Whats next? Is i a good idea to do a binary search between 12.2.12 and 12.2.5?

From my point of view (without in depth-knowledge of rbd-nbd/librbd) my assumption is that this problem might be caused by rbd-nbd code and not by librbd.
The probability that a bug like this survives uncovered in librbd for such a long time seems to be low for me :-)

Regards
Marc

#18 Updated by Marc Schöchlin 2 months ago

Due to the absence of a coworker i almost had no capacity to execute deeper tests with this problem.
But i can say that in reproduced the problem also with release 12.2.12.

The new (updated) list:

  • SUCCESSFUL: kernel 4.15, ceph 12.2.5, 1TB ec-volume, ext4 file system, 120s device timeout
    -> 18 hour testrun was successful, no dmesg output
  • FAILED: kernel 4.4, ceph 12.2.11, 2TB ec-volume, xfs file system, 120s device timeout
    -> failed after < 1 hour, rbd-nbd map/device is gone, mount throws io errors, map/mount can be re-created without reboot
    -> parallel krbd device usage with 99% io usage worked without a problem while running the test
  • FAILED: kernel 4.15, ceph 12.2.11, 2TB ec-volume, xfs file system, 120s device timeout
    -> failed after < 1 hour, rbd-nbd map/device is gone, mount throws io errors, map/mount can be re-created
    -> parallel krbd device usage with 99% io usage worked without a problem while running the test
  • FAILED: kernel 4.4, ceph 12.2.11, 2TB ec-volume, xfs file system, no timeout
    -> failed after < 10 minutes
    -> system runs in a high system load, system is almost unusable, unable to shutdown the system, hard reset of vm necessary, manual exclusive lock removal is necessary before remapping the device
  • FAILED: kernel 4.4, ceph 12.2.11, 2TB 3-replica-volume, xfs file system, 120s device timeout
    -> failed after < 1 hour, rbd-nbd map/device is gone, mount throws io errors, map/mount can be re-created
  • FAILED: kernel 5.0, ceph 12.2.12, 2TB ec-volume, ext4 file system, 120s device timeout
    -> failed after < 1 hour, rbd-nbd map/device is gone, mount throws io errors, map/mount can be re-created

#19 Updated by Marc Schöchlin 2 months ago

There is something new compared to yesterday.....three days ago i downgraded a production system to client version 12.2.5.

This night also this machine crashed. So it seems that rbd-nbd is broken in general also with release 12.2.5 and potentially before.

The new (updated) list:

  • FAILED: kernel 4.15, ceph 12.2.5, 2TB ec-volume, ext4 file system, 120s device timeout
    -> crashed in production while snapshot trimming is running on that pool
  • FAILED: kernel 4.4, ceph 12.2.11, 2TB ec-volume, xfs file system, 120s device timeout
    -> failed after < 1 hour, rbd-nbd map/device is gone, mount throws io errors, map/mount can be re-created without reboot
    -> parallel krbd device usage with 99% io usage worked without a problem while running the test
  • FAILED: kernel 4.15, ceph 12.2.11, 2TB ec-volume, xfs file system, 120s device timeout
    -> failed after < 1 hour, rbd-nbd map/device is gone, mount throws io errors, map/mount can be re-created
    -> parallel krbd device usage with 99% io usage worked without a problem while running the test
  • FAILED: kernel 4.4, ceph 12.2.11, 2TB ec-volume, xfs file system, no timeout
    -> failed after < 10 minutes
    -> system runs in a high system load, system is almost unusable, unable to shutdown the system, hard reset of vm necessary, manual exclusive lock removal is necessary before remapping the device
  • FAILED: kernel 4.4, ceph 12.2.11, 2TB 3-replica-volume, xfs file system, 120s device timeout
    -> failed after < 1 hour, rbd-nbd map/device is gone, mount throws io errors, map/mount can be re-created
  • FAILED: kernel 5.0, ceph 12.2.12, 2TB ec-volume, ext4 file system, 120s device timeout
    -> failed after < 1 hour, rbd-nbd map/device is gone, mount throws io errors, map/mount can be re-created

Charateristic of the crashed vm machine:

  • Ubuntu 18.04, with kernel 4.15, Ceph Client 12.2.5
  • Services: NFS kernel Server, nothing else
  • Crash behavior
    • daily Task for snapshot creation/deletion started at 19:00
    • a daily database backup started at 19:00, this created
    • 120 IOPS write, and 1 IOPS read
    • 22K/sectors per second write, 0 sectors/per second
    • 97 MBIT inbound and 97 MBIT outbound network usage (nfs server, inbound=nfs, outbound=rbd)
    • we had slow requests at the time of the crash
    • rbd-nbd process terminated 25min later without segfault
    • the nfs usage created a 5 min load of 10 from start, 5K context switches/sec
    • memory usage (kernel+userspace) was 10% of the system
    • no swap usage
    • ceph.conf
          [client]
          rbd cache = true
          rbd cache size = 67108864
          rbd cache max dirty = 33554432
          rbd cache target dirty = 25165824
          rbd cache max dirty age = 3
          rbd readahead max bytes = 4194304
          admin socket = /var/run/ceph/$cluster-$type.$id.$pid.$cctid.asok
          
    • 4 CPUs
    • 6 GB RAM
    • Non default Sysctl Settings
          vm.swappiness = 1
          fs.aio-max-nr = 262144
          fs.file-max = 1000000
          kernel.pid_max = 4194303
          vm.zone_reclaim_mode = 0
          kernel.randomize_va_space = 0
          kernel.panic = 0
          kernel.panic_on_oops = 0
          

#20 Updated by Marc Schöchlin 2 months ago

I'm now at vacation until 8. October 2019.
My coworker will have a look at this issue i case that you have additional questions.

#21 Updated by Marc Schöchlin about 2 months ago

Any updates here?

I think the severity of this problem (currently "minor") is not suitable to the consequences of this problem.

This reproducible problem (as describe d on the list by Mike) can cause:

  • random service outage
  • data corruption
  • long recovery procedures on huge filesystems

Is it adequate to increase the severity to major critical?

#22 Updated by Jason Dillaman about 2 months ago

@Marc: in all honesty, changing the priority / severity won't affect the timeline. Mike is already looking into the issue and it appears there is a bug in XFS (but still more issues w/ memory pressure and rbd-nbd). Perhaps retest using ext4 or another filesystem in the meantime.

#23 Updated by Mike Christie about 1 month ago

The kernel side fix for this issue has been posted here

https://lkml.org/lkml/2019/9/9/549

The ceph side just has the rbd-nbd reader/writer and its msgr-worker threads set that setting along with mlock (not really needed unless we are doing rbd-nbd swap, but it simplifies the IO path) and oom adjust score to prevent the oom killer from killing us when some other process/driver needs to free memory by writing to us.

Will post a link to the ceph PR when the kernel interface is approved, because I am guessing the kernel interface will change a lot.

#24 Updated by Marc Schöchlin about 1 month ago

As described in my last mail on the mailinglist i converted the filesystem to ext4, set "sysctl vm.dirty_background_ratio=0" and I put the regular workload on the filesystem (used as a NFS mount).
That seems so to prevent crashes for a entire week now (before this, the nbd device crashed after hours/~one day).

XFS on top of nbd devices really seems to add additional instability situations.

The current workaround causes very high cpu load (40-50 on a 4 cpu virtual system) and up to ~95% iowait if a single client puts a 20GB File on that volume.

What is your current state in correcting this problem?
Can we support you in testing the by running tests with custom kernel- or rbd-nbd builds?

Regards
Marc

Also available in: Atom PDF