Project

General

Profile

Actions

Bug #36482

closed

High amount of Read I/O on BlueFS/DB when listing omap keys

Added by Wido den Hollander over 5 years ago. Updated over 4 years ago.

Status:
Resolved
Priority:
High
Assignee:
Target version:
-
% Done:

0%

Source:
Community (dev)
Tags:
osd,bluestore,bluefs,read_random,omap,rbd
Backport:
nautilus
Regression:
No
Severity:
2 - major
Reviewed:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I don't know how to describe this issue the best, but I've been observing various issues with Luminous 12.2.4 ~ 12.2.8 OSDs.

It started with the stupidallocator dumping messages as described here: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-October/030546.html

2018-10-10 21:52:04.019037 7f90c2f0f700  0 stupidalloc 0x0x55828ae047d0
dump  0x15cd2078000~34000
2018-10-10 21:52:04.019038 7f90c2f0f700  0 stupidalloc 0x0x55828ae047d0
dump  0x15cd22cc000~24000
2018-10-10 21:52:04.019038 7f90c2f0f700  0 stupidalloc 0x0x55828ae047d0
dump  0x15cd2300000~20000
2018-10-10 21:52:04.019039 7f90c2f0f700  0 stupidalloc 0x0x55828ae047d0
dump  0x15cd2324000~24000
2018-10-10 21:52:04.019040 7f90c2f0f700  0 stupidalloc 0x0x55828ae047d0
dump  0x15cd26c0000~24000
2018-10-10 21:52:04.019041 7f90c2f0f700  0 stupidalloc 0x0x55828ae047d0
dump  0x15cd2704000~30000

After we offloaded data (CRUSH migration) to other OSDs these messages went away.

A few days later we observed OSDs utilizing their disks (Samsung PM863a 1.92TB SSDs) for ~90% and reading a lot from them, like this:

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
loop0             0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sda               0.00     0.00   13.00   43.00    60.00   236.00    10.57     0.00    0.00    0.00    0.00   0.00   0.00
sdb               0.00     4.00   12.00   98.00    52.00   820.00    15.85     0.00    0.04    0.00    0.04   0.04   0.40
sdc               0.00     1.00   25.00  145.00   120.00  1336.00    17.13     0.04    0.24    0.00    0.28   0.12   2.00
sdd               0.00     7.00   27.00  117.00   112.00  1076.00    16.50     0.01    0.06    0.00    0.07   0.06   0.80
sde               0.00     0.00 6839.00   69.00 53464.00   388.00    15.59     0.93    0.13    0.14    0.12   0.13  92.80
sdf               0.00     0.00   17.00  102.00   192.00   756.00    15.93     0.00    0.03    0.24    0.00   0.03   0.40
sdg               0.00     0.00   11.00   89.00    44.00   516.00    11.20     0.02    0.20    0.00    0.22   0.12   1.20
sdh               0.00     0.00   11.00  121.00    68.00   676.00    11.27     0.01    0.06    0.00    0.07   0.06   0.80
sdi               0.00    11.00    0.00    5.00     0.00    64.00    25.60     0.00    0.80    0.00    0.80   0.80   0.40

/dev/sde is osd.246 in this case and it's showing this in it's logs:

2018-10-17 13:32:09.050155 7f54713d7700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f54531aa700' had timed out after 60
2018-10-17 13:32:09.050160 7f54713d7700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f54571b2700' had timed out after 60
2018-10-17 13:32:09.050714 7f54713d7700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f54531aa700' had timed out after 60
2018-10-17 13:32:09.050719 7f54713d7700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f54571b2700' had timed out after 60
2018-10-17 13:32:09.073876 7f5470bd6700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f54531aa700' had timed out after 60

I increased this value from 15 to 60 as the OSDs were committing suicide.

Setting debug_bluefs to 10 shows me:

2018-10-17 13:33:10.680782 7f54571b2700 10 bluefs _read_random h 0x55b695997100 0x235c56c~ecf from file(ino 211403 size 0x41af2a4 mtime 2018-10-15 18:41:40.702314 bdev 1 allocated 4200000 extents [1:0xe753f00000+4200000])
2018-10-17 13:33:10.680932 7f54571b2700 10 bluefs _read_random h 0x55b695997100 0x235d43b~f75 from file(ino 211403 size 0x41af2a4 mtime 2018-10-15 18:41:40.702314 bdev 1 allocated 4200000 extents [1:0xe753f00000+4200000])
2018-10-17 13:33:10.681075 7f54571b2700 10 bluefs _read_random h 0x55b695997100 0x235e3b0~eb7 from file(ino 211403 size 0x41af2a4 mtime 2018-10-15 18:41:40.702314 bdev 1 allocated 4200000 extents [1:0xe753f00000+4200000])
2018-10-17 13:33:10.681229 7f54571b2700 10 bluefs _read_random h 0x55b695997100 0x235f267~e8f from file(ino 211403 size 0x41af2a4 mtime 2018-10-15 18:41:40.702314 bdev 1 allocated 4200000 extents [1:0xe753f00000+4200000])
2018-10-17 13:33:10.681370 7f54571b2700 10 bluefs _read_random h 0x55b695996b80 0x3782d9f~fb8 from file(ino 214919 size 0x445eac0 mtime 2018-10-16 01:55:01.588751 bdev 1 allocated 4500000 extents [1:0xe3f4b00000+4500000])
2018-10-17 13:33:10.681523 7f54571b2700 10 bluefs _read_random h 0x55b695997100 0x23600f6~ea6 from file(ino 211403 size 0x41af2a4 mtime 2018-10-15 18:41:40.702314 bdev 1 allocated 4200000 extents [1:0xe753f00000+4200000])
2018-10-17 13:33:10.681654 7f54571b2700 10 bluefs _read_random h 0x55b695997100 0x2360f9c~edd from file(ino 211403 size 0x41af2a4 mtime 2018-10-15 18:41:40.702314 bdev 1 allocated 4200000 extents [1:0xe753f00000+4200000])
2018-10-17 13:33:10.681798 7f54571b2700 10 bluefs _read_random h 0x55b695997100 0x2361e79~f57 from file(ino 211403 size 0x41af2a4 mtime 2018-10-15 18:41:40.702314 bdev 1 allocated 4200000 extents [1:0xe753f00000+4200000])
2018-10-17 13:33:10.681940 7f54571b2700 10 bluefs _read_random h 0x55b695997100 0x2362dd0~ed1 from file(ino 211403 size 0x41af2a4 mtime 2018-10-15 18:41:40.702314 bdev 1 allocated 4200000 extents [1:0xe753f00000+4200000])
2018-10-17 13:33:10.682088 7f54571b2700 10 bluefs _read_random h 0x55b695997100 0x2363ca1~f7b from file(ino 211403 size 0x41af2a4 mtime 2018-10-15 18:41:40.702314 bdev 1 allocated 4200000 extents [1:0xe753f00000+4200000])
2018-10-17 13:33:10.682232 7f54571b2700 10 bluefs _read_random h 0x55b695997100 0x2364c1c~f1c from file(ino 211403 size 0x41af2a4 mtime 2018-10-15 18:41:40.702314 bdev 1 allocated 4200000 extents [1:0xe753f00000+4200000])
2018-10-17 13:33:10.682393 7f54571b2700 10 bluefs _read_random h 0x55b695997100 0x2365b38~f48 from file(ino 211403 size 0x41af2a4 mtime 2018-10-15 18:41:40.702314 bdev 1 allocated 4200000 extents [1:0xe753f00000+4200000])
2018-10-17 13:33:10.682551 7f54571b2700 10 bluefs _read_random h 0x55b695997100 0x2366a80~fdd from file(ino 211403 size 0x41af2a4 mtime 2018-10-15 18:41:40.702314 bdev 1 allocated 4200000 extents [1:0xe753f00000+4200000])

I spend a few hours with Igor from SUSE to debug this problem as it is showing up on multiple OSDs.

We debugged with osd.240 (not 246) and we could trigger the problem using this command:

root@mon01:~# time rados -p rbd-ssd-c03-p02 listomapkeys rbd_header.0904ed238e1f29
features
object_prefix
order
size
snap_seq
snapshot_0000000000000431

real    2m16.678s
user    0m0.064s
sys    0m0.028s
root@mon01:~#

As you can see, it takes 2 minutes (!!) to list the omap values for this RBD header. In these two minutes the OSD becomes unusable and causes massive slow requests as it's simply blocking.

The image in this case:

root@mon01:~# rbd -p rbd-ssd-c03-p02 info fa2d5398-08de-4f9b-b519-28e6258bc5f3
rbd image 'fa2d5398-08de-4f9b-b519-28e6258bc5f3':
    size 50GiB in 12800 objects
    order 22 (4MiB objects)
    block_name_prefix: rbd_data.0904ed238e1f29
    format: 2
    features: layering, exclusive-lock, object-map, fast-diff, deep-flatten
    flags: 
root@mon01:~#
root@mon01:~# ceph osd map rbd-ssd-c03-p02 rbd_header.0904ed238e1f29
osdmap e623992 pool 'rbd-ssd-c03-p02' (29) object 'rbd_header.0904ed238e1f29' -> pg 29.9fd75a23 (29.223) -> up ([240,183,176], p240) acting ([240,183,176], p240)
root@mon01:~# 

Doing this triggers the OSD (240) to be jump to 90% util on it's SSD and cause massive slow/blocked requests for more then 60 seconds.

In the logs this is shown:

2018-10-16 21:07:15.112839 7f37d26f2700 10 bluestore(/var/lib/ceph/osd/ceph-240) get_omap_iterator 29.223_head #29:c45aebf9:::rbd_header.0904ed238e1f29:head#
2018-10-16 21:07:15.112840 7f37d26f2700 10 bluestore(/var/lib/ceph/osd/ceph-240) get_omap_iterator has_omap = 1
2018-10-16 21:07:15.112904 7f37d26f2700 10 bluefs _read_random h 0x55ddaca32680 0x22df502~f09 from file(ino 201593 size 0x41a812d mtime 2018-10-16 11:22:59.508274 bdev 1 allocated 4200000 extents [1:0xe048900000+4200000])
2018-10-16 21:07:15.113079 7f37d26f2700 10 bluefs _read_random h 0x55ddaca32680 0x22e040b~efd from file(ino 201593 size 0x41a812d mtime 2018-10-16 11:22:59.508274 bdev 1 allocated 4200000 extents [1:0xe048900000+4200000])
2018-10-16 21:07:15.113246 7f37d26f2700 10 bluefs _read_random h 0x55ddaca32680 0x22e1308~fa2 from file(ino 201593 size 0x41a812d mtime 2018-10-16 11:22:59.508274 bdev 1 allocated 4200000 extents [1:0xe048900000+4200000])
2018-10-16 21:07:15.113414 7f37d26f2700 10 bluefs _read_random h 0x55ddaca32680 0x22e22aa~f37 from file(ino 201593 size 0x41a812d mtime 2018-10-16 11:22:59.508274 bdev 1 allocated 4200000 extents [1:0xe048900000+4200000])
2018-10-16 21:07:15.113570 7f37d26f2700 10 bluefs _read_random h 0x55ddaca32680 0x22e31e1~f5a from file(ino 201593 size 0x41a812d mtime 2018-10-16 11:22:59.508274 bdev 1 allocated 4200000 extents [1:0xe048900000+4200000])
2018-10-16 21:07:15.113733 7f37d26f2700 10 bluefs _read_random h 0x55ddaca32680 0x22e413b~ecc from file(ino 201593 size 0x41a812d mtime 2018-10-16 11:22:59.508274 bdev 1 allocated 4200000 extents [1:0xe048900000+4200000])
2018-10-16 21:07:15.113897 7f37d26f2700 10 bluefs _read_random h 0x55ddaca32680 0x22e5007~f84 from file(ino 201593 size 0x41a812d mtime 2018-10-16 11:22:59.508274 bdev 1 allocated 4200000 extents [1:0xe048900000+4200000])
2018-10-16 21:07:15.114033 7f37d26f2700 10 bluefs _read_random h 0x55dd99ea9f00 0xfc0~fd0 from file(ino 201588 size 0x44a99d7 mtime 2018-10-16 11:22:53.020935 bdev 1 allocated 4500000 extents [1:0xe010800000+4500000])
...
...
...
2018-10-16 21:09:06.190232 7f37f091f700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f37d26f2700' had timed out after 60
2018-10-16 21:09:06.190259 7f37f011e700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f37d26f2700' had timed out after 60
...
...
2018-10-16 21:09:29.761930 7f37d26f2700 10 bluefs _read_random h 0x55ddf45bcc80 0x2adbe12~fc9 from file(ino 201517 size 0x44c87af mtime 2018-10-16 06:57:28.419862 bdev 1 allocated 4500000 extents [1:0xe66dd00000+4500000])
2018-10-16 21:09:29.762090 7f37d26f2700 10 bluefs _read_random h 0x55dd9422b400 0x2b812c2~fb6 from file(ino 69377 size 0x41b57a8 mtime 2018-07-10 15:29:54.444569 bdev 1 allocated 4200000 extents [1:0xe057600000+4200000])
2018-10-16 21:09:29.762166 7f37f091f700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f37d26f2700' had timed out after 60
...
2018-10-16 21:09:30.812418 7f37d26f2700 10 bluefs _read_random h 0x55ddeb9a3e00 0x8d6201~fc9 from file(ino 201734 size 0xd6c969 mtime 2018-10-16 21:02:55.145115 bdev 1 allocated e00000 extents [1:0xe0b8d00000+e00000])
2018-10-16 21:09:30.812603 7f37d26f2700 10 osd.240 pg_epoch: 620465 pg[29.223( v 620465'88140843 (620465'88139287,620465'88140843] local-lis/les=620388/620389 n=3800 ec=45900/45900 lis/c 620388/620388 les/c/f 620389/620389/0 620386/620388/620388) [240,183,176] r=0 lpr=620388 crt=620465'88140843 lcod 620465'88140842 mlcod 620465'88140842 active+clean] do_osd_op 29:c45aebf9:::rbd_header.0904ed238e1f29:head [omap-get-vals-by-keys]
2018-10-16 21:09:30.812618 7f37d26f2700 10 osd.240 pg_epoch: 620465 pg[29.223( v 620465'88140843 (620465'88139287,620465'88140843] local-lis/les=620388/620389 n=3800 ec=45900/45900 lis/c 620388/620388 les/c/f 620389/620389/0 620386/620388/620388) [240,183,176] r=0 lpr=620388 crt=620465'88140843 lcod 620465'88140842 mlcod 620465'88140842 active+clean] do_osd_op  omap-get-vals-by-keys
2018-10-16 21:09:30.812623 7f37d26f2700 15 bluestore(/var/lib/ceph/osd/ceph-240) omap_get_values 29.223_head oid #29:c45aebf9:::rbd_header.0904ed238e1f29:head#
2018-10-16 21:09:30.812669 7f37d26f2700 10 bluefs _read_random h 0x55dd99b3f980 0x4000b42~16ea88 from file(ino 201731 size 0x42c0c5f mtime 2018-10-16 20:41:03.911976 bdev 1 allocated 4300000 extents [1:0xe0b0100000+4300000])
2018-10-16 21:09:30.816931 7f37d26f2700 10 bluefs _read_random h 0x55ddf0180d80 0x40005d5~42bac6 from file(ino 201722 size 0x452f1f4 mtime 2018-10-16 20:02:42.995010 bdev 1 allocated 4600000 extents [1:0xdff2100000+4600000])
2018-10-16 21:09:30.830286 7f37d26f2700 10 bluefs _read_random h 0x55ddf0180d80 0x442c440~102d7f from file(ino 201722 size 0x452f1f4 mtime 2018-10-16 20:02:42.995010 bdev 1 allocated 4600000 extents [1:0xdff2100000+4600000])
2018-10-16 21:09:30.833395 7f37d26f2700 10 bluefs _read_random h 0x55ddaca32680 0x400084c~a4673 from file(ino 201593 size 0x41a812d mtime 2018-10-16 11:22:59.508274 bdev 1 allocated 4200000 extents [1:0xe048900000+4200000])
2018-10-16 21:09:30.835353 7f37d26f2700 10 bluefs _read_random h 0x55ddaca32680 0x40a5260~102e98 from file(ino 201593 size 0x41a812d mtime 2018-10-16 11:22:59.508274 bdev 1 allocated 4200000 extents [1:0xe048900000+4200000])
2018-10-16 21:09:30.838791 7f37d26f2700 10 bluefs _read_random h 0x55ddaca32680 0x22de541~fc1 from file(ino 201593 size 0x41a812d mtime 2018-10-16 11:22:59.508274 bdev 1 allocated 4200000 extents [1:0xe048900000+4200000])
2018-10-16 21:09:30.839002 7f37d26f2700 10 bluestore(/var/lib/ceph/osd/ceph-240) omap_get_values 29.223_head oid #29:c45aebf9:::rbd_header.0904ed238e1f29:head# = 0
...
...
2018-10-16 21:09:30.839387 7f37d26f2700 10 osd.240 pg_epoch: 620465 pg[29.223( v 620465'88140843 (620465'88139287,620465'88140843] local-lis/les=620388/620389 n=3800 ec=45900/45900 lis/c 620388/620388 les/c/f 620389/620389/0 620386/620388/620388) [240,183,176] r=0 lpr=620388 crt=620465'88140843 lcod 620465'88140842 mlcod 620465'88140842 active+clean]  dropping ondisk_read_lock
2018-10-16 21:09:30.839413 7f37d26f2700 10 osd.240 620465 dequeue_op 0x55de0be44540 finish
2018-10-16 21:09:30.839446 7f37d26f2700  1 heartbeat_map reset_timeout 'OSD::osd_op_tp thread 0x7f37d26f2700' had timed out after 60

The OSD in this case seems to be reading it's whole database (20GB) and scanned through it all for finding these OMAP keys.

It could be triggered on this specific RBD header, other objects worked just fine on the same OSD.

We tested by stopping osd.240 and this caused osd.183 to become the primary for PG 29.223. It did NOT suffer from the same problem, it responded in just a few ms with the OMAP keys.

Trying to use ceph-objectstore-tool and do the same was also very slow:

root@ceph37:~# time ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-240 rbd_header.0904ed238e1f29 list-omap
features
object_prefix
order
size
snap_seq
snapshot_0000000000000431

real    3m31.117s
user    0m19.016s
sys    0m8.524s
root@ceph37:~#

While doing so I could observe that the SSD was about 90% util and reading heavily.

Some information:

- Cluster deployed in September 2013 with Dumpling
- Upgraded to Luminous in March from Jewel
- OSDs migrated to BlueStore in March 2018
- Running mixture of 12.2.4, 12.2.5 and 12.2.8
- Ubuntu 16.04.5
- Samsumg PM863a 1.92TB SSDs

Offloading data from these OSDs to other OSDs made things better, but did not solve the problem.

I made a image with 'dd' from osd.240 in the state it was broken in. osd.240 is running again and is backfilling data to other OSDs (which is painfully slow), but that allows us to wipe the OSD at a later point and mkfs it again.

On request I have available:

- Logs of osd.240
- Image of osd.240 (1.92TB in size)


Related issues 4 (0 open4 closed)

Related to bluestore - Bug #41213: BlueStore OSD taking more than 60 minutes to boot Duplicate08/12/2019

Actions
Has duplicate bluestore - Bug #34526: OSD crash in KernelDevice::direct_read_unaligned while scrubbingDuplicate08/30/2018

Actions
Copied to bluestore - Backport #40632: nautilus: High amount of Read I/O on BlueFS/DB when listing omap keysResolvedIgor FedotovActions
Copied to bluestore - Backport #42428: mimic: High amount of Read I/O on BlueFS/DB when listing omap keysRejectedActions
Actions

Also available in: Atom PDF