Ceph : Issueshttps://tracker.ceph.com/https://tracker.ceph.com/favicon.ico2019-11-12T07:24:32ZCeph
Redmine Ceph - Bug #42761 (New): systemd restarts OSD too fast after failurehttps://tracker.ceph.com/issues/427612019-11-12T07:24:32ZWido den Hollanderwido@42on.com
<p>While upgrading a cluster from Mimic to Nautilus I noticed that after a reboot the OSDs wouldn't start on allmost all servers.</p>
<p>I checked journalctl and this showed me on all servers:</p>
<pre>
Nov 12 07:04:51 XXYY systemd[1]: Starting Ceph object storage daemon osd.576...
Nov 12 07:04:51 XXYY systemd[1]: Started Ceph object storage daemon osd.576.
Nov 12 07:04:52 XXYY ceph-osd[2400]: server name not found: ceph-monitor.ceph.mydomain (Name or service not known)
Nov 12 07:04:52 XXYY ceph-osd[2400]: unable to parse addrs in 'ceph-monitor.ceph.mydomain'
Nov 12 07:04:52 XXYY ceph-osd[2400]: 2019-11-12 07:04:52.146 7fc9a84f8dc0 -1 monclient: get_monmap_and_config cannot identify monitors to contact
Nov 12 07:04:52 XXYY ceph-osd[2400]: failed to fetch mon config (--no-mon-config to skip)
Nov 12 07:04:52 XXYY systemd[1]: ceph-osd@576.service: main process exited, code=exited, status=1/FAILURE
Nov 12 07:04:52 XXYY systemd[1]: Unit ceph-osd@576.service entered failed state.
Nov 12 07:04:52 XXYY systemd[1]: ceph-osd@576.service failed.
Nov 12 07:04:52 XXYY systemd[1]: ceph-osd@576.service holdoff time over, scheduling restart.
Nov 12 07:04:52 XXYY systemd[1]: Stopped Ceph object storage daemon osd.576.
Nov 12 07:04:52 XXYY systemd[1]: Starting Ceph object storage daemon osd.576...
Nov 12 07:04:52 XXYY systemd[1]: Started Ceph object storage daemon osd.576.
Nov 12 07:04:52 XXYY ceph-osd[2533]: server name not found: ceph-monitor.ceph.mydomain (Name or service not known)
Nov 12 07:04:52 XXYY ceph-osd[2533]: unable to parse addrs in 'ceph-monitor.ceph.mydomain'
Nov 12 07:04:52 XXYY ceph-osd[2533]: 2019-11-12 07:04:52.709 7fb510d8cdc0 -1 monclient: get_monmap_and_config cannot identify monitors to contact
Nov 12 07:04:52 XXYY ceph-osd[2533]: failed to fetch mon config (--no-mon-config to skip)
Nov 12 07:04:52 XXYY systemd[1]: ceph-osd@576.service: main process exited, code=exited, status=1/FAILURE
Nov 12 07:04:52 XXYY systemd[1]: Unit ceph-osd@576.service entered failed state.
Nov 12 07:04:52 XXYY systemd[1]: ceph-osd@576.service failed.
Nov 12 07:04:53 XXYY systemd[1]: ceph-osd@576.service holdoff time over, scheduling restart.
Nov 12 07:04:53 XXYY systemd[1]: Stopped Ceph object storage daemon osd.576.
Nov 12 07:04:53 XXYY systemd[1]: Starting Ceph object storage daemon osd.576...
Nov 12 07:04:53 XXYY systemd[1]: Started Ceph object storage daemon osd.576.
Nov 12 07:04:53 XXYY ceph-osd[2678]: server name not found: ceph-monitor.ceph.mydomain (Name or service not known)
Nov 12 07:04:53 XXYY ceph-osd[2678]: unable to parse addrs in 'ceph-monitor.ceph.mydomain'
Nov 12 07:04:53 XXYY ceph-osd[2678]: 2019-11-12 07:04:53.323 7f0e1e0b5dc0 -1 monclient: get_monmap_and_config cannot identify monitors to contact
Nov 12 07:04:53 XXYY ceph-osd[2678]: failed to fetch mon config (--no-mon-config to skip)
</pre>
<p>Here you see a start of the OSD was attempted on:</p>
<p>- Nov 12 07:04:51<br />- Nov 12 07:04:52<br />- Nov 12 07:04:53</p>
<p>The failure was due to the fact that the (IPv6) network was not online yet while the system booted.</p> Ceph - Documentation #38253 (New): doc: Documentation about daemon restart with SELinuxhttps://tracker.ceph.com/issues/382532019-02-11T08:14:44ZWido den Hollanderwido@42on.com
<p>ceph-selinux restarts all MON and OSD daemons on package upgrade.</p>
<p>See:<br />- <a class="external" href="https://tracker.ceph.com/issues/21672">https://tracker.ceph.com/issues/21672</a><br />- <a class="external" href="https://tracker.ceph.com/issues/21672">https://tracker.ceph.com/issues/21672</a><br />- <a class="external" href="https://github.com/ceph/ceph/pull/25960">https://github.com/ceph/ceph/pull/25960</a></p>
<p>We should document that daemons are being restarted when upgrading this package.</p> Ceph - Feature #38138 (Resolved): mgr: add get_latest_counter() to C++ -> Python interface.https://tracker.ceph.com/issues/381382019-02-01T07:43:09ZWido den Hollanderwido@42on.com
<p>This vastly improves performance of the Mgr when fetching performance counters.</p>
<p>This improves the performance of these modules:</p>
<p>- Influx<br />- Telegraf<br />- Prometheus</p>
<p>PR: <a class="external" href="https://github.com/ceph/ceph/pull/23978">https://github.com/ceph/ceph/pull/23978</a></p> bluestore - Bug #36482 (Resolved): High amount of Read I/O on BlueFS/DB when listing omap keyshttps://tracker.ceph.com/issues/364822018-10-17T11:54:10ZWido den Hollanderwido@42on.com
<p>I don't know how to describe this issue the best, but I've been observing various issues with Luminous 12.2.4 ~ 12.2.8 OSDs.</p>
<p>It started with the stupidallocator dumping messages as described here: <a class="external" href="http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-October/030546.html">http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-October/030546.html</a></p>
<pre>
2018-10-10 21:52:04.019037 7f90c2f0f700 0 stupidalloc 0x0x55828ae047d0
dump 0x15cd2078000~34000
2018-10-10 21:52:04.019038 7f90c2f0f700 0 stupidalloc 0x0x55828ae047d0
dump 0x15cd22cc000~24000
2018-10-10 21:52:04.019038 7f90c2f0f700 0 stupidalloc 0x0x55828ae047d0
dump 0x15cd2300000~20000
2018-10-10 21:52:04.019039 7f90c2f0f700 0 stupidalloc 0x0x55828ae047d0
dump 0x15cd2324000~24000
2018-10-10 21:52:04.019040 7f90c2f0f700 0 stupidalloc 0x0x55828ae047d0
dump 0x15cd26c0000~24000
2018-10-10 21:52:04.019041 7f90c2f0f700 0 stupidalloc 0x0x55828ae047d0
dump 0x15cd2704000~30000
</pre>
<p>After we offloaded data (CRUSH migration) to other OSDs these messages went away.</p>
<p>A few days later we observed OSDs utilizing their disks (Samsung PM863a 1.92TB SSDs) for ~90% and reading a lot from them, like this:</p>
<pre>
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
loop0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sda 0.00 0.00 13.00 43.00 60.00 236.00 10.57 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 4.00 12.00 98.00 52.00 820.00 15.85 0.00 0.04 0.00 0.04 0.04 0.40
sdc 0.00 1.00 25.00 145.00 120.00 1336.00 17.13 0.04 0.24 0.00 0.28 0.12 2.00
sdd 0.00 7.00 27.00 117.00 112.00 1076.00 16.50 0.01 0.06 0.00 0.07 0.06 0.80
sde 0.00 0.00 6839.00 69.00 53464.00 388.00 15.59 0.93 0.13 0.14 0.12 0.13 92.80
sdf 0.00 0.00 17.00 102.00 192.00 756.00 15.93 0.00 0.03 0.24 0.00 0.03 0.40
sdg 0.00 0.00 11.00 89.00 44.00 516.00 11.20 0.02 0.20 0.00 0.22 0.12 1.20
sdh 0.00 0.00 11.00 121.00 68.00 676.00 11.27 0.01 0.06 0.00 0.07 0.06 0.80
sdi 0.00 11.00 0.00 5.00 0.00 64.00 25.60 0.00 0.80 0.00 0.80 0.80 0.40
</pre>
<p>/dev/sde is osd.246 in this case and it's showing this in it's logs:</p>
<pre>
2018-10-17 13:32:09.050155 7f54713d7700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f54531aa700' had timed out after 60
2018-10-17 13:32:09.050160 7f54713d7700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f54571b2700' had timed out after 60
2018-10-17 13:32:09.050714 7f54713d7700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f54531aa700' had timed out after 60
2018-10-17 13:32:09.050719 7f54713d7700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f54571b2700' had timed out after 60
2018-10-17 13:32:09.073876 7f5470bd6700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f54531aa700' had timed out after 60
</pre>
<p>I increased this value from 15 to 60 as the OSDs were committing suicide.</p>
<p>Setting debug_bluefs to 10 shows me:</p>
<pre>
2018-10-17 13:33:10.680782 7f54571b2700 10 bluefs _read_random h 0x55b695997100 0x235c56c~ecf from file(ino 211403 size 0x41af2a4 mtime 2018-10-15 18:41:40.702314 bdev 1 allocated 4200000 extents [1:0xe753f00000+4200000])
2018-10-17 13:33:10.680932 7f54571b2700 10 bluefs _read_random h 0x55b695997100 0x235d43b~f75 from file(ino 211403 size 0x41af2a4 mtime 2018-10-15 18:41:40.702314 bdev 1 allocated 4200000 extents [1:0xe753f00000+4200000])
2018-10-17 13:33:10.681075 7f54571b2700 10 bluefs _read_random h 0x55b695997100 0x235e3b0~eb7 from file(ino 211403 size 0x41af2a4 mtime 2018-10-15 18:41:40.702314 bdev 1 allocated 4200000 extents [1:0xe753f00000+4200000])
2018-10-17 13:33:10.681229 7f54571b2700 10 bluefs _read_random h 0x55b695997100 0x235f267~e8f from file(ino 211403 size 0x41af2a4 mtime 2018-10-15 18:41:40.702314 bdev 1 allocated 4200000 extents [1:0xe753f00000+4200000])
2018-10-17 13:33:10.681370 7f54571b2700 10 bluefs _read_random h 0x55b695996b80 0x3782d9f~fb8 from file(ino 214919 size 0x445eac0 mtime 2018-10-16 01:55:01.588751 bdev 1 allocated 4500000 extents [1:0xe3f4b00000+4500000])
2018-10-17 13:33:10.681523 7f54571b2700 10 bluefs _read_random h 0x55b695997100 0x23600f6~ea6 from file(ino 211403 size 0x41af2a4 mtime 2018-10-15 18:41:40.702314 bdev 1 allocated 4200000 extents [1:0xe753f00000+4200000])
2018-10-17 13:33:10.681654 7f54571b2700 10 bluefs _read_random h 0x55b695997100 0x2360f9c~edd from file(ino 211403 size 0x41af2a4 mtime 2018-10-15 18:41:40.702314 bdev 1 allocated 4200000 extents [1:0xe753f00000+4200000])
2018-10-17 13:33:10.681798 7f54571b2700 10 bluefs _read_random h 0x55b695997100 0x2361e79~f57 from file(ino 211403 size 0x41af2a4 mtime 2018-10-15 18:41:40.702314 bdev 1 allocated 4200000 extents [1:0xe753f00000+4200000])
2018-10-17 13:33:10.681940 7f54571b2700 10 bluefs _read_random h 0x55b695997100 0x2362dd0~ed1 from file(ino 211403 size 0x41af2a4 mtime 2018-10-15 18:41:40.702314 bdev 1 allocated 4200000 extents [1:0xe753f00000+4200000])
2018-10-17 13:33:10.682088 7f54571b2700 10 bluefs _read_random h 0x55b695997100 0x2363ca1~f7b from file(ino 211403 size 0x41af2a4 mtime 2018-10-15 18:41:40.702314 bdev 1 allocated 4200000 extents [1:0xe753f00000+4200000])
2018-10-17 13:33:10.682232 7f54571b2700 10 bluefs _read_random h 0x55b695997100 0x2364c1c~f1c from file(ino 211403 size 0x41af2a4 mtime 2018-10-15 18:41:40.702314 bdev 1 allocated 4200000 extents [1:0xe753f00000+4200000])
2018-10-17 13:33:10.682393 7f54571b2700 10 bluefs _read_random h 0x55b695997100 0x2365b38~f48 from file(ino 211403 size 0x41af2a4 mtime 2018-10-15 18:41:40.702314 bdev 1 allocated 4200000 extents [1:0xe753f00000+4200000])
2018-10-17 13:33:10.682551 7f54571b2700 10 bluefs _read_random h 0x55b695997100 0x2366a80~fdd from file(ino 211403 size 0x41af2a4 mtime 2018-10-15 18:41:40.702314 bdev 1 allocated 4200000 extents [1:0xe753f00000+4200000])
</pre>
<p>I spend a few hours with Igor from SUSE to debug this problem as it is showing up on multiple OSDs.</p>
<p>We debugged with osd.240 (not 246) and we could trigger the problem using this command:</p>
<pre>
root@mon01:~# time rados -p rbd-ssd-c03-p02 listomapkeys rbd_header.0904ed238e1f29
features
object_prefix
order
size
snap_seq
snapshot_0000000000000431
real 2m16.678s
user 0m0.064s
sys 0m0.028s
root@mon01:~#
</pre>
<p>As you can see, it takes 2 minutes (!!) to list the omap values for this RBD header. In these two minutes the OSD becomes unusable and causes massive slow requests as it's simply blocking.</p>
<p>The image in this case:</p>
<pre>
root@mon01:~# rbd -p rbd-ssd-c03-p02 info fa2d5398-08de-4f9b-b519-28e6258bc5f3
rbd image 'fa2d5398-08de-4f9b-b519-28e6258bc5f3':
size 50GiB in 12800 objects
order 22 (4MiB objects)
block_name_prefix: rbd_data.0904ed238e1f29
format: 2
features: layering, exclusive-lock, object-map, fast-diff, deep-flatten
flags:
root@mon01:~#
</pre>
<pre>
root@mon01:~# ceph osd map rbd-ssd-c03-p02 rbd_header.0904ed238e1f29
osdmap e623992 pool 'rbd-ssd-c03-p02' (29) object 'rbd_header.0904ed238e1f29' -> pg 29.9fd75a23 (29.223) -> up ([240,183,176], p240) acting ([240,183,176], p240)
root@mon01:~#
</pre>
<p>Doing this triggers the OSD (240) to be jump to 90% util on it's SSD and cause massive slow/blocked requests for more then 60 seconds.</p>
<p>In the logs this is shown:</p>
<pre>
2018-10-16 21:07:15.112839 7f37d26f2700 10 bluestore(/var/lib/ceph/osd/ceph-240) get_omap_iterator 29.223_head #29:c45aebf9:::rbd_header.0904ed238e1f29:head#
2018-10-16 21:07:15.112840 7f37d26f2700 10 bluestore(/var/lib/ceph/osd/ceph-240) get_omap_iterator has_omap = 1
2018-10-16 21:07:15.112904 7f37d26f2700 10 bluefs _read_random h 0x55ddaca32680 0x22df502~f09 from file(ino 201593 size 0x41a812d mtime 2018-10-16 11:22:59.508274 bdev 1 allocated 4200000 extents [1:0xe048900000+4200000])
2018-10-16 21:07:15.113079 7f37d26f2700 10 bluefs _read_random h 0x55ddaca32680 0x22e040b~efd from file(ino 201593 size 0x41a812d mtime 2018-10-16 11:22:59.508274 bdev 1 allocated 4200000 extents [1:0xe048900000+4200000])
2018-10-16 21:07:15.113246 7f37d26f2700 10 bluefs _read_random h 0x55ddaca32680 0x22e1308~fa2 from file(ino 201593 size 0x41a812d mtime 2018-10-16 11:22:59.508274 bdev 1 allocated 4200000 extents [1:0xe048900000+4200000])
2018-10-16 21:07:15.113414 7f37d26f2700 10 bluefs _read_random h 0x55ddaca32680 0x22e22aa~f37 from file(ino 201593 size 0x41a812d mtime 2018-10-16 11:22:59.508274 bdev 1 allocated 4200000 extents [1:0xe048900000+4200000])
2018-10-16 21:07:15.113570 7f37d26f2700 10 bluefs _read_random h 0x55ddaca32680 0x22e31e1~f5a from file(ino 201593 size 0x41a812d mtime 2018-10-16 11:22:59.508274 bdev 1 allocated 4200000 extents [1:0xe048900000+4200000])
2018-10-16 21:07:15.113733 7f37d26f2700 10 bluefs _read_random h 0x55ddaca32680 0x22e413b~ecc from file(ino 201593 size 0x41a812d mtime 2018-10-16 11:22:59.508274 bdev 1 allocated 4200000 extents [1:0xe048900000+4200000])
2018-10-16 21:07:15.113897 7f37d26f2700 10 bluefs _read_random h 0x55ddaca32680 0x22e5007~f84 from file(ino 201593 size 0x41a812d mtime 2018-10-16 11:22:59.508274 bdev 1 allocated 4200000 extents [1:0xe048900000+4200000])
2018-10-16 21:07:15.114033 7f37d26f2700 10 bluefs _read_random h 0x55dd99ea9f00 0xfc0~fd0 from file(ino 201588 size 0x44a99d7 mtime 2018-10-16 11:22:53.020935 bdev 1 allocated 4500000 extents [1:0xe010800000+4500000])
...
...
...
2018-10-16 21:09:06.190232 7f37f091f700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f37d26f2700' had timed out after 60
2018-10-16 21:09:06.190259 7f37f011e700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f37d26f2700' had timed out after 60
...
...
2018-10-16 21:09:29.761930 7f37d26f2700 10 bluefs _read_random h 0x55ddf45bcc80 0x2adbe12~fc9 from file(ino 201517 size 0x44c87af mtime 2018-10-16 06:57:28.419862 bdev 1 allocated 4500000 extents [1:0xe66dd00000+4500000])
2018-10-16 21:09:29.762090 7f37d26f2700 10 bluefs _read_random h 0x55dd9422b400 0x2b812c2~fb6 from file(ino 69377 size 0x41b57a8 mtime 2018-07-10 15:29:54.444569 bdev 1 allocated 4200000 extents [1:0xe057600000+4200000])
2018-10-16 21:09:29.762166 7f37f091f700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f37d26f2700' had timed out after 60
...
2018-10-16 21:09:30.812418 7f37d26f2700 10 bluefs _read_random h 0x55ddeb9a3e00 0x8d6201~fc9 from file(ino 201734 size 0xd6c969 mtime 2018-10-16 21:02:55.145115 bdev 1 allocated e00000 extents [1:0xe0b8d00000+e00000])
2018-10-16 21:09:30.812603 7f37d26f2700 10 osd.240 pg_epoch: 620465 pg[29.223( v 620465'88140843 (620465'88139287,620465'88140843] local-lis/les=620388/620389 n=3800 ec=45900/45900 lis/c 620388/620388 les/c/f 620389/620389/0 620386/620388/620388) [240,183,176] r=0 lpr=620388 crt=620465'88140843 lcod 620465'88140842 mlcod 620465'88140842 active+clean] do_osd_op 29:c45aebf9:::rbd_header.0904ed238e1f29:head [omap-get-vals-by-keys]
2018-10-16 21:09:30.812618 7f37d26f2700 10 osd.240 pg_epoch: 620465 pg[29.223( v 620465'88140843 (620465'88139287,620465'88140843] local-lis/les=620388/620389 n=3800 ec=45900/45900 lis/c 620388/620388 les/c/f 620389/620389/0 620386/620388/620388) [240,183,176] r=0 lpr=620388 crt=620465'88140843 lcod 620465'88140842 mlcod 620465'88140842 active+clean] do_osd_op omap-get-vals-by-keys
2018-10-16 21:09:30.812623 7f37d26f2700 15 bluestore(/var/lib/ceph/osd/ceph-240) omap_get_values 29.223_head oid #29:c45aebf9:::rbd_header.0904ed238e1f29:head#
2018-10-16 21:09:30.812669 7f37d26f2700 10 bluefs _read_random h 0x55dd99b3f980 0x4000b42~16ea88 from file(ino 201731 size 0x42c0c5f mtime 2018-10-16 20:41:03.911976 bdev 1 allocated 4300000 extents [1:0xe0b0100000+4300000])
2018-10-16 21:09:30.816931 7f37d26f2700 10 bluefs _read_random h 0x55ddf0180d80 0x40005d5~42bac6 from file(ino 201722 size 0x452f1f4 mtime 2018-10-16 20:02:42.995010 bdev 1 allocated 4600000 extents [1:0xdff2100000+4600000])
2018-10-16 21:09:30.830286 7f37d26f2700 10 bluefs _read_random h 0x55ddf0180d80 0x442c440~102d7f from file(ino 201722 size 0x452f1f4 mtime 2018-10-16 20:02:42.995010 bdev 1 allocated 4600000 extents [1:0xdff2100000+4600000])
2018-10-16 21:09:30.833395 7f37d26f2700 10 bluefs _read_random h 0x55ddaca32680 0x400084c~a4673 from file(ino 201593 size 0x41a812d mtime 2018-10-16 11:22:59.508274 bdev 1 allocated 4200000 extents [1:0xe048900000+4200000])
2018-10-16 21:09:30.835353 7f37d26f2700 10 bluefs _read_random h 0x55ddaca32680 0x40a5260~102e98 from file(ino 201593 size 0x41a812d mtime 2018-10-16 11:22:59.508274 bdev 1 allocated 4200000 extents [1:0xe048900000+4200000])
2018-10-16 21:09:30.838791 7f37d26f2700 10 bluefs _read_random h 0x55ddaca32680 0x22de541~fc1 from file(ino 201593 size 0x41a812d mtime 2018-10-16 11:22:59.508274 bdev 1 allocated 4200000 extents [1:0xe048900000+4200000])
2018-10-16 21:09:30.839002 7f37d26f2700 10 bluestore(/var/lib/ceph/osd/ceph-240) omap_get_values 29.223_head oid #29:c45aebf9:::rbd_header.0904ed238e1f29:head# = 0
...
...
2018-10-16 21:09:30.839387 7f37d26f2700 10 osd.240 pg_epoch: 620465 pg[29.223( v 620465'88140843 (620465'88139287,620465'88140843] local-lis/les=620388/620389 n=3800 ec=45900/45900 lis/c 620388/620388 les/c/f 620389/620389/0 620386/620388/620388) [240,183,176] r=0 lpr=620388 crt=620465'88140843 lcod 620465'88140842 mlcod 620465'88140842 active+clean] dropping ondisk_read_lock
2018-10-16 21:09:30.839413 7f37d26f2700 10 osd.240 620465 dequeue_op 0x55de0be44540 finish
2018-10-16 21:09:30.839446 7f37d26f2700 1 heartbeat_map reset_timeout 'OSD::osd_op_tp thread 0x7f37d26f2700' had timed out after 60
</pre>
<p>The OSD in this case seems to be reading it's whole database (20GB) and scanned through it all for finding these OMAP keys.</p>
<p>It could be triggered on this specific RBD header, other objects worked just fine on the same OSD.</p>
<p>We tested by stopping osd.240 and this caused osd.183 to become the primary for PG 29.223. It did NOT suffer from the same problem, it responded in just a few ms with the OMAP keys.</p>
<p>Trying to use ceph-objectstore-tool and do the same was also very slow:</p>
<pre>
root@ceph37:~# time ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-240 rbd_header.0904ed238e1f29 list-omap
features
object_prefix
order
size
snap_seq
snapshot_0000000000000431
real 3m31.117s
user 0m19.016s
sys 0m8.524s
root@ceph37:~#
</pre>
<p>While doing so I could observe that the SSD was about 90% util and reading heavily.</p>
<p>Some information:</p>
<p>- Cluster deployed in September 2013 with Dumpling<br />- Upgraded to Luminous in March from Jewel<br />- OSDs migrated to BlueStore in March 2018<br />- Running mixture of 12.2.4, 12.2.5 and 12.2.8<br />- Ubuntu 16.04.5<br />- Samsumg PM863a 1.92TB SSDs</p>
<p>Offloading data from these OSDs to other OSDs made things better, but did not solve the problem.</p>
<p>I made a image with 'dd' from osd.240 in the state it was broken in. osd.240 is running again and is backfilling data to other OSDs (which is painfully slow), but that allows us to wipe the OSD at a later point and mkfs it again.</p>
<p>On request I have available:</p>
<p>- Logs of osd.240<br />- Image of osd.240 (1.92TB in size)</p> Ceph - Bug #23467 (Won't Fix): ceph-disk: Destroyed OSDs keeps old CRUSH weight if new device is ...https://tracker.ceph.com/issues/234672018-03-27T07:14:25ZWido den Hollanderwido@42on.com
<p>Before I go to the bug, let me explain where this is coming from.</p>
<p>In a cluster with 3.84TB Samsung SSDs a SSD failed. The Ceph operator asked the datacenter to replace the SSD and they did.</p>
<p>The employee in the datacenter made a mistake: Instead of replacing the SSD with a 3.84TB he installed a 960GB SSD in that slot.</p>
<p>The Ceph operator did not notice this mistake and he took the following steps:</p>
<p>- Destroy the old OSD with 'ceph osd destroy X'<br />- Prepare the new SSD as the old OSD ID using 'ceph-disk prepare --osd-id X /dev/sdX'</p>
<p>The OSD was added again, but it kept it's weight of 3.48700 in CRUSH while the OSD is only 960GB large.</p>
<p>Backfilling started and Ceph went to HEALTH_OK. The admin thought everything was going just fine and let the cluster continue.</p>
<p>Suddenly the system went into HEALTH_ERR and I/O stopped due to this OSD being 95% full. Expected behavior of Ceph, but not completely expected here.</p>
<p>The admin then found out this OSD was only 960GB in size. The OSD was stopped and the problem was resolved. Datacenter later swapped the SSD again with a properly sized SSD.</p>
<p>To verify this behavior of Luminous I build a very simple Ceph cluster with a few OSDs:</p>
<pre>
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 0.03918 root default
-3 0.01959 host alpha
0 ssd 0.00980 osd.0 up 1.00000 1.00000
3 ssd 0.00980 osd.3 up 1.00000 1.00000
-5 0.00980 host bravo
1 ssd 0.00980 osd.1 up 1.00000 1.00000
-7 0.00980 host charlie
2 ssd 0.00980 osd.2 up 1.00000 1.00000
</pre>
<p>In this case all the OSDs are 10GB running inside Virtual Machines.</p>
<p>I stopped and <strong>destroyed</strong> <em>osd.3</em> and re-added it with a 5GB disk.</p>
<p>The CRUSH tree however remained the same, but 'ceph df' shows the new OSD size:</p>
<pre>
ID CLASS WEIGHT REWEIGHT SIZE USE AVAIL %USE VAR PGS
0 ssd 0.00980 1.00000 10236M 1065M 9170M 10.41 0.88 68
3 ssd 0.00980 1.00000 5116M 1054M 4061M 20.61 1.74 60
1 ssd 0.00980 1.00000 10236M 1061M 9174M 10.37 0.88 128
2 ssd 0.00980 1.00000 10236M 1061M 9174M 10.37 0.88 128
TOTAL 35824M 4242M 31581M 11.84
MIN/MAX VAR: 0.88/1.74 STDDEV: 4.56
</pre>
<p>Take a close look, the <em>size</em> of the OSD is <strong>5116M</strong> while it's weight is still <strong>0.00980</strong>.</p>
<p>Looking at the logs of the MON I see that the OSD ran the correct <em>crush create-or-move</em> with the proper weights:</p>
<pre>
2018-03-26 16:14:09.622816 7faf99315700 0 mon.alpha@0(leader).osd e28 create-or-move crush item name 'osd.3' initial_weight 0.0098 at location {host=alpha,root=default}
2018-03-26 16:17:11.380765 7f9ee734f700 0 mon.alpha@0(leader).osd e41 create-or-move crush item name 'osd.3' initial_weight 0.0049 at location {host=alpha,root=default}
</pre>
<p><em>initial_weight</em> is only used when the OSD is completely new, but it's not taken into account when the OSD is destroyed.</p>
<p>Shouldn't we accept <em>initial_weight</em> when the OSD is <strong>destroyed</strong> and when it's booted for the first time?</p> Ceph - Bug #22784 (Fix Under Review): Key exported from auth keyring is not imported properlyhttps://tracker.ceph.com/issues/227842018-01-24T11:01:59ZWido den Hollanderwido@42on.com
<p>Exporting a key with " in there is exported to a file, but importing this key again seems to succeed, but the mon caps are missing:</p>
<p>The resulting file contains:</p>
<pre>
[client.rbd]
key = AQA5R8dWPYvJCxAAvdhxx7UIKzFofSyRb4VrwQ==
caps mon = "allow r, allow command "osd blacklist""
caps osd = "allow class-read object_prefix rbd_children, allow rwx pool=rbd"
</pre>
<p>If you use this command:</p>
<pre>
ceph auth import -i client.rbd
</pre>
<p>The resulting key becomes:</p>
<pre>
client.rbd
key: AQA5R8dWPYvJCxAAvdhxx7UIKzFofSyRb4VrwQ==
caps: [osd] allow class-read object_prefix rbd_children, allow rwx pool=rbd
</pre>
<p>The 'mon' caps are lost in this case and that is a problem.</p>
<p>The double quotes are causing problems here.</p> Ceph - Bug #21813 (Resolved): OSD bind to IPv6 link-local addresshttps://tracker.ceph.com/issues/218132017-10-16T13:31:19ZWido den Hollanderwido@42on.com
<p>Just observed this behavior on a cluster when upgrading to Luminous:</p>
<pre>
osd.2 up in weight 1 up_from 175547 up_thru 175711 down_at 175546 last_clean_interval [175531,175545) [2a04:XXX:1:5:ec4:7aff:fe1e:44c8]:6808/2302 [fe80::ec4:7aff:fe1e:44c8%bond0.204]:6828/1002302 [2a04:XXX:1:5:ec4:7aff:fe1e:44c8]:6828/1002302 [2a04:XXX:1:5:ec4:7aff:fe1e:44c8]:6829/1002302 exists,up 7bdbcb99-fd7f-4880-859a-9e54d26c96da
osd.5 up in weight 1 up_from 175700 up_thru 175712 down_at 175699 last_clean_interval [175527,175698) [fe80::ec4:7aff:fe1e:44c8%bond0.204]:6800/1658 [2a04:XXX:1:5:ec4:7aff:fe1e:44c8]:6809/1001658 [fe80::ec4:7aff:fe1e:44c8%bond0.204]:6809/1001658 [fe80::ec4:7aff:fe1e:44c8%bond0.204]:6810/1001658 exists,up c3e13f69-43b6-4441-922b-aef5d2bfe262
osd.30 up in weight 1 up_from 175677 up_thru 175845 down_at 175676 last_clean_interval [175665,175675) [fe80::ec4:7aff:fe1e:3f3c%bond0.204]:6800/1662 [2a04:XXX:1:5:ec4:7aff:fe1e:3f3c]:6808/1001662 [fe80::ec4:7aff:fe1e:3f3c%bond0.204]:6808/1001662 [fe80::ec4:7aff:fe1e:3f3c%bond0.204]:6809/1001662 exists,up 3c1aeb5b-0ace-49cf-84f8-c85dfedd7c2f
</pre>
<p>In this case OSD 2, 5 and 30 bound to a Link-Local Ipv6 (fe80:XX:XX) address after they booted.</p>
<p>This is probably some form of race condition where the Unicast 2a04:X address isn't online yet but the OSDs boot.</p>
<p>These fe80 addresses should however not qualify as an address to bind on as they can't be routed thus breaks traffic.</p> Ceph - Bug #20430 (Resolved): crush_location_hook does not work with Luminoushttps://tracker.ceph.com/issues/204302017-06-27T14:11:20ZWido den Hollanderwido@42on.com
<p>Trying with the Luminous RC it seems that the <strong>crush_location_hook</strong> isn't invoked.</p>
<p>The OSDs always seem to add themselves to root=default host=xyz</p>
<p>Looking at the source code the culprit seems to be in crush/CrushLocation.h</p>
<pre>
public:
CrushLocation(CephContext *c) : cct(c) {
update_from_conf();
}
int update_from_conf(); ///< refresh from config
int update_from_hook(); ///< call hook, if present
int init_on_startup();
std::multimap<std::string,std::string> get_location() {
std::lock_guard<std::mutex> l(lock);
return loc;
}
};
</pre>
<p>update_from_conf() is called in the constructor, but to me it seems that init_on_startup() should be called</p> Ceph - Feature #18978 (New): ceph-disk: Implement easy replace of OSDhttps://tracker.ceph.com/issues/189782017-02-17T18:58:53ZWido den Hollanderwido@42on.com
<p>As discussed here: <a class="external" href="http://marc.info/?l=ceph-devel&m=148717797822608&w=2">http://marc.info/?l=ceph-devel&m=148717797822608&w=2</a></p>
<p>Maybe something like:</p>
<pre>ceph-disk prepare --replace --bluestore /dev/sdX</pre>
<p>This will then fetch the current OSD's UUID, wipe the whole disk, reformat it with BlueStore (if supplied) and start the OSD again.</p>
<p>This needs some work in the Monitors to allow the bootstrap-osd key to do this, but this ticket is there to track this work.</p> Ceph - Feature #18966 (New): Add mon_osd_down_out_subtree_max_osdhttps://tracker.ceph.com/issues/189662017-02-17T06:06:43ZWido den Hollanderwido@42on.com
<p>In addition to mon_osd_down_out_subtree_limit there should be a config option mon_osd_down_out_subtree_max_osd</p>
<p>The amount of OSDs it will mark out in that given subtree.</p>
<p>OSDs could be failing in a staggered wait and by limiting the amount of OSDs that are allowed to be marked as out in a subtree we can prevent a snowball-effect where we still mark all OSDs except for the last one as out.</p> Ceph - Bug #18962 (Resolved): ceph-disk: Zap disk doesn't clear OSD journal datahttps://tracker.ceph.com/issues/189622017-02-16T21:49:38ZWido den Hollanderwido@42on.com
<p>When Zapping a disk and re-creating a OSD on the same disk again the chance is there that the partition table will be identical to the one before.</p>
<p>If so the journal might contain 'old' data from the previous OSD which refuses the OSD to start afterwards.</p>
<p>ceph-disk should write 10M to all partitions on the disk prior to zapping the partitions should fix this</p> Ceph - Bug #18638 (Resolved): OSD metadata reports filestore when using bluestorehttps://tracker.ceph.com/issues/186382017-01-23T19:29:06ZWido den Hollanderwido@42on.com
<p>When testing with Kraken 11.2.0 I checked the metadata of my OSDs running BlueStore, but found out they report filestore as objectstore.</p>
<pre>root@alpha:~# ceph osd metadata 0
{
"id": 0,
"arch": "x86_64",
"back_addr": "[2001:db8::100]:6801\/3634",
"ceph_version": "ceph version 11.2.0 (f223e27eeb35991352ebc1f67423d4ebc252adb7)",
"cpu": "Intel(R) Core(TM) i5-3427U CPU @ 1.80GHz",
"distro": "ubuntu",
"distro_description": "Ubuntu 16.04.1 LTS",
"distro_version": "16.04",
"front_addr": "[2001:db8::100]:6800\/3634",
"hb_back_addr": "[2001:db8::100]:6802\/3634",
"hb_front_addr": "[2001:db8::100]:6803\/3634",
"hostname": "alpha",
"kernel_description": "#50-Ubuntu SMP Wed Jul 13 00:07:12 UTC 2016",
"kernel_version": "4.4.0-31-generic",
"mem_swap_kb": "522236",
"mem_total_kb": "500192",
"os": "Linux",
"osd_data": "\/var\/lib\/ceph\/osd\/ceph-0",
"osd_journal": "\/var\/lib\/ceph\/osd\/ceph-0\/journal",
"osd_objectstore": "filestore"
}
root@alpha:~#</pre>
<p>Seems that the OSD uses the osd_objectstore configuration directive and passes that in the metadata.</p> Ceph - Bug #18635 (Resolved): systemd restarts Ceph Mon to quickly after failing to starthttps://tracker.ceph.com/issues/186352017-01-23T07:15:51ZWido den Hollanderwido@42on.com
<p>In the systemd profile of the ceph-mon service RestartSec is <strong>not</strong> set which makes it default to <em>100ms</em>.</p>
<p>On systems where IPv6 connectivity is still awaiting Duplicated Address Detection (DAD) or Router Announcements (RA) the Monitor might fail to start.</p>
<p>A higher RestartSec value will mitigate this by giving the host time to bring the network up properly.</p> Ceph - Feature #18400 (New): Have OSD mark itself down while still runninghttps://tracker.ceph.com/issues/184002017-01-03T13:03:14ZWido den Hollanderwido@42on.com
<p>In some case you might want to mark a OSD as down while you keep the daemon running. E.g. when you want to perform (network) maintenance on the machine it is running on.</p>
<p>Currently you can set the noup flag and then mark the OSD as down, but it would also be useful if you can set a OSD to non-active using the admin socket.</p>
<pre>ceph daemon osd.0 active false</pre>
<p>Now the OSD will mark itself as down while it keeps running. You can now safely perform maintenance.</p>
<p>A different use-case is where you want to run fstrim on the backing SSD but that will make the OSD utilized for 100% for a few minutes. By marking it down you prevent I/O going to it.</p> Ceph - Bug #17545 (Duplicate): Data corruption using RBD with caching enabledhttps://tracker.ceph.com/issues/175452016-10-10T08:47:23ZWido den Hollanderwido@42on.com
<p>This was reported on launchpad, but I think it's better suited to be reported here: <a class="external" href="https://bugs.launchpad.net/mos/+bug/1627775">https://bugs.launchpad.net/mos/+bug/1627775</a></p>
<p>The situation is that when using Windows on top of RBD with caching enabled Windows 2012R2 complains about page corruptions.</p>
<p>Tested with both Firefly and Hammer it only happens on RBD backed volumes with caching enabled. When the writeback cache is disabled the problem does NOT occur.</p>
<p>The issue is not reproducible on LVM/file based storage.</p>
<p>Steps to reproduce: run SQL Server running on Windows 2012R2 or SQLioSim (stress test utility emulating SQL server)</p>
<p>Expected results: no errors</p>
<p>Actual result:<br />xpected FileId: 0x0<br />Received FileId: 0x0<br />Expected PageId: 0xCB19C<br />Received PageId: 0xCB19A (does not match expected)<br />Received CheckSum: 0x9F444071<br />Calculated CheckSum: 0x89603EC9 (does not match expected)<br />Received Buffer Length: 0x2000</p>
<p>Reproducibility: steadily reproducable with SQLioSim</p>
<p>Like mentioned, the workaround is currently to disable RBD caching, but that kills the performance of the system completely.</p>
<p>The issue has been reproduced using OpenStack on Ubuntu 12.04 and 14.04, but also on Proxmox. This hints towards a RBD issue and not so much a Qemu issue.</p>
<p>We still have to test this with the Jewel client (librbd) on the systems, but so far Firefly and Hammer have the same result.</p>