Project

General

Profile

Bug #58022

Fragmentation score rising by seemingly stuck thread

Added by Kevin Fox 3 months ago. Updated 1 day ago.

Status:
New
Priority:
Normal
Assignee:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Due to issue https://tracker.ceph.com/issues/57672 we've been monitoring our clusters closely ensure it doesn't run into the same issue on our other clusters. We have a cluster running 16.2.9 that is showing a weird/bad behavior.

We've noticed some osd's suddenly start increasing their fragmentation at a constant rate until they are restarted. They then settle down and reduce their fragmentation very slowly.

Talking with @Vikhyat a bit, the theory was maybe compaction was kicking in repeatedly. We used the ceph_rocksdb_log_parser.py on one of the runaway osds and didn't see a significant number of compaction events during the time of its runaway fragmentation. So that is unlikely to be the cause.

Please see attached screenshot. You can see the run away osds do so over multiple days and then when we restart them, they level off and slowly decrease.

If it was workload related, we would expect it to continue to fragment after the restart as the workload continues on. But the behavior stops immediately on restart. So feels like some thread in the osd is doing something unusual until restarted.

Screenshot from 2022-11-14 08-53-56.png View (396 KB) Kevin Fox, 11/14/2022 05:03 PM

Screenshot from 2022-11-28 11-15-20.png View - Newer picture, after I had just restarted the current batch of runaways. (436 KB) Kevin Fox, 11/28/2022 07:17 PM

Screenshot from 2023-01-18 09-11-04-1.png View (437 KB) Kevin Fox, 01/18/2023 05:15 PM

Screenshot from 2023-01-20 09-02-13.png View (185 KB) Kevin Fox, 01/20/2023 05:04 PM

Screenshot from 2023-01-20 09-36-41.png View (219 KB) Kevin Fox, 01/20/2023 05:37 PM

Screenshot from 2023-01-25 09-50-11.png View (232 KB) Kevin Fox, 01/25/2023 05:58 PM

Screenshot from 2023-01-25 09-50-41.png View (228 KB) Kevin Fox, 01/25/2023 05:58 PM

osd.3.gz (554 KB) Kevin Fox, 01/27/2023 05:52 PM

osd.4.gz (501 KB) Kevin Fox, 01/27/2023 05:52 PM

History

#1 Updated by Neha Ojha 2 months ago

  • Project changed from Ceph to bluestore
  • Category deleted (OSD)

#2 Updated by Kevin Fox 2 months ago

Saw this on 8 more osds over the weekend.

#3 Updated by Kevin Fox 2 months ago

11 more osds started doing this over the holiday weekend.

#4 Updated by Kevin Fox 2 months ago

Newer picture, after I had just restarted the current batch of runaways.

#5 Updated by Vikhyat Umrao about 2 months ago

Igor/Adam - "But the behavior stops immediately on restart. So feels like some thread in the osd is doing something unusual until restarted." this is pretty important information. How come a restart is helping? Maybe in restart we compact?

Kevin - are these osds have a heavy delete workload?

#6 Updated by Kevin Fox about 2 months ago

These run away osds are not on the heavy delete workload cluster. Its a relatively lightly loaded cluster. though I can't say it doesn't experience slight moments of busyness. I haven't kept that close an eye on it. Not sure I could tell.

Is there a way to get a thread dump or something the next time we get a runaway one that might help shed light on the issue?

#7 Updated by Kevin Fox about 2 months ago

Another 10 are doing it now.

#8 Updated by Kevin Fox 10 days ago

We ended up slowly reformatting all of our osts and re-adding them. Things settled out to a fragmentation score of < .20.

I checked today and several ost's have been slowly but steadily been rising over the last 3 weeks of data I have. So, looks like it may still be a problem, even with freshly formatted osds.

#9 Updated by Kevin Fox 10 days ago

Here's a picture.

#10 Updated by Kevin Fox 9 days ago

If I strace a run away osd, it shows up with 59 threads. If I do it to one that is not run away, it shows up with 59 threads. So, no extra thread....

#11 Updated by Igor Fedotov 9 days ago

Vikhyat Umrao wrote:

Igor/Adam - "But the behavior stops immediately on restart. So feels like some thread in the osd is doing something unusual until restarted." this is pretty important information. How come a restart is helping? Maybe in restart we compact?

One potential explanation can be pretty trivial: allocator keeps tracking a sort of history(hints) where to start the next allocation from based on the previous allocation result(s). On restart this history is reset and allocations start over. Just a hypothesis...

#12 Updated by Igor Fedotov 9 days ago

At the first step I'd like to see allocation stats probes from OSD logs. Here is an example:

2023-01-20T16:28:41.464+0300 7fab001a66c0 0 bluestore(/home/if/ceph.2/build/dev/osd0) allocation stats probe 0: cnt: 1411 frags: 1411 size: 463032320
2023-01-20T16:28:41.464+0300 7fab001a66c0 0 bluestore(/home/if/ceph.2/build/dev/osd0) probe -1: 0, 0, 0
2023-01-20T16:28:41.464+0300 7fab001a66c0 0 bluestore(/home/if/ceph.2/build/dev/osd0) probe -2: 0, 0, 0
2023-01-20T16:28:41.464+0300 7fab001a66c0 0 bluestore(/home/if/ceph.2/build/dev/osd0) probe -4: 0, 0, 0
2023-01-20T16:28:41.464+0300 7fab001a66c0 0 bluestore(/home/if/ceph.2/build/dev/osd0) probe -8: 0, 0, 0
2023-01-20T16:28:41.464+0300 7fab001a66c0 0 bluestore(/home/if/ceph.2/build/dev/osd0) probe -16: 0, 0, 0

This shows amount of allocation requests (=cnt), amount of returned fragments(frags) and total allocated size for the last day, day before and 2/4/8/16 days before.

It's reset on OSD restart. Having a bunch of such probes we'll hopefully see how large is the difference (in terms of fragmentation) between allocations between and after restart. As well as how things are becoming worse over time without restart...

So please run through existing OSD logs and collect these allocation probes along with information about OSD restarts, relevant OSD log line looks like (actual version might vary):
2023-01-19T16:28:36.273+0300 7fab0fa05500 0 ceph version Development (no_version) reef (dev), process ceph-osd, pid 22499

#13 Updated by Kevin Fox 9 days ago

I can get some more, but here's an initial bit.

osd.4 has been running away for a long time (at least 2 weeks. based on slope, guessing maybe 3). osd.5 has not. both were started at the same time:
ELAPSED CMD
23-15:15:09 /usr/bin/ceph-osd -n osd.4 -f --setuser ceph --setgroup ceph --default-log-to-file=false --default-log-to-stderr=true --default-log-stderr-prefix=debug
23-15:14:13 /usr/bin/ceph-osd -n osd.5 -f --setuser ceph --setgroup ceph --default-log-to-file=false --default-log-to-stderr=true --default-log-stderr-prefix=debug

  1. journalctl -u | grep probe
    Jan 19 17:40:46 rcs3.cloud.pnnl.gov ceph-b15015c8-af07-4973-b35d-28c3bfd2af22-osd-42099039: debug 2023-01-20T01:40:46.719+0000 7fe1350e5700 0 bluestore(/var/lib/ceph/osd/ceph-4) allocation stats probe 22: cnt: 47025 frags: 47025 size: 51463442432
    Jan 19 17:40:46 rcs3.cloud.pnnl.gov ceph-b15015c8-af07-4973-b35d-28c3bfd2af22-osd-42099039: debug 2023-01-20T01:40:46.719+0000 7fe1350e5700 0 bluestore(/var/lib/ceph/osd/ceph-4) probe -1: 47765, 47765, 51807432704
    Jan 19 17:40:46 rcs3.cloud.pnnl.gov ceph-b15015c8-af07-4973-b35d-28c3bfd2af22-osd-42099039: debug 2023-01-20T01:40:46.719+0000 7fe1350e5700 0 bluestore(/var/lib/ceph/osd/ceph-4) probe -2: 51034, 51034, 54404734976
    Jan 19 17:40:46 rcs3.cloud.pnnl.gov ceph-b15015c8-af07-4973-b35d-28c3bfd2af22-osd-42099039: debug 2023-01-20T01:40:46.719+0000 7fe1350e5700 0 bluestore(/var/lib/ceph/osd/ceph-4) probe -6: 48928, 48928, 51115958272
    Jan 19 17:40:46 rcs3.cloud.pnnl.gov ceph-b15015c8-af07-4973-b35d-28c3bfd2af22-osd-42099039: debug 2023-01-20T01:40:46.719+0000 7fe1350e5700 0 bluestore(/var/lib/ceph/osd/ceph-4) probe -14: 36777, 36777, 45640888320
    Jan 19 17:40:46 rcs3.cloud.pnnl.gov ceph-b15015c8-af07-4973-b35d-28c3bfd2af22-osd-42099039: debug 2023-01-20T01:40:46.719+0000 7fe1350e5700 0 bluestore(/var/lib/ceph/osd/ceph-4) probe -22: 543961, 543961, 1790549340160
  1. journalctl -u | grep probe
    Jan 19 17:41:38 rcs3.cloud.pnnl.gov ceph-b15015c8-af07-4973-b35d-28c3bfd2af22-osd-52103576: debug 2023-01-20T01:41:38.199+0000 7faacea1e700 0 bluestore(/var/lib/ceph/osd/ceph-5) allocation stats probe 22: cnt: 48968 frags: 48968 size: 50902216704
    Jan 19 17:41:38 rcs3.cloud.pnnl.gov ceph-b15015c8-af07-4973-b35d-28c3bfd2af22-osd-52103576: debug 2023-01-20T01:41:38.199+0000 7faacea1e700 0 bluestore(/var/lib/ceph/osd/ceph-5) probe -1: 44796, 44796, 51117088768
    Jan 19 17:41:38 rcs3.cloud.pnnl.gov ceph-b15015c8-af07-4973-b35d-28c3bfd2af22-osd-52103576: debug 2023-01-20T01:41:38.199+0000 7faacea1e700 0 bluestore(/var/lib/ceph/osd/ceph-5) probe -2: 47106, 47106, 52133576704
    Jan 19 17:41:38 rcs3.cloud.pnnl.gov ceph-b15015c8-af07-4973-b35d-28c3bfd2af22-osd-52103576: debug 2023-01-20T01:41:38.199+0000 7faacea1e700 0 bluestore(/var/lib/ceph/osd/ceph-5) probe -6: 47240, 47240, 48791560192
    Jan 19 17:41:38 rcs3.cloud.pnnl.gov ceph-b15015c8-af07-4973-b35d-28c3bfd2af22-osd-52103576: debug 2023-01-20T01:41:38.199+0000 7faacea1e700 0 bluestore(/var/lib/ceph/osd/ceph-5) probe -14: 39224, 39224, 47587168256
    Jan 19 17:41:38 rcs3.cloud.pnnl.gov ceph-b15015c8-af07-4973-b35d-28c3bfd2af22-osd-52103576: debug 2023-01-20T01:41:38.199+0000 7faacea1e700 0 bluestore(/var/lib/ceph/osd/ceph-5) probe -22: 529677, 529677, 1732095647744

#14 Updated by Kevin Fox 8 days ago

Here is a runaway one I restarted 2 days ago.

ELAPSED CMD
2-00:09:13 /usr/bin/ceph-osd -n osd.3 -f --setuser ceph --setgroup ceph --default-log-to-file=false --default-log-to-stderr=true --default-log-stderr-prefix=debug

Jan 19 09:25:36 rcs3.cloud.pnnl.gov ceph-b15015c8-af07-4973-b35d-28c3bfd2af22-osd-33620262: debug 2023-01-19T17:25:36.555+0000 7f638b0e3700 0 bluestore(/var/lib/ceph/osd/ceph-3) allocation stats probe 0: cnt: 54513 frags: 54513 size: 51361972224
Jan 19 09:25:36 rcs3.cloud.pnnl.gov ceph-b15015c8-af07-4973-b35d-28c3bfd2af22-osd-33620262: debug 2023-01-19T17:25:36.555+0000 7f638b0e3700 0 bluestore(/var/lib/ceph/osd/ceph-3) probe -1: 0, 0, 0
Jan 19 09:25:36 rcs3.cloud.pnnl.gov ceph-b15015c8-af07-4973-b35d-28c3bfd2af22-osd-33620262: debug 2023-01-19T17:25:36.555+0000 7f638b0e3700 0 bluestore(/var/lib/ceph/osd/ceph-3) probe -2: 0, 0, 0
Jan 19 09:25:36 rcs3.cloud.pnnl.gov ceph-b15015c8-af07-4973-b35d-28c3bfd2af22-osd-33620262: debug 2023-01-19T17:25:36.555+0000 7f638b0e3700 0 bluestore(/var/lib/ceph/osd/ceph-3) probe -4: 0, 0, 0
Jan 19 09:25:36 rcs3.cloud.pnnl.gov ceph-b15015c8-af07-4973-b35d-28c3bfd2af22-osd-33620262: debug 2023-01-19T17:25:36.555+0000 7f638b0e3700 0 bluestore(/var/lib/ceph/osd/ceph-3) probe -8: 0, 0, 0
Jan 19 09:25:36 rcs3.cloud.pnnl.gov ceph-b15015c8-af07-4973-b35d-28c3bfd2af22-osd-33620262: debug 2023-01-19T17:25:36.555+0000 7f638b0e3700 0 bluestore(/var/lib/ceph/osd/ceph-3) probe -16: 0, 0, 0
Jan 20 09:25:36 rcs3.cloud.pnnl.gov ceph-b15015c8-af07-4973-b35d-28c3bfd2af22-osd-33620262: debug 2023-01-20T17:25:36.583+0000 7f638b0e3700 0 bluestore(/var/lib/ceph/osd/ceph-3) allocation stats probe 1: cnt: 48357 frags: 48357 size: 48756817920
Jan 20 09:25:36 rcs3.cloud.pnnl.gov ceph-b15015c8-af07-4973-b35d-28c3bfd2af22-osd-33620262: debug 2023-01-20T17:25:36.583+0000 7f638b0e3700 0 bluestore(/var/lib/ceph/osd/ceph-3) probe -1: 54513, 54513, 51361972224
Jan 20 09:25:36 rcs3.cloud.pnnl.gov ceph-b15015c8-af07-4973-b35d-28c3bfd2af22-osd-33620262: debug 2023-01-20T17:25:36.583+0000 7f638b0e3700 0 bluestore(/var/lib/ceph/osd/ceph-3) probe -3: 0, 0, 0
Jan 20 09:25:36 rcs3.cloud.pnnl.gov ceph-b15015c8-af07-4973-b35d-28c3bfd2af22-osd-33620262: debug 2023-01-20T17:25:36.583+0000 7f638b0e3700 0 bluestore(/var/lib/ceph/osd/ceph-3) probe -5: 0, 0, 0
Jan 20 09:25:36 rcs3.cloud.pnnl.gov ceph-b15015c8-af07-4973-b35d-28c3bfd2af22-osd-33620262: debug 2023-01-20T17:25:36.583+0000 7f638b0e3700 0 bluestore(/var/lib/ceph/osd/ceph-3) probe -9: 0, 0, 0
Jan 20 09:25:36 rcs3.cloud.pnnl.gov ceph-b15015c8-af07-4973-b35d-28c3bfd2af22-osd-33620262: debug 2023-01-20T17:25:36.583+0000 7f638b0e3700 0 bluestore(/var/lib/ceph/osd/ceph-3) probe -17: 0, 0, 0

#15 Updated by Kevin Fox 6 days ago

Here's osd4 that was still running away this morning. I just restarted it. Here's the right before metrics. Will get more soon.
[root@rcs3 ~]# journalctl -u | grep probe
Jan 22 17:40:46 rcs3.cloud.pnnl.gov ceph-b15015c8-af07-4973-b35d-28c3bfd2af22-osd-42099039: debug 2023-01-23T01:40:46.765+0000 7fe1350e5700 0 bluestore(/var/lib/ceph/osd/ceph-4) allocation stats probe 25: cnt: 42495 frags: 42495 size: 45438136320
Jan 22 17:40:46 rcs3.cloud.pnnl.gov ceph-b15015c8-af07-4973-b35d-28c3bfd2af22-osd-42099039: debug 2023-01-23T01:40:46.765+0000 7fe1350e5700 0 bluestore(/var/lib/ceph/osd/ceph-4) probe -1: 42484, 42484, 44923281408
Jan 22 17:40:46 rcs3.cloud.pnnl.gov ceph-b15015c8-af07-4973-b35d-28c3bfd2af22-osd-42099039: debug 2023-01-23T01:40:46.765+0000 7fe1350e5700 0 bluestore(/var/lib/ceph/osd/ceph-4) probe -3: 47025, 47025, 51463442432
Jan 22 17:40:46 rcs3.cloud.pnnl.gov ceph-b15015c8-af07-4973-b35d-28c3bfd2af22-osd-42099039: debug 2023-01-23T01:40:46.765+0000 7fe1350e5700 0 bluestore(/var/lib/ceph/osd/ceph-4) probe -5: 51034, 51034, 54404734976
Jan 22 17:40:46 rcs3.cloud.pnnl.gov ceph-b15015c8-af07-4973-b35d-28c3bfd2af22-osd-42099039: debug 2023-01-23T01:40:46.765+0000 7fe1350e5700 0 bluestore(/var/lib/ceph/osd/ceph-4) probe -9: 48928, 48928, 51115958272
Jan 22 17:40:46 rcs3.cloud.pnnl.gov ceph-b15015c8-af07-4973-b35d-28c3bfd2af22-osd-42099039: debug 2023-01-23T01:40:46.765+0000 7fe1350e5700 0 bluestore(/var/lib/ceph/osd/ceph-4) probe -25: 543961, 543961, 1790549340160

#16 Updated by Kevin Fox 4 days ago

After restart:
[root@rcs3 ~]# journalctl -u | grep probe
Jan 24 08:57:33 rcs3.cloud.pnnl.gov ceph-b15015c8-af07-4973-b35d-28c3bfd2af22-osd-41524224: debug 2023-01-24T16:57:33.217+0000 7f1c9fb26700 0 bluestore(/var/lib/ceph/osd/ceph-4) allocation stats probe 0: cnt: 44417 frags: 44417 size: 52236275712
Jan 24 08:57:33 rcs3.cloud.pnnl.gov ceph-b15015c8-af07-4973-b35d-28c3bfd2af22-osd-41524224: debug 2023-01-24T16:57:33.217+0000 7f1c9fb26700 0 bluestore(/var/lib/ceph/osd/ceph-4) probe -1: 0, 0, 0
Jan 24 08:57:33 rcs3.cloud.pnnl.gov ceph-b15015c8-af07-4973-b35d-28c3bfd2af22-osd-41524224: debug 2023-01-24T16:57:33.217+0000 7f1c9fb26700 0 bluestore(/var/lib/ceph/osd/ceph-4) probe -2: 0, 0, 0
Jan 24 08:57:33 rcs3.cloud.pnnl.gov ceph-b15015c8-af07-4973-b35d-28c3bfd2af22-osd-41524224: debug 2023-01-24T16:57:33.217+0000 7f1c9fb26700 0 bluestore(/var/lib/ceph/osd/ceph-4) probe -4: 0, 0, 0
Jan 24 08:57:33 rcs3.cloud.pnnl.gov ceph-b15015c8-af07-4973-b35d-28c3bfd2af22-osd-41524224: debug 2023-01-24T16:57:33.217+0000 7f1c9fb26700 0 bluestore(/var/lib/ceph/osd/ceph-4) probe -8: 0, 0, 0
Jan 24 08:57:33 rcs3.cloud.pnnl.gov ceph-b15015c8-af07-4973-b35d-28c3bfd2af22-osd-41524224: debug 2023-01-24T16:57:33.217+0000 7f1c9fb26700 0 bluestore(/var/lib/ceph/osd/ceph-4) probe -16: 0, 0, 0

#17 Updated by Adam Kupczyk 3 days ago

Hi Kevin,

I have two questions:
1) Is rising fragmentation score related to change of free space?
If no other method is available, free space can be read by admin socket command "bluestore bluefs device info".
2) Might it be possible to collect few snapshots of "bluestore allocator dump block" from osd.3 or osd.4
(the hosts that showed increasing fragmentation that reversed trend after restart).

The overall inquiry I am targeting is to find out whether fragmentation calculation algorithm is too sensitive in some contexts.
It is possible that the algorithm behaves in following way:
a) 10MB free / 1KB data / 490MB free = fragmentation 0.08
b) 250MB free / 1KB data / 250MB free = fragmentation 0.14

Regards,
Adam

#18 Updated by Kevin Fox 3 days ago

For question 1, here's a couple of screenshots with consumed space and fragmentation added to both. utilization is pretty constant in both cases.

I'll work on answering question 2.

#20 Updated by Kevin Fox 2 days ago

The patch implies that the calculation may be wrong? But why would behavior change on restart?

Thanks,
Kevin

#21 Updated by Adam Kupczyk 2 days ago

Igor Fedotov wrote:

One potential explanation can be pretty trivial: allocator keeps tracking a sort of history(hints) where to start the next allocation from based on the previous allocation result(s). On restart this history is reset and allocations start over. Just a hypothesis...

Kevin Fox wrote:

The patch implies that the calculation may be wrong? But why would behavior change on restart?

Thanks,
Kevin

The calculation was wrong in the sense that 100% fragmentation was unachievable (part 1 of PR)
I believed Igor's hypothesis so much that I decided to numb a bit fragmentation calculation algorithm (part 2 or PR).

To prove it I need "bluestore allocator dump"s.

#22 Updated by Kevin Fox 1 day ago

Please find dumps attached.

#23 Updated by Kevin Fox 1 day ago

Just looking at lengths, there are lots of pretty small ones?:
[kfox@zathras tmp]$ jq '.extents[].length' osd.3 | sort | uniq -c | sort -k 1 -n | tail -n 4
3458 "0x20000"
4410 "0x70000"
9544 "0x1000"
55281 "0x10000"
[kfox@zathras tmp]$ jq '.extents[].length' osd.4 | sort | uniq -c | sort -k 1 -n | tail -n 4
2619 "0x20000"
4237 "0x70000"
11238 "0x1000"
44461 "0x10000"

Also available in: Atom PDF