Project

General

Profile

Actions

Bug #54434

open

hdd osd's crashing after pacific upgrade

Added by Maximilian Stinsky about 2 years ago. Updated 7 months ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
OSD
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

In the last couple of month we upgraded all of our ceph clusters from nautilus to pacific.
After upgrading our last cluster which is also hosting an s3 service on an ec pool which is backed by hdd's we have the problem that everyday a couple of those hdd osd's are crashing.

What we can observe is that most of the time the osd's crash in roughly the same timeframe everyday.

As I said this is only happening in one of our 5 clusters and is only happening on hdd osds.
We upgraded from 14.2.22 to 16.2.7 and the upgrade is completely finished no open tasks are left from the manual upgrade guide.

The log message that seems to be the reason for osds crashing is `1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7fd2a1f44700' had timed out after 15.000000954s`
We see a lot of those messages until the cluster removes the osd and everything goes back to normal and the osd rejoins the cluster as healthy.

I attached a small part of one osd log that is crashing. The log shows timestamps of around 11:25 but the problem for that specific osd started at 11:08 and repeated the same pattern for a couple of minutes until joining the cluster again.
The issue we are seeing always lasts for around 10-20 minutes causing slow ops in the cluster and effecting several osd's in that timeframe. It seems that failing osd's happening in a serial manner.


Files

osd-log.csv (814 KB) osd-log.csv Maximilian Stinsky, 03/01/2022 11:15 AM
Actions #1

Updated by Dan van der Ster about 2 years ago

It's a long shot, but since you said it happens at the same time daily, it makes me think of https://tracker.ceph.com/issues/54313
Can you reproduce these osd thread timeouts by running `ceph tell osd.0 smart` ?

Otherwise, how busy is the disk while the osd is stuck like that? If it's pinned to 100% in iostat, maybe you need to compact the OSDs?

Actions #2

Updated by Maximilian Stinsky about 2 years ago

We tested `ceph tell osd.${id} smart` and this does not trigger the problem.

Looking into metrics we dont see any suspicious iostat for the failing osds - the only thing is that cpu usage for those osds rises quite a lot which is most likely expected when threads start to timeout.
After digging through a couple of mailing list entries and bug reports we are going to try and compact all of the hdd osd's to see if this has any effect on the cluster.

One thing that we also found is that we see a log of messages like `1 heartbeat_map reset_timeout 'Monitor::cpu_tp thread 0x7f1706267700' had timed out after 0.000000000s` in the ceph-mons since our upgrade to pacific. Not sure if this is related to the issue or not.

Actions #3

Updated by Maximilian Stinsky about 2 years ago

So we compacted all hdd osd's via `ceph tell osd.${id} compact`, but today we still have the same issue with crashing hdd osd's.

We are not sure what else we could try to get rid of this problem or what to try to get more information.
Any ideas?

Actions #4

Updated by Boris B about 2 years ago

We have the same issue with a heavy loaded s3 cluster after upgrading from nautilus to octopus.
We are still searching for the cause.

What we've temporary tried until now:
  • disable swap
  • more swap
  • disable bluefs_buffered_io
  • disable write cache for all disks
  • disable scrubbing
What we permanently changed:
  • reinstall with new OS (from centos7 to ubuntu 20.04)
  • disable cluster_network (so there is only one way to communicate)
  • increase txqueuelen on the network interfaces
Some facts about the cluster:
  • Hosts are all Ubuntu 20,04 and we've set the txqueuelen to 10k
  • Network interfaces are 20gbit (2x10 in a 802.3ad encap3+4 bond)
  • We only use the frontend network.
  • All disks are spinning, some have block.db devices.
  • All disks are bluestore
  • configs are mostly defaults
  • we've set the OSDs to restart=always without a limit, because we had the problem with unavailable PGs when two OSDs are marked as offline and the share PGs.

Currently we are redoing all OSDs, by syncing them out, wipe the disk, remove the OSDs from crushmap and readd them without the block.db SSDs (10 OSDs share one SSD and this is a huge blast radius if an SSD fails. Also there is not a lot of traffic so we just go with "all spinning")

But this also does not help, as we just had an OSD on a hosts where all disks where redone. The disk that was restarted is still backfilling.

Doing a `grep -F 'OSD::osd_op_tp thread' /var/log/ceph/ceph-osd.*.log` reveals that three OSDs had this problem but only one was marked as offline three times and got restarted by systemd (as we've set systemd to restart always)

Actions #5

Updated by Maximilian Stinsky about 2 years ago

I just wanted to add that we also did not find any fix yet.

At the moment we are also redeploying every hdd osd but this takes a lot of time until we are finished with this.

What I wanted to add is that we definitely have a rhythm in the failure every day which leads us to believe that this has to do we a scheduled job in ceph.
Every morning around ~6am the problem starts we see thread timeout error like at 6:20, 7:20, 8:20 then it suddenly stops and starts at around 12:20, 13:20, 14:20 and most of the time stops at 15:20 and does not fail until the next morning again.
Does anyone know if there are any new scheduled tasks in ceph pacific from nautilus or what scheduled task this may cause.

Actions #6

Updated by Maximilian Stinsky about 2 years ago

I think we just found the issue.
We can trigger the problem by issuing `radosgw-admin gc list` so its related to garbage collection of the rgw.

After we understood this we instantly found the following mailing list post: https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/BRWF2CCF2NYQGLMEFSHVF4RIJTYIFT6H/

As Boris also hits the same problem on octopus something in the osd code seems to have changed from nautilius which makes the threads timeout when querying or removing bulk data.

Actions #7

Updated by Boris B about 2 years ago

We are currently in the process of reinstalling all OSDs without block.db and even when they are freshly synced in they tend to flap.
Maybe it is really the GC, but how to work around it?

Actions #8

Updated by Boris B about 2 years ago

So, we've seen the same problem with a nearly fresh octopus cluster, that got 12x8TB disks without cache.db

How can I generate more debug output to get this issue fixed.

Actions #9

Updated by Simon Stephan almost 2 years ago

Hi,

i am working on the same cluster as Maximilian and we are also still facing the issue.
We compared some configurations and things that we tested and tried with Boris, here is a summary what we did on our cluster so far:

manual compaction of each OSD
redeployment of each OSD
increased bluestore cache size from 2 to 5 GB
remove bluestore cache autotune = 0 and switched to osd memory target to enable cache autotune
we lowered garbage collection times to reduce the amount of gc for each run
moved the rgw.buckets.index Pool to SSD Devices

Unfortunately all that did not change anything on the behavior of the problem and we still have the timeouts with restarting OSDs on a daily basis.

We are also running our OSDs with SSDs for the block.db, 3x 12TB HDDs share one 1.8TB SSD that is sliced in 580GB for each HDD. In each OSD Host we have 12 HDDs and 4 SSDs installed.

What leads us to the assumption that this most likely is a bug or a performance regression is that we were running with the exact same configuration on nautilus without the issue and also our usage behavior has not changed since then.

Actions #10

Updated by Simon Stephan almost 2 years ago

Hi,
we think this might be caused by https://tracker.ceph.com/issues/55324
As this should be fixed in 16.2.8 we plan to upgrade our clusters. We already upgraded our Lab Environment but as we never saw the problems there we can confirm our hypothesis earliest in two weeks.

Actions #11

Updated by Simon Stephan almost 2 years ago

we now also upgraded our production cluster to 16.2.9, unfortunately that did not change anything on the issue.
We still see a lot of `1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7fd2a1f44700' had timed out after 15.000000954s` messages and our OSDs are also still restarting on a daily basis.

Actions #12

Updated by Boris B over 1 year ago

So, we have move further:

We added some SSDs to the cluster, and moved all pools except the data pools to them and now the SSDs are "flapping".

~# grep "mark osd" /var/log/ceph/ceph-mon.s3db1.log
2022-07-29T00:26:57.496+0000 7fb9de667700  1 mon.s3db1@0(leader).osd e642120  we have enough reporters to mark osd.268 down
2022-07-29T00:29:23.542+0000 7fb9de667700  1 mon.s3db1@0(leader).osd e642124  we have enough reporters to mark osd.268 down
2022-07-29T00:30:33.905+0000 7fb9de667700  1 mon.s3db1@0(leader).osd e642128  we have enough reporters to mark osd.239 down
2022-07-29T00:54:44.511+0000 7fb9de667700  1 mon.s3db1@0(leader).osd e642133  we have enough reporters to mark osd.268 down

This seems to be happening when our backup center removes large portions from the s3 cluster (we save rpd snapshots in the cluster).

I have the feeling that the gc pool is responsible for that.
I am going to migrate the gc pool back to our HDDs and check if something else is happening.

In terms of GC: we have 15 rgw daemons running. Three of the are only responsible for the GC if the RGW daemons are busy with handling traffic.

Actions #13

Updated by Simon Stephan over 1 year ago

I just found that Bug Report (https://tracker.ceph.com/issues/53585) for Octopus that matches this issue and they also confirmed that they still have the issue in 16.2.9.
So i think we have a duplicate here, but unfortunately there is also no fix in sight.

Actions #14

Updated by Maximilian Stinsky about 1 year ago

We just upgraded our cluster to 16.2.11 and the issue is still present. We still have crashing osd's every day.

Actions #15

Updated by xiaobao wen about 1 year ago

Maximilian Stinsky wrote:

We just upgraded our cluster to 16.2.11 and the issue is still present. We still have crashing osd's every day.

could osd config like `--osd-op-thread-timeout` solve your problem?

Actions #16

Updated by Igor Fedotov about 1 year ago

Given that slow request warning in the original log snippet mentions "call rgw_gc.rgw_gc_queue_list_entries", e.g.

"Mar 1, 2022 @ 11:24:06.000","2022-03-01T10:24:06.696+0000 7fd2baf76700 0 log_channel(cluster) log [WRN] : slow request osd_op(client.2167284078.0:4423109 11.29 11:954be8d2:gc::gc.5:head [call rgw_gc.rgw_gc_queue_list_entries in=46b] snapc 0=[] RETRY=10 ondisk+retry+read+known_if_redirected e567925) initiated 2022-03-01T10:23:53.691219+0000 currently started"

the following ticket might be relevant too: https://tracker.ceph.com/issues/58190

It's still pending backport to Pacific though..

Actions #17

Updated by Maximilian Stinsky 7 months ago

Greetings.

We upgraded our ceph cluster to 16.2.14 which included the backport for https://github.com/ceph/ceph/pull/49313 that @Igor Gajowiak thought might help with the issue here.
And we are quite certain now that this indeed fixed it. Since we finished the upgrade on tuesday the 10.oct we did not have a single failed osd with heartbeat timeouts.

Side question regarding this issue. I saw some people on the mailing list with what might be the same issue as well as @Boris here in this ticket.
Could we have done anything different, e.g. provide more logs or any other details to get more focus on this?
We were living with daily crashing osd's for over 1,5y and this only got fixed in the last release of pacfic.

Actions #18

Updated by Boris B 7 months ago

Indeed,

we just rolled out latest pacific two weeks ago, and since then everything is stable.

Actions

Also available in: Atom PDF