Bug #54434: hdd osd's crashing after pacific upgrade - Ceph - Ceph

Actions

Copy link

Bug #54434

open

hdd osd's crashing after pacific upgrade

Added by Maximilian Stinsky about 2 years ago. Updated 7 months ago.

Status:

New

Priority:

Normal

Assignee:

Category:

OSD

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

2 - major

Reviewed:

Affected Versions:

v16.2.7

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

In the last couple of month we upgraded all of our ceph clusters from nautilus to pacific.
After upgrading our last cluster which is also hosting an s3 service on an ec pool which is backed by hdd's we have the problem that everyday a couple of those hdd osd's are crashing.

What we can observe is that most of the time the osd's crash in roughly the same timeframe everyday.

As I said this is only happening in one of our 5 clusters and is only happening on hdd osds.
We upgraded from 14.2.22 to 16.2.7 and the upgrade is completely finished no open tasks are left from the manual upgrade guide.

The log message that seems to be the reason for osds crashing is `1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7fd2a1f44700' had timed out after 15.000000954s`
We see a lot of those messages until the cluster removes the osd and everything goes back to normal and the osd rejoins the cluster as healthy.

I attached a small part of one osd log that is crashing. The log shows timestamps of around 11:25 but the problem for that specific osd started at 11:08 and repeated the same pattern for a couple of minutes until joining the cluster again.
The issue we are seeing always lasts for around 10-20 minutes causing slow ops in the cluster and effecting several osd's in that timeframe. It seems that failing osd's happening in a serial manner.

Files

osd-log.csv (814 KB) osd-log.csv

Maximilian Stinsky, 03/01/2022 11:15 AM

Actions

Copy link

Updated by Dan van der Ster about 2 years ago

It's a long shot, but since you said it happens at the same time daily, it makes me think of https://tracker.ceph.com/issues/54313
Can you reproduce these osd thread timeouts by running `ceph tell osd.0 smart` ?

Otherwise, how busy is the disk while the osd is stuck like that? If it's pinned to 100% in iostat, maybe you need to compact the OSDs?

Actions

Copy link

Updated by Maximilian Stinsky about 2 years ago

We tested `ceph tell osd.${id} smart` and this does not trigger the problem.

Looking into metrics we dont see any suspicious iostat for the failing osds - the only thing is that cpu usage for those osds rises quite a lot which is most likely expected when threads start to timeout.
After digging through a couple of mailing list entries and bug reports we are going to try and compact all of the hdd osd's to see if this has any effect on the cluster.

One thing that we also found is that we see a log of messages like `1 heartbeat_map reset_timeout 'Monitor::cpu_tp thread 0x7f1706267700' had timed out after 0.000000000s` in the ceph-mons since our upgrade to pacific. Not sure if this is related to the issue or not.

Actions

Copy link

Updated by Maximilian Stinsky about 2 years ago

So we compacted all hdd osd's via `ceph tell osd.${id} compact`, but today we still have the same issue with crashing hdd osd's.

We are not sure what else we could try to get rid of this problem or what to try to get more information.
Any ideas?

Actions

Copy link

Updated by Boris B about 2 years ago

We have the same issue with a heavy loaded s3 cluster after upgrading from nautilus to octopus.
We are still searching for the cause.

What we've temporary tried until now:

disable swap
more swap
disable bluefs_buffered_io
disable write cache for all disks
disable scrubbing

What we permanently changed:

reinstall with new OS (from centos7 to ubuntu 20.04)
disable cluster_network (so there is only one way to communicate)
increase txqueuelen on the network interfaces

Some facts about the cluster:

Hosts are all Ubuntu 20,04 and we've set the txqueuelen to 10k
Network interfaces are 20gbit (2x10 in a 802.3ad encap3+4 bond)
We only use the frontend network.
All disks are spinning, some have block.db devices.
All disks are bluestore
configs are mostly defaults
we've set the OSDs to restart=always without a limit, because we had the problem with unavailable PGs when two OSDs are marked as offline and the share PGs.

Currently we are redoing all OSDs, by syncing them out, wipe the disk, remove the OSDs from crushmap and readd them without the block.db SSDs (10 OSDs share one SSD and this is a huge blast radius if an SSD fails. Also there is not a lot of traffic so we just go with "all spinning")

But this also does not help, as we just had an OSD on a hosts where all disks where redone. The disk that was restarted is still backfilling.

Doing a `grep -F 'OSD::osd_op_tp thread' /var/log/ceph/ceph-osd.*.log` reveals that three OSDs had this problem but only one was marked as offline three times and got restarted by systemd (as we've set systemd to restart always)

Actions

Copy link

Updated by Maximilian Stinsky about 2 years ago

I just wanted to add that we also did not find any fix yet.

At the moment we are also redeploying every hdd osd but this takes a lot of time until we are finished with this.

What I wanted to add is that we definitely have a rhythm in the failure every day which leads us to believe that this has to do we a scheduled job in ceph.
Every morning around ~6am the problem starts we see thread timeout error like at 6:20, 7:20, 8:20 then it suddenly stops and starts at around 12:20, 13:20, 14:20 and most of the time stops at 15:20 and does not fail until the next morning again.
Does anyone know if there are any new scheduled tasks in ceph pacific from nautilus or what scheduled task this may cause.

Actions

Copy link

Updated by Maximilian Stinsky about 2 years ago

I think we just found the issue.
We can trigger the problem by issuing `radosgw-admin gc list` so its related to garbage collection of the rgw.

After we understood this we instantly found the following mailing list post: https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/BRWF2CCF2NYQGLMEFSHVF4RIJTYIFT6H/

As Boris also hits the same problem on octopus something in the osd code seems to have changed from nautilius which makes the threads timeout when querying or removing bulk data.

Actions

Copy link

Updated by Boris B about 2 years ago

We are currently in the process of reinstalling all OSDs without block.db and even when they are freshly synced in they tend to flap.
Maybe it is really the GC, but how to work around it?

Actions

Actions

Copy link

#16

Updated by Igor Fedotov about 1 year ago

Given that slow request warning in the original log snippet mentions "call rgw_gc.rgw_gc_queue_list_entries", e.g.

"Mar 1, 2022 @ 11:24:06.000","2022-03-01T10:24:06.696+0000 7fd2baf76700 0 log_channel(cluster) log [WRN] : slow request osd_op(client.2167284078.0:4423109 11.29 11:954be8d2:gc::gc.5:head [call rgw_gc.rgw_gc_queue_list_entries in=46b] snapc 0=[] RETRY=10 ondisk+retry+read+known_if_redirected e567925) initiated 2022-03-01T10:23:53.691219+0000 currently started"

the following ticket might be relevant too: https://tracker.ceph.com/issues/58190

It's still pending backport to Pacific though..

Actions

Copy link

#17

Updated by Maximilian Stinsky 7 months ago

Greetings.

We upgraded our ceph cluster to 16.2.14 which included the backport for https://github.com/ceph/ceph/pull/49313 that @Igor Gajowiak thought might help with the issue here.
And we are quite certain now that this indeed fixed it. Since we finished the upgrade on tuesday the 10.oct we did not have a single failed osd with heartbeat timeouts.

Side question regarding this issue. I saw some people on the mailing list with what might be the same issue as well as @Boris here in this ticket.
Could we have done anything different, e.g. provide more logs or any other details to get more focus on this?
We were living with daily crashing osd's for over 1,5y and this only got fixed in the last release of pacfic.

Actions

Copy link

#18

Updated by Boris B 7 months ago

Indeed,

we just rolled out latest pacific two weeks ago, and since then everything is stable.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph

Custom queries

Bug #54434

hdd osd's crashing after pacific upgrade

Updated by Dan van der Ster about 2 years ago

Updated by Maximilian Stinsky about 2 years ago

Updated by Maximilian Stinsky about 2 years ago

Updated by Boris B about 2 years ago

Updated by Maximilian Stinsky about 2 years ago

Updated by Maximilian Stinsky about 2 years ago

Updated by Boris B about 2 years ago

Updated by Boris B about 2 years ago

Updated by Simon Stephan almost 2 years ago

Updated by Simon Stephan almost 2 years ago

Updated by Simon Stephan almost 2 years ago

Updated by Boris B over 1 year ago

Updated by Simon Stephan over 1 year ago

Updated by Maximilian Stinsky about 1 year ago

Updated by xiaobao wen about 1 year ago

Updated by Igor Fedotov about 1 year ago

Updated by Maximilian Stinsky 7 months ago

Updated by Boris B 7 months ago