Bug #46969
openOctopus OSDs deadlock with slow ops and make the whole cluster unresponsive
0%
Description
Hi,
I have another unpleasant bug to report.
Right after I upgraded my cluster to Octopus 15.2.4 I started to experience some deadlocks which result in a lot of "slow ops" and make the whole cluster unresponsive until I restart some OSDs.
It happens every day or so, yesterday it happened twice. It's usually caused by one specific OSD: osd.7 and it's usually sufficient to restart it. However, yesterday evening it was a different OSD, I ended up restarting the whole cluster.
The cluster has 3 SAS SSD + 11 NVMe drives, 1 OSD per SSD/NVMe, all drives look healthy according to SMART. Also I'm using a configuration that seems a "bug-bingo": EC 2+1 + compression.
ceph daemon osd.7 dump_blocked_ops shows a number of blocked ops in the "queued for pg" state and one "started" operation (see attachment). Other OSDs also show a lot of blocked ops, sometimes it's obvious that they're waiting for osd.7 (there's something like "waiting for sub ops from 7"), sometimes not.
What other details do you want for me to provide to start looking into this bug?
Now I basically restart my Octopus cluster every day, it's pretty annoying :)
Files