Octopus OSDs deadlock with slow ops and make the whole cluster unresponsive
I have another unpleasant bug to report.
Right after I upgraded my cluster to Octopus 15.2.4 I started to experience some deadlocks which result in a lot of "slow ops" and make the whole cluster unresponsive until I restart some OSDs.
It happens every day or so, yesterday it happened twice. It's usually caused by one specific OSD: osd.7 and it's usually sufficient to restart it. However, yesterday evening it was a different OSD, I ended up restarting the whole cluster.
The cluster has 3 SAS SSD + 11 NVMe drives, 1 OSD per SSD/NVMe, all drives look healthy according to SMART. Also I'm using a configuration that seems a "bug-bingo": EC 2+1 + compression.
ceph daemon osd.7 dump_blocked_ops shows a number of blocked ops in the "queued for pg" state and one "started" operation (see attachment). Other OSDs also show a lot of blocked ops, sometimes it's obvious that they're waiting for osd.7 (there's something like "waiting for sub ops from 7"), sometimes not.
What other details do you want for me to provide to start looking into this bug?
Now I basically restart my Octopus cluster every day, it's pretty annoying :)
#1 Updated by Vitaliy Filippov 11 months ago
It seems the problem has gone away after removing the following non-default variables from the configuration:
#bluestore_prefer_deferred_size_ssd = 16384
#bluestore_sync_submit_transaction = true
#bdev_enable_discard = true
#bdev_async_discard = true
#bluestore_rocksdb_options = compression=kNoCompression,max_write_buffer_number=32,min_write_buffer_number_to_merge=8,recycle_log_file_num=32,write_buffer_size=33554432,writable_file_max_buffer_size=0,compaction_readahead_size=2097152
At least the cluster is alive for several days without reboots. Before these changes it required manual intervention every day.
#2 Updated by Vitaliy Filippov 11 months ago
Oops, sorry, there was one more change - I changed shards*threads to 1*16 from default 2*8:
osd_op_num_threads_per_shard = 16
osd_op_num_shards = 1
It could also be the thing that helped.
I did it after looking here https://github.com/ceph/ceph/pull/36032/commits/51d3e7f4877b97717bce15e93f691f273da325df and seeing the word "wakeup" :) where there's a lack of wakeup there may be deadlocks too... :)