OSD used for Metadata / MDS storage constantly entering heartbeat timeout
After our stress test creating 100,000,000 small files on cephfs, and now finally deleting all those files, now 2 of the 4 OSDs crash continously.
They enter heartbeat timeouts and are finally killed.
The other 2 of the 4 OSDs (replica 4) were recreated and backfilled during the deletion process.
# ceph osd df | head ID CLASS WEIGHT REWEIGHT SIZE USE AVAIL %USE VAR PGS 0 ssd 0.21829 1.00000 223G 4692M 218G 2.05 1.74 128 1 ssd 0.21829 1.00000 223G 4218M 219G 1.84 1.56 128 2 ssd 0.21819 1.00000 223G 12007M 211G 5.25 4.46 128 3 ssd 0.21819 1.00000 223G 13314M 210G 5.82 4.94 128
osd.0 and osd.1 have been backfilled and are running stable, but osd.2 and osd.3 are affected by the issue.
They shortly managed to synchronize and cluster was healthy at some point, so they all contain the same "information".
Main difference is that osd.2 and osd.4 have lived through the mess of 100,000,000 files created and deleted,
while osd.0 and osd.1 are still rather fresh.
I have uploaded a debug log with log level 20 for osd.3 here:
It after 2018-02-27 21:00, it contains several of those crashes with debug level 1, while the last one I captured with debug level 20.
Basically, I just see many:
heartbeat_map is_healthy 'OSD::osd_op_tp thread 0xXXXXXXXXX' had timed out after 15
before a suicide timeout and abort.
While I could now (and likely will) just recreate those OSDs and backfill them from the more healthy ones,
I hope the information collected in this ticket and log will help to solve the underyling issue.
#1 Updated by Oliver Freyermuth almost 3 years ago
To clarify, since this was maybe not clear from the ticket text:
Those 4 OSDs were only used for the metadata pool.
They are full SSDs, bluestore. osd.0 and osd.1 are made with ceph-volume, osd.2 and osd.3 still with ceph-disk, but this should not really matter.
#2 Updated by Oliver Freyermuth almost 3 years ago
It also seems:
# ceph osd df | head ID CLASS WEIGHT REWEIGHT SIZE USE AVAIL %USE VAR PGS 0 ssd 0.21829 1.00000 223G 4638M 219G 2.03 1.73 128 1 ssd 0.21829 1.00000 223G 4169M 219G 1.82 1.55 128 2 ssd 0.21819 1.00000 223G 11983M 211G 5.24 4.46 0 3 ssd 0.21819 1.00000 223G 13276M 210G 5.80 4.94 0
storage usage is slowly dwindling across the crashes. Maybe this is RocksDB compaction or something like that, which is so heavy that the heartbeats can not be met,
and waiting a long time and many crashes things will heal?
#4 Updated by Oliver Freyermuth almost 3 years ago
After a night of waiting and some million files written and deleted, the sitation has stabilized.
The space-usage pattern:
# ceph osd df | head ID CLASS WEIGHT REWEIGHT SIZE USE AVAIL %USE VAR PGS 0 ssd 0.21829 1.00000 223G 2644M 220G 1.16 1.02 128 1 ssd 0.21829 1.00000 223G 2632M 220G 1.15 1.02 128 2 ssd 0.21819 1.00000 223G 9478M 214G 4.14 3.67 128 3 ssd 0.21819 1.00000 223G 10475M 213G 4.58 4.06 128
still looks a bit strange, though. Potentially forcing a DB compaction could help here?