Bug #23165
closedOSD used for Metadata / MDS storage constantly entering heartbeat timeout
0%
Description
After our stress test creating 100,000,000 small files on cephfs, and now finally deleting all those files, now 2 of the 4 OSDs crash continously.
They enter heartbeat timeouts and are finally killed.
The other 2 of the 4 OSDs (replica 4) were recreated and backfilled during the deletion process.
# ceph osd df | head ID CLASS WEIGHT REWEIGHT SIZE USE AVAIL %USE VAR PGS 0 ssd 0.21829 1.00000 223G 4692M 218G 2.05 1.74 128 1 ssd 0.21829 1.00000 223G 4218M 219G 1.84 1.56 128 2 ssd 0.21819 1.00000 223G 12007M 211G 5.25 4.46 128 3 ssd 0.21819 1.00000 223G 13314M 210G 5.82 4.94 128
osd.0 and osd.1 have been backfilled and are running stable, but osd.2 and osd.3 are affected by the issue.
They shortly managed to synchronize and cluster was healthy at some point, so they all contain the same "information".
Main difference is that osd.2 and osd.4 have lived through the mess of 100,000,000 files created and deleted,
while osd.0 and osd.1 are still rather fresh.
I have uploaded a debug log with log level 20 for osd.3 here:
29275fcd-0dd3-4f0f-bacf-33d8482d85a3
It after 2018-02-27 21:00, it contains several of those crashes with debug level 1, while the last one I captured with debug level 20.
Basically, I just see many:
heartbeat_map is_healthy 'OSD::osd_op_tp thread 0xXXXXXXXXX' had timed out after 15
before a suicide timeout and abort.
While I could now (and likely will) just recreate those OSDs and backfill them from the more healthy ones,
I hope the information collected in this ticket and log will help to solve the underyling issue.