Project

General

Profile

Actions

Bug #23165

closed

OSD used for Metadata / MDS storage constantly entering heartbeat timeout

Added by Oliver Freyermuth about 6 years ago. Updated over 5 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

After our stress test creating 100,000,000 small files on cephfs, and now finally deleting all those files, now 2 of the 4 OSDs crash continously.
They enter heartbeat timeouts and are finally killed.

The other 2 of the 4 OSDs (replica 4) were recreated and backfilled during the deletion process.

# ceph osd df | head
ID  CLASS WEIGHT  REWEIGHT SIZE  USE    AVAIL %USE VAR  PGS 
  0   ssd 0.21829  1.00000  223G  4692M  218G 2.05 1.74 128 
  1   ssd 0.21829  1.00000  223G  4218M  219G 1.84 1.56 128 
  2   ssd 0.21819  1.00000  223G 12007M  211G 5.25 4.46 128 
  3   ssd 0.21819  1.00000  223G 13314M  210G 5.82 4.94 128

osd.0 and osd.1 have been backfilled and are running stable, but osd.2 and osd.3 are affected by the issue.
They shortly managed to synchronize and cluster was healthy at some point, so they all contain the same "information".
Main difference is that osd.2 and osd.4 have lived through the mess of 100,000,000 files created and deleted,
while osd.0 and osd.1 are still rather fresh.

I have uploaded a debug log with log level 20 for osd.3 here:
29275fcd-0dd3-4f0f-bacf-33d8482d85a3
It after 2018-02-27 21:00, it contains several of those crashes with debug level 1, while the last one I captured with debug level 20.
Basically, I just see many:

heartbeat_map is_healthy 'OSD::osd_op_tp thread 0xXXXXXXXXX' had timed out after 15

before a suicide timeout and abort.

While I could now (and likely will) just recreate those OSDs and backfill them from the more healthy ones,
I hope the information collected in this ticket and log will help to solve the underyling issue.

Actions

Also available in: Atom PDF