Project

General

Profile

Bug #23165

OSD used for Metadata / MDS storage constantly entering heartbeat timeout

Added by Oliver Freyermuth over 2 years ago. Updated almost 2 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature:

Description

After our stress test creating 100,000,000 small files on cephfs, and now finally deleting all those files, now 2 of the 4 OSDs crash continously.
They enter heartbeat timeouts and are finally killed.

The other 2 of the 4 OSDs (replica 4) were recreated and backfilled during the deletion process.

# ceph osd df | head
ID  CLASS WEIGHT  REWEIGHT SIZE  USE    AVAIL %USE VAR  PGS 
  0   ssd 0.21829  1.00000  223G  4692M  218G 2.05 1.74 128 
  1   ssd 0.21829  1.00000  223G  4218M  219G 1.84 1.56 128 
  2   ssd 0.21819  1.00000  223G 12007M  211G 5.25 4.46 128 
  3   ssd 0.21819  1.00000  223G 13314M  210G 5.82 4.94 128

osd.0 and osd.1 have been backfilled and are running stable, but osd.2 and osd.3 are affected by the issue.
They shortly managed to synchronize and cluster was healthy at some point, so they all contain the same "information".
Main difference is that osd.2 and osd.4 have lived through the mess of 100,000,000 files created and deleted,
while osd.0 and osd.1 are still rather fresh.

I have uploaded a debug log with log level 20 for osd.3 here:
29275fcd-0dd3-4f0f-bacf-33d8482d85a3
It after 2018-02-27 21:00, it contains several of those crashes with debug level 1, while the last one I captured with debug level 20.
Basically, I just see many:

heartbeat_map is_healthy 'OSD::osd_op_tp thread 0xXXXXXXXXX' had timed out after 15

before a suicide timeout and abort.

While I could now (and likely will) just recreate those OSDs and backfill them from the more healthy ones,
I hope the information collected in this ticket and log will help to solve the underyling issue.

History

#1 Updated by Oliver Freyermuth over 2 years ago

To clarify, since this was maybe not clear from the ticket text:
Those 4 OSDs were only used for the metadata pool.
They are full SSDs, bluestore. osd.0 and osd.1 are made with ceph-volume, osd.2 and osd.3 still with ceph-disk, but this should not really matter.

#2 Updated by Oliver Freyermuth over 2 years ago

It also seems:

# ceph osd df | head
ID  CLASS WEIGHT  REWEIGHT SIZE  USE    AVAIL %USE VAR  PGS 
  0   ssd 0.21829  1.00000  223G  4638M  219G 2.03 1.73 128 
  1   ssd 0.21829  1.00000  223G  4169M  219G 1.82 1.55 128 
  2   ssd 0.21819  1.00000  223G 11983M  211G 5.24 4.46   0 
  3   ssd 0.21819  1.00000  223G 13276M  210G 5.80 4.94   0

storage usage is slowly dwindling across the crashes. Maybe this is RocksDB compaction or something like that, which is so heavy that the heartbeats can not be met,
and waiting a long time and many crashes things will heal?

#3 Updated by Oliver Freyermuth over 2 years ago

Ah, and this is Luminous 12.2.4, just upgraded from 12.2.3 a few hours ago.

#4 Updated by Oliver Freyermuth over 2 years ago

After a night of waiting and some million files written and deleted, the sitation has stabilized.

The space-usage pattern:

# ceph osd df | head
ID  CLASS WEIGHT  REWEIGHT SIZE  USE    AVAIL %USE VAR  PGS 
  0   ssd 0.21829  1.00000  223G  2644M  220G 1.16 1.02 128 
  1   ssd 0.21829  1.00000  223G  2632M  220G 1.15 1.02 128 
  2   ssd 0.21819  1.00000  223G  9478M  214G 4.14 3.67 128 
  3   ssd 0.21819  1.00000  223G 10475M  213G 4.58 4.06 128 

still looks a bit strange, though. Potentially forcing a DB compaction could help here?

#5 Updated by Sage Weil almost 2 years ago

  • Status changed from New to Resolved

This sounds like the cephfs directories weren't well fragmented, leading to very large omap objects.

Also available in: Atom PDF