Project

General

Profile

Bug #61483

"ceph tell osd.X compact" times out and retries forever

Added by Hector Martin 10 months ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
OSD
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

When OSD compactions take long enough, it seems something along the way times out and reissues the asok command. If the OSDs are busy and producing new RocksDB tables, this results in the OSDs repeatedly compacting forever, until a compaction cycle gets lucky enough to complete without any new tables being produced (if that ever happens).

This is how it looks when you do get lucky and the loop eventually terminates:

osd.13 10998 triggering manual compaction
osd.13 11009 finished manual compaction in 3022.82 seconds
osd.13 11009 triggering manual compaction
osd.13 11011 finished manual compaction in 3554.88 seconds
osd.13 11011 triggering manual compaction
osd.13 11011 finished manual compaction in 3376.6 seconds
osd.13 11011 triggering manual compaction
osd.13 11013 finished manual compaction in 2700.78 seconds
osd.13 11013 triggering manual compaction
osd.13 11017 finished manual compaction in 2704.95 seconds
osd.13 11017 triggering manual compaction
osd.13 11019 finished manual compaction in 3202.42 seconds
osd.13 11019 triggering manual compaction
osd.13 11019 finished manual compaction in 526.873 seconds
osd.13 11019 triggering manual compaction
osd.13 11019 finished manual compaction in 17.2247 seconds
osd.13 11019 triggering manual compaction
osd.13 11019 finished manual compaction in 0 seconds
osd.13 11019 triggering manual compaction
osd.13 11019 finished manual compaction in 0 seconds
osd.13 11019 triggering manual compaction
osd.13 11019 finished manual compaction in 0 seconds
osd.13 11019 triggering manual compaction
[many more lines of 0-second compactions, presumably queued while the above compactions were running]

That that was the result of issuing one `ceph tell osd.13 compact` command, which blocked until all the compactions finished.

When it doesn't, it just keeps going forever. I tried ^Cing the `ceph tell osd.X compact` command and restarting the mon, but it seems the commands were already queued on the OSDs, so that didn't break the loop. I had to restart a couple OSDs that just would not give up to get them to stop compacting.

Also available in: Atom PDF