Project

General

Profile

Bug #6528

btrfs osd gets kicked out while removing large pg

Added by Alexandre Oliva over 10 years ago. Updated over 10 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
dumpling
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

btrfs isn't exactly blazingly fast for file removal; so much is well known to me. However, there are two issues at hand here that don't make sense to me.

Context is a pg that finished being moved elsewhere, and the osd proceeds to remove its local copy. After a while, it starts showing OSD::disk_t heartbeat timeouts in its log. After a longer while, other osds kick it out of the cluster.

The osd, however, does not die because of the timeout. It notifies that cluster that it was incorrectly marked as down, and keeps on removing the pg. However, it will only reconnect to the cluster after removal is complete. This is the first oddity; it should ideally succeed in reconnecting before the pg removal is complete (it can take a while to remove 100k files), but suicide would be fine too, so that a pid-based monitor can bring it back up.

The other oddity is that, if I restart the osd manually when it's in this timed-out disk_t heartbeat state, it will often quickly rejoin the cluster and keep on removing the pg files, but without hitting the heartbeat timeouts any more! What gives? Couldn't it do whatever it does to remove the pg files after a restart without the restart, so as to avoid being kicked out in the first place?

Associated revisions

Revision c658258d (diff)
Added by Samuel Just over 10 years ago

OSD: ping tphandle during pg removal

Fixes: #6528
Signed-off-by: Samuel Just <>
Reviewed-by: Sage Weil <>

Revision 24711cd4 (diff)
Added by Samuel Just about 10 years ago

OSD: ping tphandle during pg removal

Fixes: #6528
Signed-off-by: Samuel Just <>
Reviewed-by: Sage Weil <>

(cherry picked from commit c658258d9e2f590054a30c0dee14a579a51bda8c)

Conflicts:
src/osd/OSD.cc

History

#1 Updated by Sage Weil over 10 years ago

  • Priority changed from Normal to High

#2 Updated by Samuel Just over 10 years ago

Ah, we don't ping the tphandle during that process, creating patch!

#3 Updated by Samuel Just over 10 years ago

  • Status changed from New to In Progress
  • Assignee set to Samuel Just
  • Priority changed from High to Urgent

#4 Updated by Samuel Just over 10 years ago

  • Backport set to dumpling

#5 Updated by Sage Weil over 10 years ago

  • Status changed from In Progress to Fix Under Review

#6 Updated by Sage Weil over 10 years ago

  • Status changed from Fix Under Review to Resolved

Also available in: Atom PDF