Bug #6528
btrfs osd gets kicked out while removing large pg
0%
Description
btrfs isn't exactly blazingly fast for file removal; so much is well known to me. However, there are two issues at hand here that don't make sense to me.
Context is a pg that finished being moved elsewhere, and the osd proceeds to remove its local copy. After a while, it starts showing OSD::disk_t heartbeat timeouts in its log. After a longer while, other osds kick it out of the cluster.
The osd, however, does not die because of the timeout. It notifies that cluster that it was incorrectly marked as down, and keeps on removing the pg. However, it will only reconnect to the cluster after removal is complete. This is the first oddity; it should ideally succeed in reconnecting before the pg removal is complete (it can take a while to remove 100k files), but suicide would be fine too, so that a pid-based monitor can bring it back up.
The other oddity is that, if I restart the osd manually when it's in this timed-out disk_t heartbeat state, it will often quickly rejoin the cluster and keep on removing the pg files, but without hitting the heartbeat timeouts any more! What gives? Couldn't it do whatever it does to remove the pg files after a restart without the restart, so as to avoid being kicked out in the first place?
Associated revisions
OSD: ping tphandle during pg removal
Fixes: #6528
Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
OSD: ping tphandle during pg removal
Fixes: #6528
Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit c658258d9e2f590054a30c0dee14a579a51bda8c)
Conflicts:
src/osd/OSD.cc
History
#1 Updated by Sage Weil over 10 years ago
- Priority changed from Normal to High
#2 Updated by Samuel Just over 10 years ago
Ah, we don't ping the tphandle during that process, creating patch!
#3 Updated by Samuel Just over 10 years ago
- Status changed from New to In Progress
- Assignee set to Samuel Just
- Priority changed from High to Urgent
#4 Updated by Samuel Just over 10 years ago
- Backport set to dumpling
#5 Updated by Sage Weil over 10 years ago
- Status changed from In Progress to Fix Under Review
#6 Updated by Sage Weil over 10 years ago
- Status changed from Fix Under Review to Resolved