Project

General

Profile

Actions

Bug #9285

closed

osd: promoted object can get evicted before promotion completes

Added by Sage Weil over 9 years ago. Updated over 9 years ago.

Status:
Resolved
Priority:
High
Assignee:
-
Category:
OSD
Target version:
-
% Done:

0%

Source:
Community (dev)
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Date: Fri, 29 Aug 2014 06:43:22 +0000
From: "Wang, Zhiqiang" <zhiqiang.wang@intel.com>
To: "'ceph-devel@vger.kernel.org'" <ceph-devel@vger.kernel.org>
Subject: Cache tiering slow request issue: currently waiting for rw locks
Parts/Attachments:
   1 Shown   ~35 lines  Text
   2          44 KB     Application, "slow_request.log" 
----------------------------------------

Hi all,

I've ran into this slow request issue some time ago. The problem is like this:
when running with cache tieing, there are 'slow request' warning messages in the
log file like below.

2014-08-29 10:18:24.669763 7f9b20f1b700  0 log [WRN] : 1 slow requests, 1
included below; oldest blocked for > 30.996595 secs
2014-08-29 10:18:24.669768 7f9b20f1b700  0 log [WRN] : slow request 30.996595
seconds old, received at 2014-08-29 10:17:53.673142:
osd_op(client.114176.0:144919 rb.0.17f56.6b8b4567.000000000935 [sparse-read
3440640~4096] 45.cf45084b ack+read e26168) v4 currently waiting for rw locks

Recently I made some changes to the log, captured this problem, and finally
figured out its root cause. You can check the attachment for the logs.

Here is the root cause:
There is a cache miss when doing read. During promotion, after copying the data
from base tier osd, the cache tier primary osd replicates the data to other
cache tier osds. Some times this takes quite a long time. During this period of
time, the promoted object may be evicted because the cache tier is full. When
the primary osd finally gets the replication response and restarts the original
read request, it doesn't find the object in the cache tier, and do promotion
again. This loops for several times, and we'll see the 'slow request' in the
logs. Theoretically, this could loops forever, and the request from the client
would never be finished.

There is a simple fix for this:
Add a field in the object state, indicating the status of the promotion. It's
set to true after the copy of data from base tier and before the replication.
It's reset to false after the replication and the original client request starts
to execute. Evicting is not allowed when this field is true.

What do you think?

Actions

Also available in: Atom PDF