Project

General

Profile

Actions

Bug #5216

closed

restarted or failed osd resulted in a lot of caller_ops.size error messages and stalled I/O

Added by Stefan Priebe almost 11 years ago. Updated almost 11 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
OSD
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
Severity:
1 - critical
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I'm running upstream/cuttlefish 85ad65e294f2b3d4bd1cfef6ae613e31d1cea635

I've seen the following today while just restarting a single osd.

/etc/init.d/ceph restart osd.3

ceph -w started to show the following immediatly:
13-05-31 13:42:44.342142 osd.54 [ERR] 4.97d caller_ops.size 3002 > log
size 3001
2013-05-31 13:42:44.362058 osd.11 [ERR] 4.d15 caller_ops.size 3002 > log
size 3001
2013-05-31 13:42:44.365393 osd.11 [ERR] 4.af4 caller_ops.size 3002 > log
size 3001
2013-05-31 13:42:44.381021 osd.11 [ERR] 4.bc9 caller_ops.size 3002 > log
size 3001
2013-05-31 13:42:44.432011 osd.54 [ERR] 4.651 caller_ops.size 3002 > log
size 3001
2013-05-31 13:42:44.451333 osd.54 [ERR] 4.332 caller_ops.size 3002 > log
size 3001
2013-05-31 13:42:44.458137 osd.11 [ERR] 4.b87 caller_ops.size 3002 > log
size 3001
2013-05-31 13:42:44.552218 osd.54 [ERR] 4.9ea caller_ops.size 3002 > log
size 3001
2013-05-31 13:42:44.573992 osd.54 [ERR] 4.d56 caller_ops.size 3002 > log
size 3001
2013-05-31 13:42:44.576582 osd.11 [ERR] 4.8ce caller_ops.size 3002 > log
...

and then after a few seconds i saw a lot of messages like these:
2013-05-31 13:43:40.732750 osd.3 [WRN] slow request 45.152737 seconds
old, received at 2013-05-31 13:42:55.579934:
osd_op(client.9439559.0:3387070 rbd_data.5039256b8b4567.000000000000290f
[write 2043904~8192] 4.f4a3cae1 RETRY=1 snapc 4da2=[] e90858) v4
currently reached pg
2013-05-31 13:43:40.732752 osd.3 [WRN] slow request 45.152716 seconds
old, received at 2013-05-31 13:42:55.579955:
osd_op(client.9439559.0:3387071 rbd_data.5039256b8b4567.000000000000290f
[write 2568192~12288] 4.f4a3cae1 RETRY=1 snapc 4da2=[] e90858) v4
currently reached pg

Then my VMs (Qemu with rbd) started to hang.

Actions #1

Updated by Stefan Priebe almost 11 years ago

Then the whole ceph storage became instable until the osd is up and running again and had recovered.

Actions #2

Updated by Stefan Priebe almost 11 years ago

To me it seems that the osd sets itself online / available before it is really ready which then results in slow I/O.

Starting an OSD results in 2-3 min 200% CPU and a lot of I/O. But it sets itself online / available just after the start.

Actions #3

Updated by Stefan Priebe almost 11 years ago

So i get the caller_ops.size 3002 > log size 3001 messages while the osd is offline and i get the slow request messages, when it comes online again.

Actions #4

Updated by Samuel Just almost 11 years ago

  • Status changed from New to Pending Backport
  • Assignee set to Samuel Just

pushed fix to master, needs backport to cuttlefish

Note, this probably did not cause the IO hang.

Actions #5

Updated by Stefan Priebe almost 11 years ago

this is the backport:

commit 2af3f1d40b9c64f58d1a05232c52b2a47426fef5
Author: Samuel Just <>
Date: Fri May 31 11:08:47 2013 -0700

PGLog: only add entry to caller_ops in add() if reqid_is_indexed()
Fixes: #5216
Signed-off-by: Samuel Just &lt;&gt;
Reviewed-by: Sage Weil &lt;&gt;
(cherry picked from commit e6ad9da03a17e9bfa12386ecaf0a2ab257327da6)
Conflicts:
src/osd/PGLog.h

diff --git a/src/osd/PG.h b/src/osd/PG.h
index fe89879..015fc87 100644
--- a/src/osd/PG.h
+++ b/src/osd/PG.h
@ -279,7 +279,8 @ public:

// to our index
objects[e.soid] = &(log.back());
- caller_ops[e.reqid] = &(log.back());
+ if (e.reqid_is_indexed())
+ caller_ops[e.reqid] = &(log.back());
}
void trim(ObjectStore::Transaction &t, hobject_t& oid, eversion_t s);
Actions #6

Updated by Stefan Priebe almost 11 years ago

The stalled I/O seems to come from the freshly started OSDs. They seem to tell ceph hey i can handle I/O but they're still in the start phase and seem to scan their osd folder or flush journal or whatever...

Actions #7

Updated by Samuel Just almost 11 years ago

  • Status changed from Pending Backport to Resolved
Actions

Also available in: Atom PDF