Bug #5216: restarted or failed osd resulted in a lot of caller_ops.size error messages and stalled I/O - Ceph - Ceph

Actions

Copy link

Bug #5216

closed

restarted or failed osd resulted in a lot of caller_ops.size error messages and stalled I/O

Added by Stefan Priebe almost 11 years ago. Updated almost 11 years ago.

Status:

Resolved

Priority:

Urgent

Assignee:

Samuel Just

Category:

OSD

Target version:

% Done:

Source:

other

Tags:

Backport:

Regression:

Severity:

1 - critical

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

I'm running upstream/cuttlefish 85ad65e294f2b3d4bd1cfef6ae613e31d1cea635

I've seen the following today while just restarting a single osd.

/etc/init.d/ceph restart osd.3

ceph -w started to show the following immediatly:
13-05-31 13:42:44.342142 osd.54 [ERR] 4.97d caller_ops.size 3002 > log
size 3001
2013-05-31 13:42:44.362058 osd.11 [ERR] 4.d15 caller_ops.size 3002 > log
size 3001
2013-05-31 13:42:44.365393 osd.11 [ERR] 4.af4 caller_ops.size 3002 > log
size 3001
2013-05-31 13:42:44.381021 osd.11 [ERR] 4.bc9 caller_ops.size 3002 > log
size 3001
2013-05-31 13:42:44.432011 osd.54 [ERR] 4.651 caller_ops.size 3002 > log
size 3001
2013-05-31 13:42:44.451333 osd.54 [ERR] 4.332 caller_ops.size 3002 > log
size 3001
2013-05-31 13:42:44.458137 osd.11 [ERR] 4.b87 caller_ops.size 3002 > log
size 3001
2013-05-31 13:42:44.552218 osd.54 [ERR] 4.9ea caller_ops.size 3002 > log
size 3001
2013-05-31 13:42:44.573992 osd.54 [ERR] 4.d56 caller_ops.size 3002 > log
size 3001
2013-05-31 13:42:44.576582 osd.11 [ERR] 4.8ce caller_ops.size 3002 > log
...

and then after a few seconds i saw a lot of messages like these:
2013-05-31 13:43:40.732750 osd.3 [WRN] slow request 45.152737 seconds
old, received at 2013-05-31 13:42:55.579934:
osd_op(client.9439559.0:3387070 rbd_data.5039256b8b4567.000000000000290f
[write 2043904~8192] 4.f4a3cae1 RETRY=1 snapc 4da2=[] e90858) v4
currently reached pg
2013-05-31 13:43:40.732752 osd.3 [WRN] slow request 45.152716 seconds
old, received at 2013-05-31 13:42:55.579955:
osd_op(client.9439559.0:3387071 rbd_data.5039256b8b4567.000000000000290f
[write 2568192~12288] 4.f4a3cae1 RETRY=1 snapc 4da2=[] e90858) v4
currently reached pg

Then my VMs (Qemu with rbd) started to hang.

Actions

Copy link

Updated by Stefan Priebe almost 11 years ago

Then the whole ceph storage became instable until the osd is up and running again and had recovered.

Actions

Copy link

Updated by Stefan Priebe almost 11 years ago

To me it seems that the osd sets itself online / available before it is really ready which then results in slow I/O.

Starting an OSD results in 2-3 min 200% CPU and a lot of I/O. But it sets itself online / available just after the start.

Actions

Copy link

Updated by Stefan Priebe almost 11 years ago

So i get the caller_ops.size 3002 > log size 3001 messages while the osd is offline and i get the slow request messages, when it comes online again.

Actions

Copy link

Updated by Samuel Just almost 11 years ago

Status changed from New to Pending Backport
Assignee set to Samuel Just

pushed fix to master, needs backport to cuttlefish

Note, this probably did not cause the IO hang.

Actions

Copy link

Updated by Stefan Priebe almost 11 years ago

this is the backport:

commit 2af3f1d40b9c64f58d1a05232c52b2a47426fef5
Author: Samuel Just <sam.just@inktank.com>
Date: Fri May 31 11:08:47 2013 -0700

PGLog: only add entry to caller_ops in add() if reqid_is_indexed()

Fixes: #5216
    Signed-off-by: Samuel Just &lt;sam.just@inktank.com&gt;
    Reviewed-by: Sage Weil &lt;sage@inktank.com&gt;
    (cherry picked from commit e6ad9da03a17e9bfa12386ecaf0a2ab257327da6)

Conflicts:
        src/osd/PGLog.h

diff --git a/src/osd/PG.h b/src/osd/PG.h
index fe89879..015fc87 100644
--- a/src/osd/PG.h
+++ b/src/osd/PG.h
@ -279,7 +279,8 @ public:

// to our index
       objects[e.soid] = &(log.back());
-      caller_ops[e.reqid] = &(log.back());
+      if (e.reqid_is_indexed())
+       caller_ops[e.reqid] = &(log.back());
     }

void trim(ObjectStore::Transaction &t, hobject_t& oid, eversion_t s);

Actions

Copy link

Updated by Stefan Priebe almost 11 years ago

The stalled I/O seems to come from the freshly started OSDs. They seem to tell ceph hey i can handle I/O but they're still in the start phase and seem to scan their osd folder or flush journal or whatever...

Actions

Copy link

Updated by Samuel Just almost 11 years ago

Status changed from Pending Backport to Resolved

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph

Custom queries

Bug #5216

restarted or failed osd resulted in a lot of caller_ops.size error messages and stalled I/O

Updated by Stefan Priebe almost 11 years ago

Updated by Stefan Priebe almost 11 years ago

Updated by Stefan Priebe almost 11 years ago

Updated by Samuel Just almost 11 years ago

Updated by Stefan Priebe almost 11 years ago

Updated by Stefan Priebe almost 11 years ago

Updated by Samuel Just almost 11 years ago