Project

General

Profile

Actions

Bug #5216

closed

restarted or failed osd resulted in a lot of caller_ops.size error messages and stalled I/O

Added by Stefan Priebe almost 11 years ago. Updated almost 11 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
OSD
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
Severity:
1 - critical
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I'm running upstream/cuttlefish 85ad65e294f2b3d4bd1cfef6ae613e31d1cea635

I've seen the following today while just restarting a single osd.

/etc/init.d/ceph restart osd.3

ceph -w started to show the following immediatly:
13-05-31 13:42:44.342142 osd.54 [ERR] 4.97d caller_ops.size 3002 > log
size 3001
2013-05-31 13:42:44.362058 osd.11 [ERR] 4.d15 caller_ops.size 3002 > log
size 3001
2013-05-31 13:42:44.365393 osd.11 [ERR] 4.af4 caller_ops.size 3002 > log
size 3001
2013-05-31 13:42:44.381021 osd.11 [ERR] 4.bc9 caller_ops.size 3002 > log
size 3001
2013-05-31 13:42:44.432011 osd.54 [ERR] 4.651 caller_ops.size 3002 > log
size 3001
2013-05-31 13:42:44.451333 osd.54 [ERR] 4.332 caller_ops.size 3002 > log
size 3001
2013-05-31 13:42:44.458137 osd.11 [ERR] 4.b87 caller_ops.size 3002 > log
size 3001
2013-05-31 13:42:44.552218 osd.54 [ERR] 4.9ea caller_ops.size 3002 > log
size 3001
2013-05-31 13:42:44.573992 osd.54 [ERR] 4.d56 caller_ops.size 3002 > log
size 3001
2013-05-31 13:42:44.576582 osd.11 [ERR] 4.8ce caller_ops.size 3002 > log
...

and then after a few seconds i saw a lot of messages like these:
2013-05-31 13:43:40.732750 osd.3 [WRN] slow request 45.152737 seconds
old, received at 2013-05-31 13:42:55.579934:
osd_op(client.9439559.0:3387070 rbd_data.5039256b8b4567.000000000000290f
[write 2043904~8192] 4.f4a3cae1 RETRY=1 snapc 4da2=[] e90858) v4
currently reached pg
2013-05-31 13:43:40.732752 osd.3 [WRN] slow request 45.152716 seconds
old, received at 2013-05-31 13:42:55.579955:
osd_op(client.9439559.0:3387071 rbd_data.5039256b8b4567.000000000000290f
[write 2568192~12288] 4.f4a3cae1 RETRY=1 snapc 4da2=[] e90858) v4
currently reached pg

Then my VMs (Qemu with rbd) started to hang.

Actions

Also available in: Atom PDF