Project

General

Profile

Actions

Bug #3722

closed

osd: indefinitely hung request on stable cluster

Added by Sage Weil over 11 years ago. Updated over 11 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
OSD
Target version:
-
% Done:

0%

Source:
Development
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

0.48.2argonaut, rbd workload.

occasional requests are blocked indefinitely.

may be osd down/up cycles (due to network blip) with non-restarting ceph-osd daemon; need to verify from logs.

Actions #1

Updated by Sage Weil over 11 years ago

  • Status changed from 12 to Need More Info
Actions #2

Updated by Sage Weil over 11 years ago

  • Assignee set to Sage Weil
Actions #3

Updated by Sage Weil over 11 years ago

the trigger is a brief osd reset due to an intermittent network outage. no actual ceph-osd daemons restart.

2013-01-04 05:19:41.210134 mon.0 10.10.10.4:6789/0 492076 : [INF] pgmap v13047564: 12680 pgs: 12676 active+clean, 4 active+clean+inconsistent; 7925 GB data, 16021 GB used, 28644 GB / 44666 GB avail
2013-01-04 05:19:41.960109 mon.0 10.10.10.4:6789/0 492077 : [INF] osd.30 10.10.10.8:6806/26264 failed (by osd.63 10.10.10.24:6800/18335)
2013-01-04 05:19:41.966086 mon.0 10.10.10.4:6789/0 492078 : [INF] osd.31 10.10.10.8:6803/24001 failed (by osd.63 10.10.10.24:6800/18335)
2013-01-04 05:19:41.970664 mon.0 10.10.10.4:6789/0 492079 : [INF] osd.32 10.10.10.8:6805/29556 failed (by osd.63 10.10.10.24:6800/18335)
2013-01-04 05:19:41.975062 mon.0 10.10.10.4:6789/0 492080 : [INF] osd.33 10.10.10.8:6809/14499 failed (by osd.63 10.10.10.24:6800/18335)
2013-01-04 05:19:42.184149 mon.0 10.10.10.4:6789/0 492081 : [INF] pgmap v13047565: 12680 pgs: 12676 active+clean, 4 active+clean+inconsistent; 7925 GB data, 16021 GB used, 28644 GB / 44666 GB avail
2013-01-04 05:19:42.380137 mon.0 10.10.10.4:6789/0 492082 : [INF] osd.30 10.10.10.8:6806/26264 failed (by osd.3 10.10.10.26:6806/29622)
2013-01-04 05:19:42.386106 mon.0 10.10.10.4:6789/0 492083 : [INF] osd.31 10.10.10.8:6803/24001 failed (by osd.3 10.10.10.26:6806/29622)
2013-01-04 05:19:42.390876 mon.0 10.10.10.4:6789/0 492084 : [INF] osd.32 10.10.10.8:6805/29556 failed (by osd.3 10.10.10.26:6806/29622)
2013-01-04 05:19:42.395409 mon.0 10.10.10.4:6789/0 492085 : [INF] osd.33 10.10.10.8:6809/14499 failed (by osd.3 10.10.10.26:6806/29622)
2013-01-04 05:19:43.183205 mon.0 10.10.10.4:6789/0 492086 : [INF] pgmap v13047566: 12680 pgs: 12676 active+clean, 4 active+clean+inconsistent; 7925 GB data, 16021 GB used, 28644 GB / 44666 GB avail
2013-01-04 05:19:43.257549 mon.0 10.10.10.4:6789/0 492087 : [INF] osd.30 10.10.10.8:6806/26264 failed (by osd.40 10.10.10.10:6803/25287)
2013-01-04 05:19:43.263381 mon.0 10.10.10.4:6789/0 492088 : [INF] osd.31 10.10.10.8:6803/24001 failed (by osd.40 10.10.10.10:6803/25287)
2013-01-04 05:19:43.267941 mon.0 10.10.10.4:6789/0 492089 : [INF] osd.32 10.10.10.8:6805/29556 failed (by osd.40 10.10.10.10:6803/25287)
2013-01-04 05:19:43.272340 mon.0 10.10.10.4:6789/0 492090 : [INF] osd.33 10.10.10.8:6809/14499 failed (by osd.40 10.10.10.10:6803/25287)
2013-01-04 05:19:43.339231 mon.0 10.10.10.4:6789/0 492091 : [INF] osdmap e24227: 24 osds: 20 up, 24 in
2013-01-04 05:19:43.401341 mon.0 10.10.10.4:6789/0 492092 : [INF] pgmap v13047567: 12680 pgs: 10505 active+clean, 2171 stale+active+clean, 2 active+clean+inconsistent, 2 stale+active+clean+inconsistent; 7925 GB data, 16021 GB used, 28644 GB / 44666 GB avail
2013-01-04 05:19:44.327985 mon.0 10.10.10.4:6789/0 492093 : [INF] osdmap e24228: 24 osds: 20 up, 24 in
2013-01-04 05:19:44.366346 mon.0 10.10.10.4:6789/0 492094 : [INF] pgmap v13047568: 12680 pgs: 10505 active+clean, 2171 stale+active+clean, 2 active+clean+inconsistent, 2 stale+active+clean+inconsistent; 7925 GB data, 16021 GB used, 28644 GB / 44666 GB avail
2013-01-04 05:19:48.171594 mon.0 10.10.10.4:6789/0 492095 : [INF] osdmap e24229: 24 osds: 21 up, 24 in
2013-01-04 05:19:48.177605 mon.0 10.10.10.4:6789/0 492096 : [INF] osd.31 10.10.10.8:6803/24001 boot
2013-01-04 05:19:48.218748 mon.0 10.10.10.4:6789/0 492097 : [INF] pgmap v13047569: 12680 pgs: 10505 active+clean, 2171 stale+active+clean, 2 active+clean+inconsistent, 2 stale+active+clean+inconsistent; 7925 GB data, 16021 GB used, 28644 GB / 44666 GB avail
2013-01-04 05:19:49.173983 mon.0 10.10.10.4:6789/0 492098 : [INF] osdmap e24230: 24 osds: 21 up, 24 in
2013-01-04 05:19:43.782674 osd.33 10.10.10.8:6809/14499 35 : [WRN] map e24227 wrongly marked me down
2013-01-04 05:19:49.214385 mon.0 10.10.10.4:6789/0 492099 : [INF] pgmap v13047570: 12680 pgs: 10505 active+clean, 2171 stale+active+clean, 2 active+clean+inconsistent, 2 stale+active+clean+inconsistent; 7925 GB data, 16021 GB used, 28644 GB / 44666 GB avail
2013-01-04 05:19:53.408027 mon.0 10.10.10.4:6789/0 492100 : [INF] pgmap v13047571: 12680 pgs: 73 active, 10358 active+clean, 1425 stale+active+clean, 543 peering, 277 active+degraded, 2 active+clean+inconsistent, 2 stale+active+clean+inconsistent; 7925 GB data, 16021 GB used, 28645 GB / 44666 GB avail; 62190/446323
2 degraded (1.393%)
2013-01-04 05:19:53.805811 mon.0 10.10.10.4:6789/0 492101 : [INF] osdmap e24231: 24 osds: 22 up, 24 in
2013-01-04 05:19:44.375800 osd.31 10.10.10.8:6803/24001 320 : [WRN] map e24228 wrongly marked me down
2013-01-04 05:19:45.405770 osd.32 10.10.10.8:6805/29556 76 : [WRN] map e24227 wrongly marked me down
2013-01-04 05:19:48.562974 osd.30 10.10.10.8:6806/26264 533 : [WRN] map e24229 wrongly marked me down
2013-01-04 05:19:53.810840 mon.0 10.10.10.4:6789/0 492102 : [INF] osd.33 10.10.10.8:6809/14499 boot
2013-01-04 05:19:53.897423 mon.0 10.10.10.4:6789/0 492103 : [INF] pgmap v13047572: 12680 pgs: 353 active, 9374 active+clean, 915 stale+active+clean, 556 peering, 1478 active+degraded, 2 active+clean+inconsistent, 2 stale+active+clean+inconsistent; 7925 GB data, 16020 GB used, 28645 GB / 44666 GB avail; 209857/44632
32 degraded (4.702%)
2013-01-04 05:19:54.790001 mon.0 10.10.10.4:6789/0 492104 : [INF] osdmap e24232: 24 osds: 22 up, 24 in
2013-01-04 05:19:54.836895 mon.0 10.10.10.4:6789/0 492105 : [INF] pgmap v13047573: 12680 pgs: 368 active, 9270 active+clean, 825 stale+active+clean, 556 peering, 1657 active+degraded, 2 active+clean+inconsistent, 2 stale+active+clean+inconsistent; 7925 GB data, 16020 GB used, 28646 GB / 44666 GB avail; 243323/44632
32 degraded (5.452%)
2013-01-04 05:19:55.937034 mon.0 10.10.10.4:6789/0 492106 : [INF] osdmap e24233: 24 osds: 22 up, 24 in
2013-01-04 05:19:55.983413 mon.0 10.10.10.4:6789/0 492107 : [INF] pgmap v13047574: 12680 pgs: 368 active, 9270 active+clean, 825 stale+active+clean, 556 peering, 1657 active+degraded, 2 active+clean+inconsistent, 2 stale+active+clean+inconsistent; 7925 GB data, 16020 GB used, 28646 GB / 44666 GB avail; 243323/44632
32 degraded (5.452%)
2013-01-04 05:19:56.941925 mon.0 10.10.10.4:6789/0 492108 : [INF] osdmap e24234: 24 osds: 23 up, 24 in
2013-01-04 05:19:56.948025 mon.0 10.10.10.4:6789/0 492109 : [INF] osd.32 10.10.10.8:6805/29556 boot
2013-01-04 05:19:56.994924 mon.0 10.10.10.4:6789/0 492110 : [INF] pgmap v13047575: 12680 pgs: 368 active, 9270 active+clean, 825 stale+active+clean, 556 peering, 1657 active+degraded, 2 active+clean+inconsistent, 2 stale+active+clean+inconsistent; 7925 GB data, 16020 GB used, 28646 GB / 44666 GB avail; 243323/44632
32 degraded (5.452%)
2013-01-04 05:19:57.947225 mon.0 10.10.10.4:6789/0 492111 : [INF] osdmap e24235: 24 osds: 24 up, 24 in
2013-01-04 05:19:57.954479 mon.0 10.10.10.4:6789/0 492112 : [INF] osd.30 10.10.10.8:6806/26264 boot
2013-01-04 05:19:58.000971 mon.0 10.10.10.4:6789/0 492113 : [INF] pgmap v13047576: 12680 pgs: 368 active, 9270 active+clean, 825 stale+active+clean, 556 peering, 1657 active+degraded, 2 active+clean+inconsistent, 2 stale+active+clean+inconsistent; 7925 GB data, 16020 GB used, 28646 GB / 44666 GB avail; 243323/44632
32 degraded (5.452%)
2013-01-04 05:19:58.953169 mon.0 10.10.10.4:6789/0 492114 : [INF] osdmap e24236: 24 osds: 24 up, 24 in
2013-01-04 05:20:01.536428 osd.4 10.10.10.7:6804/10210 19 : [WRN] 1 slow requests, 1 included below; oldest blocked for > 30.692572 secs
2013-01-04 05:20:01.536442 osd.4 10.10.10.7:6804/10210 20 : [WRN] slow request 30.692572 seconds old, received at 2013-01-04 05:19:30.843793: osd_op(client.284333.0:400169 rb.0.2d6ac.622d206b.000000000fc1 [write 1831936~4096] 65.b6a9ae06 RETRY) v4 currently delayed
2013-01-04 05:19:59.005985 mon.0 10.10.10.4:6789/0 492115 : [INF] pgmap v13047577: 12680 pgs: 758 active, 9270 active+clean, 825 stale+active+clean, 166 peering, 1657 active+degraded, 2 active+clean+inconsistent, 2 stale+active+clean+inconsistent; 7925 GB data, 16020 GB used, 28646 GB / 44666 GB avail; 243323/44632
32 degraded (5.452%)

the first slow request report above is the one that gets stuck:
2013-01-04 06:24:51.747672 osd.4 10.10.10.7:6804/10210 217 : [WRN] 1 slow requests, 1 included below; oldest blocked for > 3920.903820 secs
2013-01-04 06:24:51.747688 osd.4 10.10.10.7:6804/10210 218 : [WRN] slow request 3920.903820 seconds old, received at 2013-01-04 05:19:30.843793: osd_op(client.284333.0:400169 rb.0.2d6ac.622d206b.000000000fc1 [write 1831936~4096] 65.b6a9ae06 RETRY) v4 currently delayed

the request appears to have been received by a primary at 05:19:30.843793, about 10 seconds before the replica was marked down (05:19:41.960109-ish). from that i would expect it to be waiting for the subop, but it shows up as 'currently delayed'.

Actions #4

Updated by Sage Weil over 11 years ago

  • Assignee changed from Sage Weil to Samuel Just
Actions #5

Updated by Sage Weil over 11 years ago

  • Status changed from Need More Info to 7
Actions #6

Updated by Sage Weil over 11 years ago

  • Status changed from 7 to Resolved
Actions

Also available in: Atom PDF