Project

General

Profile

Actions

Bug #11334

closed

osd: op with newer epoch stuck in optracker

Added by Sage Weil about 9 years ago. Updated over 7 years ago.

Status:
Can't reproduce
Priority:
High
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

log shows

2015-04-04 20:21:19.807227 7f628f9e9700  1 -- 10.214.136.136:6800/24788 <== mds.0 10.214.134.10:6800/9790 125281 ==== osd_op(mds.0.122:43197967 1000a878d0c.00000000 [create 0~0,setxattr parent (342)] 0.47732881 ondisk+write+known_if_redirected e406836) v5 ==== 217+0+348 (1980847806 0 3494425867) 0x201c9500 con 0x1e92ac60
2015-04-04 20:21:19.807306 7f628f9e9700  1 -- 10.214.136.136:6800/24788 --> 10.214.133.104:6789/0 -- mon_subscribe({monmap=6+,osd_pg_creates=0,osdmap=406836}) v2 -- ?+0 0x38d7400 con 0x1f163c60

and then starts complaining about a blocked op.

Actions #1

Updated by Sage Weil about 9 years ago

  • Description updated (diff)
Actions #2

Updated by Sage Weil about 9 years ago

  • Priority changed from Urgent to High
Actions #3

Updated by Kefu Chai almost 9 years ago

  • Regression set to No

so it seems that, sometimes we failed to release OpRequestRef timely.

and hence didn't remove the finished op out of the tracker before the complaint_time, which leads to false alarms:

2015-01-05 13:42:42.937678 osd.18 39.7.48.7:6803/11185 220 : [WRN] 3 slow requests, 1 included below; oldest blocked for > 513.611379 secs
2015-01-05 13:42:42.937685 osd.18 39.7.48.7:6803/11185 221 : [WRN] slow request 30.136429 seconds old, received at 2015-01-05 13:42:12.801205: osd_op(client.92008.1:3101508 rb.0.1437.238e1f29.00000000000f [write 114688~512] 3.841c0edf ondisk+write e994) v4 currently waiting for subops from 3,37
Actions #4

Updated by Kefu Chai almost 9 years ago

the only holder of TrackedOpRef is session.waiting_on_map. and once an updated osdmap is received, the request in waiting_on_map will be released after it is served.

so it seems it simply took too long to receive the osdmap. no clues...

Actions #5

Updated by Sage Weil over 7 years ago

  • Status changed from New to Can't reproduce
Actions

Also available in: Atom PDF