Project

General

Profile

Actions

Bug #9389

closed

ec pg stuck peering, did not send query for one shard

Added by Sage Weil over 9 years ago. Updated over 9 years ago.

Status:
Duplicate
Priority:
Urgent
Assignee:
-
Category:
OSD
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

  "recovery_state": [
        { "name": "Started\/Primary\/Peering\/GetInfo",
          "enter_time": "2014-09-08 08:10:05.258543",
          "requested_info_from": [
                { "osd": "2(0)"}]},
...
of           "probing_osds": [
                "0(1)",
                "1(2)",
                "2(0)",
                "4(0)",
                "5(3)"],

and it tries to send it:

2014-09-08 08:10:05.258639 7f8545562700 10 osd.5 pg_epoch: 825 pg[1.1es3( v 785'235 (0'0,785'235] local-les=812 n=0 ec=11 les/c 812/808 822/822/818) [2,0,1,5] r=3 lpr=825 pi=730-821/8 crt=753'233 mlcod 0'0 peering] state<Started/Primary/Peering/GetInfo>:  querying info from osd.2(0)

but when the pg_query goes it out it only has
2014-09-08 08:10:05.259864 7f8545562700  7 osd.5 825 do_queries querying osd.2 on 2 PGs
2014-09-08 08:10:05.259865 7f8545562700  1 -- 10.214.136.6:6811/49073 --> 10.214.133.10:6801/51165 -- pg_query(1.21s3,1.fds1 epoch 825) v3 -- ?+0 0x7b14d00 con 0x7137a20

ubuntu@teuthology:/a/teuthology-2014-09-08_02:32:01-rados-master-testing-basic-multi/472310


Related issues 1 (0 open1 closed)

Is duplicate of Ceph - Bug #9821: failed to recover before timeout expiredResolvedSamuel Just10/19/2014

Actions
Actions #1

Updated by Samuel Just over 9 years ago

/a/samuelj-2014-09-20_19:00:23-rados-wip-sam-testing-firefly2-wip-testing-old-vanilla-basic-multi/501557

probably related, in GetMissing though.

Actions #2

Updated by Samuel Just over 9 years ago

At least on that one, looks like do_queries doesn't send the query. That can happen if the osd is down as of the osd epoch (almost certainly not the case here), or if up_from for the osd is after the peering_wq working map eopch (also probably not true in this case), or if messenger get_connection somehow returned NULL.

Actions #3

Updated by Sage Weil over 9 years ago

  • Status changed from New to Need More Info

d851c3f2338e8d17dfd78d631b9f7977365356aa adds better debug output (and cleans up a bit)

Actions #4

Updated by Samuel Just over 9 years ago

  • Status changed from Need More Info to Duplicate
Actions

Also available in: Atom PDF