Project

General

Profile

Actions

Bug #4853

closed

ceph-fuse hang on mount getattr

Added by Sage Weil almost 11 years ago. Updated almost 11 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
cuttlefish
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

100% reproducible with this job file

machine_type: plana
interactive-on-error: true
overrides:
  ceph:
    conf:
      mon:
        debug mon: 20
        debug ms: 20
        debug paxos: 20
    log-whitelist:
    - slow request
    branch: wip-up
  workunit:
    branch: wip-up
roles:
- - mon.a
  - mds.0
  - osd.0
  - osd.1
- - mon.b
  - osd.2
  - osd.3
- - client.0
tasks:
- chef: null
- clock.check: null
- install:
    branch: bobtail
- ceph:
    conf:
      client:
        debug client: 20
        debug ms: 1
      mds:
        debug mds: 20
        debug ms: 1
- install.upgrade:
    all:
      branch: next
- ceph.restart: null
- ceph-fuse:
  - client.0
- workunit:
    clients:
      client.0:
      - suites/dbench.sh

(this is from the upgrade suite).

wip-up adds some debug output htat i haven't had time to look at yet. afraid this is a regression from the replay cap release fix

Actions #1

Updated by Sage Weil almost 11 years ago

  • Status changed from New to In Progress
  • Assignee set to Sage Weil
Actions #2

Updated by Greg Farnum almost 11 years ago

/a/teuthology-2013-04-28_21:32:40-fs-next-testing-basic/2662

That's an fsstress run that got hung, I copied the client log into that dir.

Actions #3

Updated by Greg Farnum almost 11 years ago

Ignore that, wrong bug — sorry.

Actions #4

Updated by Sage Weil almost 11 years ago

  • Status changed from In Progress to Fix Under Review

fix in wip-up

here is the client-side log that shows we send the getattr twice. we only process the first reply, and drop the second, which is how the caps state gets screwed up.

2013-04-29 10:24:46.879319 7f57b68c5780  2 client.4204 mounted: have osdmap 0 and mdsmap 0
2013-04-29 10:24:46.879356 7f57b68c5780 10 client.4204 did not get mds through better means, so chose random mds -1
2013-04-29 10:24:46.879361 7f57b68c5780 20 client.4204 mds is -1
2013-04-29 10:24:46.879362 7f57b68c5780 10 client.4204  target mds.-1 not active, waiting for new mdsmap
2013-04-29 10:24:46.880366 7f57b0d09700  1 -- 10.214.131.31:0/21262 <== mon.0 10.214.131.32:6789/0 5 ==== osd_map(6..6 src has 1..6) v3 ==== 3104+0+0 (4129061429 0 0) 0x22776c0 con 0x2290420
2013-04-29 10:24:46.880504 7f57b0d09700  1 -- 10.214.131.31:0/21262 <== mon.0 10.214.131.32:6789/0 6 ==== mdsmap(e 5) v1 ==== 583+0+0 (929839905 0 0) 0x2281600 con 0x2290420
2013-04-29 10:24:46.880527 7f57b0d09700  1 client.4204 handle_mds_map epoch 5
2013-04-29 10:24:46.880563 7f57b0d09700  1 -- 10.214.131.31:0/21262 <== mon.0 10.214.131.32:6789/0 7 ==== osd_map(6..6 src has 1..6) v3 ==== 3104+0+0 (4129061429 0 0) 0x2277b40 con 0x2290420
2013-04-29 10:24:46.880578 7f57b0d09700  1 -- 10.214.131.31:0/21262 <== mon.0 10.214.131.32:6789/0 8 ==== mon_subscribe_ack(300s) v1 ==== 20+0+0 (2192033517 0 0) 0x22911c0 con 0x2290420
2013-04-29 10:24:46.880646 7f57b68c5780 10 client.4204 did not get mds through better means, so chose random mds 0
2013-04-29 10:24:46.880656 7f57b68c5780 20 client.4204 mds is 0
2013-04-29 10:24:46.880666 7f57b68c5780 10 client.4204 _open_mds_session mds.0
2013-04-29 10:24:46.880732 7f57b68c5780  1 -- 10.214.131.31:0/21262 --> 10.214.131.32:6806/3185 -- client_session(request_open) v1 -- ?+0 0x2291700 con 0x2290dc0
2013-04-29 10:24:46.880748 7f57b68c5780 10 client.4204 waiting for session to mds.0 to open
2013-04-29 10:24:46.881508 7f57af405700  0 -- 10.214.131.31:0/21262 >> 10.214.131.32:6806/3185 pipe(0x2292280 sd=7 :53754 s=1 pgs=0 cs=0 l=0).connect claims to be 10.214.131.32:6806/3663 not 10.214.131.32:6806/3185 - wrong node!
2013-04-29 10:24:46.881585 7f57af405700  0 -- 10.214.131.31:0/21262 >> 10.214.131.32:6806/3185 pipe(0x2292280 sd=7 :53754 s=1 pgs=0 cs=0 l=0).fault
2013-04-29 10:24:46.882149 7f57af405700  0 -- 10.214.131.31:0/21262 >> 10.214.131.32:6806/3185 pipe(0x2292280 sd=7 :53755 s=1 pgs=0 cs=0 l=0).connect claims to be 10.214.131.32:6806/3663 not 10.214.131.32:6806/3185 - wrong node!
2013-04-29 10:24:46.994039 7f57b0d09700  1 -- 10.214.131.31:0/21262 <== mon.0 10.214.131.32:6789/0 9 ==== mdsmap(e 6) v1 ==== 575+0+0 (907463370 0 0) 0x2281e00 con 0x2290420
2013-04-29 10:24:46.994070 7f57b0d09700  1 client.4204 handle_mds_map epoch 6
2013-04-29 10:24:46.994084 7f57b0d09700  1 -- 10.214.131.31:0/21262 mark_down 10.214.131.32:6806/3185 -- 0x2292280
2013-04-29 10:24:47.498407 7f57b0d09700  1 -- 10.214.131.31:0/21262 <== mon.0 10.214.131.32:6789/0 10 ==== mdsmap(e 7) v1 ==== 583+0+0 (2807803663 0 0) 0x2281600 con 0x2290420
2013-04-29 10:24:47.498437 7f57b0d09700  1 client.4204 handle_mds_map epoch 7
2013-04-29 10:24:47.498451 7f57b0d09700  1 -- 10.214.131.31:0/21262 mark_down 10.214.131.32:6806/3185 -- pipe dne
2013-04-29 10:24:47.879467 7f57b1d0b700 10 client.4204 renew_caps()
2013-04-29 10:24:47.879482 7f57b1d0b700 15 client.4204 renew_caps requesting from mds.0
2013-04-29 10:25:01.302268 7f57b0d09700  1 -- 10.214.131.31:0/21262 <== mon.0 10.214.131.32:6789/0 11 ==== mdsmap(e 8) v1 ==== 583+0+0 (2888500851 0 0) 0x2281a00 con 0x2290420
2013-04-29 10:25:01.302298 7f57b0d09700  1 client.4204 handle_mds_map epoch 8
2013-04-29 10:25:01.302355 7f57b0d09700 10 client.4204 send_reconnect to mds.0
2013-04-29 10:25:01.302375 7f57b0d09700 10 client.4204 connect_mds_targets for mds.0
2013-04-29 10:25:01.302380 7f57b0d09700  1 -- 10.214.131.31:0/21262 --> 10.214.131.32:6806/3663 -- client_reconnect(0 caps) v3 -- ?+0 0x2281600 con 0x2290dc0
2013-04-29 10:25:01.306807 7f57b0d09700 10 client.4204 ms_handle_connect on 10.214.131.32:6806/3663
2013-04-29 10:25:01.307535 7f57b0d09700  1 -- 10.214.131.31:0/21262 <== mds.0 10.214.131.32:6806/3663 1 ==== client_session(open) v1 ==== 28+0+0 (2719320884 0 0) 0x2291700 con 0x2290dc0
2013-04-29 10:25:01.307569 7f57b0d09700 10 client.4204 handle_client_session client_session(open) v1 from mds.0
2013-04-29 10:25:01.307578 7f57b0d09700 10 client.4204 renew_caps mds.0
2013-04-29 10:25:01.307580 7f57b0d09700  1 -- 10.214.131.31:0/21262 --> 10.214.131.32:6806/3663 -- client_session(request_renewcaps seq 1) v1 -- ?+0 0x22911c0 con 0x2290dc0
2013-04-29 10:25:01.307598 7f57b0d09700 10 client.4204 connect_mds_targets for mds.0
2013-04-29 10:25:01.307685 7f57b68c5780 10 client.4204 send_request rebuilding request 1 for mds.0
2013-04-29 10:25:01.307717 7f57b68c5780 20 client.4204 encode_cap_releases enter (req: 0x22a0000, mds: 0)
2013-04-29 10:25:01.307722 7f57b68c5780 20 client.4204 send_request set sent_stamp to 2013-04-29 10:25:01.307722
2013-04-29 10:25:01.307727 7f57b68c5780 10 client.4204 send_request client_request(unknown.0:1 getattr pAsLsXsFs #1) v1 to mds.0
2013-04-29 10:25:01.307740 7f57b68c5780  1 -- 10.214.131.31:0/21262 --> 10.214.131.32:6806/3663 -- client_request(client.4204:1 getattr pAsLsXsFs #1) v1 -- ?+0 0x22ab500 con 0x2290dc0
2013-04-29 10:25:01.307761 7f57b68c5780 20 client.4204 awaiting reply|forward|kick on 0x7fffed882b50
2013-04-29 10:25:01.438798 7f57b0d09700  1 -- 10.214.131.31:0/21262 <== mds.0 10.214.131.32:6806/3663 2 ==== client_session(open) v1 ==== 28+0+0 (2719320884 0 0) 0x22911c0 con 0x2290dc0
2013-04-29 10:25:01.438828 7f57b0d09700 10 client.4204 handle_client_session client_session(open) v1 from mds.0
2013-04-29 10:25:01.438835 7f57b0d09700 10 client.4204 renew_caps mds.0
2013-04-29 10:25:01.438837 7f57b0d09700  1 -- 10.214.131.31:0/21262 --> 10.214.131.32:6806/3663 -- client_session(request_renewcaps seq 2) v1 -- ?+0 0x2291700 con 0x2290dc0
2013-04-29 10:25:01.438853 7f57b0d09700 10 client.4204 connect_mds_targets for mds.0
2013-04-29 10:25:01.570996 7f57b0d09700  1 -- 10.214.131.31:0/21262 <== mon.0 10.214.131.32:6789/0 12 ==== mdsmap(e 9) v1 ==== 583+0+0 (1513426008 0 0) 0x2281800 con 0x2290420
2013-04-29 10:25:01.571026 7f57b0d09700  1 client.4204 handle_mds_map epoch 9
2013-04-29 10:25:01.687263 7f57b0d09700  1 -- 10.214.131.31:0/21262 <== mon.0 10.214.131.32:6789/0 13 ==== mdsmap(e 10) v1 ==== 583+0+0 (2013845237 0 0) 0x2281600 con 0x2290420
2013-04-29 10:25:01.687293 7f57b0d09700  1 client.4204 handle_mds_map epoch 10
2013-04-29 10:25:01.687306 7f57b0d09700 10 client.4204 kick_requests for mds.0
2013-04-29 10:25:01.687309 7f57b0d09700 10 client.4204 send_request rebuilding request 1 for mds.0
2013-04-29 10:25:01.687311 7f57b0d09700 20 client.4204 encode_cap_releases enter (req: 0x22a0000, mds: 0)
2013-04-29 10:25:01.687313 7f57b0d09700 10 client.4204 send_request client_request(unknown.0:1 getattr pAsLsXsFs #1) v1 to mds.0
2013-04-29 10:25:01.687321 7f57b0d09700  1 -- 10.214.131.31:0/21262 --> 10.214.131.32:6806/3663 -- client_request(client.4204:1 getattr pAsLsXsFs #1) v1 -- ?+0 0x2292c80 con 0x2290dc0
2013-04-29 10:25:01.687344 7f57b0d09700 10 client.4204 kick_flushing_caps mds.0
2013-04-29 10:25:01.687346 7f57b0d09700 10 client.4204 connect_mds_targets for mds.0

Actions #5

Updated by Sage Weil almost 11 years ago

  • Priority changed from Urgent to High
  • Backport set to cuttlefish

this is not a new race, and is only triggered when a mds session open and request race with an mds restart. not a cuttlefish blocker, although we should backport if it doesn't make the release.

Actions #6

Updated by Sage Weil almost 11 years ago

  • Status changed from Fix Under Review to Resolved

commit:ee553ac279664b7f1b527a0b1b56768134cf5157

Actions

Also available in: Atom PDF