Bug #4853
closed
ceph-fuse hang on mount getattr
Added by Sage Weil about 11 years ago.
Updated about 11 years ago.
Description
100% reproducible with this job file
machine_type: plana
interactive-on-error: true
overrides:
ceph:
conf:
mon:
debug mon: 20
debug ms: 20
debug paxos: 20
log-whitelist:
- slow request
branch: wip-up
workunit:
branch: wip-up
roles:
- - mon.a
- mds.0
- osd.0
- osd.1
- - mon.b
- osd.2
- osd.3
- - client.0
tasks:
- chef: null
- clock.check: null
- install:
branch: bobtail
- ceph:
conf:
client:
debug client: 20
debug ms: 1
mds:
debug mds: 20
debug ms: 1
- install.upgrade:
all:
branch: next
- ceph.restart: null
- ceph-fuse:
- client.0
- workunit:
clients:
client.0:
- suites/dbench.sh
(this is from the upgrade suite).
wip-up adds some debug output htat i haven't had time to look at yet. afraid this is a regression from the replay cap release fix
- Status changed from New to In Progress
- Assignee set to Sage Weil
/a/teuthology-2013-04-28_21:32:40-fs-next-testing-basic/2662
That's an fsstress run that got hung, I copied the client log into that dir.
Ignore that, wrong bug — sorry.
- Status changed from In Progress to Fix Under Review
fix in wip-up
here is the client-side log that shows we send the getattr twice. we only process the first reply, and drop the second, which is how the caps state gets screwed up.
2013-04-29 10:24:46.879319 7f57b68c5780 2 client.4204 mounted: have osdmap 0 and mdsmap 0
2013-04-29 10:24:46.879356 7f57b68c5780 10 client.4204 did not get mds through better means, so chose random mds -1
2013-04-29 10:24:46.879361 7f57b68c5780 20 client.4204 mds is -1
2013-04-29 10:24:46.879362 7f57b68c5780 10 client.4204 target mds.-1 not active, waiting for new mdsmap
2013-04-29 10:24:46.880366 7f57b0d09700 1 -- 10.214.131.31:0/21262 <== mon.0 10.214.131.32:6789/0 5 ==== osd_map(6..6 src has 1..6) v3 ==== 3104+0+0 (4129061429 0 0) 0x22776c0 con 0x2290420
2013-04-29 10:24:46.880504 7f57b0d09700 1 -- 10.214.131.31:0/21262 <== mon.0 10.214.131.32:6789/0 6 ==== mdsmap(e 5) v1 ==== 583+0+0 (929839905 0 0) 0x2281600 con 0x2290420
2013-04-29 10:24:46.880527 7f57b0d09700 1 client.4204 handle_mds_map epoch 5
2013-04-29 10:24:46.880563 7f57b0d09700 1 -- 10.214.131.31:0/21262 <== mon.0 10.214.131.32:6789/0 7 ==== osd_map(6..6 src has 1..6) v3 ==== 3104+0+0 (4129061429 0 0) 0x2277b40 con 0x2290420
2013-04-29 10:24:46.880578 7f57b0d09700 1 -- 10.214.131.31:0/21262 <== mon.0 10.214.131.32:6789/0 8 ==== mon_subscribe_ack(300s) v1 ==== 20+0+0 (2192033517 0 0) 0x22911c0 con 0x2290420
2013-04-29 10:24:46.880646 7f57b68c5780 10 client.4204 did not get mds through better means, so chose random mds 0
2013-04-29 10:24:46.880656 7f57b68c5780 20 client.4204 mds is 0
2013-04-29 10:24:46.880666 7f57b68c5780 10 client.4204 _open_mds_session mds.0
2013-04-29 10:24:46.880732 7f57b68c5780 1 -- 10.214.131.31:0/21262 --> 10.214.131.32:6806/3185 -- client_session(request_open) v1 -- ?+0 0x2291700 con 0x2290dc0
2013-04-29 10:24:46.880748 7f57b68c5780 10 client.4204 waiting for session to mds.0 to open
2013-04-29 10:24:46.881508 7f57af405700 0 -- 10.214.131.31:0/21262 >> 10.214.131.32:6806/3185 pipe(0x2292280 sd=7 :53754 s=1 pgs=0 cs=0 l=0).connect claims to be 10.214.131.32:6806/3663 not 10.214.131.32:6806/3185 - wrong node!
2013-04-29 10:24:46.881585 7f57af405700 0 -- 10.214.131.31:0/21262 >> 10.214.131.32:6806/3185 pipe(0x2292280 sd=7 :53754 s=1 pgs=0 cs=0 l=0).fault
2013-04-29 10:24:46.882149 7f57af405700 0 -- 10.214.131.31:0/21262 >> 10.214.131.32:6806/3185 pipe(0x2292280 sd=7 :53755 s=1 pgs=0 cs=0 l=0).connect claims to be 10.214.131.32:6806/3663 not 10.214.131.32:6806/3185 - wrong node!
2013-04-29 10:24:46.994039 7f57b0d09700 1 -- 10.214.131.31:0/21262 <== mon.0 10.214.131.32:6789/0 9 ==== mdsmap(e 6) v1 ==== 575+0+0 (907463370 0 0) 0x2281e00 con 0x2290420
2013-04-29 10:24:46.994070 7f57b0d09700 1 client.4204 handle_mds_map epoch 6
2013-04-29 10:24:46.994084 7f57b0d09700 1 -- 10.214.131.31:0/21262 mark_down 10.214.131.32:6806/3185 -- 0x2292280
2013-04-29 10:24:47.498407 7f57b0d09700 1 -- 10.214.131.31:0/21262 <== mon.0 10.214.131.32:6789/0 10 ==== mdsmap(e 7) v1 ==== 583+0+0 (2807803663 0 0) 0x2281600 con 0x2290420
2013-04-29 10:24:47.498437 7f57b0d09700 1 client.4204 handle_mds_map epoch 7
2013-04-29 10:24:47.498451 7f57b0d09700 1 -- 10.214.131.31:0/21262 mark_down 10.214.131.32:6806/3185 -- pipe dne
2013-04-29 10:24:47.879467 7f57b1d0b700 10 client.4204 renew_caps()
2013-04-29 10:24:47.879482 7f57b1d0b700 15 client.4204 renew_caps requesting from mds.0
2013-04-29 10:25:01.302268 7f57b0d09700 1 -- 10.214.131.31:0/21262 <== mon.0 10.214.131.32:6789/0 11 ==== mdsmap(e 8) v1 ==== 583+0+0 (2888500851 0 0) 0x2281a00 con 0x2290420
2013-04-29 10:25:01.302298 7f57b0d09700 1 client.4204 handle_mds_map epoch 8
2013-04-29 10:25:01.302355 7f57b0d09700 10 client.4204 send_reconnect to mds.0
2013-04-29 10:25:01.302375 7f57b0d09700 10 client.4204 connect_mds_targets for mds.0
2013-04-29 10:25:01.302380 7f57b0d09700 1 -- 10.214.131.31:0/21262 --> 10.214.131.32:6806/3663 -- client_reconnect(0 caps) v3 -- ?+0 0x2281600 con 0x2290dc0
2013-04-29 10:25:01.306807 7f57b0d09700 10 client.4204 ms_handle_connect on 10.214.131.32:6806/3663
2013-04-29 10:25:01.307535 7f57b0d09700 1 -- 10.214.131.31:0/21262 <== mds.0 10.214.131.32:6806/3663 1 ==== client_session(open) v1 ==== 28+0+0 (2719320884 0 0) 0x2291700 con 0x2290dc0
2013-04-29 10:25:01.307569 7f57b0d09700 10 client.4204 handle_client_session client_session(open) v1 from mds.0
2013-04-29 10:25:01.307578 7f57b0d09700 10 client.4204 renew_caps mds.0
2013-04-29 10:25:01.307580 7f57b0d09700 1 -- 10.214.131.31:0/21262 --> 10.214.131.32:6806/3663 -- client_session(request_renewcaps seq 1) v1 -- ?+0 0x22911c0 con 0x2290dc0
2013-04-29 10:25:01.307598 7f57b0d09700 10 client.4204 connect_mds_targets for mds.0
2013-04-29 10:25:01.307685 7f57b68c5780 10 client.4204 send_request rebuilding request 1 for mds.0
2013-04-29 10:25:01.307717 7f57b68c5780 20 client.4204 encode_cap_releases enter (req: 0x22a0000, mds: 0)
2013-04-29 10:25:01.307722 7f57b68c5780 20 client.4204 send_request set sent_stamp to 2013-04-29 10:25:01.307722
2013-04-29 10:25:01.307727 7f57b68c5780 10 client.4204 send_request client_request(unknown.0:1 getattr pAsLsXsFs #1) v1 to mds.0
2013-04-29 10:25:01.307740 7f57b68c5780 1 -- 10.214.131.31:0/21262 --> 10.214.131.32:6806/3663 -- client_request(client.4204:1 getattr pAsLsXsFs #1) v1 -- ?+0 0x22ab500 con 0x2290dc0
2013-04-29 10:25:01.307761 7f57b68c5780 20 client.4204 awaiting reply|forward|kick on 0x7fffed882b50
2013-04-29 10:25:01.438798 7f57b0d09700 1 -- 10.214.131.31:0/21262 <== mds.0 10.214.131.32:6806/3663 2 ==== client_session(open) v1 ==== 28+0+0 (2719320884 0 0) 0x22911c0 con 0x2290dc0
2013-04-29 10:25:01.438828 7f57b0d09700 10 client.4204 handle_client_session client_session(open) v1 from mds.0
2013-04-29 10:25:01.438835 7f57b0d09700 10 client.4204 renew_caps mds.0
2013-04-29 10:25:01.438837 7f57b0d09700 1 -- 10.214.131.31:0/21262 --> 10.214.131.32:6806/3663 -- client_session(request_renewcaps seq 2) v1 -- ?+0 0x2291700 con 0x2290dc0
2013-04-29 10:25:01.438853 7f57b0d09700 10 client.4204 connect_mds_targets for mds.0
2013-04-29 10:25:01.570996 7f57b0d09700 1 -- 10.214.131.31:0/21262 <== mon.0 10.214.131.32:6789/0 12 ==== mdsmap(e 9) v1 ==== 583+0+0 (1513426008 0 0) 0x2281800 con 0x2290420
2013-04-29 10:25:01.571026 7f57b0d09700 1 client.4204 handle_mds_map epoch 9
2013-04-29 10:25:01.687263 7f57b0d09700 1 -- 10.214.131.31:0/21262 <== mon.0 10.214.131.32:6789/0 13 ==== mdsmap(e 10) v1 ==== 583+0+0 (2013845237 0 0) 0x2281600 con 0x2290420
2013-04-29 10:25:01.687293 7f57b0d09700 1 client.4204 handle_mds_map epoch 10
2013-04-29 10:25:01.687306 7f57b0d09700 10 client.4204 kick_requests for mds.0
2013-04-29 10:25:01.687309 7f57b0d09700 10 client.4204 send_request rebuilding request 1 for mds.0
2013-04-29 10:25:01.687311 7f57b0d09700 20 client.4204 encode_cap_releases enter (req: 0x22a0000, mds: 0)
2013-04-29 10:25:01.687313 7f57b0d09700 10 client.4204 send_request client_request(unknown.0:1 getattr pAsLsXsFs #1) v1 to mds.0
2013-04-29 10:25:01.687321 7f57b0d09700 1 -- 10.214.131.31:0/21262 --> 10.214.131.32:6806/3663 -- client_request(client.4204:1 getattr pAsLsXsFs #1) v1 -- ?+0 0x2292c80 con 0x2290dc0
2013-04-29 10:25:01.687344 7f57b0d09700 10 client.4204 kick_flushing_caps mds.0
2013-04-29 10:25:01.687346 7f57b0d09700 10 client.4204 connect_mds_targets for mds.0
- Priority changed from Urgent to High
- Backport set to cuttlefish
this is not a new race, and is only triggered when a mds session open and request race with an mds restart. not a cuttlefish blocker, although we should backport if it doesn't make the release.
- Status changed from Fix Under Review to Resolved
commit:ee553ac279664b7f1b527a0b1b56768134cf5157
Also available in: Atom
PDF