Bug #51069
mds: mkdir on ephemerally pinned directory sometimes blocked on journal flush
0%
Description
2021-06-02T03:57:06.273 DEBUG:teuthology.orchestra.run.smithi005:> (cd /home/ubuntu/cephtest/mnt.0 && exec sudo bash -c ' 2021-06-02T03:57:06.273 DEBUG:teuthology.orchestra.run.smithi005:> set -ex 2021-06-02T03:57:06.274 DEBUG:teuthology.orchestra.run.smithi005:> mkdir -p tree 2021-06-02T03:57:06.274 DEBUG:teuthology.orchestra.run.smithi005:> 2021-06-02T03:57:06.274 DEBUG:teuthology.orchestra.run.smithi005:> setfattr -n ceph.dir.pin.distributed -v 1 tree 2021-06-02T03:57:06.275 DEBUG:teuthology.orchestra.run.smithi005:> 2021-06-02T03:57:06.275 DEBUG:teuthology.orchestra.run.smithi005:> for ((i = 0; i < 1000; i++)); do 2021-06-02T03:57:06.275 DEBUG:teuthology.orchestra.run.smithi005:> mkdir -p "tree/$i" 2021-06-02T03:57:06.276 DEBUG:teuthology.orchestra.run.smithi005:> echo file > "tree/$i/file" 2021-06-02T03:57:06.276 DEBUG:teuthology.orchestra.run.smithi005:> done 2021-06-02T03:57:06.276 DEBUG:teuthology.orchestra.run.smithi005:> ') ... 2021-06-02T03:57:08.446 INFO:teuthology.orchestra.run.smithi005.stderr:+ echo file 2021-06-02T03:57:12.542 INFO:teuthology.orchestra.run.smithi005.stderr:+ (( i++ )) 2021-06-02T03:57:12.542 INFO:teuthology.orchestra.run.smithi005.stderr:+ (( i < 1000 )) 2021-06-02T03:57:12.543 INFO:teuthology.orchestra.run.smithi005.stderr:+ mkdir -p tree/347 2021-06-02T03:57:12.548 INFO:teuthology.orchestra.run.smithi005.stderr:+ echo file 2021-06-02T03:57:17.487 INFO:teuthology.orchestra.run.smithi005.stderr:+ (( i++ )) 2021-06-02T03:57:17.488 INFO:teuthology.orchestra.run.smithi005.stderr:+ (( i < 1000 )) 2021-06-02T03:57:17.488 INFO:teuthology.orchestra.run.smithi005.stderr:+ mkdir -p tree/348 2021-06-02T03:57:17.592 INFO:teuthology.orchestra.run.smithi005.stderr:+ echo file 2021-06-02T03:57:20.843 DEBUG:teuthology.orchestra.run.smithi005:> sudo logrotate /etc/logrotate.d/ceph-test.conf 2021-06-02T03:57:20.846 DEBUG:teuthology.orchestra.run.smithi076:> sudo logrotate /etc/logrotate.d/ceph-test.conf 2021-06-02T03:57:22.474 INFO:teuthology.orchestra.run.smithi005.stderr:+ (( i++ )) 2021-06-02T03:57:22.475 INFO:teuthology.orchestra.run.smithi005.stderr:+ (( i < 1000 )) 2021-06-02T03:57:22.475 INFO:teuthology.orchestra.run.smithi005.stderr:+ mkdir -p tree/349 2021-06-02T03:57:22.556 INFO:teuthology.orchestra.run.smithi005.stderr:+ echo file 2021-06-02T03:57:22.574 INFO:teuthology.orchestra.run.smithi005.stderr:+ (( i++ )) 2021-06-02T03:57:22.574 INFO:teuthology.orchestra.run.smithi005.stderr:+ (( i < 1000 )) 2021-06-02T03:57:22.574 INFO:teuthology.orchestra.run.smithi005.stderr:+ mkdir -p tree/350 2021-06-02T03:57:22.593 INFO:teuthology.orchestra.run.smithi005.stderr:+ echo file 2021-06-02T03:57:22.606 INFO:teuthology.orchestra.run.smithi005.stderr:+ (( i++ )) 2021-06-02T03:57:22.606 INFO:teuthology.orchestra.run.smithi005.stderr:+ (( i < 1000 )) 2021-06-02T03:57:22.607 INFO:teuthology.orchestra.run.smithi005.stderr:+ mkdir -p tree/351 2021-06-02T03:57:22.613 INFO:teuthology.orchestra.run.smithi005.stderr:+ echo file 2021-06-02T03:57:22.619 INFO:teuthology.orchestra.run.smithi005.stderr:+ (( i++ )) 2021-06-02T03:57:22.620 INFO:teuthology.orchestra.run.smithi005.stderr:+ (( i < 1000 )) 2021-06-02T03:57:22.620 INFO:teuthology.orchestra.run.smithi005.stderr:+ mkdir -p tree/352 2021-06-02T03:57:22.624 INFO:teuthology.orchestra.run.smithi005.stderr:+ echo file 2021-06-02T03:57:27.462 INFO:teuthology.orchestra.run.smithi005.stderr:+ (( i++ )) 2021-06-02T03:57:27.462 INFO:teuthology.orchestra.run.smithi005.stderr:+ (( i < 1000 )) 2021-06-02T03:57:27.462 INFO:teuthology.orchestra.run.smithi005.stderr:+ mkdir -p tree/353 2021-06-02T03:57:27.517 INFO:teuthology.orchestra.run.smithi005.stderr:+ echo file 2021-06-02T03:57:27.524 INFO:teuthology.orchestra.run.smithi005.stderr:+ (( i++ )) 2021-06-02T03:57:27.524 INFO:teuthology.orchestra.run.smithi005.stderr:+ (( i < 1000 )) 2021-06-02T03:57:27.524 INFO:teuthology.orchestra.run.smithi005.stderr:+ mkdir -p tree/354 2021-06-02T03:57:27.530 INFO:teuthology.orchestra.run.smithi005.stderr:+ echo file 2021-06-02T03:57:32.459 INFO:teuthology.orchestra.run.smithi005.stderr:+ (( i++ )) ...
From: /ceph/teuthology-archive/pdonnell-2021-06-02_03:31:59-fs-master-distro-basic-smithi/6145427/teuthology.log
The 5 second turnaround usually is a symptom of something blocking on journal flush.
Related issues
History
#1 Updated by Patrick Donnelly almost 3 years ago
- Description updated (diff)
- Status changed from New to Triaged
- Assignee set to Xiubo Li
#2 Updated by Xiubo Li almost 3 years ago
The journal was submit very early:
2021-06-02T03:57:22.624+0000 7f41d2bf8700 10 mds.0.openfiles remove_inode [inode 0x10000000561 [...2,head] /tree/352/ auth v844 f() n(v0 1=0+1) (iauth excl) (inest lock) (ifile excl) (ixattr excl) (iversion lock) caps={8439=-/pAsxXsxFsxcrwb@0},l=8439 | caps=1 0x55eb8b98bb80] 2021-06-02T03:57:22.624+0000 7f41d2bf8700 10 mds.0.server journal_and_reply tracei 0x55eb8b98bb80 tracedn 0x55eb8ba0a000 2021-06-02T03:57:22.624+0000 7f41d2bf8700 10 mds.0.locker set_xlocks_done on (dn xlock x=1 by 0x55eb8b73d400) [dentry #0x1/tree/352 [2,head] auth NULL (dn xlock x=1 by 0x55eb8b73d400) (dversion lock w=1 last_client=8439) pv=844 v=843 ap=2 ino=(nil) state=1073741888|bottomlru | request=1 lock=2 authpin=1 0x55eb8ba0a000] 2021-06-02T03:57:22.624+0000 7f41d2bf8700 10 mds.0.server early_reply 0 ((0) Success) client_request(client.8439:1308 mkdir #0x100000002a0/352 2021-06-02T03:57:22.623214+0000 caller_uid=0, caller_gid=0{0,}) v5 ... 2021-06-02T03:57:22.624+0000 7f41cc3eb700 5 mds.0.log _submit_thread 8317390~2221 : EUpdate mkdir [metablob 0x1, 3 dirs] ...
But it didn't flush the journal in nudge_log(), because the "if (lock->get_parent()->is_auth() && lock->is_unstable_and_locked())" is not true, because the ifile lock is not locked:
2021-06-02T03:57:22.622+0000 7f41d2bf8700 10 mds.0.locker nudge_log (ifile sync->mix g=1,2) on [inode 0x100000002a0 [...2,head] /tree/ auth{1=1,2=4} fragtree_t(*^7) v1630 ap=2 DIRTYPARENT DISTEPHEMERALPIN f(v1 m2021-06-02T03:57:22.612022+0000 352=0+352) n(v102 rc2021-06-02T03:57:22.612022+0000 b1730 701=348+353) (isnap sync r=6) (inest mix w=1 dirty scatter_wanted) (ipolicy sync r=5) (ifile sync->mix g=1,2) (iversion lock) caps={8439=pAsLsXs/-@475} | dirtyscattered=1 request=0 lock=3 dirfrag=128 caps=1 dirtyrstat=0 dirtyparent=1 replicated=1 dirty=1 waiter=1 authpin=1 distepin 0x55eb8b951700]
And then the latest nudge_log() call which will call the mdlog->flush() is 5 seconds later:
2021-06-02T03:57:27.435+0000 7f41d13f5700 10 mds.0.locker nudge_log (ifile mix->sync w=1 scatter_wanted) on [inode 0x100000002a0 [...2,head] /tree/ auth{1=1,2=4} fragtree_t(*^7) v1630 pv1632 ap=5 DIRTYPARENT DISTEPHEMERALPIN f(v1 m2021-06-02T03:57:22.612022+0000 352=0+352)->f(v1 m2021-06-02T03:57:22.623214+0000 353=0+353) n(v102 rc2021-06-02T03:57:22.612022+0000 b1730 701=348+353)->n(v102 rc2021-06-02T03:57:22.623214+0000 b1730 702=348+354) (isnap sync r=6) (inest mix->lock(2) g=1,2 w=2 dirty scatter_wanted) (ipolicy sync r=5) (ifile mix->sync w=1 scatter_wanted) (iversion lock w=1 last_client=8439) caps={8439=pAsLsXs/-@476} | dirtyscattered=1 request=0 lock=5 dirfrag=128 caps=1 dirtyrstat=0 dirtyparent=1 replicated=1 dirty=1 waiter=1 authpin=1 distepin 0x55eb8b951700]
The journal flush was delayed in journal_and_reply() because the early reply was done:
... 1912 early_reply(mdr, in, dn); 1913 1914 mdr->committing = true; 1915 submit_mdlog_entry(le, fin, mdr, __func__); 1916 1917 if (mdr->client_request && mdr->client_request->is_queued_for_replay()) { 1918 if (mds->queue_one_replay()) { 1919 dout(10) << " queued next replay op" << dendl; 1920 } else { 1921 dout(10) << " journaled last replay op" << dendl; 1922 } 1923 } else if (mdr->did_early_reply) 1924 mds->locker->drop_rdlocks_for_early_reply(mdr.get()); 1925 else 1926 mdlog->flush(); 1927 }
But it never be called after that in time.
I think why it delays the mdlog->flush() here, it may think since the early reply has been done, the safe reply won't be a that urgent ? Because I didn't see will the mds->locker->drop_rdlocks_for_early_reply() trigger the mdlog->flush() directly or indirectly yet.
Maybe we should follow all the others who are calling the mdlog->flush() immediately just after submit_mdlog_entry() ?
#3 Updated by Patrick Donnelly almost 3 years ago
Thanks for analyzing this Xiubo!
Xiubo Li wrote:
The journal was submit very early:
[...]
But it didn't flush the journal in nudge_log(), because the "if (lock->get_parent()->is_auth() && lock->is_unstable_and_locked())" is not true, because the ifile lock is not locked:
[...]
And then the latest nudge_log() call which will call the mdlog->flush() is 5 seconds later:
[...]
The journal flush was delayed in journal_and_reply() because the early reply was done:
[...]
But it never be called after that in time.
I think why it delays the mdlog->flush() here, it may think since the early reply has been done, the safe reply won't be a that urgent ?
It is actually. The client mount won't complete the mkdir syscall until the safe reply is received.
Because I didn't see will the mds->locker->drop_rdlocks_for_early_reply() trigger the mdlog->flush() directly or indirectly yet.
I'm not sure but clearly it needs to gather some locks from other mds before it can flush the entry? We need to nudge that along to ensure these delays do not occur.
Maybe we should follow all the others who are calling the mdlog->flush() immediately just after submit_mdlog_entry() ?
I don't think so. I wish this was commented -- I'm not sure why it's done that way.
#4 Updated by Xiubo Li almost 3 years ago
Patrick Donnelly wrote:
Thanks for analyzing this Xiubo!
Xiubo Li wrote:
The journal was submit very early:
[...]
But it didn't flush the journal in nudge_log(), because the "if (lock->get_parent()->is_auth() && lock->is_unstable_and_locked())" is not true, because the ifile lock is not locked:
[...]
And then the latest nudge_log() call which will call the mdlog->flush() is 5 seconds later:
[...]
The journal flush was delayed in journal_and_reply() because the early reply was done:
[...]
But it never be called after that in time.
I think why it delays the mdlog->flush() here, it may think since the early reply has been done, the safe reply won't be a that urgent ?
It is actually. The client mount won't complete the mkdir syscall until the safe reply is received.
Yeah, the client is waiting the safe reply, which will be done after the journal log is successfully flushed.
Because I didn't see will the mds->locker->drop_rdlocks_for_early_reply() trigger the mdlog->flush() directly or indirectly yet.
I'm not sure but clearly it needs to gather some locks from other mds before it can flush the entry? We need to nudge that along to ensure these delays do not occur.
The problem is other threads could trigger the mdlog->flush() to flush the journal log no matter the current thread has finished gathering the locks or not.
I didn't see any limitation that we cannot do the flush immediately here.
Maybe we should follow all the others who are calling the mdlog->flush() immediately just after submit_mdlog_entry() ?
I don't think so. I wish this was commented -- I'm not sure why it's done that way.
I didn't see any comments about this from the commit history.
#5 Updated by Xiubo Li almost 3 years ago
Xiubo Li wrote:
Patrick Donnelly wrote:
Thanks for analyzing this Xiubo!
Xiubo Li wrote:
The journal was submit very early:
[...]
But it didn't flush the journal in nudge_log(), because the "if (lock->get_parent()->is_auth() && lock->is_unstable_and_locked())" is not true, because the ifile lock is not locked:
[...]
And then the latest nudge_log() call which will call the mdlog->flush() is 5 seconds later:
[...]
The journal flush was delayed in journal_and_reply() because the early reply was done:
[...]
But it never be called after that in time.
I think why it delays the mdlog->flush() here, it may think since the early reply has been done, the safe reply won't be a that urgent ?
It is actually. The client mount won't complete the mkdir syscall until the safe reply is received.
Yeah, the client is waiting the safe reply, which will be done after the journal log is successfully flushed.
I think we both are wrong here.
Checked the Client code and logs, the "mkdir" won't wait the safe reply, the unsafe one is enough and this is why just after early_reply() the mdlog->flush() isn't triggered, because it is unnecessary.
From the client logs:
... 2021-06-02T03:57:22.624+0000 7f96af7fe700 10 client.8439 _mkdir result is 0 2021-06-02T03:57:22.624+0000 7f96af7fe700 20 client.8439 trim_cache size 707 max 16384 2021-06-02T03:57:22.624+0000 7f96af7fe700 8 client.8439 _mkdir(#0x100000002a0/352, 040755) = 0 2021-06-02T03:57:22.624+0000 7f96af7fe700 10 client.8439 fill_stat on 0x10000000561 snap/devhead mode 040755 mtime 2021-06-02T03:57:22.623214+0000 ctime 2021-06-02T03:57:22.623214+0000 2021-06-02T03:57:22.624+0000 7f96af7fe700 20 client.8439 _ll_get 0x7f96b816a6e0 0x10000000561 -> 1 2021-06-02T03:57:22.624+0000 7f96af7fe700 3 client.8439 ll_mkdir 0x100000002a0.head 352 = 0 (10000000561) 2021-06-02T03:57:22.624+0000 7f96af7fe700 20 client.8439 put_inode on 0x10000000561.head(faked_ino=0 nref=7 ll_ref=1 cap_refs={} open={} mode=40755 size=0/0 nlink=1 btime=2021-06-02T03:57:22.623214+0000 mtime=2021-06-02T03:57:22.623214+0000 ctime=2021-06-02T03:57:22.623214+0000 caps=pAsxLsXsxFsx(0=pAsxLsXsxFsx) COMPLETE parents=0x100000002a0.head["352"] 0x7f96b816a6e0) n = 1 2021-06-02T03:57:22.624+0000 7f96af7fe700 8 client.8439 _ll_forget 0x100000002a0 1 2021-06-02T03:57:22.624+0000 7f96af7fe700 20 client.8439 _ll_put 0x7f96b8007fa0 0x100000002a0 1 -> 1060 2021-06-02T03:57:22.624+0000 7f96affff700 3 client.8439 ll_lookup 0x1.head tree ... 2021-06-02T03:57:22.624+0000 7f96affff700 20 client.8439 awaiting reply|forward|kick on 0x7f96afffcda0 ... 2021-06-02T03:57:27.449+0000 7f96beffd700 20 client.8439 handle_client_reply signalling caller 0x7f96afffcda0 2021-06-02T03:57:27.449+0000 7f96beffd700 20 client.8439 handle_client_reply awaiting kickback on tid 1309 0x7f96beffb070 ... 2021-06-02T03:57:27.449+0000 7f96affff700 20 client.8439 _ll_get 0x7f96b8007fa0 0x100000002a0 -> 1061 2021-06-02T03:57:27.449+0000 7f96affff700 3 client.8439 ll_lookup 0x1.head tree -> 0 (100000002a0)
We can see mkdir("352/") succeeded but then it was stuck for looking up the "tree" dir...
#6 Updated by Xiubo Li almost 3 years ago
The "tree/" directory is auth pinned in mds.0 daemon, while after the "tree/352/" directory was successfully created and when doing the:
2021-06-02T03:57:22.624 INFO:teuthology.orchestra.run.smithi005.stderr:+ echo file
It will lookup the "tree/352/file", but when looking up the "tree/" directory, it sent the request to mds.2 instead, from fuse client log:
2021-06-02T03:57:22.624+0000 7f96affff700 10 client.8439 choose_target_mds from dirfragtree hash 2021-06-02T03:57:22.624+0000 7f96affff700 20 client.8439 mds is 2 2021-06-02T03:57:22.624+0000 7f96affff700 10 client.8439 send_request rebuilding request 1309 for mds.2 2021-06-02T03:57:22.624+0000 7f96affff700 20 client.8439 encode_cap_releases enter (req: 0x7f96a00e7030, mds: 2) 2021-06-02T03:57:22.624+0000 7f96affff700 20 client.8439 send_request set sent_stamp to 2021-06-02T03:57:22.626116+0000 2021-06-02T03:57:22.624+0000 7f96affff700 10 client.8439 send_request client_request(unknown.0:1309 lookup #0x1/tree 2021-06-02T03:57:22.626101+0000 caller_uid=0, caller_gid=0{0,}) v5 to mds.2
But the lookup request was stuck in waiting ifile lock to become STABLE:
2021-06-02T03:57:22.626+0000 7f0779622700 7 mds.2.locker rdlock_start on (ifile mix->sync(2)) on [inode 0x100000002a0 [...2,head] /tree/ rep@0.4 fragtree_t(*^7) v1630 DISTEPHEMERALPIN f(v1 m2021-06-02T03:57:22.612022+0000 352=0+352) n(v102 rc2021-06-02T03:57:12.547337+0000 b1730 696=347+349)/n(v101 rc2021-06-02T03:57:08.382951+0000 687=343+344) (isnap sync r=1) (inest mix dirty) (ifile mix->sync(2)) (iversion lock) caps={8439=pAsLsXs/-@10} | dirtyscattered=1 request=1 lock=1 importing=0 dirfrag=11 caps=1 waiter=0 distepin 0x562c0fde0b00] 2021-06-02T03:57:22.626+0000 7f0779622700 7 mds.2.locker rdlock_start waiting on (ifile mix->sync(2)) on [inode 0x100000002a0 [...2,head] /tree/ rep@0.4 fragtree_t(*^7) v1630 DISTEPHEMERALPIN f(v1 m2021-06-02T03:57:22.612022+0000 352=0+352) n(v102 rc2021-06-02T03:57:12.547337+0000 b1730 696=347+349)/n(v101 rc2021-06-02T03:57:08.382951+0000 687=343+344) (isnap sync r=1) (inest mix dirty) (ifile mix->sync(2)) (iversion lock) caps={8439=pAsLsXs/-@10} | dirtyscattered=1 request=1 lock=1 importing=0 dirfrag=11 caps=1 waiter=0 distepin 0x562c0fde0b00] 2021-06-02T03:57:22.626+0000 7f0779622700 10 mds.2.cache.ino(0x100000002a0) add_waiter tag 2000000040000000 0x562c0fe300c0 !ambig 1 !frozen 1 !freezing 1
Until 5 seconds later after the ifile lock was STABLE:
2021-06-02T03:57:27.447+0000 7f0779622700 1 -- [v2:172.21.15.76:6834/126060394,v1:172.21.15.76:6835/126060394] <== mds.0 v2:172.21.15.5:6835/1368589064 347 ==== lock(a=sync ifile 0x100000002a0.head) v1 ==== 13495+0+0 (unknown 0 0 0) 0x562c0efb2000 con 0x562c0fe08800 ... 2021-06-02T03:57:27.448+0000 7f0779622700 7 mds.2.locker issue_caps allowed=pAsLsXsFscr, xlocker allowed=pAsLsXsFscr on [inode 0x100000002a0 [...2,head] /tree/ rep@0.4 fragtree_t(*^7) v1632 DISTEPHEMERALPIN f(v1 m2021-06-02T03:57:22.623214+0000 353=0+353) n(v102 rc2021-06-02T03:57:12.547337+0000 b1730 696=347+349)/n(v101 rc2021-06-02T03:57:08.382951+0000 687=343+344) (isnap sync r=1) (inest lock flushing) (ifile sync r=1) (iversion lock) caps={8439=pAsLsXs/-@10} | dirtyscattered=1 request=1 lock=2 importing=0 dirfrag=11 caps=1 waiter=1 distepin 0x562c0fde0b00] 2021-06-02T03:57:27.448+0000 7f0779622700 20 mds.2.locker client.8439 pending pAsLsXs allowed pAsLsXsFs wanted - 2021-06-02T03:57:27.448+0000 7f0779622700 10 MDSContext::complete: 18C_MDS_RetryRequest 2021-06-02T03:57:27.448+0000 7f0779622700 7 mds.2.server dispatch_client_request client_request(client.8439:1309 lookup #0x1/tree 2021-06-02T03:57:22.626101+0000 caller_uid=0, caller_gid=0{0,}) v5
Which is just after the "mkdir tree/352" journal was successfully flushed in mds.0:
2021-06-02T03:57:27.447+0000 7f41ccbec700 7 mds.0.server reply_client_request 0 ((0) Success) client_request(client.8439:1308 mkdir #0x100000002a0/352 2021-06-02T03:57:22.623214+0000 caller_uid=0, caller_gid=0{0,}) v5 2021-06-02T03:57:27.447+0000 7f41ccbec700 10 mds.0.server apply_allocated_inos 0x0 / [] / 0x10000000561 2021-06-02T03:57:27.447+0000 7f41ccbec700 20 mds.0.sessionmap mark_dirty s=0x55eb8b79cf00 name=client.8439 v=880 2021-06-02T03:57:27.447+0000 7f41ccbec700 7 mds.0.locker wrlock_finish on (inest mix->lock(2) g=1 w=1 dirty) on [inode 0x1 [...2,head] / auth{1=1,2=1} v124 ap=1 snaprealm=0x55eb8b5a7600 DIRTYPARENT f(v0 m2021-06-02T03:57:06.305406+0000 1=0+1) n(v2 rc2021-06-02T03:57:22.453553+0000 b1730 699=348+351)/n(v0 rc2021-06-02T03:56:51.974090+0000 1=0+1) (isnap sync r=6) (inest mix->lock(2) g=1 w=1 dirty) (ipolicy sync r=5) (iversion lock) caps={8439=pAsLsXsFs/-@130} | dirtyscattered=1 request=0 lock=3 dirfrag=1 caps=1 dirtyparent=1 replicated=1 dirty=1 waiter=0 authpin=1 0x55eb8b6d4000] 2021-06-02T03:57:27.447+0000 7f41ccbec700 10 mds.0.locker eval_gather (inest mix->lock(2) g=1 dirty) on [inode 0x1 [...2,head] / auth{1=1,2=1} v124 ap=1 snaprealm=0x55eb8b5a7600 DIRTYPARENT f(v0 m2021-06-02T03:57:06.305406+0000 1=0+1) n(v2 rc2021-06-02T03:57:22.453553+0000 b1730 699=348+351)/n(v0 rc2021-06-02T03:56:51.974090+0000 1=0+1) (isnap sync r=6) (inest mix->lock(2) g=1 dirty) (ipolicy sync r=5) (iversion lock) caps={8439=pAsLsXsFs/-@130} | dirtyscattered=1 request=0 lock=2 dirfrag=1 caps=1 dirtyparent=1 replicated=1 dirty=1 waiter=0 authpin=1 0x55eb8b6d4000] 2021-06-02T03:57:27.447+0000 7f41ccbec700 7 mds.0.locker local_wrlock_finish on (iversion lock w=1 last_client=8439) on [inode 0x100000002a0 [...2,head] /tree/ auth{1=1,2=4} fragtree_t(*^7) v1632 ap=73 DIRTYPARENT DISTEPHEMERALPIN f(v1 m2021-06-02T03:57:22.623214+0000 353=0+353) n(v102 rc2021-06-02T03:57:22.623214+0000 b1730 702=348+354) (isnap sync r=6) (inest mix->lock(2) g=1,2 w=1 dirty scatter_wanted) (ipolicy sync r=5) (ifile mix->sync w=1 scatter_wanted) (iversion lock w=1 last_client=8439) caps={8439=pAsLsXs/-@476} | dirtyscattered=1 request=0 lock=5 dirfrag=128 caps=1 dirtyrstat=0 dirtyparent=1 replicated=1 dirty=1 waiter=1 authpin=1 distepin 0x55eb8b951700] 2021-06-02T03:57:27.447+0000 7f41ccbec700 7 mds.0.locker wrlock_finish on (ifile mix->sync w=1 scatter_wanted) on [inode 0x100000002a0 [...2,head] /tree/ auth{1=1,2=4} fragtree_t(*^7) v1632 ap=73 DIRTYPARENT DISTEPHEMERALPIN f(v1 m2021-06-02T03:57:22.623214+0000 353=0+353) n(v102 rc2021-06-02T03:57:22.623214+0000 b1730 702=348+354) (isnap sync r=6) (inest mix->lock(2) g=1,2 w=1 dirty scatter_wanted) (ipolicy sync r=5) (ifile mix->sync w=1 scatter_wanted) (iversion lock) caps={8439=pAsLsXs/-@476} | dirtyscattered=1 request=0 lock=4 dirfrag=128 caps=1 dirtyrstat=0 dirtyparent=1 replicated=1 dirty=1 waiter=1 authpin=1 distepin 0x55eb8b951700] 2021-06-02T03:57:27.447+0000 7f41ccbec700 10 mds.0.locker eval_gather (ifile mix->sync scatter_wanted) on [inode 0x100000002a0 [...2,head] /tree/ auth{1=1,2=4} fragtree_t(*^7) v1632 ap=73 DIRTYPARENT DISTEPHEMERALPIN f(v1 m2021-06-02T03:57:22.623214+0000 353=0+353) n(v102 rc2021-06-02T03:57:22.623214+0000 b1730 702=348+354) (isnap sync r=6) (inest mix->lock(2) g=1,2 w=1 dirty scatter_wanted) (ipolicy sync r=5) (ifile mix->sync scatter_wanted) (iversion lock) caps={8439=pAsLsXs/-@476} | dirtyscattered=1 request=0 lock=3 dirfrag=128 caps=1 dirtyrstat=0 dirtyparent=1 replicated=1 dirty=1 waiter=1 authpin=1 distepin 0x55eb8b951700] 2021-06-02T03:57:27.447+0000 7f41ccbec700 10 mds.0.locker next state is sync issued/allows loner /scrl xlocker /scrl other /scrl 2021-06-02T03:57:27.447+0000 7f41ccbec700 7 mds.0.locker eval_gather finished gather on (ifile mix->sync scatter_wanted) on [inode 0x100000002a0 [...2,head] /tree/ auth{1=1,2=4} fragtree_t(*^7) v1632 ap=73 DIRTYPARENT DISTEPHEMERALPIN f(v1 m2021-06-02T03:57:22.623214+0000 353=0+353) n(v102 rc2021-06-02T03:57:22.623214+0000 b1730 702=348+354) (isnap sync r=6) (inest mix->lock(2) g=1,2 w=1 dirty scatter_wanted) (ipolicy sync r=5) (ifile mix->sync scatter_wanted) (iversion lock) caps={8439=pAsLsXs/-@476} | dirtyscattered=1 request=0 lock=3 dirfrag=128 caps=1 dirtyrstat=0 dirtyparent=1 replicated=1 dirty=1 waiter=1 authpin=1 distepin 0x55eb8b951700] 2021-06-02T03:57:27.447+0000 7f41ccbec700 15 mds.0.cache.ino(0x100000002a0) encode_lock_ifile inode.dirstat is f(v1 m2021-06-02T03:57:22.623214+0000 353=0+353) ... 2021-06-02T03:57:27.448+0000 7f41ccbec700 1 -- [v2:172.21.15.5:6835/1368589064,v1:172.21.15.5:6837/1368589064] send_to--> mds [v2:172.21.15.5:6834/2757933374,v1:172.21.15.5:6836/2757933374] -- lock(a=sync ifile 0x100000002a0.head) v1 -- ?+0 0x55eb8b71cc40 2021-06-02T03:57:27.448+0000 7f41ccbec700 1 -- [v2:172.21.15.5:6835/1368589064,v1:172.21.15.5:6837/1368589064] --> [v2:172.21.15.5:6834/2757933374,v1:172.21.15.5:6836/2757933374] -- lock(a=sync ifile 0x100000002a0.head) v1 -- 0x55eb8b71cc40 con 0x55eb8b7b0000 2021-06-02T03:57:27.448+0000 7f41ccbec700 1 -- [v2:172.21.15.5:6835/1368589064,v1:172.21.15.5:6837/1368589064] send_to--> mds [v2:172.21.15.76:6834/126060394,v1:172.21.15.76:6835/126060394] -- lock(a=sync ifile 0x100000002a0.head) v1 -- ?+0 0x55eb8b71ce00 2021-06-02T03:57:27.448+0000 7f41ccbec700 1 -- [v2:172.21.15.5:6835/1368589064,v1:172.21.15.5:6837/1368589064] --> [v2:172.21.15.76:6834/126060394,v1:172.21.15.76:6835/126060394] -- lock(a=sync ifile 0x100000002a0.head) v1 -- 0x55eb8b71ce00 con 0x55eb8bb02800
So the lookup request in mds.2 was waiting the journal was flushed in mds.0 daemon.
So the mdlog->flush() was necessary even the early reply was done. Or we could just request the mds.0 to do the nudge_log or something to trigger the mdlog->flush().
Patrick, any idea ?
#7 Updated by Xiubo Li almost 3 years ago
5 seconds is also the interval of "mds_tick_interval" in mds daemon, which will call the scatter_tick() and finally will call the mdlog->flush().
#8 Updated by Xiubo Li almost 3 years ago
- Pull request ID set to 41849
#9 Updated by Xiubo Li almost 3 years ago
- Status changed from Triaged to Fix Under Review
#10 Updated by Patrick Donnelly over 2 years ago
- Status changed from Fix Under Review to Pending Backport
- Backport set to pacific
#11 Updated by Backport Bot over 2 years ago
- Copied to Backport #51412: pacific: mds: mkdir on ephemerally pinned directory sometimes blocked on journal flush added
#12 Updated by Loïc Dachary over 2 years ago
- Status changed from Pending Backport to Resolved
While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".