Project

General

Profile

Bug #51069

mds: mkdir on ephemerally pinned directory sometimes blocked on journal flush

Added by Patrick Donnelly almost 3 years ago. Updated over 2 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
Performance/Resource Usage
Target version:
% Done:

0%

Source:
Q/A
Tags:
Backport:
pacific
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
task(medium)
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

2021-06-02T03:57:06.273 DEBUG:teuthology.orchestra.run.smithi005:> (cd /home/ubuntu/cephtest/mnt.0 && exec sudo bash -c '
2021-06-02T03:57:06.273 DEBUG:teuthology.orchestra.run.smithi005:> set -ex
2021-06-02T03:57:06.274 DEBUG:teuthology.orchestra.run.smithi005:> mkdir -p tree
2021-06-02T03:57:06.274 DEBUG:teuthology.orchestra.run.smithi005:>
2021-06-02T03:57:06.274 DEBUG:teuthology.orchestra.run.smithi005:> setfattr -n ceph.dir.pin.distributed -v 1 tree
2021-06-02T03:57:06.275 DEBUG:teuthology.orchestra.run.smithi005:>
2021-06-02T03:57:06.275 DEBUG:teuthology.orchestra.run.smithi005:> for ((i = 0; i < 1000; i++)); do
2021-06-02T03:57:06.275 DEBUG:teuthology.orchestra.run.smithi005:>     mkdir -p "tree/$i" 
2021-06-02T03:57:06.276 DEBUG:teuthology.orchestra.run.smithi005:>     echo file > "tree/$i/file" 
2021-06-02T03:57:06.276 DEBUG:teuthology.orchestra.run.smithi005:> done
2021-06-02T03:57:06.276 DEBUG:teuthology.orchestra.run.smithi005:> ')
...
2021-06-02T03:57:08.446 INFO:teuthology.orchestra.run.smithi005.stderr:+ echo file
2021-06-02T03:57:12.542 INFO:teuthology.orchestra.run.smithi005.stderr:+ (( i++ ))
2021-06-02T03:57:12.542 INFO:teuthology.orchestra.run.smithi005.stderr:+ (( i < 1000 ))
2021-06-02T03:57:12.543 INFO:teuthology.orchestra.run.smithi005.stderr:+ mkdir -p tree/347
2021-06-02T03:57:12.548 INFO:teuthology.orchestra.run.smithi005.stderr:+ echo file
2021-06-02T03:57:17.487 INFO:teuthology.orchestra.run.smithi005.stderr:+ (( i++ ))
2021-06-02T03:57:17.488 INFO:teuthology.orchestra.run.smithi005.stderr:+ (( i < 1000 ))
2021-06-02T03:57:17.488 INFO:teuthology.orchestra.run.smithi005.stderr:+ mkdir -p tree/348
2021-06-02T03:57:17.592 INFO:teuthology.orchestra.run.smithi005.stderr:+ echo file
2021-06-02T03:57:20.843 DEBUG:teuthology.orchestra.run.smithi005:> sudo logrotate /etc/logrotate.d/ceph-test.conf
2021-06-02T03:57:20.846 DEBUG:teuthology.orchestra.run.smithi076:> sudo logrotate /etc/logrotate.d/ceph-test.conf
2021-06-02T03:57:22.474 INFO:teuthology.orchestra.run.smithi005.stderr:+ (( i++ ))
2021-06-02T03:57:22.475 INFO:teuthology.orchestra.run.smithi005.stderr:+ (( i < 1000 ))
2021-06-02T03:57:22.475 INFO:teuthology.orchestra.run.smithi005.stderr:+ mkdir -p tree/349
2021-06-02T03:57:22.556 INFO:teuthology.orchestra.run.smithi005.stderr:+ echo file
2021-06-02T03:57:22.574 INFO:teuthology.orchestra.run.smithi005.stderr:+ (( i++ ))
2021-06-02T03:57:22.574 INFO:teuthology.orchestra.run.smithi005.stderr:+ (( i < 1000 ))
2021-06-02T03:57:22.574 INFO:teuthology.orchestra.run.smithi005.stderr:+ mkdir -p tree/350
2021-06-02T03:57:22.593 INFO:teuthology.orchestra.run.smithi005.stderr:+ echo file
2021-06-02T03:57:22.606 INFO:teuthology.orchestra.run.smithi005.stderr:+ (( i++ ))
2021-06-02T03:57:22.606 INFO:teuthology.orchestra.run.smithi005.stderr:+ (( i < 1000 ))
2021-06-02T03:57:22.607 INFO:teuthology.orchestra.run.smithi005.stderr:+ mkdir -p tree/351
2021-06-02T03:57:22.613 INFO:teuthology.orchestra.run.smithi005.stderr:+ echo file
2021-06-02T03:57:22.619 INFO:teuthology.orchestra.run.smithi005.stderr:+ (( i++ ))
2021-06-02T03:57:22.620 INFO:teuthology.orchestra.run.smithi005.stderr:+ (( i < 1000 ))
2021-06-02T03:57:22.620 INFO:teuthology.orchestra.run.smithi005.stderr:+ mkdir -p tree/352
2021-06-02T03:57:22.624 INFO:teuthology.orchestra.run.smithi005.stderr:+ echo file
2021-06-02T03:57:27.462 INFO:teuthology.orchestra.run.smithi005.stderr:+ (( i++ ))
2021-06-02T03:57:27.462 INFO:teuthology.orchestra.run.smithi005.stderr:+ (( i < 1000 ))
2021-06-02T03:57:27.462 INFO:teuthology.orchestra.run.smithi005.stderr:+ mkdir -p tree/353
2021-06-02T03:57:27.517 INFO:teuthology.orchestra.run.smithi005.stderr:+ echo file
2021-06-02T03:57:27.524 INFO:teuthology.orchestra.run.smithi005.stderr:+ (( i++ ))
2021-06-02T03:57:27.524 INFO:teuthology.orchestra.run.smithi005.stderr:+ (( i < 1000 ))
2021-06-02T03:57:27.524 INFO:teuthology.orchestra.run.smithi005.stderr:+ mkdir -p tree/354
2021-06-02T03:57:27.530 INFO:teuthology.orchestra.run.smithi005.stderr:+ echo file
2021-06-02T03:57:32.459 INFO:teuthology.orchestra.run.smithi005.stderr:+ (( i++ ))
...

From: /ceph/teuthology-archive/pdonnell-2021-06-02_03:31:59-fs-master-distro-basic-smithi/6145427/teuthology.log

The 5 second turnaround usually is a symptom of something blocking on journal flush.


Related issues

Copied to CephFS - Backport #51412: pacific: mds: mkdir on ephemerally pinned directory sometimes blocked on journal flush Resolved

History

#1 Updated by Patrick Donnelly almost 3 years ago

  • Description updated (diff)
  • Status changed from New to Triaged
  • Assignee set to Xiubo Li

#2 Updated by Xiubo Li almost 3 years ago

The journal was submit very early:


2021-06-02T03:57:22.624+0000 7f41d2bf8700 10 mds.0.openfiles remove_inode [inode 0x10000000561 [...2,head] /tree/352/ auth v844 f() n(v0 1=0+1) (iauth excl) (inest lock) (ifile excl) (ixattr excl) (iversion lock) caps={8439=-/pAsxXsxFsxcrwb@0},l=8439 | caps=1 0x55eb8b98bb80]
2021-06-02T03:57:22.624+0000 7f41d2bf8700 10 mds.0.server journal_and_reply tracei 0x55eb8b98bb80 tracedn 0x55eb8ba0a000
2021-06-02T03:57:22.624+0000 7f41d2bf8700 10 mds.0.locker set_xlocks_done on (dn xlock x=1 by 0x55eb8b73d400) [dentry #0x1/tree/352 [2,head] auth NULL (dn xlock x=1 by 0x55eb8b73d400) (dversion lock w=1 last_client=8439) pv=844 v=843 ap=2 ino=(nil) state=1073741888|bottomlru | request=1 lock=2 authpin=1 0x55eb8ba0a000]
2021-06-02T03:57:22.624+0000 7f41d2bf8700 10 mds.0.server early_reply 0 ((0) Success) client_request(client.8439:1308 mkdir #0x100000002a0/352 2021-06-02T03:57:22.623214+0000 caller_uid=0, caller_gid=0{0,}) v5

...

2021-06-02T03:57:22.624+0000 7f41cc3eb700  5 mds.0.log _submit_thread 8317390~2221 : EUpdate mkdir [metablob 0x1, 3 dirs]

...

But it didn't flush the journal in nudge_log(), because the "if (lock->get_parent()->is_auth() && lock->is_unstable_and_locked())" is not true, because the ifile lock is not locked:


2021-06-02T03:57:22.622+0000 7f41d2bf8700 10 mds.0.locker nudge_log (ifile sync->mix g=1,2) on [inode 0x100000002a0 [...2,head] /tree/ auth{1=1,2=4} fragtree_t(*^7) v1630 ap=2 DIRTYPARENT DISTEPHEMERALPIN f(v1 m2021-06-02T03:57:22.612022+0000 352=0+352) n(v102 rc2021-06-02T03:57:22.612022+0000 b1730 701=348+353) (isnap sync r=6) (inest mix w=1 dirty scatter_wanted) (ipolicy sync r=5) (ifile sync->mix g=1,2) (iversion lock) caps={8439=pAsLsXs/-@475} | dirtyscattered=1 request=0 lock=3 dirfrag=128 caps=1 dirtyrstat=0 dirtyparent=1 replicated=1 dirty=1 waiter=1 authpin=1 distepin 0x55eb8b951700]

And then the latest nudge_log() call which will call the mdlog->flush() is 5 seconds later:


2021-06-02T03:57:27.435+0000 7f41d13f5700 10 mds.0.locker nudge_log (ifile mix->sync w=1 scatter_wanted) on [inode 0x100000002a0 [...2,head] /tree/ auth{1=1,2=4} fragtree_t(*^7) v1630 pv1632 ap=5 DIRTYPARENT DISTEPHEMERALPIN f(v1 m2021-06-02T03:57:22.612022+0000 352=0+352)->f(v1 m2021-06-02T03:57:22.623214+0000 353=0+353) n(v102 rc2021-06-02T03:57:22.612022+0000 b1730 701=348+353)->n(v102 rc2021-06-02T03:57:22.623214+0000 b1730 702=348+354) (isnap sync r=6) (inest mix->lock(2) g=1,2 w=2 dirty scatter_wanted) (ipolicy sync r=5) (ifile mix->sync w=1 scatter_wanted) (iversion lock w=1 last_client=8439) caps={8439=pAsLsXs/-@476} | dirtyscattered=1 request=0 lock=5 dirfrag=128 caps=1 dirtyrstat=0 dirtyparent=1 replicated=1 dirty=1 waiter=1 authpin=1 distepin 0x55eb8b951700]

The journal flush was delayed in journal_and_reply() because the early reply was done:

  ...
 1912   early_reply(mdr, in, dn);  
 1913   
 1914   mdr->committing = true;    
 1915   submit_mdlog_entry(le, fin, mdr, __func__);
 1916   
 1917   if (mdr->client_request && mdr->client_request->is_queued_for_replay()) {
 1918     if (mds->queue_one_replay()) {
 1919       dout(10) << " queued next replay op" << dendl;
 1920     } else {
 1921       dout(10) << " journaled last replay op" << dendl;
 1922     }                        
 1923   } else if (mdr->did_early_reply)
 1924     mds->locker->drop_rdlocks_for_early_reply(mdr.get());
 1925   else
 1926     mdlog->flush();
 1927 }

But it never be called after that in time.

I think why it delays the mdlog->flush() here, it may think since the early reply has been done, the safe reply won't be a that urgent ? Because I didn't see will the mds->locker->drop_rdlocks_for_early_reply() trigger the mdlog->flush() directly or indirectly yet.

Maybe we should follow all the others who are calling the mdlog->flush() immediately just after submit_mdlog_entry() ?

#3 Updated by Patrick Donnelly almost 3 years ago

Thanks for analyzing this Xiubo!

Xiubo Li wrote:

The journal was submit very early:

[...]

But it didn't flush the journal in nudge_log(), because the "if (lock->get_parent()->is_auth() && lock->is_unstable_and_locked())" is not true, because the ifile lock is not locked:

[...]

And then the latest nudge_log() call which will call the mdlog->flush() is 5 seconds later:

[...]

The journal flush was delayed in journal_and_reply() because the early reply was done:

[...]

But it never be called after that in time.

I think why it delays the mdlog->flush() here, it may think since the early reply has been done, the safe reply won't be a that urgent ?

It is actually. The client mount won't complete the mkdir syscall until the safe reply is received.

Because I didn't see will the mds->locker->drop_rdlocks_for_early_reply() trigger the mdlog->flush() directly or indirectly yet.

I'm not sure but clearly it needs to gather some locks from other mds before it can flush the entry? We need to nudge that along to ensure these delays do not occur.

Maybe we should follow all the others who are calling the mdlog->flush() immediately just after submit_mdlog_entry() ?

I don't think so. I wish this was commented -- I'm not sure why it's done that way.

#4 Updated by Xiubo Li almost 3 years ago

Patrick Donnelly wrote:

Thanks for analyzing this Xiubo!

Xiubo Li wrote:

The journal was submit very early:

[...]

But it didn't flush the journal in nudge_log(), because the "if (lock->get_parent()->is_auth() && lock->is_unstable_and_locked())" is not true, because the ifile lock is not locked:

[...]

And then the latest nudge_log() call which will call the mdlog->flush() is 5 seconds later:

[...]

The journal flush was delayed in journal_and_reply() because the early reply was done:

[...]

But it never be called after that in time.

I think why it delays the mdlog->flush() here, it may think since the early reply has been done, the safe reply won't be a that urgent ?

It is actually. The client mount won't complete the mkdir syscall until the safe reply is received.

Yeah, the client is waiting the safe reply, which will be done after the journal log is successfully flushed.

Because I didn't see will the mds->locker->drop_rdlocks_for_early_reply() trigger the mdlog->flush() directly or indirectly yet.

I'm not sure but clearly it needs to gather some locks from other mds before it can flush the entry? We need to nudge that along to ensure these delays do not occur.

The problem is other threads could trigger the mdlog->flush() to flush the journal log no matter the current thread has finished gathering the locks or not.

I didn't see any limitation that we cannot do the flush immediately here.

Maybe we should follow all the others who are calling the mdlog->flush() immediately just after submit_mdlog_entry() ?

I don't think so. I wish this was commented -- I'm not sure why it's done that way.

I didn't see any comments about this from the commit history.

#5 Updated by Xiubo Li almost 3 years ago

Xiubo Li wrote:

Patrick Donnelly wrote:

Thanks for analyzing this Xiubo!

Xiubo Li wrote:

The journal was submit very early:

[...]

But it didn't flush the journal in nudge_log(), because the "if (lock->get_parent()->is_auth() && lock->is_unstable_and_locked())" is not true, because the ifile lock is not locked:

[...]

And then the latest nudge_log() call which will call the mdlog->flush() is 5 seconds later:

[...]

The journal flush was delayed in journal_and_reply() because the early reply was done:

[...]

But it never be called after that in time.

I think why it delays the mdlog->flush() here, it may think since the early reply has been done, the safe reply won't be a that urgent ?

It is actually. The client mount won't complete the mkdir syscall until the safe reply is received.

Yeah, the client is waiting the safe reply, which will be done after the journal log is successfully flushed.

I think we both are wrong here.

Checked the Client code and logs, the "mkdir" won't wait the safe reply, the unsafe one is enough and this is why just after early_reply() the mdlog->flush() isn't triggered, because it is unnecessary.

From the client logs:

...
2021-06-02T03:57:22.624+0000 7f96af7fe700 10 client.8439 _mkdir result is 0
2021-06-02T03:57:22.624+0000 7f96af7fe700 20 client.8439 trim_cache size 707 max 16384
2021-06-02T03:57:22.624+0000 7f96af7fe700  8 client.8439 _mkdir(#0x100000002a0/352, 040755) = 0
2021-06-02T03:57:22.624+0000 7f96af7fe700 10 client.8439 fill_stat on 0x10000000561 snap/devhead mode 040755 mtime 2021-06-02T03:57:22.623214+0000 ctime 2021-06-02T03:57:22.623214+0000
2021-06-02T03:57:22.624+0000 7f96af7fe700 20 client.8439 _ll_get 0x7f96b816a6e0 0x10000000561 -> 1
2021-06-02T03:57:22.624+0000 7f96af7fe700  3 client.8439 ll_mkdir 0x100000002a0.head 352 = 0 (10000000561)
2021-06-02T03:57:22.624+0000 7f96af7fe700 20 client.8439 put_inode on 0x10000000561.head(faked_ino=0 nref=7 ll_ref=1 cap_refs={} open={} mode=40755 size=0/0 nlink=1 btime=2021-06-02T03:57:22.623214+0000 mtime=2021-06-02T03:57:22.623214+0000 ctime=2021-06-02T03:57:22.623214+0000 caps=pAsxLsXsxFsx(0=pAsxLsXsxFsx) COMPLETE parents=0x100000002a0.head["352"] 0x7f96b816a6e0) n = 1
2021-06-02T03:57:22.624+0000 7f96af7fe700  8 client.8439 _ll_forget 0x100000002a0 1
2021-06-02T03:57:22.624+0000 7f96af7fe700 20 client.8439 _ll_put 0x7f96b8007fa0 0x100000002a0 1 -> 1060
2021-06-02T03:57:22.624+0000 7f96affff700  3 client.8439 ll_lookup 0x1.head tree
...
2021-06-02T03:57:22.624+0000 7f96affff700 20 client.8439 awaiting reply|forward|kick on 0x7f96afffcda0

...

2021-06-02T03:57:27.449+0000 7f96beffd700 20 client.8439 handle_client_reply signalling caller 0x7f96afffcda0
2021-06-02T03:57:27.449+0000 7f96beffd700 20 client.8439 handle_client_reply awaiting kickback on tid 1309 0x7f96beffb070
...
2021-06-02T03:57:27.449+0000 7f96affff700 20 client.8439 _ll_get 0x7f96b8007fa0 0x100000002a0 -> 1061
2021-06-02T03:57:27.449+0000 7f96affff700  3 client.8439 ll_lookup 0x1.head tree -> 0 (100000002a0)

We can see mkdir("352/") succeeded but then it was stuck for looking up the "tree" dir...

#6 Updated by Xiubo Li almost 3 years ago

The "tree/" directory is auth pinned in mds.0 daemon, while after the "tree/352/" directory was successfully created and when doing the:

2021-06-02T03:57:22.624 INFO:teuthology.orchestra.run.smithi005.stderr:+ echo file

It will lookup the "tree/352/file", but when looking up the "tree/" directory, it sent the request to mds.2 instead, from fuse client log:

2021-06-02T03:57:22.624+0000 7f96affff700 10 client.8439 choose_target_mds from dirfragtree hash
2021-06-02T03:57:22.624+0000 7f96affff700 20 client.8439 mds is 2
2021-06-02T03:57:22.624+0000 7f96affff700 10 client.8439 send_request rebuilding request 1309 for mds.2
2021-06-02T03:57:22.624+0000 7f96affff700 20 client.8439 encode_cap_releases enter (req: 0x7f96a00e7030, mds: 2)
2021-06-02T03:57:22.624+0000 7f96affff700 20 client.8439 send_request set sent_stamp to 2021-06-02T03:57:22.626116+0000
2021-06-02T03:57:22.624+0000 7f96affff700 10 client.8439 send_request client_request(unknown.0:1309 lookup #0x1/tree 2021-06-02T03:57:22.626101+0000 caller_uid=0, caller_gid=0{0,}) v5 to mds.2

But the lookup request was stuck in waiting ifile lock to become STABLE:

2021-06-02T03:57:22.626+0000 7f0779622700  7 mds.2.locker rdlock_start  on (ifile mix->sync(2)) on [inode 0x100000002a0 [...2,head] /tree/ rep@0.4 fragtree_t(*^7) v1630 DISTEPHEMERALPIN f(v1 m2021-06-02T03:57:22.612022+0000 352=0+352) n(v102 rc2021-06-02T03:57:12.547337+0000 b1730 696=347+349)/n(v101 rc2021-06-02T03:57:08.382951+0000 687=343+344) (isnap sync r=1) (inest mix dirty) (ifile mix->sync(2)) (iversion lock) caps={8439=pAsLsXs/-@10} | dirtyscattered=1 request=1 lock=1 importing=0 dirfrag=11 caps=1 waiter=0 distepin 0x562c0fde0b00]
2021-06-02T03:57:22.626+0000 7f0779622700  7 mds.2.locker rdlock_start waiting on (ifile mix->sync(2)) on [inode 0x100000002a0 [...2,head] /tree/ rep@0.4 fragtree_t(*^7) v1630 DISTEPHEMERALPIN f(v1 m2021-06-02T03:57:22.612022+0000 352=0+352) n(v102 rc2021-06-02T03:57:12.547337+0000 b1730 696=347+349)/n(v101 rc2021-06-02T03:57:08.382951+0000 687=343+344) (isnap sync r=1) (inest mix dirty) (ifile mix->sync(2)) (iversion lock) caps={8439=pAsLsXs/-@10} | dirtyscattered=1 request=1 lock=1 importing=0 dirfrag=11 caps=1 waiter=0 distepin 0x562c0fde0b00]
2021-06-02T03:57:22.626+0000 7f0779622700 10 mds.2.cache.ino(0x100000002a0) add_waiter tag 2000000040000000 0x562c0fe300c0 !ambig 1 !frozen 1 !freezing 1

Until 5 seconds later after the ifile lock was STABLE:

2021-06-02T03:57:27.447+0000 7f0779622700  1 -- [v2:172.21.15.76:6834/126060394,v1:172.21.15.76:6835/126060394] <== mds.0 v2:172.21.15.5:6835/1368589064 347 ==== lock(a=sync ifile 0x100000002a0.head) v1 ==== 13495+0+0 (unknown 0 0 0) 0x562c0efb2000 con 0x562c0fe08800

...

2021-06-02T03:57:27.448+0000 7f0779622700  7 mds.2.locker issue_caps allowed=pAsLsXsFscr, xlocker allowed=pAsLsXsFscr on [inode 0x100000002a0 [...2,head] /tree/ rep@0.4 fragtree_t(*^7) v1632 DISTEPHEMERALPIN f(v1 m2021-06-02T03:57:22.623214+0000 353=0+353) n(v102 rc2021-06-02T03:57:12.547337+0000 b1730 696=347+349)/n(v101 rc2021-06-02T03:57:08.382951+0000 687=343+344) (isnap sync r=1) (inest lock flushing) (ifile sync r=1) (iversion lock) caps={8439=pAsLsXs/-@10} | dirtyscattered=1 request=1 lock=2 importing=0 dirfrag=11 caps=1 waiter=1 distepin 0x562c0fde0b00]
2021-06-02T03:57:27.448+0000 7f0779622700 20 mds.2.locker  client.8439 pending pAsLsXs allowed pAsLsXsFs wanted -
2021-06-02T03:57:27.448+0000 7f0779622700 10 MDSContext::complete: 18C_MDS_RetryRequest
2021-06-02T03:57:27.448+0000 7f0779622700  7 mds.2.server dispatch_client_request client_request(client.8439:1309 lookup #0x1/tree 2021-06-02T03:57:22.626101+0000 caller_uid=0, caller_gid=0{0,}) v5

Which is just after the "mkdir tree/352" journal was successfully flushed in mds.0:

2021-06-02T03:57:27.447+0000 7f41ccbec700  7 mds.0.server reply_client_request 0 ((0) Success) client_request(client.8439:1308 mkdir #0x100000002a0/352 2021-06-02T03:57:22.623214+0000 caller_uid=0, caller_gid=0{0,}) v5
2021-06-02T03:57:27.447+0000 7f41ccbec700 10 mds.0.server apply_allocated_inos 0x0 / [] / 0x10000000561
2021-06-02T03:57:27.447+0000 7f41ccbec700 20 mds.0.sessionmap mark_dirty s=0x55eb8b79cf00 name=client.8439 v=880
2021-06-02T03:57:27.447+0000 7f41ccbec700  7 mds.0.locker wrlock_finish on (inest mix->lock(2) g=1 w=1 dirty) on [inode 0x1 [...2,head] / auth{1=1,2=1} v124 ap=1 snaprealm=0x55eb8b5a7600 DIRTYPARENT f(v0 m2021-06-02T03:57:06.305406+0000 1=0+1) n(v2 rc2021-06-02T03:57:22.453553+0000 b1730 699=348+351)/n(v0 rc2021-06-02T03:56:51.974090+0000 1=0+1) (isnap sync r=6) (inest mix->lock(2) g=1 w=1 dirty) (ipolicy sync r=5) (iversion lock) caps={8439=pAsLsXsFs/-@130} | dirtyscattered=1 request=0 lock=3 dirfrag=1 caps=1 dirtyparent=1 replicated=1 dirty=1 waiter=0 authpin=1 0x55eb8b6d4000]
2021-06-02T03:57:27.447+0000 7f41ccbec700 10 mds.0.locker eval_gather (inest mix->lock(2) g=1 dirty) on [inode 0x1 [...2,head] / auth{1=1,2=1} v124 ap=1 snaprealm=0x55eb8b5a7600 DIRTYPARENT f(v0 m2021-06-02T03:57:06.305406+0000 1=0+1) n(v2 rc2021-06-02T03:57:22.453553+0000 b1730 699=348+351)/n(v0 rc2021-06-02T03:56:51.974090+0000 1=0+1) (isnap sync r=6) (inest mix->lock(2) g=1 dirty) (ipolicy sync r=5) (iversion lock) caps={8439=pAsLsXsFs/-@130} | dirtyscattered=1 request=0 lock=2 dirfrag=1 caps=1 dirtyparent=1 replicated=1 dirty=1 waiter=0 authpin=1 0x55eb8b6d4000]
2021-06-02T03:57:27.447+0000 7f41ccbec700  7 mds.0.locker local_wrlock_finish  on (iversion lock w=1 last_client=8439) on [inode 0x100000002a0 [...2,head] /tree/ auth{1=1,2=4} fragtree_t(*^7) v1632 ap=73 DIRTYPARENT DISTEPHEMERALPIN f(v1 m2021-06-02T03:57:22.623214+0000 353=0+353) n(v102 rc2021-06-02T03:57:22.623214+0000 b1730 702=348+354) (isnap sync r=6) (inest mix->lock(2) g=1,2 w=1 dirty scatter_wanted) (ipolicy sync r=5) (ifile mix->sync w=1 scatter_wanted) (iversion lock w=1 last_client=8439) caps={8439=pAsLsXs/-@476} | dirtyscattered=1 request=0 lock=5 dirfrag=128 caps=1 dirtyrstat=0 dirtyparent=1 replicated=1 dirty=1 waiter=1 authpin=1 distepin 0x55eb8b951700]
2021-06-02T03:57:27.447+0000 7f41ccbec700  7 mds.0.locker wrlock_finish on (ifile mix->sync w=1 scatter_wanted) on [inode 0x100000002a0 [...2,head] /tree/ auth{1=1,2=4} fragtree_t(*^7) v1632 ap=73 DIRTYPARENT DISTEPHEMERALPIN f(v1 m2021-06-02T03:57:22.623214+0000 353=0+353) n(v102 rc2021-06-02T03:57:22.623214+0000 b1730 702=348+354) (isnap sync r=6) (inest mix->lock(2) g=1,2 w=1 dirty scatter_wanted) (ipolicy sync r=5) (ifile mix->sync w=1 scatter_wanted) (iversion lock) caps={8439=pAsLsXs/-@476} | dirtyscattered=1 request=0 lock=4 dirfrag=128 caps=1 dirtyrstat=0 dirtyparent=1 replicated=1 dirty=1 waiter=1 authpin=1 distepin 0x55eb8b951700]
2021-06-02T03:57:27.447+0000 7f41ccbec700 10 mds.0.locker eval_gather (ifile mix->sync scatter_wanted) on [inode 0x100000002a0 [...2,head] /tree/ auth{1=1,2=4} fragtree_t(*^7) v1632 ap=73 DIRTYPARENT DISTEPHEMERALPIN f(v1 m2021-06-02T03:57:22.623214+0000 353=0+353) n(v102 rc2021-06-02T03:57:22.623214+0000 b1730 702=348+354) (isnap sync r=6) (inest mix->lock(2) g=1,2 w=1 dirty scatter_wanted) (ipolicy sync r=5) (ifile mix->sync scatter_wanted) (iversion lock) caps={8439=pAsLsXs/-@476} | dirtyscattered=1 request=0 lock=3 dirfrag=128 caps=1 dirtyrstat=0 dirtyparent=1 replicated=1 dirty=1 waiter=1 authpin=1 distepin 0x55eb8b951700]
2021-06-02T03:57:27.447+0000 7f41ccbec700 10 mds.0.locker  next state is sync issued/allows loner /scrl xlocker /scrl other /scrl
2021-06-02T03:57:27.447+0000 7f41ccbec700  7 mds.0.locker eval_gather finished gather on (ifile mix->sync scatter_wanted) on [inode 0x100000002a0 [...2,head] /tree/ auth{1=1,2=4} fragtree_t(*^7) v1632 ap=73 DIRTYPARENT DISTEPHEMERALPIN f(v1 m2021-06-02T03:57:22.623214+0000 353=0+353) n(v102 rc2021-06-02T03:57:22.623214+0000 b1730 702=348+354) (isnap sync r=6) (inest mix->lock(2) g=1,2 w=1 dirty scatter_wanted) (ipolicy sync r=5) (ifile mix->sync scatter_wanted) (iversion lock) caps={8439=pAsLsXs/-@476} | dirtyscattered=1 request=0 lock=3 dirfrag=128 caps=1 dirtyrstat=0 dirtyparent=1 replicated=1 dirty=1 waiter=1 authpin=1 distepin 0x55eb8b951700]
2021-06-02T03:57:27.447+0000 7f41ccbec700 15 mds.0.cache.ino(0x100000002a0) encode_lock_ifile inode.dirstat is f(v1 m2021-06-02T03:57:22.623214+0000 353=0+353)

...

2021-06-02T03:57:27.448+0000 7f41ccbec700  1 -- [v2:172.21.15.5:6835/1368589064,v1:172.21.15.5:6837/1368589064] send_to--> mds [v2:172.21.15.5:6834/2757933374,v1:172.21.15.5:6836/2757933374] -- lock(a=sync ifile 0x100000002a0.head) v1 -- ?+0 0x55eb8b71cc40
2021-06-02T03:57:27.448+0000 7f41ccbec700  1 -- [v2:172.21.15.5:6835/1368589064,v1:172.21.15.5:6837/1368589064] --> [v2:172.21.15.5:6834/2757933374,v1:172.21.15.5:6836/2757933374] -- lock(a=sync ifile 0x100000002a0.head) v1 -- 0x55eb8b71cc40 con 0x55eb8b7b0000
2021-06-02T03:57:27.448+0000 7f41ccbec700  1 -- [v2:172.21.15.5:6835/1368589064,v1:172.21.15.5:6837/1368589064] send_to--> mds [v2:172.21.15.76:6834/126060394,v1:172.21.15.76:6835/126060394] -- lock(a=sync ifile 0x100000002a0.head) v1 -- ?+0 0x55eb8b71ce00
2021-06-02T03:57:27.448+0000 7f41ccbec700  1 -- [v2:172.21.15.5:6835/1368589064,v1:172.21.15.5:6837/1368589064] --> [v2:172.21.15.76:6834/126060394,v1:172.21.15.76:6835/126060394] -- lock(a=sync ifile 0x100000002a0.head) v1 -- 0x55eb8b71ce00 con 0x55eb8bb02800

So the lookup request in mds.2 was waiting the journal was flushed in mds.0 daemon.

So the mdlog->flush() was necessary even the early reply was done. Or we could just request the mds.0 to do the nudge_log or something to trigger the mdlog->flush().

Patrick, any idea ?

#7 Updated by Xiubo Li almost 3 years ago

5 seconds is also the interval of "mds_tick_interval" in mds daemon, which will call the scatter_tick() and finally will call the mdlog->flush().

#8 Updated by Xiubo Li almost 3 years ago

  • Pull request ID set to 41849

#9 Updated by Xiubo Li almost 3 years ago

  • Status changed from Triaged to Fix Under Review

#10 Updated by Patrick Donnelly over 2 years ago

  • Status changed from Fix Under Review to Pending Backport
  • Backport set to pacific

#11 Updated by Backport Bot over 2 years ago

  • Copied to Backport #51412: pacific: mds: mkdir on ephemerally pinned directory sometimes blocked on journal flush added

#12 Updated by Loïc Dachary over 2 years ago

  • Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

Also available in: Atom PDF