After the investigation, we found that when doing .d_revalidate() on the dentry of the stale directory, the cephfs would issue the lookup OP to MDS. Then with the MDSReply replied, ceph_fill_trace() is invoked to incorporate the fresh data into local cache. The newly created directory with the same name carries a different inode, so the stale dentry gets invalidated by d_invalidate(), expecting it to be unhashed from the global dentry hashtable.
However, the dentry is being referenced at least by the vfs lookup() and the ceph lookup request, leading to its reference count at least 2, so the __d_drop() in the d_invalidate is not invoked() on the dentry. The stale dentry is preserved on the hashtable.
d_invalidate():
if (dentry->d_lockref.count > 1 && dentry->d_inode) { // <=========
if (S_ISDIR(dentry->d_inode->i_mode) || d_mountpoint(dentry)) {
spin_unlock(&dentry->d_lock);
return -EBUSY;
}
}
__d_drop(dentry);
ceph_d_revalidate():
err = ceph_mdsc_do_request(mdsc, NULL, req);
if (err 0 || err ENOENT) {
if (dentry req>r_dentry) {
valid = !d_unhashed(dentry);// <========
} else {
d_invalidate(req->r_dentry);
err = -EAGAIN;
}
}
... ...
if (valid) {
ceph_dentry_lru_touch(dentry);
} else {
ceph_dir_clear_complete(dir);
d_drop(dentry);
}
return valid;
When the lookup request finishes, the condition 'if (dentry == req->r_dentry)' still holds, so 'valid' is evaluated from '!d_unhashed(dentry)' that evaluates true, so ceph_d_revalidate() returns true on the stale dentry, leading to the stale directory occur on the console.
The problem exists only in kernel version <3.19.
since starting from the kernel version 3.19(included), d_invalidate() changed the logic: the incoming dentry gets unhashed from the global dentry hashtable without regard to its reference count.
My fix is to insert d_drop() just after d_invalidate() in the ceph_fill_trace() where the problem occurs to confirm the stale dentry unhashed successfully.
I am not sure where to commit the bugfix, since this is problem on old linux kernel only. Or can I just push it to the master branch?