Bug #1774: client: files become inaccessible in large directories (with snapshots?) - CephFS - Ceph

Actions

Copy link

Bug #1774

closed

client: files become inaccessible in large directories (with snapshots?)

Added by Alexandre Oliva over 12 years ago. Updated almost 7 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(FS):

Labels (FS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Taking snapshots of certain directories within ceph that hold backups of root filesystems of my openmoko phone causes some files to disappear. After some experimentation, I found out the issue doesn't only happen to files in the snapshots; sometimes I also get failure to access files in the original directories. From the observed behavior, I'm guessing it has to do with some border condition in the mds: the information is there, but it's not retrieved when the file happens to fall at some specific offset within the directory or somesuch. The evidence is that adding or removing files (and letting the mds commit the changes from its log, then starting a fresh mds) makes the faulty file vary, but once the diretory holds exactly the contects from the originally backed up image, the files that fail are always the same, though different ones in 3 different backup images with different sets of packages installed.

The faulty directory, in these 3 cases, has always been /var/lib/opkg/info, that holds multiple files per installed package, such as file lists, control scripts and more. File names are build out of the package name plus a suffix indicating the function, so we end up with long names, and lots of them. When I take a snapshot, we apparently cross a threshold, and then files that end up precisely at the border start to fail.

I attach level 20 debug dumps from the mds. It's surely not a coincidence that the 3 files that find says it can't stat (i.e., they appear as dir entries, but stat/read/write fails) are the ones that match appear at snapid offset messages in the mds logs:

for d in .link/{Om2008.8-orig,shr-testing-2010-03+,shr-testing2011.1-2011-03-17}/usr/lib/opkg/info; do ../../gen-list $d > /dev/null; done
find: `.link/Om2008.8-orig/usr/lib/opkg/info/qtopia-phone-x11-composer-genericcomposer.list': No such file or directory
find: `.link/shr-testing-2010-03+/usr/lib/opkg/info/update-modules.postinst': No such file or directory
find: `.link/shr-testing2011.1-2011-03-17/usr/lib/opkg/info/task-shr-minimal-apps.control': No such file or directory

grep "snapid 22 offset '[^']" ~/mds-baddir.log
2011-12-01 00:37:16.520212 7f2cde2e2700 mds.0.server snapid 22 offset 'qtopia-phone-x11-composer-genericcomposer.list'
2011-12-01 00:37:26.925145 7f2cde2e2700 mds.0.server snapid 22 offset 'update-modules.postinst'
2011-12-01 00:37:36.710803 7f2cde2e2700 mds.0.server snapid 22 offset 'task-shr-minimal-apps.control'

Neat, eh? I attach the compressed mds log.

Files

Download all files

mds-baddir.log.xz (851 KB) mds-baddir.log.xz	mds log	Alexandre Oliva, 11/30/2011 07:21 PM
0001-Start-caching-readdir-results-after-readdir_start.patch (1.07 KB) 0001-Start-caching-readdir-results-after-readdir_start.patch		Alexandre Oliva, 01/09/2012 07:59 PM
gen-1774.bz2 (8.03 KB) gen-1774.bz2	bash script that tests that the problem is fixed	Alexandre Oliva, 01/11/2012 04:24 PM

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » CephFS

Custom queries

Bug #1774

client: files become inaccessible in large directories (with snapshots?)

Updated by Alexandre Oliva over 12 years ago

Updated by Sage Weil over 12 years ago

Updated by Sage Weil over 12 years ago

Updated by Sage Weil over 12 years ago

Updated by Sage Weil over 12 years ago

Updated by Alexandre Oliva over 12 years ago

Updated by Sage Weil over 12 years ago

Updated by Alexandre Oliva over 12 years ago

Updated by Greg Farnum almost 7 years ago