Project

General

Profile

Actions

Bug #1774

closed

client: files become inaccessible in large directories (with snapshots?)

Added by Alexandre Oliva over 12 years ago. Updated almost 7 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Taking snapshots of certain directories within ceph that hold backups of root filesystems of my openmoko phone causes some files to disappear. After some experimentation, I found out the issue doesn't only happen to files in the snapshots; sometimes I also get failure to access files in the original directories. From the observed behavior, I'm guessing it has to do with some border condition in the mds: the information is there, but it's not retrieved when the file happens to fall at some specific offset within the directory or somesuch. The evidence is that adding or removing files (and letting the mds commit the changes from its log, then starting a fresh mds) makes the faulty file vary, but once the diretory holds exactly the contects from the originally backed up image, the files that fail are always the same, though different ones in 3 different backup images with different sets of packages installed.

The faulty directory, in these 3 cases, has always been /var/lib/opkg/info, that holds multiple files per installed package, such as file lists, control scripts and more. File names are build out of the package name plus a suffix indicating the function, so we end up with long names, and lots of them. When I take a snapshot, we apparently cross a threshold, and then files that end up precisely at the border start to fail.

I attach level 20 debug dumps from the mds. It's surely not a coincidence that the 3 files that find says it can't stat (i.e., they appear as dir entries, but stat/read/write fails) are the ones that match appear at snapid offset messages in the mds logs:

  1. for d in .link/{Om2008.8-orig,shr-testing-2010-03+,shr-testing2011.1-2011-03-17}/usr/lib/opkg/info; do ../../gen-list $d > /dev/null; done
    find: `.link/Om2008.8-orig/usr/lib/opkg/info/qtopia-phone-x11-composer-genericcomposer.list': No such file or directory
    find: `.link/shr-testing-2010-03+/usr/lib/opkg/info/update-modules.postinst': No such file or directory
    find: `.link/shr-testing2011.1-2011-03-17/usr/lib/opkg/info/task-shr-minimal-apps.control': No such file or directory
  1. grep "snapid 22 offset '[^']" ~/mds-baddir.log
    2011-12-01 00:37:16.520212 7f2cde2e2700 mds.0.server snapid 22 offset 'qtopia-phone-x11-composer-genericcomposer.list'
    2011-12-01 00:37:26.925145 7f2cde2e2700 mds.0.server snapid 22 offset 'update-modules.postinst'
    2011-12-01 00:37:36.710803 7f2cde2e2700 mds.0.server snapid 22 offset 'task-shr-minimal-apps.control'

Neat, eh? I attach the compressed mds log.


Files

mds-baddir.log.xz (851 KB) mds-baddir.log.xz mds log Alexandre Oliva, 11/30/2011 07:21 PM
0001-Start-caching-readdir-results-after-readdir_start.patch (1.07 KB) 0001-Start-caching-readdir-results-after-readdir_start.patch Alexandre Oliva, 01/09/2012 07:59 PM
gen-1774.bz2 (8.03 KB) gen-1774.bz2 bash script that tests that the problem is fixed Alexandre Oliva, 01/11/2012 04:24 PM
Actions

Also available in: Atom PDF