Project

General

Profile

Actions

Bug #3935

closed

kclient: Big directory access bugs (multiple), mixed 32- and 64-bit clients

Added by Ivan Kudryavtsev about 11 years ago. Updated almost 8 years ago.

Status:
Can't reproduce
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Development
Tags:
Backport:
Regression:
Severity:
1 - critical
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
kceph
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I have next directory structure in ceph fs:

somedir
       subdir1
              == 35K files, every 20MB size ==

I'm using Ceph Bobtail. This directory is mounted from fstab like this:

10.252.0.3:6789,10.252.0.2:6789,10.252.0.4:6789:/ /mnt/ceph ceph _netdev,snapdirname=.cc633faa563cbe671221758ad9c01de3,dirstat,norbytes,nocrc,name=admin,secret=SOMESECRET==,readdir_max_entries=8192,readdir_max_bytes=4194304 0 0

from two hosts:

1st is 32bit host running 3.7.3 kernel with ceph module
2nd is 64bit host running 3.7.3 kernel with ceph module

If I cd to that dir from x64, I can see the contents and can copy files from it to local fs.
If I try to create some subdir inside subdir1 and move a file from subdir1 to subdir1/subdir2 it hangs on all accessing hosts, only complete unmount from all hosts, restart mds fixes.

If I cd to that dir from 32bit, I see empty directory, but can see that directory subdir1 includes 35K files from it's stat.

Sometimes, when I try to cd to subdir1 directory, It hangs randomly. Full umount and restart mds-s fixes it.

When the problem occures, other mounts of ceph (with another subdirs) works well and no problem at all.

I have 3 node/15 osds (5 on each), every on separate drive installation, journal in RAMFS. XFS as backing store for OSD. Also have 3 mons and 3 mds on separate 3 nodes. Every mon writes to SSD. Hosts are connected using 1G/10G mixed.

Actions #1

Updated by Ivan Kudryavtsev about 11 years ago

At #3936 I'm providing some benchmarks to show that IOPS/speed is OK for my installation and my hands are not performance connected. My installation also has 1MDS active, 2 standby.

Actions #2

Updated by Ian Colle about 11 years ago

  • Priority changed from Normal to High
Actions #3

Updated by Ivan Kudryavtsev about 11 years ago

I made a mistake during initial post: amount of files in directory is 3.5K, not 35K. It's my netflow for last years, so just arbitrary tar.gz files.

Actions #4

Updated by Zheng Yan about 11 years ago

please set 'debug mds = 10' and upload mds log. To minimize mds log size, please truncate the mds log before executing the any operation that causes hang.

Actions #5

Updated by Ivan Kudryavtsev about 11 years ago

I will be able to reproduce after the Feb,8. Willl do if nobody will reproduce before.

Actions #6

Updated by Sage Weil about 11 years ago

  • Status changed from New to Need More Info
Actions #7

Updated by Ivan Kudryavtsev about 11 years ago

Tried with updated to 0.56.2. Found no troubles, but actually environment changed, since I removed 32-bit kernel client at all.

Actions #8

Updated by Sage Weil about 11 years ago

  • Subject changed from kclient: Big directory access bugs (multiple) to kclient: Big directory access bugs (multiple), mixed 32- and 64-bit clients
  • Status changed from Need More Info to 12
Actions #9

Updated by Greg Farnum about 11 years ago

The only way I can think of that a 32-bit client would be different is in the inode assignment; could it be running into a conflict and behaving badly?

Actions #10

Updated by Sage Weil about 11 years ago

I was thinking the offset for readdir might be 32-bits.. but i may be wrong there.

Actions #11

Updated by Greg Farnum about 11 years ago

  • Priority changed from High to Normal
Actions #12

Updated by Ian Colle about 11 years ago

  • Target version deleted (v0.56)
Actions #13

Updated by Greg Farnum about 10 years ago

The hangs sound like generic cap and request waitlisting issues to to me. The empty directory is tickling something in my brain about possible listing issues that Zheng fixed several months ago; is that a possibility or am I making it up?

Actions #14

Updated by Zheng Yan about 9 years ago

  • Status changed from 12 to Can't reproduce
Actions #15

Updated by Greg Farnum almost 8 years ago

  • Component(FS) kceph added
Actions

Also available in: Atom PDF