Bug #3267: Multiple active MDSes stall when listing freshly created files - CephFS - Ceph

Actions

Copy link

Bug #3267

closed

Multiple active MDSes stall when listing freshly created files

Added by Stan Schwertly over 11 years ago. Updated about 5 years ago.

Status:

Closed

Priority:

Low

Assignee:

Category:

Target version:

% Done:

Source:

Community (user)

Tags:

Backport:

Regression:

Severity:

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(FS):

Common/Protocol, MDS

Labels (FS):

multimds

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

The output from ceph-debugpack can be found at the following location: [[http://cumulonim.biz/mds.tar.gz]] We were running the following version (but tested with HEAD/ad97bbb0a1e985b91ab0ffe9ae5b15cfce465211 as well):

ceph version 0.52 (commit:e48859474c4944d4ff201ddc9f5fd400e8898173)

We created a Ceph cluster on a single box with 3 monitors, 9 MDSes (6 active, 3 standby), and 5 OSDs. Note: we've seen the same issue when separated across several machines, but kept it down to one box for this report and for simplicity. After creating the cluster, we mounted CephFS from another server and ran the following line:

for D in {0..99}; do  mkdir -p /mnt/ceph/$D; for F in {0..99}; do echo "Hello $F in $D" > /mnt/ceph/$D/$F; done; done

In the specific instance (of the logs provided,) we were able to complete this command, but immediately found the MDSes to "stall" after trying to `ls /mnt/ceph/*` (we saw 2 directories returned prior to the command hanging.) Note: in other runs of this test, the command to create the files and directories would also stall. We ran these tests repeatedly with similar outcomes.

Actions

Copy link

Updated by Greg Farnum over 11 years ago

I'll try and take a look at this, but multi-MDS setups are known to be pretty unstable at this point. Have you tried just using one active with some standbys?

Actions

Copy link

Updated by Stan Schwertly over 11 years ago

Greg Farnum wrote:

I'll try and take a look at this, but multi-MDS setups are known to be pretty unstable at this point. Have you tried just using one active with some standbys?

We are aware of the issue with multi-MDS setups, hence the report :)! For good measure, we reissued the command against a cluster with a single MDS and two standbys and it worked like a charm.

Let us know if there's any more detail we can provide.

Actions

Copy link