Project

General

Profile

Actions

Bug #64856

open

mds crashes when extracting from a tar is cancelled

Added by Rishabh Dave about 2 months ago. Updated about 1 month ago.

Status:
New
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

On fresh vstart cluster following commands were run -

    ./bin/ceph config set mon mon_allow_pool_delete true
    ./bin/ceph config set mds mds_cache_memory_limit 1G
    ./bin/ceph config set mds mds_health_cache_threshold 1.000001
    ./bin/ceph fs set a allow_standby_replay true
    sleep 2
    ./bin/ceph status | grep "hot standby" 
    ./bin/ceph-fuse cephfs1
    tar -xv -f linux-6.7.9.tar.xz

This extraction takes some time. After 2-5 seconds into extracting, when it is cancelled through ctrl-c, the tar command hangs and prompt never returns. Hitting ctrl-c multiple times has no effect; tar commands remain stuck and prompt doesn't return. Hitting enter too few times doesn't help. As per ps output, tar process is still running. Killing tar command using kill <tar-pid>, sudo kill <tar-pid> and sudo kill -9 <tar-pid> has no effect.

After unmount CephFS using sudo umount -lf <cephfs-mntp>, the prompt returns, ps doesn't report tar anymore and MDS crashes instantaneously. In some cases, tar doesn't hang but MDS definitely crashes everytime.

Reproducing this issue was successfully also when MDS cache size was set to 50M, 100M, 500M, 1G, 2G and 4G.

Reproduciblity: 10/10 times, but minutely lesser with 2G, even lesser with 4G. I've reproduced this around 70 times with different cache sizes on the main branch as well as on feature branch on which this issue was discovered.

Actions #1

Updated by Milind Changire about 2 months ago

  • Assignee set to Dhairya Parmar
Actions #2

Updated by Greg Farnum about 1 month ago

  • Assignee changed from Dhairya Parmar to Rishabh Dave

Rishabh, this sounds a lot like you're just putting too much of a metadata workload in for the MDS to handle with constrained memory. Have you debugged what is going on at all beyond the apparent hang? Is the MDS swapping or spending all its time doing RADOS IO?

Actions

Also available in: Atom PDF