Project

General

Profile

Bug #19755

MDS became unresponsive when truncating a very large file

Added by Sangdi Xu 6 months ago. Updated 2 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
-
Start date:
04/25/2017
Due date:
% Done:

0%

Source:
Tags:
Backport:
jewel, kraken
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
fs
Release:
jewel
Component(FS):
MDS
Needs Doc:
No

Description

We were trying to copy a very large file (7TB exactly) between two directories through Samba/CephFS, and cancelled it after about 2 minutes(not finished). Then the MDS became unresponsive, with the error message: heartbeat_map is_healthy 'MDSRank' had timed out after 15 stage:HeartbeatMap::is_healthy

The following is part of the logs related:

2017-04-21 10:38:07.445221 7f48ddb5f700 10 mds.0.locker _do_cap_update dirty Fw issued pAsxLsXsxFsxcrwb wanted pAsxXsxFsxcrwb on [inode 100000003f1 [2,head] /data/CIFS/aa04.txt auth v60 dirtyparent s=7696581394432 n(v0 b7696581394432 1=1+0) (iauth excl) (ifile excl) (ixattr excl) (iversion lock) cr={19464=0-17592186044416@1} caps={19464=pAsxLsXsxFsxcrwb/pAsxXsxFsxcrwb@4},l=19464 | ptrwaiter=0 request=0 lock=0 caps=1 dirtyparent=1 dirty=1 authpin=0 0x56275fc668a0]
2017-04-21 10:38:07.445233 7f48ddb5f700 20 mds.0.locker inode is file
2017-04-21 10:38:07.445235 7f48ddb5f700 20 mds.0.locker client has write caps; m->get_max_size=7696581394432; old_max=17592186044416

2017-04-21 10:38:21.639874 7f48d9255700 15 mds.0.cache.dir(10000000004) adjust_nested_auth_pins 1/1 on [dir 10000000004 /data/CIFS/ [2,head] auth v=69 cv=0/0 ap=1+3+3 state=1610612738|complete f(v0 m2017-04-21 10:37:16.324655 1=1+0) n(v2 rc2017-04-21 10:37:16.324655 1=1+0) hs=1+3,ss=0+0 dirty=2 | child=1 dirty=1 authpin=1 0x5627610c1500] by 0x56275fc668a0 count now 1 + 3
2017-04-21 10:38:21.639890 7f48d9255700 10 mds.0.cache realm snaprealm(1 seq 1 lc 0 cr 0 cps 1 snaps={} 0x56275fb24ac0)
2017-04-21 10:38:21.639893 7f48d9255700 10 mds.0.cache _truncate_inode snapc 0x56275fb24c48 on [inode 100000003f1 [2,head] /data/CIFS/aa04.txt auth v68 ap=3+0 dirtyparent truncating(7696581394432 to 0) s=0 n(v0 1=1+0) (iauth excl) (ifile xlockdone x=1) (ixattr excl) (iversion lock w=1 last_client=19464) cr={19464=0-17592186044416@1} caps={19464=pAsxLsXsxFscb/pAsxXsxFsxcrwb@7},l=19464 | ptrwaiter=0 request=1 lock=2 caps=1 truncating=1 dirtyparent=1 dirty=1 waiter=0 authpin=1 0x56275fc668a0]
2017-04-21 10:38:21.695521 7f48daa58700 10 mds.beacon.mds1 _send up:active seq 56
2017-04-21 10:38:25.695604 7f48daa58700 10 mds.beacon.mds1 _send up:active seq 57
2017-04-21 10:38:29.695697 7f48daa58700 10 mds.beacon.mds1 _send up:active seq 58
2017-04-21 10:38:33.695758 7f48daa58700 10 mds.beacon.mds1 _send up:active seq 59
2017-04-21 10:38:37.695822 7f48daa58700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 stage:HeartbeatMap::is_healthy
2017-04-21 10:38:37.695847 7f48daa58700 1 mds.beacon.mds1 _send skipping beacon, heartbeat map not healthy
2017-04-21 10:38:38.826118 7f48dfb63700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 stage:HeartbeatMap::is_healthy
2017-04-21 10:38:41.695882 7f48daa58700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 stage:HeartbeatMap::is_healthy
2017-04-21 10:38:41.695900 7f48daa58700 1 mds.beacon.mds1 _send skipping beacon, heartbeat map not healthy
2017-04-21 10:38:43.826195 7f48dfb63700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 stage:HeartbeatMap::is_healthy
2017-04-21 10:38:45.695943 7f48daa58700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 stage:HeartbeatMap::is_healthy
2017-04-21 10:38:45.695960 7f48daa58700 1 mds.beacon.mds1 _send skipping beacon, heartbeat map not healthy
2017-04-21 10:38:48.826273 7f48dfb63700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 stage:HeartbeatMap::is_healthy
2017-04-21 10:38:49.696002 7f48daa58700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 stage:HeartbeatMap::is_healthy

With the debug_ms open, we found a lot of trimtrunc messages sent to OSDs. It looks like the MDS was busy spinning here in the for loop:

Filer.h::truncate:
for (vector<ObjectExtent>::iterator p = extents.begin();
       p != extents.end();
       ++p) {

    vector<OSDOp> ops(1);
    ops[0].op.op = CEPH_OSD_OP_TRIMTRUNC;
    ops[0].op.extent.truncate_size = p->offset;
    ops[0].op.extent.truncate_seq = truncate_seq;
    objecter->_modify(p->oid, p->oloc, ops, mtime, snapc, flags,
              onack ? gack.new_sub():0,
              oncommit ? gcom.new_sub():0);
      }


Related issues

Copied to fs - Backport #20025: jewel: MDS became unresponsive when truncating a very large file Resolved
Copied to fs - Backport #20026: kraken: cephfs: MDS became unresponsive when truncating a very large file Resolved

History

#1 Updated by Zheng Yan 6 months ago

  • Status changed from New to Need Review

#2 Updated by John Spray 5 months ago

  • Status changed from Need Review to Pending Backport
  • Backport set to jewel, kraken

#3 Updated by Nathan Cutler 5 months ago

  • Copied to Backport #20025: jewel: MDS became unresponsive when truncating a very large file added

#4 Updated by Nathan Cutler 5 months ago

  • Copied to Backport #20026: kraken: cephfs: MDS became unresponsive when truncating a very large file added

#5 Updated by Nathan Cutler 2 months ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF