Project

General

Profile

Bug #23814

mds: newly active mds aborts may abort in handle_file_lock

Added by Patrick Donnelly over 2 years ago. Updated over 2 years ago.

Status:
Rejected
Priority:
Urgent
Assignee:
Category:
Correctness/Safety
Target version:
% Done:

0%

Source:
Development
Tags:
Backport:
luminous
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
multimds
Pull request ID:
Crash signature:

Description

2018-04-21 01:07:27.042 7fbcd68c2700 10 MDSInternalContextBase::complete: 15C_OFT_Committed
2018-04-21 01:07:27.042 7fbcd68c2700 10 mds.1.log _trim_expired_segments waiting for 1/784843672 to expire
2018-04-21 01:07:27.042 7fbcdf8d4700  1 -- 127.0.0.1:6853/3229053236 <== osd.1 127.0.0.1:6843/6701 16 ==== osd_op_reply(22 201.000000bb [zero 508824~3685480] v19'8804 uv8795 ondisk = 0) v8 ==== 156+0+0 (2537174225 0 0) 0x8b6ea26d80 con 0x8b6e80a300
2018-04-21 01:07:27.042 7fbcdf8d4700  1 -- 127.0.0.1:6853/3229053236 <== osd.1 127.0.0.1:6843/6701 17 ==== osd_op_reply(28 201.00000000 [writefull 0~90 [fadvise_dontneed]] v19'6841 uv6841 ondisk = 0) v8 ==== 156+0+0 (2094454382 0 0) 0x8b6ea26d80 con 0x8b6e80a300
2018-04-21 01:07:27.042 7fbcdf8d4700  1 -- 127.0.0.1:6853/3229053236 <== osd.1 127.0.0.1:6843/6701 18 ==== osd_op_reply(26 201.000000bf [delete] v19'5922 uv5921 ondisk = -2 ((2) No such file or directory)) v8 ==== 156+0+0 (793177486 0 0) 0x8b6ea26d80 con 0x8b6e80a300
2018-04-21 01:07:27.042 7fbcdf8d4700  1 -- 127.0.0.1:6853/3229053236 <== osd.1 127.0.0.1:6843/6701 19 ==== osd_op_reply(23 201.000000bc [delete] v19'8805 uv8803 ondisk = -2 ((2) No such file or directory)) v8 ==== 156+0+0 (3668938919 0 0) 0x8b6ea26d80 con 0x8b6e80a300
2018-04-21 01:07:27.042 7fbcd68c2700  1 -- 127.0.0.1:6853/3229053236 --> 127.0.0.1:6843/6701 -- osd_op(unknown.0.80:30 2.4 2:2aa2ac10:::201.000000bb:head [write 508824~540 [fadvise_dontneed]] snapc 0=[] ondisk+write+known_if_redirected+full_force e19) v8 -- 0x8b6ea25d40 con 0
2018-04-21 01:07:27.078 7fbcdf8d4700  1 -- 127.0.0.1:6853/3229053236 <== osd.1 127.0.0.1:6843/6701 20 ==== osd_op_reply(30 201.000000bb [write 508824~540 [fadvise_dontneed]] v19'8806 uv8806 ondisk = 0) v8 ==== 156+0+0 (3469275214 0 0) 0x8b6ea25d40 con 0x8b6e80a300
2018-04-21 01:07:27.078 7fbcd68c2700 10 MDSIOContextBase::complete: 13C_MDL_Flushed
2018-04-21 01:07:27.366 7fbcd98c8700 10 mds.beacon.c _send up:active seq 7
2018-04-21 01:07:27.366 7fbcd98c8700 10 monclient: _send_mon_message to mon.a at 127.0.0.1:40984/0
2018-04-21 01:07:27.366 7fbcd98c8700  1 -- 127.0.0.1:6853/3229053236 --> 127.0.0.1:40984/0 -- mdsbeacon(6579/c up:active seq 7 v81) v7 -- 0x8b6e987f80 con 0
2018-04-21 01:07:27.366 7fbcdc8ce700  1 -- 127.0.0.1:6853/3229053236 <== mon.0 127.0.0.1:40984/0 23 ==== mdsbeacon(6579/c up:active seq 7 v81) v7 ==== 126+0+0 (2281587176 0 0) 0x8b6e987f80 con 0x8b6e808e00
2018-04-21 01:07:27.366 7fbcdc8ce700 10 mds.beacon.c handle_mds_beacon up:active seq 7 rtt 0.000802
2018-04-21 01:07:28.490 7fbcdc8ce700  1 -- 127.0.0.1:6853/3229053236 <== mds.0 127.0.0.1:6851/1476514047 5 ==== lock(a=lock inest 0x1.head) v1 ==== 69+0+0 (1800409819 0 0) 0x8b6e7c1e40 con 0x8b6ea60e00
2018-04-21 01:07:28.490 7fbcdc8ce700  7 mds.1.locker handle_file_lock a=lock on (inest lock) from mds.0 [inode 0x1 [...2,head] / rep@0.1 v747 snaprealm=0x8b6e933b80 f(v0 m2018-04-21 00:32:43.864847 1=0+1) n(v101 rc2018-04-21 01:06:06.782998 b296479986 41205=35424+5781)/n(v0 1=0+1) (inest lock) (iversion lock) 0x8b6e80b100]
2018-04-21 01:07:28.490 7fbcdc8ce700 -1 *** Caught signal (Aborted) **
 in thread 7fbcdc8ce700 thread_name:ms_dispatch

 ceph version 13.0.2-1600-g5aa5c66 (5aa5c6624c168ec02c305c6f687a3dd42177704c) mimic (dev)
 1: (()+0x3dc7b0) [0x8b6d66a7b0]
 2: (()+0x11390) [0x7fbce39d5390]
 3: (gsignal()+0x38) [0x7fbce2b83428]
 4: (abort()+0x16a) [0x7fbce2b8502a]
 5: (Locker::handle_file_lock(ScatterLock*, MLock*)+0xb1c) [0x8b6d505ecc]
 6: (MDSRank::handle_deferrable_message(Message*)+0x545) [0x8b6d39c5b5]
 7: (MDSRank::_dispatch(Message*, bool)+0x62b) [0x8b6d3a81fb]
 8: (MDSRankDispatcher::ms_dispatch(Message*)+0x15) [0x8b6d3a88f5]
 9: (MDSDaemon::ms_dispatch(Message*)+0xd3) [0x8b6d3946b3]
 10: (DispatchQueue::entry()+0xb5a) [0x7fbce4163aba]
 11: (DispatchQueue::DispatchThread::entry()+0xd) [0x7fbce42035cd]
 12: (()+0x76ba) [0x7fbce39cb6ba]
 13: (clone()+0x6d) [0x7fbce2c5541d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

epoch 81:

2018-04-21 01:07:23.366 7f605cfcf700  7 mon.a@0(leader).mds e80 print_map
e81
enable_multiple, ever_enabled_multiple: 0,0
compat: compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no anchor table,9=file layout v2,10=snaprealm v2}
legacy client fscid: 1

Filesystem 'cephfs_a' (1)
fs_name cephfs_a
epoch   81
flags   0
created 2018-04-21 00:22:38.214445
modified        2018-04-21 01:07:23.291338
tableserver     0
root    0
session_timeout 60
session_autoclose       300
max_file_size   1099511627776
last_failure    0
last_failure_osd_epoch  19
compat  compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no anchor table,9=file layout v2,10=snaprealm v2}
max_mds 2
in      0,1
up      {0=5970,1=6579}
failed
damaged
stopped 2
data_pools      [1]
metadata_pool   2
inline_data     disabled
balancer
standby_count_wanted    1
5970:   127.0.0.1:6851/1476514047 'a' mds.0.45 up:active seq 10 export_targets=1,2
6579:   127.0.0.1:6853/3229053236 'c' mds.1.80 up:active seq 6

mds.0:

2018-04-21 01:07:28.486 7ffb0e1eb700 10 mds.0.locker scatter_tick
2018-04-21 01:07:28.486 7ffb0e1eb700 10 mds.0.locker scatter_nudge auth, scatter/unscattering (inest mix dirty) on [inode 0x1 [...2,head] / auth{1=1} v750 ap=1+0 snaprealm=0x5aaa26a080 f(v0 m2018-04-21 00:32:43.8
64847 1=0+1) n(v101 rc2018-04-21 01:06:06.782998 b296479986 41205=35424+5781)/n(v0 1=0+1) (isnap sync r=1) (inest mix dirty) (iversion lock) caps={5313=pAsLsXsFs/-@0} | dirtyscattered=1 lock=1 dirfrag=1 caps=1 op
eningsnapparents=0 replicated=1 dirty=1 authpin=1 0x5aaa3b8700]
2018-04-21 01:07:28.486 7ffb0e1eb700  7 mds.0.locker simple_lock on (inest mix dirty) on [inode 0x1 [...2,head] / auth{1=1} v750 ap=1+0 snaprealm=0x5aaa26a080 f(v0 m2018-04-21 00:32:43.864847 1=0+1) n(v101 rc2018
-04-21 01:06:06.782998 b296479986 41205=35424+5781)/n(v0 1=0+1) (isnap sync r=1) (inest mix dirty) (iversion lock) caps={5313=pAsLsXsFs/-@0} | dirtyscattered=1 lock=1 dirfrag=1 caps=1 openingsnapparents=0 replica
ted=1 dirty=1 authpin=1 0x5aaa3b8700]
2018-04-21 01:07:28.486 7ffb0e1eb700  1 -- 127.0.0.1:6851/1476514047 _send_message--> mds.1 127.0.0.1:6853/3229053236 -- lock(a=lock inest 0x1.head) v1 -- ?+0 0x5ab0577080
2018-04-21 01:07:28.486 7ffb0e1eb700  1 -- 127.0.0.1:6851/1476514047 --> 127.0.0.1:6853/3229053236 -- lock(a=lock inest 0x1.head) v1 -- 0x5ab0577080 con 0

Related issues

Related to CephFS - Bug #23812: mds: may send LOCK_SYNC_MIX message to starting MDS Resolved 04/21/2018

History

#1 Updated by Zheng Yan over 2 years ago

  • Related to Bug #23812: mds: may send LOCK_SYNC_MIX message to starting MDS added

#2 Updated by Zheng Yan over 2 years ago

I think this is related to #23812. The patch for #23812 makes mds skip sending lock message to 'starting' mds. The skipped lock message caused lock state to be out of sync

#3 Updated by Patrick Donnelly over 2 years ago

  • Status changed from New to Rejected

Also available in: Atom PDF