Project

General

Profile

Bug #41799

client: FAILED assert(cap == in->auth_cap)

Added by Xiaoxi Chen 7 months ago. Updated 5 months ago.

Status:
Pending Backport
Priority:
High
Assignee:
Category:
-
Target version:
% Done:

0%

Source:
Community (dev)
Tags:
Backport:
nautilus,mimic
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Client
Labels (FS):
crash, multimds, offline
Pull request ID:
Crash signature:

Description

below log explains the issue clearly, the auth_caps was set to NULL in previous remove_caps, and when add_update_cap get called from handle_import_cap, the seq == cap.seq holds.

2019-09-12 07:22:12.142637 7f200bee1700  1 -- 10.190.184.174:0/3460862904 <== mds.1 10.166.123.131:6800/3395465192 10 ==== client_caps(export ino 0x200000511ba 2782052306 seq 0 caps=- dirty=- wanted=- follows 0 size 0/0 mtime 0.000000) v11 ==== 252+0+0 (557563979 0 0) 0x564b9c90f600 con 0x564b9cc13000
2019-09-12 07:22:12.142721 7f200bee1700 10 client.138394001  mds.1 seq now 1
2019-09-12 07:22:12.142734 7f200bee1700  5 client.138394001 handle_cap_export ino 0x200000511ba mseq 0 EXPORT from mds.1
2019-09-12 07:22:12.142742 7f200bee1700 10 client.138394001 _open_mds_session mds.0
2019-09-12 07:22:12.143001 7f200bee1700  1 -- 10.190.184.174:0/3460862904 --> 10.212.160.177:6800/169156538 -- client_session(request_open) v2 -- 0x564b9c95aac0 con 0
2019-09-12 07:22:12.143087 7f200bee1700 10 client.138394001 add_update_cap issued - -> pAsLsXsFs from mds.0 on 0x200000511ba.head(faked_ino=0 ref=3 ll_ref=1 cap_refs={} open={} mode=40755 size=0/0 nlink=1 mtime=2019-09-12 07:22:04.840341 caps=pAsLsXsFs(0=pAsLsXsFs,1=pAsLsXsFs,3=pAsLsXsFs) COMPLETE quota(max_bytes = 6597069766656 max_files = 0) 0x564b9cc2a000)
2019-09-12 07:22:12.143131 7f200bee1700 10 client.138394001 remove_cap mds.1 on 0x200000511ba.head(faked_ino=0 ref=3 ll_ref=1 cap_refs={} open={} mode=40755 size=0/0 nlink=1 mtime=2019-09-12 07:22:04.840341 caps=pAsLsXsFs(0=pAsLsXsFs,1=pAsLsXsFs,3=pAsLsXsFs) COMPLETE quota(max_bytes = 6597069766656 max_files = 0) 0x564b9cc2a000)
2019-09-12 07:22:12.143159 7f200bee1700  1 -- 10.190.184.174:0/3460862904 <== mds.3 10.212.160.176:6800/3474105568 9 ==== client_caps(export ino 0x200000511ba 3772052911 seq 0 caps=- dirty=- wanted=- follows 0 size 0/0 mtime 0.000000) v11 ==== 252+0+0 (3141920983 0 0) 0x564b9c90f180 con 0x564b9cc10000
2019-09-12 07:22:12.143170 7f200bee1700 10 client.138394001  mds.3 seq now 3
2019-09-12 07:22:12.143172 7f200bee1700  5 client.138394001 handle_cap_export ino 0x200000511ba mseq 0 EXPORT from mds.3
2019-09-12 07:22:12.143174 7f200bee1700 10 client.138394001 remove_cap mds.3 on 0x200000511ba.head(faked_ino=0 ref=3 ll_ref=1 cap_refs={} open={} mode=40755 size=0/0 nlink=1 mtime=2019-09-12 07:22:04.840341 caps=pAsLsXsFs(0=pAsLsXsFs,3=pAsLsXsFs) COMPLETE quota(max_bytes = 6597069766656 max_files = 0) 0x564b9cc2a000)
2019-09-12 07:22:12.173783 7f200bee1700 10 client.138394001 ms_handle_connect on 10.212.160.177:6800/169156538
2019-09-12 07:22:12.181461 7f200bee1700  1 -- 10.190.184.174:0/3460862904 <== mds.0 10.212.160.177:6800/169156538 1 ==== client_session(open) v1 ==== 28+0+0 (2719320884 0 0) 0x564b9c958fc0 con 0x564b9cd92000
2019-09-12 07:22:12.181485 7f200bee1700 10 client.138394001 handle_client_session client_session(open) v1 from mds.0
2019-09-12 07:22:12.181489 7f200bee1700 10 client.138394001 renew_caps mds.0
2019-09-12 07:22:12.181491 7f200bee1700  1 -- 10.190.184.174:0/3460862904 --> 10.212.160.177:6800/169156538 -- client_session(request_renewcaps seq 1) v2 -- 0x564b9c95b840 con 0
2019-09-12 07:22:12.181544 7f200bee1700 10 client.138394001 connect_mds_targets for mds.0
2019-09-12 07:22:12.181553 7f200bee1700  1 -- 10.190.184.174:0/3460862904 <== mds.0 10.212.160.177:6800/169156538 2 ==== client_caps(import ino 0x200000511ba 5277779549 seq 1 caps=pAsLsXsFs dirty=- wanted=AsLsXsFs follows 0 size 0/0 ts 1/18446744073709551615 mtime 2019-09-12 07:22:04.840341 xattrs(v=1 l=4)) v11 ==== 321+4+0 (2210101217 0 0) 0x564b9c90ed00 con 0x564b9cd92000
2019-09-12 07:22:12.181571 7f200bee1700 10 client.138394001  mds.0 seq now 1
2019-09-12 07:22:12.181575 7f200bee1700  5 client.138394001 handle_cap_import ino 0x200000511ba mseq 0 IMPORT from mds.0
2019-09-12 07:22:12.181577 7f200bee1700 10 client.138394001 update_snap_trace len 48
2019-09-12 07:22:12.181580 7f200bee1700 20 client.138394001 get_snap_realm 0x1 0x564b9c934f70 14 -> 15
2019-09-12 07:22:12.181582 7f200bee1700 10 client.138394001 update_snap_trace snaprealm(0x1 nref=15 c=0 seq=1 parent=0x0 my_snaps=[] cached_snapc=1=[]) seq 1 <= 1 and same parent, SKIPPING
2019-09-12 07:22:12.182866 7f200bee1700 -1 /build/ceph-12.2.12/src/client/Client.cc: In function 'void Client::add_update_cap(Inode*, MetaSession*, uint64_t, unsigned int, unsigned int, unsigned int, unsigned int, inodeno_t, int, const UserPerm&)' thread 7f200bee1700 time 2019-09-12 07:22:12.181592
/build/ceph-12.2.12/src/client/Client.cc: 3963: FAILED assert(cap == in->auth_cap)
(gdb) bt
#0  0x00007fbe58509269 in raise () from /lib/x86_64-linux-gnu/libpthread.so.0
#1  0x000056372c54477e in reraise_fatal (signum=6) at /build/ceph-12.2.12/src/global/signal_handler.cc:74
#2  handle_fatal_signal (signum=6) at /build/ceph-12.2.12/src/global/signal_handler.cc:138
#3  <signal handler called>
#4  0x00007fbe57295428 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#5  0x00007fbe5729702a in abort () from /lib/x86_64-linux-gnu/libc.so.6
#6  0x000056372c15099e in ceph::__ceph_assert_fail (assertion=assertion@entry=0x56372c5d0e08 "cap == in->auth_cap", file=file@entry=0x56372c5cc6f8 "/build/ceph-12.2.12/src/client/Client.cc", line=line@entry=3963,
    func=func@entry=0x56372c5d4860 <Client::add_update_cap(Inode*, MetaSession*, unsigned long, unsigned int, unsigned int, unsigned int, unsigned int, inodeno_t, int, UserPerm const&)::__PRETTY_FUNCTION__> "void Client::add_update_cap(Inode*, MetaSession*, uint64_t, unsigned int, unsigned int, unsigned int, unsigned int, inodeno_t, int, const UserPerm&)") at /build/ceph-12.2.12/src/common/assert.cc:66
#7  0x000056372c0b570d in Client::add_update_cap (this=this@entry=0x563735d12000, in=in@entry=0x563735d54000, mds_session=mds_session@entry=0x563735b10d00, cap_id=5277779602, issued=341, wanted=340, seq=1, mseq=0, realm=..., flags=1,
    cap_perms=...) at /build/ceph-12.2.12/src/client/Client.cc:3963
#8  0x000056372c0b73c6 in Client::handle_cap_import (this=this@entry=0x563735d12000, session=session@entry=0x563735b10d00, in=in@entry=0x563735d54000, m=m@entry=0x563735a40d00) at /build/ceph-12.2.12/src/client/Client.cc:4866
#9  0x000056372c0df721 in Client::handle_caps (this=this@entry=0x563735d12000, m=m@entry=0x563735a40d00) at /build/ceph-12.2.12/src/client/Client.cc:4821
#10 0x000056372c0e33cb in Client::ms_dispatch (this=0x563735d12000, m=0x563735a40d00) at /build/ceph-12.2.12/src/client/Client.cc:2571
#11 0x000056372c4d81ea in Messenger::ms_deliver_dispatch (m=0x563735a40d00, this=0x563735c6a000) at /build/ceph-12.2.12/src/msg/Messenger.h:668
#12 DispatchQueue::entry (this=0x563735c6a180) at /build/ceph-12.2.12/src/msg/DispatchQueue.cc:197
#13 0x000056372c1f37dd in DispatchQueue::DispatchThread::entry (this=<optimized out>) at /build/ceph-12.2.12/src/msg/DispatchQueue.h:101
#14 0x00007fbe584ff6ba in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#15 0x00007fbe5736741d in clone () from /lib/x86_64-linux-gnu/libc.so.6
(gdb) f 7
#7  0x000056372c0b570d in Client::add_update_cap (this=this@entry=0x563735d12000, in=in@entry=0x563735d54000, mds_session=mds_session@entry=0x563735b10d00, cap_id=5277779602, issued=341, wanted=340, seq=1, mseq=0, realm=..., flags=1,
    cap_perms=...) at /build/ceph-12.2.12/src/client/Client.cc:3963
3963          assert(cap == in->auth_cap);
(gdb) p in->auth_cap
$1 = (Cap *) 0x0
(gdb) p seq
$2 = 1
(gdb) p cap.seq
$3 = 1

Related issues

Copied to fs - Backport #42631: nautilus: client: FAILED assert(cap == in->auth_cap) Resolved
Copied to fs - Backport #42632: mimic: client: FAILED assert(cap == in->auth_cap) New

History

#1 Updated by Xiaoxi Chen 7 months ago

The issue affect all releases including master

#2 Updated by Patrick Donnelly 7 months ago

  • Subject changed from FAILED assert(cap == in->auth_cap) to client: FAILED assert(cap == in->auth_cap)
  • Target version set to v15.0.0
  • Start date deleted (09/12/2019)
  • Source set to Community (dev)
  • Component(FS) Client added
  • Component(FS) deleted (ceph-fuse)

#3 Updated by Zheng Yan 7 months ago

  • Assignee set to Zheng Yan

#4 Updated by Patrick Donnelly 6 months ago

  • Status changed from New to Fix Under Review
  • Assignee changed from Zheng Yan to Xiaoxi Chen
  • Pull request ID set to 30402
  • Labels (FS) offline added

#5 Updated by Zheng Yan 6 months ago

I'd like to know why the cap import message's seq is 1, mseq is 0. please use gdb to print cap import message's peer information. (f 8, p m->peer)

#6 Updated by Xiaoxi Chen 6 months ago

(gdb) p m->peer
$1 = {cap_id = {v = 2782052343}, seq = {v = 4}, mseq = {v = 0}, mds = {v = 1}, flags = 2 '\002'}

#7 Updated by Xiaoxi Chen 6 months ago

@Zheng,

any update or anything we can help?

seems like we can change from

if (ceph_seq_cmp(seq, cap.seq) <= 0) {

to
if (ceph_seq_cmp(seq, cap.seq) < 0) {

as ceph_seq_cmp(seq, cap.seq) ==0 indicating exactly the import_cap message.

#8 Updated by Xiaoxi Chen 6 months ago

One more victim on 13.2.5


    -4> 2019-09-28 09:01:50.156 7fe6799e3700  1 -- 10.206.140.28:0/1309002991 <== mds.0 10.218.2.224:6800/2026541741 2 ==== client_caps(import ino 0x1000598f6bb 1913515435 seq 1 caps=pAsLsXsFs dirty=- wanted=- follows 0 mseq 1 size 0/0 ts 1/18446744073709551615 mtime 2019-06-05 17:21:06.348736 xattrs(v=1 l=4)) v11 ==== 321+4+0 (3926379498 0 0) 0x564a110b9680 con 0x564a111d7e00
    -3> 2019-09-28 09:01:50.156 7fe6799e3700  5 client.197381312 handle_cap_import ino 0x1000598f6bb mseq 1 IMPORT from mds.0
    -2> 2019-09-28 09:01:50.156 7fe67c9e9700  5 -- 10.206.140.28:0/1309002991 >> 10.156.69.192:6789/0 conn(0x564a11373000 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=16068008 cs=1 l=1). rx mon.3 seq 143 0x564a113da500 osd_map(4050054..4050054 src has 4050054..4072219) v3
    -1> 2019-09-28 09:01:50.156 7fe67c9e9700  5 -- 10.206.140.28:0/1309002991 >> 10.156.69.192:6789/0 conn(0x564a11373000 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=16068008 cs=1 l=1). rx mon.3 seq 144 0x564a115a5b80 osd_map(4050055..4050059 src has 4050054..4072219) v3
     0> 2019-09-28 09:01:50.156 7fe6799e3700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/13.2.5/rpm/el7/BUILD/ceph-13.2.5/src/client/Client.cc: In function 'void Client::add_update_cap(Inode*, MetaSession*, uint64_t, unsigned int, unsigned int, unsigned int, inodeno_t, int, const UserPerm&)' thread 7fe6799e3700 time 2019-09-28 09:01:50.158860
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/13.2.5/rpm/el7/BUILD/ceph-13.2.5/src/client/Client.cc: 3936: FAILED assert(&cap == in->auth_cap)

 ceph version 13.2.5 (cbff874f9007f1869bfd3821b7e33b2a6ffd4988) mimic (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0xff) [0x7fe685611fbf]
 2: (()+0x26d187) [0x7fe685612187]
 3: (Client::add_update_cap(Inode*, MetaSession*, unsigned long, unsigned int, unsigned int, unsigned int, inodeno_t, int, UserPerm const&)+0x945) [0x564a101ccfd5]
 4: (Client::handle_cap_import(MetaSession*, Inode*, MClientCaps*)+0x26a) [0x564a101ce89a]
 5: (Client::handle_caps(MClientCaps*)+0x5d1) [0x564a102037f1]
 6: (Client::ms_dispatch(Message*)+0x41b) [0x564a1020c0cb]
 7: (DispatchQueue::entry()+0xb7a) [0x7fe6856ce39a]
 8: (DispatchQueue::DispatchThread::entry()+0xd) [0x7fe68576c2cd]
 9: (()+0x7dd5) [0x7fe683648dd5]
 10: (clone()+0x6d) [0x7fe682521ead]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

#9 Updated by Zheng Yan 6 months ago

Sorry for the delay

I understand what happened. (two mds exported non-auth caps )

PR#30402 is a good workaround.

#10 Updated by Patrick Donnelly 5 months ago

  • Status changed from Fix Under Review to Pending Backport
  • Backport set to nautilus,mimic

#11 Updated by Nathan Cutler 5 months ago

  • Copied to Backport #42631: nautilus: client: FAILED assert(cap == in->auth_cap) added

#12 Updated by Nathan Cutler 5 months ago

  • Copied to Backport #42632: mimic: client: FAILED assert(cap == in->auth_cap) added

Also available in: Atom PDF