Project

General

Profile

Bug #41799

client: FAILED assert(cap == in->auth_cap)

Added by Xiaoxi Chen 10 months ago. Updated about 2 months ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
% Done:

0%

Source:
Community (dev)
Tags:
Backport:
nautilus,mimic
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Client
Labels (FS):
crash, multimds, offline
Pull request ID:
Crash signature:

Description

below log explains the issue clearly, the auth_caps was set to NULL in previous remove_caps, and when add_update_cap get called from handle_import_cap, the seq == cap.seq holds.

2019-09-12 07:22:12.142637 7f200bee1700  1 -- 10.190.184.174:0/3460862904 <== mds.1 10.166.123.131:6800/3395465192 10 ==== client_caps(export ino 0x200000511ba 2782052306 seq 0 caps=- dirty=- wanted=- follows 0 size 0/0 mtime 0.000000) v11 ==== 252+0+0 (557563979 0 0) 0x564b9c90f600 con 0x564b9cc13000
2019-09-12 07:22:12.142721 7f200bee1700 10 client.138394001  mds.1 seq now 1
2019-09-12 07:22:12.142734 7f200bee1700  5 client.138394001 handle_cap_export ino 0x200000511ba mseq 0 EXPORT from mds.1
2019-09-12 07:22:12.142742 7f200bee1700 10 client.138394001 _open_mds_session mds.0
2019-09-12 07:22:12.143001 7f200bee1700  1 -- 10.190.184.174:0/3460862904 --> 10.212.160.177:6800/169156538 -- client_session(request_open) v2 -- 0x564b9c95aac0 con 0
2019-09-12 07:22:12.143087 7f200bee1700 10 client.138394001 add_update_cap issued - -> pAsLsXsFs from mds.0 on 0x200000511ba.head(faked_ino=0 ref=3 ll_ref=1 cap_refs={} open={} mode=40755 size=0/0 nlink=1 mtime=2019-09-12 07:22:04.840341 caps=pAsLsXsFs(0=pAsLsXsFs,1=pAsLsXsFs,3=pAsLsXsFs) COMPLETE quota(max_bytes = 6597069766656 max_files = 0) 0x564b9cc2a000)
2019-09-12 07:22:12.143131 7f200bee1700 10 client.138394001 remove_cap mds.1 on 0x200000511ba.head(faked_ino=0 ref=3 ll_ref=1 cap_refs={} open={} mode=40755 size=0/0 nlink=1 mtime=2019-09-12 07:22:04.840341 caps=pAsLsXsFs(0=pAsLsXsFs,1=pAsLsXsFs,3=pAsLsXsFs) COMPLETE quota(max_bytes = 6597069766656 max_files = 0) 0x564b9cc2a000)
2019-09-12 07:22:12.143159 7f200bee1700  1 -- 10.190.184.174:0/3460862904 <== mds.3 10.212.160.176:6800/3474105568 9 ==== client_caps(export ino 0x200000511ba 3772052911 seq 0 caps=- dirty=- wanted=- follows 0 size 0/0 mtime 0.000000) v11 ==== 252+0+0 (3141920983 0 0) 0x564b9c90f180 con 0x564b9cc10000
2019-09-12 07:22:12.143170 7f200bee1700 10 client.138394001  mds.3 seq now 3
2019-09-12 07:22:12.143172 7f200bee1700  5 client.138394001 handle_cap_export ino 0x200000511ba mseq 0 EXPORT from mds.3
2019-09-12 07:22:12.143174 7f200bee1700 10 client.138394001 remove_cap mds.3 on 0x200000511ba.head(faked_ino=0 ref=3 ll_ref=1 cap_refs={} open={} mode=40755 size=0/0 nlink=1 mtime=2019-09-12 07:22:04.840341 caps=pAsLsXsFs(0=pAsLsXsFs,3=pAsLsXsFs) COMPLETE quota(max_bytes = 6597069766656 max_files = 0) 0x564b9cc2a000)
2019-09-12 07:22:12.173783 7f200bee1700 10 client.138394001 ms_handle_connect on 10.212.160.177:6800/169156538
2019-09-12 07:22:12.181461 7f200bee1700  1 -- 10.190.184.174:0/3460862904 <== mds.0 10.212.160.177:6800/169156538 1 ==== client_session(open) v1 ==== 28+0+0 (2719320884 0 0) 0x564b9c958fc0 con 0x564b9cd92000
2019-09-12 07:22:12.181485 7f200bee1700 10 client.138394001 handle_client_session client_session(open) v1 from mds.0
2019-09-12 07:22:12.181489 7f200bee1700 10 client.138394001 renew_caps mds.0
2019-09-12 07:22:12.181491 7f200bee1700  1 -- 10.190.184.174:0/3460862904 --> 10.212.160.177:6800/169156538 -- client_session(request_renewcaps seq 1) v2 -- 0x564b9c95b840 con 0
2019-09-12 07:22:12.181544 7f200bee1700 10 client.138394001 connect_mds_targets for mds.0
2019-09-12 07:22:12.181553 7f200bee1700  1 -- 10.190.184.174:0/3460862904 <== mds.0 10.212.160.177:6800/169156538 2 ==== client_caps(import ino 0x200000511ba 5277779549 seq 1 caps=pAsLsXsFs dirty=- wanted=AsLsXsFs follows 0 size 0/0 ts 1/18446744073709551615 mtime 2019-09-12 07:22:04.840341 xattrs(v=1 l=4)) v11 ==== 321+4+0 (2210101217 0 0) 0x564b9c90ed00 con 0x564b9cd92000
2019-09-12 07:22:12.181571 7f200bee1700 10 client.138394001  mds.0 seq now 1
2019-09-12 07:22:12.181575 7f200bee1700  5 client.138394001 handle_cap_import ino 0x200000511ba mseq 0 IMPORT from mds.0
2019-09-12 07:22:12.181577 7f200bee1700 10 client.138394001 update_snap_trace len 48
2019-09-12 07:22:12.181580 7f200bee1700 20 client.138394001 get_snap_realm 0x1 0x564b9c934f70 14 -> 15
2019-09-12 07:22:12.181582 7f200bee1700 10 client.138394001 update_snap_trace snaprealm(0x1 nref=15 c=0 seq=1 parent=0x0 my_snaps=[] cached_snapc=1=[]) seq 1 <= 1 and same parent, SKIPPING
2019-09-12 07:22:12.182866 7f200bee1700 -1 /build/ceph-12.2.12/src/client/Client.cc: In function 'void Client::add_update_cap(Inode*, MetaSession*, uint64_t, unsigned int, unsigned int, unsigned int, unsigned int, inodeno_t, int, const UserPerm&)' thread 7f200bee1700 time 2019-09-12 07:22:12.181592
/build/ceph-12.2.12/src/client/Client.cc: 3963: FAILED assert(cap == in->auth_cap)
(gdb) bt
#0  0x00007fbe58509269 in raise () from /lib/x86_64-linux-gnu/libpthread.so.0
#1  0x000056372c54477e in reraise_fatal (signum=6) at /build/ceph-12.2.12/src/global/signal_handler.cc:74
#2  handle_fatal_signal (signum=6) at /build/ceph-12.2.12/src/global/signal_handler.cc:138
#3  <signal handler called>
#4  0x00007fbe57295428 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#5  0x00007fbe5729702a in abort () from /lib/x86_64-linux-gnu/libc.so.6
#6  0x000056372c15099e in ceph::__ceph_assert_fail (assertion=assertion@entry=0x56372c5d0e08 "cap == in->auth_cap", file=file@entry=0x56372c5cc6f8 "/build/ceph-12.2.12/src/client/Client.cc", line=line@entry=3963,
    func=func@entry=0x56372c5d4860 <Client::add_update_cap(Inode*, MetaSession*, unsigned long, unsigned int, unsigned int, unsigned int, unsigned int, inodeno_t, int, UserPerm const&)::__PRETTY_FUNCTION__> "void Client::add_update_cap(Inode*, MetaSession*, uint64_t, unsigned int, unsigned int, unsigned int, unsigned int, inodeno_t, int, const UserPerm&)") at /build/ceph-12.2.12/src/common/assert.cc:66
#7  0x000056372c0b570d in Client::add_update_cap (this=this@entry=0x563735d12000, in=in@entry=0x563735d54000, mds_session=mds_session@entry=0x563735b10d00, cap_id=5277779602, issued=341, wanted=340, seq=1, mseq=0, realm=..., flags=1,
    cap_perms=...) at /build/ceph-12.2.12/src/client/Client.cc:3963
#8  0x000056372c0b73c6 in Client::handle_cap_import (this=this@entry=0x563735d12000, session=session@entry=0x563735b10d00, in=in@entry=0x563735d54000, m=m@entry=0x563735a40d00) at /build/ceph-12.2.12/src/client/Client.cc:4866
#9  0x000056372c0df721 in Client::handle_caps (this=this@entry=0x563735d12000, m=m@entry=0x563735a40d00) at /build/ceph-12.2.12/src/client/Client.cc:4821
#10 0x000056372c0e33cb in Client::ms_dispatch (this=0x563735d12000, m=0x563735a40d00) at /build/ceph-12.2.12/src/client/Client.cc:2571
#11 0x000056372c4d81ea in Messenger::ms_deliver_dispatch (m=0x563735a40d00, this=0x563735c6a000) at /build/ceph-12.2.12/src/msg/Messenger.h:668
#12 DispatchQueue::entry (this=0x563735c6a180) at /build/ceph-12.2.12/src/msg/DispatchQueue.cc:197
#13 0x000056372c1f37dd in DispatchQueue::DispatchThread::entry (this=<optimized out>) at /build/ceph-12.2.12/src/msg/DispatchQueue.h:101
#14 0x00007fbe584ff6ba in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#15 0x00007fbe5736741d in clone () from /lib/x86_64-linux-gnu/libc.so.6
(gdb) f 7
#7  0x000056372c0b570d in Client::add_update_cap (this=this@entry=0x563735d12000, in=in@entry=0x563735d54000, mds_session=mds_session@entry=0x563735b10d00, cap_id=5277779602, issued=341, wanted=340, seq=1, mseq=0, realm=..., flags=1,
    cap_perms=...) at /build/ceph-12.2.12/src/client/Client.cc:3963
3963          assert(cap == in->auth_cap);
(gdb) p in->auth_cap
$1 = (Cap *) 0x0
(gdb) p seq
$2 = 1
(gdb) p cap.seq
$3 = 1

Related issues

Copied to fs - Backport #42631: nautilus: client: FAILED assert(cap == in->auth_cap) Resolved
Copied to fs - Backport #42632: mimic: client: FAILED assert(cap == in->auth_cap) Rejected

History

#1 Updated by Xiaoxi Chen 10 months ago

The issue affect all releases including master

#2 Updated by Patrick Donnelly 10 months ago

  • Subject changed from FAILED assert(cap == in->auth_cap) to client: FAILED assert(cap == in->auth_cap)
  • Target version set to v15.0.0
  • Start date deleted (09/12/2019)
  • Source set to Community (dev)
  • Component(FS) Client added
  • Component(FS) deleted (ceph-fuse)

#3 Updated by Zheng Yan 10 months ago

  • Assignee set to Zheng Yan

#4 Updated by Patrick Donnelly 10 months ago

  • Status changed from New to Fix Under Review
  • Assignee changed from Zheng Yan to Xiaoxi Chen
  • Pull request ID set to 30402
  • Labels (FS) offline added

#5 Updated by Zheng Yan 10 months ago

I'd like to know why the cap import message's seq is 1, mseq is 0. please use gdb to print cap import message's peer information. (f 8, p m->peer)

#6 Updated by Xiaoxi Chen 10 months ago

(gdb) p m->peer
$1 = {cap_id = {v = 2782052343}, seq = {v = 4}, mseq = {v = 0}, mds = {v = 1}, flags = 2 '\002'}

#7 Updated by Xiaoxi Chen 9 months ago

@Zheng,

any update or anything we can help?

seems like we can change from

if (ceph_seq_cmp(seq, cap.seq) <= 0) {

to
if (ceph_seq_cmp(seq, cap.seq) < 0) {

as ceph_seq_cmp(seq, cap.seq) ==0 indicating exactly the import_cap message.

#8 Updated by Xiaoxi Chen 9 months ago

One more victim on 13.2.5


    -4> 2019-09-28 09:01:50.156 7fe6799e3700  1 -- 10.206.140.28:0/1309002991 <== mds.0 10.218.2.224:6800/2026541741 2 ==== client_caps(import ino 0x1000598f6bb 1913515435 seq 1 caps=pAsLsXsFs dirty=- wanted=- follows 0 mseq 1 size 0/0 ts 1/18446744073709551615 mtime 2019-06-05 17:21:06.348736 xattrs(v=1 l=4)) v11 ==== 321+4+0 (3926379498 0 0) 0x564a110b9680 con 0x564a111d7e00
    -3> 2019-09-28 09:01:50.156 7fe6799e3700  5 client.197381312 handle_cap_import ino 0x1000598f6bb mseq 1 IMPORT from mds.0
    -2> 2019-09-28 09:01:50.156 7fe67c9e9700  5 -- 10.206.140.28:0/1309002991 >> 10.156.69.192:6789/0 conn(0x564a11373000 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=16068008 cs=1 l=1). rx mon.3 seq 143 0x564a113da500 osd_map(4050054..4050054 src has 4050054..4072219) v3
    -1> 2019-09-28 09:01:50.156 7fe67c9e9700  5 -- 10.206.140.28:0/1309002991 >> 10.156.69.192:6789/0 conn(0x564a11373000 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=16068008 cs=1 l=1). rx mon.3 seq 144 0x564a115a5b80 osd_map(4050055..4050059 src has 4050054..4072219) v3
     0> 2019-09-28 09:01:50.156 7fe6799e3700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/13.2.5/rpm/el7/BUILD/ceph-13.2.5/src/client/Client.cc: In function 'void Client::add_update_cap(Inode*, MetaSession*, uint64_t, unsigned int, unsigned int, unsigned int, inodeno_t, int, const UserPerm&)' thread 7fe6799e3700 time 2019-09-28 09:01:50.158860
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/13.2.5/rpm/el7/BUILD/ceph-13.2.5/src/client/Client.cc: 3936: FAILED assert(&cap == in->auth_cap)

 ceph version 13.2.5 (cbff874f9007f1869bfd3821b7e33b2a6ffd4988) mimic (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0xff) [0x7fe685611fbf]
 2: (()+0x26d187) [0x7fe685612187]
 3: (Client::add_update_cap(Inode*, MetaSession*, unsigned long, unsigned int, unsigned int, unsigned int, inodeno_t, int, UserPerm const&)+0x945) [0x564a101ccfd5]
 4: (Client::handle_cap_import(MetaSession*, Inode*, MClientCaps*)+0x26a) [0x564a101ce89a]
 5: (Client::handle_caps(MClientCaps*)+0x5d1) [0x564a102037f1]
 6: (Client::ms_dispatch(Message*)+0x41b) [0x564a1020c0cb]
 7: (DispatchQueue::entry()+0xb7a) [0x7fe6856ce39a]
 8: (DispatchQueue::DispatchThread::entry()+0xd) [0x7fe68576c2cd]
 9: (()+0x7dd5) [0x7fe683648dd5]
 10: (clone()+0x6d) [0x7fe682521ead]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

#9 Updated by Zheng Yan 9 months ago

Sorry for the delay

I understand what happened. (two mds exported non-auth caps )

PR#30402 is a good workaround.

#10 Updated by Patrick Donnelly 8 months ago

  • Status changed from Fix Under Review to Pending Backport
  • Backport set to nautilus,mimic

#11 Updated by Nathan Cutler 8 months ago

  • Copied to Backport #42631: nautilus: client: FAILED assert(cap == in->auth_cap) added

#12 Updated by Nathan Cutler 8 months ago

  • Copied to Backport #42632: mimic: client: FAILED assert(cap == in->auth_cap) added

#13 Updated by Nathan Cutler about 2 months ago

  • Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

Also available in: Atom PDF