Project

General

Profile

Bug #56698

client: FAILED ceph_assert(_size == 0)

Added by Patrick Donnelly over 1 year ago. Updated 2 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Correctness/Safety
Target version:
% Done:

50%

Source:
Q/A
Tags:
backport_processed
Backport:
reef,quincy,pacific
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Client
Labels (FS):
crash
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

2022-07-22T20:36:19.033 INFO:tasks.cephfs.fuse_mount.ceph-fuse.0.smithi066.stderr:/build/ceph-17.0.0-13732-g89768db3/src/include/xlist.h: In function 'xlist<T>::~xlist() [with T = Inode*]' thread 7f91367fc700 time 2022-07-22T20:36:19.037781+0000
2022-07-22T20:36:19.033 INFO:tasks.cephfs.fuse_mount.ceph-fuse.0.smithi066.stderr:/build/ceph-17.0.0-13732-g89768db3/src/include/xlist.h: 81: FAILED ceph_assert(_size == 0)
2022-07-22T20:36:19.036 INFO:tasks.cephfs.fuse_mount.ceph-fuse.0.smithi066.stderr: ceph version 17.0.0-13732-g89768db3 (89768db311950607682ea2bb29f56edc324f86ac) quincy (dev)
2022-07-22T20:36:19.037 INFO:tasks.cephfs.fuse_mount.ceph-fuse.0.smithi066.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14f) [0x7f9152d6fa4c]
2022-07-22T20:36:19.037 INFO:tasks.cephfs.fuse_mount.ceph-fuse.0.smithi066.stderr: 2: /usr/lib/ceph/libceph-common.so.2(+0x2b9c5e) [0x7f9152d6fc5e]
2022-07-22T20:36:19.037 INFO:tasks.cephfs.fuse_mount.ceph-fuse.0.smithi066.stderr: 3: (std::_Sp_counted_ptr<MetaSession*, (__gnu_cxx::_Lock_policy)2>::_M_dispose()+0x1cd) [0x560c9acb1fad]
2022-07-22T20:36:19.037 INFO:tasks.cephfs.fuse_mount.ceph-fuse.0.smithi066.stderr: 4: (std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release()+0x48) [0x560c9acaa978]
2022-07-22T20:36:19.037 INFO:tasks.cephfs.fuse_mount.ceph-fuse.0.smithi066.stderr: 5: (Client::handle_client_session(boost::intrusive_ptr<MClientSession const> const&)+0xf8) [0x560c9ac92bf8]
2022-07-22T20:36:19.038 INFO:tasks.cephfs.fuse_mount.ceph-fuse.0.smithi066.stderr: 6: (Client::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x513) [0x560c9ac998b3]
2022-07-22T20:36:19.038 INFO:tasks.cephfs.fuse_mount.ceph-fuse.0.smithi066.stderr: 7: (Messenger::ms_deliver_dispatch(boost::intrusive_ptr<Message> const&)+0x460) [0x7f9153010ec0]
2022-07-22T20:36:19.038 INFO:tasks.cephfs.fuse_mount.ceph-fuse.0.smithi066.stderr: 8: (DispatchQueue::entry()+0x58f) [0x7f915300e75f]
2022-07-22T20:36:19.038 INFO:tasks.cephfs.fuse_mount.ceph-fuse.0.smithi066.stderr: 9: (DispatchQueue::DispatchThread::entry()+0x11) [0x7f91530ddac1]
2022-07-22T20:36:19.039 INFO:tasks.cephfs.fuse_mount.ceph-fuse.0.smithi066.stderr: 10: /lib/x86_64-linux-gnu/libpthread.so.0(+0x8609) [0x7f9152a9b609]
2022-07-22T20:36:19.039 INFO:tasks.cephfs.fuse_mount.ceph-fuse.0.smithi066.stderr: 11: clone()

From: /ceph/teuthology-archive/pdonnell-2022-07-22_19:42:58-fs-wip-pdonnell-testing-20220721.235756-distro-default-smithi/6945819/teuthology.log


Subtasks

Bug #61913: client: crash the client more gracefullyClosedXiubo Li

Bug #61914: client: improve the libcephfs when MDS is stoppingFix Under ReviewXiubo Li


Related issues

Related to CephFS - Bug #56003: client: src/include/xlist.h: 81: FAILED ceph_assert(_size == 0) Duplicate
Copied to CephFS - Backport #62519: quincy: client: FAILED ceph_assert(_size == 0) Resolved
Copied to CephFS - Backport #62520: pacific: client: FAILED ceph_assert(_size == 0) Resolved
Copied to CephFS - Backport #62521: reef: client: FAILED ceph_assert(_size == 0) Resolved

History

#1 Updated by Venky Shankar over 1 year ago

  • Status changed from New to Triaged
  • Assignee set to Rishabh Dave

#2 Updated by Xiubo Li 8 months ago

https://pulpito.ceph.com/rishabh-2023-06-19_18:26:08-fs-wip-rishabh-2023June18-testing-default-smithi/7307845/

    -7> 2023-06-19T21:20:50.862+0000 7f66b37fe700 10 client.4746 remove_cap mds.3 on 0x10000002ec1.head(faked_ino=0 nref=5 ll_ref=401 cap_refs={} open={} mode=40775 size=0/0 nlink=1 btime=2023-06-19T21:15:40.143541+0000 mtime=2023-06-19T21:20:48.215040+0000 ctime=2023-06-19T21:20:48.215040+0000 change_attr=151 caps=pAsLsXsFsx(0=pAsLsXsFsx,3=pAsLsXs) parents=0x10000002c44.head["common"] 0x7f66ac40b700)
    -6> 2023-06-19T21:20:50.862+0000 7f66b37fe700 20 client.4746 put_inode on 0x10000002ec1.head(faked_ino=0 nref=5 ll_ref=401 cap_refs={} open={} mode=40775 size=0/0 nlink=1 btime=2023-06-19T21:15:40.143541+0000 mtime=2023-06-19T21:20:48.215040+0000 ctime=2023-06-19T21:20:48.215040+0000 change_attr=151 caps=pAsLsXsFsx(0=pAsLsXsFsx) parents=0x10000002c44.head["common"] 0x7f66ac40b700) n = 1
    -5> 2023-06-19T21:20:50.862+0000 7f66b37fe700 10 client.4746 remove_cap mds.3 on 0x10000002c44.head(faked_ino=0 nref=143 ll_ref=4888 cap_refs={} open={} mode=40775 size=0/0 nlink=1 btime=2023-06-19T21:15:37.179416+0000 mtime=2023-06-19T21:20:49.829995+0000 ctime=2023-06-19T21:20:49.829995+0000 change_attr=160 caps=pAsLsXsFs(0=pAsLsXsFs,1=pAsLsXs,2=pAsLsXsFs,3=pAsLsXs) parents=0x1000000157b.head["src"] 0x7f66ac2f0900)
    -4> 2023-06-19T21:20:50.862+0000 7f66b37fe700 20 client.4746 put_inode on 0x10000002c44.head(faked_ino=0 nref=143 ll_ref=4888 cap_refs={} open={} mode=40775 size=0/0 nlink=1 btime=2023-06-19T21:15:37.179416+0000 mtime=2023-06-19T21:20:49.829995+0000 ctime=2023-06-19T21:20:49.829995+0000 change_attr=160 caps=pAsLsXsFs(0=pAsLsXsFs,1=pAsLsXs,2=pAsLsXsFs) parents=0x1000000157b.head["src"] 0x7f66ac2f0900) n = 1
    -3> 2023-06-19T21:20:50.862+0000 7f66b37fe700 10 client.4746 kick_requests_closed for mds.3
    -2> 2023-06-19T21:20:50.862+0000 7f66cc79c700  1 -- 192.168.0.1:0/1579759531 reap_dead start
    -1> 2023-06-19T21:20:50.863+0000 7f66b37fe700 -1 /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/18.0.0-4483-g2a42270d/rpm/el8/BUILD/ceph-18.0.0-4483-g2a42270d/src/include/xlist.h: In function 'xlist<T>::~xlist() [with T = Inode*]' thread 7f66b37fe700 time 2023-06-19T21:20:50.862434+0000
/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/18.0.0-4483-g2a42270d/rpm/el8/BUILD/ceph-18.0.0-4483-g2a42270d/src/include/xlist.h: 81: FAILED ceph_assert(_size == 0)

 ceph version 18.0.0-4483-g2a42270d (2a42270dc66b4270a8c395e9316834a4a74c094e) reef (dev)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x135) [0x7f66d8ea9dbf]
 2: /usr/lib64/ceph/libceph-common.so.2(+0x2a9f85) [0x7f66d8ea9f85]
 3: (MetaSession::~MetaSession()+0x1eb) [0x562df37ad37b]
 4: (std::_Sp_counted_ptr<MetaSession*, (__gnu_cxx::_Lock_policy)2>::_M_dispose()+0x16) [0x562df37ad3f6]
 5: (std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release()+0x3a) [0x562df378feba]
 6: (Client::handle_client_session(boost::intrusive_ptr<MClientSession const> const&)+0xf8) [0x562df3776ba8]
 7: (Client::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x555) [0x562df3779a25]
 8: (Messenger::ms_deliver_dispatch(boost::intrusive_ptr<Message> const&)+0x478) [0x7f66d9122808]
 9: (DispatchQueue::entry()+0x50f) [0x7f66d911f9af]
 10: (DispatchQueue::DispatchThread::entry()+0x11) [0x7f66d91e5ca1]
 11: /lib64/libpthread.so.0(+0x81cf) [0x7f66d7a081cf]
 12: clone()

#3 Updated by Rishabh Dave 8 months ago

  • Assignee changed from Rishabh Dave to Xiubo Li

Xiubo, spent half day working on a failure caused by this issue. I've not spent any time with this ticket recently and Xiubo wants to continue working on this ticket, so I am reassigning this ticket to him.

#4 Updated by Xiubo Li 8 months ago

I have gone through all the xlist in the MetaSession:

 52   xlist<Cap*> caps;
 53   // dirty_list keeps all the dirty inodes before flushing in current session.
 54   xlist<Inode*> dirty_list;
 55   xlist<Inode*> flushing_caps;
 56   xlist<MetaRequest*> requests;
 57   xlist<MetaRequest*> unsafe_requests;

All are well cleaned up in Client::_closed_mds_session(), and I am not sure what I have missed. Just raised one subtask,which is https://tracker.ceph.com/issues/61913, to make it crash more gracefully. With this we can exactly know which list is not cleaned.

#5 Updated by Xiubo Li 8 months ago

  • Status changed from Triaged to In Progress

#6 Updated by Venky Shankar 7 months ago

Xiubo, do we have the core for this crash. If you have the debug env, then figuring out which xlist member in MetaSession was not empty should be straightforward.

#7 Updated by Xiubo Li 7 months ago

Venky Shankar wrote:

Xiubo, do we have the core for this crash. If you have the debug env, then figuring out which xlist member in MetaSession was not empty should be straightforward.

I think I have found the root cause, which is when the auth cap is changed it missed to move the inode to the new session.

#8 Updated by Xiubo Li 7 months ago

  • Status changed from In Progress to Fix Under Review
  • Pull request ID set to 52335

#9 Updated by Venky Shankar 7 months ago

  • Related to Bug #56003: client: src/include/xlist.h: 81: FAILED ceph_assert(_size == 0) added

#10 Updated by Venky Shankar 6 months ago

  • Status changed from Fix Under Review to Pending Backport
  • Backport changed from quincy,pacific to reef,quincy,pacific

#11 Updated by Backport Bot 6 months ago

  • Copied to Backport #62519: quincy: client: FAILED ceph_assert(_size == 0) added

#12 Updated by Backport Bot 6 months ago

  • Copied to Backport #62520: pacific: client: FAILED ceph_assert(_size == 0) added

#13 Updated by Backport Bot 6 months ago

  • Copied to Backport #62521: reef: client: FAILED ceph_assert(_size == 0) added

#14 Updated by Backport Bot 6 months ago

  • Tags set to backport_processed

#15 Updated by Konstantin Shalygin 2 months ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF