Project

General

Profile

Bug #53082

ceph-fuse: segmenetation fault in Client::handle_mds_map

Added by Patrick Donnelly over 1 year ago. Updated about 1 year ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Client
Labels (FS):
crash
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

2021-10-26T22:49:17.843 INFO:tasks.cephfs.fuse_mount.ceph-fuse.0.smithi017.stderr:*** Caught signal (Segmentation fault) **
2021-10-26T22:49:17.843 INFO:tasks.cephfs.fuse_mount.ceph-fuse.0.smithi017.stderr: in thread 7f20f77fe700 thread_name:ms_dispatch
2021-10-26T22:49:17.844 INFO:tasks.cephfs.fuse_mount.ceph-fuse.0.smithi017.stderr: ceph version 17.0.0-8469-g3dd8f159 (3dd8f1596bf5dc8769506e3cff803328189f20b1) quincy (dev)
2021-10-26T22:49:17.844 INFO:tasks.cephfs.fuse_mount.ceph-fuse.0.smithi017.stderr: 1: /lib64/libpthread.so.0(+0x12b20) [0x7f2114ceeb20]
2021-10-26T22:49:17.845 INFO:tasks.cephfs.fuse_mount.ceph-fuse.0.smithi017.stderr: 2: (Client::handle_mds_map(boost::intrusive_ptr<MMDSMap const> const&)+0x617) [0x55dfb7943287]
2021-10-26T22:49:17.845 INFO:tasks.cephfs.fuse_mount.ceph-fuse.0.smithi017.stderr: 3: (Client::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x7fd) [0x55dfb79448ad]
2021-10-26T22:49:17.845 INFO:tasks.cephfs.fuse_mount.ceph-fuse.0.smithi017.stderr: 4: (DispatchQueue::entry()+0x14fa) [0x7f2115b3c37a]
2021-10-26T22:49:17.845 INFO:tasks.cephfs.fuse_mount.ceph-fuse.0.smithi017.stderr: 5: (DispatchQueue::DispatchThread::entry()+0x11) [0x7f2115bf2741]
2021-10-26T22:49:17.846 INFO:tasks.cephfs.fuse_mount.ceph-fuse.0.smithi017.stderr: 6: /lib64/libpthread.so.0(+0x814a) [0x7f2114ce414a]
2021-10-26T22:49:17.846 INFO:tasks.cephfs.fuse_mount.ceph-fuse.0.smithi017.stderr: 7: clone()

From: /ceph/teuthology-archive/pdonnell-2021-10-26_22:12:03-fs-wip-pdonnell-testing-20211025.000447-distro-basic-smithi/6462553/teuthology.log

I tried digging into the core file but getting the right packages set up with centos:stream has been difficult. So far I can't see what caused this. Hopefully we can reproduce this again with centos:8.

It's possible but unlikely this is related to #52436.

History

#1 Updated by Xiubo Li over 1 year ago

There has some logs before the corruption:

    -8> 2021-10-26T22:49:17.842+0000 7f20f77fe700  1 client.4692 handle_mds_map epoch 87
    -7> 2021-10-26T22:49:17.842+0000 7f20f77fe700  1 -- 192.168.0.1:0/3403711952 >> [v2:172.21.15.73:6838/301218788,v1:172.21.15.73:6839/301218788] conn(0x7f20b809bbf0 msgr2=0x7f20b80d1b20 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=0).mark_down
    -6> 2021-10-26T22:49:17.842+0000 7f20f77fe700  1 --2- 192.168.0.1:0/3403711952 >> [v2:172.21.15.73:6838/301218788,v1:172.21.15.73:6839/301218788] conn(0x7f20b809bbf0 0x7f20b80d1b20 crc :-1 s=READY pgs=172 cs=0 l=0 rev1=1 crypto rx=0 tx=0 comp rx=0 tx=0).stop
    -5> 2021-10-26T22:49:17.842+0000 7f20f77fe700  5 client.4692 _closed_mds_session mds.2 seq 0
    -4> 2021-10-26T22:49:17.842+0000 7f20f77fe700  1 -- 192.168.0.1:0/3403711952 >> [v2:172.21.15.73:6838/301218788,v1:172.21.15.73:6839/301218788] conn(0x7f20b809bbf0 msgr2=0x7f20b80d1b20 crc :-1 s=STATE_CLOSED l=0).mark_down
    -3> 2021-10-26T22:49:17.842+0000 7f20f77fe700  1 --2- 192.168.0.1:0/3403711952 >> [v2:172.21.15.73:6838/301218788,v1:172.21.15.73:6839/301218788] conn(0x7f20b809bbf0 0x7f20b80d1b20 crc :-1 s=CLOSED pgs=172 cs=0 l=0 rev1=1 crypto rx=0 tx=0 comp rx=0 tx=0).stop
    -2> 2021-10-26T22:49:17.842+0000 7f20f77fe700 10 client.4692 remove_session_caps mds.2
    -1> 2021-10-26T22:49:17.842+0000 7f20f77fe700 10 client.4692 kick_requests_closed for mds.2
     0> 2021-10-26T22:49:17.843+0000 7f20f77fe700 -1 *** Caught signal (Segmentation fault) **

It should be crashed in kick_requests_closed() or after it, and checked the code, since it was crash in the libpthread.so:

2021-10-26T22:49:17.844 INFO:tasks.cephfs.fuse_mount.ceph-fuse.0.smithi017.stderr: 1: /lib64/libpthread.so.0(+0x12b20) [0x7f2114ceeb20]

It seems caused by some pthread_XX ? Such when doing `req->caller_cond->notify_all();` it will finally call `pthread_cond_broadcast()` ?

#2 Updated by Xiubo Li about 1 year ago

  • Assignee set to Xiubo Li

Venky, I will take it.

#3 Updated by Xiubo Li about 1 year ago

  • Status changed from New to In Progress

#4 Updated by Xiubo Li about 1 year ago

  • Status changed from In Progress to Fix Under Review
  • Pull request ID set to 44038

#5 Updated by Venky Shankar about 1 year ago

  • Status changed from Fix Under Review to Resolved

Also available in: Atom PDF