Bug #56288: crash: Client::_readdir_cache_cb(dir_result_t*, int (*)(void*, dirent*, ceph_statx*, long, Inode*), void*, int, bool) - CephFS - Ceph

Actions

Copy link

Bug #56288

open

crash: Client::_readdir_cache_cb(dir_result_t, int ()(void, dirent, ceph_statx, long, Inode), void*, int, bool)

Added by Telemetry Bot almost 2 years ago. Updated 15 days ago.

Status:

Triaged

Priority:

Normal

Assignee:

Venky Shankar

Category:

Correctness/Safety

Target version:

% Done:

Source:

Telemetry

Tags:

Backport:

reef,quincy,pacific

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

Ceph - v17.2.0

ceph-qa-suite:

Component(FS):

Client

Labels (FS):

crash

Pull request ID:

Crash signature (v1):

915ad90d349e6a333f04e0504a80cafc8ba5777680640de0597c7b0275ea1dc4

Crash signature (v2):

aeaa2b6c5a82bba2b2f33885dde87b61d62ebb22b579a85f5fbf5c3eedab2c67

Description

http://telemetry.front.sepia.ceph.com:4000/d/jByk5HaMz/crash-spec-x-ray?orgId=1&var-sig_v2=aeaa2b6c5a82bba2b2f33885dde87b61d62ebb22b579a85f5fbf5c3eedab2c67

Sanitized backtrace:

    Client::_readdir_cache_cb(dir_result_t*, int (*)(void*, dirent*, ceph_statx*, long, Inode*), void*, int, bool)
    Client::readdir_r_cb(dir_result_t*, int (*)(void*, dirent*, ceph_statx*, long, Inode*), void*, unsigned int, unsigned int, bool)

Crash dump sample:

{
    "backtrace": [
        "__kernel_rt_sigreturn()",
        "(Client::_readdir_cache_cb(dir_result_t*, int (*)(void*, dirent*, ceph_statx*, long, Inode*), void*, int, bool)+0x37c) [0xaaaaca9854b4]",
        "(Client::readdir_r_cb(dir_result_t*, int (*)(void*, dirent*, ceph_statx*, long, Inode*), void*, unsigned int, unsigned int, bool)+0x9f0) [0xaaaaca9865d0]",
        "ceph-fuse(+0x9f37c) [0xaaaaca93737c]",
        "/lib/aarch64-linux-gnu/libfuse.so.2(+0x14068) [0xffffba485068]",
        "/lib/aarch64-linux-gnu/libfuse.so.2(+0x15064) [0xffffba486064]",
        "/lib/aarch64-linux-gnu/libfuse.so.2(+0x12158) [0xffffba483158]",
        "/lib/aarch64-linux-gnu/libpthread.so.0(+0x751c) [0xffffb997b51c]",
        "/lib/aarch64-linux-gnu/libc.so.6(+0xd122c) [0xffffb96b522c]" 
    ],
    "ceph_version": "17.2.0",
    "crash_id": "2022-05-06T09:38:25.862101Z_96a4e7ca-adf6-4a56-acd2-0731016b50ca",
    "entity_name": "client.779720a56c617f4713d46a0389a5f0b5c78d2903",
    "os_id": "ubuntu",
    "os_name": "Ubuntu",
    "os_version": "20.04.4 LTS (Focal Fossa)",
    "os_version_id": "20.04",
    "process_name": "ceph-fuse",
    "stack_sig": "915ad90d349e6a333f04e0504a80cafc8ba5777680640de0597c7b0275ea1dc4",
    "timestamp": "2022-05-06T09:38:25.862101Z",
    "utsname_machine": "aarch64",
    "utsname_release": "5.4.0-1059-raspi",
    "utsname_sysname": "Linux",
    "utsname_version": "#67-Ubuntu SMP PREEMPT Mon Apr 11 14:16:01 UTC 2022" 
}

Files

log.gz (219 KB) log.gz

ceph client logs from upstream user

Milind Changire, 07/17/2023 03:41 PM

Actions

Copy link

Updated by Telemetry Bot almost 2 years ago

Crash signature (v1) updated (diff)
Crash signature (v2) updated (diff)
Affected Versions v17.2.0 added

Actions

Copy link

Updated by Venky Shankar over 1 year ago

Category set to Correctness/Safety
Status changed from New to Triaged
Assignee set to Venky Shankar
Target version set to v18.0.0
Backport set to pacific,quincy
Component(FS) Client added
Labels (FS) crash added

Actions

Copy link

Updated by Milind Changire 10 months ago

Similar crash report in ceph-users mailing list

Actions

Copy link

Updated by Milind Changire 9 months ago

File log.gz log.gz added

Actions

Copy link

Updated by Venky Shankar 9 months ago

Milind Changire wrote:

Similar crash report in ceph-users mailing list

Thanks for the logs. I'll have a look.

Actions

Copy link

Updated by Venky Shankar 9 months ago

Backport changed from pacific,quincy to reef,quincy,pacific

Actions

Copy link

Updated by Milind Changire 9 months ago

Venky,
The upstream user has also sent across debug (level 20) logs for ceph-fuse as well as mds.
Unfortunately, the size is more tham 1000KB and so they can't be attached to the tracker.
Let me know when you need them.

Actions

Copy link

Updated by Venky Shankar 9 months ago

Milind Changire wrote:

Venky,
The upstream user has also sent across debug (level 20) logs for ceph-fuse as well as mds.
Unfortunately, the size is more tham 1000KB and so they can't be attached to the tracker.
Let me know when you need them.

Would help. Please upload using ceph-post-file.

Actions

Copy link

Updated by Patrick Donnelly 7 months ago

Target version deleted (~~v18.0.0~~)

Actions

Copy link

#10

Updated by lei liu about 1 month ago

We recently encountered a similar issue, may I ask if there is a solution?

Actions

Copy link

#11

Updated by Venky Shankar about 1 month ago

lei liu wrote:

We recently encountered a similar issue, may I ask if there is a solution?

Which version was this hit in? 17.2.*?

Actions

Copy link

#12

Updated by Venky Shankar about 1 month ago

Venky Shankar wrote:

lei liu wrote:

We recently encountered a similar issue, may I ask if there is a solution?

Which version was this hit in? 17.2.*?

Is there a coredump that can be shared?

Actions

Copy link

#13

Updated by Venky Shankar about 1 month ago

lei liu wrote:

We recently encountered a similar issue, may I ask if there is a solution?

For now, restart ceph-fuse and that should get you past this.

Actions

Copy link

#14

Updated by Venky Shankar about 1 month ago

I think this happens when there are concurrent lookups and deletes under a directory. _readdir_cache_cb() has code like

    int idx = pd - dir->readdir_cache.begin();
    if (dn->inode->is_dir() && cct->_conf->client_dirsize_rbytes) {
      mask |= CEPH_STAT_RSTAT;
    }
    int r = _getattr(dn->inode, mask, dirp->perms);
    if (r < 0)
      return r;

    // the content of readdir_cache may change after _getattr(), so pd may be invalid iterator
    pd = dir->readdir_cache.begin() + idx;
    if (pd >= dir->readdir_cache.end() || *pd != dn)
      return -CEPHFS_EAGAIN;

... to handle the cache getting modified when the client_lock gets dropped in _getattr() (via add_update_inode()). I'll reproduce this and see what's going on.

If you have a core to share, it'll speed you debugging. Thx!

Actions

Copy link

#15

Updated by lei liu about 1 month ago

Venky Shankar wrote:

lei liu wrote:

We recently encountered a similar issue, may I ask if there is a solution?

Which version was this hit in? 17.2.*?

version: ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus (stable);

For some reason, we did not save the core file. Currently, we only have some dmesg information:* ganesha.nfsd²⁴⁴¹⁶: segfault at 90 ip 00007f79f8731227 sp 00007f78037f24d0 error 4 in libcephfs.so.2.0.0[7f79f86d3000+e2000]*. They seem to point to [https://github.com/ceph/ceph/blob/v15.2.17/src/client/Client.cc#L8171]. In our environment, this issue has occurred three times in the past three weeks, but I have not yet found a way to reproduce this issue.

Actions

Copy link

#16

Updated by Venky Shankar about 1 month ago

lei liu wrote:

Venky Shankar wrote:

lei liu wrote:

We recently encountered a similar issue, may I ask if there is a solution?

Which version was this hit in? 17.2.*?

version: ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus (stable);

For some reason, we did not save the core file. Currently, we only have some dmesg information:* ganesha.nfsd²⁴⁴¹⁶: segfault at 90 ip 00007f79f8731227 sp 00007f78037f24d0 error 4 in libcephfs.so.2.0.0[7f79f86d3000+e2000]*. They seem to point to [https://github.com/ceph/ceph/blob/v15.2.17/src/client/Client.cc#L8171]. In our environment, this issue has occurred three times in the past three weeks, but I have not yet found a way to reproduce this issue.

Ok, yeh. I doubted that part (mentioned in note-14).

Actions

Copy link

#17

Updated by Venky Shankar 15 days ago

I haven't been unable to reproduce this with the main branch. If possible, please collect ceph-mds coredump and attach it to this tracker.

Actions

Copy link

#18

Updated by Venky Shankar 15 days ago

So, for some reason this part of the code

if (pd >= dir->readdir_cache.end() || *pd != dn)
      return -CEPHFS_EAGAIN;

especially dereferencing @pd` is causing the crash. I don't have a core to validate that, but none of the other checks seem too wild to be causing this.

Actions

Copy link

#19

Updated by lei liu 15 days ago

Venky Shankar wrote in #note-18:

So, for some reason this part of the code

[...]

especially dereferencing @pd` is causing the crash. I don't have a core to validate that, but none of the other checks seem too wild to be causing this.

thank you, I will send you the core file when it happens again.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » CephFS

Custom queries

Bug #56288

crash: Client::_readdir_cache_cb(dir_result_t, int ()(void, dirent, ceph_statx, long, Inode), void*, int, bool)

Updated by Telemetry Bot almost 2 years ago

Updated by Venky Shankar over 1 year ago

Updated by Milind Changire 10 months ago

Updated by Milind Changire 9 months ago

Updated by Venky Shankar 9 months ago

Updated by Venky Shankar 9 months ago

Updated by Milind Changire 9 months ago

Updated by Venky Shankar 9 months ago

Updated by Patrick Donnelly 7 months ago

Updated by lei liu about 1 month ago

Updated by Venky Shankar about 1 month ago

Updated by Venky Shankar about 1 month ago

Updated by Venky Shankar about 1 month ago

Updated by Venky Shankar about 1 month ago

Updated by lei liu about 1 month ago

Updated by Venky Shankar about 1 month ago

Updated by Venky Shankar 15 days ago

Updated by Venky Shankar 15 days ago

Updated by lei liu 15 days ago