Project

General

Profile

Actions

Bug #56288

open

crash: Client::_readdir_cache_cb(dir_result_t*, int (*)(void*, dirent*, ceph_statx*, long, Inode*), void*, int, bool)

Added by Telemetry Bot almost 2 years ago. Updated 15 days ago.

Status:
Triaged
Priority:
Normal
Assignee:
Category:
Correctness/Safety
Target version:
-
% Done:

0%

Source:
Telemetry
Tags:
Backport:
reef,quincy,pacific
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Client
Labels (FS):
crash
Pull request ID:
Crash signature (v1):

915ad90d349e6a333f04e0504a80cafc8ba5777680640de0597c7b0275ea1dc4


Description

http://telemetry.front.sepia.ceph.com:4000/d/jByk5HaMz/crash-spec-x-ray?orgId=1&var-sig_v2=aeaa2b6c5a82bba2b2f33885dde87b61d62ebb22b579a85f5fbf5c3eedab2c67

Sanitized backtrace:

    Client::_readdir_cache_cb(dir_result_t*, int (*)(void*, dirent*, ceph_statx*, long, Inode*), void*, int, bool)
    Client::readdir_r_cb(dir_result_t*, int (*)(void*, dirent*, ceph_statx*, long, Inode*), void*, unsigned int, unsigned int, bool)

Crash dump sample:
{
    "backtrace": [
        "__kernel_rt_sigreturn()",
        "(Client::_readdir_cache_cb(dir_result_t*, int (*)(void*, dirent*, ceph_statx*, long, Inode*), void*, int, bool)+0x37c) [0xaaaaca9854b4]",
        "(Client::readdir_r_cb(dir_result_t*, int (*)(void*, dirent*, ceph_statx*, long, Inode*), void*, unsigned int, unsigned int, bool)+0x9f0) [0xaaaaca9865d0]",
        "ceph-fuse(+0x9f37c) [0xaaaaca93737c]",
        "/lib/aarch64-linux-gnu/libfuse.so.2(+0x14068) [0xffffba485068]",
        "/lib/aarch64-linux-gnu/libfuse.so.2(+0x15064) [0xffffba486064]",
        "/lib/aarch64-linux-gnu/libfuse.so.2(+0x12158) [0xffffba483158]",
        "/lib/aarch64-linux-gnu/libpthread.so.0(+0x751c) [0xffffb997b51c]",
        "/lib/aarch64-linux-gnu/libc.so.6(+0xd122c) [0xffffb96b522c]" 
    ],
    "ceph_version": "17.2.0",
    "crash_id": "2022-05-06T09:38:25.862101Z_96a4e7ca-adf6-4a56-acd2-0731016b50ca",
    "entity_name": "client.779720a56c617f4713d46a0389a5f0b5c78d2903",
    "os_id": "ubuntu",
    "os_name": "Ubuntu",
    "os_version": "20.04.4 LTS (Focal Fossa)",
    "os_version_id": "20.04",
    "process_name": "ceph-fuse",
    "stack_sig": "915ad90d349e6a333f04e0504a80cafc8ba5777680640de0597c7b0275ea1dc4",
    "timestamp": "2022-05-06T09:38:25.862101Z",
    "utsname_machine": "aarch64",
    "utsname_release": "5.4.0-1059-raspi",
    "utsname_sysname": "Linux",
    "utsname_version": "#67-Ubuntu SMP PREEMPT Mon Apr 11 14:16:01 UTC 2022" 
}


Files

log.gz (219 KB) log.gz ceph client logs from upstream user Milind Changire, 07/17/2023 03:41 PM
Actions #1

Updated by Telemetry Bot almost 2 years ago

  • Crash signature (v1) updated (diff)
  • Crash signature (v2) updated (diff)
  • Affected Versions v17.2.0 added
Actions #2

Updated by Venky Shankar over 1 year ago

  • Category set to Correctness/Safety
  • Status changed from New to Triaged
  • Assignee set to Venky Shankar
  • Target version set to v18.0.0
  • Backport set to pacific,quincy
  • Component(FS) Client added
  • Labels (FS) crash added
Actions #4

Updated by Milind Changire 9 months ago

Actions #5

Updated by Venky Shankar 9 months ago

Milind Changire wrote:

Similar crash report in ceph-users mailing list

Thanks for the logs. I'll have a look.

Actions #6

Updated by Venky Shankar 9 months ago

  • Backport changed from pacific,quincy to reef,quincy,pacific
Actions #7

Updated by Milind Changire 9 months ago

Venky,
The upstream user has also sent across debug (level 20) logs for ceph-fuse as well as mds.
Unfortunately, the size is more tham 1000KB and so they can't be attached to the tracker.
Let me know when you need them.

Actions #8

Updated by Venky Shankar 9 months ago

Milind Changire wrote:

Venky,
The upstream user has also sent across debug (level 20) logs for ceph-fuse as well as mds.
Unfortunately, the size is more tham 1000KB and so they can't be attached to the tracker.
Let me know when you need them.

Would help. Please upload using ceph-post-file.

Actions #9

Updated by Patrick Donnelly 7 months ago

  • Target version deleted (v18.0.0)
Actions #10

Updated by lei liu about 1 month ago

We recently encountered a similar issue, may I ask if there is a solution?

Actions #11

Updated by Venky Shankar about 1 month ago

lei liu wrote:

We recently encountered a similar issue, may I ask if there is a solution?

Which version was this hit in? 17.2.*?

Actions #12

Updated by Venky Shankar about 1 month ago

Venky Shankar wrote:

lei liu wrote:

We recently encountered a similar issue, may I ask if there is a solution?

Which version was this hit in? 17.2.*?

Is there a coredump that can be shared?

Actions #13

Updated by Venky Shankar about 1 month ago

lei liu wrote:

We recently encountered a similar issue, may I ask if there is a solution?

For now, restart ceph-fuse and that should get you past this.

Actions #14

Updated by Venky Shankar about 1 month ago

I think this happens when there are concurrent lookups and deletes under a directory. _readdir_cache_cb() has code like

    int idx = pd - dir->readdir_cache.begin();
    if (dn->inode->is_dir() && cct->_conf->client_dirsize_rbytes) {
      mask |= CEPH_STAT_RSTAT;
    }
    int r = _getattr(dn->inode, mask, dirp->perms);
    if (r < 0)
      return r;

    // the content of readdir_cache may change after _getattr(), so pd may be invalid iterator
    pd = dir->readdir_cache.begin() + idx;
    if (pd >= dir->readdir_cache.end() || *pd != dn)
      return -CEPHFS_EAGAIN;

... to handle the cache getting modified when the client_lock gets dropped in _getattr() (via add_update_inode()). I'll reproduce this and see what's going on.

If you have a core to share, it'll speed you debugging. Thx!

Actions #15

Updated by lei liu about 1 month ago

Venky Shankar wrote:

lei liu wrote:

We recently encountered a similar issue, may I ask if there is a solution?

Which version was this hit in? 17.2.*?

version: ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus (stable);

For some reason, we did not save the core file. Currently, we only have some dmesg information:* ganesha.nfsd24416: segfault at 90 ip 00007f79f8731227 sp 00007f78037f24d0 error 4 in libcephfs.so.2.0.0[7f79f86d3000+e2000]*. They seem to point to [https://github.com/ceph/ceph/blob/v15.2.17/src/client/Client.cc#L8171]. In our environment, this issue has occurred three times in the past three weeks, but I have not yet found a way to reproduce this issue.

Actions #16

Updated by Venky Shankar about 1 month ago

lei liu wrote:

Venky Shankar wrote:

lei liu wrote:

We recently encountered a similar issue, may I ask if there is a solution?

Which version was this hit in? 17.2.*?

version: ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus (stable);

For some reason, we did not save the core file. Currently, we only have some dmesg information:* ganesha.nfsd24416: segfault at 90 ip 00007f79f8731227 sp 00007f78037f24d0 error 4 in libcephfs.so.2.0.0[7f79f86d3000+e2000]*. They seem to point to [https://github.com/ceph/ceph/blob/v15.2.17/src/client/Client.cc#L8171]. In our environment, this issue has occurred three times in the past three weeks, but I have not yet found a way to reproduce this issue.

Ok, yeh. I doubted that part (mentioned in note-14).

Actions #17

Updated by Venky Shankar 15 days ago

I haven't been unable to reproduce this with the main branch. If possible, please collect ceph-mds coredump and attach it to this tracker.

Actions #18

Updated by Venky Shankar 15 days ago

So, for some reason this part of the code

if (pd >= dir->readdir_cache.end() || *pd != dn)
      return -CEPHFS_EAGAIN;

especially dereferencing @pd` is causing the crash. I don't have a core to validate that, but none of the other checks seem too wild to be causing this.

Actions #19

Updated by lei liu 15 days ago

Venky Shankar wrote in #note-18:

So, for some reason this part of the code

[...]

especially dereferencing @pd` is causing the crash. I don't have a core to validate that, but none of the other checks seem too wild to be causing this.

thank you, I will send you the core file when it happens again.

Actions

Also available in: Atom PDF