Project

General

Profile

Bug #56249

crash: int Client::_do_remount(bool): abort

Added by Telemetry Bot 5 months ago. Updated 3 months ago.

Status:
Pending Backport
Priority:
Normal
Assignee:
Category:
-
Target version:
% Done:

0%

Source:
Telemetry
Tags:
backport_processed
Backport:
quincy,pacific
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Client
Labels (FS):
Pull request ID:
Crash signature (v1):

97e13040c15aa2bae337c24391bb31795b267fb6425c39aea3bd1efddf065920


Description

http://telemetry.front.sepia.ceph.com:4000/d/jByk5HaMz/crash-spec-x-ray?orgId=1&var-sig_v2=e023ce46f46b39b4a3c88a317c424d5472918b3a3be586dd320a767e159b2d8a

Assert condition: abort
Assert function: int Client::_do_remount(bool)

Sanitized backtrace:

    pthread_kill()
    raise()
    Client::_do_remount(bool)
    Context::complete(int)
    Finisher::finisher_thread_entry()

Crash dump sample:
{
    "assert_condition": "abort",
    "assert_file": "client/Client.cc",
    "assert_func": "int Client::_do_remount(bool)",
    "assert_line": 4428,
    "assert_msg": "client/Client.cc: In function 'int Client::_do_remount(bool)' thread 7fb2027fc640 time 2022-04-23T13:37:17.465777-0400\nclient/Client.cc: 4428: ceph_abort_msg(\"abort() called\")",
    "assert_thread_name": "fn_anonymous",
    "backtrace": [
        "/lib/x86_64-linux-gnu/libc.so.6(+0x42520) [0x7fb216cb0520]",
        "pthread_kill()",
        "raise()",
        "abort()",
        "(ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x190) [0x7fb217422a6e]",
        "(Client::_do_remount(bool)+0x2fa) [0x555eec2a7b9a]",
        "(Context::complete(int)+0xd) [0x555eec311dbd]",
        "(Finisher::finisher_thread_entry()+0x175) [0x7fb2174bacb5]",
        "/lib/x86_64-linux-gnu/libc.so.6(+0x94b43) [0x7fb216d02b43]",
        "/lib/x86_64-linux-gnu/libc.so.6(+0x126a00) [0x7fb216d94a00]" 
    ],
    "ceph_version": "17.1.0",
    "crash_id": "2022-04-23T17:37:17.499177Z_94ef73b0-10d3-4ccc-8763-5785eb804805",
    "entity_name": "client.590b341930fefd001f0579e71f44665ac78a651f",
    "os_id": "22.04",
    "os_name": "Ubuntu Jammy Jellyfish (development branch)",
    "os_version": "22.04 (Jammy Jellyfish)",
    "os_version_id": "22.04",
    "process_name": "ceph-fuse",
    "stack_sig": "97e13040c15aa2bae337c24391bb31795b267fb6425c39aea3bd1efddf065920",
    "timestamp": "2022-04-23T17:37:17.499177Z",
    "utsname_machine": "x86_64",
    "utsname_release": "5.15.0-17-generic",
    "utsname_sysname": "Linux",
    "utsname_version": "#17-Ubuntu SMP Thu Jan 13 16:27:23 UTC 2022" 
}


Related issues

Copied to CephFS - Backport #57394: quincy: crash: int Client::_do_remount(bool): abort In Progress
Copied to CephFS - Backport #57395: pacific: crash: int Client::_do_remount(bool): abort Resolved

History

#1 Updated by Telemetry Bot 5 months ago

  • Crash signature (v1) updated (diff)
  • Crash signature (v2) updated (diff)
  • Affected Versions v17.1.0 added

#2 Updated by Xiubo Li 5 months ago

#3 Updated by Venky Shankar 4 months ago

Xiubo Li wrote:

Should be fixed by https://tracker.ceph.com/issues/54049.

Looks the same. However, I'm not sure if the client used to crash with the bracktrace as in this tracker.

#4 Updated by Xiubo Li 4 months ago

Venky,

Please check this one https://tracker.ceph.com/issues/56532. It should be the same bug with this one.

#5 Updated by Xiubo Li 4 months ago

This only exist in the v17.1.0 and the logic has been changed after [1][2][3] below. When trying remount to invalidate the kernel dcache, it may fail dues to the mount is still busy. After v17.1.0 it will be allowed to retry it at most of mds_max_retries_on_remount_failure time instead of crash the client in _do_remout() directly.

[1] https://tracker.ceph.com/issues/27657
[2] https://tracker.ceph.com/issues/54049
[2] https://tracker.ceph.com/issues/56532

This could be closed directly.

#6 Updated by Xiubo Li 4 months ago

Xiubo Li wrote:

This only exist in the v17.1.0 and the logic has been changed after [1][2][3] below. When trying remount to invalidate the kernel dcache, it may fail dues to the mount is still busy. After v17.1.0 it will be allowed to retry it at most of mds_max_retries_on_remount_failure time instead of crash the client in _do_remout() directly.

[1] https://tracker.ceph.com/issues/27657
[2] https://tracker.ceph.com/issues/54049
[2] https://tracker.ceph.com/issues/56532

This could be closed directly.

I might be wrong, actually the [2] introduced one bug, that is when the remount failed for more than mds_max_retries_on_remount_failure time we should abort the client to avoid the client cannot do invalidate the dentry cache and then the MDS cache cannot shrink which can cause the MDS to fail.

#7 Updated by Xiubo Li 4 months ago

Went through the kernel code I couldn't find any case in our case could cause the failure.

And from https://tracker.ceph.com/issues/56532#note-1:

$ grep 'to trim kernel dentries' client.* -rn
client.0.774355.log:98:2022-07-14T14:24:03.984+0530 7fbbd5ffb640 -1 client.4495 failed to remount (to trim kernel dentries): return code = 1
client.0.792060.log:137:2022-07-14T14:34:49.030+0530 7f46017fa640 -1 client.10512 failed to remount (to trim kernel dentries): return code = 32
client.0.793575.log:137:2022-07-14T14:35:32.033+0530 7f779aff5640 -1 client.10990 failed to remount (to trim kernel dentries): return code = 32
client.0.795072.log:137:2022-07-14T14:36:15.771+0530 7f899e7fc640 -1 client.11499 failed to remount (to trim kernel dentries): return code = 32
client.1.774461.log:98:2022-07-14T14:24:05.570+0530 7f766bfff640 -1 client.4505 failed to remount (to trim kernel dentries): return code = 1

The return code 1 is:

1      incorrect invocation or permissions

And from util-linux, the 1 is util-linux, just suppose we may passed invalidate options. And then went through the client/Client.cc and ceph_fuse.cc code more carefully, found that it's possible we may do use-after-free for the mountpoint:

1254 static int remount_cb(void *handle)
1255 { 
1256   // used for trimming kernel dcache. when remounting a file system, linux kernel
1257   // trims all unused dentries in the file system
1258   char cmd[128+PATH_MAX];
1259   CephFuse::Handle *cfuse = (CephFuse::Handle *)handle;
1260   snprintf(cmd, sizeof(cmd), "LIBMOUNT_FSTAB=/dev/null mount -i -o remount %s",                                                                   
1261 #if FUSE_VERSION >= FUSE_MAKE_VERSION(3, 0)
1262                   cfuse->opts.mountpoint);
1263 #else
1264                   cfuse->mountpoint);
1265 #endif
1266   int r = system(cmd);
1267   if (r != 0 && r != -1) {    
1268     r = WEXITSTATUS(r);
1269   }
1270   
1271   return r;
1272 } 
1273   

Because in ceph_fuse.cc it will do client->unmount() and then do cfuse->finalize(), which will free the mountpoint and then call client->shutdown(), which will stop the remount_finisher queue.

 92 int main(int argc, const char **argv, const char *envp[]) {

...
333   out_client_unmount:
334     client->unmount();                                                                                                                             
335     cfuse->finalize();
336   out_shutdown:
337     icp.stop();
338     client->shutdown();
339   out_init_failed:
...
}

The remount_finisher queue is a queue will run the remount_cb().

That means after the mountpoint freed the remount_finisher maybe not empty.

#8 Updated by Xiubo Li 4 months ago

  • Status changed from New to In Progress
  • Assignee set to Xiubo Li
  • Backport set to quincy,pacific
  • Component(FS) Client added

#9 Updated by Xiubo Li 4 months ago

  • Target version set to v18.0.0

#10 Updated by Xiubo Li 4 months ago

  • Status changed from In Progress to Fix Under Review
  • Pull request ID set to 47620

#11 Updated by Rishabh Dave 3 months ago

  • Status changed from Fix Under Review to Pending Backport

#12 Updated by Backport Bot 3 months ago

  • Copied to Backport #57394: quincy: crash: int Client::_do_remount(bool): abort added

#13 Updated by Backport Bot 3 months ago

  • Copied to Backport #57395: pacific: crash: int Client::_do_remount(bool): abort added

#14 Updated by Backport Bot 3 months ago

  • Tags set to backport_processed

Also available in: Atom PDF