Project

General

Profile

Actions

Bug #21275

closed

test hang after mds evicts kclient

Added by Zheng Yan over 6 years ago. Updated over 6 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
luminous
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

http://pulpito.ceph.com/zyan-2017-09-07_03:18:23-kcephfs-master-testing-basic-mira/

http://qa-proxy.ceph.com/teuthology/zyan-2017-09-07_03:18:23-kcephfs-master-testing-basic-mira/1603494/teuthology.log
http://qa-proxy.ceph.com/teuthology/zyan-2017-09-07_03:18:23-kcephfs-master-testing-basic-mira/1603494/teuthology.log

A python process at:

[ 4862.107710] RIP: 0033:0x7f8d18b55e8c
[ 4862.107714] RSP: 002b:00007ffc96ef6a38 EFLAGS: 00000246 ORIG_RAX: 000000000000003d
[ 4862.107721] RAX: ffffffffffffffda RBX: 00007f8d18f18c30 RCX: 00007f8d18b55e8c
[ 4862.107724] RDX: 0000000000000000 RSI: 00007ffc96ef6a60 RDI: 0000000000001c6a
[ 4862.107728] RBP: 000000000091bc60 R08: 00000000005c2242 R09: 0000000000000000
[ 4862.107732] R10: 0000000000000000 R11: 0000000000000246 R12: 00007f8d18f0dd00
[ 4862.107736] R13: 000000000000006a R14: 00007f8d18f0dd00 R15: 00007f8d18f0bd32
[ 4862.107759] python          D    0  7274   7272 0x00000006
[ 4862.107766] Call Trace:
[ 4862.107778]  __schedule+0x41d/0xb60
[ 4862.107795]  schedule+0x3d/0x90
[ 4862.107801]  schedule_timeout+0x268/0x570
[ 4862.107811]  ? wait_for_completion_killable_timeout+0x110/0x1a0
[ 4862.107821]  ? trace_hardirqs_on_caller+0x11f/0x190
[ 4862.107831]  wait_for_completion_killable_timeout+0x118/0x1a0
[ 4862.107836]  ? wait_for_completion_killable_timeout+0x118/0x1a0
[ 4862.107844]  ? wake_up_q+0x70/0x70
[ 4862.107876]  ceph_mdsc_do_request+0x1da/0x2d0 [ceph]
[ 4862.107899]  ceph_lock_message+0x12f/0x2c0 [ceph]
[ 4862.107925]  ceph_lock+0x91/0x1d0 [ceph]
[ 4862.107937]  vfs_lock_file+0x30/0x50
[ 4862.107943]  locks_remove_posix+0xb8/0x210
[ 4862.107964]  ? rcu_read_lock_sched_held+0x89/0xa0
[ 4862.107970]  ? kmem_cache_free+0x2c4/0x2f0
[ 4862.107990]  filp_close+0x4e/0x70
[ 4862.107999]  put_files_struct+0x75/0xe0
[ 4862.108010]  exit_files+0x47/0x50
[ 4862.108019]  do_exit+0x2fd/0xc80
[ 4862.108027]  ? get_signal+0x317/0x8f0
[ 4862.108038]  do_group_exit+0x50/0xd0
[ 4862.108046]  get_signal+0x254/0x8f0
[ 4862.108066]  do_signal+0x28/0x720
[ 4862.108083]  ? _copy_to_user+0x5b/0x70
[ 4862.108092]  ? poll_select_copy_remaining+0xd9/0x120
[ 4862.108109]  exit_to_usermode_loop+0x80/0xc0
[ 4862.108119]  syscall_return_slowpath+0xc8/0xd0
[ 4862.108127]  entry_SYSCALL_64_fastpath+0xc0/0xc2


Related issues 2 (0 open2 closed)

Has duplicate CephFS - Bug #21468: kcephfs: hang during umountDuplicateZheng Yan09/19/2017

Actions
Copied to CephFS - Backport #21473: luminous: test hang after mds evicts kclientResolvedNathan CutlerActions
Actions #1

Updated by Zheng Yan over 6 years ago

static struct ceph_msg *create_session_open_msg(struct ceph_mds_client *mdsc, u64 seq)
{
        struct ceph_msg *msg;
        struct ceph_mds_session_head *h;
        int i = -1;
        int metadata_bytes = 0;
        int metadata_key_count = 0;
        struct ceph_options *opt = mdsc->fsc->client->options;
        struct ceph_mount_options *fsopt = mdsc->fsc->mount_options;
        void *p;

        const char* metadata[][2] = {
                {"hostname", utsname()->nodename},
                {"kernel_version", utsname()->release},
                {"entity_id", opt->name ? : ""},
                {"root", fsopt->server_path ? : "/"},
                {NULL, NULL}
        };

The panic is caused by "ustname() return NULL when process exits".

Actions #2

Updated by Jeff Layton over 6 years ago

Got it. I think we've hit problems like that in NFS, and what we had to do is save copies of the fields from utsname() that we'll need later (see rpc_clnt_set_nodename()). In this case, I think you want copies of nodename and release, maybe put them in the mdsc?

That said...once the MDS has evicted the client, we should just tear down any state that it holds on it (including locks). There really should be no reason to issue calls to the MDS to tear down state that we no longer hold, right?

In fact, note too that you have a signal pending here, so a call to wait_for_completion_killable_timeout is going to immediately return, most likely.

Actions #3

Updated by Patrick Donnelly over 6 years ago

  • Status changed from New to Fix Under Review

Patch is on ceph-devel.

Actions #5

Updated by Patrick Donnelly over 6 years ago

  • Status changed from Fix Under Review to Resolved
Actions #6

Updated by Patrick Donnelly over 6 years ago

  • Has duplicate Bug #21468: kcephfs: hang during umount added
Actions #7

Updated by Patrick Donnelly over 6 years ago

  • Status changed from Resolved to Pending Backport
  • Backport set to luminous
Actions #8

Updated by Nathan Cutler over 6 years ago

  • Copied to Backport #21473: luminous: test hang after mds evicts kclient added
Actions #9

Updated by Nathan Cutler over 6 years ago

  • Status changed from Pending Backport to Resolved
Actions

Also available in: Atom PDF