Bug #21275: test hang after mds evicts kclient - CephFS - Ceph

Actions

Copy link

Bug #21275

closed

test hang after mds evicts kclient

Added by Zheng Yan over 6 years ago. Updated over 6 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Source:

Tags:

Backport:

luminous

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(FS):

Labels (FS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

http://pulpito.ceph.com/zyan-2017-09-07_03:18:23-kcephfs-master-testing-basic-mira/

http://qa-proxy.ceph.com/teuthology/zyan-2017-09-07_03:18:23-kcephfs-master-testing-basic-mira/1603494/teuthology.log
http://qa-proxy.ceph.com/teuthology/zyan-2017-09-07_03:18:23-kcephfs-master-testing-basic-mira/1603494/teuthology.log

A python process at:

[ 4862.107710] RIP: 0033:0x7f8d18b55e8c
[ 4862.107714] RSP: 002b:00007ffc96ef6a38 EFLAGS: 00000246 ORIG_RAX: 000000000000003d
[ 4862.107721] RAX: ffffffffffffffda RBX: 00007f8d18f18c30 RCX: 00007f8d18b55e8c
[ 4862.107724] RDX: 0000000000000000 RSI: 00007ffc96ef6a60 RDI: 0000000000001c6a
[ 4862.107728] RBP: 000000000091bc60 R08: 00000000005c2242 R09: 0000000000000000
[ 4862.107732] R10: 0000000000000000 R11: 0000000000000246 R12: 00007f8d18f0dd00
[ 4862.107736] R13: 000000000000006a R14: 00007f8d18f0dd00 R15: 00007f8d18f0bd32
[ 4862.107759] python          D    0  7274   7272 0x00000006
[ 4862.107766] Call Trace:
[ 4862.107778]  __schedule+0x41d/0xb60
[ 4862.107795]  schedule+0x3d/0x90
[ 4862.107801]  schedule_timeout+0x268/0x570
[ 4862.107811]  ? wait_for_completion_killable_timeout+0x110/0x1a0
[ 4862.107821]  ? trace_hardirqs_on_caller+0x11f/0x190
[ 4862.107831]  wait_for_completion_killable_timeout+0x118/0x1a0
[ 4862.107836]  ? wait_for_completion_killable_timeout+0x118/0x1a0
[ 4862.107844]  ? wake_up_q+0x70/0x70
[ 4862.107876]  ceph_mdsc_do_request+0x1da/0x2d0 [ceph]
[ 4862.107899]  ceph_lock_message+0x12f/0x2c0 [ceph]
[ 4862.107925]  ceph_lock+0x91/0x1d0 [ceph]
[ 4862.107937]  vfs_lock_file+0x30/0x50
[ 4862.107943]  locks_remove_posix+0xb8/0x210
[ 4862.107964]  ? rcu_read_lock_sched_held+0x89/0xa0
[ 4862.107970]  ? kmem_cache_free+0x2c4/0x2f0
[ 4862.107990]  filp_close+0x4e/0x70
[ 4862.107999]  put_files_struct+0x75/0xe0
[ 4862.108010]  exit_files+0x47/0x50
[ 4862.108019]  do_exit+0x2fd/0xc80
[ 4862.108027]  ? get_signal+0x317/0x8f0
[ 4862.108038]  do_group_exit+0x50/0xd0
[ 4862.108046]  get_signal+0x254/0x8f0
[ 4862.108066]  do_signal+0x28/0x720
[ 4862.108083]  ? _copy_to_user+0x5b/0x70
[ 4862.108092]  ? poll_select_copy_remaining+0xd9/0x120
[ 4862.108109]  exit_to_usermode_loop+0x80/0xc0
[ 4862.108119]  syscall_return_slowpath+0xc8/0xd0
[ 4862.108127]  entry_SYSCALL_64_fastpath+0xc0/0xc2

Related issues 2 (0 open — 2 closed)

Actions

Copy link

Updated by Zheng Yan over 6 years ago

static struct ceph_msg *create_session_open_msg(struct ceph_mds_client *mdsc, u64 seq)
{
        struct ceph_msg *msg;
        struct ceph_mds_session_head *h;
        int i = -1;
        int metadata_bytes = 0;
        int metadata_key_count = 0;
        struct ceph_options *opt = mdsc->fsc->client->options;
        struct ceph_mount_options *fsopt = mdsc->fsc->mount_options;
        void *p;

        const char* metadata[][2] = {
                {"hostname", utsname()->nodename},
                {"kernel_version", utsname()->release},
                {"entity_id", opt->name ? : ""},
                {"root", fsopt->server_path ? : "/"},
                {NULL, NULL}
        };

The panic is caused by "ustname() return NULL when process exits".

Actions

Copy link

Updated by Jeff Layton over 6 years ago

Got it. I think we've hit problems like that in NFS, and what we had to do is save copies of the fields from utsname() that we'll need later (see rpc_clnt_set_nodename()). In this case, I think you want copies of nodename and release, maybe put them in the mdsc?

That said...once the MDS has evicted the client, we should just tear down any state that it holds on it (including locks). There really should be no reason to issue calls to the MDS to tear down state that we no longer hold, right?

In fact, note too that you have a signal pending here, so a call to wait_for_completion_killable_timeout is going to immediately return, most likely.

Actions

Copy link