Bug #4706: kclient: Oops when two clients concurrently write a file - CephFS - Ceph

Actions

Copy link

Bug #4706

closed

kclient: Oops when two clients concurrently write a file

Added by Zheng Yan about 11 years ago. Updated almost 8 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

Alex Elder

Category:

Target version:

v0.62a

% Done:

Source:

Development

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(FS):

kceph

Labels (FS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

[  229.868015] Modules linked in: netconsole ceph libceph libcrc32c ip6table_filter ip6_tables ebtable_nat ebtables ipt_MASQUERADE iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack xt_CHECKSUM iptable_mangle bridge lockd sunrpc bnep bluetooth stp llc rfkill be2iscsi iscsi_boot_sysfs bnx2i cnic uio cxgb4i cxgb4 cxgb3i cxgb3 mdio libcxgbi ib_iser rdma_cm ib_addr iw_cm ib_cm ib_sa ib_mad ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi virtio_net pcspkr microcode virtio_balloon uinput virtio_blk
[  229.868015] CPU 1 
[  229.868015] Pid: 50, comm: kworker/1:2 Tainted: G      D      3.8.0+ #1 Bochs Bochs
[  229.868015] RIP: 0010:[<ffffffff81084540>]  [<ffffffff81084540>] kthread_data+0x10/0x20
[  229.868015] RSP: 0018:ffff88003711b528  EFLAGS: 00010092
[  229.868015] RAX: 0000000000000000 RBX: 0000000000000001 RCX: 000000000000000e
[  229.868015] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff880037120000
[  229.868015] RBP: ffff88003711b528 R08: ffff880037120070 R09: 0000000000000000
[  229.868015] R10: 0000000000000001 R11: 0000000000000000 R12: ffff88003fd14800
[  229.868015] R13: 0000000000000001 R14: ffff88003711fff0 R15: ffff880037120000
[  229.868015] FS:  0000000000000000(0000) GS:ffff88003fd00000(0000) knlGS:0000000000000000
[  229.868015] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  229.868015] CR2: ffffffffffffffa8 CR3: 000000003ce33000 CR4: 00000000000006e0
[  229.868015] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  229.868015] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  229.868015] Process kworker/1:2 (pid: 50, threadinfo ffff88003711a000, task ffff880037120000)
[  229.868015] Stack:
[  229.868015]  ffff88003711b548 ffffffff8107f375 ffff88003711b548 ffff8800371203d0
[  229.868015]  ffff88003711b5b8 ffffffff81677802 ffff880037120000 ffff88003711bfd8
[  229.868015]  ffff88003711bfd8 ffff88003711bfd8 ffff88003711b340 ffff880037120000
[  229.868015] Call Trace:
[  229.868015]  [<ffffffff8107f375>] wq_worker_sleeping+0x15/0xc0
[  229.868015]  [<ffffffff81677802>] __schedule+0x5e2/0x800
[  229.868015]  [<ffffffff81677d49>] schedule+0x29/0x70
[  229.868015]  [<ffffffff810649f2>] do_exit+0x6a2/0x9f0
[  229.868015]  [<ffffffff8167a8ed>] oops_end+0x9d/0xe0
[  229.868015]  [<ffffffff8166d4e6>] no_context+0x253/0x27e
[  229.868015]  [<ffffffff81312962>] ? put_dec+0x72/0x90
[  229.868015]  [<ffffffff8166d6dc>] __bad_area_nosemaphore+0x1cb/0x1ea
[  229.868015]  [<ffffffff8166d70e>] bad_area_nosemaphore+0x13/0x15
[  229.868015]  [<ffffffff8167d70e>] __do_page_fault+0x36e/0x500
[  229.868015]  [<ffffffff81314c94>] ? vsnprintf+0x354/0x640
[  229.868015]  [<ffffffff81314fc0>] ? sprintf+0x40/0x50
[  229.868015]  [<ffffffff8167d8ae>] do_page_fault+0xe/0x10
[  229.868015]  [<ffffffff8167d025>] do_async_page_fault+0x35/0x90
[  229.868015]  [<ffffffff81679c78>] async_page_fault+0x28/0x30
[  229.868015]  [<ffffffff810c1bd1>] ? __lock_acquire+0x61/0x1dc0
[  229.868015]  [<ffffffff8166dc0b>] ? printk+0x61/0x63
[  229.868015]  [<ffffffff810c3ef1>] lock_acquire+0xa1/0x120
[  229.868015]  [<ffffffffa03260ff>] ? sync_write_commit+0x4f/0xb0 [ceph]
[  229.868015]  [<ffffffff81678c81>] _raw_spin_lock+0x31/0x40
[  229.868015]  [<ffffffffa03260ff>] ? sync_write_commit+0x4f/0xb0 [ceph]
[  229.868015]  [<ffffffffa03260ff>] sync_write_commit+0x4f/0xb0 [ceph]
[  229.868015]  [<ffffffffa02e0a81>] complete_request+0x21/0x40 [libceph]
[  229.868015]  [<ffffffffa02e5364>] dispatch+0x6b4/0x920 [libceph]
[  229.868015]  [<ffffffff81676b1b>] ? __mutex_unlock_slowpath+0xdb/0x170
[  229.868015]  [<ffffffffa02dbbb8>] con_work+0x1428/0x2e00 [libceph]
[  229.868015]  [<ffffffff810c1f8a>] ? __lock_acquire+0x41a/0x1dc0
[  229.868015]  [<ffffffff8107c25b>] ? process_one_work+0x13b/0x550
[  229.868015]  [<ffffffff8107c2c1>] process_one_work+0x1a1/0x550
[  229.868015]  [<ffffffff8107c25b>] ? process_one_work+0x13b/0x550
[  229.868015]  [<ffffffffa02da790>] ? ceph_con_close+0xd0/0xd0 [libceph]
[  229.868015]  [<ffffffff8107ea9e>] worker_thread+0x15e/0x440
[  229.868015]  [<ffffffff8107e940>] ? busy_worker_rebind_fn+0x100/0x100
[  229.868015]  [<ffffffff810843fa>] kthread+0xea/0xf0
[  229.868015]  [<ffffffff81084310>] ? flush_kthread_work+0x1b0/0x1b0
[  229.868015]  [<ffffffff81681f6c>] ret_from_fork+0x7c/0xb0
[  229.868015]  [<ffffffff81084310>] ? flush_kthread_work+0x1b0/0x1b0

Got above Oops when doing concurrent write with current "testing" branch.
When two clients write data to a file at the same time, they do sync write
even the file is not opened in sync mode. I think the issue is new, it can
be reproduced by running following command on two kclients.

dd if=/dev/zero bs=4k conv=notrunc of=test1

Files

patch (3.12 KB) patch

Zheng Yan, 04/11/2013 09:23 AM

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Greg Farnum about 11 years ago

Them doing a sync write is probably correct as their concurrency is being managed by the MDS now, and they aren't going to be allowed to buffer anything.
Although the actual trace is quite different, this is setting off comparisons to #4679 in my brain; did you check if it's a cap revoke issue of some kind?

Actions

Copy link

Updated by Zheng Yan about 11 years ago

It is a new issue in the sync write path, nothing to do with cap revoke. Alex has made quite a lot of changes in that area. The Oops can be reproduced in seconds with current test branch.

Actions

Copy link

Updated by Zheng Yan about 11 years ago

    ret = ceph_osdc_start_request(&fsc->client->osdc, req, false);
    if (!ret) {
        if (req->r_safe_callback) {
            /*
             * Add to inode unsafe list only after we
             * start_request so that a tid has been assigned.
             */
            spin_lock(&ci->i_unsafe_lock);
            list_add_tail(&req->r_unsafe_item,
                      &ci->i_unsafe_writes);
            spin_unlock(&ci->i_unsafe_lock);
            ceph_get_cap_refs(ci, CEPH_CAP_FILE_WR);
        }

        ret = ceph_osdc_wait_request(&fsc->client->osdc, req);
        if (ret < 0 && req->r_safe_callback) {
            spin_lock(&ci->i_unsafe_lock);
            list_del_init(&req->r_unsafe_item);
            spin_unlock(&ci->i_unsafe_lock);
            ceph_put_cap_refs(ci, CEPH_CAP_FILE_WR);
        }
    }

Found a potential cause. the request may complete before adding it to the unsafe list.

Actions

Copy link

Updated by Alex Elder about 11 years ago

the request may complete before adding it to the unsafe list.

That looks like a reasonable explanation to me. And if so it's
an old bug but maybe my recent changes made it trigger.

If I could reproduce it I could narrow down exactly where it's
happening using the debugger to verify your suspicion.

I'll see what I can do.

Your description of how to reproduce it is fine (two clients
writing concurrently to the same file) but can you offer some
more specifics?

Actions

Copy link

Updated by Alex Elder about 11 years ago

Well that's interesting.

I haven't been working with the ceph file system much so
I'm not sure what to expect. But I have both my clients
hung. (I have seen something similar with the ceph "next"
branch recently so I guess I better go back to "bobtail".)

Can you confirm what version of ceph you're working with?
I have been assuming the "testing" branch referred to the
ceph-client code.

In any case, I'll document what I did here.

I created a file in a file system mounted by 2 ceph fs
clients. I wrote to it from one host and was able to read
the result back.

I then wrote to it from the second host, and that
was successful. Reading it back from that same (second)
host confirmed it. However now the "cat" process that
was reading it is hung, stuck here:
[<ffffffffa046801d>] ceph_mdsc_do_request+0x12d/0x2a0 [ceph]
[<ffffffffa044f591>] ceph_do_getattr+0xf1/0x1c0 [ceph]
[<ffffffffa0452b43>] ceph_aio_read+0x413/0x890 [ceph]
[<ffffffff81183923>] do_sync_read+0xa3/0xe0
[<ffffffff81184070>] vfs_read+0xb0/0x180
[<ffffffff81184192>] sys_read+0x52/0xa0
[<ffffffff81666e59>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff

Meanwhile, back on the first host, I attempted to read
the shared file with a cat command, and it to is hung,
stuck with exactly the same stack trace.

No obvious indication of a problem on the mds or
osd machines.

Actions

Copy link

Updated by Alex Elder about 11 years ago

Well, I unfortunately got the same problem using
the "bobtail" branch.

Specifically what I'm doing:

machine 1
---------
hostname > /mnt/ceph/foo
cat /mnt/ceph/foo
(shows hostname)
                         machine 2
                         ---------
                         cat /mnt/ceph/foo
                         (shows machine 1 hostname)
                         hostname > /mnt/ceph/foo
                         cat /mnt/ceph/foo
                         (shows machine 2 hostname,
                          then hangs)
cat /mnt/ceph/foo
(no output, hangs)

I'll try once more, creating foo in a subdirectory
rather than in the root of the mounted file system,
in case it makes any difference at all...

Actions

Copy link

Updated by Zheng Yan about 11 years ago

File patch patch added

Yes, test branch of ceph-client. The hint to trigger the Oops is multiple clients write date to a file at the same time. Your test case is writing data to a file one client by one client. I tried your test case locally, it didn't cause hang.
(my MDS has been modified, patch is attached)

Actions

Copy link

Updated by Alex Elder about 11 years ago

OK, well I believe I have reproduced the problem.
I did this on two nodes simultaneously:

dd if=/dev/zero of=/mnt/ceph/shared_file bs=4096

Nothing seemed to happen at first, so I interrupted one
of them. The other one I couldn't interrupt, and it
was at that point I noticed one of the machines had
crashed. And it involved sync_write_commit(), _raw_spin_lock(),
and lock_acquire() at approximate the same offsets.

I'm going to try to narrow down precisely where it is in
the code to see if your theory is correct.

Actions

Copy link

Updated by Alex Elder about 11 years ago

This crash looks a little bit familiar to me, and I think
I created a bug for it, but at the moment I can't find it.
I had seen this or something like it with rbd.

It may not be the same as what you are seeing Zheng, because
what I'm talking about has to do with interrupting while
waiting for an osd request to complete. I did that; I'm
not aware that you did.

The problem--from memory--is that if an interrupt occurs
during the wait_for_completion_interruptible() call
in ceph_osdc_wait_request(), a block of code is entered
that drops a reference before being done with a data
structure. (I'll look a little more closely to get more
concrete with this, but I wanted to just document what
I think I see right at the moment.)

The function is small, so I'm just going to reproduce
it here for reference.

int ceph_osdc_wait_request(struct ceph_osd_client *osdc,
                           struct ceph_osd_request *req)
{
        int rc;

        rc = wait_for_completion_interruptible(&req->r_completion);
        if (rc < 0) {
                mutex_lock(&osdc->request_mutex);
                __cancel_request(req);
                __unregister_request(osdc, req);
                mutex_unlock(&osdc->request_mutex);
                complete_request(req);
                dout("wait_request tid %llu canceled/timed out\n", req->r_tid);
                return rc;
        }

        dout("wait_request tid %llu result %d\n", req->r_tid, req->r_result);
        return req->r_result;
}

Actions

Copy link

#10

Updated by Zheng Yan about 11 years ago

Status changed from New to 7

The Oops is caused by uninitialized req->r_inode

Actions

Copy link

#11

Updated by Alex Elder about 11 years ago

Status changed from 7 to New

Found a potential cause. the request may complete before adding it
to the unsafe list.

I think that not being on the safe list is probably fine (at
the point where the crash occurred), because sync_write_commit()
simply deletes the request from whatever list it's on, and
doing that on an initialized list entry not on a list is harmless.
However if the capability references get dropped before they
are taken (i.e., ceph_put_cap_refs() gets called before the
matching ceph_get_cap_refs()) that would cause a problem, and
maybe this problem.

To address that (for starters) it should be fine to call
ceph_get_cap_refs() before starting the osd request. We
still wouldn't want to put it on the unsafe item list if
it had already completed though.

Still looking.

Actions

Copy link

#12

Updated by Alex Elder about 11 years ago

Zheng I think I have a fix. I'm going to test it first,
but then I'd like to supply it to you to see if it resolves
the problem you're seeing. Is that OK?

Actions

Copy link

#13

Updated by Zheng Yan about 11 years ago

I also proposed a fix: [PATCH 1/4] ceph: add osd request to inode unsafe list in advance

Actions

Copy link

#14

Updated by Alex Elder about 11 years ago

FYI I just reproduced the problem without interrupt
and it matches what I saw before. (So I don't believe
the interrupt was a factor when I reproduced it before.)

Actions

Copy link

#15