Bug #3187: ceph fs: crash/hang on 32-bit architecture - Linux kernel client - Ceph

Actions

Copy link

Bug #3187

closed

ceph fs: crash/hang on 32-bit architecture

Added by Alex Elder over 11 years ago. Updated over 11 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Source:

Development

Tags:

Backport:

Regression:

Severity:

Reviewed:

Affected Versions:

ceph-qa-suite:

Crash signature (v1):

Crash signature (v2):

Description

I was hitting this while attempting to write files on a 32-bit system
running inside a VM, trying to reproduce bug 3112.

Separately, Bryan Wright <bryan@Virginia.EDU> reported seeing this on
the mailing list, and I didn't see any other reports, so I thought it
was time to file a bug.

http://permalink.gmane.org/gmane.comp.file-systems.ceph.devel/9308

I'll add a stack trace taken from gdb on my VM shortly.

Actions

Copy link

Updated by Alex Elder over 11 years ago

Below is the stack dump taken from gdb that I've been seeing
on the 32-bit system running inside a VM. Note this line:

#26 kunmap_high (page=0xf704b700)
        at /home/elder/ceph/ceph-client/mm/highmem.c:290

And that corresponds to this BUG call in kunmap_high():

        /*
         * A count must never go down to zero
         * without a TLB flush!
         */
        need_wakeup = 0;
        switch (--pkmap_count[nr]) {
        case 0:
                BUG();
        case 1:
                /*
                 * Avoid an unnecessary wake_up() function call.

OK, here's the stack trace.

^C
Program received signal SIGINT, Interrupt.
0xc12b2e6f in delay_tsc (__loops=2837462058)
    at /home/elder/ceph/ceph-client/arch/x86/lib/delay.c:59
59        rdtscl(bclock);
(gdb) bt
#0  0xc12b2e6f in delay_tsc (__loops=2837462058)
    at /home/elder/ceph/ceph-client/arch/x86/lib/delay.c:59
#1  0xc12b2dce in __delay (loops=<optimized out>)
    at /home/elder/ceph/ceph-client/arch/x86/lib/delay.c:112
#2  0xc12b9bd8 in __spin_lock_debug (lock=0xf6fd8960)
    at /home/elder/ceph/ceph-client/lib/spinlock_debug.c:116
#3  do_raw_spin_lock (lock=0xf6fd8960)
    at /home/elder/ceph/ceph-client/lib/spinlock_debug.c:133
#4  0xc1571bfa in __raw_spin_lock_irq (lock=0xf6fd8960)
    at /home/elder/ceph/ceph-client/include/linux/spinlock_api_smp.h:129
#5  _raw_spin_lock_irq (lock=0xf6fd8960)
    at /home/elder/ceph/ceph-client/kernel/spinlock.c:153
#6  0xc156ff19 in __schedule ()
    at /home/elder/ceph/ceph-client/kernel/sched/core.c:3390
#7  0xc1570753 in schedule ()
    at /home/elder/ceph/ceph-client/kernel/sched/core.c:3467
#8  0xc103debd in do_exit (code=9)
    at /home/elder/ceph/ceph-client/kernel/exit.c:950
#9  0xc1572bd5 in oops_end (flags=70, regs=0xf1629c28, signr=9)
    at /home/elder/ceph/ceph-client/arch/x86/kernel/dumpstack.c:249
#10 0xc155d92b in no_context (regs=0xf1629c28, error_code=0, 
    address=4294967292, signal=11, si_code=196609)
    at /home/elder/ceph/ceph-client/arch/x86/mm/fault.c:689
#11 0xc155da6b in __bad_area_nosemaphore (regs=0xf1629c28, error_code=0, 
    address=<optimized out>, si_code=196609)
    at /home/elder/ceph/ceph-client/arch/x86/mm/fault.c:767
#12 0xc155da8a in bad_area_nosemaphore (regs=<optimized out>, 
    error_code=<optimized out>, address=<optimized out>)
    at /home/elder/ceph/ceph-client/arch/x86/mm/fault.c:774
#13 0xc1575265 in do_page_fault (regs=0xf1629c28, error_code=0)
    at /home/elder/ceph/ceph-client/arch/x86/mm/fault.c:1121
#14 0xc1574d5d in do_async_page_fault (regs=0xf1629c28, error_code=0)
    at /home/elder/ceph/ceph-client/arch/x86/kernel/kvm.c:246
#15 <signal handler called>
#16 kthread_data (task=<optimized out>)
    at /home/elder/ceph/ceph-client/kernel/kthread.c:96
#17 0xc10549cb in wq_worker_sleeping (task=<optimized out>, cpu=0)
    at /home/elder/ceph/ceph-client/kernel/workqueue.c:729
#18 0xc1570327 in __schedule ()
    at /home/elder/ceph/ceph-client/kernel/sched/core.c:3408
#19 0xc1570753 in schedule ()
    at /home/elder/ceph/ceph-client/kernel/sched/core.c:3467
#20 0xc103dd68 in do_exit (code=<optimized out>)
    at /home/elder/ceph/ceph-client/kernel/exit.c:1074
#21 0xc1572bd5 in oops_end (flags=582, regs=0xf1629e20, signr=11)
    at /home/elder/ceph/ceph-client/arch/x86/kernel/dumpstack.c:249
#22 0xc1005484 in die (str=0xc16e824b "invalid opcode", regs=0xf1629e20, err=0)
    at /home/elder/ceph/ceph-client/arch/x86/kernel/dumpstack.c:310
#23 0xc1572636 in do_trap (trapnr=6, signr=4, str=0xc16e824b "invalid opcode", 
    regs=0xf1629e20, error_code=0, info=0xf1629d90)
    at /home/elder/ceph/ceph-client/arch/x86/kernel/traps.c:167
#24 0xc1002f9b in do_invalid_op (regs=0xf1629e20, error_code=0)
    at /home/elder/ceph/ceph-client/arch/x86/kernel/traps.c:209
#25 <signal handler called>
#26 kunmap_high (page=0xf704b700)
    at /home/elder/ceph/ceph-client/mm/highmem.c:290
#27 0xc10325ed in kunmap (page=<optimized out>)
    at /home/elder/ceph/ceph-client/arch/x86/mm/highmem_32.c:20
#28 0xf82cd9d3 in ?? ()
#29 0xf82cf5c8 in ?? ()
---Type <return> to continue, or q <return> to quit---
#30 0xf82d1224 in ?? ()
#31 0xc1052916 in process_one_work (worker=<optimized out>, work=0xe8e4d01c)
    at /home/elder/ceph/ceph-client/kernel/workqueue.c:2004
#32 0xc105432e in worker_thread (__worker=0xf69fa300)
    at /home/elder/ceph/ceph-client/kernel/workqueue.c:2125
#33 0xc1058fed in kthread (_create=0xf6887eb8)
    at /home/elder/ceph/ceph-client/kernel/kthread.c:121
#34 0xc15792a2 in ?? ()
    at /home/elder/ceph/ceph-client/arch/x86/kernel/entry_32.S:1001
#35 0x00000000 in ?? ()
(gdb)

Actions

Copy link

Updated by Alex Elder over 11 years ago

After some digging, I'm pretty sure the workqueue is from the
ceph messenger, because I have pretty good confidence the
work queue function that's being called is con_work().

And a quick look through the messenger code indicates that
maybe the problem lies in write_partial_msg_pages(), where
we see this:

    if (do_datacrc && !con->out_msg_pos.did_page_crc) {
        ...
        kaddr = kmap(page);
    }
    ...

followed by

    if (do_datacrc)
        kunmap(page);

It seems to me, if this page had already been done on this
page, then we would not get the kmap() call, but because
that's not checked later, we would get the kunmap() call.

That could be it...

Actions

Copy link