Project

General

Profile

Bug #3572

Updated by Sage Weil over 11 years ago


 I dropped a ceph kernel client node into the debugger and left it there a while.    After resuming execution the machine is looping heavily in the kernel.    If I set a breakpoint on ceph_mdsmap_get_random_mds() and __do_request(), it will hit them repeatedly.    It appears that a long chain of requests are being process in __wake_requests(). 

 Another issue is that with VMs getting entropy can be difficult, so calling get_random_bytes() which is the export kernel interface for getting random numbers drains the entropy pool.    This means that it will block waiting for more entropy.    I wish we could get at the non-blocking variant, although it would be best to only call __choose_mds() once after a possible loss. 

 This isn't a real world scenario, but I'm concerned that on a heavily loaded machine the same circumstances could occur. 

 Below is a typical stack trace if I just randomly interrupt into the debugger. 

 #0    0xffffffff813393f8 in sha_transform (digest=0xffff88001c5239b0, data=<optimized out>, array=0xffff88001c523970) at lib/sha1.c:155 
 #1    0xffffffff8140cf1a in extract_buf (r=0xffffffff81c83b00, out=0xffff88001c523a8e "") at drivers/char/random.c:926 
 #2    0xffffffff8140dadf in extract_entropy (r=0xffffffff81c83b00, buf=0xffff88001c523aff, nbytes=1, min=0, reserved=0) at drivers/char/random.c:965 
 #3    0xffffffff8140dbc0 in get_random_bytes (buf=<optimized out>, nbytes=<optimized out>) at drivers/char/random.c:1035 
 #4    0xffffffff812a5178 in ceph_mdsmap_get_random_mds (m=0xffff880030c90960) at fs/ceph/mdsmap.c:33 
 #5    0xffffffff812a21bb in __choose_mds (req=0xffff88002fc2ec00, mdsc=0xffff8800241c4800) at fs/ceph/mds_client.c:755 
 #6    __do_request (mdsc=0xffff8800241c4800, req=0xffff88002fc2ec00) at fs/ceph/mds_client.c:1820 
 #7    0xffffffff812a24c2 in __wake_requests (mdsc=0xffff8800241c4800, head=0xffff88003dbbe570) at fs/ceph/mds_client.c:1883 
 #8    0xffffffff812a4698 in handle_session (msg=0xffff88002fce5900, session=0xffff88003dbbe000) at fs/ceph/mds_client.c:2338 
 #9    dispatch (con=<optimized out>, msg=0xffff88002fce5900) at fs/ceph/mds_client.c:3359 
 #10 0xffffffff81654fc9 in process_message (con=0xffff88003dbbe040) at net/ceph/messenger.c:2006 
 #11 try_read (con=<optimized out>) at net/ceph/messenger.c:2221 
 #12 con_work (work=0xffff88003dbbe438) at net/ceph/messenger.c:2332 
 #13 0xffffffff810740e6 in process_one_work (worker=<optimized out>, work=0xffff88003dbbe438) at kernel/workqueue.c:2080 
 #14 0xffffffff81075380 in worker_thread (__worker=0xffff880030d5f780) at kernel/workqueue.c:2201 
 #15 0xffffffff8107a423 in kthread (_create=0xffff88003e371cb8) at kernel/kthread.c:121 
 #16 0xffffffff8169ea84 in ?? () at arch/x86/kernel/entry_64.S:1216 
 #17 0x0000000000000000 in ?? ()

Back