Project

General

Profile

Bug #2224

Oops in __cfh_to_dentry

Added by Henry Chang almost 12 years ago. Updated about 10 years ago.

Status:
Rejected
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Development
Tags:
Backport:
Regression:
No
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):

Description

I setup an HA pair of NFS servers which re-export Ceph to NFS clients.
The HA pair is in active/standby mode, using Heartbeat+virtual IP mechanism.
Only active node acquires the virtual IP, mounts Ceph and runs NFS server.

I got the following oops when I tested the failover scenario:

1. Have one NFS client to mount nfs on /mnt/nfs. Run "ls -l /mnt/nfs" to list the existing files.
2. Reboot the active NFS server.
3. On the NFS client, run "ls -l /mnt/nfs" again.

The oops happens when the standby node becomes active and runs the NFS server. It is always reproducible.

Mar 23 11:02:16 ecctsrv11 kernel: [37542.584997] BUG: unable to handle kernel NULL pointer dereference at           (null)
Mar 23 11:02:16 ecctsrv11 kernel: [37542.587453] IP: [<ffffffff812dfb72>] strlen+0x2/0x20
Mar 23 11:02:16 ecctsrv11 kernel: [37542.588997] PGD 0 
Mar 23 11:02:16 ecctsrv11 kernel: [37542.589617] Oops: 0000 [#1] SMP 
Mar 23 11:02:16 ecctsrv11 kernel: [37542.590641] last sysfs file: /sys/devices/system/cpu/cpu7/cache/index2/shared_cpu_map
Mar 23 11:02:16 ecctsrv11 kernel: [37542.593043] CPU 2 
Mar 23 11:02:16 ecctsrv11 kernel: [37542.593635] Modules linked in: ceph libceph nfsd exportfs nfs lockd fscache nfs_acl auth_rpcgss sunrpc vesafb dcdbas psmouse serio_raw ghes hed joydev i7core_edac edac_core lp parport mptsas mptscsih igb mptbase e1000e usbhid hid dca btrfs scsi_transport_sas zlib_deflate libcrc32c
Mar 23 11:02:16 ecctsrv11 kernel: [37542.602240] 
Mar 23 11:02:16 ecctsrv11 kernel: [37542.602673] Pid: 27635, comm: nfsd Not tainted 2.6.38-13-server #52 Dell                   DCS CS24-SC           /345678                
Mar 23 11:02:16 ecctsrv11 kernel: [37542.606578] RIP: 0010:[<ffffffff812dfb72>]  [<ffffffff812dfb72>] strlen+0x2/0x20
Mar 23 11:02:16 ecctsrv11 kernel: [37542.608869] RSP: 0018:ffff880bd7937958  EFLAGS: 00010246
Mar 23 11:02:16 ecctsrv11 kernel: [37542.610487] RAX: 0000000000000000 RBX: ffff880bd7937a04 RCX: 000001000000001e
Mar 23 11:02:16 ecctsrv11 kernel: [37542.612673] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
Mar 23 11:02:16 ecctsrv11 kernel: [37542.612676] RBP: ffff880bd7937990 R08: ffff880bd79379f8 R09: ffff880bd7937a0c
Mar 23 11:02:16 ecctsrv11 kernel: [37542.612678] R10: 00000000ea46b424 R11: ffff880bd5ba89c0 R12: ffff880bd7d62400
Mar 23 11:02:16 ecctsrv11 kernel: [37542.612680] R13: ffff880bd79379e8 R14: 0000000000000000 R15: ffff880bd7d62428
Mar 23 11:02:16 ecctsrv11 kernel: [37542.612683] FS:  0000000000000000(0000) GS:ffff8800bf440000(0000) knlGS:0000000000000000
Mar 23 11:02:16 ecctsrv11 kernel: [37542.612686] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Mar 23 11:02:16 ecctsrv11 kernel: [37542.612688] CR2: 0000000000000000 CR3: 0000000001a03000 CR4: 00000000000006e0
Mar 23 11:02:16 ecctsrv11 kernel: [37542.612690] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Mar 23 11:02:16 ecctsrv11 kernel: [37542.612692] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Mar 23 11:02:16 ecctsrv11 kernel: [37542.612695] Process nfsd (pid: 27635, threadinfo ffff880bd7936000, task ffff880bd78b16e0)
Mar 23 11:02:16 ecctsrv11 kernel: [37542.612697] Stack:
Mar 23 11:02:16 ecctsrv11 kernel: [37542.612699]  ffffffffa0347e28 ffff880bd7937a0c ffffffff813a7664 0000000000000000
Mar 23 11:02:16 ecctsrv11 kernel: [37542.612702]  ffff880bd63fe000 ffff880bd7d62400 0000000000000000 ffff880bd7937a40
Mar 23 11:02:16 ecctsrv11 kernel: [37542.612705]  ffffffffa0347ed9 ffff880bd79379e8 ffff880bd7937a04 0000000000000000
Mar 23 11:02:16 ecctsrv11 kernel: [37542.612708] Call Trace:
Mar 23 11:02:16 ecctsrv11 kernel: [37542.612719]  [<ffffffffa0347e28>] ? set_request_path_attr+0x148/0x170 [ceph]
Mar 23 11:02:16 ecctsrv11 kernel: [37542.612725]  [<ffffffff813a7664>] ? extract_entropy+0x94/0xd0
Mar 23 11:02:16 ecctsrv11 kernel: [37542.612732]  [<ffffffffa0347ed9>] create_request_message.clone.25+0x89/0x4f0 [ceph]
Mar 23 11:02:16 ecctsrv11 kernel: [37542.612736]  [<ffffffff813a7790>] ? get_random_bytes+0x20/0x30
Mar 23 11:02:16 ecctsrv11 kernel: [37542.612742]  [<ffffffffa034ae96>] ? ceph_mdsmap_get_random_mds+0x76/0xc0 [ceph]
Mar 23 11:02:16 ecctsrv11 kernel: [37542.612749]  [<ffffffffa0348432>] __prepare_send_request+0xf2/0x1a0 [ceph]
Mar 23 11:02:16 ecctsrv11 kernel: [37542.612755]  [<ffffffffa034869b>] __do_request+0x1bb/0x260 [ceph]
Mar 23 11:02:16 ecctsrv11 kernel: [37542.612761]  [<ffffffffa034669c>] ? __register_request+0xac/0x170 [ceph]
Mar 23 11:02:16 ecctsrv11 kernel: [37542.612768]  [<ffffffffa0349e32>] ceph_mdsc_do_request+0xa2/0x1b0 [ceph]
Mar 23 11:02:16 ecctsrv11 kernel: [37542.612774]  [<ffffffffa033d1fe>] __cfh_to_dentry+0x13e/0x1c0 [ceph]
Mar 23 11:02:16 ecctsrv11 kernel: [37542.612780]  [<ffffffffa033d293>] ceph_fh_to_dentry+0x13/0x30 [ceph]
Mar 23 11:02:16 ecctsrv11 kernel: [37542.612784]  [<ffffffffa013163c>] exportfs_decode_fh+0x5c/0x2b0 [exportfs]
Mar 23 11:02:16 ecctsrv11 kernel: [37542.612793]  [<ffffffffa02e5750>] ? nfsd_acceptable+0x0/0x120 [nfsd]
Mar 23 11:02:16 ecctsrv11 kernel: [37542.612810]  [<ffffffffa026aaec>] ? cache_check+0x6c/0x250 [sunrpc]
Mar 23 11:02:16 ecctsrv11 kernel: [37542.612818]  [<ffffffffa02eb353>] ? exp_find_key+0x63/0xb0 [nfsd]
Mar 23 11:02:16 ecctsrv11 kernel: [37542.612823]  [<ffffffff81153dcf>] ? kmem_cache_alloc_trace+0xef/0x110
Mar 23 11:02:16 ecctsrv11 kernel: [37542.612828]  [<ffffffff812a624d>] ? aa_dup_task_context+0x3d/0x70
Mar 23 11:02:16 ecctsrv11 kernel: [37542.612832]  [<ffffffff812ab6e0>] ? apparmor_cred_prepare+0x40/0x60
Mar 23 11:02:16 ecctsrv11 kernel: [37542.612839]  [<ffffffffa02e5aa2>] nfsd_set_fh_dentry+0x232/0x3a0 [nfsd]
Mar 23 11:02:16 ecctsrv11 kernel: [37542.612844]  [<ffffffff81091d90>] ? getboottime+0x30/0x40
Mar 23 11:02:16 ecctsrv11 kernel: [37542.612850]  [<ffffffffa02e63e3>] fh_verify+0x1b3/0x270 [nfsd]
Mar 23 11:02:16 ecctsrv11 kernel: [37542.612858]  [<ffffffffa02f166c>] nfsd3_proc_getattr+0x6c/0xe0 [nfsd]
Mar 23 11:02:16 ecctsrv11 kernel: [37542.612864]  [<ffffffffa02e29ee>] nfsd_dispatch+0xfe/0x240 [nfsd]
Mar 23 11:02:16 ecctsrv11 kernel: [37542.612876]  [<ffffffffa025ffe5>] svc_process_common+0x345/0x690 [sunrpc]
Mar 23 11:02:16 ecctsrv11 kernel: [37542.612881]  [<ffffffff8105f3d0>] ? default_wake_function+0x0/0x20
Mar 23 11:02:16 ecctsrv11 kernel: [37542.612892]  [<ffffffffa0260436>] svc_process+0x106/0x160 [sunrpc]
Mar 23 11:02:16 ecctsrv11 kernel: [37542.612897]  [<ffffffffa02e2102>] nfsd+0xc2/0x160 [nfsd]
Mar 23 11:02:16 ecctsrv11 kernel: [37542.612902]  [<ffffffffa02e2040>] ? nfsd+0x0/0x160 [nfsd]
Mar 23 11:02:16 ecctsrv11 kernel: [37542.612907]  [<ffffffff81086f86>] kthread+0x96/0xa0
Mar 23 11:02:16 ecctsrv11 kernel: [37542.612913]  [<ffffffff8100cde4>] kernel_thread_helper+0x4/0x10
Mar 23 11:02:16 ecctsrv11 kernel: [37542.612916]  [<ffffffff81086ef0>] ? kthread+0x0/0xa0
Mar 23 11:02:16 ecctsrv11 kernel: [37542.612919]  [<ffffffff8100cde0>] ? kernel_thread_helper+0x0/0x10
Mar 23 11:02:16 ecctsrv11 kernel: [37542.612921] Code: 11 eb 1b 66 0f 1f 44 00 00 48 83 ea 01 48 39 d0 77 0c 0f b6 0a f6 81 00 73 63 81 20 75 eb c6 42 01 00 c9 c3 0f 1f 44 00 00 31 c0 <80> 3f 00 55 48 89 e5 74 11 48 89 f8 66 90 48 83 c0 01 80 38 00 
Mar 23 11:02:16 ecctsrv11 kernel: [37542.612942] RIP  [<ffffffff812dfb72>] strlen+0x2/0x20
Mar 23 11:02:16 ecctsrv11 kernel: [37542.612945]  RSP <ffff880bd7937958>
Mar 23 11:02:16 ecctsrv11 kernel: [37542.612946] CR2: 0000000000000000
Mar 23 11:02:16 ecctsrv11 kernel: [37542.742497] ---[ end trace 437e9018f75dd144 ]---

I tried to fix it as follows. However, the kernel got page fault in handle_reply when it received the reply from mds. (See the attached video..)

diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index f8ba653..88aecd6 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -1601,11 +1601,13 @@ static int set_request_path_attr(struct inode *rinode, struct dentry *rdentry,
                r = build_dentry_path(rdentry, ppath, pathlen, ino, freepath);
                dout(" dentry %p %llx/%.*s\n", rdentry, *ino, *pathlen,
                     *ppath);
-       } else if (rpath || rino) {
+       } else {
                *ino = rino;
-               *ppath = rpath;
-               *pathlen = strlen(rpath);
-               dout(" path %.*s\n", *pathlen, rpath);
+               if (rpath) {
+                       *ppath = rpath;
+                       *pathlen = strlen(rpath);
+                       dout(" path %.*s\n", *pathlen, rpath);
+               }
        }

        return r;

page_fault.avi (5.69 MB) Henry Chang, 03/29/2012 09:27 AM

History

#1 Updated by Sage Weil about 10 years ago

  • Status changed from New to Rejected

Also available in: Atom PDF