Project

General

Profile

Actions

Bug #19127

closed

NULL pointer dereference in ceph_readdir

Added by Zheng Yan about 7 years ago. Updated about 5 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):

Description

[41775.962636] Oops: 0000 [#1] SMP
[41775.965783] Modules linked in: ceph libceph libcrc32c fscache binfmt_misc kvm_intel intel_rapl sb_edac edac_core x86_pkg_temp_thermal intel_powerclamp coretemp kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel ipmi_ssif pcbc aesni_intel aes_x86_64 crypto_simd glue_helper cryptd joydev lpc_ich wmi ipmi_si ipmi_devintf acpi_power_meter ipmi_msghandler mei_me ioatdma mei acpi_pad shpchp ib_iser rdma_cm iw_cm ib_cm ib_core configfs iscsi_tcp libiscsi_tcp nfsd auth_rpcgss libiscsi nfs_acl scsi_transport_iscsi lockd grace lp sunrpc parport autofs4 btrfs xor raid6_pq hid_generic usbhid hid igb i2c_algo_bit ixgbe dca ahci ptp libahci pps_core nvme mdio nvme_core [last unloaded: kvm_intel]
[41776.027424] CPU: 5 PID: 6807 Comm: rm Not tainted 4.10.0-ceph-gbbcd1b20a189 #1
[41776.034694] Hardware name: Supermicro SYS-5018R-WR/X10SRW-F, BIOS 2.0 12/17/2015
[41776.042140] task: ffff99e6563ecc80 task.stack: ffffb6ddc4fe0000
[41776.048102] RIP: 0010:ceph_readdir+0xe8e/0x12c0 [ceph]
[41776.053273] RSP: 0018:ffffb6ddc4fe3db8 EFLAGS: 00010296
[41776.058528] RAX: 0000000000000001 RBX: 0000000000000000 RCX: 00000000766ec7e6
[41776.065699] RDX: 0000000000000001 RSI: 00000000e736fbf5 RDI: ffff99e604ca01a0
[41776.072866] RBP: ffffb6ddc4fe3ea0 R08: 0000000000000000 R09: 0000000000000000
[41776.080034] R10: ffff99e6563ed4c8 R11: ffff99e6563ed4a0 R12: ffff99e604ca01a0
[41776.087201] R13: ffff99e604ca0120 R14: 0000000000000007 R15: ffffb6ddc4fe3ef0
[41776.094371] FS:  00007f1092840700(0000) GS:ffff99e67fd40000(0000) knlGS:0000000000000000
[41776.102515] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[41776.108297] CR2: 000000000000000c CR3: 00000008576f8000 CR4: 00000000003406e0
[41776.115468] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[41776.122633] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[41776.129804] Call Trace:
[41776.132285]  ? __might_fault+0x8c/0xa0
[41776.136063]  ? __might_fault+0x43/0xa0
[41776.139845]  iterate_dir+0xd3/0x1b0
[41776.143368]  SyS_getdents+0xa7/0x150
[41776.146977]  ? filldir64+0x150/0x150
[41776.150585]  entry_SYSCALL_64_fastpath+0x23/0xc6
[41776.155240] RIP: 0033:0x7f109232ed3b
[41776.158848] RSP: 002b:00007ffd4b373c30 EFLAGS: 00000206 ORIG_RAX: 000000000000004e
[41776.166470] RAX: ffffffffffffffda RBX: 0000000001077040 RCX: 00007f109232ed3b
[41776.173636] RDX: 0000000000010000 RSI: 00000000010792f0 RDI: 0000000000000003
[41776.180803] RBP: 0000000000000004 R08: 00007f109262b2f8 R09: 00007ffd4b373da4
[41776.187973] R10: 00000000010792b0 R11: 0000000000000206 R12: 0000000000000004
[41776.195139] R13: 0000000001077040 R14: 0000000000000000 R15: 00000000010899c0
[41776.202306] Code: 0f 84 75 02 00 00 48 3d 00 f0 ff ff 0f 87 8b 01 00 00 48 8b 98 d8 00 00 00 4c 8d a0 80 00 00 00 4c 89 e7 e8 05 0e ef c7 8b 45 80 <3b> 43 0c 75 a6 49 83 7d 58 00 74 9f 48 8b 4b 48 4d 8b 47 08 48 

got this when rm fsstress test directory on 7 mds cluster.

(gdb) l * ceph_readdir+0xe8e
0x952e is in ceph_readdir (fs/ceph/dir.c:235).
233            di = ceph_dentry(dentry);
234            spin_lock(&dentry->d_lock);
235            if (di->lease_shared_gen == shared_gen &&
236                d_really_is_positive(dentry) &&
237                fpos_cmp(ctx->pos, di->offset) <= 0) {
238                emit_dentry = true;
239            }
240            spin_unlock(&dentry->d_lock);
Actions #1

Updated by Jeff Layton about 7 years ago

Ouch...the ceph readdir code seems to keep pointers to dentries in the ceph_readdir_cache_control, but doesn't hold references to them. Am I missing something on how that is expected to work? We probably need to ensure that we hold references to the child dentries in that code.

Actions #2

Updated by Jeff Layton about 7 years ago

  • Assignee set to Jeff Layton
Actions #3

Updated by Zheng Yan about 7 years ago

The idea is borrowed from ncpfs. When receiving reply from server, the readdir code fills dentry pointers in page cache for later readdir. The code knows a dentry pointers is valid when following checks pass: (1) no dentry was trimmed since filling the cache (For cephfs, the check is ceph_dir_is_complete_ordered(), the vfs callback for trimming dentry is ceph_d_prune()) (2) lockref_get_not_dead(dentry pointer) return non-null. Besides, the code needs to do both checks in RCU read lock.

Actions #4

Updated by Zheng Yan about 7 years ago

I suspect the first check does not properly handle the d_splice_alias() case.

Actions #5

Updated by Jeff Layton about 7 years ago

We clear the parent's complete bit whenever we call splice_dentry, AFAICT. The only exception is in the readdir code itself, and that's precisely the place that we want to set that bit, not clear it.

It does sound like there may be a missing ceph_dir_clear_complete call, I'm just not sure where it is. I will note that I don't see it being cleared in either the unlink or the create codepath (though it is cleared on rename).

I guess the trace handling is supposed to do that?

Actions #6

Updated by Zheng Yan about 7 years ago

Jeff Layton wrote:

We clear the parent's complete bit whenever we call splice_dentry, AFAICT. The only exception is in the readdir code itself, and that's precisely the place that we want to set that bit, not clear it.

we call ceph_dir_clear_ordered() before splice_dentry(). But only for dentry's current parent. If the dentry was spliced from other directory, we don't call ceph_dir_clear_ordered() for the old directroy.

It does sound like there may be a missing ceph_dir_clear_complete call, I'm just not sure where it is. I will note that I don't see it being cleared in either the unlink or the create codepath (though it is cleared on rename).

I guess the trace handling is supposed to do that?

Yes, trace handling does the job (ceph_dir_clear_ordered() in ceph_fill_trace()). rename is not special either. The ceph_dir_clear_complete()s in ceph_rename() are for traceless reply. recent version mds does not send traceless reply.

Actions #7

Updated by Jeff Layton about 7 years ago

Zheng Yan wrote:

we call ceph_dir_clear_ordered() before splice_dentry(). But only for dentry's current parent. If the dentry was spliced from other directory, we don't call ceph_dir_clear_ordered() for the old directroy.

For my own notes...there are 2 cases here fill_trace and ceph_readdir_prepopulate:

The main concern I think is in fill_trace. We use req->r_dentry and splice the inode we find in the trace into it. This case does call ceph_clear_dir_ordered prior to any call to splice_dentry. There's also this assertion at the top of the first conditional if block:

BUG_ON(d_inode(dn->d_parent) != dir);

...but, I suppose it's possible that we could end up with a remotely renamed directory (the __d_unalias case in d_splice_alias). What happens to the complete bit in the original parent when that occurs? Probably nothing. Still...

I don't suppose you happened to have gotten a vmcore here? It would be interesting to look at the dentry here and verify that di was actually NULL, or whether the dentry itself was just toast.

One thing that looks suspicious to me is that we check __ceph_dir_is_complete_ordered in ceph_readdir, and that is what governs whether the cached dentry pointers are valid. What guarantees that it remains complete/ordered once the spinlock is dropped in there? Could the real fix be that we should be holding a reference to CEPH_CAP_FILE_SHARED on the directory when we call __dcache_readdir?

Actions #8

Updated by Zheng Yan about 7 years ago

Jeff Layton wrote:

Another question too...why are we using CEPH_CAP_FILE_SHARED in this code. Shouldn't we require CEPH_CAP_FILE_CACHE? >Directories are different than files though, so perhaps that explains the discrepancy there?

We happen to use page cache to implement readdir cache, nothing to do with real file data cache. The reason that we use CEPH_CAP_FILE_SHARED here is that readdir cache is valid as long as client holds the CEPH_CAP_FILE_SHARED. The readdir cache become stale once client loses CEPH_CAP_FILE_SHARED. (If mds need to modify the directory, it revokes CEPH_CAP_FILE_SHARED first)

Actions #9

Updated by Zheng Yan about 7 years ago

Jeff Layton wrote:

Zheng Yan wrote:

we call ceph_dir_clear_ordered() before splice_dentry(). But only for dentry's current parent. If the dentry was spliced from other directory, we don't call ceph_dir_clear_ordered() for the old directroy.

For my own notes...there are 2 cases here fill_trace and ceph_readdir_prepopulate:

The main concern I think is in fill_trace. We use req->r_dentry and splice the inode we find in the trace into it. This case does call ceph_clear_dir_ordered prior to any call to splice_dentry. There's also this assertion at the top of the first conditional if block:

BUG_ON(d_inode(dn->d_parent) != dir);

...but, I suppose it's possible that we could end up with a remotely renamed directory (the __d_unalias case in d_splice_alias). What happens to the complete bit in the original parent when that occurs? Probably nothing. Still...

that what I suspects

I don't suppose you happened to have gotten a vmcore here? It would be interesting to look at the dentry here and verify that di was actually NULL, or whether the dentry itself was just toast.

no vmcore

One thing that looks suspicious to me is that we check __ceph_dir_is_complete_ordered in ceph_readdir, and that is what governs whether the cached dentry pointers are valid. What guarantees that it remains complete/ordered once the spinlock is dropped in there? Could the real fix be that we should be holding a reference to CEPH_CAP_FILE_SHARED on the directory when we call __dcache_readdir?

rcu_read_lock() and lockref_get_not_dead() guarantee that dentry pointer is valid even directory no longer complete/ordered. (rcu_read_lock() prevent freeing dentry from be reused. lockref_get_not_dead() makes sure the dentry is not being freed and get a reference)

Actions #10

Updated by Jeff Layton about 7 years ago

Zheng Yan wrote:

no vmcore

Bummer. Do you have the text in the log from before the Oops line? It'd be good to know whether it was a GPF, NULL pointer exception or what.

If it was NULL pointer, then I'd be inclined to think that ceph_dentry returned NULL such that the dentry was valid but on its way to being torn down. If it's a GPF, then it sounds more like maybe we have a cached array of dentry pointers that point into memory that has been recycled for other purposes.

Either way, knowing what happened to cause the crash might help us rule out some potential avenues. If it happens again, and you can get a core, then so much the better.

rcu_read_lock() and lockref_get_not_dead() guarantee that dentry pointer is valid even directory no longer complete/ordered. (rcu_read_lock() prevent freeing dentry from be reused. lockref_get_not_dead() makes sure the dentry is not being freed and get a reference)

Right, so I think we have to consider how we would have gotten a bad pointer here. I think there are two likely possibilities:

1) we have a dentry that is not yet freed but somehow ended up with a ceph_dentry of NULL (race in lockref_get_not_dead handling?)

2) we have a dentry pointer in the cache that is being considered valid by the ceph readdir code, but is actually gone and has been recycled. lockref_get_not_dead just happened to work because of the way the memory was reused.

I think 2 is the most likely scenario, but it's hard to tell with just what we have so far here.

Actions #11

Updated by Zheng Yan about 7 years ago

Jeff Layton wrote:

Zheng Yan wrote:

no vmcore

Bummer. Do you have the text in the log from before the Oops line? It'd be good to know whether it was a GPF, NULL pointer exception or what.

If it was NULL pointer, then I'd be inclined to think that ceph_dentry returned NULL such that the dentry was valid but on its way to being torn down. If it's a GPF, then it sounds more like maybe we have a cached array of dentry pointers that point into memory that has been recycled for other purposes.

Either way, knowing what happened to cause the crash might help us rule out some potential avenues. If it happens again, and you can get a core, then so much the better.

If I remember right, it's NULL pointer dereference. dentry is likely valid because the code has succeeded in locking d_lock.

rcu_read_lock() and lockref_get_not_dead() guarantee that dentry pointer is valid even directory no longer complete/ordered. (rcu_read_lock() prevent freeing dentry from be reused. lockref_get_not_dead() makes sure the dentry is not being freed and get a reference)

Right, so I think we have to consider how we would have gotten a bad pointer here. I think there are two likely possibilities:

1) we have a dentry that is not yet freed but somehow ended up with a ceph_dentry of NULL (race in lockref_get_not_dead handling?)

2) we have a dentry pointer in the cache that is being considered valid by the ceph readdir code, but is actually gone and has been recycled. lockref_get_not_dead just happened to work because of the way the memory was reused.

I think 2 is the most likely scenario, but it's hard to tell with just what we have so far here.

Agree. The oops happened in multimds setup. I guess CEPH_CAP_FILE_SHARED was not properly revoked.

Actions #12

Updated by Jeff Layton about 7 years ago

Zheng Yan wrote:

Jeff Layton wrote:

Zheng Yan wrote:

no vmcore

Bummer. Do you have the text in the log from before the Oops line? It'd be good to know whether it was a GPF, NULL pointer exception or what.

If it was NULL pointer, then I'd be inclined to think that ceph_dentry returned NULL such that the dentry was valid but on its way to being torn down. If it's a GPF, then it sounds more like maybe we have a cached array of dentry pointers that point into memory that has been recycled for other purposes.

Either way, knowing what happened to cause the crash might help us rule out some potential avenues. If it happens again, and you can get a core, then so much the better.

If I remember right, it's NULL pointer dereference. dentry is likely valid because the code has succeeded in locking d_lock.

Maybe. It really depends on what was in the memory. Dentries come out of a slabcache, so it's possible it got recycled as another dentry that didn't set d_fsdata.

rcu_read_lock() and lockref_get_not_dead() guarantee that dentry pointer is valid even directory no longer complete/ordered. (rcu_read_lock() prevent freeing dentry from be reused. lockref_get_not_dead() makes sure the dentry is not being freed and get a reference)

Right, so I think we have to consider how we would have gotten a bad pointer here. I think there are two likely possibilities:

1) we have a dentry that is not yet freed but somehow ended up with a ceph_dentry of NULL (race in lockref_get_not_dead handling?)

2) we have a dentry pointer in the cache that is being considered valid by the ceph readdir code, but is actually gone and has been recycled. lockref_get_not_dead just happened to work because of the way the memory was reused.

I think 2 is the most likely scenario, but it's hard to tell with just what we have so far here.

Agree. The oops happened in multimds setup. I guess CEPH_CAP_FILE_SHARED was not properly revoked.

Yes, we also don't hold a reference to CEPH_CAP_FILE_SHARED in this code. I suppose that's the first thing to fix, and then we can see if the problem is still reproducible afterward.

Actions #13

Updated by Jeff Layton about 7 years ago

So, my thinking was to call ceph_try_get_caps in ceph_readdir to grab CEPH_CAP_FILE_RD and CEPH_CAP_FILE_SHARED. ceph_try_get_caps has these BUG_ONs in there:

       BUG_ON(need & ~CEPH_CAP_FILE_RD);
       BUG_ON(want & ~(CEPH_CAP_FILE_CACHE|CEPH_CAP_FILE_LAZYIO));

Zheng, those were added pretty recently in 2b1ac852eb67a6e95595e576371d23519105559f. Why the strict rules on need and want there?

Actions #14

Updated by Zheng Yan about 7 years ago

Jeff Layton wrote:

So, my thinking was to call ceph_try_get_caps in ceph_readdir to grab CEPH_CAP_FILE_RD and CEPH_CAP_FILE_SHARED. ceph_try_get_caps has these BUG_ONs in there:

[...]

Zheng, those were added pretty recently in 2b1ac852eb67a6e95595e576371d23519105559f. Why the strict rules on need and want there?

see __take_cap_refs(), it only supports these caps.

Actions #15

Updated by Zheng Yan about 7 years ago

Jeff Layton wrote:

Yes, we also don't hold a reference to CEPH_CAP_FILE_SHARED in this code. I suppose that's the first thing to fix, and then we can see if the problem is still reproducible afterward.

I think we can just call __ceph_dir_clear_complete when client release CEPH_CAP_FILE_SHARED to mds

Actions #16

Updated by Jeff Layton about 5 years ago

  • Status changed from New to Resolved
  • Assignee changed from Jeff Layton to Zheng Yan

I haven't heard of this cropping up in quite some time, so I'm going to assume that the original oops here has been fixed, likely by commit 5495c2d04f. Reassigning to Zheng (since he did that patch) and closing.

Actions

Also available in: Atom PDF