Bug #12088: cephfs client crash after enable readahead mechanism through setting conf option 'client_readahead_max_bytes' - CephFS - Ceph

Actions

Copy link

Bug #12088

closed

cephfs client crash after enable readahead mechanism through setting conf option 'client_readahead_max_bytes'

Added by Yan Shen almost 9 years ago. Updated over 8 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

Zheng Yan

Category:

Target version:

% Done:

Source:

other

Tags:

Backport:

hammer

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

Ceph - v0.94.3

ceph-qa-suite:

Component(FS):

Labels (FS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

I run fio tool to test the randread performance of cephfs. Ceph client will crash, when I enable readahead on ceph fuse client.
The bug happened when I run randread tests with 4 threads.

Our ceph.conf:
client_readahead_max_bytes = 1048576

The core dump
(gdb) bt
#0  0x00007fc65540236b in raise () from /lib64/libpthread.so.0
#1  0x0000000000632606 in reraise_fatal (signum=6) at global/signal_handler.cc:59
#2  handle_fatal_signal (signum=6) at global/signal_handler.cc:109
#3  <signal handler called>
#4  0x00007fc6544e9885 in raise () from /lib64/libc.so.6
#5  0x00007fc6544eb065 in abort () from /lib64/libc.so.6
#6  0x00007fc654d9fa7d in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib64/libstdc++.so.6
#7  0x00007fc654d9dc06 in ?? () from /usr/lib64/libstdc++.so.6
#8  0x00007fc654d9dc33 in std::terminate() () from /usr/lib64/libstdc++.so.6
#9  0x00007fc654d9dd2e in __cxa_throw () from /usr/lib64/libstdc++.so.6
#10 0x000000000064bcca in ceph::__ceph_assert_fail (assertion=0x7fc5d40a30b0 "&\254\203U>\314\211#", file=<value optimized out>, 
    line=68661056, func=0x89a5d0 "void Mutex::Lock(bool)") at common/assert.cc:77
#11 0x0000000000635d33 in Mutex::Lock (this=0x7fc57098bab0, no_lockdep=<value optimized out>) at common/Mutex.cc:95
#12 0x0000000000687f2e in Readahead::dec_pending (this=0x7fc57098b9f0, count=1) at common/Readahead.cc:122
#13 0x00000000005723c8 in Client::C_Readahead::finish (this=0x7fc570977d40, r=<value optimized out>) at client/Client.cc:7044
#14 0x00000000005afc89 in Context::complete (this=0x7fc570977d40, r=<value optimized out>) at include/Context.h:65
#15 0x00000000005afc89 in Context::complete (this=0x7fc570008e60, r=<value optimized out>) at include/Context.h:65
#16 0x00000000006291bd in finish_contexts<Context> (cct=0x417af40, finished=<value optimized out>, result=0)
    at include/Context.h:121
#17 0x00000000006187df in ObjectCacher::bh_read_finish (this=0x419eb40, poolid=<value optimized out>, oid=<value optimized out>, 
    tid=9290, start=0, length=131072, bl=..., r=0, trust_enoent=true) at osdc/ObjectCacher.cc:824
#18 0x0000000000629427 in ObjectCacher::C_ReadFinish::finish (this=0x7fc570173370, r=0) at osdc/ObjectCacher.h:512
#19 0x00000000005afc89 in Context::complete (this=0x7fc570173370, r=<value optimized out>) at include/Context.h:65
#20 0x00000000005b12f1 in C_Lock::finish (this=0x7fc570014620, r=0) at include/Context.h:162
#21 0x00000000005afc89 in Context::complete (this=0x7fc570014620, r=<value optimized out>) at include/Context.h:65
#22 0x000000000064a648 in Finisher::finisher_thread_entry (this=0x419be40) at common/Finisher.cc:59
#23 0x00007fc6553fa7f1 in start_thread () from /lib64/libpthread.so.0
#24 0x00007fc65459cccd in clone () from /lib64/libc.so.6
(gdb) frame 11
#11 0x0000000000635d33 in Mutex::Lock (this=0x7fc57098bab0, no_lockdep=<value optimized out>) at common/Mutex.cc:95
95        assert(r == 0);
(gdb) l
90          start = ceph_clock_now(cct);
91        r = pthread_mutex_lock(&_m);
92        if (logger && cct && cct->_conf->mutex_perf_counter)
93          logger->tinc(l_mutex_wait,
94                       ceph_clock_now(cct) - start);
95        assert(r == 0);
96        if (lockdep && g_lockdep) _locked();
97        _post_lock();
98
99      out:
(gdb) print _m
$1 = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = -1, __spins = 0, __list = {__prev = 0x0, 
      __next = 0x0}}, __size = '\000' <repeats 16 times>"\377, \377\377\377", '\000' <repeats 19 times>, __align = 0}

The coredump shows : the mutex lock is broken(size contains '\377').

The Ceph log


2015-06-19 12:36:10.360312 7fca117fa700 20 client.97564 readahead 19922944~131072 (caller wants 19906560~16384)
2015-06-19 12:36:10.360318 7fca117fa700 10 striper file_to_extents 19922944~131072 format 30000009691.%08llx
2015-06-19 12:36:10.360320 7fca117fa700 20 striper  su 4194304 sc 1 os 4194304 stripes_per_object 1
2015-06-19 12:36:10.360324 7fca117fa700 20 striper  off 19922944 blockno 4 stripeno 4 stripepos 0 objectsetno 4 objectno 4 block_start 0 block_off 3145728 3145728~131072
2015-06-19 12:36:10.360330 7fca117fa700 20 striper object_truncate_size 4 0->0
2015-06-19 12:36:10.360332 7fca117fa700 20 striper  added new extent(30000009691.00000004 (4) in @3 3145728~131072 -> [])
2015-06-19 12:36:10.360335 7fca117fa700 15 striper file_to_extents  extent(30000009691.00000004 (4) in @3 3145728~131072 -> [0,131072]) in @3

2015-06-19 12:36:10.360340 7fca117fa700 10 objectcacher readx extent(30000009691.00000004 (4) in @3 3145728~131072 -> [0,131072])
2015-06-19 12:36:10.360345 7fca117fa700 10 objectcacher.object(30000009691.00000004/head) map_read 30000009691.00000004 3145728~131072
2015-06-19 12:36:10.360349 7fca117fa700 20 objectcacher.object(30000009691.00000004/head) map_read gap bh[ 0x7fc9d8002900 3145728~131072 0x7fc9d966bf10 (0) v 0 missing] waiters = {}

2015-06-19 12:36:10.360407 7fca117fa700  1 -- 10.****:0/65371 --> 10.****.95:6800/31915 -- osd_op(client.97564.0:79410 30000009691.00000004 [read 3145728~131072] 3.c9946208 ack+read+known_if_redirected e12226) v5 -- ?+0 0x7fc9d810d790 con 0x7fccc008a280
2015-06-19 12:36:10.360421 7fca117fa700 20 -- 10.****:0/65371 submit_message osd_op(client.97564.0:79410 30000009691.00000004 [read 3145728~131072] 3.c9946208 ack+read+known_if_redirected e12226) v5 remote, 10.****.95:6800/31915, have pipe.

2015-06-19 12:36:10.361266 7fc9cdfeb700  3 client.97564 ll_release (fh)0x7fc9d9655660 30000009691
2015-06-19 12:36:10.361270 7fc9cdfeb700  5 client.97564 *_release_fh* 0x7fc9d9655660 mode 1 on 30000009691.head(ref=4 ll_ref=7 cap_refs={1024=1,2048=1,4096=0,8192=0} open={1=1,2=0,3=0} mode=100644 size=20971520/0 mtime=2015-06-19 12:35:06.668295 caps=pAsLsXsFscr(2=pAsLsXsFscr) objectset[30000009691 ts 0/0 objects 5 dirty_or_tx 0] parents=0x7fca0cc56b80 0x7fccd002c6b0)

2015-06-19 12:36:10.369798 7fcabb8f8700  1 -- 10.****:0/65371 <== osd.44 10.****.95:6800/31915 1941 ==== osd_op_reply(79410 30000009691.00000004 [read 3145728~131072] v0'0 uv18386 ondisk = 0) v6 ==== 187+0+131072 (2385944524 0 3796294508) 0x7fca90030a70 con 0x7fccc008a280

2015-06-19 12:36:10.398019 7fccf0ce3700  7 objectcacher bh_read_finish 30000009691.00000004/head tid 38326 3145728~131072 (bl is 131072) returned 0 *outstanding reads 31*

2015-06-19 12:36:10.398058 7fccf0ce3700 10 objectcacher readx extent(30000009691.00000004 (4) in @3 3145728~131072 -> [0,131072])
2015-06-19 12:36:10.398063 7fccf0ce3700 10 objectcacher.object(30000009691.00000004/head) map_read 30000009691.00000004 3145728~131072
2015-06-19 12:36:10.398066 7fccf0ce3700 20 objectcacher.object(30000009691.00000004/head) map_read hit bh[ 0x7fc9d957ac90 3112960~163840 0x7fc9d966bf10 (163840) v 0 clean firstbyte=0] waiters = {}

2015-06-19 12:36:10.398206 7fccf0ce3700 20 client.97564 C_Readahead on 0x159ffa4d55839c3a

As shown in ceph log, I think the reason is the file handler is released, outstanding reads don't finish yet. So when the outstanding read finish, it will access the broken Fh structure(ie. broken Mutex).
We use ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff) in our I/O tests.

Files

client.patch (2.94 KB) client.patch

Zheng Yan, 07/02/2015 08:47 AM

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Wenjun Huang almost 9 years ago

While diving into the source code, we found the code path which will cause the crash. Described as the following steps:
In the Client::_read_async method:
1. The Fh is read for some data, which will trigger an read Op to be sent.
2. If the client_readahead_max_bytes option is set, it will trigger an additional readahead Op to be sent, and the Fh is handled by the Op finisher C_Readahead.

3. For some reason, the client close the fh quickly, but the response of the readahead Op has not reached to the client at that time.
4. In the above condition, the Client::_release_fh method did not have an additional check before deleting the Fh object. But the pointer to the Fh object is still handled by the finisher C_Readahead.
5. At last, the response of the readahead Op reached, it will execute its C_Readahead::finish method, but the pointer to Fh it handled points to an undefined memory area. So when it operates on the pointer, the program will crash.

A way I can come up with to fix this bug is to set a ref count for the Fh object. Then we can have a check before deleting the Fh object, if the ref count is larger than zero, we can postpone the deleting operation.

Actions

Copy link

Updated by Samuel Just almost 9 years ago

Project changed from Ceph to CephFS

Actions

Copy link

Updated by Zheng Yan almost 9 years ago

File client.patch client.patch added
Status changed from New to 7

please try the attached patch

Actions

Copy link

Updated by Zhi Zhang almost 9 years ago

hi Yan,

Thanks for the patch. We also have a similar patch tested internally, which uses readahead's pending counter as a ref, but your fix looks more concise. One question is, do you think a lock is needed when modifying _ref in Fh?

Actions

Copy link

Updated by Zheng Yan almost 9 years ago

no extra lock is needed. the async readahead context is called while client_lock is locked.

Actions

Copy link

Updated by John Spray almost 9 years ago

Status changed from 7 to Pending Backport

I imagine we probably want backports to firefly + hammer?

Actions

Copy link

Updated by John Spray almost 9 years ago

commit 34b939a81d38173b882c429b28dedce778504ba8
Author: Yan, Zheng <zyan@redhat.com>
Date:   Wed Jul 8 10:11:43 2015 +0800

    client: reference counting 'struct Fh'

    The async readahead finisher needs to reference 'struct Fh'. But
    it's possible user closes FD and free the corresponding 'struct Fh'
    before async readahead finishes.

    Fixes: #12088
    Signed-off-by: Yan, Zheng <zyan@redhat.com>

Actions

Copy link

Updated by Zheng Yan almost 9 years ago

https://github.com/ceph/ceph/pull/5222

Actions

Copy link

Updated by Kefu Chai almost 9 years ago

Assignee set to Zheng Yan
Backport set to hammer

Actions

Copy link

#10

Updated by Greg Farnum almost 9 years ago

Backports team: Zheng created a backport PR at https://github.com/ceph/ceph/pull/5222, but let's let it bake a bit in master.

Actions

Copy link

#11

Updated by Yan Shen almost 9 years ago

Hi, Yan. I think it is also unsafe in the _read_async code when this patch applied. When thread1 wait on read, thread2 may release this fh (close fh immediately). If I enable client_readahead_max_bytes, f->readahead.update(off, len, in->size) may access freed fh probably.

if (r == 0) {
    get_cap_ref(in, CEPH_CAP_FILE_CACHE);
    client_lock.Unlock();
    flock.Lock();
    while (!done)
      cond.Wait(flock);<---------unsafe wait?
    flock.Unlock();
    client_lock.Lock();
    put_cap_ref(in, CEPH_CAP_FILE_CACHE);
    r = rvalue;
  } else {
    // it was cached.
    delete onfinish;
  }

  if(conf->client_readahead_max_bytes > 0) {
    pair<uint64_t, uint64_t> readahead_extent = f->readahead.update(off, len, in->size);
    if (readahead_extent.second > 0) {

When I apply this patch in ceph-dokan, the following bug happend at some times. It is really hard to reproduce. Do you think we should reproduce this patch to handle it?

 18  Id: 1eb0.1a6c Suspend: 0 Teb: 7ef70000 Unfrozen
ChildEBP RetAddr  Args to Child              
04e4f1dc 75640bdd 00000002 04e4f22c 00000001 ntdll_776c0000!ZwWaitForMultipleObjects+0x15
04e4f278 76bb1a2c 04e4f22c 04e4f2a0 00000000 KERNELBASE!WaitForMultipleObjectsEx+0x100
04e4f2c0 76bb4208 00000002 7efde000 00000000 kernel32!WaitForMultipleObjectsExImplementation+0xe0
04e4f2dc 76bd80a4 00000002 04e4f310 00000000 kernel32!WaitForMultipleObjects+0x18
04e4f348 76bd7f63 04e4f428 00000001 00000001 kernel32!WerpReportFaultInternal+0x186
04e4f35c 76bd7858 04e4f428 00000001 04e4f3f8 kernel32!WerpReportFault+0x70
04e4f36c 76bd77d7 04e4f428 00000001 e294106f kernel32!BasepReportFault+0x20
04e4f3f8 777374df 00000000 777373bc 00000000 kernel32!UnhandledExceptionFilter+0x1af
04e4f400 777373bc 00000000 04e5ffd4 776ec530 ntdll_776c0000!__RtlUserThreadStart+0x62
04e4f414 77737261 00000000 00000000 00000000 ntdll_776c0000!_EH4_CallFilterFunc+0x12
04e4f43c 7771b459 fffffffe 04e5ffc4 04e4f578 ntdll_776c0000!_except_handler4+0x8e
04e4f460 7771b42b 04e4f528 04e5ffc4 04e4f578 ntdll_776c0000!ExecuteHandler2+0x26
04e4f484 7771b3ce 04e4f528 04e5ffc4 04e4f578 ntdll_776c0000!ExecuteHandler+0x24
04e4f510 776d0133 01e4f528 04e4f578 04e4f528 ntdll_776c0000!RtlDispatchException+0x127
04e4f51c 04e4f528 04e4f578 c0000005 00000000 ntdll_776c0000!KiUserExceptionDispatcher+0xf
WARNING: Frame IP not in any known module. Following frames may be wrong.
04e4f8d8 64e9fa26 00000000 00000000 00000008 0x4e4f528
04e4f928 64e88ea7 00000000 04e4f9cc 6fcc3a00 libcephfs!ZN5Mutex4LockEb+0x4c
04e4f968 64fdaf3f 101140a0 02c32000 00000000 libcephfs!ZN9Readahead6updateEyyy+0x41
04e4fc18 64fda419 10114078 02c32000 00000000 libcephfs!ZN6Client11_read_asyncEP2FhyyPN4ceph6buffer4listE+0x5ff
04e4fce8 64fd9d22 10114078 02c32000 00000000 libcephfs!ZN6Client5_readEP2FhxyPN4ceph6buffer4listE+0x497

Actions

Copy link

#12

Updated by Zhi Zhang almost 9 years ago

Zheng Yan wrote:

https://github.com/ceph/ceph/pull/5222

Hi Yan,

We found a serious problem with this patch when doing massive testing recently. Fh's ref will be increased when initializing C_Readahead, but if readahead does not need to do read from osd, readahead finisher will not be called and onfinish2 will be deleted. So Fh's ref can not be decreased. Then closing this file later, its inode's ref will miss the window to be decreased as well.

When this happened, the bad thing we met was that if we tried to delete those files, they could not be deleted physically from OSD because their inode's ref was not zero. The cluster is telling it is nearfull, while only some files are in cephfs.

How about we put _put_fh in C_Readahead's deconstructor, then either case will decrease Fh's ref?

Please check this PR: https://github.com/ceph/ceph/pull/5311

Thanks.

Actions

Copy link

#13

Updated by Loïc Dachary over 8 years ago

Status changed from Pending Backport to Resolved

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » CephFS

Custom queries

Bug #12088

cephfs client crash after enable readahead mechanism through setting conf option 'client_readahead_max_bytes'

Updated by Wenjun Huang almost 9 years ago

Updated by Samuel Just almost 9 years ago

Updated by Zheng Yan almost 9 years ago

Updated by Zhi Zhang almost 9 years ago

Updated by Zheng Yan almost 9 years ago

Updated by John Spray almost 9 years ago

Updated by John Spray almost 9 years ago

Updated by Zheng Yan almost 9 years ago

Updated by Kefu Chai almost 9 years ago

Updated by Greg Farnum almost 9 years ago

Updated by Yan Shen almost 9 years ago

Updated by Zhi Zhang almost 9 years ago

Updated by Loïc Dachary over 8 years ago