Project

General

Profile

Actions

Bug #1110

closed

mds: ls -l hangs on concurrent writer

Added by Sage Weil almost 13 years ago. Updated over 7 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

On Wed, 25 May 2011, Andre Noll wrote:
> On Wed, May 25, 09:16, Sage Weil wrote:
> > > It's trivial to reproduce: Executing "ls -l" on one client while
> > > another client is writing causes the ls command to hang indefinitely.
> > 
> > This was recently fixed in the master branch, commit
> > ee7fa813ef29890557f0b03bd3950d422484215d.  It hasn't made it into any
> > releases yet.  I'll cherry-pick it into stable for 0.28.2.
> > 
> > Please let me know if you continue to see this problem with that fix in
> > place.
> 
> Nope. I'm now running the ceph master branch (fe955881) on all nodes
> but the problem remains. Even after stopping the writer, the "ls -l" 
> hangs. Linux kernel version is plain 2.6.38.7 everywhere and the ceph
> setup is fairly simple: 6 cosds, one cmon and one cmds.

Files

mds.0.log (7.72 MB) mds.0.log Andre Noll, 05/25/2011 11:39 AM
mds.0.log (169 KB) mds.0.log Andre Noll, 06/14/2011 01:41 AM
mds.log.crash (31.2 KB) mds.log.crash Andre Noll, 06/16/2011 01:42 AM
Actions #1

Updated by Andre Noll almost 13 years ago

mds log created with

ceph mds tell 0 injectargs '--debug-mds 20 --debug-ms 1'

The problem occurs at 2011-05-25 20:05 (around line 18400 in the file).

Actions #2

Updated by Sage Weil almost 13 years ago

The kclient isn't responding to a cap revocation message. I'm not seeing anything since 2.6.38 that would have fixed that sort of issue, but I'm not able to reproduce it here. This happens every time?

We may need to look at debug out on the kernel side. That's enabled with something like

echo 'file fs/ceph/caps.c +p' > /sys/kernel/debug/dynamic_debug/control

There are a few lines that are hit for every write(2), so you may need to filter these out specifically

[ 75.680000] ceph: caps.c:2017 : get_cap_refs 00000000a11b9ce8 need Fw want Fb
[ 75.680000] ceph: caps.c:677 : __ceph_caps_issued 00000000a11b9ce8 cap 000000009cba5ab8 issued pAsxLsXsxFsxcrwb
[ 75.680000] ceph: caps.c:1997 : __take_cap_refs 00000000a11b9ce8 wb 0 -> 1 (?)
[ 75.680000] ceph: caps.c:752 : __ceph_caps_issued_mask 00000000a11b9ce8 cap 000000009cba5ab8 issued pAsxLsXsxFsxcrwb (mask Xs)
[ 75.680000] ceph: caps.c:716 : __touch_cap 00000000a11b9ce8 cap 000000009cba5ab8 mds0

to limit the noise from the writing process.

Actions #3

Updated by Andre Noll almost 13 years ago

Hm that does not seem to work. I had to compile a kernel with dynamic debug enabled,
but nothing makes it to the logs after executing the above echo command.

Some more information:

  • The ls -l also hangs on the node where cmds and cmon are running.
  • A simple ls (without -l) does not hang
  • stracing the ls -l shows that lstat() blocks for one particular file while
    the same system call returned successfully for many other files. Indeed,
    ls -l <filename> for that file hangs.
  • There are quite some "kernel time sync status change 2001" messages in the log.
    Sometimes the number 6001 rather than 2001.
  • I see many of these in the logs:
    ------------[ cut here ]------------
    WARNING: at /fml/ag-raetsch/home/maan/tmp/scm/OTHER/linux-cmt/fs/btrfs/inode.c:2164 btrfs_orphan_commit_root+0x87/0xa5()
    Hardware name: To Be Filled By O.E.M.
    Pid: 2711, comm: kworker/1:1 Not tainted 2.6.38.7 #120
    Call Trace:
    [<ffffffff8102ce12>] ? warn_slowpath_common+0x7b/0x93
    [<ffffffff8102ce3f>] ? warn_slowpath_null+0x15/0x17
    [<ffffffff811d8917>] ? btrfs_orphan_commit_root+0x87/0xa5
    [<ffffffff811cecc8>] ? commit_fs_roots+0x8b/0xf1
    [<ffffffff811cf9b0>] ? btrfs_commit_transaction+0x353/0x625
    [<ffffffff81041fab>] ? autoremove_wake_function+0x0/0x34
    [<ffffffff8103e294>] ? queue_delayed_work_on+0xda/0xe8
    [<ffffffff811cfc82>] ? do_async_commit+0x0/0x26
    [<ffffffff811cfc9c>] ? do_async_commit+0x1a/0x26
    [<ffffffff8103ebc4>] ? process_one_work+0x21b/0x350
    [<ffffffff8103f0f8>] ? worker_thread+0x1d3/0x34b
    [<ffffffff8103ef25>] ? worker_thread+0x0/0x34b
    [<ffffffff81041bd0>] ? kthread+0x7d/0x85
    [<ffffffff81002af4>] ? kernel_thread_helper+0x4/0x10
    [<ffffffff81041b53>] ? kthread+0x0/0x85
    [<ffffffff81002af0>] ? kernel_thread_helper+0x0/0x10
    ---[ end trace d0dcc47e46d584cb ]---
    ------------[ cut here ]--------------
Actions #4

Updated by Sage Weil almost 13 years ago

Andre Noll wrote:

Hm that does not seem to work. I had to compile a kernel with dynamic debug enabled,
but nothing makes it to the logs after executing the above echo command.

Do you see them in 'dmesg' output? By default htey are at KERN_DEBUG level which may not make it to the console or log. You can 'echo 9 > /proc/sysrq-trigger' to ensure it all hits the console; not sure what process it to make sure klogd picks it up (check your syslog.conf).

Some more information:
  • The ls -l also hangs on the node where cmds and cmon are running.
  • A simple ls (without -l) does not hang
  • stracing the ls -l shows that lstat() blocks for one particular file while
    the same system call returned successfully for many other files. Indeed,
    ls -l <filename> for that file hangs.

Can you reproduce the problem with something like dd if=/dev/zero of=mnt/foo and ls -al on foo from another node?

  • There are quite some "kernel time sync status change 2001" messages in the log.
    Sometimes the number 6001 rather than 2001.
  • I see many of these in the logs:
    ------------[ cut here ]------------
    WARNING: at /fml/ag-raetsch/home/maan/tmp/scm/OTHER/linux-cmt/fs/btrfs/inode.c:2164

btrfs_orphan_commit_root+0x87/0xa5()

This is a btrfs thing that shouldn't be related. It's been fixed in the latest code, though, although I'm not sure if the patch made it into 2.6.39 or not.

Actions #5

Updated by Andre Noll almost 13 years ago

Sage Weil wrote:

Andre Noll wrote:

Hm that does not seem to work. I had to compile a kernel with dynamic debug enabled,
but nothing makes it to the logs after executing the above echo command.

Do you see them in 'dmesg' output?

No

By default htey are at KERN_DEBUG level which may not make it to the console or log. You can 'echo 9 > /proc/sysrq-trigger' to ensure it all hits the console; not sure what process it to make sure klogd picks it up (check your syslog.conf).

No luck. The dmesg contains only

device fsid 9040acc0c6c9c2fa-dccb4662427a46be devid 1 transid 26246 /dev/mapper/ceph-meta
device fsid e947f00ceeb2a50c-a0062f9743a0dfab devid 1 transid 3212 /dev/mapper/ceph-data
libceph: client4301 fsid f0a68877-f727-d3d5-61d6-ccfc70f54707
libceph: mon0 192.168.3.25:6789 session established
sshd (1178): /proc/1178/oom_adj is deprecated, please use /proc/1178/oom_score_adj instead.
SysRq : Changing Loglevel
Loglevel set to 9

We are using busybox syslog to log from all nodes to a remote server. Works fine, otherwise
I would not see the "kernel time sync status" messages from ntpd either.

  • stracing the ls -l shows that lstat() blocks for one particular file while
    the same system call returned successfully for many other files. Indeed,
    ls -l <filename> for that file hangs.

Can you reproduce the problem with something like dd if=/dev/zero of=mnt/foo and ls -al on foo from another node?

Yes, something similar (stress -d 10) revealed the problem originally. However, he writer does not seem to matter
any more once one "bad" file exists. Then the ls -l hangs everywhere even if no writer is present. Restarting the
ceph daemons or even rebooting the client did not help either. I haven't rebooted the node on which cmds and cmon
are running though.

  • There are quite some "kernel time sync status change 2001" messages in the log.
    Sometimes the number 6001 rather than 2001.
  • I see many of these in the logs:
    ------------[ cut here ]------------
    WARNING: at /fml/ag-raetsch/home/maan/tmp/scm/OTHER/linux-cmt/fs/btrfs/inode.c:2164

btrfs_orphan_commit_root+0x87/0xa5()

This is a btrfs thing that shouldn't be related. It's been fixed in the latest code, though, although I'm not sure if the patch made it into 2.6.39 or not.

OK, thanks. I'll try out 2.6.39 tomorrow. Will keep you informed.

Actions #6

Updated by Andre Noll almost 13 years ago

OK, thanks. I'll try out 2.6.39 tomorrow. Will keep you informed.

Now running 2.6.39 everywhere on freshly created underlying btrfs and cephfs. My usual stress tests reproduced
the problem, but now the ls -l eventually returns. Running stress -d 10 --hdd-noclean on one node made the ls -l
on another node take more and more time to complete, apparently depending on the number of files. With ~600 files
the ls -l needs ~30 minutes to complete.

The good news is that everything is back to normal when I stop the writer.

Actions #7

Updated by Sage Weil almost 13 years ago

  • Target version changed from v0.29 to v0.30
Actions #8

Updated by Sage Weil almost 13 years ago

  • Translation missing: en.field_position set to 6
Actions #9

Updated by Sage Weil almost 13 years ago

  • Translation missing: en.field_story_points set to 3
  • Translation missing: en.field_position deleted (10)
  • Translation missing: en.field_position set to 10
Actions #10

Updated by Sage Weil almost 13 years ago

  • Translation missing: en.field_position deleted (14)
  • Translation missing: en.field_position set to 12
Actions #11

Updated by Sage Weil almost 13 years ago

Okay, there was an issue with the behavior with the MDS locks (they didn't do what I thought they did). I added a new lock state to make sync more efficient (not require a flush of buffers on the clients); it's in the mds_xsyn branch. Can you try that out and see if there's an improvement?

It's faster for me, but there are still stalls. Currently that's because:

- mds will receive stat, revoke WR capability
- a kclient write starts to write into the buffer cache (generic_file_aio_write). the vm hits its dirty page limit and it blocks, awaiting completion of some writeback
- kclient will receive revocation, but a WR reference is held by the writing thread
- we wait a while to get an ack on the previous writeback
- the blocked generic_file_aio_write process finally gets past balance_dirty_pages and finishes
- we release the WR cap back to the mds

I'm not sure how to fix it. We need to take a WR reference before the write to make sure we're allowed, but we don't dirty/update i_size and mtime until/unless generic_file_aio_write completes (and that means we need to [still] be holding the WR cap reference).

I wonder what Lustre does here...

Actions #12

Updated by Sage Weil almost 13 years ago

  • Status changed from 4 to Resolved

I'm going to open a separate kclient issue to deal with the balance_dirty_pages issue.

Actions #13

Updated by Andre Noll almost 13 years ago

First of all, sorry for the delay, I haven been distracted by other work.

Sage Weil wrote:

Okay, there was an issue with the behavior with the MDS locks (they didn't do what I thought they did). I added a new lock state to make sync more efficient (not require a flush of buffers on the clients); it's in the mds_xsyn branch. Can you try that out and see if there's an improvement?

I'm now running 2.6.39.1 and the mds_xsyn branch (350e6503) but the problem remains.

It's faster for me, but there are still stalls.

Same thing here. ls -l now takes "only" about 2 minutes to complete. However, this cmds crashes on my system.
Here's the gdb output and back trace:

(gdb) run -f -c /etc/ceph/ceph.conf -i 0
Starting program: /usr/local/bin/cmds -f -c /etc/ceph/ceph.conf -i 0
[Thread debugging using libthread_db enabled] * WARNING: Ceph is still under development. Any feedback can be directed * * at or http://ceph.newdream.net/. *
starting mds.0 at 0.0.0.0:6800/2230
[New Thread 0x7ffff69d2700 (LWP 2231)]
[New Thread 0x7ffff61d1700 (LWP 2232)]
[New Thread 0x7ffff59d0700 (LWP 2233)]
[New Thread 0x7ffff51cf700 (LWP 2234)]
[New Thread 0x7ffff49ce700 (LWP 2235)]
[New Thread 0x7ffff7fd8700 (LWP 2236)]
[New Thread 0x7ffff41cd700 (LWP 2239)]
[New Thread 0x7ffff40cc700 (LWP 2240)]
[New Thread 0x7ffff38cb700 (LWP 2241)]
[New Thread 0x7ffff30ca700 (LWP 2242)]
[New Thread 0x7ffff2fc9700 (LWP 2243)]
[New Thread 0x7ffff2ec8700 (LWP 2244)]
[New Thread 0x7ffff2dc7700 (LWP 2245)]
[New Thread 0x7ffff2cc6700 (LWP 2246)]
[New Thread 0x7ffff2bc5700 (LWP 2247)]
[New Thread 0x7ffff2ac4700 (LWP 2248)]
[New Thread 0x7ffff29c3700 (LWP 2249)]
[New Thread 0x7ffff28c2700 (LWP 2250)]
[New Thread 0x7ffff27c1700 (LWP 2251)]
[New Thread 0x7ffff26c0700 (LWP 2252)]
[Thread 0x7ffff26c0700 (LWP 2252) exited]
[New Thread 0x7ffff1ebf700 (LWP 2253)]
[New Thread 0x7ffff1dbe700 (LWP 2254)]
[New Thread 0x7ffff1cbd700 (LWP 2255)]
[New Thread 0x7ffff1abb700 (LWP 2257)]
[New Thread 0x7ffff1bbc700 (LWP 2256)]
[New Thread 0x7ffff19ba700 (LWP 2258)]
[New Thread 0x7ffff18b9700 (LWP 2259)]
[New Thread 0x7ffff13b4700 (LWP 2264)]
[New Thread 0x7ffff14b5700 (LWP 2263)]
[New Thread 0x7ffff15b6700 (LWP 2262)]
[New Thread 0x7ffff17b8700 (LWP 2261)]
[New Thread 0x7ffff16b7700 (LWP 2260)]
mds/Locker.cc: In function 'void Locker::file_xsyn(SimpleLock*, bool*)', in thread '0x7ffff51cf700'
mds/Locker.cc: 3894: FAILED assert(in->get_loner() >= 0 && in->mds_caps_wanted.empty())
ceph version 0.28.2-200-g350e650 (commit:350e65035b4a7d485fe62a48cc11b782ddbe5127)
1: (Locker::file_xsyn(SimpleLock*, bool*)+0x767) [0x5b5c87]
2: (Locker::_rdlock_kick(SimpleLock*)+0x16f) [0x5befcf]
3: (Locker::rdlock_start(SimpleLock*, MDRequest*, bool)+0x91) [0x5cb991]
4: (Locker::acquire_locks(MDRequest*, std::set<SimpleLock*, std::less<SimpleLock*>, std::allocator<SimpleLock*> >&, std::set<SimpleLock*, std::less<SimpleLock*>, std::allocator<SimpleLock*> >&, std::set<SimpleLock*, std::less<SimpleLock*>, std::allocator<SimpleLock*> >&)+0x28f3) [0x5d1113]
5: (Server::handle_client_stat(MDRequest*)+0x254) [0x506d24]
6: (Server::handle_client_request(MClientRequest*)+0x524) [0x516694]
7: (MDS::handle_deferrable_message(Message*)+0x9df) [0x497d7f]
8: (MDS::_dispatch(Message*)+0x1766) [0x4ac2c6]
9: (MDS::ms_dispatch(Message*)+0x59) [0x4ac819]
10: (SimpleMessenger::dispatch_entry()+0x803) [0x6d79d3]
11: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x4893fc]
12: (()+0x69ca) [0x7ffff7bc69ca]
13: (clone()+0x6d) [0x7ffff6ab970d]
ceph version 0.28.2-200-g350e650 (commit:350e65035b4a7d485fe62a48cc11b782ddbe5127)
1: (Locker::file_xsyn(SimpleLock*, bool*)+0x767) [0x5b5c87]
2: (Locker::_rdlock_kick(SimpleLock*)+0x16f) [0x5befcf]
3: (Locker::rdlock_start(SimpleLock*, MDRequest*, bool)+0x91) [0x5cb991]
4: (Locker::acquire_locks(MDRequest*, std::set<SimpleLock*, std::less<SimpleLock*>, std::allocator<SimpleLock*> >&, std::set<SimpleLock*, std::less<SimpleLock*>, std::allocator<SimpleLock*> >&, std::set<SimpleLock*, std::less<SimpleLock*>, std::allocator<SimpleLock*> >&)+0x28f3) [0x5d1113]
5: (Server::handle_client_stat(MDRequest*)+0x254) [0x506d24]
6: (Server::handle_client_request(MClientRequest*)+0x524) [0x516694]
7: (MDS::handle_deferrable_message(Message*)+0x9df) [0x497d7f]
8: (MDS::_dispatch(Message*)+0x1766) [0x4ac2c6]
9: (MDS::ms_dispatch(Message*)+0x59) [0x4ac819]
10: (SimpleMessenger::dispatch_entry()+0x803) [0x6d79d3]
11: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x4893fc]
12: (()+0x69ca) [0x7ffff7bc69ca]
13: (clone()+0x6d) [0x7ffff6ab970d]
terminate called after throwing an instance of 'ceph::FailedAssertion'

Program received signal SIGABRT, Aborted.
[Switching to Thread 0x7ffff51cf700 (LWP 2234)]
0x00007ffff6a06a75 in _GI_raise (sig=<value optimized out>) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
64 ../nptl/sysdeps/unix/sysv/linux/raise.c: No such file or directory.
in ../nptl/sysdeps/unix/sysv/linux/raise.c
(gdb) bt
#0 0x00007ffff6a06a75 in *
_GI_raise (sig=<value optimized out>) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1 0x00007ffff6a0a5c0 in *_GI_abort () at abort.c:92
#2 0x00007ffff72bc8e5 in __gnu_cxx::
_verbose_terminate_handler() () from /usr/lib/libstdc++.so.6
#3 0x00007ffff72bad16 in ?? () from /usr/lib/libstdc++.so.6
#4 0x00007ffff72bad43 in std::terminate() () from /usr/lib/libstdc++.so.6
#5 0x00007ffff72bae3e in _cxa_throw () from /usr/lib/libstdc++.so.6
#6 0x00000000006cb77a in ceph::
_ceph_assert_fail (assertion=<value optimized out>, file=<value optimized out>, line=<value optimized out>, func=0x758400 "void Locker::file_xsyn(SimpleLock
, bool*)")
at common/assert.cc:86
#7 0x00000000005b5c87 in Locker::file_xsyn (this=<value optimized out>, lock=0xa60d00, need_issue=<value optimized out>) at mds/Locker.cc:3894
#8 0x00000000005befcf in Locker::_rdlock_kick (this=0xa06850, lock=0xa60d00) at mds/Locker.cc:789
#9 0x00000000005cb991 in Locker::rdlock_start (this=0xa06850, lock=0xa60d00, mut=0xa94660, as_anon=255) at mds/Locker.cc:855
#10 0x00000000005d1113 in Locker::acquire_locks (this=<value optimized out>, mdr=0xa94660, rdlocks=<value optimized out>, wrlocks=<value optimized out>, xlocks=...) at mds/Locker.cc:378
#11 0x0000000000506d24 in Server::handle_client_stat (this=0xa0d030, mdr=0xa94660) at mds/Server.cc:2032
#12 0x0000000000516694 in Server::handle_client_request (this=0xa0d030, req=0x7fffec03e350) at mds/Server.cc:1107
#13 0x0000000000497d7f in MDS::handle_deferrable_message (this=0xa09320, m=0x7fffec03e350) at mds/MDS.cc:1720
#14 0x00000000004ac2c6 in MDS::_dispatch (this=0xa09320, m=0x7fffec03e350) at mds/MDS.cc:1796
#15 0x00000000004ac819 in MDS::ms_dispatch (this=0xa09320, m=0x7fffec03e350) at mds/MDS.cc:1612
#16 0x00000000006d79d3 in Messenger::ms_deliver_dispatch (this=0xa0ca40) at msg/Messenger.h:98
#17 SimpleMessenger::dispatch_entry (this=0xa0ca40) at msg/SimpleMessenger.cc:353
#18 0x00000000004893fc in SimpleMessenger::DispatchThread::entry (this=0xa0cec8) at ./msg/SimpleMessenger.h:544
#19 0x00007ffff7bc69ca in start_thread (arg=<value optimized out>) at pthread_create.c:300
#20 0x00007ffff6ab970d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#21 0x0000000000000000 in ?? ()

Currently that's because:

- mds will receive stat, revoke WR capability
- a kclient write starts to write into the buffer cache (generic_file_aio_write). the vm hits its dirty page limit and it blocks, awaiting completion of some writeback
- kclient will receive revocation, but a WR reference is held by the writing thread
- we wait a while to get an ack on the previous writeback
- the blocked generic_file_aio_write process finally gets past balance_dirty_pages and finishes
- we release the WR cap back to the mds

I'm not sure how to fix it. We need to take a WR reference before the write to make sure we're allowed, but we don't dirty/update i_size and mtime until/unless generic_file_aio_write completes (and that means we need to [still] be holding the WR cap reference).

I wonder what Lustre does here...

We could ask adilger. He certainly knows...

Actions #14

Updated by Sage Weil almost 13 years ago

I pushed commit:5922de29e7414be3aaa6eb04be8bb0ae5c943ccf, which should fix that crash. Can you retest?

Still working out a fix for the kclient side of the cap revocation, #1140.

Actions #15

Updated by Andre Noll almost 13 years ago

Sage Weil wrote:

I pushed commit:5922de29e7414be3aaa6eb04be8bb0ae5c943ccf, which should fix that crash. Can you retest?

Thanks. I've pulled and installed the new mds_xsyn branch and restarted all ceph daemons. I'll let the tests
run over night and will report back tomorrow.

Still working out a fix for the kclient side of the cap revocation, #1140.

OK. If you want me to test anything, don't hestiate to ask. The six cluster nodes for ceph testing are not
used otherwise, so I can try patches and reboot at will.

Actions #16

Updated by Andre Noll almost 13 years ago

Andre Noll wrote:

Sage Weil wrote:

I pushed commit:5922de29e7414be3aaa6eb04be8bb0ae5c943ccf, which should fix that crash. Can you retest?

Thanks. I've pulled and installed the new mds_xsyn branch and restarted all ceph daemons. I'll let the tests
run over night and will report back tomorrow.

No more cmds crah but the file system now hangs completely. Strace shows that cmds blocks on futex(0xa0cbec,
FUTEX_WAIT_PRIVATE, 1, NULL), while the ls command blocks on stat("/tmp/global").

Actions #17

Updated by Sage Weil almost 13 years ago

  • Status changed from Resolved to In Progress

Andre Noll wrote:

Andre Noll wrote:

Sage Weil wrote:

I pushed commit:5922de29e7414be3aaa6eb04be8bb0ae5c943ccf, which should fix that crash. Can you retest?

Thanks. I've pulled and installed the new mds_xsyn branch and restarted all ceph daemons. I'll let the tests
run over night and will report back tomorrow.

No more cmds crah but the file system now hangs completely. Strace shows that cmds blocks on futex(0xa0cbec,
FUTEX_WAIT_PRIVATE, 1, NULL), while the ls command blocks on stat("/tmp/global").

Do you have mds logs by any chance? Not reproducing this one locally. (Not easily at least :)

Actions #18

Updated by Sage Weil almost 13 years ago

Andre, any chance you can pass along more information about the hang you were seeing? Log, or the specific workload?

Actions #19

Updated by Andre Noll almost 13 years ago

Sage Weil wrote:

Andre, any chance you can pass along more information about the hang you were seeing? Log, or the specific workload?

Sure, attached is the mds log. And just in case there's something wrong with my config, here's my ceph.conf.

[global]
; enable secure authentication
;auth supported = cephx
;osd journal size = 100 ; measured in MB

[client] ; userspace client
debug ms = 1
debug client = 10

; You need at least one monitor. You need at least three if you want to
; tolerate any node failures. Always create an odd number.
[mon]
mon data = /var/ceph/mon$id
; some minimal logging (just message traffic) to aid debugging
; debug ms = 1
debug auth = 20 ;authentication code

[mon.0]
host = node325
mon addr = 192.168.3.25:6789

; You need at least one mds. Define two to get a standby.
[mds]
; where the mds keeps it's secret encryption keys
keyring = /var/ceph/keyring.$name
[mds.0]
host = node325

; osd
; You need at least one. Two if you want data to be replicated.
; Define as many as you like.
[osd]
; This is where the btrfs volume will be mounted.
osd data = /var/ceph/osd$id

keyring = /etc/ceph/keyring.$name
; Ideally, make this a separate disk or partition.  A few GB
; is usually enough; more if you have fast disks. You can use
; a file under the osd data dir if need be
; (e.g. /data/osd$id/journal), but it will be slower than a
; separate disk or partition.
osd journal = /var/ceph/osd$id/journal
; If the OSD journal is a file, you need to specify the size. This is specified in MB.
osd journal size = 512

[osd.0]
host = node326
btrfs devs = /dev/ceph/data
[osd.1]
host = node327
btrfs devs = /dev/ceph/data
[osd.2]
host = node328
btrfs devs = /dev/ceph/data
[osd.3]
host = node329
btrfs devs = /dev/ceph/data
[osd.4]
host = node330
btrfs devs = /dev/ceph/data

Actions #20

Updated by Sage Weil almost 13 years ago

Oh, that log is from the 8th, and doesn't include the fix I pushed on the 9th fixing that particular bug. Did you attach the wrong log? Or do you mind retesting with the latest mds_xsyn branch?

Thanks!

Actions #21

Updated by Andre Noll almost 13 years ago

Sage Weil wrote:

Oh, that log is from the 8th, and doesn't include the fix I pushed on the 9th fixing that particular bug. Did you attach the wrong log? Or do you mind retesting with the latest mds_xsyn branch?

Hm, it's the only mds log I have, and it contains entries from 2011-06-08 up to today. Also, the mds installed
on the nodes says

ceph version 0.28.2-201-g5922de2 (commit:5922de29e7414be3aaa6eb04be8bb0ae5c943ccf)

which matches your recent mds_xsyn branch. In fact this git version is logged in the attached file, just
after the backtrace at 2011-06-09 18:40:28.

The mds is still running btw. and the file system and the ls command still hang since last Friday.

Actions #22

Updated by Sage Weil almost 13 years ago

Oh! Sorry, I didn't scroll down. :)

I do see this:
2011-06-10 11:41:04.325925 7f931b987700 mds0.objecter FULL, paused modify 0x7f931402cee0 tid 25

which indicates the object store hit 90% (or whatever the default is, i forget) and declared itself full. Is it still marked that way? Can you attach your 'ceph -s' and 'ceph osd dump -o -' output? That may be the issue?

Otherwise, there's nothing there. If the situation is reproducible, the best would be an mds log generated with 'debug mds = 20' and 'debug ms = 1' in it.

Thanks!

Actions #23

Updated by Andre Noll almost 13 years ago

Sage Weil wrote:

I do see this:
2011-06-10 11:41:04.325925 7f931b987700 mds0.objecter FULL, paused modify 0x7f931402cee0 tid 25

which indicates the object store hit 90% (or whatever the default is, i forget) and declared itself full. Is it still marked that way?

Yes, it's full. Which is kind of expected as my stress tests just wrote files in an endless loop. A full file
system should not cause these kind of hangs though.

Can you attach your 'ceph -s' and 'ceph osd dump -o -' output? That may be the issue?

Here we go. The -s option is not documented in the man page btw.

root@node325:~# ceph s
2011-06-15 11:20:15.650818 7f162db6e720 -
:/0 messenger.start
2011-06-15 11:20:15.651776 7f162db6e720 -- :/2969 --> 192.168.3.25:6789/0 -- auth(proto 0 30 bytes) v1 -- ?+0 0x776a40 con 0x775840
2011-06-15 11:20:15.652039 7f162db6d700 -- 192.168.3.25:0/2969 learned my addr 192.168.3.25:0/2969
2011-06-15 11:20:15.652644 7f162aced700 -- 192.168.3.25:0/2969 <== mon0 192.168.3.25:6789/0 1 ==== auth_reply(proto 1 0 Success) v1 ==== 24+0+0 (3120868696 0 0) 0x776a40 con 0x775840
2011-06-15 11:20:15.652712 7f162aced700 -- 192.168.3.25:0/2969 --> 192.168.3.25:6789/0 -- mon_subscribe({monmap=0+}) v1 -- ?+0 0x776c20 con 0x775840
2011-06-15 11:20:15.652743 7f162aced700 -- 192.168.3.25:0/2969 --> 192.168.3.25:6789/0 -- observe(0 v0) v1 -- ?+0 0x7770e0 con 0x775840
2011-06-15 11:20:15.652770 7f162aced700 -- 192.168.3.25:0/2969 --> 192.168.3.25:6789/0 -- observe(1 v0) v1 -- ?+0 0x777290 con 0x775840
2011-06-15 11:20:15.652787 7f162aced700 -- 192.168.3.25:0/2969 --> 192.168.3.25:6789/0 -- observe(2 v0) v1 -- ?+0 0x777440 con 0x775840
2011-06-15 11:20:15.652809 7f162aced700 -- 192.168.3.25:0/2969 --> 192.168.3.25:6789/0 -- observe(3 v0) v1 -- ?+0 0x777610 con 0x775840
2011-06-15 11:20:15.652831 7f162aced700 -- 192.168.3.25:0/2969 --> 192.168.3.25:6789/0 -- observe(4 v0) v1 -- ?+0 0x778120 con 0x775840
2011-06-15 11:20:15.652847 7f162aced700 -- 192.168.3.25:0/2969 --> 192.168.3.25:6789/0 -- observe(5 v0) v1 -- ?+0 0x7782f0 con 0x775840
2011-06-15 11:20:15.653169 7f162aced700 -- 192.168.3.25:0/2969 <== mon0 192.168.3.25:6789/0 2 ==== mon_map v1 ==== 187+0+0 (1016302864 0 0) 0x778120 con 0x775840
2011-06-15 11:20:15.653272 7f162db6e720 -- 192.168.3.25:0/2969 --> 192.168.3.25:6789/0 -- observe(0 v0) v1 -- ?+0 0x778120 con 0x775840
2011-06-15 11:20:15.653317 7f162aced700 -- 192.168.3.25:0/2969 <== mon0 192.168.3.25:6789/0 3 ==== mon_subscribe_ack(300s) v1 ==== 20+0+0 (4247668211 0 0) 0x7782d0 con 0x775840
2011-06-15 11:20:15.653347 7f162db6e720 -- 192.168.3.25:0/2969 --> 192.168.3.25:6789/0 -- observe(1 v0) v1 -- ?+0 0x776a40 con 0x775840
2011-06-15 11:20:15.653376 7f162db6e720 -- 192.168.3.25:0/2969 --> 192.168.3.25:6789/0 -- observe(2 v0) v1 -- ?+0 0x776bf0 con 0x775840
2011-06-15 11:20:15.653399 7f162db6e720 -- 192.168.3.25:0/2969 --> 192.168.3.25:6789/0 -- observe(3 v0) v1 -- ?+0 0x776da0 con 0x775840
2011-06-15 11:20:15.653420 7f162db6e720 -- 192.168.3.25:0/2969 --> 192.168.3.25:6789/0 -- observe(4 v0) v1 -- ?+0 0x7770e0 con 0x775840
2011-06-15 11:20:15.653445 7f162db6e720 -- 192.168.3.25:0/2969 --> 192.168.3.25:6789/0 -- observe(5 v0) v1 -- ?+0 0x7772b0 con 0x775840
2011-06-15 11:20:15.654455 7f162aced700 -- 192.168.3.25:0/2969 <== mon0 192.168.3.25:6789/0 4 ==== mon_observe_notify(v24631 231088 bytes latest v24631) v1 ==== 231139+0+0 (47911947 0 0) 0x778120 con 0x775840
2011-06-15 11:20:15.654477 pg v24631: 990 pgs: 987 active+clean, 3 active+clean+inconsistent; 679 GB data, 1365 GB used, 100 GB / 1500 GB avail
2011-06-15 11:20:15.656803 7f162aced700 -- 192.168.3.25:0/2969 <== mon0 192.168.3.25:6789/0 5 ==== mon_observe_notify(v5711 499 bytes latest v5711) v1 ==== 550+0+0 (3220045252 0 0) 0x776ca0 con 0x775840
2011-06-15 11:20:15.656822 mds e5711: 1/1/1 up {0=0=up:active}
2011-06-15 11:20:15.656882 7f162aced700 -- 192.168.3.25:0/2969 <== mon0 192.168.3.25:6789/0 6 ==== mon_observe_notify(v51 2872 bytes latest v51) v1 ==== 2923+0+0 (1089237266 0 0) 0x77cbd0 con 0x775840
2011-06-15 11:20:15.656910 osd e51: 5 osds: 5 up, 5 in full
2011-06-15 11:20:15.657002 7f162aced700 -- 192.168.3.25:0/2969 <== mon0 192.168.3.25:6789/0 7 ==== mon_observe_notify(v2632 9233 bytes latest v2632) v1 ==== 9284+0+0 (3354963846 0 0) 0x7770e0 con 0x775840
2011-06-15 11:20:15.657019 log 2011-06-14 22:01:34.249448 osd1 192.168.3.27:6800/1135 1108 : [INF] 0.6c scrub ok
2011-06-15 11:20:15.657112 7f162aced700 -- 192.168.3.25:0/2969 <== mon0 192.168.3.25:6789/0 8 ==== mon_observe_notify(v1 183 bytes latest v1) v1 ==== 234+0+0 (1461902107 0 0) 0x777460 con 0x775840
2011-06-15 11:20:15.657130 mon e1: 1 mons at {0=192.168.3.25:6789/0}
2011-06-15 11:20:15.657167 7f162aced700 -- 192.168.3.25:0/2969 <== mon0 192.168.3.25:6789/0 9 ==== mon_observe_notify(v173 1405 bytes latest v173) v1 ==== 1456+0+0 (1213620868 0 0) 0x77aef0 con 0x775840
2011-06-15 11:20:15.657495 7f162db6e720 -- 192.168.3.25:0/2969 shutdown complete.
root@node325:~#
root@node325:~# ceph osd dump o -
2011-06-15 11:20:18.350958 7feefa23b720 -
:/0 messenger.start
2011-06-15 11:20:18.351506 7feefa23b720 -- :/2979 --> 192.168.3.25:6789/0 -- auth(proto 0 30 bytes) v1 -- ?+0 0x776a40 con 0x775840
2011-06-15 11:20:18.351778 7feefa23a700 -- 192.168.3.25:0/2979 learned my addr 192.168.3.25:0/2979
2011-06-15 11:20:18.352376 7feef73ba700 -- 192.168.3.25:0/2979 <== mon0 192.168.3.25:6789/0 1 ==== auth_reply(proto 1 0 Success) v1 ==== 24+0+0 (3860680295 0 0) 0x776a40 con 0x775840
2011-06-15 11:20:18.352444 7feef73ba700 -- 192.168.3.25:0/2979 --> 192.168.3.25:6789/0 -- mon_subscribe({monmap=0+}) v1 -- ?+0 0x776c20 con 0x775840
2011-06-15 11:20:18.352742 7feef73ba700 -- 192.168.3.25:0/2979 <== mon0 192.168.3.25:6789/0 2 ==== mon_map v1 ==== 187+0+0 (1016302864 0 0) 0x776a40 con 0x775840
2011-06-15 11:20:18.352810 7feef73ba700 -- 192.168.3.25:0/2979 <== mon0 192.168.3.25:6789/0 3 ==== mon_subscribe_ack(300s) v1 ==== 20+0+0 (4247668211 0 0) 0x776c30 con 0x775840
2011-06-15 11:20:18.352821 mon <- [osd,dump]
2011-06-15 11:20:18.352888 7feefa23b720 -- 192.168.3.25:0/2979 --> 192.168.3.25:6789/0 -- mon_command(osd dump v 0) v1 -- ?+0 0x776a40 con 0x775840
2011-06-15 11:20:18.353344 7feef73ba700 -- 192.168.3.25:0/2979 <== mon0 192.168.3.25:6789/0 4 ==== mon_command_ack([osd,dump]=0 dumped osdmap epoch 51 v51) v1 ==== 67+0+1331 (295438958 0 1350419461) 0x776a40 con 0x775840
2011-06-15 11:20:18.353377 mon0 -> 'dumped osdmap epoch 51' (0)
epoch 51
fsid 09e0bcaf-e87f-45f9-911b-e325d418699d
created 2011-06-08 17:54:56.837951
modifed 2011-06-10 12:04:51.439498
flags full

pg_pool 0 'data' pg_pool(rep pg_size 2 crush_ruleset 0 object_hash rjenkins pg_num 320 pgp_num 320 lpg_num 2 lpgp_num 2 last_change 1 owner 0)
pg_pool 1 'metadata' pg_pool(rep pg_size 2 crush_ruleset 1 object_hash rjenkins pg_num 320 pgp_num 320 lpg_num 2 lpgp_num 2 last_change 1 owner 0)
pg_pool 2 'rbd' pg_pool(rep pg_size 2 crush_ruleset 2 object_hash rjenkins pg_num 320 pgp_num 320 lpg_num 2 lpgp_num 2 last_change 1 owner 0)

max_osd 5
osd0 up in weight 1 up_from 30 up_thru 46 down_at 29 last_clean_interval 27-28 192.168.3.26:6800/1172 192.168.3.26:6801/1172 192.168.3.26:6802/1172
osd1 up in weight 1 up_from 46 up_thru 46 down_at 41 last_clean_interval 33-43 192.168.3.27:6800/1135 192.168.3.27:6803/1135 192.168.3.27:6804/1135
osd2 up in weight 1 up_from 45 up_thru 46 down_at 44 last_clean_interval 35-44 192.168.3.28:6800/1136 192.168.3.28:6801/1136 192.168.3.28:6802/1136
osd3 up in weight 1 up_from 33 up_thru 46 down_at 32 last_clean_interval 3-31 192.168.3.29:6800/1144 192.168.3.29:6801/1144 192.168.3.29:6802/1144
osd4 up in weight 1 up_from 33 up_thru 46 down_at 32 last_clean_interval 3-31 192.168.3.30:6800/1135 192.168.3.30:6801/1135 192.168.3.30:6802/1135

2011-06-15 11:20:18.353646 7feefa23b720 -- 192.168.3.25:0/2979 shutdown complete.

Otherwise, there's nothing there. If the situation is reproducible, the best would be an mds log generated with 'debug mds = 20' and 'debug ms = 1' in it.

I've recreated the file system, added the above debug options and will let you know what happens.

Actions #24

Updated by Andre Noll almost 13 years ago

Otherwise, there's nothing there. If the situation is reproducible, the best would be an mds log generated with 'debug mds = 20' and 'debug ms = 1' in it.

I've recreated the file system, added the above debug options and will let you know what happens.

Seems to be reproducible. Tonight the same hang occurred, presumably when the file system was full
again. df now reports 93% disk usage and the

while : ; do time ls -l > /dev/null; sleep 10m; done

loop on the client stopped producing output at 22:45 yesterday evening. The full mds log file is 712M, so it is
too large to include here, but I can upload it to a web server if you like to see it. I think I've captured
and attached the interesting part of it. Please have a look.

Actions #25

Updated by Sage Weil almost 13 years ago

Can you confirm that it hung when the fs was full? Because that's normal (and unrelated to this issue); if the FS is full not much will work.

You can look at the osdmap to see when teh "full" flag showed up... 'ceph osd dump -o - | grep full' should show the flag set. You can work backwards by specifying an epoch number ('ceph osd dump -o - 123 | grep full') to see when the flag showed up, and then look at the modified timestamp on that osdmap epoch to see when it was.

There was something a little odd in that log, but I suspect it's related to the ENOSPC.

Actions #26

Updated by Andre Noll almost 13 years ago

Sage Weil wrote:

Can you confirm that it hung when the fs was full? Because that's normal (and unrelated to this issue); if the FS is full not much will work.

Not even an ls command will succeed if it's full?

You can look at the osdmap to see when teh "full" flag showed up... 'ceph osd dump -o - | grep full' should show the flag set. You can work backwards by specifying an epoch number ('ceph osd dump -o - 123 | grep full') to see when the flag showed up, and then look at the modified timestamp on that osdmap epoch to see when it was.

Indeed: It was at epoch 5 which has a "modified" timestamp of 2011-06-15 22:37:48. This matches exactly the time when
the ls command started to hang.

I will now start other tests which do not fill up the file system. Let's see if I can break it with those tests
as well. Will follow up with results next Monday.

Thanks again for your help.

Actions #27

Updated by Andre Noll almost 13 years ago

Indeed: It was at epoch 5 which has a "modified" timestamp of 2011-06-15 22:37:48. This matches exactly the time when
the ls command started to hang.

I will now start other tests which do not fill up the file system. Let's see if I can break it with those tests
as well. Will follow up with results next Monday.

Looks good, no problems at all so far.

To summarize, I think we have seen a combination of two different issues: The stalled readers in
presence of concurrent writers and the complete hangs in case of a full disk.

The patch in the mds_xsyn branch fixes the first issue, but there is still room for improvement as
an ls -l takes ~2 minutes even with the patch applied. It would be interesting to know what other
distributed file systems are doing in this situation.

If write access for users is desired, the second issue is a real showstopper though. At least in our
institute the write-data-in-an-endless-loop-without-looking-at-free-disk-space test is a good
approximation of what our users are going to do ;) Are there plans to address this problem? Could I
help somehow?

Thanks
Andre

Actions #28

Updated by Sage Weil almost 13 years ago

Okay, so I found one other issue while testing this patch. So there are two reasons this is slow:

- when we revoke the WR cap from the client, the kernel may be blocked in generic_file_aio_write waiting for writeback.  Looking at what, say, Lustre does may be helpful.  It is probably also possible to speculatively do the metadata update prior to writing into the page cache here.  See #1140.
- when the client sends caps back to the MDS, some of them are dirty.  The MDS, as a rule, will always make sure metadata is stable before exposing it to other clients, lest clients get inconsistent behavior on MDS restart.  That means something like

mds->client revoke
client->mds revoke ack, Fw is dirty
mds->osd journal dirty Fw (file size, mtime)
osd->mds ack
mds->otherclient here is the current file size
mds->client you can continue writing

If the system is under heavy IO load, the mds->osd->mds write can be slow. The best performing solution is probably to segregate your metadata OSDs from data OSDs so that those writes are always fast. The best effort marginal improvement solution would be to prioritize mds writes on the OSD so they jump ahead of client writes in the op queue; I'm not sure how much that will help.

Do you want to try segregating metadata onto a different set of OSDs?

Actions #29

Updated by Sage Weil almost 13 years ago

  • Target version changed from v0.30 to v0.31
Actions #30

Updated by Sage Weil almost 13 years ago

  • Status changed from In Progress to Resolved

Created a separate task for the request prioritiziation on the osd, #1203

Actions #31

Updated by Andre Noll almost 13 years ago

- when the client sends caps back to the MDS, some of them are dirty. The MDS, as a rule, will always make sure metadata is stable before exposing it to other clients, lest clients get inconsistent behavior on MDS restart.

Maybe this rule is a bit too restrictive, given there is no guarantee anyway for the exposed metadata to be uptodate
at the time the client receives it. How does NFS handle this case?

That means something like

mds->client revoke
client->mds revoke ack, Fw is dirty
mds->osd journal dirty Fw (file size, mtime)
osd->mds ack
mds->otherclient here is the current file size
mds->client you can continue writing

If the system is under heavy IO load, the mds->osd->mds write can be slow. The best performing solution is probably to segregate your metadata OSDs from data OSDs so that those writes are always fast. The best effort marginal improvement solution would be to prioritize mds writes on the OSD so they jump ahead of client writes in the op queue; I'm not sure how much that will help.

Do you want to try segregating metadata onto a different set of OSDs?

I'm not sure what you mean by segregating. In our test setup one node (node325) is dedicated to the mds and the monitor
while on each of the other five nodes only the osd is running, see the config file contained in an earlier message of
this thread.

Actions #32

Updated by John Spray over 7 years ago

  • Project changed from Ceph to CephFS
  • Category deleted (1)
  • Target version deleted (v0.31)

Bulk updating project=ceph category=mds bugs so that I can remove the MDS category from the Ceph project to avoid confusion.

Actions

Also available in: Atom PDF