Project

General

Profile

Bug #6301

ceph-osd hung by XFS using linux 3.10

Added by Loic Dachary about 6 years ago. Updated over 4 years ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
Start date:
09/13/2013
Due date:
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature:

Description

The kernel configuration ( 3.10.11 sources ) is here : config.gz
The general ceph setup is described here : http://dachary.org/?p=2087
The following happened today on a machine heavily loaded with recovering OSDs and a lot of I/O.

root@miltonduff:~# uptime
 11:59:47 up 1 day, 19:37,  2 users,  load average: 546,05, 546,08, 534,52

The system is otherwise stable and that problem never shows unless the I/O load increases significantly.
root@miltonduff:~# uname -a
Linux miltonduff 3.10.11-dsiun-130711 #44 SMP Mon Sep 9 16:59:12 CEST 2013 x86_64 GNU/Linux
NFO: task ceph-osd:17047 blocked for more than 120 seconds.
[153972.073476] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[153972.076322] ceph-osd        D ffff880869a28a60     0 17047   5423 0x00000000
[153972.076324]  ffff880869a28750 0000000000000002 ffff880867788ee8 ffff8807e4e47500
[153972.079259]  ffff880517addfd8 ffff880517addfd8 ffff880517addfd8 ffff880869a28750
[153972.082244]  0000000000000004 ffff880517addd48 ffff880517addd50 7fffffffffffffff
[153972.085278] Call Trace:
[153972.088310]  [<ffffffff81410f4a>] ? schedule_timeout+0x1ca/0x270
[153972.091371]  [<ffffffff8100abf1>] ? native_sched_clock+0x11/0x70
[153972.094386]  [<ffffffff81070cda>] ? try_to_wake_up+0x1ea/0x270
[153972.097358]  [<ffffffff81412623>] ? wait_for_completion+0xa3/0x120
[153972.100322]  [<ffffffff81070d60>] ? try_to_wake_up+0x270/0x270
[153972.103292]  [<ffffffff811a3702>] ? do_coredump+0x1b2/0xee0
[153972.106212]  [<ffffffff811a3899>] ? do_coredump+0x349/0xee0
[153972.109085]  [<ffffffff8134e0c4>] ? skb_queue_tail+0x24/0x60
[153972.111954]  [<ffffffff813eb73a>] ? unix_dgram_sendmsg+0x5aa/0x640
[153972.114839]  [<ffffffff81053049>] ? get_signal_to_deliver+0x199/0x5a0
[153972.117743]  [<ffffffff81002353>] ? do_signal+0x63/0x8c0
[153972.120632]  [<ffffffff81052030>] ? do_send_sig_info+0x60/0x90
[153972.123521]  [<ffffffff81002c38>] ? do_notify_resume+0x88/0xa0
[153972.126400]  [<ffffffff81414d6a>] ? int_signal+0x12/0x17
[153972.299643] XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250)
[153972.868782] XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250)
[153973.038189] XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250)
[153974.309978] XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250)
[153974.763236] INPUT miltonduff IN=eth2.2092 OUT= MAC=01:00:5e:00:00:01:00:1b:21:b9:99:3c:08:00 SRC=0.0.0.0 DST=224.0.0.1 LEN=32 TOS=0x00 PREC=0xC0 TTL=1 ID=0 DF PROTO=2 
[153974.870473] XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250)
[153975.039848] XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250)

mount points
root@miltonduff:~# cat /proc/mounts
rootfs / rootfs rw 0 0
sysfs /sys sysfs rw,nosuid,nodev,noexec,relatime 0 0
proc /proc proc rw,nosuid,nodev,noexec,relatime 0 0
udev /dev devtmpfs rw,relatime,size=10240k,nr_inodes=4116954,mode=755 0 0
devpts /dev/pts devpts rw,nosuid,noexec,relatime,mode=600,ptmxmode=000 0 0
tmpfs /run tmpfs rw,nosuid,noexec,relatime,size=3301508k,mode=755 0 0
/dev/disk/by-uuid/37dd603c-168c-49de-830d-ef1b5c6982f8 / xfs rw,relatime,attr2,inode64,noquota 0 0
tmpfs /run/lock tmpfs rw,nosuid,nodev,noexec,relatime,size=5120k 0 0
tmpfs /run/shm tmpfs rw,nosuid,nodev,noexec,relatime,size=6709040k 0 0
/dev/sdd1 /boot ext2 rw,relatime,errors=continue 0 0
none /var/local/cgroup cgroup rw,relatime,hugetlb,perf_event,blkio,net_cls,freezer,devices,memory,cpuacct,cpu,debug,cpuset,clone_children 0 0
/dev/mapper/LocalDisk-root--ceph--osd--loi--A--3--1 /var/lib/lxc/01-ceph-osd-loi-A-3-1 xfs rw,noatime,attr2,nobarrier,inode64,noquota 0 0
/dev/mapper/osd--loi--3--1-A24 /var/lib/lxc/01-ceph-osd-loi-A-3-1/rootfs/CEPH/A.24 xfs rw,noatime,attr2,nobarrier,inode64,noquota 0 0
/dev/mapper/LocalDisk-journal--ceph--osd--loi--A--3--1 /var/lib/lxc/01-ceph-osd-loi-A-3-1/rootfs/var/lib/ceph/osd/ceph-24 xfs rw,noatime,attr2,nobarrier,inode64,noquota 0 0
/dev/mapper/LocalDisk-root--ceph--osd--loi--A--3--2 /var/lib/lxc/02-ceph-osd-loi-A-3-2 xfs rw,noatime,attr2,nobarrier,inode64,noquota 0 0
/dev/mapper/osd--loi--3--2-A25 /var/lib/lxc/02-ceph-osd-loi-A-3-2/rootfs/CEPH/A.25 xfs rw,noatime,attr2,nobarrier,inode64,noquota 0 0
/dev/mapper/LocalDisk-journal--ceph--osd--loi--A--3--2 /var/lib/lxc/02-ceph-osd-loi-A-3-2/rootfs/var/lib/ceph/osd/ceph-25 xfs rw,noatime,attr2,nobarrier,inode64,noquota 0 0
/dev/mapper/LocalDisk-root--ceph--osd--loi--A--3--3 /var/lib/lxc/03-ceph-osd-loi-A-3-3 xfs rw,noatime,attr2,nobarrier,inode64,noquota 0 0
/dev/mapper/osd--loi--3--3-A26 /var/lib/lxc/03-ceph-osd-loi-A-3-3/rootfs/CEPH/A.26 xfs rw,noatime,attr2,nobarrier,inode64,noquota 0 0
/dev/mapper/LocalDisk-journal--ceph--osd--loi--A--3--3 /var/lib/lxc/03-ceph-osd-loi-A-3-3/rootfs/var/lib/ceph/osd/ceph-26 xfs rw,noatime,attr2,nobarrier,inode64,noquota 0 0
/dev/mapper/LocalDisk-root--ceph--osd--loi--B--3--1 /var/lib/lxc/04-ceph-osd-loi-B-3-1 xfs rw,noatime,attr2,nobarrier,inode64,noquota 0 0
/dev/mapper/osd--loi--3--1-B24 /var/lib/lxc/04-ceph-osd-loi-B-3-1/rootfs/CEPH/B.24 xfs rw,noatime,attr2,nobarrier,inode64,noquota 0 0
/dev/mapper/LocalDisk-journal--ceph--osd--loi--B--3--1 /var/lib/lxc/04-ceph-osd-loi-B-3-1/rootfs/var/lib/ceph/osd/ceph-24 xfs rw,noatime,attr2,nobarrier,inode64,noquota 0 0
/dev/mapper/LocalDisk-root--ceph--osd--loi--B--3--2 /var/lib/lxc/05-ceph-osd-loi-B-3-2 xfs rw,noatime,attr2,nobarrier,inode64,noquota 0 0
/dev/mapper/osd--loi--3--2-B25 /var/lib/lxc/05-ceph-osd-loi-B-3-2/rootfs/CEPH/B.25 xfs rw,noatime,attr2,nobarrier,inode64,noquota 0 0
/dev/mapper/LocalDisk-journal--ceph--osd--loi--B--3--2 /var/lib/lxc/05-ceph-osd-loi-B-3-2/rootfs/var/lib/ceph/osd/ceph-25 xfs rw,noatime,attr2,nobarrier,inode64,noquota 0 0
/dev/mapper/LocalDisk-root--ceph--osd--loi--B--3--3 /var/lib/lxc/06-ceph-osd-loi-B-3-3 xfs rw,noatime,attr2,nobarrier,inode64,noquota 0 0
/dev/mapper/osd--loi--3--3-B26 /var/lib/lxc/06-ceph-osd-loi-B-3-3/rootfs/CEPH/B.26 xfs rw,noatime,attr2,nobarrier,inode64,noquota 0 0
/dev/mapper/LocalDisk-journal--ceph--osd--loi--B--3--3 /var/lib/lxc/06-ceph-osd-loi-B-3-3/rootfs/var/lib/ceph/osd/ceph-26 xfs rw,noatime,attr2,nobarrier,inode64,noquota 0 0
fusectl /sys/fs/fuse/connections fusectl rw,relatime 0 0 

config.gz - kernel 3.10.11 config (28.6 KB) Loic Dachary, 09/13/2013 05:32 AM

History

#1 Updated by Loic Dachary about 6 years ago

  • Description updated (diff)

#2 Updated by Loic Dachary about 6 years ago

  • Description updated (diff)

#3 Updated by Loic Dachary about 6 years ago

  • Description updated (diff)

#4 Updated by Loic Dachary about 6 years ago

  • Description updated (diff)
  • Status changed from New to Need More Info

#5 Updated by Loic Dachary about 6 years ago

#6 Updated by Sage Weil about 6 years ago

It woudl probably be best ot send the xfs info to the list.

Also--you probably don't want to use 'nobarrier' unless you are very sure your raid card firmware is well behaved and nvram is reliable!

#7 Updated by Sage Weil about 6 years ago

  • Status changed from Need More Info to Closed

#8 Updated by Loic Dachary almost 6 years ago

This really is a XFS problem that should be reported as follows : http://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F

#9 Updated by Loic Dachary almost 6 years ago

Emmanuel Lacour ran into the same problem today and sent a bug report with a detailed description

#10 Updated by Emmanuel Lacour almost 6 years ago

I confirm, we had (several times since this cluster setup) a similar problem here. It seems related to some memory corruption on linux/xfs side which hang osd access to xfs fs. I just reported detail on xfs mailing list here:

http://oss.sgi.com/pipermail/xfs/2013-December/032346.html

#11 Updated by Yann Dupont almost 6 years ago

Loïc kindly generated this issue for me (lack of time, no open ID at the moment of the report... etc)

So it's obviously the same bug that the one reported by emmanuel on the XFS list (thanks emmanuel for taking time to do that) and it seems the problem is caused by 64k directories on XFS.

Dave chinner proposed a patch, under test on XFS folks side, but this one only covers one aspect of the problem.

See http://oss.sgi.com/archives/xfs/2013-12/msg00087.html

#12 Updated by Yann Dupont almost 6 years ago

maybe the status should be changed from closed to something more appropriate, even if ceph isn't the culprit here ...

#13 Updated by Yann Dupont over 5 years ago

One fix has been integrated in kernel commit b3f03bac8132207a20286d5602eda64500c19724
(3.14rc1).

No backport AFAICS.

#14 Updated by Mark Nelson about 5 years ago

fwiw, after upgrading the performance test nodes from Ubuntu 13.10 to Fedora Core 20, I appear to be hitting this under heavy load as well. confirmed that neither fedora 20 nor the RHEL7 kernel source contain b3f03ba.

#15 Updated by wei li almost 5 years ago

We face to this problem in centos 7, kernel 3.10.0-123.el7.x86_64
We download the kernel source http://vault.centos.org/7.0.1406/os/Source/SPackages/kernel-3.10.0-123.el7.src.rpm
Try to manual patch the kernel commit b3f03bac8132207a20286d5602eda64500c19724, but we found this commit already in kernel-3.10.0-123.el7.
It seems the redhat already patch this, we fire an issue in centos http://bugs.centos.org/view.php?id=8053
We find the smilar issue in kernel https://bugzilla.kernel.org/show_bug.cgi?id=73831

In our test env, the rbd order set to 27, block size is 128M. Is it possible relat to this? Or another bug in xfs.
When xfs happen to memory allocation deadlock, the osd will hang, the front side clinet IO will hang too.
Restart osd do not fix the problem, need restart the machine.

#16 Updated by wei li almost 5 years ago

Some ceph error log when the xfs memory allocation deadlock

2015-01-05 16:29:14.011438 7fc7a4274700 -1 os/FileStore.cc: In function 'virtual int FileStore::read(coll_t, const ghobject_t&, uint64_t, size_t, ceph::bufferlist&, bool)' thread 7fc7a4274700 time 2015-01-05 16:29:13.913484
os/FileStore.cc: 2641: FAILED assert(allow_eio || !m_filestore_fail_eio || got != -5)

 ceph version  ()
 1: (FileStore::read(coll_t, ghobject_t const&, unsigned long, unsigned long, ceph::buffer::list&, bool)+0xc0a) [0x891b0a]
 2: (ReplicatedBackend::build_push_op(ObjectRecoveryInfo const&, ObjectRecoveryProgress const&, ObjectRecoveryProgress*, PushOp*, object_stat_sum_t*)+0x5e0) [0x7d4c90]
 3: (ReplicatedBackend::handle_push_reply(pg_shard_t, PushReplyOp&, PushOp*)+0x266) [0x817456]
 4: (ReplicatedBackend::do_push_reply(std::tr1::shared_ptr<OpRequest>)+0x13d) [0x817a5d]
 5: (ReplicatedBackend::handle_message(std::tr1::shared_ptr<OpRequest>)+0x266) [0x922556]
 6: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x303) [0x7afe73]
 7: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x3a5) [0x6003f5]
 8: (OSD::OpWQ::_process(boost::intrusive_ptr<PG>, ThreadPool::TPHandle&)+0x203) [0x61b6a3]
 9: (ThreadPool::WorkQueueVal<std::pair<boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest> >, boost::intrusive_ptr<PG> >::_void_process(void*, ThreadPool::TPHandle&)+0xac) [0x65f70c]
 10: (ThreadPool::worker(ThreadPool::WorkThread*)+0xb10) [0xa77cb0]
 11: (ThreadPool::WorkThread::entry()+0x10) [0xa78ba0]
 12: (()+0x7df3) [0x7fc7bd6e8df3]
 13: (clone()+0x6d) [0x7fc7bc3d33dd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.


ceph version ()
1: /usr/bin/ceph-osd() [0x99d296]
2: (()+0xf130) [0x7fc7bd6f0130]
3: (gsignal()+0x39) [0x7fc7bc312989]
4: (abort()+0x148) [0x7fc7bc314098]
5: (_gnu_cxx::_verbose_terminate_handler()+0x165) [0x7fc7bcc169d5]
6: (()+0x5e946) [0x7fc7bcc14946]
7: (()+0x5e973) [0x7fc7bcc14973]
8: (()+0x5eb9f) [0x7fc7bcc14b9f]
9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1ef) [0xa8725f]
10: (FileStore::read(coll_t, ghobject_t const&, unsigned long, unsigned long, ceph::buffer::list&, bool)+0xc0a) [0x891b0a]
11: (ReplicatedBackend::build_push_op(ObjectRecoveryInfo const&, ObjectRecoveryProgress const&, ObjectRecoveryProgress*, PushOp*, object_stat_sum_t*)+0x5e0) [0x7d4c90]
12: (ReplicatedBackend::handle_push_reply(pg_shard_t, PushReplyOp&, PushOp*)+0x266) [0x817456]
13: (ReplicatedBackend::do_push_reply(std::tr1::shared_ptr<OpRequest>)+0x13d) [0x817a5d]
14: (ReplicatedBackend::handle_message(std::tr1::shared_ptr<OpRequest>)+0x266) [0x922556]
15: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x303) [0x7afe73]
16: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x3a5) [0x6003f5]
17: (OSD::OpWQ::_process(boost::intrusive_ptr<PG>, ThreadPool::TPHandle&)+0x203) [0x61b6a3]
18: (ThreadPool::WorkQueueVal<std::pair<boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest> >, boost::intrusive_ptr<PG> >::_void_process(void*, ThreadPool::TPHandle&)+0xac) [0x65f70c]
19: (ThreadPool::worker(ThreadPool::WorkThread*)+0xb10) [0xa77cb0]
20: (ThreadPool::WorkThread::entry()+0x10) [0xa78ba0]
21: (()+0x7df3) [0x7fc7bd6e8df3]
22: (clone()+0x6d) [0x7fc7bc3d33dd]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. 2015-01-05 17:00:03.024481 7f4a6ed837c0 0 ceph version (), process ceph-osd, pid 5117
2015-01-05 17:00:03.043106 7f4a6ed837c0 0 filestore(/var/lib/ceph/osd/ceph-9) mount detected xfs (libxfs)
2015-01-05 17:00:03.043115 7f4a6ed837c0 1 filestore(/var/lib/ceph/osd/ceph-9) disabling 'filestore replica fadvise' due to known issues with fadvise(DONTNEED) on xfs
2015-01-05 17:00:03.044863 7f4a6ed837c0 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-9) detect_features: FIEMAP ioctl is supported and appears to work
2015-01-05 17:00:03.044873 7f4a6ed837c0 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-9) detect_features: FIEMAP ioctl is disabled via 'filestore fiemap' config option
2015-01-05 17:00:03.045124 7f4a6ed837c0 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-9) detect_features: syncfs(2) syscall fully supported (by glibc and kernel)
2015-01-05 17:00:03.045170 7f4a6ed837c0 0 xfsfilestorebackend(/var/lib/ceph/osd/ceph-9) detect_feature: extsize is disabled by conf
2015-01-05 17:00:03.150770 7f4a6ed837c0 0 filestore(/var/lib/ceph/osd/ceph-9) mount: enabling WRITEAHEAD journal mode: checkpoint is not enabled
2015-01-05 17:00:40.302339 7f4a6ed837c0 -1 journal FileJournal::_open: disabling aio for non-block journal. Use journal_force_aio to force use of aio anyway
2015-01-05 17:00:40.302376 7f4a6ed837c0 1 journal _open /var/lib/ceph/osd/ceph-9/journal fd 20: 4294967296 bytes, block size 4096 bytes, directio = 1, aio = 0
2015-01-05 17:00:41.104992 7f4a6ed837c0 1 journal _open /var/lib/ceph/osd/ceph-9/journal fd 20: 4294967296 bytes, block size 4096 bytes, directio = 1, aio = 0
2015-01-05 17:00:41.123568 7f4a6ed837c0 1 journal close /var/lib/ceph/osd/ceph-9/journal

#17 Updated by Yann Dupont almost 5 years ago

Is this with 64k directories ?

We don't have the problem anymore - all our OSD now use standard xfs format - (that is, no 64k directory).

#18 Updated by wei li over 4 years ago

@Yann Dupont
Yes, I mkfs args is:

mkfs.xfs -f -d agcount=32 -l size=1024m -n size=64k /dev/sdb1

How much do you prefer for this?

#19 Updated by Yann Dupont over 4 years ago

Well, in fact we were plagued by this issue for a long period last year. It was very sporadic. I'm not sure -n size=64k was a good idea anyway.

So, after a while, we decided to reformat all our nodes, one by one, and stick with XFS default parameters (that is, plain simple mkfs.xfs without options). That was cumbersome to do (we had to convert all our OSD), but in the end, that was a wise move : no problems since.

I hope you don't have much OSD to convert.

#20 Updated by wei li over 4 years ago

@Yann Dupont
Thank you very much! I will do so.

#21 Updated by Warren Wang over 4 years ago

I know this isn't strictly a Ceph problem, but hopefully I can save some folks some grief. I just spent 3 weeks looking at this problem. For Ubuntu users, kernel 3.13.0-40 and above include the fix for the xfs contiguous allocation bug.

Also available in: Atom PDF