Bug #13662
closedext4 xattr linux 3.16 / 4.2.3 panic
Added by Loïc Dachary over 8 years ago. Updated about 7 years ago.
0%
Description
problem description¶
Machines hosting OSDs with Debian jessie and running linux 3.16 crashed one after the other with the attached backtrace (photo of the console). They are all related to ext4 xattr but no trace is the same.
Running hammer 0.94.5. It looks like something related to xattr propagated to all OSD and triggered an ext4 kernel bug that has the machine crash.
workaround¶
Rebuilt the ubuntu trusty 3.19 kernel with the following patch: https://bugzilla.kernel.org/show_bug.cgi?id=107301#c6 The patch simply removes the mbcache code from ext4 xattr handing.
Files
s1.png (18.1 KB) s1.png | kernel trace 1 | Loïc Dachary, 10/31/2015 03:34 AM | |
s2.png (15.4 KB) s2.png | kernel trace 2 | Loïc Dachary, 10/31/2015 03:34 AM | |
s3.png (14.9 KB) s3.png | kernel trace 3 | Loïc Dachary, 10/31/2015 03:34 AM | |
bt_g1 (14.7 KB) bt_g1 | Mehdi Abaakouk, 10/31/2015 03:38 AM | ||
bt_n7 (16.1 KB) bt_n7 | Mehdi Abaakouk, 10/31/2015 03:47 AM | ||
g6.dmesg (159 KB) g6.dmesg | Mehdi Abaakouk, 11/05/2015 01:55 PM |
Updated by Loïc Dachary over 8 years ago
Mehdi & Laurent : could you add as much detail as possible about the configuration of the machines where this crash happens ?
Updated by Loïc Dachary over 8 years ago
- Subject changed from ext4 xattr linux 3.16 backtrace to ext4 xattr linux 3.16 panic
Updated by Mehdi Abaakouk over 8 years ago
Just got the same kind of backtrace on a 4.2.3 kernel
Updated by Mehdi Abaakouk over 8 years ago
Backtrace on a 3.16.7 kernel
Updated by Loïc Dachary over 8 years ago
s1¶
mb_cache_entry_get [mbcache] ext4_xattr_block_set [ext4] ext4_xattr_find_entry [ext4] ext4_xattr_set_entry [ext4] ext4_xattr_set_handle [ext4] ext4_xattr_set [ext4] generic_setxattr __vfs_setxattr_noperm vfs_setxattr setxattr ext4_xattr_user_list ext4_xattr_list_entries ext4_listattr __sb_start_write
s2¶
__mb_cache_entry_release+0x9d/0x120 [mbcache] ext4_xattr_get+XXXX [ext4] generic_getxattr getxattr ext4_discard_preallocation
s3¶
ext4_xattr_cache_insert [ext4] ext4_xattr_get [ext4] generic_getxattr vfs_getxattr ext4_discard_preallocations do_flip_open
Updated by Loïc Dachary over 8 years ago
- Subject changed from ext4 xattr linux 3.16 panic to ext4 xattr linux 3.16 / 4.2.3 panic
Updated by Loïc Dachary over 8 years ago
- http://tracker.ceph.com/issues/11581#note-2 has and ext4 related panic but the problem does not look the same
Updated by Laurent GUERBY over 8 years ago
On some machines before freezing we have soft lockup messages:
root@g2:~#
Message from syslogd@g2 at Oct 30 21:01:37 ...
kernel:[76005.122564] BUG: soft lockup - CPU#3 stuck for 23s! [ceph-osd:5530]
Message from syslogd@g2 at Oct 30 21:01:37 ...
kernel:[76005.146542] BUG: soft lockup - CPU#5 stuck for 23s! [ceph-osd:5790]
Message from syslogd@g2 at Oct 30 21:01:37 ...
kernel:[76005.158531] BUG: soft lockup - CPU#6 stuck for 23s! [ceph-osd:10587]
Message from syslogd@g2 at Oct 30 21:01:37 ...
kernel:[76005.170521] BUG: soft lockup - CPU#7 stuck for 23s! [ceph-osd:5920]
Message from syslogd@g2 at Oct 30 21:02:05 ...
kernel:[76033.097887] BUG: soft lockup - CPU#3 stuck for 22s! [ceph-osd:5530]
Message from syslogd@g2 at Oct 30 21:02:05 ...
kernel:[76033.121866] BUG: soft lockup - CPU#5 stuck for 22s! [ceph-osd:5790]
Message from syslogd@g2 at Oct 30 21:02:05 ...
kernel:[76033.133855] BUG: soft lockup - CPU#6 stuck for 22s! [ceph-osd:10587]
Message from syslogd@g2 at Oct 30 21:02:05 ...
kernel:[76033.145844] BUG: soft lockup - CPU#7 stuck for 22s! [ceph-osd:5920]
Updated by Samuel Just over 8 years ago
We need to send a description of this to the ext4 list asap.
Updated by Loïc Dachary over 8 years ago
- Status changed from 12 to Need More Info
It would be useful to have a way to reproduce the problem. Or maybe a list of the xattr of all objects on the faulty OSD ?
Updated by Mehdi Abaakouk over 8 years ago
I have opened the issue on kernel side: https://bugzilla.kernel.org/show_bug.cgi?id=107301
And got a complete backtrace via netconsole
Updated by Mehdi Abaakouk over 8 years ago
The issue seems due to a concurrency issue in the ext4 mbcache/xattr code in kernel.
In that case, does 'filestore op threads = 1' can workaround the issue ?
Updated by Loïc Dachary over 8 years ago
@laurent @mehdi : could you please summarize what you did to work around the problem ?
Updated by Laurent GUERBY over 8 years ago
Since no user level workaround were effective we rebuilt the ubuntu trusty 3.19 kernel with the following patch:
https://bugzilla.kernel.org/show_bug.cgi?id=107301#c6
The patch simply removes the mbcache code from ext4 xattr handing.
The kernel developpers are discussing what to do in the bugzilla, Sage and the Lustre developpers stepped in as users.
No issue so far in two days of production on our 11 machines with the new kernel.
Updated by Loïc Dachary over 8 years ago
- Description updated (diff)
- Status changed from Need More Info to 15
- Priority changed from High to Normal
Updated by Yuri Weinstein over 8 years ago
- Related to Bug #14505: "kernel: BUG: soft lockup - CPU#1 stuck" in upgrade:infernalis-infernalis-distro-basic-openstack added
Updated by Anonymous about 8 years ago
I just updated the korg bugzilla and updating this ticket accordingly.
The 4.6.x series of the kernel will feature a patch from Jan Kara that should solve this issue but upstream in the Kernel.
If Laurent & Medhi can confirm that 4.6.x is safe that would be ideal to report Ceph users considering using such kernel to avoid the issue you had.