Project

General

Profile

Bug #13662

ext4 xattr linux 3.16 / 4.2.3 panic

Added by Loïc Dachary almost 7 years ago. Updated over 5 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

problem description

Machines hosting OSDs with Debian jessie and running linux 3.16 crashed one after the other with the attached backtrace (photo of the console). They are all related to ext4 xattr but no trace is the same.
Running hammer 0.94.5. It looks like something related to xattr propagated to all OSD and triggered an ext4 kernel bug that has the machine crash.

workaround

Rebuilt the ubuntu trusty 3.19 kernel with the following patch: https://bugzilla.kernel.org/show_bug.cgi?id=107301#c6 The patch simply removes the mbcache code from ext4 xattr handing.

s1.png View - kernel trace 1 (18.1 KB) Loïc Dachary, 10/31/2015 03:34 AM

s2.png View - kernel trace 2 (15.4 KB) Loïc Dachary, 10/31/2015 03:34 AM

s3.png View - kernel trace 3 (14.9 KB) Loïc Dachary, 10/31/2015 03:34 AM

bt_g1 (14.7 KB) Mehdi Abaakouk, 10/31/2015 03:38 AM

bt_n7 (16.1 KB) Mehdi Abaakouk, 10/31/2015 03:47 AM

g6.dmesg (159 KB) Mehdi Abaakouk, 11/05/2015 01:55 PM


Related issues

Related to Ceph - Bug #14505: "kernel: BUG: soft lockup - CPU#1 stuck" in upgrade:infernalis-infernalis-distro-basic-openstack Rejected 01/25/2016

History

#1 Updated by Loïc Dachary almost 7 years ago

Mehdi & Laurent : could you add as much detail as possible about the configuration of the machines where this crash happens ?

#2 Updated by Loïc Dachary almost 7 years ago

  • Subject changed from ext4 xattr linux 3.16 backtrace to ext4 xattr linux 3.16 panic

#3 Updated by Mehdi Abaakouk almost 7 years ago

Just got the same kind of backtrace on a 4.2.3 kernel

#4 Updated by Mehdi Abaakouk almost 7 years ago

Backtrace on a 3.16.7 kernel

#5 Updated by Loïc Dachary almost 7 years ago

s1

mb_cache_entry_get [mbcache]
ext4_xattr_block_set [ext4]
ext4_xattr_find_entry [ext4]
ext4_xattr_set_entry [ext4]
ext4_xattr_set_handle [ext4]
ext4_xattr_set [ext4]
generic_setxattr
__vfs_setxattr_noperm
vfs_setxattr
setxattr
ext4_xattr_user_list
ext4_xattr_list_entries
ext4_listattr
__sb_start_write

s2

__mb_cache_entry_release+0x9d/0x120 [mbcache]
ext4_xattr_get+XXXX [ext4]
generic_getxattr
getxattr
ext4_discard_preallocation

s3

ext4_xattr_cache_insert [ext4]
ext4_xattr_get [ext4]
generic_getxattr
vfs_getxattr
ext4_discard_preallocations
do_flip_open

#6 Updated by Loïc Dachary almost 7 years ago

  • Subject changed from ext4 xattr linux 3.16 panic to ext4 xattr linux 3.16 / 4.2.3 panic

#7 Updated by Loïc Dachary almost 7 years ago

#8 Updated by Laurent GUERBY almost 7 years ago

On some machines before freezing we have soft lockup messages:

root@g2:~#
Message from syslogd@g2 at Oct 30 21:01:37 ...
kernel:[76005.122564] BUG: soft lockup - CPU#3 stuck for 23s! [ceph-osd:5530]

Message from syslogd@g2 at Oct 30 21:01:37 ...
kernel:[76005.146542] BUG: soft lockup - CPU#5 stuck for 23s! [ceph-osd:5790]

Message from syslogd@g2 at Oct 30 21:01:37 ...
kernel:[76005.158531] BUG: soft lockup - CPU#6 stuck for 23s! [ceph-osd:10587]

Message from syslogd@g2 at Oct 30 21:01:37 ...
kernel:[76005.170521] BUG: soft lockup - CPU#7 stuck for 23s! [ceph-osd:5920]

Message from syslogd@g2 at Oct 30 21:02:05 ...
kernel:[76033.097887] BUG: soft lockup - CPU#3 stuck for 22s! [ceph-osd:5530]

Message from syslogd@g2 at Oct 30 21:02:05 ...
kernel:[76033.121866] BUG: soft lockup - CPU#5 stuck for 22s! [ceph-osd:5790]

Message from syslogd@g2 at Oct 30 21:02:05 ...
kernel:[76033.133855] BUG: soft lockup - CPU#6 stuck for 22s! [ceph-osd:10587]

Message from syslogd@g2 at Oct 30 21:02:05 ...
kernel:[76033.145844] BUG: soft lockup - CPU#7 stuck for 22s! [ceph-osd:5920]

#9 Updated by Samuel Just almost 7 years ago

We need to send a description of this to the ext4 list asap.

#10 Updated by Samuel Just almost 7 years ago

  • Priority changed from Urgent to High

#11 Updated by Loïc Dachary almost 7 years ago

  • Status changed from 12 to Need More Info

It would be useful to have a way to reproduce the problem. Or maybe a list of the xattr of all objects on the faulty OSD ?

#12 Updated by Mehdi Abaakouk almost 7 years ago

I have opened the issue on kernel side: https://bugzilla.kernel.org/show_bug.cgi?id=107301

And got a complete backtrace via netconsole

#13 Updated by Mehdi Abaakouk almost 7 years ago

The issue seems due to a concurrency issue in the ext4 mbcache/xattr code in kernel.

In that case, does 'filestore op threads = 1' can workaround the issue ?

#14 Updated by Loïc Dachary almost 7 years ago

@laurent @mehdi : could you please summarize what you did to work around the problem ?

#15 Updated by Laurent GUERBY almost 7 years ago

Since no user level workaround were effective we rebuilt the ubuntu trusty 3.19 kernel with the following patch:

https://bugzilla.kernel.org/show_bug.cgi?id=107301#c6

The patch simply removes the mbcache code from ext4 xattr handing.

The kernel developpers are discussing what to do in the bugzilla, Sage and the Lustre developpers stepped in as users.

No issue so far in two days of production on our 11 machines with the new kernel.

#16 Updated by Loïc Dachary almost 7 years ago

  • Description updated (diff)
  • Status changed from Need More Info to 15
  • Priority changed from High to Normal

#17 Updated by Yuri Weinstein over 6 years ago

  • Related to Bug #14505: "kernel: BUG: soft lockup - CPU#1 stuck" in upgrade:infernalis-infernalis-distro-basic-openstack added

#18 Updated by Anonymous over 6 years ago

I just updated the korg bugzilla and updating this ticket accordingly.

The 4.6.x series of the kernel will feature a patch from Jan Kara that should solve this issue but upstream in the Kernel.
If Laurent & Medhi can confirm that 4.6.x is safe that would be ideal to report Ceph users considering using such kernel to avoid the issue you had.

#19 Updated by Sage Weil over 5 years ago

  • Status changed from 15 to Resolved

Also available in: Atom PDF