Project

General

Profile

Actions

Bug #14708

closed

Semi-crashed OSDs cause cluster to hang

Added by Christian Theune about 8 years ago. Updated about 8 years ago.

Status:
Rejected
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

We triggered a weird kernel/ext4 bug on a server this weekend which caused the OSD processes to keep running but become unresponsive. The host system was kinda screwed as well and needed a hard (IPMI) reset.

The error looked like this (for each OSD on that machine):

Feb  7 22:51:47 cartman05 kernel: [878225.044206] INFO: rcu_sched self-detected stall on CPU
Feb  7 22:51:47 cartman05 kernel: [878225.044217]       10: (20999 ticks this GP) idle=9c1/140000000000001/0 softirq=25919109/25919109 fqs=6995 
Feb  7 22:51:47 cartman05 kernel: [878225.044221]        (t=21000 jiffies g=23795691 c=23795690 q=1101)
Feb  7 22:51:47 cartman05 kernel: [878225.044226] Task dump for CPU 10:
Feb  7 22:51:47 cartman05 kernel: [878225.044229] ceph-osd        R  running task    12312  6395      1 0x00000008
Feb  7 22:51:47 cartman05 kernel: [878225.044239]  ffffffff81e49580 ffff88107fc83d88 ffffffff81079c5c 000000000000000a
Feb  7 22:51:47 cartman05 kernel: [878225.044242]  ffffffff81e49580 ffff88107fc83da8 ffffffff8107cead ffff88107fc83de8
Feb  7 22:51:47 cartman05 kernel: [878225.044246]  000000000000000a ffff88107fc83dd8 ffffffff810a5f45 ffff88107fc94780
Feb  7 22:51:47 cartman05 kernel: [878225.044250] Call Trace:
Feb  7 22:51:47 cartman05 kernel: [878225.044253]  <IRQ>  [<ffffffff81079c5c>] sched_show_task+0xcc/0x140
Feb  7 22:51:47 cartman05 kernel: [878225.044272]  [<ffffffff8107cead>] dump_cpu_task+0x3d/0x50
Feb  7 22:51:47 cartman05 kernel: [878225.044279]  [<ffffffff810a5f45>] rcu_dump_cpu_stacks+0x95/0xd0
Feb  7 22:51:47 cartman05 kernel: [878225.044283]  [<ffffffff810a9292>] rcu_check_callbacks+0x442/0x760
Feb  7 22:51:47 cartman05 kernel: [878225.044291]  [<ffffffff810ecdbc>] ? acct_account_cputime+0x1c/0x20
Feb  7 22:51:47 cartman05 kernel: [878225.044295]  [<ffffffff8107dc43>] ? account_system_time+0xc3/0x180
Feb  7 22:51:47 cartman05 kernel: [878225.044300]  [<ffffffff810ae4a8>] update_process_times+0x38/0x70
Feb  7 22:51:47 cartman05 kernel: [878225.044305]  [<ffffffff810bd2e6>] tick_sched_handle.isra.12+0x36/0x50
Feb  7 22:51:47 cartman05 kernel: [878225.044308]  [<ffffffff810bd48b>] tick_sched_timer+0x4b/0x80
Feb  7 22:51:47 cartman05 kernel: [878225.044312]  [<ffffffff810aedc2>] __run_hrtimer+0x82/0x1c0
Feb  7 22:51:47 cartman05 kernel: [878225.044316]  [<ffffffff810bd440>] ? tick_sched_do_timer+0x40/0x40
Feb  7 22:51:47 cartman05 kernel: [878225.044320]  [<ffffffff810af57b>] hrtimer_interrupt+0xfb/0x210
Feb  7 22:51:47 cartman05 kernel: [878225.044328]  [<ffffffff81036149>] local_apic_timer_interrupt+0x39/0x60
Feb  7 22:51:47 cartman05 kernel: [878225.044335]  [<ffffffff8181a845>] smp_apic_timer_interrupt+0x45/0x60
Feb  7 22:51:47 cartman05 kernel: [878225.044342]  [<ffffffff81818dea>] apic_timer_interrupt+0x6a/0x70
Feb  7 22:51:47 cartman05 kernel: [878225.044344]  <EOI>  [<ffffffff811d9bbf>] ? mb_cache_entry_alloc+0x5f/0x1e0
Feb  7 22:51:47 cartman05 kernel: [878225.044354]  [<ffffffff811d9b8c>] ? mb_cache_entry_alloc+0x2c/0x1e0
Feb  7 22:51:47 cartman05 kernel: [878225.044363]  [<ffffffff8124853e>] ext4_xattr_cache_insert+0x2e/0x80
Feb  7 22:51:47 cartman05 kernel: [878225.044367]  [<ffffffff812498a6>] ext4_listxattr+0x176/0x280
Feb  7 22:51:47 cartman05 kernel: [878225.044374]  [<ffffffff811a66ed>] vfs_listxattr+0x4d/0x80
Feb  7 22:51:47 cartman05 kernel: [878225.044378]  [<ffffffff811a678b>] listxattr+0x6b/0x130
Feb  7 22:51:47 cartman05 kernel: [878225.044383]  [<ffffffff811a768f>] SyS_flistxattr+0x5f/0xb0
Feb  7 22:51:47 cartman05 kernel: [878225.044389]  [<ffffffff81817ff2>] system_call_fastpath+0x12/0x17

This keeps happening for a while until I hard reset the machine. See the full log attached.

From a Ceph perspective the remaining OSDs still tried to contact this one, but I wasn't even able to kill the OSDs cleanly. They seem to have gotten into some half-baked state which caused the cluster trying to peer to the PGs ... (the mon on that host was still up but the osd map showed the OSDs as down).

Here's a bit of output from the ceph log from another host:

2016-02-07 22:51:41.518976 osd.16 172.24.4.4:6812/6427 8591 : cluster [WRN] 2 slow requests, 2 included below; oldest blocked for > 15.331451 secs
2016-02-07 22:51:41.518991 osd.16 172.24.4.4:6812/6427 8592 : cluster [WRN] slow request 15.331451 seconds old, received at 2016-02-07 22:51:26.187423: osd_op(client.5812507.0:36979780 rbd_data.2dd02ae8944a.0000000000000822 [set-alloc-hint object_size 4194304 write_size 4194304,write 36864~446464] 7.20551b34 ack+ondisk+write+known_if_redirected e31777) currently commit_sent
2016-02-07 22:51:41.519003 osd.16 172.24.4.4:6812/6427 8593 : cluster [WRN] slow request 15.326734 seconds old, received at 2016-02-07 22:51:26.192139: osd_op(client.5812507.0:36979781 rbd_data.2dd02ae8944a.0000000000000d41 [set-alloc-hint object_size 4194304 write_size 4194304,write 4128768~4096] 7.63b6768d ack+ondisk+write+known_if_redirected e31777) currently commit_sent
2016-02-07 22:51:42.269884 mon.0 172.24.4.3:6789/0 76740 : cluster [INF] pgmap v60984278: 7741 pgs: 7741 active+clean; 7019 GB data, 13967 GB used, 35029 GB / 49500 GB avail; 20302 B/s wr, 3 op/s
2016-02-07 22:51:42.519169 osd.16 172.24.4.4:6812/6427 8594 : cluster [WRN] 3 slow requests, 1 included below; oldest blocked for > 16.331696 secs
2016-02-07 22:51:42.519180 osd.16 172.24.4.4:6812/6427 8595 : cluster [WRN] slow request 15.380566 seconds old, received at 2016-02-07 22:51:27.138552: osd_repop(client.5822106.0:2137508 15.1 7e80f5d1/rbd_data.54bd032ae8944a.00000000000004c3/head//15 v 31777'1665863) currently commit_sent
2016-02-07 22:51:42.467283 osd.0 172.24.4.3:6800/5433 15852 : cluster [WRN] 2 slow requests, 2 included below; oldest blocked for > 15.353941 secs
2016-02-07 22:51:42.467295 osd.0 172.24.4.3:6800/5433 15853 : cluster [WRN] slow request 15.353941 seconds old, received at 2016-02-07 22:51:27.102833: osd_op(client.5832500.0:148982071 rbd_data.35dcf74b0dc51.0000000000002201 [set-alloc-hint object_size 4194304 write_size 4194304,write 151552~4096] 13.7dd929da ack+ondisk+write+known_if_redirected e31777) currently waiting for subops from 17
2016-02-07 22:51:42.467305 osd.0 172.24.4.3:6800/5433 15854 : cluster [WRN] slow request 15.353563 seconds old, received at 2016-02-07 22:51:27.103211: osd_op(client.5832500.0:148982076 rbd_data.35dcf74b0dc51.0000000000002201 [set-alloc-hint object_size 4194304 write_size 4194304,write 204800~4096] 13.7dd929da ack+ondisk+write+known_if_redirected e31777) currently waiting for subops from 17
2016-02-07 22:51:43.273276 mon.0 172.24.4.3:6789/0 76743 : cluster [INF] pgmap v60984279: 7741 pgs: 7741 active+clean; 7019 GB data, 13967 GB used, 35029 GB / 49500 GB avail; 8636 B/s rd, 10160 B/s wr, 14 op/s
2016-02-07 22:51:44.286289 mon.0 172.24.4.3:6789/0 76746 : cluster [INF] pgmap v60984280: 7741 pgs: 7741 active+clean; 7019 GB data, 13967 GB used, 35029 GB / 49500 GB avail; 8634 B/s rd, 10157 B/s wr, 14 op/s
2016-02-07 22:51:45.289260 mon.0 172.24.4.3:6789/0 76747 : cluster [INF] pgmap v60984281: 7741 pgs: 7741 active+clean; 7019 GB data, 13967 GB used, 35029 GB / 49500 GB avail; 14218 B/s wr, 2 op/s
2016-02-07 22:51:46.274113 mon.0 172.24.4.3:6789/0 76753 : cluster [INF] osd.14 172.24.4.4:6808/5693 failed (3 reports from 3 peers after 21.000555 >= grace 20.243914)
2016-02-07 22:51:46.307121 mon.0 172.24.4.3:6789/0 76757 : cluster [INF] pgmap v60984282: 7741 pgs: 7741 active+clean; 7019 GB data, 13967 GB used, 35029 GB / 49500 GB avail; 28466 B/s wr, 4 op/s
2016-02-07 22:51:46.313403 mon.0 172.24.4.3:6789/0 76759 : cluster [INF] osd.17 172.24.4.4:6814/6973 failed (3 reports from 3 peers after 21.039530 >= grace 20.000000)
2016-02-07 22:51:46.325869 mon.0 172.24.4.3:6789/0 76760 : cluster [INF] osdmap e31778: 18 osds: 17 up, 18 in
2016-02-07 22:51:46.345860 mon.0 172.24.4.3:6789/0 76761 : cluster [INF] pgmap v60984283: 7741 pgs: 449 stale+active+clean, 7292 active+clean; 7019 GB data, 13967 GB used, 35029 GB / 49500 GB avail; 31656 B/s wr, 5 op/s
2016-02-07 22:51:47.159465 osd.2 172.24.4.3:6804/5797 14825 : cluster [WRN] 2 slow requests, 2 included below; oldest blocked for > 15.674588 secs

A while later you can see that many PGs are stuck peering:

2016-02-07 22:57:15.550556 mon.0 172.24.4.3:6789/0 77020 : cluster [INF] pgmap v60984440: 7741 pgs: 102 active+degraded+remapped, 105 active+recovery_wait+degraded+remapped, 612 a
ctive+undersized+degraded+remapped, 65 activating+degraded+remapped, 240 activating+undersized+degraded+remapped, 4 active+undersized+degraded+remapped+backfilling, 127 remapped+p
eering, 438 active+remapped, 4636 active+undersized+degraded, 125 activating+remapped, 409 down+peering, 878 active+undersized+degraded+remapped+wait_backfill; 6917 GB data, 7778 
GB used, 19442 GB / 27500 GB avail; 110 GB/s rd, 85920 MB/s wr, 13610 kop/s; 1635919/3563414 objects degraded (45.909%); 1073847/3563414 objects misplaced (30.135%); 11196 MB/s, 2
824 objects/s recovering

There's a lot of log output, so if I can pass you something more specific - let me know!

The original bug is likely a Kernel/Ext4 problem, but I guess Ceph should be able to continue in this situation.


Files

messages-20160208 (162 KB) messages-20160208 Christian Theune, 02/09/2016 03:18 PM
Actions #2

Updated by Josh Durgin about 8 years ago

  • Project changed from rbd to Ceph
Actions #3

Updated by Christian Theune about 8 years ago

I just had this again on a different server, same traceback. This time Ceph behaved more gracefully, setting the mons out.

I've gotta say diagnosing this is really hard - Google doesn't help much on the specific error message, so this appears to be arcane. I found this thread talking about locking (contention or not releasing them) as a potential issue:
https://lkml.org/lkml/2015/5/8/480

For the record: we'll migrate the affected servers to XFS. Our clusters using XFS do not seem to have the same issue, so that supports the assumption that this is ext4-specific.

Actions #4

Updated by Samuel Just about 8 years ago

We do have a watchdog thread which should have killed any osd with hung threads (see the .*timeout.* and .*suicide_timeout.* settings). Did it not happen quickly enough?

Actions #5

Updated by Samuel Just about 8 years ago

  • Status changed from New to Rejected

There are already timeouts for this that can be adjusted.

Actions

Also available in: Atom PDF