Bug #14708
closedSemi-crashed OSDs cause cluster to hang
0%
Description
We triggered a weird kernel/ext4 bug on a server this weekend which caused the OSD processes to keep running but become unresponsive. The host system was kinda screwed as well and needed a hard (IPMI) reset.
The error looked like this (for each OSD on that machine):
Feb 7 22:51:47 cartman05 kernel: [878225.044206] INFO: rcu_sched self-detected stall on CPU Feb 7 22:51:47 cartman05 kernel: [878225.044217] 10: (20999 ticks this GP) idle=9c1/140000000000001/0 softirq=25919109/25919109 fqs=6995 Feb 7 22:51:47 cartman05 kernel: [878225.044221] (t=21000 jiffies g=23795691 c=23795690 q=1101) Feb 7 22:51:47 cartman05 kernel: [878225.044226] Task dump for CPU 10: Feb 7 22:51:47 cartman05 kernel: [878225.044229] ceph-osd R running task 12312 6395 1 0x00000008 Feb 7 22:51:47 cartman05 kernel: [878225.044239] ffffffff81e49580 ffff88107fc83d88 ffffffff81079c5c 000000000000000a Feb 7 22:51:47 cartman05 kernel: [878225.044242] ffffffff81e49580 ffff88107fc83da8 ffffffff8107cead ffff88107fc83de8 Feb 7 22:51:47 cartman05 kernel: [878225.044246] 000000000000000a ffff88107fc83dd8 ffffffff810a5f45 ffff88107fc94780 Feb 7 22:51:47 cartman05 kernel: [878225.044250] Call Trace: Feb 7 22:51:47 cartman05 kernel: [878225.044253] <IRQ> [<ffffffff81079c5c>] sched_show_task+0xcc/0x140 Feb 7 22:51:47 cartman05 kernel: [878225.044272] [<ffffffff8107cead>] dump_cpu_task+0x3d/0x50 Feb 7 22:51:47 cartman05 kernel: [878225.044279] [<ffffffff810a5f45>] rcu_dump_cpu_stacks+0x95/0xd0 Feb 7 22:51:47 cartman05 kernel: [878225.044283] [<ffffffff810a9292>] rcu_check_callbacks+0x442/0x760 Feb 7 22:51:47 cartman05 kernel: [878225.044291] [<ffffffff810ecdbc>] ? acct_account_cputime+0x1c/0x20 Feb 7 22:51:47 cartman05 kernel: [878225.044295] [<ffffffff8107dc43>] ? account_system_time+0xc3/0x180 Feb 7 22:51:47 cartman05 kernel: [878225.044300] [<ffffffff810ae4a8>] update_process_times+0x38/0x70 Feb 7 22:51:47 cartman05 kernel: [878225.044305] [<ffffffff810bd2e6>] tick_sched_handle.isra.12+0x36/0x50 Feb 7 22:51:47 cartman05 kernel: [878225.044308] [<ffffffff810bd48b>] tick_sched_timer+0x4b/0x80 Feb 7 22:51:47 cartman05 kernel: [878225.044312] [<ffffffff810aedc2>] __run_hrtimer+0x82/0x1c0 Feb 7 22:51:47 cartman05 kernel: [878225.044316] [<ffffffff810bd440>] ? tick_sched_do_timer+0x40/0x40 Feb 7 22:51:47 cartman05 kernel: [878225.044320] [<ffffffff810af57b>] hrtimer_interrupt+0xfb/0x210 Feb 7 22:51:47 cartman05 kernel: [878225.044328] [<ffffffff81036149>] local_apic_timer_interrupt+0x39/0x60 Feb 7 22:51:47 cartman05 kernel: [878225.044335] [<ffffffff8181a845>] smp_apic_timer_interrupt+0x45/0x60 Feb 7 22:51:47 cartman05 kernel: [878225.044342] [<ffffffff81818dea>] apic_timer_interrupt+0x6a/0x70 Feb 7 22:51:47 cartman05 kernel: [878225.044344] <EOI> [<ffffffff811d9bbf>] ? mb_cache_entry_alloc+0x5f/0x1e0 Feb 7 22:51:47 cartman05 kernel: [878225.044354] [<ffffffff811d9b8c>] ? mb_cache_entry_alloc+0x2c/0x1e0 Feb 7 22:51:47 cartman05 kernel: [878225.044363] [<ffffffff8124853e>] ext4_xattr_cache_insert+0x2e/0x80 Feb 7 22:51:47 cartman05 kernel: [878225.044367] [<ffffffff812498a6>] ext4_listxattr+0x176/0x280 Feb 7 22:51:47 cartman05 kernel: [878225.044374] [<ffffffff811a66ed>] vfs_listxattr+0x4d/0x80 Feb 7 22:51:47 cartman05 kernel: [878225.044378] [<ffffffff811a678b>] listxattr+0x6b/0x130 Feb 7 22:51:47 cartman05 kernel: [878225.044383] [<ffffffff811a768f>] SyS_flistxattr+0x5f/0xb0 Feb 7 22:51:47 cartman05 kernel: [878225.044389] [<ffffffff81817ff2>] system_call_fastpath+0x12/0x17
This keeps happening for a while until I hard reset the machine. See the full log attached.
From a Ceph perspective the remaining OSDs still tried to contact this one, but I wasn't even able to kill the OSDs cleanly. They seem to have gotten into some half-baked state which caused the cluster trying to peer to the PGs ... (the mon on that host was still up but the osd map showed the OSDs as down).
Here's a bit of output from the ceph log from another host:
2016-02-07 22:51:41.518976 osd.16 172.24.4.4:6812/6427 8591 : cluster [WRN] 2 slow requests, 2 included below; oldest blocked for > 15.331451 secs 2016-02-07 22:51:41.518991 osd.16 172.24.4.4:6812/6427 8592 : cluster [WRN] slow request 15.331451 seconds old, received at 2016-02-07 22:51:26.187423: osd_op(client.5812507.0:36979780 rbd_data.2dd02ae8944a.0000000000000822 [set-alloc-hint object_size 4194304 write_size 4194304,write 36864~446464] 7.20551b34 ack+ondisk+write+known_if_redirected e31777) currently commit_sent 2016-02-07 22:51:41.519003 osd.16 172.24.4.4:6812/6427 8593 : cluster [WRN] slow request 15.326734 seconds old, received at 2016-02-07 22:51:26.192139: osd_op(client.5812507.0:36979781 rbd_data.2dd02ae8944a.0000000000000d41 [set-alloc-hint object_size 4194304 write_size 4194304,write 4128768~4096] 7.63b6768d ack+ondisk+write+known_if_redirected e31777) currently commit_sent 2016-02-07 22:51:42.269884 mon.0 172.24.4.3:6789/0 76740 : cluster [INF] pgmap v60984278: 7741 pgs: 7741 active+clean; 7019 GB data, 13967 GB used, 35029 GB / 49500 GB avail; 20302 B/s wr, 3 op/s 2016-02-07 22:51:42.519169 osd.16 172.24.4.4:6812/6427 8594 : cluster [WRN] 3 slow requests, 1 included below; oldest blocked for > 16.331696 secs 2016-02-07 22:51:42.519180 osd.16 172.24.4.4:6812/6427 8595 : cluster [WRN] slow request 15.380566 seconds old, received at 2016-02-07 22:51:27.138552: osd_repop(client.5822106.0:2137508 15.1 7e80f5d1/rbd_data.54bd032ae8944a.00000000000004c3/head//15 v 31777'1665863) currently commit_sent 2016-02-07 22:51:42.467283 osd.0 172.24.4.3:6800/5433 15852 : cluster [WRN] 2 slow requests, 2 included below; oldest blocked for > 15.353941 secs 2016-02-07 22:51:42.467295 osd.0 172.24.4.3:6800/5433 15853 : cluster [WRN] slow request 15.353941 seconds old, received at 2016-02-07 22:51:27.102833: osd_op(client.5832500.0:148982071 rbd_data.35dcf74b0dc51.0000000000002201 [set-alloc-hint object_size 4194304 write_size 4194304,write 151552~4096] 13.7dd929da ack+ondisk+write+known_if_redirected e31777) currently waiting for subops from 17 2016-02-07 22:51:42.467305 osd.0 172.24.4.3:6800/5433 15854 : cluster [WRN] slow request 15.353563 seconds old, received at 2016-02-07 22:51:27.103211: osd_op(client.5832500.0:148982076 rbd_data.35dcf74b0dc51.0000000000002201 [set-alloc-hint object_size 4194304 write_size 4194304,write 204800~4096] 13.7dd929da ack+ondisk+write+known_if_redirected e31777) currently waiting for subops from 17 2016-02-07 22:51:43.273276 mon.0 172.24.4.3:6789/0 76743 : cluster [INF] pgmap v60984279: 7741 pgs: 7741 active+clean; 7019 GB data, 13967 GB used, 35029 GB / 49500 GB avail; 8636 B/s rd, 10160 B/s wr, 14 op/s 2016-02-07 22:51:44.286289 mon.0 172.24.4.3:6789/0 76746 : cluster [INF] pgmap v60984280: 7741 pgs: 7741 active+clean; 7019 GB data, 13967 GB used, 35029 GB / 49500 GB avail; 8634 B/s rd, 10157 B/s wr, 14 op/s 2016-02-07 22:51:45.289260 mon.0 172.24.4.3:6789/0 76747 : cluster [INF] pgmap v60984281: 7741 pgs: 7741 active+clean; 7019 GB data, 13967 GB used, 35029 GB / 49500 GB avail; 14218 B/s wr, 2 op/s 2016-02-07 22:51:46.274113 mon.0 172.24.4.3:6789/0 76753 : cluster [INF] osd.14 172.24.4.4:6808/5693 failed (3 reports from 3 peers after 21.000555 >= grace 20.243914) 2016-02-07 22:51:46.307121 mon.0 172.24.4.3:6789/0 76757 : cluster [INF] pgmap v60984282: 7741 pgs: 7741 active+clean; 7019 GB data, 13967 GB used, 35029 GB / 49500 GB avail; 28466 B/s wr, 4 op/s 2016-02-07 22:51:46.313403 mon.0 172.24.4.3:6789/0 76759 : cluster [INF] osd.17 172.24.4.4:6814/6973 failed (3 reports from 3 peers after 21.039530 >= grace 20.000000) 2016-02-07 22:51:46.325869 mon.0 172.24.4.3:6789/0 76760 : cluster [INF] osdmap e31778: 18 osds: 17 up, 18 in 2016-02-07 22:51:46.345860 mon.0 172.24.4.3:6789/0 76761 : cluster [INF] pgmap v60984283: 7741 pgs: 449 stale+active+clean, 7292 active+clean; 7019 GB data, 13967 GB used, 35029 GB / 49500 GB avail; 31656 B/s wr, 5 op/s 2016-02-07 22:51:47.159465 osd.2 172.24.4.3:6804/5797 14825 : cluster [WRN] 2 slow requests, 2 included below; oldest blocked for > 15.674588 secs
A while later you can see that many PGs are stuck peering:
2016-02-07 22:57:15.550556 mon.0 172.24.4.3:6789/0 77020 : cluster [INF] pgmap v60984440: 7741 pgs: 102 active+degraded+remapped, 105 active+recovery_wait+degraded+remapped, 612 a ctive+undersized+degraded+remapped, 65 activating+degraded+remapped, 240 activating+undersized+degraded+remapped, 4 active+undersized+degraded+remapped+backfilling, 127 remapped+p eering, 438 active+remapped, 4636 active+undersized+degraded, 125 activating+remapped, 409 down+peering, 878 active+undersized+degraded+remapped+wait_backfill; 6917 GB data, 7778 GB used, 19442 GB / 27500 GB avail; 110 GB/s rd, 85920 MB/s wr, 13610 kop/s; 1635919/3563414 objects degraded (45.909%); 1073847/3563414 objects misplaced (30.135%); 11196 MB/s, 2 824 objects/s recovering
There's a lot of log output, so if I can pass you something more specific - let me know!
The original bug is likely a Kernel/Ext4 problem, but I guess Ceph should be able to continue in this situation.
Files