Bug #17002
closedpg marked as incomplete during cluster rebalance
0%
Description
Introduction
Cluster with 320 OSDs. 1152 TB raw data.
One of our storage servers hung. We rebooted this server after several hours and this caused rebalancing. The rebalancing ended with two incomplete pgs, and all writes stopped on two ceph pools.
20160728 21:01: Machine hangs "out of the blue"
On 20160728 around 21:00, server `zhdk0051` hung. The exact time of the hang
is unclear. No log entries were found that pointed to upcoming
problems.
In `/var/log/syslog`, the last normal entry is from 21:00:01:
Jul 28 20:56:01 zhdk0051 CRON[66763]: (root) CMD ( test -x /usr/sbin/aadm-cron && /usr/sbin/aadm-cron >> /var/log/aadm/log 2>&1) Jul 28 21:00:01 zhdk0051 CRON[69640]: (root) CMD ( test -x /usr/sbin/aadm-cron && /usr/sbin/aadm-cron >> /var/log/aadm/log 2>&1) Jul 29 14:59:13 zhdk0051 rsyslogd: [origin software="rsyslogd" swVersion="7.4.4" x-pid="1694" x-info="http://www.rsyslog.com"] start Jul 29 14:59:13 zhdk0051 rsyslogd: rsyslogd's groupid changed to 104 Jul 29 14:59:13 zhdk0051 rsyslogd: rsyslogd's userid changed to 101
One of the Ceph OSD log files has (harmless) messages from 21:00:11:
leinen@zhdk0051:/var/log/ceph$ zgrep '2016-07-28 21' *.log.4.gz ceph-osd.191.log.4.gz:2016-07-28 21:00:11.801985 7f0c3be72700 0 -- [2001:620:5ca1:101::151]:6817/4976 >> [2001:620:5ca1:100::1029]:6826/67083 pipe(0x138df000 sd=963 :6817 s=0 pgs=0 cs=0 l=0 c=0x35393180).accept connect_seq 1807 vs existing 1807 state standby ceph-osd.191.log.4.gz:2016-07-28 21:00:11.802451 7f0c3be72700 0 -- [2001:620:5ca1:101::151]:6817/4976 >> [2001:620:5ca1:100::1029]:6826/67083 pipe(0x138df000 sd=963 :6817 s=0 pgs=0 cs=0 l=0 c=0x35393180).accept connect_seq 1808 vs existing 1807 state standby
Nagios sent a `DOWN` notification about the host on `Thu Jul 28
21:05:43 CEST 2016`.
20160728 21:01:50—23:15:32 Successful Ceph recovery
The Ceph mons noticed issues (reported by other OSDs) starting at
21:01:41.782083:
2016-07-28 21:01:41.782083 7f925f504700 1 mon.zhdk0009@0(leader).osd e150017 prepare_failure osd.177 [2001:620:5ca1:101::151]:6800/4721 from osd.75 [2001:620:5ca1:100::1021]:6824/84357 is reporting failure:1 2016-07-28 21:01:41.782108 7f925f504700 0 log_channel(cluster) log [DBG] : osd.177 [2001:620:5ca1:101::151]:6800/4721 reported failed by osd.75 [2001:620:5ca1:100::1021]:6824/84357 2016-07-28 21:01:41.872661 7f925f504700 1 mon.zhdk0009@0(leader).osd e150017 prepare_failure osd.187 [2001:620:5ca1:101::151]:6824/5036 from osd.118 [2001:620:5ca1:100::1006]:6850/33782 is reporting failure:1 2016-07-28 21:01:41.872717 7f925f504700 0 log_channel(cluster) log [DBG] : osd.187 [2001:620:5ca1:101::151]:6824/5036 reported failed by osd.118 [2001:620:5ca1:100::1006]:6850/33782 2016-07-28 21:01:42.239309 7f925f504700 1 mon.zhdk0009@0(leader).osd e150017 prepare_failure osd.197 [2001:620:5ca1:101::151]:6812/4942 from osd.277 [2001:620:5ca1:101::155]:6812/5171 is reporting failure:1 2016-07-28 21:01:42.239357 7f925f504700 0 log_channel(cluster) log [DBG] : osd.197 [2001:620:5ca1:101::151]:6812/4942 reported failed by osd.277 [2001:620:5ca1:101::155]:6812/5171 2016-07-28 21:01:42.707880 7f925f504700 1 mon.zhdk0009@0(leader).osd e150017 prepare_failure osd.187 [2001:620:5ca1:101::151]:6824/5036 from osd.8 [2001:620:5ca1:100::1020]:6851/32767 is reporting failure:1 2016-07-28 21:01:42.707921 7f925f504700 0 log_channel(cluster) log [DBG] : osd.187 [2001:620:5ca1:101::151]:6824/5036 reported failed by osd.8 [2001:620:5ca1:100::1020]:6851/32767 2016-07-28 21:01:42.775416 7f925f504700 1 mon.zhdk0009@0(leader).osd e150017 prepare_failure osd.177 [2001:620:5ca1:101::151]:6800/4721 from osd.74 [2001:620:5ca1:100::1022]:6801/78054 is reporting failure:1 2016-07-28 21:01:42.775431 7f925f504700 0 log_channel(cluster) log [DBG] : osd.177 [2001:620:5ca1:101::151]:6800/4721 reported failed by osd.74 [2001:620:5ca1:100::1022]:6801/78054 [...]
Between 21:01:50 and 21:01:52, the mons declared all ten OSDs on `zhdk0051` down:
2016-07-28 21:01:50.326891 7f925f504700 1 mon.zhdk0009@0(leader).osd e150017 we have enough reports/reporters to mark osd.187 down 2016-07-28 21:01:50.340161 7f925f504700 1 mon.zhdk0009@0(leader).osd e150017 we have enough reports/reporters to mark osd.191 down 2016-07-28 21:01:50.660919 7f925f504700 1 mon.zhdk0009@0(leader).osd e150017 we have enough reports/reporters to mark osd.179 down 2016-07-28 21:01:50.781420 7f925f504700 1 mon.zhdk0009@0(leader).osd e150017 we have enough reports/reporters to mark osd.193 down 2016-07-28 21:01:50.851441 7f925f504700 1 mon.zhdk0009@0(leader).osd e150017 we have enough reports/reporters to mark osd.182 down 2016-07-28 21:01:50.980782 7f925f504700 1 mon.zhdk0009@0(leader).osd e150017 we have enough reports/reporters to mark osd.189 down 2016-07-28 21:01:50.986454 7f925f504700 1 mon.zhdk0009@0(leader).osd e150017 we have enough reports/reporters to mark osd.195 down 2016-07-28 21:01:51.283271 7f925f504700 1 mon.zhdk0009@0(leader).osd e150017 we have enough reports/reporters to mark osd.177 down 2016-07-28 21:01:51.490620 7f92629e9700 1 mon.zhdk0009@0(leader).osd e150018 we have enough reports/reporters to mark osd.185 down 2016-07-28 21:01:51.620752 7f925f504700 1 mon.zhdk0009@0(leader).osd e150018 we have enough reports/reporters to mark osd.197 down
Recovery started, and five minutes later the OSDs were declared out:
2016-07-28 21:06:57.750169 7f925fd05700 0 log_channel(cluster) log [INF] : osd.177 out (down for 306.280513) 2016-07-28 21:06:57.750311 7f925fd05700 0 log_channel(cluster) log [INF] : osd.179 out (down for 306.280513) 2016-07-28 21:06:57.750355 7f925fd05700 0 log_channel(cluster) log [INF] : osd.182 out (down for 306.280512) 2016-07-28 21:06:57.750413 7f925fd05700 0 log_channel(cluster) log [INF] : osd.185 out (down for 305.210924) 2016-07-28 21:06:57.750460 7f925fd05700 0 log_channel(cluster) log [INF] : osd.187 out (down for 306.280512) 2016-07-28 21:06:57.750519 7f925fd05700 0 log_channel(cluster) log [INF] : osd.189 out (down for 306.280512) 2016-07-28 21:06:57.750607 7f925fd05700 0 log_channel(cluster) log [INF] : osd.191 out (down for 306.280512) 2016-07-28 21:06:57.750754 7f925fd05700 0 log_channel(cluster) log [INF] : osd.193 out (down for 306.280512) 2016-07-28 21:06:57.750807 7f925fd05700 0 log_channel(cluster) log [INF] : osd.195 out (down for 306.280511) 2016-07-28 21:06:57.750872 7f925fd05700 0 log_channel(cluster) log [INF] : osd.197 out (down for 305.210923)
Recovery finished at 23:15:32, roughly 134 minutes after the outage.
2016-07-28 23:15:20.327404 7f92629e9700 0 log_channel(cluster) log [INF] : pgmap v53000803: 10704 pgs: 1 active+undersized+degraded+remapped+backfilling, 10703 active+clean; 127 TB data, 436 TB used, 680 TB / 1116 TB avail; 22070 kB/s rd, 17638 kB/s wr, 4185 op/s; 4075/115877199 objects degraded (0.004%); 4172/115877199 objects misplaced (0.004%); 134 MB/s, 33 objects/s recovering 2016-07-28 23:15:23.754936 7f92629e9700 0 log_channel(cluster) log [INF] : pgmap v53000806: 10704 pgs: 1 active+undersized+degraded+remapped+backfilling, 10703 active+clean; 127 TB data, 436 TB used, 680 TB / 1116 TB avail; 22889 kB/s rd, 25897 kB/s wr, 4197 op/s; 4075/115877277 objects degraded (0.004%); 4106/115877277 objects misplaced (0.004%); 128 MB/s, 30 objects/s recovering 2016-07-28 23:15:24.942932 7f92629e9700 0 log_channel(cluster) log [INF] : pgmap v53000807: 10704 pgs: 1 active+undersized+degraded+remapped+backfilling, 10703 active+clean; 127 TB data, 436 TB used, 680 TB / 1116 TB avail; 32095 kB/s rd, 23545 kB/s wr, 2779 op/s; 4075/115877112 objects degraded (0.004%); 4106/115877112 objects misplaced (0.004%); 127 MB/s, 30 objects/s recovering 2016-07-28 23:15:30.836000 7f92629e9700 0 log_channel(cluster) log [INF] : pgmap v53000814: 10704 pgs: 10704 active+clean; 127 TB data, 436 TB used, 680 TB / 1116 TB avail; 60888 kB/s rd, 19547 kB/s wr, 5251 op/s; 39494 kB/s, 9 objects/s recovering 2016-07-28 23:15:31.923399 7f92629e9700 0 log_channel(cluster) log [INF] : pgmap v53000815: 10704 pgs: 10704 active+clean; 127 TB data, 436 TB used, 680 TB / 1116 TB avail; 34487 kB/s rd, 25203 kB/s wr, 4139 op/s; 39913 kB/s, 9 objects/s recovering
2016-07-29 14:59: Server reboot
It took a while for the team to notice the outage and bring the machine back up.
The server rebooted with a "power reset" command through
ipmitool.
In the syslog and kern.log there is not any hint of why the server
cloud have stopped responding.
After the reboot, Ceph tried to recover again, but 2 PGs were
left in a bad state and needed manual intervention. This lead to long
blockage of many RBD clients.
Troubleshooting from 2016-07-29 to 2016-07-30 6:50 pm
The recovery caused some slow requests on two OSDs on the server zhdk0051 and 2 pgs got marked as incomplete.
We think that the incomplete state is caused by a race condition in the writes, while the OSDs acting primary
on ZHDK0051 were slow and unresponsive. Probably for a short amount of time there were two OSDs acting primary for these pgs, and this left ceph in a inconsistent state.
These settings on our `ceph.conf` about recovery were the following:
osd max backfills = 3 osd recovery max active = 5 osd recovery op priority = 2
The pgs were `5.37c` and `6.106`. They both had their first acting OSD on zhdk0051. The two OSds are 179 and 187.
We were able to query both pgs to get troubleshooting informations with the following commands:
ceph pg 5.37c query > 5.37c.txt
ceph pg 6.106 query > 6.106.txt
To make the load lower on these OSDs,
we tried to reweight them to 0, to have 0 writes from user traffic on them,
but did not work to make the PG complete again.
We tried several standard things, like setting the OSDs down and out, and the again up and in. All our tries just
caused a lot of rebalancing.
At point point we were not able to query the pg `5.37c` anymore. To understand where the query was hanging we used this command
ceph pg --debug_ms 5.37c query
and we could see that it was not possible to talk to the primary OSD.
We used `ss` to make sure the osd was correctly listening on the tcp socket
ss -l -p | grep <tcpport>
Finally, we just set all OSDs up and in. We had 0 remappings, 0 lost objects, 0 degraded. But still the two pg where incomplete.
We checked in for instance the OSD process reached some kernel limits, but the following values looked ok:
/proc/sys/fs/file-max /proc/`pidof theosd`/limits
Settings to have less load on the cluster
ceph osd set noout ceph osd set noscrub ceph osd set nodeep-scrub ceph osd set norecover ceph osd set nobackfill ceph tell osd.* injectargs -- --osd_max_backfills 1 ceph tell osd.* injectargs -- --osd_recovery_max_active 1
With this settings we tried to restart one by one all the OSDs in zhdk0051
To restart a single OSD, like a soft reboot, just do
ceph osd down osd.<id>
this did not work to have the pgs up again.
How we fixed the problem at the end
Enable recover and backfill (if disabled)
ceph osd unset norecover
ceph osd unset nobackfill
Make the rebalancing as slow as possible:
ceph tell osd.* injectargs -- --osd_recovery_max_active 1
ceph tell osd.* injectargs -- --osd_max_backfills 1
Find the current acting primary for the incomplete pg,
ceph pg 6.106 query | grep acting primary
in this case it is osd.179. Then:
ceph tell osd.179 injectargs -- --osd_find_best_info_ignore_history_les=1 ceph osd down osd.179
If the PG doesn't go directly to active, this is because another OSD knows about the history of that PG and won't let it activate yet. To find that other _new_ primary OSD do again:
ceph pg <pgid> query | grep acting_primary
Then repeat above (osd_find_best_info_ignore_history_les=1) on that new OSD. Finally the PG will activate and after a bit of backfilling it will be active+clean.
Then reset osd_find_best_info_ignore_history_les to 0 on the above OSDs.
Repeat this for all incomplete PGs -- each of them may need `osd_find_best_info_ignore_history_les=1` on one or more OSDs.
At the end do:
ceph tell osd.* injectargs -- --osd_find_best_info_ignore_history_les=0
to make sure it's disabled everywhere.
Updated by Sage Weil almost 7 years ago
- Status changed from New to Won't Fix
Thanks for your detailed account! This is a known bug on hammer (known in that it happens sometimes; the cause is not compeltely clear). We haven't not seen it in jewel and believe it has been fixed there. Please reopen if you see any need for the ignore_history_les option on jewel!