Project

General

Profile

Actions

Bug #17002

closed

pg marked as incomplete during cluster rebalance

Added by Saverio Proto over 7 years ago. Updated almost 7 years ago.

Status:
Won't Fix
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Introduction

Cluster with 320 OSDs. 1152 TB raw data.

One of our storage servers hung. We rebooted this server after several hours and this caused rebalancing. The rebalancing ended with two incomplete pgs, and all writes stopped on two ceph pools.

20160728 21:01: Machine hangs "out of the blue"

On 20160728 around 21:00, server `zhdk0051` hung. The exact time of the hang
is unclear. No log entries were found that pointed to upcoming
problems.

In `/var/log/syslog`, the last normal entry is from 21:00:01:

Jul 28 20:56:01 zhdk0051 CRON[66763]: (root) CMD ( test -x /usr/sbin/aadm-cron && /usr/sbin/aadm-cron >> /var/log/aadm/log 2>&1)
Jul 28 21:00:01 zhdk0051 CRON[69640]: (root) CMD ( test -x /usr/sbin/aadm-cron && /usr/sbin/aadm-cron >> /var/log/aadm/log 2>&1)
Jul 29 14:59:13 zhdk0051 rsyslogd: [origin software="rsyslogd" swVersion="7.4.4" x-pid="1694" x-info="http://www.rsyslog.com"] start
Jul 29 14:59:13 zhdk0051 rsyslogd: rsyslogd's groupid changed to 104
Jul 29 14:59:13 zhdk0051 rsyslogd: rsyslogd's userid changed to 101

One of the Ceph OSD log files has (harmless) messages from 21:00:11:

leinen@zhdk0051:/var/log/ceph$ zgrep '2016-07-28 21' *.log.4.gz
ceph-osd.191.log.4.gz:2016-07-28 21:00:11.801985 7f0c3be72700  0 -- [2001:620:5ca1:101::151]:6817/4976 >> [2001:620:5ca1:100::1029]:6826/67083 pipe(0x138df000 sd=963 :6817 s=0 pgs=0 cs=0 l=0 c=0x35393180).accept connect_seq 1807 vs existing 1807 state standby
ceph-osd.191.log.4.gz:2016-07-28 21:00:11.802451 7f0c3be72700  0 -- [2001:620:5ca1:101::151]:6817/4976 >> [2001:620:5ca1:100::1029]:6826/67083 pipe(0x138df000 sd=963 :6817 s=0 pgs=0 cs=0 l=0 c=0x35393180).accept connect_seq 1808 vs existing 1807 state standby

Nagios sent a `DOWN` notification about the host on `Thu Jul 28
21:05:43 CEST 2016`.

20160728 21:01:50—23:15:32 Successful Ceph recovery

The Ceph mons noticed issues (reported by other OSDs) starting at
21:01:41.782083:

2016-07-28 21:01:41.782083 7f925f504700  1 mon.zhdk0009@0(leader).osd e150017 prepare_failure osd.177 [2001:620:5ca1:101::151]:6800/4721 from osd.75 [2001:620:5ca1:100::1021]:6824/84357 is reporting failure:1
2016-07-28 21:01:41.782108 7f925f504700  0 log_channel(cluster) log [DBG] : osd.177 [2001:620:5ca1:101::151]:6800/4721 reported failed by osd.75 [2001:620:5ca1:100::1021]:6824/84357
2016-07-28 21:01:41.872661 7f925f504700  1 mon.zhdk0009@0(leader).osd e150017 prepare_failure osd.187 [2001:620:5ca1:101::151]:6824/5036 from osd.118 [2001:620:5ca1:100::1006]:6850/33782 is reporting failure:1
2016-07-28 21:01:41.872717 7f925f504700  0 log_channel(cluster) log [DBG] : osd.187 [2001:620:5ca1:101::151]:6824/5036 reported failed by osd.118 [2001:620:5ca1:100::1006]:6850/33782
2016-07-28 21:01:42.239309 7f925f504700  1 mon.zhdk0009@0(leader).osd e150017 prepare_failure osd.197 [2001:620:5ca1:101::151]:6812/4942 from osd.277 [2001:620:5ca1:101::155]:6812/5171 is reporting failure:1
2016-07-28 21:01:42.239357 7f925f504700  0 log_channel(cluster) log [DBG] : osd.197 [2001:620:5ca1:101::151]:6812/4942 reported failed by osd.277 [2001:620:5ca1:101::155]:6812/5171
2016-07-28 21:01:42.707880 7f925f504700  1 mon.zhdk0009@0(leader).osd e150017 prepare_failure osd.187 [2001:620:5ca1:101::151]:6824/5036 from osd.8 [2001:620:5ca1:100::1020]:6851/32767 is reporting failure:1
2016-07-28 21:01:42.707921 7f925f504700  0 log_channel(cluster) log [DBG] : osd.187 [2001:620:5ca1:101::151]:6824/5036 reported failed by osd.8 [2001:620:5ca1:100::1020]:6851/32767
2016-07-28 21:01:42.775416 7f925f504700  1 mon.zhdk0009@0(leader).osd e150017 prepare_failure osd.177 [2001:620:5ca1:101::151]:6800/4721 from osd.74 [2001:620:5ca1:100::1022]:6801/78054 is reporting failure:1
2016-07-28 21:01:42.775431 7f925f504700  0 log_channel(cluster) log [DBG] : osd.177 [2001:620:5ca1:101::151]:6800/4721 reported failed by osd.74 [2001:620:5ca1:100::1022]:6801/78054
[...]

Between 21:01:50 and 21:01:52, the mons declared all ten OSDs on `zhdk0051` down:

2016-07-28 21:01:50.326891 7f925f504700  1 mon.zhdk0009@0(leader).osd e150017  we have enough reports/reporters to mark osd.187 down
2016-07-28 21:01:50.340161 7f925f504700  1 mon.zhdk0009@0(leader).osd e150017  we have enough reports/reporters to mark osd.191 down
2016-07-28 21:01:50.660919 7f925f504700  1 mon.zhdk0009@0(leader).osd e150017  we have enough reports/reporters to mark osd.179 down
2016-07-28 21:01:50.781420 7f925f504700  1 mon.zhdk0009@0(leader).osd e150017  we have enough reports/reporters to mark osd.193 down
2016-07-28 21:01:50.851441 7f925f504700  1 mon.zhdk0009@0(leader).osd e150017  we have enough reports/reporters to mark osd.182 down
2016-07-28 21:01:50.980782 7f925f504700  1 mon.zhdk0009@0(leader).osd e150017  we have enough reports/reporters to mark osd.189 down
2016-07-28 21:01:50.986454 7f925f504700  1 mon.zhdk0009@0(leader).osd e150017  we have enough reports/reporters to mark osd.195 down
2016-07-28 21:01:51.283271 7f925f504700  1 mon.zhdk0009@0(leader).osd e150017  we have enough reports/reporters to mark osd.177 down
2016-07-28 21:01:51.490620 7f92629e9700  1 mon.zhdk0009@0(leader).osd e150018  we have enough reports/reporters to mark osd.185 down
2016-07-28 21:01:51.620752 7f925f504700  1 mon.zhdk0009@0(leader).osd e150018  we have enough reports/reporters to mark osd.197 down

Recovery started, and five minutes later the OSDs were declared out:

2016-07-28 21:06:57.750169 7f925fd05700  0 log_channel(cluster) log [INF] : osd.177 out (down for 306.280513)
2016-07-28 21:06:57.750311 7f925fd05700  0 log_channel(cluster) log [INF] : osd.179 out (down for 306.280513)
2016-07-28 21:06:57.750355 7f925fd05700  0 log_channel(cluster) log [INF] : osd.182 out (down for 306.280512)
2016-07-28 21:06:57.750413 7f925fd05700  0 log_channel(cluster) log [INF] : osd.185 out (down for 305.210924)
2016-07-28 21:06:57.750460 7f925fd05700  0 log_channel(cluster) log [INF] : osd.187 out (down for 306.280512)
2016-07-28 21:06:57.750519 7f925fd05700  0 log_channel(cluster) log [INF] : osd.189 out (down for 306.280512)
2016-07-28 21:06:57.750607 7f925fd05700  0 log_channel(cluster) log [INF] : osd.191 out (down for 306.280512)
2016-07-28 21:06:57.750754 7f925fd05700  0 log_channel(cluster) log [INF] : osd.193 out (down for 306.280512)
2016-07-28 21:06:57.750807 7f925fd05700  0 log_channel(cluster) log [INF] : osd.195 out (down for 306.280511)
2016-07-28 21:06:57.750872 7f925fd05700  0 log_channel(cluster) log [INF] : osd.197 out (down for 305.210923)

Recovery finished at 23:15:32, roughly 134 minutes after the outage.

2016-07-28 23:15:20.327404 7f92629e9700  0 log_channel(cluster) log [INF] : pgmap v53000803: 10704 pgs: 1 active+undersized+degraded+remapped+backfilling, 10703 active+clean; 127 TB data, 436 TB used, 680 TB / 1116 TB avail; 22070 kB/s rd, 17638 kB/s wr, 4185 op/s; 4075/115877199 objects degraded (0.004%); 4172/115877199 objects misplaced (0.004%); 134 MB/s, 33 objects/s recovering
2016-07-28 23:15:23.754936 7f92629e9700  0 log_channel(cluster) log [INF] : pgmap v53000806: 10704 pgs: 1 active+undersized+degraded+remapped+backfilling, 10703 active+clean; 127 TB data, 436 TB used, 680 TB / 1116 TB avail; 22889 kB/s rd, 25897 kB/s wr, 4197 op/s; 4075/115877277 objects degraded (0.004%); 4106/115877277 objects misplaced (0.004%); 128 MB/s, 30 objects/s recovering
2016-07-28 23:15:24.942932 7f92629e9700  0 log_channel(cluster) log [INF] : pgmap v53000807: 10704 pgs: 1 active+undersized+degraded+remapped+backfilling, 10703 active+clean; 127 TB data, 436 TB used, 680 TB / 1116 TB avail; 32095 kB/s rd, 23545 kB/s wr, 2779 op/s; 4075/115877112 objects degraded (0.004%); 4106/115877112 objects misplaced (0.004%); 127 MB/s, 30 objects/s recovering
2016-07-28 23:15:30.836000 7f92629e9700  0 log_channel(cluster) log [INF] : pgmap v53000814: 10704 pgs: 10704 active+clean; 127 TB data, 436 TB used, 680 TB / 1116 TB avail; 60888 kB/s rd, 19547 kB/s wr, 5251 op/s; 39494 kB/s, 9 objects/s recovering
2016-07-28 23:15:31.923399 7f92629e9700  0 log_channel(cluster) log [INF] : pgmap v53000815: 10704 pgs: 10704 active+clean; 127 TB data, 436 TB used, 680 TB / 1116 TB avail; 34487 kB/s rd, 25203 kB/s wr, 4139 op/s; 39913 kB/s, 9 objects/s recovering

2016-07-29 14:59: Server reboot

It took a while for the team to notice the outage and bring the machine back up.
The server rebooted with a "power reset" command through
ipmitool.

In the syslog and kern.log there is not any hint of why the server
cloud have stopped responding.

After the reboot, Ceph tried to recover again, but 2 PGs were
left in a bad state and needed manual intervention. This lead to long
blockage of many RBD clients.

Troubleshooting from 2016-07-29 to 2016-07-30 6:50 pm

The recovery caused some slow requests on two OSDs on the server zhdk0051 and 2 pgs got marked as incomplete.

We think that the incomplete state is caused by a race condition in the writes, while the OSDs acting primary
on ZHDK0051 were slow and unresponsive. Probably for a short amount of time there were two OSDs acting primary for these pgs, and this left ceph in a inconsistent state.

These settings on our `ceph.conf` about recovery were the following:

osd max backfills = 3
osd recovery max active = 5
osd recovery op priority = 2

The pgs were `5.37c` and `6.106`. They both had their first acting OSD on zhdk0051. The two OSds are 179 and 187.

We were able to query both pgs to get troubleshooting informations with the following commands:

ceph pg  5.37c query > 5.37c.txt
ceph pg 6.106 query > 6.106.txt

To make the load lower on these OSDs,
we tried to reweight them to 0, to have 0 writes from user traffic on them,
but did not work to make the PG complete again.

We tried several standard things, like setting the OSDs down and out, and the again up and in. All our tries just
caused a lot of rebalancing.

At point point we were not able to query the pg `5.37c` anymore. To understand where the query was hanging we used this command

ceph pg --debug_ms 5.37c query

and we could see that it was not possible to talk to the primary OSD.

We used `ss` to make sure the osd was correctly listening on the tcp socket

ss -l -p | grep <tcpport>

Finally, we just set all OSDs up and in. We had 0 remappings, 0 lost objects, 0 degraded. But still the two pg where incomplete.

We checked in for instance the OSD process reached some kernel limits, but the following values looked ok:

/proc/sys/fs/file-max
/proc/`pidof theosd`/limits

Settings to have less load on the cluster

ceph osd set noout
ceph osd set noscrub
ceph osd set nodeep-scrub
ceph osd set norecover
ceph osd set nobackfill
ceph tell osd.* injectargs -- --osd_max_backfills 1
ceph tell osd.* injectargs -- --osd_recovery_max_active 1

With this settings we tried to restart one by one all the OSDs in zhdk0051

To restart a single OSD, like a soft reboot, just do

ceph osd down osd.<id>

this did not work to have the pgs up again.

How we fixed the problem at the end

Enable recover and backfill (if disabled)

ceph osd unset norecover
ceph osd unset nobackfill

Make the rebalancing as slow as possible:

ceph tell osd.* injectargs -- --osd_recovery_max_active 1
ceph tell osd.* injectargs -- --osd_max_backfills 1

Find the current acting primary for the incomplete pg,

ceph pg 6.106 query | grep acting primary

in this case it is osd.179. Then:


ceph tell osd.179 injectargs -- --osd_find_best_info_ignore_history_les=1
ceph osd down osd.179

If the PG doesn't go directly to active, this is because another OSD knows about the history of that PG and won't let it activate yet. To find that other ​_new_​ primary OSD do again:

ceph pg <pgid> query | grep acting_primary

Then repeat above (osd_find_best_info_ignore_history_les=1) on that new OSD. Finally the PG will activate and after a bit of backfilling it will be active+clean.
Then reset osd_find_best_info_ignore_history_les to 0 on the above OSDs.

Repeat this for all incomplete PGs -- each of them may need `osd_find_best_info_ignore_history_les=1` on one or more OSDs.

At the end do:

ceph tell osd.* injectargs -- --osd_find_best_info_ignore_history_les=0

to make sure it's disabled everywhere.

Actions #1

Updated by Sage Weil almost 7 years ago

  • Status changed from New to Won't Fix

Thanks for your detailed account! This is a known bug on hammer (known in that it happens sometimes; the cause is not compeltely clear). We haven't not seen it in jewel and believe it has been fixed there. Please reopen if you see any need for the ignore_history_les option on jewel!

Actions

Also available in: Atom PDF