Project

General

Profile

Actions

Bug #18960

closed

PG stuck peering after host reboot

Added by George Vasilakakos about 7 years ago. Updated about 7 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
OSD
Target version:
-
% Done:

0%

Source:
Tags:
ec, peering, crush, osd, msgr
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

On a cluster running Jewel 10.2.5, rebooting a host resulted in a PG getting stuck in the peering state.

pg 1.323 is stuck inactive for 73352.498493, current state peering, last acting [595,1391,240,127,937,362,267,320,7,634,716]

Restarting OSDs or hosts does nothing to help, or sometimes results in things like this:

pg 1.323 is remapped+peering, acting [2147483647,1391,240,127,937,362,267,320,7,634,716]

In fact, we since upgraded to Kraken 11.2.0 and every OSD that was restarted was replaced by CRUSH_ITEM_NONE. This is only rectified by restarting OSDs 595 and 1391.

The host that was rebooted originally is home to osd.7 (8). When I would go onto it to look at the logs for osd.7 this is what I would see:

$ tail f /var/log/ceph/ceph-osd.7.log
2017-02-08 15:41:00.445247 7f5fcc2bd700 0 -
XXX.XXX.XXX.172:6905/20510 >> XXX.XXX.XXX.192:6921/55371 pipe(0x7f6074a0b400 sd=34 :42828 s=2 pgs=319 cs=471 l=0 c=0x7f6070086700).fault, initiating reconnect

I'm assuming that in IP1:port1/PID1 >> IP2:port2/PID2 the >> indicates the direction of communication. I've traced these to osd.7 (rank 8 in the stuck PG) reaching out to osd.595 (the primary in the stuck PG).

Meanwhile, looking at the logs of osd.595 I would see this:

$ tail f /var/log/ceph/ceph-osd.595.log
2017-02-08 15:41:15.760708 7f1765673700 0 -
XXX.XXX.XXX.192:6921/55371 >> XXX.XXX.XXX.172:6905/20510 pipe(0x7f17b2911400 sd=101 :6921 s=0 pgs=0 cs=0 l=0 c=0x7f17b7beaf00).accept connect_seq 478 vs existing 477 state standby
2017-02-08 15:41:20.768844 7f1765673700 0 bad crc in front 1941070384 != exp 3786596716

which again shows osd.595 reaching out to osd.7 and from what I could gather the CRC problem is about messaging.

The CRC messages no longer seem to be logged by osd.595, osd.7 is still logging reconnects.

ceph pg 1.323 query seems to hang forever but it completed once early on and I noticed this:

"peering_blocked_by_detail": [
    {
"detail": "peering_blocked_by_history_les_bound"
}

We have seen this before and it was cleared by setting osd_find_best_info_ignore_history_les to true for the first two OSDs on the stuck PGs (this was on a 3 replica pool). This hasn't worked in this case.

PG 1.323 is the only PG with both 595 and 7 in its set.
There are another 217 PGs with OSDs from both these hosts in their sets which makes this seem less like a networking issue.


Files

ceph.conf (1.05 KB) ceph.conf George Vasilakakos, 02/16/2017 01:47 PM
crushmap.txt (72 KB) crushmap.txt George Vasilakakos, 02/16/2017 01:47 PM
health-detail-pg-1.323.txt (2.42 KB) health-detail-pg-1.323.txt George Vasilakakos, 02/16/2017 01:48 PM
Actions

Also available in: Atom PDF