Project

General

Profile

Actions

Support #18508

closed

PGs of EC pool stuck in peering state

Added by George Vasilakakos over 7 years ago. Updated almost 7 years ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Tags:
ec, peering, crush, osd
Reviewed:
Affected Versions:
Component(RADOS):
Pull request ID:

Description

We have a 30 host, 1080 OSD cluster with a mix of replicated and EC 8+3 pools, running Jewel on SL7.

ceph version 10.2.3 (ecc23778eb545d8dd55e2e4735b53cc93f92e65b)

The problem was first seen when creating a large pool (16384 PGs) where a number of them would stay in peering states. Restarting the last primary would get the PG to peer.

PGs in other pools would also go into peering states.

It was also seen when creating a smaller pool and expanding the number of PGs (when increasing both pg_num and pgp_num). The problem could be mitigated by increasing pg_num (or pgp_num) in very small steps - anything above 128 PGs would result in some of them getting stuck.

After gradually increasing the PGs in the pool and getting everything active+clean, we started a rados bench to run overnight.

In the morning, all the OSDs on a host were down (36 in total) and a number of PGs in aforementioned pool in peering states.

Sample ceph health detail output:

pg 19.4a4 is down+remapped+peering, acting [86,812,2147483647,209,622,306,420,1029,394,266,204]
pg 19.380 is peering, acting [514,301,97,347,366,206,438,738,947,431,982]

Throughout the whole process the stuck PGs would peer after restarting their primary OSD, it was observed that the acting sets contained CRUSH_ITEM_NONE (2147483647) mappings rather than actual OSDs.

Sample ceph pg query past_interval output for pg 19.380:

                {
                    "first": 2035,
                    "last": 2036,
                    "maybe_went_rw": 1,
                    "up": [
                        514,
                        301,
                        97,
                        347,
                        366,
                        206,
                        438,
                        738,
                        947,
                        431,
                        982
                    ],
                    "acting": [
                        2147483647,
                        301,
                        97,
                        347,
                        366,
                        206,
                        438,
                        738,
                        947,
                        431,
                        982
                    ],
                    "primary": 301,
                    "up_primary": 514
                }

Some PGs in EC 8+3 pools would report up to 8 of their acting set as CRUSH_ITEM_NONE, often the primary being one of them.


Files

echo-pg-peering-ceph-status-2017-01-12 10.42.40.txt (920 Bytes) echo-pg-peering-ceph-status-2017-01-12 10.42.40.txt ceph status output with a node down George Vasilakakos, 01/12/2017 11:28 AM
echo-pg-peering-ceph-health-detail-2017-01-12 10.42.40.txt (3.92 KB) echo-pg-peering-ceph-health-detail-2017-01-12 10.42.40.txt ceph health detail output with a node down George Vasilakakos, 01/12/2017 11:28 AM
cm.txt (53.2 KB) cm.txt CRUSH map currently on cluster George Vasilakakos, 01/12/2017 11:28 AM
Actions #1

Updated by Wido den Hollander over 7 years ago

While looking at this with George I noticed that the async messenger was being used. We set it back to SimpleMessenger and that seemed to resolve it.

Looks a lot like: #16051

Actions #2

Updated by Nathan Cutler about 7 years ago

  • Target version deleted (v10.2.6)
Actions #3

Updated by Greg Farnum almost 7 years ago

  • Tracker changed from Bug to Support
  • Status changed from New to Closed

There was clearly a lot going on here and none of it was clear. If switching to SimpleMessenger fixed it, I presume there were some bugs with AsyncMessenger in that Jewel release that led to it behaving badly under network contention conditions or something.

Actions #4

Updated by Greg Farnum almost 7 years ago

  • Project changed from Ceph to RADOS
  • Category deleted (10)
Actions

Also available in: Atom PDF