Support #18508: PGs of EC pool stuck in peering state - RADOS - Ceph

Actions

Copy link

Support #18508

closed

PGs of EC pool stuck in peering state

Added by George Vasilakakos over 7 years ago. Updated almost 7 years ago.

Status:

Closed

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Tags:

ec, peering, crush, osd

Reviewed:

Affected Versions:

Ceph - v10.2.3

Component(RADOS):

Pull request ID:

Description

We have a 30 host, 1080 OSD cluster with a mix of replicated and EC 8+3 pools, running Jewel on SL7.

ceph version 10.2.3 (ecc23778eb545d8dd55e2e4735b53cc93f92e65b)

The problem was first seen when creating a large pool (16384 PGs) where a number of them would stay in peering states. Restarting the last primary would get the PG to peer.

PGs in other pools would also go into peering states.

It was also seen when creating a smaller pool and expanding the number of PGs (when increasing both pg_num and pgp_num). The problem could be mitigated by increasing pg_num (or pgp_num) in very small steps - anything above 128 PGs would result in some of them getting stuck.

After gradually increasing the PGs in the pool and getting everything active+clean, we started a rados bench to run overnight.

In the morning, all the OSDs on a host were down (36 in total) and a number of PGs in aforementioned pool in peering states.

Sample ceph health detail output:

pg 19.4a4 is down+remapped+peering, acting [86,812,2147483647,209,622,306,420,1029,394,266,204]
pg 19.380 is peering, acting [514,301,97,347,366,206,438,738,947,431,982]

Throughout the whole process the stuck PGs would peer after restarting their primary OSD, it was observed that the acting sets contained CRUSH_ITEM_NONE (2147483647) mappings rather than actual OSDs.

Sample ceph pg query past_interval output for pg 19.380:

                {
                    "first": 2035,
                    "last": 2036,
                    "maybe_went_rw": 1,
                    "up": [
                        514,
                        301,
                        97,
                        347,
                        366,
                        206,
                        438,
                        738,
                        947,
                        431,
                        982
                    ],
                    "acting": [
                        2147483647,
                        301,
                        97,
                        347,
                        366,
                        206,
                        438,
                        738,
                        947,
                        431,
                        982
                    ],
                    "primary": 301,
                    "up_primary": 514
                }

Some PGs in EC 8+3 pools would report up to 8 of their acting set as CRUSH_ITEM_NONE, often the primary being one of them.

Files

Download all files

echo-pg-peering-ceph-status-2017-01-12 10.42.40.txt (920 Bytes) echo-pg-peering-ceph-status-2017-01-12 10.42.40.txt	ceph status output with a node down	George Vasilakakos, 01/12/2017 11:28 AM
echo-pg-peering-ceph-health-detail-2017-01-12 10.42.40.txt (3.92 KB) echo-pg-peering-ceph-health-detail-2017-01-12 10.42.40.txt	ceph health detail output with a node down	George Vasilakakos, 01/12/2017 11:28 AM
cm.txt (53.2 KB) cm.txt	CRUSH map currently on cluster	George Vasilakakos, 01/12/2017 11:28 AM

Actions

Copy link

Updated by Wido den Hollander over 7 years ago

While looking at this with George I noticed that the async messenger was being used. We set it back to SimpleMessenger and that seemed to resolve it.

Looks a lot like: #16051

Actions

Copy link

Updated by Nathan Cutler about 7 years ago

Target version deleted (~~v10.2.6~~)

Actions

Copy link

Updated by Greg Farnum almost 7 years ago

Tracker changed from Bug to Support
Status changed from New to Closed

There was clearly a lot going on here and none of it was clear. If switching to SimpleMessenger fixed it, I presume there were some bugs with AsyncMessenger in that Jewel release that led to it behaving badly under network contention conditions or something.

Actions

Copy link