Project

General

Profile

Actions

Bug #3617

closed

Ceph doesn't support > 65536 PGs(?) and fails silently

Added by Faidon Liambotis over 11 years ago. Updated over 11 years ago.

Status:
Resolved
Priority:
Normal
Category:
-
Target version:
-
% Done:

0%

Source:
Development
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hi,

While playing with a test cluster and trying to size it according to production needs & future growth, we decided to create a pool with 65536 placement groups. There were some other pools Since there is no pg splitting yet, we started with very small cluster, two boxes with 4 OSDs each.

We were seeing very weird behavior, including PGs that never managed to peer and lots of unfound PGs:
2012-12-11 02:18:16.576998 mon.0 [INF] pgmap v11750: 66728 pgs: 9581 active, 16659 active+clean, 2 active+remapped+wait_backfill, 28382 active+recovery_wait, 1208 5 peering, 4 active+remapped, 3 active+recovery_wait+remapped, 7 remapped+peering, 5 active+recovering; 79015 MB data, 166 GB used, 18433 GB / 18600 GB avail; 100586/230461 degraded (43.646 ); 11716/115185 unfound (10.171)

After inquiring it about it on IRC, I was told that the maximum PGs are 65536 and was pointed at struct ceph_pg, presumably the 16-bit value for the placement seed.

If that's the case, this probably means that it overflowed and failed in many other ways silently. It'd be nice if Ceph wouldn't let you shoot yourself in the foot like and deny setting a pool to a size that would increase PGs over that limit.

Additionally, at the time this weird behavior happened, we were seeing a lot of OSDs get a SIGABRT, asserting in:
osd/ReplicatedPG.cc: In function 'int ReplicatedPG::pull(const hobject_t&, eversion_t, int)' thread 7f496b723700 time 2012-12-10 16:42:40.295124\nosd/ReplicatedPG.cc: 4890: FAILED assert(peer_missing.count(fromosd))

Full backtrace is attached. I'm unsure if it's related or not, but due to the lack of more info/debugging logs and the unsual of the setup, I'm not filing a separate bug.

This was with Ceph 0.55 and a stock configuration with nothing unusual but the number of PGs.


Files

ceph-osd-64kpg-crash.txt (2.19 KB) ceph-osd-64kpg-crash.txt Faidon Liambotis, 12/13/2012 09:40 AM
Actions

Also available in: Atom PDF