Project

General

Profile

Bug #20416

"FAILED assert(osdmap->test_flag((1<<15)))" (sortbitwise) on upgraded cluster

Added by Hey Pas 6 months ago. Updated 2 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
-
Start date:
06/26/2017
Due date:
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Release:
Needs Doc:
No
Component(RADOS):

Description

Hello,

I've upgraded a Jewel cluster to Luminous 12.1.0 (RC), restarted the monitors, mgr is active, but I can't restart the first OSD, as it gives a very nasty strack trace with an assertion error claiming that

2017-06-26 11:49:56.786744 7f83963d7700 -1 /build/ceph-12.1.0/src/osd/PG.cc: In function 'void PG::on_new_interval()' thread 7f83963d7700 time 2017-06-26 11:49:56.780364
/build/ceph-12.1.0/src/osd/PG.cc: 5412: FAILED assert(osdmap->test_flag((1<<15)))

I guess this is the check for sortbitwise ( https://github.com/ceph/ceph/blob/288f623878284c7025f0197d24b2689a6cbb3af6/src/osd/PG.cc#L5412 )

The problem is, I can't set sortbitwise.

After executing ceph osd set sortbitwise (the mon logs confirm that the command finished), ceph status doesn't show the osdmap/flags section.

Is there any way to set the flag via lower level commands/manually? Or it's easier just to trash the OSD and somehow create a new one? (This is a 3 osd mini cluster, but size is 3 [with min_size = 2] for important pools.)

This is on Ubuntu Trusty.

History

#1 Updated by Josh Durgin 5 months ago

Since this flag is set all the time now, it (and the require_x_osds flags) aren't shown by default. Does it appear in 'ceph osd dump --format json-pretty | grep flags' ?

#2 Updated by Greg Farnum 5 months ago

  • Project changed from Ceph to RADOS
  • Category deleted (OSD)

#3 Updated by Josh Durgin 5 months ago

  • Status changed from New to Need More Info

#4 Updated by Hey Pas 4 months ago

Hello,

sorry for the delay

Yes, it appears under flags.

{
    "epoch": 542,
    "fsid": "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
    "created": "2015-09-01 18:27:53.253076",
    "modified": "2017-07-17 19:10:14.234962",
    "flags": "sortbitwise,require_jewel_osds",
    "crush_version": 1,
    "full_ratio": 0.000000,
    "backfillfull_ratio": 0.000000,
    "nearfull_ratio": 0.000000,
    "cluster_snapshot": "",
    "pool_max": 18,
    "max_osd": 4,
    "require_min_compat_client": "unknown",
    "min_compat_client": "jewel",
    "require_osd_release": "jewel",
    "pools": [ "omitted" : "omitted" ]
}

#5 Updated by Greg Farnum 3 months ago

  • Subject changed from Cannot set sortbitwise flag to "FAILED assert(osdmap->test_flag((1<<15)))" (sortbitwise) on upgraded cluster
  • Status changed from Need More Info to Verified
  • Assignee set to Greg Farnum

Got a report of this happening in downstream Red Hat packages at https://bugzilla.redhat.com/show_bug.cgi?id=1494238

I went through the code a bit and there is a bit of an issue:
1) run without sortbitwise set
2) set sortbitwise
3) upgrade to Luminous before all OSDs have processed the OSDMap which sets sortbitwise
4) assert horribly because the PG gets set up with pre-sortbitwise map but still has the assert

But the bugzilla report there has apparently been running for months with sortbitwise so it doesn't seem likely to be the case on its own. I'm wondering if maybe there are "dead" PGs that haven't been updated in a while or something.

#6 Updated by Greg Farnum 3 months ago

Okay, the one I'm looking at is crashing on pg 126.b7, at epoch 5350. Pool 126 does not presently exist; epoch 5350 (modified 2017-09-19 15:48:16.743313) really does not have sortbitwise set (nor does 5447 (modified 2017-09-19 15:55:28.991958) which is the newest map the OSD has on disk); the cluster is currently at 6019 (modified 2017-09-23 22:28:16.690407) and that map does have sortbitwise set.

Looks like sortbitwise was set in epoch 5784 (modified 2017-09-19 16:02:56.161941); I didn't bother to track down when the pool was deleted. (It was much later.)

Still pondering how to let this situation resolve itself in code...

#7 Updated by Greg Farnum 3 months ago

Okay. Assuming sortbitwise is just a messaging scheme (I think it is), we should be safe to change the assert to require sortbitwise or that we (the OSD) are down during this map.

I also kind of want to remove that assert from the per-PG per-map processing anyway; will look and see if there's a better place to put it.

#8 Updated by Greg Farnum 3 months ago

  • Status changed from Verified to Feedback

https://github.com/ceph/ceph/pull/18047 for the fix. I'll backport it to Luminous if that looks good.

#9 Updated by Greg Farnum 2 months ago

  • Status changed from Feedback to Testing
  • Backport set to luminous

Yuri's testing it (it will pass), so I went ahead and created a backport PR: https://github.com/ceph/ceph/pull/18132

#10 Updated by Sage Weil 2 months ago

  • Status changed from Testing to Resolved

#11 Updated by Yuri Weinstein 2 months ago

Greg Farnum wrote:

https://github.com/ceph/ceph/pull/18047 for the fix. I'll backport it to Luminous if that looks good.

merged

#12 Updated by Nathan Cutler 2 months ago

  • Backport deleted (luminous)

fast-tracking the backport, since it's already open

Also available in: Atom PDF