Project

General

Profile

Bug #20985

PG which marks divergent_priors causes crash on startup

Added by Greg Farnum over 1 year ago. Updated over 1 year ago.

Status:
Resolved
Priority:
Immediate
Assignee:
Category:
Correctness/Safety
Target version:
-
Start date:
08/11/2017
Due date:
% Done:

0%

Source:
Tags:
Backport:
luminous
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
OSD
Pull request ID:

Description

This was noticed in the course of somebody upgrading from 12.1.1 to 12.1.2:

2017-08-11 23:01:53.109922 7fd4268ffcc0  0 genericfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_features: SEEK_DATA/SEEK_HOLE is disabled via 'filestore seek data hole' config option
2017-08-11 23:01:53.109926 7fd4268ffcc0  0 genericfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_features: splice() is disabled via 'filestore splice' config option
2017-08-11 23:01:53.111939 7fd4268ffcc0  0 genericfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_features: syncfs(2) syscall fully supported (by glibc and kernel)
2017-08-11 23:01:53.112060 7fd4268ffcc0  0 xfsfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_feature: extsize is disabled by conf
2017-08-11 23:01:53.113102 7fd4268ffcc0  0 filestore(/var/lib/ceph/osd/ceph-0) start omap initiation
2017-08-11 23:01:53.114429 7fd4268ffcc0  1 leveldb: Recovering log #181623
2017-08-11 23:01:53.122344 7fd4268ffcc0  1 leveldb: Delete type=0 #181623

2017-08-11 23:01:53.122450 7fd4268ffcc0  1 leveldb: Delete type=3 #181622

2017-08-11 23:02:41.757352 7fd4268ffcc0  0 filestore(/var/lib/ceph/osd/ceph-0) mount(1758): enabling WRITEAHEAD journal mode: checkpoint is not enabled
2017-08-11 23:02:41.788193 7fd4268ffcc0 -1 journal FileJournal::_open: disabling aio for non-block journal.  Use journal_force_aio to force use of aio anyway
2017-08-11 23:02:41.788202 7fd4268ffcc0  1 journal _open /var/lib/ceph/osd/ceph-0/journal fd 28: 1048576000 bytes, block size 4096 bytes, directio = 1, aio = 0
2017-08-11 23:02:41.823216 7fd4268ffcc0  1 journal _open /var/lib/ceph/osd/ceph-0/journal fd 28: 1048576000 bytes, block size 4096 bytes, directio = 1, aio = 0
2017-08-11 23:02:41.830592 7fd4268ffcc0  1 filestore(/var/lib/ceph/osd/ceph-0) upgrade(1365)
2017-08-11 23:02:41.831343 7fd4268ffcc0  0 _get_class not permitted to load lua
2017-08-11 23:02:41.833438 7fd4268ffcc0  0 _get_class not permitted to load sdk
2017-08-11 23:02:41.842946 7fd4268ffcc0  0 <cls> /build/ceph-12.1.3/src/cls/cephfs/cls_cephfs.cc:197: loading cephfs
2017-08-11 23:02:41.843280 7fd4268ffcc0  0 <cls> /build/ceph-12.1.3/src/cls/hello/cls_hello.cc:296: loading cls_hello
2017-08-11 23:02:41.843606 7fd4268ffcc0  0 _get_class not permitted to load kvs
2017-08-11 23:02:41.843662 7fd4268ffcc0  1 osd.0 0 warning: got an error loading one or more classes: (1) Operation not permitted
2017-08-11 23:02:41.844083 7fd4268ffcc0  0 osd.0 6793 crush map has features 288232576282525696, adjusting msgr requires for clients
2017-08-11 23:02:41.844124 7fd4268ffcc0  0 osd.0 6793 crush map has features 288232576282525696 was 8705, adjusting msgr requires for mons
2017-08-11 23:02:41.844160 7fd4268ffcc0  0 osd.0 6793 crush map has features 1008808516661821440, adjusting msgr requires for osds
2017-08-11 23:02:44.634391 7fd4268ffcc0  0 osd.0 6793 load_pgs
2017-08-11 23:02:44.749661 7fd4268ffcc0 -1 /build/ceph-12.1.3/src/osd/PGLog.h: In function 'static void PGLog::read_log_and_missing(ObjectStore*, coll_t, coll_t, ghobject_t, const pg_info_t&, PGLog::IndexedLog&, missing_type&, bool, std::ostringstream&, bool, bool*, const DoutPrefixProvider*, std::set<std::basic_string<char> >*, bool) [with missing_type = pg_missing_set<true>; std::ostringstream = std::basic_ostringstream<char>]' thread 7fd4268ffcc0 time 2017-08-11 23:02:44.746102
/build/ceph-12.1.3/src/osd/PGLog.h: 1301: FAILED assert(force_rebuild_missing)

But it's actually much worse than that: PG::read_state only sets force_rebuild_missing if the info_structv is the Jewel version, and it asserts(force_rebuild_missing) if it sees divergent_priors written down. Which means that on reboot of an all-Luminous system, it will crash.

ceph-osd.log View (947 KB) Stephan Hohn, 08/12/2017 09:19 AM

History

#1 Updated by Greg Farnum over 1 year ago

  • Backport set to luminous

https://github.com/ceph/ceph/pull/17000

Still compiling, testing, etc

#3 Updated by Stephan Hohn over 1 year ago

Facing the same issue upgrading from jewel 10.2.9 -> luminous 12.1.3 (RC)

#4 Updated by Greg Farnum over 1 year ago

  • Status changed from In Progress to Testing

If anyone wants to validate that the fix packages at https://shaman.ceph.com/repos/ceph/wip-20985-divergent-handling-luminous/ed1c1ecc6f3bf1edf55b49e5625d0c3bf3508d4a/ actually solve this problem before stuff gets merged, that would be a helpful ata point and let you get stuff running again more quickly. :)

#5 Updated by Stephan Hohn over 1 year ago

I can conform that this build worked on my test cluster. It's back to HEALTH_OK and all OSDs are up.

#6 Updated by Stephan Hohn over 1 year ago

Stephan Hohn wrote:

I can confirm that this build worked on my test cluster. It's back to HEALTH_OK and all OSDs are up.

#7 Updated by Greg Farnum over 1 year ago

  • Status changed from Testing to Resolved

Several other confirmations and a healthy test run later, all merged!

Also available in: Atom PDF