Bug #20985
closedPG which marks divergent_priors causes crash on startup
0%
Description
This was noticed in the course of somebody upgrading from 12.1.1 to 12.1.2:
2017-08-11 23:01:53.109922 7fd4268ffcc0 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_features: SEEK_DATA/SEEK_HOLE is disabled via 'filestore seek data hole' config option 2017-08-11 23:01:53.109926 7fd4268ffcc0 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_features: splice() is disabled via 'filestore splice' config option 2017-08-11 23:01:53.111939 7fd4268ffcc0 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_features: syncfs(2) syscall fully supported (by glibc and kernel) 2017-08-11 23:01:53.112060 7fd4268ffcc0 0 xfsfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_feature: extsize is disabled by conf 2017-08-11 23:01:53.113102 7fd4268ffcc0 0 filestore(/var/lib/ceph/osd/ceph-0) start omap initiation 2017-08-11 23:01:53.114429 7fd4268ffcc0 1 leveldb: Recovering log #181623 2017-08-11 23:01:53.122344 7fd4268ffcc0 1 leveldb: Delete type=0 #181623 2017-08-11 23:01:53.122450 7fd4268ffcc0 1 leveldb: Delete type=3 #181622 2017-08-11 23:02:41.757352 7fd4268ffcc0 0 filestore(/var/lib/ceph/osd/ceph-0) mount(1758): enabling WRITEAHEAD journal mode: checkpoint is not enabled 2017-08-11 23:02:41.788193 7fd4268ffcc0 -1 journal FileJournal::_open: disabling aio for non-block journal. Use journal_force_aio to force use of aio anyway 2017-08-11 23:02:41.788202 7fd4268ffcc0 1 journal _open /var/lib/ceph/osd/ceph-0/journal fd 28: 1048576000 bytes, block size 4096 bytes, directio = 1, aio = 0 2017-08-11 23:02:41.823216 7fd4268ffcc0 1 journal _open /var/lib/ceph/osd/ceph-0/journal fd 28: 1048576000 bytes, block size 4096 bytes, directio = 1, aio = 0 2017-08-11 23:02:41.830592 7fd4268ffcc0 1 filestore(/var/lib/ceph/osd/ceph-0) upgrade(1365) 2017-08-11 23:02:41.831343 7fd4268ffcc0 0 _get_class not permitted to load lua 2017-08-11 23:02:41.833438 7fd4268ffcc0 0 _get_class not permitted to load sdk 2017-08-11 23:02:41.842946 7fd4268ffcc0 0 <cls> /build/ceph-12.1.3/src/cls/cephfs/cls_cephfs.cc:197: loading cephfs 2017-08-11 23:02:41.843280 7fd4268ffcc0 0 <cls> /build/ceph-12.1.3/src/cls/hello/cls_hello.cc:296: loading cls_hello 2017-08-11 23:02:41.843606 7fd4268ffcc0 0 _get_class not permitted to load kvs 2017-08-11 23:02:41.843662 7fd4268ffcc0 1 osd.0 0 warning: got an error loading one or more classes: (1) Operation not permitted 2017-08-11 23:02:41.844083 7fd4268ffcc0 0 osd.0 6793 crush map has features 288232576282525696, adjusting msgr requires for clients 2017-08-11 23:02:41.844124 7fd4268ffcc0 0 osd.0 6793 crush map has features 288232576282525696 was 8705, adjusting msgr requires for mons 2017-08-11 23:02:41.844160 7fd4268ffcc0 0 osd.0 6793 crush map has features 1008808516661821440, adjusting msgr requires for osds 2017-08-11 23:02:44.634391 7fd4268ffcc0 0 osd.0 6793 load_pgs 2017-08-11 23:02:44.749661 7fd4268ffcc0 -1 /build/ceph-12.1.3/src/osd/PGLog.h: In function 'static void PGLog::read_log_and_missing(ObjectStore*, coll_t, coll_t, ghobject_t, const pg_info_t&, PGLog::IndexedLog&, missing_type&, bool, std::ostringstream&, bool, bool*, const DoutPrefixProvider*, std::set<std::basic_string<char> >*, bool) [with missing_type = pg_missing_set<true>; std::ostringstream = std::basic_ostringstream<char>]' thread 7fd4268ffcc0 time 2017-08-11 23:02:44.746102 /build/ceph-12.1.3/src/osd/PGLog.h: 1301: FAILED assert(force_rebuild_missing)
But it's actually much worse than that: PG::read_state only sets force_rebuild_missing if the info_structv is the Jewel version, and it asserts(force_rebuild_missing) if it sees divergent_priors written down. Which means that on reboot of an all-Luminous system, it will crash.
Files
Updated by Greg Farnum over 6 years ago
- Backport set to luminous
https://github.com/ceph/ceph/pull/17000
Still compiling, testing, etc
Updated by Greg Farnum over 6 years ago
Luminous at https://github.com/ceph/ceph/pull/17001
Updated by Stephan Hohn over 6 years ago
- File ceph-osd.log ceph-osd.log added
Facing the same issue upgrading from jewel 10.2.9 -> luminous 12.1.3 (RC)
Updated by Greg Farnum over 6 years ago
- Status changed from In Progress to 7
If anyone wants to validate that the fix packages at https://shaman.ceph.com/repos/ceph/wip-20985-divergent-handling-luminous/ed1c1ecc6f3bf1edf55b49e5625d0c3bf3508d4a/ actually solve this problem before stuff gets merged, that would be a helpful ata point and let you get stuff running again more quickly. :)
Updated by Stephan Hohn over 6 years ago
I can conform that this build worked on my test cluster. It's back to HEALTH_OK and all OSDs are up.
Updated by Stephan Hohn over 6 years ago
Stephan Hohn wrote:
I can confirm that this build worked on my test cluster. It's back to HEALTH_OK and all OSDs are up.
Updated by Greg Farnum over 6 years ago
- Status changed from 7 to Resolved
Several other confirmations and a healthy test run later, all merged!