Bug #24597: FAILED assert(0 == "ERROR: source must exist") in FileStore::_collection_move_rename() - RADOS - Ceph

Actions

Copy link

Bug #24597

closed

FAILED assert(0 == "ERROR: source must exist") in FileStore::_collection_move_rename()

Added by Neha Ojha almost 6 years ago. Updated almost 6 years ago.

Status:

Resolved

Priority:

Immediate

Assignee:

Sage Weil

Category:

Correctness/Safety

Target version:

% Done:

Source:

Tags:

Backport:

mimic,luminous

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(RADOS):

FileStore

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

2018-06-20T18:58:36.950 INFO:tasks.ceph.osd.6.smithi143.stderr:/home/jenkins-build/build/workspace/ceph-dev-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.0.0-663-g111c515/rpm/el7/BUILD/ceph-14.0.0-663-g111c515/src/os/filestore/FileStore.cc: In function 'int FileStore::_collection_move_rename(const coll_t&, const ghobject_t&, coll_t, const ghobject_t&, const SequencerPosition&, bool)' thread 7f6034592700 time 2018-06-20 18:58:36.961023
2018-06-20T18:58:36.950 INFO:tasks.ceph.osd.6.smithi143.stderr:/home/jenkins-build/build/workspace/ceph-dev-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.0.0-663-g111c515/rpm/el7/BUILD/ceph-14.0.0-663-g111c515/src/os/filestore/FileStore.cc: 5524: FAILED assert(0 == "ERROR: source must exist")
2018-06-20T18:58:36.952 INFO:tasks.ceph.osd.6.smithi143.stderr: ceph version 14.0.0-663-g111c515 (111c515ab0294ffe409fcd8555bb98d3e7290a61) nautilus (dev)
2018-06-20T18:58:36.952 INFO:tasks.ceph.osd.6.smithi143.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0xff) [0x7f604ddb4cdf]
2018-06-20T18:58:36.952 INFO:tasks.ceph.osd.6.smithi143.stderr: 2: (()+0x28aec7) [0x7f604ddb4ec7]
2018-06-20T18:58:36.952 INFO:tasks.ceph.osd.6.smithi143.stderr: 3: (FileStore::_collection_move_rename(coll_t const&, ghobject_t const&, coll_t, ghobject_t const&, SequencerPosition const&, bool)+0xa7c) [0x55e2f5a1b26c]
2018-06-20T18:58:36.952 INFO:tasks.ceph.osd.6.smithi143.stderr: 4: (FileStore::_do_transaction(ObjectStore::Transaction&, unsigned long, int, ThreadPool::TPHandle*, char const*)+0xe7b) [0x55e2f5a1d40b]
2018-06-20T18:58:36.953 INFO:tasks.ceph.osd.6.smithi143.stderr: 5: (FileStore::_do_transactions(std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, unsigned long, ThreadPool::TPHandle*, char const*)+0x48) [0x55e2f5a23368]
2018-06-20T18:58:36.953 INFO:tasks.ceph.osd.6.smithi143.stderr: 6: (FileStore::_do_op(FileStore::OpSequencer*, ThreadPool::TPHandle&)+0x13f) [0x55e2f5a234df]
2018-06-20T18:58:36.953 INFO:tasks.ceph.osd.6.smithi143.stderr: 7: (ThreadPool::worker(ThreadPool::WorkThread*)+0x7c7) [0x7f604ddba047]
2018-06-20T18:58:36.953 INFO:tasks.ceph.osd.6.smithi143.stderr: 8: (ThreadPool::WorkThread::entry()+0x10) [0x7f604ddbb6a0]
2018-06-20T18:58:36.953 INFO:tasks.ceph.osd.6.smithi143.stderr: 9: (()+0x7e25) [0x7f604a908e25]
2018-06-20T18:58:36.953 INFO:tasks.ceph.osd.6.smithi143.stderr: 10: (clone()+0x6d) [0x7f60499f8bad]
2018-06-20T18:58:36.954 INFO:tasks.ceph.osd.6.smithi143.stderr: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

http://pulpito.ceph.com/nojha-2018-06-20_18:20:55-rados:thrash-master-distro-basic-smithi/2684831/

Related issues 4 (0 open — 4 closed)

Actions

Copy link

Updated by Josh Durgin almost 6 years ago

Priority changed from Normal to Urgent

Actions

Copy link

Updated by Josh Durgin almost 6 years ago

Category set to Correctness/Safety
Component(RADOS) FileStore added

Actions

Copy link

Updated by Sage Weil almost 6 years ago

I believe this is caused by b50186bfe6c8981700e33c8a62850e21779d67d5, which does

  if (roll_forward_to) {
    pg_log.roll_forward(&rollbacker);
  }

i.e., rolls forward to log.head instead of *roll_forward_to.

in 12.2.5 this is a backported fix for http://tracker.ceph.com/issues/22050 which is much less severe :)

Actions

Copy link

Updated by Sage Weil almost 6 years ago

Status changed from New to 12
Priority changed from Urgent to Immediate

Actions

Copy link

Updated by Josh Durgin almost 6 years ago

Aha, in that case wip-24192 should fix it. Running it through testing again...

Actions

Copy link

Updated by Sage Weil almost 6 years ago

Status changed from 12 to In Progress
Backport set to mimic,luminous

https://github.com/ceph/ceph/pull/22974

Actions

Copy link

Updated by Sage Weil almost 6 years ago

Related to Bug #23145: OSD crashes during recovery of EC pg added

Actions

Copy link

Updated by Josh Durgin almost 6 years ago

Has duplicate Bug #24192: cluster [ERR] Corruption detected: object 2:f59d1934:::smithi14913526-5822:head is missing hash_info added

Actions

Copy link

Updated by Sage Weil almost 6 years ago

Factors leading to this:

- ec pool (e.g., rgw workload0
- rados ops that result in pg log 'error' entries (e.g., deleting a non-existent object, due to rgw gc)
- peering (due to osd restarts etc)

A workaround that should work:

- quiesce IO to the EC pool (ceph osd pause/unpause, or pause radsogw processes) prior to restarting/upgrading osds

That will ensure that the last_update for all shards of each PG match and no rollback will be needed (if the pg incorrectly rolled forward too far the rollback won't be possible).

Actions

Copy link

#10