Project

General

Profile

Bug #24597

FAILED assert(0 == "ERROR: source must exist") in FileStore::_collection_move_rename()

Added by Neha Ojha about 1 year ago. Updated 12 months ago.

Status:
Resolved
Priority:
Immediate
Assignee:
Category:
Correctness/Safety
Target version:
-
Start date:
06/20/2018
Due date:
% Done:

0%

Source:
Tags:
Backport:
mimic,luminous
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
FileStore
Pull request ID:

Description

2018-06-20T18:58:36.950 INFO:tasks.ceph.osd.6.smithi143.stderr:/home/jenkins-build/build/workspace/ceph-dev-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.0.0-663-g111c515/rpm/el7/BUILD/ceph-14.0.0-663-g111c515/src/os/filestore/FileStore.cc: In function 'int FileStore::_collection_move_rename(const coll_t&, const ghobject_t&, coll_t, const ghobject_t&, const SequencerPosition&, bool)' thread 7f6034592700 time 2018-06-20 18:58:36.961023
2018-06-20T18:58:36.950 INFO:tasks.ceph.osd.6.smithi143.stderr:/home/jenkins-build/build/workspace/ceph-dev-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.0.0-663-g111c515/rpm/el7/BUILD/ceph-14.0.0-663-g111c515/src/os/filestore/FileStore.cc: 5524: FAILED assert(0 == "ERROR: source must exist")
2018-06-20T18:58:36.952 INFO:tasks.ceph.osd.6.smithi143.stderr: ceph version 14.0.0-663-g111c515 (111c515ab0294ffe409fcd8555bb98d3e7290a61) nautilus (dev)
2018-06-20T18:58:36.952 INFO:tasks.ceph.osd.6.smithi143.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0xff) [0x7f604ddb4cdf]
2018-06-20T18:58:36.952 INFO:tasks.ceph.osd.6.smithi143.stderr: 2: (()+0x28aec7) [0x7f604ddb4ec7]
2018-06-20T18:58:36.952 INFO:tasks.ceph.osd.6.smithi143.stderr: 3: (FileStore::_collection_move_rename(coll_t const&, ghobject_t const&, coll_t, ghobject_t const&, SequencerPosition const&, bool)+0xa7c) [0x55e2f5a1b26c]
2018-06-20T18:58:36.952 INFO:tasks.ceph.osd.6.smithi143.stderr: 4: (FileStore::_do_transaction(ObjectStore::Transaction&, unsigned long, int, ThreadPool::TPHandle*, char const*)+0xe7b) [0x55e2f5a1d40b]
2018-06-20T18:58:36.953 INFO:tasks.ceph.osd.6.smithi143.stderr: 5: (FileStore::_do_transactions(std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, unsigned long, ThreadPool::TPHandle*, char const*)+0x48) [0x55e2f5a23368]
2018-06-20T18:58:36.953 INFO:tasks.ceph.osd.6.smithi143.stderr: 6: (FileStore::_do_op(FileStore::OpSequencer*, ThreadPool::TPHandle&)+0x13f) [0x55e2f5a234df]
2018-06-20T18:58:36.953 INFO:tasks.ceph.osd.6.smithi143.stderr: 7: (ThreadPool::worker(ThreadPool::WorkThread*)+0x7c7) [0x7f604ddba047]
2018-06-20T18:58:36.953 INFO:tasks.ceph.osd.6.smithi143.stderr: 8: (ThreadPool::WorkThread::entry()+0x10) [0x7f604ddbb6a0]
2018-06-20T18:58:36.953 INFO:tasks.ceph.osd.6.smithi143.stderr: 9: (()+0x7e25) [0x7f604a908e25]
2018-06-20T18:58:36.953 INFO:tasks.ceph.osd.6.smithi143.stderr: 10: (clone()+0x6d) [0x7f60499f8bad]
2018-06-20T18:58:36.954 INFO:tasks.ceph.osd.6.smithi143.stderr: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

http://pulpito.ceph.com/nojha-2018-06-20_18:20:55-rados:thrash-master-distro-basic-smithi/2684831/


Related issues

Related to RADOS - Bug #23145: OSD crashes during recovery of EC pg Duplicate 02/27/2018
Duplicated by RADOS - Bug #24192: cluster [ERR] Corruption detected: object 2:f59d1934:::smithi14913526-5822:head is missing hash_info Duplicate 05/19/2018
Copied to RADOS - Backport #24890: luminous: FAILED assert(0 == "ERROR: source must exist") in FileStore::_collection_move_rename() Resolved
Copied to RADOS - Backport #24891: mimic: FAILED assert(0 == "ERROR: source must exist") in FileStore::_collection_move_rename() Resolved

History

#1 Updated by Josh Durgin about 1 year ago

  • Priority changed from Normal to Urgent

#2 Updated by Josh Durgin about 1 year ago

  • Category set to Correctness/Safety
  • Component(RADOS) FileStore added

#3 Updated by Sage Weil 12 months ago

I believe this is caused by b50186bfe6c8981700e33c8a62850e21779d67d5, which does

  if (roll_forward_to) {
    pg_log.roll_forward(&rollbacker);
  }

i.e., rolls forward to log.head instead of *roll_forward_to.

in 12.2.5 this is a backported fix for http://tracker.ceph.com/issues/22050 which is much less severe :)

#4 Updated by Sage Weil 12 months ago

  • Status changed from New to Verified
  • Priority changed from Urgent to Immediate

#5 Updated by Josh Durgin 12 months ago

Aha, in that case wip-24192 should fix it. Running it through testing again...

#6 Updated by Sage Weil 12 months ago

  • Status changed from Verified to In Progress
  • Backport set to mimic,luminous

#7 Updated by Sage Weil 12 months ago

  • Related to Bug #23145: OSD crashes during recovery of EC pg added

#8 Updated by Josh Durgin 12 months ago

  • Duplicated by Bug #24192: cluster [ERR] Corruption detected: object 2:f59d1934:::smithi14913526-5822:head is missing hash_info added

#9 Updated by Sage Weil 12 months ago

Factors leading to this:

- ec pool (e.g., rgw workload0
- rados ops that result in pg log 'error' entries (e.g., deleting a non-existent object, due to rgw gc)
- peering (due to osd restarts etc)

A workaround that should work:

- quiesce IO to the EC pool (ceph osd pause/unpause, or pause radsogw processes) prior to restarting/upgrading osds

That will ensure that the last_update for all shards of each PG match and no rollback will be needed (if the pg incorrectly rolled forward too far the rollback won't be possible).

#11 Updated by Josh Durgin 12 months ago

  • Assignee set to Sage Weil

#12 Updated by Sage Weil 12 months ago

  • Status changed from In Progress to Pending Backport

#13 Updated by Nathan Cutler 12 months ago

  • Copied to Backport #24890: luminous: FAILED assert(0 == "ERROR: source must exist") in FileStore::_collection_move_rename() added

#14 Updated by Nathan Cutler 12 months ago

  • Copied to Backport #24891: mimic: FAILED assert(0 == "ERROR: source must exist") in FileStore::_collection_move_rename() added

#15 Updated by Dan van der Ster 12 months ago

Could cephfs trigger this issue? There have been two reports of cephfs_metadata pool crc errors on the users ML this week.

#16 Updated by Nathan Cutler 12 months ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF