Project

General

Profile

Actions

Bug #24597

closed

FAILED assert(0 == "ERROR: source must exist") in FileStore::_collection_move_rename()

Added by Neha Ojha almost 6 years ago. Updated almost 6 years ago.

Status:
Resolved
Priority:
Immediate
Assignee:
Category:
Correctness/Safety
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
mimic,luminous
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
FileStore
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

2018-06-20T18:58:36.950 INFO:tasks.ceph.osd.6.smithi143.stderr:/home/jenkins-build/build/workspace/ceph-dev-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.0.0-663-g111c515/rpm/el7/BUILD/ceph-14.0.0-663-g111c515/src/os/filestore/FileStore.cc: In function 'int FileStore::_collection_move_rename(const coll_t&, const ghobject_t&, coll_t, const ghobject_t&, const SequencerPosition&, bool)' thread 7f6034592700 time 2018-06-20 18:58:36.961023
2018-06-20T18:58:36.950 INFO:tasks.ceph.osd.6.smithi143.stderr:/home/jenkins-build/build/workspace/ceph-dev-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.0.0-663-g111c515/rpm/el7/BUILD/ceph-14.0.0-663-g111c515/src/os/filestore/FileStore.cc: 5524: FAILED assert(0 == "ERROR: source must exist")
2018-06-20T18:58:36.952 INFO:tasks.ceph.osd.6.smithi143.stderr: ceph version 14.0.0-663-g111c515 (111c515ab0294ffe409fcd8555bb98d3e7290a61) nautilus (dev)
2018-06-20T18:58:36.952 INFO:tasks.ceph.osd.6.smithi143.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0xff) [0x7f604ddb4cdf]
2018-06-20T18:58:36.952 INFO:tasks.ceph.osd.6.smithi143.stderr: 2: (()+0x28aec7) [0x7f604ddb4ec7]
2018-06-20T18:58:36.952 INFO:tasks.ceph.osd.6.smithi143.stderr: 3: (FileStore::_collection_move_rename(coll_t const&, ghobject_t const&, coll_t, ghobject_t const&, SequencerPosition const&, bool)+0xa7c) [0x55e2f5a1b26c]
2018-06-20T18:58:36.952 INFO:tasks.ceph.osd.6.smithi143.stderr: 4: (FileStore::_do_transaction(ObjectStore::Transaction&, unsigned long, int, ThreadPool::TPHandle*, char const*)+0xe7b) [0x55e2f5a1d40b]
2018-06-20T18:58:36.953 INFO:tasks.ceph.osd.6.smithi143.stderr: 5: (FileStore::_do_transactions(std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, unsigned long, ThreadPool::TPHandle*, char const*)+0x48) [0x55e2f5a23368]
2018-06-20T18:58:36.953 INFO:tasks.ceph.osd.6.smithi143.stderr: 6: (FileStore::_do_op(FileStore::OpSequencer*, ThreadPool::TPHandle&)+0x13f) [0x55e2f5a234df]
2018-06-20T18:58:36.953 INFO:tasks.ceph.osd.6.smithi143.stderr: 7: (ThreadPool::worker(ThreadPool::WorkThread*)+0x7c7) [0x7f604ddba047]
2018-06-20T18:58:36.953 INFO:tasks.ceph.osd.6.smithi143.stderr: 8: (ThreadPool::WorkThread::entry()+0x10) [0x7f604ddbb6a0]
2018-06-20T18:58:36.953 INFO:tasks.ceph.osd.6.smithi143.stderr: 9: (()+0x7e25) [0x7f604a908e25]
2018-06-20T18:58:36.953 INFO:tasks.ceph.osd.6.smithi143.stderr: 10: (clone()+0x6d) [0x7f60499f8bad]
2018-06-20T18:58:36.954 INFO:tasks.ceph.osd.6.smithi143.stderr: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

http://pulpito.ceph.com/nojha-2018-06-20_18:20:55-rados:thrash-master-distro-basic-smithi/2684831/


Related issues 4 (0 open4 closed)

Related to RADOS - Bug #23145: OSD crashes during recovery of EC pgDuplicateSage Weil02/27/2018

Actions
Has duplicate RADOS - Bug #24192: cluster [ERR] Corruption detected: object 2:f59d1934:::smithi14913526-5822:head is missing hash_infoDuplicateJosh Durgin05/19/2018

Actions
Copied to RADOS - Backport #24890: luminous: FAILED assert(0 == "ERROR: source must exist") in FileStore::_collection_move_rename()ResolvedSage WeilActions
Copied to RADOS - Backport #24891: mimic: FAILED assert(0 == "ERROR: source must exist") in FileStore::_collection_move_rename()ResolvedSage WeilActions
Actions #1

Updated by Josh Durgin almost 6 years ago

  • Priority changed from Normal to Urgent
Actions #2

Updated by Josh Durgin almost 6 years ago

  • Category set to Correctness/Safety
  • Component(RADOS) FileStore added
Actions #3

Updated by Sage Weil almost 6 years ago

I believe this is caused by b50186bfe6c8981700e33c8a62850e21779d67d5, which does

  if (roll_forward_to) {
    pg_log.roll_forward(&rollbacker);
  }

i.e., rolls forward to log.head instead of *roll_forward_to.

in 12.2.5 this is a backported fix for http://tracker.ceph.com/issues/22050 which is much less severe :)

Actions #4

Updated by Sage Weil almost 6 years ago

  • Status changed from New to 12
  • Priority changed from Urgent to Immediate
Actions #5

Updated by Josh Durgin almost 6 years ago

Aha, in that case wip-24192 should fix it. Running it through testing again...

Actions #6

Updated by Sage Weil almost 6 years ago

  • Status changed from 12 to In Progress
  • Backport set to mimic,luminous
Actions #7

Updated by Sage Weil almost 6 years ago

  • Related to Bug #23145: OSD crashes during recovery of EC pg added
Actions #8

Updated by Josh Durgin almost 6 years ago

  • Has duplicate Bug #24192: cluster [ERR] Corruption detected: object 2:f59d1934:::smithi14913526-5822:head is missing hash_info added
Actions #9

Updated by Sage Weil almost 6 years ago

Factors leading to this:

- ec pool (e.g., rgw workload0
- rados ops that result in pg log 'error' entries (e.g., deleting a non-existent object, due to rgw gc)
- peering (due to osd restarts etc)

A workaround that should work:

- quiesce IO to the EC pool (ceph osd pause/unpause, or pause radsogw processes) prior to restarting/upgrading osds

That will ensure that the last_update for all shards of each PG match and no rollback will be needed (if the pg incorrectly rolled forward too far the rollback won't be possible).

Actions #11

Updated by Josh Durgin almost 6 years ago

  • Assignee set to Sage Weil
Actions #12

Updated by Sage Weil almost 6 years ago

  • Status changed from In Progress to Pending Backport
Actions #13

Updated by Nathan Cutler almost 6 years ago

  • Copied to Backport #24890: luminous: FAILED assert(0 == "ERROR: source must exist") in FileStore::_collection_move_rename() added
Actions #14

Updated by Nathan Cutler almost 6 years ago

  • Copied to Backport #24891: mimic: FAILED assert(0 == "ERROR: source must exist") in FileStore::_collection_move_rename() added
Actions #15

Updated by Dan van der Ster almost 6 years ago

Could cephfs trigger this issue? There have been two reports of cephfs_metadata pool crc errors on the users ML this week.

Actions #16

Updated by Nathan Cutler almost 6 years ago

  • Status changed from Pending Backport to Resolved
Actions

Also available in: Atom PDF