Project

General

Profile

Actions

Bug #21174

closed

OSD crash: 903: FAILED assert(objiter->second->version > last_divergent_update)

Added by Martin Millnert over 6 years ago. Updated almost 5 years ago.

Status:
Rejected
Priority:
Urgent
Assignee:
-
Category:
-
Target version:
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I've setup a cephfs erasure coded pool on a small cluster consisting of 5 bluestore OSDs.
The pools were created as follows:

ceph osd pool create cephfs_metadata 160 160 replicated
ceph osd pool create cephfs_data 160 160 erasure ecpool ec
ceph osd pool set cephfs_data allow_ec_overwrites true
ceph fs new cephfs cephfs_metadata cephfs_data

I started copying files onto the cephfs, which has now started to crash in endless loops. Cluster is unavailable (which is uncritical for me but not for a "live cluster".)
The log which is available at https://martin.millnert.se/files/cephfs_ec/ceph-osd.1.log.all.gz crashes in the vicinity of a lot of output about "omap" operations.

In the documentation at http://docs.ceph.com/docs/master/rados/operations/erasure-code/ it is stated that erasure coded pools do not support omap operations, which is why special care has to be taken with RBD.
For CephFS, it simply states: "For Cephfs, using an erasure coded pool means setting that pool in a file layout." with a link to the section on CephFS file layouts: http://docs.ceph.com/docs/master/cephfs/file-layouts/
The file layouts documentation section does not reciprocate this link, i.e. the logic/context of using erasure coded pools with cephfs is not further explained there.

So, provided I've done a user error here and there is no other bug causing my OSDs to crash, I think it would be wise to upgrade the documentation on how to use EC pools for CephFS more explicitly.

Furthermore, if it is indeed illegal to crease a cephfs using the command I did, i.e. "ceph fs new cephfs <replicated_metadata_pool> <erasure_coded_data_pool>", probably the code should test and reject that to avoid cluster down states further down the road.


Related issues 2 (0 open2 closed)

Related to RADOS - Bug #39023: osd/PGLog: preserve original_crt to check rollbackabilityResolvedNeha Ojha03/28/2019

Actions
Related to RADOS - Bug #16279: assert(objiter->second->version > last_divergent_update) failedClosed06/14/2016

Actions
Actions

Also available in: Atom PDF