Project

General

Profile

Actions

Bug #14372

closed

ENOTSUPP on trimtrunc (EC with cache pool on top)

Added by Jérôme Poulin over 8 years ago. Updated about 8 years ago.

Status:
Rejected
Priority:
Urgent
Assignee:
David Zafman
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

After copying a new file on CephFS and issuing a ls on the folder, I/O were frozen on CephFS, our monitoring reported that all MDS were down.

Our cluster is on 0.80.10, I tried using 0.80.11 for ceph-mds without success neither.

After running the debugger, it seems to fail at:

Breakpoint 1, C_MDC_TruncateFinish::finish (this=0xa6fe820, r=0) at mds/MDCache.cc:6125
6125        assert(r == 0 || r == -ENOENT);
(gdb) p r
$1 = 0

I will attach the 2.7MB log file to the ticket.

Right now, the filesystem is completely offline and I don't want to try and skip the assert without prior suggestion.

Actions #1

Updated by John Spray over 8 years ago

  • Project changed from Linux kernel client to CephFS
  • Category changed from fs/ceph to 47
Actions #2

Updated by Jérôme Poulin over 8 years ago

I'll be available as TiCPU on IRC if you need me to test any code. It is not urgent but still a major inconvenience right now.

Actions #3

Updated by Greg Farnum over 8 years ago

Please provide the output of "ceph -s", and the log file[*]. This is one of the many asserts which check that operations against RADOS return the expected values (in this case, a truncate), so probably there's a problem with the data on disk or with your cluster state.

[*]: You may want to set "debug mds = 20", "debug ms = 1", "debug objecter = 10", and "debug filer = 20" as well. If the log's too large to upload you can use ceph-post-file.

Actions #4

Updated by Jérôme Poulin over 8 years ago

  1. ceph -s
    cluster 98f5178a-6c39-4fcf-8ebe-4250c09a8b69
    health HEALTH_ERR 3 pgs degraded; 1 pgs inconsistent; 11 pgs stuck unclean; recovery 1038/6030094 objects degraded (0.017%); 1 scrub errors; mds 1 is laggy; noout flag(s) set
    monmap e2: 3 mons at {1=10.10.252.1:6789/0,2=10.10.252.2:6789/0,3=10.10.252.3:6789/0}, election epoch 3088, quorum 0,1,2 1,2,3
    mdsmap e45880: 1/1/1 up {0=1=up:active(laggy or crashed)}
    osdmap e27029: 13 osds: 13 up, 13 in
    flags noout
    pgmap v72835883: 2700 pgs, 8 pools, 2644 GB data, 1568 kobjects
    7391 GB used, 3173 GB / 11085 GB avail
    1038/6030094 objects degraded (0.017%)
    2685 active+clean
    1 active+clean+inconsistent
    11 active+remapped
    3 active+clean+degraded
    client io 0 B/s rd, 1400 kB/s wr, 357 op/s

I'm aware of the inconsistent PG. I posted something on the mailing list about ceph pg repair not working. I just noticed bug #12577 existed. My post was about deep-scrub command not getting executed neither. Ref.: http://permalink.gmane.org/gmane.comp.file-systems.ceph.user/26122

I'll get a log file to know what's wrong but I'm pretty sure this inconsistent PG is to blame. For the degraded/remapped PG this is a EC pool and we have 4 machines with 4 OSDs to add to fix this soon.

Actions #5

Updated by Greg Farnum over 8 years ago

Are you running CephFS against an EC pool? (Perhaps with a cache pool in front?)

If so, yes, that's probably the issue. Your filesystem isn't going to work right until the "disk" underneath is is behaving properly!

Actions #6

Updated by Jérôme Poulin over 8 years ago

I confirm for the CephFS on-EC-pool, without cache. The disk bad sector was repaired and now shows that everything is clean, however, ceph pg commands refuse to work against this specific PG, it won't try to repair anything. I confirmed that SMART says that no further read/write error have occurred since that one bad block. Is it better to replace it anyway and out this OSD?

Actions #7

Updated by Greg Farnum over 8 years ago

  • Status changed from New to Rejected

You'll need to track that with the RADOS guys. You can open a support ticket in the overall Ceph project or send an email to the ceph-users list, if you can't find documentation on resolving it.

Anyway, not proceeding with broken data from RADOS is expected behavior, so closing the ticket.

Actions #8

Updated by Jérôme Poulin over 8 years ago

After fixing the inconsistent PG, I notice the the single defective object was one from a RBD device et that after deleting it, Ceph fully repaired the object.

Here is a log captured with the recommended options.
ceph-post-file: 147fc917-2a4e-4f22-9a72-f1295f72a081

Actions #9

Updated by Greg Farnum over 8 years ago

  • Project changed from CephFS to Ceph
  • Subject changed from ceph-mds 0.80.11 assert on startup to ENOTSUPP on trimtrunc (EC with cache pool on top)
  • Category deleted (47)
  • Status changed from Rejected to New
  • Priority changed from High to Normal

Looking at this log we're getting EOPNOTSUPP (Operation not supported) on a trimtrunc command. That is apparently not supported on EC pools but you say this it has a cache pool in front? Tossing this back to the RADOS team.

Actions #10

Updated by Jérôme Poulin over 8 years ago

Sorry for the misunderstanding but we do not use caching on any pool at the moment.

Actions #11

Updated by Greg Farnum over 8 years ago

  • Status changed from New to Rejected

Oh, I misread. CephFS doesn't function on top of uncached EC pools at all and won't for a long time.

Actions #12

Updated by Nathan Cutler over 8 years ago

  • Subject changed from ENOTSUPP on trimtrunc (EC with cache pool on top) to ENOTSUPP on trimtrunc (EC without cache pool on top)
Actions #13

Updated by Jérôme Poulin about 8 years ago

I'm not sure how to update the fields on top but they can now be changed to:

Title: ENOTSUPP on trimtrunc (EC with cache pool on top)
Affected Versions: v0.94.5

We did upgrade a snapshot of this cluster to Hammer and added a cache tier on data, then data and metadata. After starting ceph-mds, it still asserts at what seems the same place. I posted a log file with the old and new MDS version with cache tier enabled.

ceph version 0.80.10 (ea6c958c38df1216bf95c927f143d8b13c4a9e70)
ceph-post-file: 84a3be40-687e-47be-ae49-07994cfa63d2

ceph version 0.94.5 (9764da52395923e0b32908d83a9f7304401fee43)
ceph-post-file: d13b4a54-f7bc-4516-aa29-460df637a761

Actions #14

Updated by Greg Farnum about 8 years ago

  • Subject changed from ENOTSUPP on trimtrunc (EC without cache pool on top) to ENOTSUPP on trimtrunc (EC with cache pool on top)
  • Assignee set to David Zafman

David, does this sequence of events seem like an issue you're aware of? Do you have any theories? I haven't checked the new logs but if there's a cache pool I don't think the MDS can be doing anything special to break stuff.

Actions #15

Updated by Jérôme Poulin about 8 years ago

Is it possible that since I had no cache when I encountered the problem, that caused corrupt data to be written to the journal ?

Is the journal stored in metadata or data ? The bulk of our files is EC, metadata is replicated and the last operation we made was on a folder in a replicated data pool.

Actions #16

Updated by Greg Farnum about 8 years ago

  • Status changed from Rejected to New
Actions #17

Updated by Sage Weil about 8 years ago

  • Priority changed from Normal to Urgent
Actions #18

Updated by Jérôme Poulin about 8 years ago

Since we updated the real cluster to Hammer now, I tried using cephfs-journal-tool event recover_dentries summary on the CephFS since we are starting to have demand for some of the objects that are stuck in CephFS only (not yet backed up).

Here is the output:
root@Ceph1:~# cephfs-journal-tool event recover_dentries summary
2016-02-09 18:28:01.478805 7fde2982e840 1 scavenge_dentries: frag 100000043a3.00000000 is corrupt, overwriting
Events by type:
OPEN: 6112
SESSION: 33
SUBTREEMAP: 31
TABLECLIENT: 110
TABLESERVER: 220
UPDATE: 34956
Errors: 0

On the real cluster, I did not add the cache tier since I have found it to be unstable with RBD and permanent in the case of CephFS on this version of Hammer. We will have to do more testing before adding a cache tier.

I have also dumped a binary of the journal if you want me to join it to the ticket. After resetting the journal and restarting the MDS, everything is back to normal.

Actions #19

Updated by David Zafman about 8 years ago

  • Status changed from New to Rejected

I don't think there should be anything in a journal for an unsupported operation.

Everything is working now as expected

Actions

Also available in: Atom PDF