Bug #14372
closedENOTSUPP on trimtrunc (EC with cache pool on top)
Added by Jérôme Poulin over 8 years ago. Updated over 8 years ago.
0%
Description
After copying a new file on CephFS and issuing a ls on the folder, I/O were frozen on CephFS, our monitoring reported that all MDS were down.
Our cluster is on 0.80.10, I tried using 0.80.11 for ceph-mds without success neither.
After running the debugger, it seems to fail at:
Breakpoint 1, C_MDC_TruncateFinish::finish (this=0xa6fe820, r=0) at mds/MDCache.cc:6125 6125 assert(r == 0 || r == -ENOENT); (gdb) p r $1 = 0
I will attach the 2.7MB log file to the ticket.
Right now, the filesystem is completely offline and I don't want to try and skip the assert without prior suggestion.
Updated by John Spray over 8 years ago
- Project changed from Linux kernel client to CephFS
- Category changed from fs/ceph to 47
Updated by Jérôme Poulin over 8 years ago
I'll be available as TiCPU on IRC if you need me to test any code. It is not urgent but still a major inconvenience right now.
Updated by Greg Farnum over 8 years ago
Please provide the output of "ceph -s", and the log file[*]. This is one of the many asserts which check that operations against RADOS return the expected values (in this case, a truncate), so probably there's a problem with the data on disk or with your cluster state.
[*]: You may want to set "debug mds = 20", "debug ms = 1", "debug objecter = 10", and "debug filer = 20" as well. If the log's too large to upload you can use ceph-post-file.
Updated by Jérôme Poulin over 8 years ago
- ceph -s
cluster 98f5178a-6c39-4fcf-8ebe-4250c09a8b69
health HEALTH_ERR 3 pgs degraded; 1 pgs inconsistent; 11 pgs stuck unclean; recovery 1038/6030094 objects degraded (0.017%); 1 scrub errors; mds 1 is laggy; noout flag(s) set
monmap e2: 3 mons at {1=10.10.252.1:6789/0,2=10.10.252.2:6789/0,3=10.10.252.3:6789/0}, election epoch 3088, quorum 0,1,2 1,2,3
mdsmap e45880: 1/1/1 up {0=1=up:active(laggy or crashed)}
osdmap e27029: 13 osds: 13 up, 13 in
flags noout
pgmap v72835883: 2700 pgs, 8 pools, 2644 GB data, 1568 kobjects
7391 GB used, 3173 GB / 11085 GB avail
1038/6030094 objects degraded (0.017%)
2685 active+clean
1 active+clean+inconsistent
11 active+remapped
3 active+clean+degraded
client io 0 B/s rd, 1400 kB/s wr, 357 op/s
I'm aware of the inconsistent PG. I posted something on the mailing list about ceph pg repair not working. I just noticed bug #12577 existed. My post was about deep-scrub command not getting executed neither. Ref.: http://permalink.gmane.org/gmane.comp.file-systems.ceph.user/26122
I'll get a log file to know what's wrong but I'm pretty sure this inconsistent PG is to blame. For the degraded/remapped PG this is a EC pool and we have 4 machines with 4 OSDs to add to fix this soon.
Updated by Greg Farnum over 8 years ago
Are you running CephFS against an EC pool? (Perhaps with a cache pool in front?)
If so, yes, that's probably the issue. Your filesystem isn't going to work right until the "disk" underneath is is behaving properly!
Updated by Jérôme Poulin over 8 years ago
I confirm for the CephFS on-EC-pool, without cache. The disk bad sector was repaired and now shows that everything is clean, however, ceph pg commands refuse to work against this specific PG, it won't try to repair anything. I confirmed that SMART says that no further read/write error have occurred since that one bad block. Is it better to replace it anyway and out this OSD?
Updated by Greg Farnum over 8 years ago
- Status changed from New to Rejected
You'll need to track that with the RADOS guys. You can open a support ticket in the overall Ceph project or send an email to the ceph-users list, if you can't find documentation on resolving it.
Anyway, not proceeding with broken data from RADOS is expected behavior, so closing the ticket.
Updated by Jérôme Poulin over 8 years ago
After fixing the inconsistent PG, I notice the the single defective object was one from a RBD device et that after deleting it, Ceph fully repaired the object.
Here is a log captured with the recommended options.
ceph-post-file: 147fc917-2a4e-4f22-9a72-f1295f72a081
Updated by Greg Farnum over 8 years ago
- Project changed from CephFS to Ceph
- Subject changed from ceph-mds 0.80.11 assert on startup to ENOTSUPP on trimtrunc (EC with cache pool on top)
- Category deleted (
47) - Status changed from Rejected to New
- Priority changed from High to Normal
Looking at this log we're getting EOPNOTSUPP (Operation not supported) on a trimtrunc command. That is apparently not supported on EC pools but you say this it has a cache pool in front? Tossing this back to the RADOS team.
Updated by Jérôme Poulin over 8 years ago
Sorry for the misunderstanding but we do not use caching on any pool at the moment.
Updated by Greg Farnum over 8 years ago
- Status changed from New to Rejected
Oh, I misread. CephFS doesn't function on top of uncached EC pools at all and won't for a long time.
Updated by Nathan Cutler over 8 years ago
- Subject changed from ENOTSUPP on trimtrunc (EC with cache pool on top) to ENOTSUPP on trimtrunc (EC without cache pool on top)
Updated by Jérôme Poulin over 8 years ago
I'm not sure how to update the fields on top but they can now be changed to:
Title: ENOTSUPP on trimtrunc (EC with cache pool on top)
Affected Versions: v0.94.5
We did upgrade a snapshot of this cluster to Hammer and added a cache tier on data, then data and metadata. After starting ceph-mds, it still asserts at what seems the same place. I posted a log file with the old and new MDS version with cache tier enabled.
ceph version 0.80.10 (ea6c958c38df1216bf95c927f143d8b13c4a9e70)
ceph-post-file: 84a3be40-687e-47be-ae49-07994cfa63d2
ceph version 0.94.5 (9764da52395923e0b32908d83a9f7304401fee43)
ceph-post-file: d13b4a54-f7bc-4516-aa29-460df637a761
Updated by Greg Farnum over 8 years ago
- Subject changed from ENOTSUPP on trimtrunc (EC without cache pool on top) to ENOTSUPP on trimtrunc (EC with cache pool on top)
- Assignee set to David Zafman
David, does this sequence of events seem like an issue you're aware of? Do you have any theories? I haven't checked the new logs but if there's a cache pool I don't think the MDS can be doing anything special to break stuff.
Updated by Jérôme Poulin over 8 years ago
Is it possible that since I had no cache when I encountered the problem, that caused corrupt data to be written to the journal ?
Is the journal stored in metadata or data ? The bulk of our files is EC, metadata is replicated and the last operation we made was on a folder in a replicated data pool.
Updated by Jérôme Poulin over 8 years ago
Since we updated the real cluster to Hammer now, I tried using cephfs-journal-tool event recover_dentries summary on the CephFS since we are starting to have demand for some of the objects that are stuck in CephFS only (not yet backed up).
Here is the output:
root@Ceph1:~# cephfs-journal-tool event recover_dentries summary
2016-02-09 18:28:01.478805 7fde2982e840 1 scavenge_dentries: frag 100000043a3.00000000 is corrupt, overwriting
Events by type:
OPEN: 6112
SESSION: 33
SUBTREEMAP: 31
TABLECLIENT: 110
TABLESERVER: 220
UPDATE: 34956
Errors: 0
On the real cluster, I did not add the cache tier since I have found it to be unstable with RBD and permanent in the case of CephFS on this version of Hammer. We will have to do more testing before adding a cache tier.
I have also dumped a binary of the journal if you want me to join it to the ticket. After resetting the journal and restarting the MDS, everything is back to normal.
Updated by David Zafman over 8 years ago
- Status changed from New to Rejected
I don't think there should be anything in a journal for an unsupported operation.
Everything is working now as expected