Project

General

Profile

Feature #8230

mds: new requests are not throttled by backend rados progress

Added by Alexandre Oliva almost 10 years ago. Updated over 7 years ago.

Status:
New
Priority:
High
Assignee:
-
Category:
Introspection/Control
Target version:
-
% Done:

0%

Source:
Community (dev)
Tags:
Backport:
Reviewed:
Affected Versions:
Component(FS):
Labels (FS):
Pull request ID:

Description

The use of the ceph.parent attribute as actual cephfs metadata, including its use for recovery and hard links, has turned what was just a pain point into a reason for real concern.

The pain point is that the batchy use of ceph.parent by the mds, both when creating lots of files and when recovering the filesystem, slows the cluster down significantly. It is very common for me, when exploding a tarball with lots of small files or after an mds start, to see my data osds reporting long lists of slow requests. When file creation is sustained, it is very common for such osds to end up hitting a suicide timeout and die, slowing things down further. At first I though this had to do with inefficiencies in btrfs's xattrs, but stopping ceph from using xattrs, so that ceph.parent was stored in leveldb, didn't fare any better.

The reason for concern is that, when such failures occur, replication count of the ceph.parent metadata attributes stored in datapools goes below the replication level I set for metadata. I keep metadata at smaller, faster disks, and at higher replication counts, but holding ceph.parent in data pools defeats both intents, of higher reliability and of faster access.

Have you considered storing ceph.parent attributes of data objects in the metadata pool, perhaps under the same object id used for the data, either as an attribute or as the data proper?

Considering that xattrs are metadata, it might make sense for all other xattrs to be held there, too.

History

#1 Updated by Alexandre Oliva almost 10 years ago

This change might turn out to have another unexpected benefit, although the reason for that might turn out to be a bug.

While one OSD was down, I created a number of hard links to files that hadn't been touched for a while. The MDS updated the parent xattr in them.

When the OSD came back up, it started log-based recovery. What surprised me was that strace showed each file that had had the parent updated was being newly-written in full to the recovering osd, even though the data had not been modified in any way.

I vaguely recall seeing, a while ago, code in the osd implementation that tried to avoid that, by keeping separate versions for file data and xattrs. My guess is that this may be failing because the file is regarded as missing during osd startup; the init code that goes backwards through the log accessing each touched file (besides taking a very long time to run on this osd, because the number of files to recover is so large) doesn't seem to be concerned with multiple version numbers. Isn't that a bug? (running 0.79 now, will upgrade to 0.80-rc shortly)

#2 Updated by Greg Farnum almost 10 years ago

Hmm, there are several things going on here that you're conflating. Some questions and notes
1) Exploding a tarball can kill OSDs. That's a little disturbing although possibly just a result of trying to push too many ops through. Have you taken a peek at what requests are pending within the OSDs pipeline when that happens? What version are you running?
2) Flushing out parent backpointers is batched up. That's a weakness with our current implementation, but you're the first to report it as a problem.
3) We keep the backpointers on the objects because it (a) reduces the total number of objects in the system, and (b) makes cleanup of the backpointers a lot simpler. (They live with the object they are about, and get deleted together.)
4) CephFS does store user xattrs in the metadata pool, as part of the directory files are part of. The parent xattr is internal metadata and should not be confused with external user metadata.
5) We didn't consider that recovering a changed backpointer might require recovering the whole object, but I don't think we've ever had any OSD code that would avoid that. There's certainly never been separate versioning within the OSD.
6) Setting up hard links shouldn't have touched the original file, unless it hadn't yet had a parent xattr written.

#3 Updated by Alexandre Oliva almost 10 years ago

1) I've had batches of ceph.parent ops pile up ever since ceph.parent was introduced (early 0.7x IIRC); I'm running 0.79 now. Requests pile up and are reported as delayed particularly on disks that are bigger and hold more data PGs, but I get them even on the smaller and faster disks containing metadata when creating a tree with lots of directories. The metadata disks delay requests but eventually catch up; the larger disks often fail to keep up, and give up when they hit the suicide timeout. When they come back up, others have most often made good progress and now they have to recover the objects that only had ceph.parent modified, but they can't keep up if activity continues, so they fail again and again, taking longer and longer to come back up as the log grows and more objects have to be checked for “missing” before the osd can come back up, and longer and longer to recover as they have to recover the full objects rather than just the ceph.parent attribute.

2) I wish it wasn't a problem; I suppose it might be more efficient to create the xattr along the osd file, when the dir containing it is still hot, but btrfs isn't know for stellar xattr performance. Sadly, switching to leveldb-stored attributes didn't make that any better.

3) I understand the motivations, but reducing the replication count of filesystem metadata for this sort of convenience doesn't sound right. Like, btrfs insists that all metadata be kept in replicated chunks, presumably because that makes recovery of the filesystem possible in case of localized corruption. Any piece of metadata stored in non-replicated data chunks could make a filesystem unrecoverable. Now, when ceph.parent was introduced, it was only meant to aid in mds recovery IIRC, and the worst effect was the slowdown out of hammering the slower data disks with metadata requests, but now that it's going to be used as the backpointer for hardlinks, it had better be handled with as much care as other pieces of metadata.

4) Sorry, I was confused by other xattrs that cephfs stores in data objects.

5) I guess it is the mds that keeps separate versions for data and metadata, then.

6) I suppose the hard-linked files were updated because their ceph.parent xattr was of an older version, so the code added a few releases ago to schedule a ceph.parent update at such times was triggered.

#4 Updated by Greg Farnum almost 10 years ago

Can you get me the backtraces from the suiciding OSDs? That shouldn't be happening regardless of how many ops we dump at it.

#5 Updated by Alexandre Oliva almost 10 years ago

There were plenty of other timeouts before the suicide, and this sort of situation is not specific to ceph.parent batches; in my experience, any sustained workload that exceeds the throughput of the slowest osd will eventually cause this slowest osd to suicide because it can't keep up: I/O on the underlying btrfs presumably gets slower and slower until some operation reaches the suicide timeout.

Here is the latest sample, in which the osd was recovering from earlier failures while other uses of the cephfs were underway:

398> 2014-04-28 12:18:47.798145 7f617de64700  5 - op tracker -- , seq: 96045, time: 2014-04-28 12:18:47.798144, event: waiting for subops from (1,255),(10,255), request: osd_op(client.610213.0:731234 1000082579d.00000019 [write 3505880~458752] 0.9515a9ad snapc 1=[] ondisk+write e71456) v4
397> 2014-04-28 12:18:47.798203 7f617de64700 1 - 172.31.160.7:6802/16998 --> osd.1 172.31.160.6:6813/2659 -- osd_sub_op(client.610213.0:731234 0.2d 9515a9ad/1000082579d.00000019/head//0 [] v 71456'222312 snapset=0=[]:[] snapc=0=[]) v9 -- ?+459363 0x16195000
396> 2014-04-28 12:18:47.798279 7f617de64700 1 - 172.31.160.7:6802/16998 --> osd.10 172.31.160.7:6811/13035 -- osd_sub_op(client.610213.0:731234 0.2d 9515a9ad/1000082579d.00000019/head//0 [] v 71456'222312 snapset=0=[]:[] snapc=0=[]) v9 -- ?+459363 0x16191400
395> 2014-04-28 12:18:47.798467 7f617de64700 5 - op tracker -- , seq: 96045, time: 2014-04-28 12:18:47.798467, event: commit_queued_for_journal_write, request: osd_op(client.610213.0:731234 1000082579d.00000019 [write 3505880~458752] 0.9515a9ad snapc 1=[] ondisk+write e71456) v4
-394> 2014-04-28 12:18:47.798536 7f617de64700 1 heartbeat_map reset_timeout 'OSD::op_tp thread 0x7f617de64700' had timed out after 30
-393> 2014-04-28 12:18:47.798554 7f617de64700 1 heartbeat_map reset_timeout 'OSD::op_tp thread 0x7f617de64700' had suicide timed out after 300
[...]
0> 2014-04-28 12:18:50.190783 7f617de64700 -1 common/HeartbeatMap.cc: In function 'bool ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, const char*, time_t)' thread 7f617de64700 time 2014-04-28 12:18:47.798628
common/HeartbeatMap.cc: 79: FAILED assert(0 "hit suicide timeout")
ceph version 0.79 (4c2d73a5095f527c3a2168deb5fa54b3c8991a6e)
1: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, long)+0x2d9) [0x9b52c9]
2: (ceph::HeartbeatMap::reset_timeout(ceph::heartbeat_handle_d*, long, long)+0x89) [0x9b55e9]
3: (ThreadPool::worker(ThreadPool::WorkThread*)+0x5ed) [0xa636fd]
4: (ThreadPool::WorkThread::entry()+0x10) [0xa64b10]
5: /lib64/libpthread.so.0() [0x3801607f33]
6: (clone()+0x6d) [0x3800ef4ded]
NOTE: a copy of the executable, or `objdump rdS <executable>` is needed to interpret this.
[...]
--
end dump of recent events ---
2014-04-28 12:18:50.190784 7f61ffb30700 -1 common/HeartbeatMap.cc: In function 'bool ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, const char*, time_t)' thread 7f61ffb30700 time 2014-04-28 12:18:47.805406
common/HeartbeatMap.cc: 79: FAILED assert(0 "hit suicide timeout")
ceph version 0.79 (4c2d73a5095f527c3a2168deb5fa54b3c8991a6e)
1: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, long)+0x2d9) [0x9b52c9]
2: (ceph::HeartbeatMap::is_healthy()+0xc6) [0x9b5ba6]
3: (ceph::HeartbeatMap::check_touch_file()+0x2c) [0x9b61ec]
4: (CephContextServiceThread::entry()+0x15b) [0xa86e6b]
5: /lib64/libpthread.so.0() [0x3801607f33]
6: (clone()+0x6d) [0x3800ef4ded]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

#6 Updated by Sage Weil almost 10 years ago

  • Project changed from Ceph to CephFS
  • Priority changed from Normal to High

#7 Updated by Sage Weil almost 10 years ago

  • Subject changed from move ceph.parent back to metadata to mds: new requests are not throttled by backend rados progress
  • Source changed from other to Community (dev)

This sounds like a general problem of throttling new work based on the MDS's ability to make progress against the rados backend.

#8 Updated by Sage Weil over 9 years ago

  • Tracker changed from Bug to Feature

#9 Updated by Greg Farnum almost 8 years ago

  • Category set to Introspection/Control

Also available in: Atom PDF