https://tracker.ceph.com/https://tracker.ceph.com/favicon.ico2023-02-03T20:25:10ZCeph CephFS - Bug #58640: ceph-fuse in infinite loop reading objects without client requestshttps://tracker.ceph.com/issues/58640?journal_id=2309992023-02-03T20:25:10ZAndras Patakiapataki@simonsfoundation.org
<ul></ul><p>I've uploaded the full 1 minute ceph-fuse trace as:<br />ceph-post-file: d56ebc47-4ef7-4f01-952f-8569c8c92982</p> CephFS - Bug #58640: ceph-fuse in infinite loop reading objects without client requestshttps://tracker.ceph.com/issues/58640?journal_id=2310172023-02-06T13:56:03ZVenky Shankarvshankar@redhat.com
<ul><li><strong>Category</strong> set to <i>Correctness/Safety</i></li><li><strong>Status</strong> changed from <i>New</i> to <i>Triaged</i></li><li><strong>Assignee</strong> set to <i>Venky Shankar</i></li><li><strong>Target version</strong> set to <i>v18.0.0</i></li><li><strong>Backport</strong> set to <i>pacific,quincy</i></li><li><strong>Severity</strong> changed from <i>3 - minor</i> to <i>2 - major</i></li><li><strong>Component(FS)</strong> <i>ceph-fuse</i> added</li></ul> CephFS - Bug #58640: ceph-fuse in infinite loop reading objects without client requestshttps://tracker.ceph.com/issues/58640?journal_id=2310632023-02-07T04:10:45ZVenky Shankarvshankar@redhat.com
<ul></ul><p>Andras Pataki wrote:</p>
<blockquote>
<p>I've uploaded the full 1 minute ceph-fuse trace as:<br />ceph-post-file: d56ebc47-4ef7-4f01-952f-8569c8c92982</p>
</blockquote>
<p>Thanks for the logs. I'll have a look.</p>
<p>Quick question, Andras - any configuration changes applied to the cluster (esp. around object default sizes and the likes)?</p> CephFS - Bug #58640: ceph-fuse in infinite loop reading objects without client requestshttps://tracker.ceph.com/issues/58640?journal_id=2310742023-02-07T10:33:15ZVenky Shankarvshankar@redhat.com
<ul></ul><p>There seems to be lot of cache misses for objects in ObjectCacher. The retry is coming from:</p>
<pre>
for (map<loff_t, BufferHead*>::iterator bh_it = missing.begin();
bh_it != missing.end();
++bh_it) {
uint64_t rx_bytes = static_cast<uint64_t>(
stat_rx + bh_it->second->length());
bytes_not_in_cache += bh_it->second->length();
if (!waitfor_read.empty() || (stat_rx > 0 && rx_bytes > max_size)) {
// cache is full with concurrent reads -- wait for rx's to complete
// to constrain memory growth (especially during copy-ups)
if (success) {
ldout(cct, 10) << "readx missed, waiting on cache to complete "
<< waitfor_read.size() << " blocked reads, "
<< (std::max(rx_bytes, max_size) - max_size)
<< " read bytes" << dendl;
waitfor_read.push_back(new C_RetryRead(this, rd, oset, onfinish,
*trace));
}
bh_remove(o, bh_it->second);
delete bh_it->second;
} else {
bh_it->second->set_nocache(nocache);
bh_read(bh_it->second, rd->fadvise_flags, *trace);
if ((success && onfinish) || last != missing.end())
last = bh_it;
}
success = false;
}
</pre>
<p>If I'm reading this (and other bits in `ObjectCacher::_readx`) right, for a set of extents there are <em>a lot</em> of cache misses over and over again causing the ObjectCacher to retry. E.g.: object: 100dbad3fce.*, the only osd_op_reply is for object 100dbad3fce.000000b3:</p>
<pre>
2023-02-03T14:42:41.153-0500 7fffeb7fe700 10 client.151672236.objecter ms_dispatch 0x55555e305810 osd_op_reply(42776829 100dbad3fce.000000b3 [read 0~25165824 out=25165824b] v0'0 uv471599 ondisk = 0) v8
2023-02-03T14:42:41.153-0500 7fffeb7fe700 10 client.151672236.objecter in handle_osd_op_reply
2023-02-03T14:42:41.153-0500 7fffeb7fe700 7 client.151672236.objecter handle_osd_op_reply 42776829 ondisk uv 471599 in 9.6c40 attempt 0
2023-02-03T14:42:41.153-0500 7fffeb7fe700 10 client.151672236.objecter op 0 rval 0 len 25165824
2023-02-03T14:42:41.153-0500 7fffeb7fe700 15 client.151672236.objecter handle_osd_op_reply completed tid 42776829
2023-02-03T14:42:41.153-0500 7fffeb7fe700 15 client.151672236.objecter _finish_op 42776829
2023-02-03T14:42:41.153-0500 7fffeb7fe700 20 client.151672236.objecter put_session s=0x7fff9824d360 osd=975 4
2023-02-03T14:42:41.153-0500 7fffeb7fe700 15 client.151672236.objecter _session_op_remove 975 42776829
2023-02-03T14:42:41.153-0500 7fffeb7fe700 5 client.151672236.objecter 169 in flight
2023-02-03T14:42:41.153-0500 7fffdbfff700 10 finisher(0x55555e43bf90) finisher_thread doing [0x7fffb0abe4f0,0]
2023-02-03T14:42:41.153-0500 7fffdbfff700 7 objectcacher bh_read_finish 100dbad3fce.000000b3/head tid 42769552 0~25165824 (bl is 25165824) returned 0 outstanding reads 170
2023-02-03T14:42:41.153-0500 7fffdbfff700 20 objectcacher checking bh bh[ 0x7fffb0419390 0~25165824 0x7fffb0ab14f0 (0) v 0 rx] waiters = { 0->[0x7fffb1bd9c60, ]}
2023-02-03T14:42:41.153-0500 7fffdbfff700 10 objectcacher bh_read_finish read bh[ 0x7fffb0419390 0~25165824 0x7fffb0ab14f0 (25165824) v 0 clean firstbyte=5] waiters = {}
2023-02-03T14:42:41.153-0500 7fffdbfff700 10 objectcacher.object(100dbad3fce.000000b3/head) try_merge_bh bh[ 0x7fffb0419390 0~25165824 0x7fffb0ab14f0 (25165824) v 0 clean firstbyte=5] waiters = {}
2023-02-03T14:42:41.153-0500 7fffdbfff700 20 objectcacher finishing waiters 0x7fffb1bd9c60
2023-02-03T14:42:41.153-0500 7fffdbfff700 10 1 contexts to finish with 0
</pre>
<p>And when `_readx` iterates over the extents, the only hit it gets is for 100dbad3fce.000000b3</p>
<pre>
2023-02-03T14:42:41.153-0500 7fffdbfff700 10 objectcacher readx extent(100dbad3fce.000000b1 (177) in @9 0~25165824 -> [0,25165824])
2023-02-03T14:42:41.153-0500 7fffdbfff700 10 objectcacher.object(100dbad3fce.000000b1/head) map_read 100dbad3fce.000000b1 0~25165824
2023-02-03T14:42:41.153-0500 7fffdbfff700 20 objectcacher.object(100dbad3fce.000000b1/head) map_read miss 25165824 left, bh[ 0x7fffb0b84e20 0~25165824 0x7fffb0db2b40 (0) v 0 missing] waiters = {}
2023-02-03T14:42:41.153-0500 7fffdbfff700 10 objectcacher readx missed, waiting on cache to complete 147 blocked reads, 0 read bytes
2023-02-03T14:42:41.153-0500 7fffdbfff700 10 objectcacher readx extent(100dbad3fce.000000b2 (178) in @9 0~25165824 -> [25165824,25165824])
2023-02-03T14:42:41.153-0500 7fffdbfff700 10 objectcacher.object(100dbad3fce.000000b2/head) map_read 100dbad3fce.000000b2 0~25165824
2023-02-03T14:42:41.153-0500 7fffdbfff700 20 objectcacher.object(100dbad3fce.000000b2/head) map_read miss 25165824 left, bh[ 0x7fffb0b84e20 0~25165824 0x7fffb1229ae0 (0) v 0 missing] waiters = {}
2023-02-03T14:42:41.153-0500 7fffdbfff700 10 objectcacher readx extent(100dbad3fce.000000b3 (179) in @9 0~25165824 -> [50331648,25165824])
2023-02-03T14:42:41.153-0500 7fffdbfff700 10 objectcacher.object(100dbad3fce.000000b3/head) map_read 100dbad3fce.000000b3 0~25165824
2023-02-03T14:42:41.153-0500 7fffdbfff700 20 objectcacher.object(100dbad3fce.000000b3/head) map_read hit bh[ 0x7fffb0419390 0~25165824 0x7fffb0ab14f0 (25165824) v 0 clean firstbyte=5] waiters = {}
2023-02-03T14:42:41.153-0500 7fffdbfff700 10 objectcacher readx hit bh bh[ 0x7fffb0419390 0~25165824 0x7fffb0ab14f0 (25165824) v 0 clean firstbyte=5] waiters = {}
2023-02-03T14:42:41.153-0500 7fffdbfff700 10 objectcacher readx rmap opos 0: bh[ 0x7fffb0419390 0~25165824 0x7fffb0ab14f0 (25165824) v 0 clean firstbyte=5] waiters = {} +0 frag 50331648~25165824 +0~25165824
2023-02-03T14:42:41.153-0500 7fffdbfff700 10 objectcacher readx extent(100dbad3fce.000000b4 (180) in @9 0~25165824 -> [75497472,25165824])
2023-02-03T14:42:41.153-0500 7fffdbfff700 10 objectcacher.object(100dbad3fce.000000b4/head) map_read 100dbad3fce.000000b4 0~25165824
2023-02-03T14:42:41.153-0500 7fffdbfff700 20 objectcacher.object(100dbad3fce.000000b4/head) map_read miss 25165824 left, bh[ 0x7fffb0b84e20 0~25165824 0x7fffb1168830 (0) v 0 missing] waiters = {}
2023-02-03T14:42:41.153-0500 7fffdbfff700 10 objectcacher readx extent(100dbad3fce.000000b5 (181) in @9 0~25165824 -> [100663296,25165824])
<misses>
<misses>
<misses>
</pre>
<p>Eventually for the cache misses, an osd_op is sent which does get back an actual osd_op_reply and a subsequent cache hit (verified this for some objects). Doesn't look like the ObjectCacher is running into an infinite loop, but there seems to be cache misses for a set of objects over and over. E.g.: cache hits for object prefix 100dbad3fce are just a handful:</p>
<pre>
grep "map_read hit" ceph-fuse.log | grep 100dbad3fce.000000 | awk -F'(' '{ print $2 }' | awk -F')' '{ print $1 }' | sort -u
100dbad3fce.0000002b/head
100dbad3fce.0000002c/head
100dbad3fce.0000002d/head
100dbad3fce.00000056/head
100dbad3fce.00000057/head
100dbad3fce.00000058/head
100dbad3fce.00000059/head
100dbad3fce.00000096/head
100dbad3fce.00000097/head
100dbad3fce.00000098/head
100dbad3fce.000000b1/head
100dbad3fce.000000b2/head
100dbad3fce.000000b3/head
100dbad3fce.000000dc/head
100dbad3fce.000000dd/head
100dbad3fce.000000de/head
100dbad3fce.000000df/head
100dbad3fce.000000f2/head
100dbad3fce.000000f3/head
100dbad3fce.000000f4/head
100dbad3fce.000000f5/head
</pre> CephFS - Bug #58640: ceph-fuse in infinite loop reading objects without client requestshttps://tracker.ceph.com/issues/58640?journal_id=2310822023-02-07T14:08:53ZGreg Farnumgfarnum@redhat.com
<ul></ul><p>It’s also interesting that these appear to all be full-object reads, and the objects are larger than normal — 24 MiB, it looks like.</p>
<p>Is the ceph-fuse cache size adjusted at all from normal?</p>
<p>Maybe there’s some flushing action happening from the kernel that (correctly or not) ends up requiring the entire object, but with simultaneous cleanup it can’t keep them in cache long enough to finish?</p> CephFS - Bug #58640: ceph-fuse in infinite loop reading objects without client requestshttps://tracker.ceph.com/issues/58640?journal_id=2310832023-02-07T14:33:18ZAndras Patakiapataki@simonsfoundation.org
<ul></ul><p>There definitely are config changes that are different from the defaults.<br />All these objects are in a 6+3 erasure coded pool that has 24MiB object sizes (each OSD stores 4MiB chunks as a result). So the 24MiB reads are full object reads.<br />In addition, the ceph-fuse cache sizes are significantly larger than the defaults (4GiB cache, and more read ahead to help with large files on a fast network):</p>
<pre>
# Default: 16384
client_cache_size = 262144
# Default: 104857600 (100MB)
client_oc_max_dirty = 536870912
# Default: 1000
client_oc_max_objects = 8192
# Default: 209715200 (200 MB)
client_oc_size = 4294967296
# Default: 4
client_readahead_max_periods = 64
# Default: 131072 (128KB)
client_readahead_min = 4194304
# Default: true on Luminous, false on Mimic
fuse_big_writes = true
</pre>
<p>I have tried dropping some of these settings back to lower values, client_readahead_max_periods and client_oc_size. Neither of those helped.<br />When I dropped client_oc_size, simply the number of objecter requests decreased proportionately (ceph daemon <xyz.asok> objecter_requests), but the read loop kept going.</p> CephFS - Bug #58640: ceph-fuse in infinite loop reading objects without client requestshttps://tracker.ceph.com/issues/58640?journal_id=2311002023-02-08T10:06:47ZVenky Shankarvshankar@redhat.com
<ul></ul><p>I see trim messages from object cacher</p>
<pre>
2023-02-03T14:42:41.793-0500 7fffaeffd700 10 objectcacher trim start: bytes: max 4294967296 clean 7449083904, objects: max 8192 current 13754
2023-02-03T14:42:42.485-0500 7fffacff9700 10 objectcacher trim start: bytes: max 4294967296 clean 5486149632, objects: max 8192 current 10434
2023-02-03T14:42:43.557-0500 7fff88ff9700 10 objectcacher trim start: bytes: max 4294967296 clean 7071596544, objects: max 8192 current 12659
2023-02-03T14:42:44.521-0500 7fff8bfff700 10 objectcacher trim start: bytes: max 4294967296 clean 6366953472, objects: max 8192 current 12031
2023-02-03T14:42:45.573-0500 7fffaf7fe700 10 objectcacher trim start: bytes: max 4294967296 clean 6945767424, objects: max 8192 current 12733
2023-02-03T14:42:46.177-0500 7fff68ffb700 10 objectcacher trim start: bytes: max 4294967296 clean 5360320512, objects: max 8192 current 10506
2023-02-03T14:42:47.257-0500 7fffacff9700 10 objectcacher trim start: bytes: max 4294967296 clean 7096762368, objects: max 8192 current 13197
2023-02-03T14:42:48.841-0500 7fffc5ffb700 10 objectcacher trim start: bytes: max 4294967296 clean 7902068736, objects: max 8192 current 13763
2023-02-03T14:42:50.385-0500 7fff8a7fc700 10 objectcacher trim start: bytes: max 4294967296 clean 8128561152, objects: max 8192 current 14288
2023-02-03T14:42:51.249-0500 7fff3cffb700 10 objectcacher trim start: bytes: max 4294967296 clean 5813305344, objects: max 8192 current 10583
2023-02-03T14:42:52.469-0500 7fff68ffb700 10 objectcacher trim start: bytes: max 4294967296 clean 7423918080, objects: max 8192 current 13632
2023-02-03T14:42:52.813-0500 7fffaffff700 10 objectcacher trim start: bytes: max 4294967296 clean 4303355904, objects: max 8192 current 8317
2023-02-03T14:42:54.469-0500 7fffacff9700 10 objectcacher trim start: bytes: max 4294967296 clean 9386852352, objects: max 8192 current 14965
2023-02-03T14:42:55.177-0500 7fffc5ffb700 10 objectcacher trim start: bytes: max 4294967296 clean 5460983808, objects: max 8192 current 10983
2023-02-03T14:42:56.321-0500 7fff8a7fc700 10 objectcacher trim start: bytes: max 4294967296 clean 7197425664, objects: max 8192 current 13688
...
...
...
</pre>
<p>ObjectCacher::trim() is invoked via ObjectCacher::_readx()/_writex() and the buffer heads and objects are trimmed on each invocation - from the bottom/top or their respective LRUs. The condition for that is:</p>
<pre>
while (get_stat_clean() > 0 &&
((uint64_t)get_stat_clean() > max_size ||
nr_clean_bh > max_clean_bh)) {
BufferHead *bh = static_cast<BufferHead*>(bh_lru_rest.lru_expire());
if (!bh)
break;
ldout(cct, 10) << "trim trimming " << *bh << dendl;
ceph_assert(bh->is_clean() || bh->is_zero() || bh->is_error());
Object *ob = bh->ob;
bh_remove(ob, bh);
delete bh;
--nr_clean_bh;
if (ob->complete) {
ldout(cct, 10) << "trim clearing complete on " << *ob << dendl;
ob->complete = false;
}
}
while (ob_lru.lru_get_size() > max_objects) {
Object *ob = static_cast<Object*>(ob_lru.lru_expire());
if (!ob)
break;
ldout(cct, 10) << "trim trimming " << *ob << dendl;
close_object(ob);
}
</pre>
<p>Is the read call filling up the cache too frequently? (the number of objects and the accumulated buffer head length <em>always</em> exceed the configured values, not by a huge margin IMO).</p> CephFS - Bug #58640: ceph-fuse in infinite loop reading objects without client requestshttps://tracker.ceph.com/issues/58640?journal_id=2312062023-02-09T04:13:28ZVenky Shankarvshankar@redhat.com
<ul></ul><p>Hi Andras,</p>
<p>Andras Pataki wrote:</p>
<blockquote>
<p>There definitely are config changes that are different from the defaults.<br />All these objects are in a 6+3 erasure coded pool that has 24MiB object sizes (each OSD stores 4MiB chunks as a result). So the 24MiB reads are full object reads.<br />In addition, the ceph-fuse cache sizes are significantly larger than the defaults (4GiB cache, and more read ahead to help with large files on a fast network):</p>
<p>[...]</p>
<p>I have tried dropping some of these settings back to lower values, client_readahead_max_periods and client_oc_size. Neither of those helped.</p>
</blockquote>
<p>I think scaling back client_oc_size is going to make things worse. From what I'm seeing from the 1 minute trace you shared, the accumulated buffer head bytes exceed configured client_oc_size. Maybe its due to the the 24MiB object reads - I'm not sure. Also, the number of objects in the cache exceed client_oc_max_objects on each trim invocation. Is it possible to increase client_oc_max_objects and client_oc_size (which is already 4 GiB) to see if the situation improves?</p>
<p>We do not tinker with objectcacher defaults in teuthology tests for fs suite. Its possible that such observations get overlooked in tests.</p> CephFS - Bug #58640: ceph-fuse in infinite loop reading objects without client requestshttps://tracker.ceph.com/issues/58640?journal_id=2312392023-02-09T17:04:11ZAndras Patakiapataki@simonsfoundation.org
<ul></ul><p>I'm experimenting on reproducing the problem on demand. Once I have a way to make this bad looping ceph-fuse behavior happen, I'll try increasing the cache sizes. The nodes this happens on are quite beefy - 128 AMD Rome cores, 1TB of RAM, 25Gbps Ethernet that ceph traffic goes over. So I can definitely increase caches quite a bit to see if it helps.</p>
<p>One other observation from the previous "bad" node. It looks like the whole situation did resolve itself after a couple of days of spinning. So it doesn't look like a completely infinite loop FWIW.</p> CephFS - Bug #58640: ceph-fuse in infinite loop reading objects without client requestshttps://tracker.ceph.com/issues/58640?journal_id=2321682023-02-28T13:04:27ZVenky Shankarvshankar@redhat.com
<ul></ul><p>Hey Andras,</p>
<p>Andras Pataki wrote:</p>
<blockquote>
<p>I'm experimenting on reproducing the problem on demand. Once I have a way to make this bad looping ceph-fuse behavior happen, I'll try increasing the cache sizes. The nodes this happens on are quite beefy - 128 AMD Rome cores, 1TB of RAM, 25Gbps Ethernet that ceph traffic goes over. So I can definitely increase caches quite a bit to see if it helps.</p>
</blockquote>
<p>Checking back if increasing the cache sizes improved the situation. You did mention that things resolved itself, but just wanted to check if you ran into it again even when the cache sized were increased.</p> CephFS - Bug #58640: ceph-fuse in infinite loop reading objects without client requestshttps://tracker.ceph.com/issues/58640?journal_id=2413792023-06-27T16:19:09ZAndras Patakiapataki@simonsfoundation.org
<ul></ul><p>Hi Venky,</p>
<p>I'm getting back to investigating the ceph-fuse forever looping problems again. I now have a way to reproduce this problem, even with the default client configs. All I need to do is to do sequential reads of many files in parallel like: dd if=<file-n> of=/dev/null bs=1M.<br />With the default configs (client_oc_size=200M, client_readahead_max_periods=4, file block size 4MB), if I do more than 24 sequential reads of distinct large files in parallel, I can reproduce the problem. The node I'm on is a 128 core Rome server with 25Gbps network. The network gets saturated to above 21Gbps, but the dd clients only see a few MB/s data read, and ceph-fuse is in a repeating cycle reading objects from the OSDs.</p>
<p>What seems to be happening is that there isn't enough cache to hold the objects that ceph-fuse is deciding to prefetch. For 24 streams with readahead of 4 objects, 4MB each, ceph-fuse would need 384MB of cache, but it only has 200MB. So it keeps thrashing its cache, removing prefetched objects and repeatedly trying to prefetch the same objects over and over again. If I increase client_oc_size live, ceph-fuse immediately starts behaving again.</p>
<p>In our case, we run with a larger cache than the defaults, but we also use larger prefetch values and with 128 core servers, many more files get read in parallel as well. So while increasing the cache to pretty large values does alleviate the problem, there isn't a guarantee. If enough files get read in parallel, any size cache can get saturated. I.e. some idea is needed to detect this situation and scale back prefetching/evicting objects.</p> CephFS - Bug #58640: ceph-fuse in infinite loop reading objects without client requestshttps://tracker.ceph.com/issues/58640?journal_id=2466692023-09-20T14:50:24ZPatrick Donnellypdonnell@redhat.com
<ul><li><strong>Target version</strong> deleted (<del><i>v18.0.0</i></del>)</li></ul>