Project

General

Profile

Bug #58640

ceph-fuse in infinite loop reading objects without client requests

Added by Andras Pataki about 1 year ago. Updated 6 months ago.

Status:
Triaged
Priority:
Normal
Assignee:
Category:
Correctness/Safety
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
pacific,quincy
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
ceph-fuse
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

We've been running into a strange issue with ceph-fuse on some nodes lately. After some job runs on the node (and finishes or gets killed), ceph-fuse gets stuck busy requesting objects from the OSDs without any processes on the node using cephfs. When this happens, ceph-fuse uses 2-3 cores, spinning in what seems like an infinite loop making objecter requests to various OSDs of files that were perhaps requested by some process that is long gone. The same object gets requested every few seconds cycling through a list of objects. This close to saturates the traffic in the 25Gbps NIC of the node. I have a one minute log file with debug_objecter/objectcacher/client/context/finisher set to 20 that traces this, but I don't see why this is happening exactly. Can someone with a better understanding on the client->cache->OSD flow share some insights into what is going wrong here?

Here is an example:

2023-02-03T14:42:47.593-0500 7fffdbfff700 10 client.151672236.objecter _op_submit oid 10097d5e07f.0000002b '@9' '@9' [read 0~25165824] tid 42777727 osd.4349
2023-02-03T14:43:00.056-0500 7fffdbfff700 10 client.151672236.objecter _op_submit oid 10097d5e07f.0000002b '@9' '@9' [read 0~25165824] tid 42778921 osd.4349
2023-02-03T14:43:02.048-0500 7fffdbfff700 10 client.151672236.objecter _op_submit oid 10097d5e07f.0000002b '@9' '@9' [read 0~25165824] tid 42779104 osd.4349
2023-02-03T14:43:05.392-0500 7fffdbfff700 10 client.151672236.objecter _op_submit oid 10097d5e07f.0000002b '@9' '@9' [read 0~25165824] tid 42779408 osd.4349
2023-02-03T14:43:10.076-0500 7fffdbfff700 10 client.151672236.objecter _op_submit oid 10097d5e07f.0000002b '@9' '@9' [read 0~25165824] tid 42779847 osd.4349
2023-02-03T14:43:13.288-0500 7fffdbfff700 10 client.151672236.objecter _op_submit oid 10097d5e07f.0000002b '@9' '@9' [read 0~25165824] tid 42780156 osd.4349
2023-02-03T14:43:18.908-0500 7fffdbfff700 10 client.151672236.objecter _op_submit oid 10097d5e07f.0000002b '@9' '@9' [read 0~25165824] tid 42780697 osd.4349
2023-02-03T14:43:29.056-0500 7fffdbfff700 10 client.151672236.objecter _op_submit oid 10097d5e07f.0000002b '@9' '@9' [read 0~25165824] tid 42781660 osd.4349
2023-02-03T14:43:33.707-0500 7fffdbfff700 10 client.151672236.objecter _op_submit oid 10097d5e07f.0000002b '@9' '@9' [read 0~25165824] tid 42782079 osd.4349
2023-02-03T14:42:41.609-0500 7fffdbfff700 10 client.151672236.objecter _op_submit oid 10097d5e07f.0000002c '@9' '@9' [read 0~25165824] tid 42777194 osd.3251
2023-02-03T14:42:49.809-0500 7fffdbfff700 10 client.151672236.objecter _op_submit oid 10097d5e07f.0000002c '@9' '@9' [read 0~25165824] tid 42777954 osd.3251
2023-02-03T14:43:07.884-0500 7fffdbfff700 10 client.151672236.objecter _op_submit oid 10097d5e07f.0000002c '@9' '@9' [read 0~25165824] tid 42779646 osd.3251
2023-02-03T14:43:16.736-0500 7fffdbfff700 10 client.151672236.objecter _op_submit oid 10097d5e07f.0000002c '@9' '@9' [read 0~25165824] tid 42780500 osd.3251
2023-02-03T14:43:22.160-0500 7fffdbfff700 10 client.151672236.objecter _op_submit oid 10097d5e07f.0000002c '@9' '@9' [read 0~25165824] tid 42781009 osd.3251
2023-02-03T14:43:31.603-0500 7fffdbfff700 10 client.151672236.objecter _op_submit oid 10097d5e07f.0000002c '@9' '@9' [read 0~25165824] tid 42781892 osd.3251
2023-02-03T14:43:35.503-0500 7fffdbfff700 10 client.151672236.objecter _op_submit oid 10097d5e07f.0000002c '@9' '@9' [read 0~25165824] tid 42782258 osd.3251

Taking a specific object as an example: 100dbad3fce.000000b3

2023-02-03T14:42:46.293-0500 7fffdbfff700 10 objectcacher readx extent(100dbad3fce.000000b3 (179) in @9 0~25165824 -> [729808896,25165824])
2023-02-03T14:42:46.293-0500 7fffdbfff700 10 objectcacher.object(100dbad3fce.000000b3/head) map_read 100dbad3fce.000000b3 0~25165824
2023-02-03T14:42:46.293-0500 7fffdbfff700 20 objectcacher.object(100dbad3fce.000000b3/head) map_read miss 25165824 left, bh[ 0x7fffb0f461a0 0~25165824 0x7fffb0e4b720 (0) v 0 missing] waiters = {}
... a few times the above ...
... then an OSD read ...
2023-02-03T14:42:48.557-0500 7fffdbfff700 10 objectcacher readx extent(100dbad3fce.000000b3 (179) in @9 0~25165824 -> [50331648,25165824])
2023-02-03T14:42:48.557-0500 7fffdbfff700 10 objectcacher.object(100dbad3fce.000000b3/head) map_read 100dbad3fce.000000b3 0~25165824
2023-02-03T14:42:48.557-0500 7fffdbfff700 20 objectcacher.object(100dbad3fce.000000b3/head) map_read miss 25165824 left, bh[ 0x7fffb123acd0 0~25165824 0x7fffb0e4b720 (0) v 0 missing] waiters = {}
2023-02-03T14:42:48.557-0500 7fffdbfff700  7 objectcacher bh_read on bh[ 0x7fffb123acd0 0~25165824 0x7fffb0e4b720 (0) v 0 missing] waiters = {} outstanding reads 170
2023-02-03T14:42:48.557-0500 7fffdbfff700 10 client.151672236.objecter _op_submit op 0x7fffb12de170
2023-02-03T14:42:48.557-0500 7fffdbfff700 20 client.151672236.objecter _calc_target epoch 2609365 base 100dbad3fce.000000b3 @9 precalc_pgid 0 pgid 0.0 is_read
2023-02-03T14:42:48.557-0500 7fffdbfff700 20 client.151672236.objecter _calc_target target 100dbad3fce.000000b3 @9 -> pgid 9.502eec40
2023-02-03T14:42:48.557-0500 7fffdbfff700 10 client.151672236.objecter _calc_target  raw pgid 9.502eec40 -> actual 9.6c40s0 acting [975,3712,536,3462,2276,1365,2726,1648,4053] primary 975
2023-02-03T14:42:48.557-0500 7fffdbfff700 20 client.151672236.objecter _get_session s=0x7fff9824d360 osd=975 3
2023-02-03T14:42:48.557-0500 7fffdbfff700 10 client.151672236.objecter _op_submit oid 100dbad3fce.000000b3 '@9' '@9' [read 0~25165824] tid 42777830 osd.975
2023-02-03T14:42:48.557-0500 7fffdbfff700 20 client.151672236.objecter get_session s=0x7fff9824d360 osd=975 3
2023-02-03T14:42:48.557-0500 7fffdbfff700 15 client.151672236.objecter _session_op_assign 975 42777830
2023-02-03T14:42:48.557-0500 7fffdbfff700 15 client.151672236.objecter _send_op 42777830 to 9.6c40s0 on osd.975
2023-02-03T14:42:48.557-0500 7fffdbfff700 20 client.151672236.objecter put_session s=0x7fff9824d360 osd=975 4
2023-02-03T14:42:48.557-0500 7fffdbfff700  5 client.151672236.objecter 170 in flight
...
2023-02-03T14:42:50.045-0500 7fffdbfff700 10 objectcacher readx extent(100dbad3fce.000000b3 (179) in @9 0~25165824 -> [729808896,25165824])
2023-02-03T14:42:50.045-0500 7fffdbfff700 10 objectcacher.object(100dbad3fce.000000b3/head) map_read 100dbad3fce.000000b3 0~25165824
2023-02-03T14:42:50.045-0500 7fffdbfff700 20 objectcacher.object(100dbad3fce.000000b3/head) map_read rx bh[ 0x7fffb123acd0 0~25165824 0x7fffb0e4b720 (0) v 0 rx] waiters = { 0->[0x7fffb0798310, ]}
... read reply ...
2023-02-03T14:42:51.653-0500 7fffeb7fe700 10 client.151672236.objecter ms_dispatch 0x55555e305810 osd_op_reply(42777830 100dbad3fce.000000b3 [read 0~25165824 out=25165824b] v0'0 uv471599 ondisk = 0) v8
2023-02-03T14:42:51.653-0500 7fffeb7fe700 10 client.151672236.objecter in handle_osd_op_reply
2023-02-03T14:42:51.653-0500 7fffeb7fe700  7 client.151672236.objecter handle_osd_op_reply 42777830 ondisk uv 471599 in 9.6c40 attempt 0
2023-02-03T14:42:51.653-0500 7fffeb7fe700 10 client.151672236.objecter  op 0 rval 0 len 25165824
2023-02-03T14:42:51.653-0500 7fffeb7fe700 15 client.151672236.objecter handle_osd_op_reply completed tid 42777830
2023-02-03T14:42:51.653-0500 7fffeb7fe700 15 client.151672236.objecter _finish_op 42777830
2023-02-03T14:42:51.653-0500 7fffeb7fe700 20 client.151672236.objecter put_session s=0x7fff9824d360 osd=975 5
2023-02-03T14:42:51.653-0500 7fffeb7fe700 15 client.151672236.objecter _session_op_remove 975 42777830
2023-02-03T14:42:51.653-0500 7fffeb7fe700  5 client.151672236.objecter 169 in flight
2023-02-03T14:42:51.657-0500 7fffdbfff700 10 finisher(0x55555e43bf90) finisher_thread doing [0x7fffb0f2ad60,0]
2023-02-03T14:42:51.657-0500 7fffdbfff700  7 objectcacher bh_read_finish 100dbad3fce.000000b3/head tid 42770553 0~25165824 (bl is 25165824) returned 0 outstanding reads 170
2023-02-03T14:42:51.657-0500 7fffdbfff700 20 objectcacher checking bh bh[ 0x7fffb123acd0 0~25165824 0x7fffb0e4b720 (0) v 0 rx] waiters = { 0->[0x7fffb0798310, ]}
2023-02-03T14:42:51.657-0500 7fffdbfff700 10 objectcacher bh_read_finish read bh[ 0x7fffb123acd0 0~25165824 0x7fffb0e4b720 (25165824) v 0 clean firstbyte=5] waiters = {}
2023-02-03T14:42:51.657-0500 7fffdbfff700 10 objectcacher.object(100dbad3fce.000000b3/head) try_merge_bh bh[ 0x7fffb123acd0 0~25165824 0x7fffb0e4b720 (25165824) v 0 clean firstbyte=5] waiters = {}
2023-02-03T14:42:51.657-0500 7fffdbfff700 20 objectcacher finishing waiters 0x7fffb0798310
...
2023-02-03T14:42:51.657-0500 7fffdbfff700 10 objectcacher readx extent(100dbad3fce.000000b3 (179) in @9 0~25165824 -> [50331648,25165824])
2023-02-03T14:42:51.657-0500 7fffdbfff700 10 objectcacher.object(100dbad3fce.000000b3/head) map_read 100dbad3fce.000000b3 0~25165824
2023-02-03T14:42:51.657-0500 7fffdbfff700 20 objectcacher.object(100dbad3fce.000000b3/head) map_read hit bh[ 0x7fffb123acd0 0~25165824 0x7fffb0e4b720 (25165824) v 0 clean firstbyte=5] waiters = {}
... a few times the above ...
2023-02-03T14:42:55.085-0500 7fffacff9700 10 objectcacher trim trimming object[100dbad3fce.000000b3/head oset 0x7fffd2895870 wr 0/0]
2023-02-03T14:42:55.085-0500 7fffacff9700 10 objectcacher close_object object[100dbad3fce.000000b3/head oset 0x7fffd2895870 wr 0/0]
...
2023-02-03T14:42:55.293-0500 7fffdbfff700 10 objectcacher readx extent(100dbad3fce.000000b3 (179) in @9 0~25165824 -> [50331648,25165824])
2023-02-03T14:42:55.293-0500 7fffdbfff700 10 objectcacher.object(100dbad3fce.000000b3/head) map_read 100dbad3fce.000000b3 0~25165824
2023-02-03T14:42:55.293-0500 7fffdbfff700 20 objectcacher.object(100dbad3fce.000000b3/head) map_read miss 25165824 left, bh[ 0x7fffb1c36480 0~25165824 0x7fffb0dfefa0 (0) v 0 missing] waiters = {}
... some more times ...
... then eventually another OSD read for the same object - and the cycle repeats ...

The client is ceph-fuse Pacific 16.2.10, the back end ceph is Octopus 15.2.7. The cluster is healthy during this time, no down OSDs, recoveries, etc.

History

#1 Updated by Andras Pataki about 1 year ago

I've uploaded the full 1 minute ceph-fuse trace as:
ceph-post-file: d56ebc47-4ef7-4f01-952f-8569c8c92982

#2 Updated by Venky Shankar about 1 year ago

  • Category set to Correctness/Safety
  • Status changed from New to Triaged
  • Assignee set to Venky Shankar
  • Target version set to v18.0.0
  • Backport set to pacific,quincy
  • Severity changed from 3 - minor to 2 - major
  • Component(FS) ceph-fuse added

#3 Updated by Venky Shankar about 1 year ago

Andras Pataki wrote:

I've uploaded the full 1 minute ceph-fuse trace as:
ceph-post-file: d56ebc47-4ef7-4f01-952f-8569c8c92982

Thanks for the logs. I'll have a look.

Quick question, Andras - any configuration changes applied to the cluster (esp. around object default sizes and the likes)?

#4 Updated by Venky Shankar about 1 year ago

There seems to be lot of cache misses for objects in ObjectCacher. The retry is coming from:

      for (map<loff_t, BufferHead*>::iterator bh_it = missing.begin();
           bh_it != missing.end();
           ++bh_it) {
        uint64_t rx_bytes = static_cast<uint64_t>(
          stat_rx + bh_it->second->length());
        bytes_not_in_cache += bh_it->second->length();
        if (!waitfor_read.empty() || (stat_rx > 0 && rx_bytes > max_size)) {
          // cache is full with concurrent reads -- wait for rx's to complete
          // to constrain memory growth (especially during copy-ups)
          if (success) {
            ldout(cct, 10) << "readx missed, waiting on cache to complete " 
                           << waitfor_read.size() << " blocked reads, " 
                           << (std::max(rx_bytes, max_size) - max_size)
                           << " read bytes" << dendl;
            waitfor_read.push_back(new C_RetryRead(this, rd, oset, onfinish,
                                                   *trace));
          }

          bh_remove(o, bh_it->second);
          delete bh_it->second;
        } else {
          bh_it->second->set_nocache(nocache);
          bh_read(bh_it->second, rd->fadvise_flags, *trace);
          if ((success && onfinish) || last != missing.end())
            last = bh_it;
        }
        success = false;
      }

If I'm reading this (and other bits in `ObjectCacher::_readx`) right, for a set of extents there are a lot of cache misses over and over again causing the ObjectCacher to retry. E.g.: object: 100dbad3fce.*, the only osd_op_reply is for object 100dbad3fce.000000b3:

2023-02-03T14:42:41.153-0500 7fffeb7fe700 10 client.151672236.objecter ms_dispatch 0x55555e305810 osd_op_reply(42776829 100dbad3fce.000000b3 [read 0~25165824 out=25165824b] v0'0 uv471599 ondisk = 0) v8
2023-02-03T14:42:41.153-0500 7fffeb7fe700 10 client.151672236.objecter in handle_osd_op_reply
2023-02-03T14:42:41.153-0500 7fffeb7fe700  7 client.151672236.objecter handle_osd_op_reply 42776829 ondisk uv 471599 in 9.6c40 attempt 0
2023-02-03T14:42:41.153-0500 7fffeb7fe700 10 client.151672236.objecter  op 0 rval 0 len 25165824
2023-02-03T14:42:41.153-0500 7fffeb7fe700 15 client.151672236.objecter handle_osd_op_reply completed tid 42776829
2023-02-03T14:42:41.153-0500 7fffeb7fe700 15 client.151672236.objecter _finish_op 42776829
2023-02-03T14:42:41.153-0500 7fffeb7fe700 20 client.151672236.objecter put_session s=0x7fff9824d360 osd=975 4
2023-02-03T14:42:41.153-0500 7fffeb7fe700 15 client.151672236.objecter _session_op_remove 975 42776829
2023-02-03T14:42:41.153-0500 7fffeb7fe700  5 client.151672236.objecter 169 in flight
2023-02-03T14:42:41.153-0500 7fffdbfff700 10 finisher(0x55555e43bf90) finisher_thread doing [0x7fffb0abe4f0,0]
2023-02-03T14:42:41.153-0500 7fffdbfff700  7 objectcacher bh_read_finish 100dbad3fce.000000b3/head tid 42769552 0~25165824 (bl is 25165824) returned 0 outstanding reads 170
2023-02-03T14:42:41.153-0500 7fffdbfff700 20 objectcacher checking bh bh[ 0x7fffb0419390 0~25165824 0x7fffb0ab14f0 (0) v 0 rx] waiters = { 0->[0x7fffb1bd9c60, ]}
2023-02-03T14:42:41.153-0500 7fffdbfff700 10 objectcacher bh_read_finish read bh[ 0x7fffb0419390 0~25165824 0x7fffb0ab14f0 (25165824) v 0 clean firstbyte=5] waiters = {}
2023-02-03T14:42:41.153-0500 7fffdbfff700 10 objectcacher.object(100dbad3fce.000000b3/head) try_merge_bh bh[ 0x7fffb0419390 0~25165824 0x7fffb0ab14f0 (25165824) v 0 clean firstbyte=5] waiters = {}
2023-02-03T14:42:41.153-0500 7fffdbfff700 20 objectcacher finishing waiters 0x7fffb1bd9c60
2023-02-03T14:42:41.153-0500 7fffdbfff700 10 1 contexts to finish with 0

And when `_readx` iterates over the extents, the only hit it gets is for 100dbad3fce.000000b3

2023-02-03T14:42:41.153-0500 7fffdbfff700 10 objectcacher readx extent(100dbad3fce.000000b1 (177) in @9 0~25165824 -> [0,25165824])
2023-02-03T14:42:41.153-0500 7fffdbfff700 10 objectcacher.object(100dbad3fce.000000b1/head) map_read 100dbad3fce.000000b1 0~25165824
2023-02-03T14:42:41.153-0500 7fffdbfff700 20 objectcacher.object(100dbad3fce.000000b1/head) map_read miss 25165824 left, bh[ 0x7fffb0b84e20 0~25165824 0x7fffb0db2b40 (0) v 0 missing] waiters = {}
2023-02-03T14:42:41.153-0500 7fffdbfff700 10 objectcacher readx missed, waiting on cache to complete 147 blocked reads, 0 read bytes
2023-02-03T14:42:41.153-0500 7fffdbfff700 10 objectcacher readx extent(100dbad3fce.000000b2 (178) in @9 0~25165824 -> [25165824,25165824])
2023-02-03T14:42:41.153-0500 7fffdbfff700 10 objectcacher.object(100dbad3fce.000000b2/head) map_read 100dbad3fce.000000b2 0~25165824
2023-02-03T14:42:41.153-0500 7fffdbfff700 20 objectcacher.object(100dbad3fce.000000b2/head) map_read miss 25165824 left, bh[ 0x7fffb0b84e20 0~25165824 0x7fffb1229ae0 (0) v 0 missing] waiters = {}
2023-02-03T14:42:41.153-0500 7fffdbfff700 10 objectcacher readx extent(100dbad3fce.000000b3 (179) in @9 0~25165824 -> [50331648,25165824])
2023-02-03T14:42:41.153-0500 7fffdbfff700 10 objectcacher.object(100dbad3fce.000000b3/head) map_read 100dbad3fce.000000b3 0~25165824
2023-02-03T14:42:41.153-0500 7fffdbfff700 20 objectcacher.object(100dbad3fce.000000b3/head) map_read hit bh[ 0x7fffb0419390 0~25165824 0x7fffb0ab14f0 (25165824) v 0 clean firstbyte=5] waiters = {}
2023-02-03T14:42:41.153-0500 7fffdbfff700 10 objectcacher readx hit bh bh[ 0x7fffb0419390 0~25165824 0x7fffb0ab14f0 (25165824) v 0 clean firstbyte=5] waiters = {}
2023-02-03T14:42:41.153-0500 7fffdbfff700 10 objectcacher readx rmap opos 0: bh[ 0x7fffb0419390 0~25165824 0x7fffb0ab14f0 (25165824) v 0 clean firstbyte=5] waiters = {} +0 frag 50331648~25165824 +0~25165824
2023-02-03T14:42:41.153-0500 7fffdbfff700 10 objectcacher readx extent(100dbad3fce.000000b4 (180) in @9 0~25165824 -> [75497472,25165824])
2023-02-03T14:42:41.153-0500 7fffdbfff700 10 objectcacher.object(100dbad3fce.000000b4/head) map_read 100dbad3fce.000000b4 0~25165824
2023-02-03T14:42:41.153-0500 7fffdbfff700 20 objectcacher.object(100dbad3fce.000000b4/head) map_read miss 25165824 left, bh[ 0x7fffb0b84e20 0~25165824 0x7fffb1168830 (0) v 0 missing] waiters = {}
2023-02-03T14:42:41.153-0500 7fffdbfff700 10 objectcacher readx extent(100dbad3fce.000000b5 (181) in @9 0~25165824 -> [100663296,25165824])
<misses>
<misses>
<misses>

Eventually for the cache misses, an osd_op is sent which does get back an actual osd_op_reply and a subsequent cache hit (verified this for some objects). Doesn't look like the ObjectCacher is running into an infinite loop, but there seems to be cache misses for a set of objects over and over. E.g.: cache hits for object prefix 100dbad3fce are just a handful:

grep "map_read hit" ceph-fuse.log | grep 100dbad3fce.000000 | awk -F'(' '{ print $2 }' | awk -F')' '{ print $1 }' | sort -u
100dbad3fce.0000002b/head
100dbad3fce.0000002c/head
100dbad3fce.0000002d/head
100dbad3fce.00000056/head
100dbad3fce.00000057/head
100dbad3fce.00000058/head
100dbad3fce.00000059/head
100dbad3fce.00000096/head
100dbad3fce.00000097/head
100dbad3fce.00000098/head
100dbad3fce.000000b1/head
100dbad3fce.000000b2/head
100dbad3fce.000000b3/head
100dbad3fce.000000dc/head
100dbad3fce.000000dd/head
100dbad3fce.000000de/head
100dbad3fce.000000df/head
100dbad3fce.000000f2/head
100dbad3fce.000000f3/head
100dbad3fce.000000f4/head
100dbad3fce.000000f5/head

#5 Updated by Greg Farnum about 1 year ago

It’s also interesting that these appear to all be full-object reads, and the objects are larger than normal — 24 MiB, it looks like.

Is the ceph-fuse cache size adjusted at all from normal?

Maybe there’s some flushing action happening from the kernel that (correctly or not) ends up requiring the entire object, but with simultaneous cleanup it can’t keep them in cache long enough to finish?

#6 Updated by Andras Pataki about 1 year ago

There definitely are config changes that are different from the defaults.
All these objects are in a 6+3 erasure coded pool that has 24MiB object sizes (each OSD stores 4MiB chunks as a result). So the 24MiB reads are full object reads.
In addition, the ceph-fuse cache sizes are significantly larger than the defaults (4GiB cache, and more read ahead to help with large files on a fast network):

# Default:      16384
client_cache_size = 262144

# Default:      104857600 (100MB)
client_oc_max_dirty = 536870912

# Default:      1000
client_oc_max_objects = 8192

# Default:      209715200 (200 MB)
client_oc_size = 4294967296

# Default:      4
client_readahead_max_periods = 64

# Default:      131072 (128KB)
client_readahead_min = 4194304

# Default:      true on Luminous, false on Mimic
fuse_big_writes = true

I have tried dropping some of these settings back to lower values, client_readahead_max_periods and client_oc_size. Neither of those helped.
When I dropped client_oc_size, simply the number of objecter requests decreased proportionately (ceph daemon <xyz.asok> objecter_requests), but the read loop kept going.

#7 Updated by Venky Shankar about 1 year ago

I see trim messages from object cacher

2023-02-03T14:42:41.793-0500 7fffaeffd700 10 objectcacher trim  start: bytes: max 4294967296  clean 7449083904, objects: max 8192 current 13754
2023-02-03T14:42:42.485-0500 7fffacff9700 10 objectcacher trim  start: bytes: max 4294967296  clean 5486149632, objects: max 8192 current 10434
2023-02-03T14:42:43.557-0500 7fff88ff9700 10 objectcacher trim  start: bytes: max 4294967296  clean 7071596544, objects: max 8192 current 12659
2023-02-03T14:42:44.521-0500 7fff8bfff700 10 objectcacher trim  start: bytes: max 4294967296  clean 6366953472, objects: max 8192 current 12031
2023-02-03T14:42:45.573-0500 7fffaf7fe700 10 objectcacher trim  start: bytes: max 4294967296  clean 6945767424, objects: max 8192 current 12733
2023-02-03T14:42:46.177-0500 7fff68ffb700 10 objectcacher trim  start: bytes: max 4294967296  clean 5360320512, objects: max 8192 current 10506
2023-02-03T14:42:47.257-0500 7fffacff9700 10 objectcacher trim  start: bytes: max 4294967296  clean 7096762368, objects: max 8192 current 13197
2023-02-03T14:42:48.841-0500 7fffc5ffb700 10 objectcacher trim  start: bytes: max 4294967296  clean 7902068736, objects: max 8192 current 13763
2023-02-03T14:42:50.385-0500 7fff8a7fc700 10 objectcacher trim  start: bytes: max 4294967296  clean 8128561152, objects: max 8192 current 14288
2023-02-03T14:42:51.249-0500 7fff3cffb700 10 objectcacher trim  start: bytes: max 4294967296  clean 5813305344, objects: max 8192 current 10583
2023-02-03T14:42:52.469-0500 7fff68ffb700 10 objectcacher trim  start: bytes: max 4294967296  clean 7423918080, objects: max 8192 current 13632
2023-02-03T14:42:52.813-0500 7fffaffff700 10 objectcacher trim  start: bytes: max 4294967296  clean 4303355904, objects: max 8192 current 8317
2023-02-03T14:42:54.469-0500 7fffacff9700 10 objectcacher trim  start: bytes: max 4294967296  clean 9386852352, objects: max 8192 current 14965
2023-02-03T14:42:55.177-0500 7fffc5ffb700 10 objectcacher trim  start: bytes: max 4294967296  clean 5460983808, objects: max 8192 current 10983
2023-02-03T14:42:56.321-0500 7fff8a7fc700 10 objectcacher trim  start: bytes: max 4294967296  clean 7197425664, objects: max 8192 current 13688
...
...
...

ObjectCacher::trim() is invoked via ObjectCacher::_readx()/_writex() and the buffer heads and objects are trimmed on each invocation - from the bottom/top or their respective LRUs. The condition for that is:

  while (get_stat_clean() > 0 &&
         ((uint64_t)get_stat_clean() > max_size ||
          nr_clean_bh > max_clean_bh)) {
    BufferHead *bh = static_cast<BufferHead*>(bh_lru_rest.lru_expire());
    if (!bh)
      break;

    ldout(cct, 10) << "trim trimming " << *bh << dendl;
    ceph_assert(bh->is_clean() || bh->is_zero() || bh->is_error());

    Object *ob = bh->ob;
    bh_remove(ob, bh);
    delete bh;

    --nr_clean_bh;

    if (ob->complete) {
      ldout(cct, 10) << "trim clearing complete on " << *ob << dendl;
      ob->complete = false;
    }
  }

  while (ob_lru.lru_get_size() > max_objects) {
    Object *ob = static_cast<Object*>(ob_lru.lru_expire());
    if (!ob)
      break;

    ldout(cct, 10) << "trim trimming " << *ob << dendl;
    close_object(ob);
  }

Is the read call filling up the cache too frequently? (the number of objects and the accumulated buffer head length always exceed the configured values, not by a huge margin IMO).

#8 Updated by Venky Shankar about 1 year ago

Hi Andras,

Andras Pataki wrote:

There definitely are config changes that are different from the defaults.
All these objects are in a 6+3 erasure coded pool that has 24MiB object sizes (each OSD stores 4MiB chunks as a result). So the 24MiB reads are full object reads.
In addition, the ceph-fuse cache sizes are significantly larger than the defaults (4GiB cache, and more read ahead to help with large files on a fast network):

[...]

I have tried dropping some of these settings back to lower values, client_readahead_max_periods and client_oc_size. Neither of those helped.

I think scaling back client_oc_size is going to make things worse. From what I'm seeing from the 1 minute trace you shared, the accumulated buffer head bytes exceed configured client_oc_size. Maybe its due to the the 24MiB object reads - I'm not sure. Also, the number of objects in the cache exceed client_oc_max_objects on each trim invocation. Is it possible to increase client_oc_max_objects and client_oc_size (which is already 4 GiB) to see if the situation improves?

We do not tinker with objectcacher defaults in teuthology tests for fs suite. Its possible that such observations get overlooked in tests.

#9 Updated by Andras Pataki about 1 year ago

I'm experimenting on reproducing the problem on demand. Once I have a way to make this bad looping ceph-fuse behavior happen, I'll try increasing the cache sizes. The nodes this happens on are quite beefy - 128 AMD Rome cores, 1TB of RAM, 25Gbps Ethernet that ceph traffic goes over. So I can definitely increase caches quite a bit to see if it helps.

One other observation from the previous "bad" node. It looks like the whole situation did resolve itself after a couple of days of spinning. So it doesn't look like a completely infinite loop FWIW.

#10 Updated by Venky Shankar about 1 year ago

Hey Andras,

Andras Pataki wrote:

I'm experimenting on reproducing the problem on demand. Once I have a way to make this bad looping ceph-fuse behavior happen, I'll try increasing the cache sizes. The nodes this happens on are quite beefy - 128 AMD Rome cores, 1TB of RAM, 25Gbps Ethernet that ceph traffic goes over. So I can definitely increase caches quite a bit to see if it helps.

Checking back if increasing the cache sizes improved the situation. You did mention that things resolved itself, but just wanted to check if you ran into it again even when the cache sized were increased.

#11 Updated by Andras Pataki 9 months ago

Hi Venky,

I'm getting back to investigating the ceph-fuse forever looping problems again. I now have a way to reproduce this problem, even with the default client configs. All I need to do is to do sequential reads of many files in parallel like: dd if=<file-n> of=/dev/null bs=1M.
With the default configs (client_oc_size=200M, client_readahead_max_periods=4, file block size 4MB), if I do more than 24 sequential reads of distinct large files in parallel, I can reproduce the problem. The node I'm on is a 128 core Rome server with 25Gbps network. The network gets saturated to above 21Gbps, but the dd clients only see a few MB/s data read, and ceph-fuse is in a repeating cycle reading objects from the OSDs.

What seems to be happening is that there isn't enough cache to hold the objects that ceph-fuse is deciding to prefetch. For 24 streams with readahead of 4 objects, 4MB each, ceph-fuse would need 384MB of cache, but it only has 200MB. So it keeps thrashing its cache, removing prefetched objects and repeatedly trying to prefetch the same objects over and over again. If I increase client_oc_size live, ceph-fuse immediately starts behaving again.

In our case, we run with a larger cache than the defaults, but we also use larger prefetch values and with 128 core servers, many more files get read in parallel as well. So while increasing the cache to pretty large values does alleviate the problem, there isn't a guarantee. If enough files get read in parallel, any size cache can get saturated. I.e. some idea is needed to detect this situation and scale back prefetching/evicting objects.

#12 Updated by Patrick Donnelly 6 months ago

  • Target version deleted (v18.0.0)

Also available in: Atom PDF