Project

General

Profile

Actions

Bug #58640

open

ceph-fuse in infinite loop reading objects without client requests

Added by Andras Pataki over 1 year ago. Updated 8 months ago.

Status:
Triaged
Priority:
Normal
Assignee:
Category:
Correctness/Safety
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
pacific,quincy
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
ceph-fuse
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

We've been running into a strange issue with ceph-fuse on some nodes lately. After some job runs on the node (and finishes or gets killed), ceph-fuse gets stuck busy requesting objects from the OSDs without any processes on the node using cephfs. When this happens, ceph-fuse uses 2-3 cores, spinning in what seems like an infinite loop making objecter requests to various OSDs of files that were perhaps requested by some process that is long gone. The same object gets requested every few seconds cycling through a list of objects. This close to saturates the traffic in the 25Gbps NIC of the node. I have a one minute log file with debug_objecter/objectcacher/client/context/finisher set to 20 that traces this, but I don't see why this is happening exactly. Can someone with a better understanding on the client->cache->OSD flow share some insights into what is going wrong here?

Here is an example:

2023-02-03T14:42:47.593-0500 7fffdbfff700 10 client.151672236.objecter _op_submit oid 10097d5e07f.0000002b '@9' '@9' [read 0~25165824] tid 42777727 osd.4349
2023-02-03T14:43:00.056-0500 7fffdbfff700 10 client.151672236.objecter _op_submit oid 10097d5e07f.0000002b '@9' '@9' [read 0~25165824] tid 42778921 osd.4349
2023-02-03T14:43:02.048-0500 7fffdbfff700 10 client.151672236.objecter _op_submit oid 10097d5e07f.0000002b '@9' '@9' [read 0~25165824] tid 42779104 osd.4349
2023-02-03T14:43:05.392-0500 7fffdbfff700 10 client.151672236.objecter _op_submit oid 10097d5e07f.0000002b '@9' '@9' [read 0~25165824] tid 42779408 osd.4349
2023-02-03T14:43:10.076-0500 7fffdbfff700 10 client.151672236.objecter _op_submit oid 10097d5e07f.0000002b '@9' '@9' [read 0~25165824] tid 42779847 osd.4349
2023-02-03T14:43:13.288-0500 7fffdbfff700 10 client.151672236.objecter _op_submit oid 10097d5e07f.0000002b '@9' '@9' [read 0~25165824] tid 42780156 osd.4349
2023-02-03T14:43:18.908-0500 7fffdbfff700 10 client.151672236.objecter _op_submit oid 10097d5e07f.0000002b '@9' '@9' [read 0~25165824] tid 42780697 osd.4349
2023-02-03T14:43:29.056-0500 7fffdbfff700 10 client.151672236.objecter _op_submit oid 10097d5e07f.0000002b '@9' '@9' [read 0~25165824] tid 42781660 osd.4349
2023-02-03T14:43:33.707-0500 7fffdbfff700 10 client.151672236.objecter _op_submit oid 10097d5e07f.0000002b '@9' '@9' [read 0~25165824] tid 42782079 osd.4349
2023-02-03T14:42:41.609-0500 7fffdbfff700 10 client.151672236.objecter _op_submit oid 10097d5e07f.0000002c '@9' '@9' [read 0~25165824] tid 42777194 osd.3251
2023-02-03T14:42:49.809-0500 7fffdbfff700 10 client.151672236.objecter _op_submit oid 10097d5e07f.0000002c '@9' '@9' [read 0~25165824] tid 42777954 osd.3251
2023-02-03T14:43:07.884-0500 7fffdbfff700 10 client.151672236.objecter _op_submit oid 10097d5e07f.0000002c '@9' '@9' [read 0~25165824] tid 42779646 osd.3251
2023-02-03T14:43:16.736-0500 7fffdbfff700 10 client.151672236.objecter _op_submit oid 10097d5e07f.0000002c '@9' '@9' [read 0~25165824] tid 42780500 osd.3251
2023-02-03T14:43:22.160-0500 7fffdbfff700 10 client.151672236.objecter _op_submit oid 10097d5e07f.0000002c '@9' '@9' [read 0~25165824] tid 42781009 osd.3251
2023-02-03T14:43:31.603-0500 7fffdbfff700 10 client.151672236.objecter _op_submit oid 10097d5e07f.0000002c '@9' '@9' [read 0~25165824] tid 42781892 osd.3251
2023-02-03T14:43:35.503-0500 7fffdbfff700 10 client.151672236.objecter _op_submit oid 10097d5e07f.0000002c '@9' '@9' [read 0~25165824] tid 42782258 osd.3251

Taking a specific object as an example: 100dbad3fce.000000b3

2023-02-03T14:42:46.293-0500 7fffdbfff700 10 objectcacher readx extent(100dbad3fce.000000b3 (179) in @9 0~25165824 -> [729808896,25165824])
2023-02-03T14:42:46.293-0500 7fffdbfff700 10 objectcacher.object(100dbad3fce.000000b3/head) map_read 100dbad3fce.000000b3 0~25165824
2023-02-03T14:42:46.293-0500 7fffdbfff700 20 objectcacher.object(100dbad3fce.000000b3/head) map_read miss 25165824 left, bh[ 0x7fffb0f461a0 0~25165824 0x7fffb0e4b720 (0) v 0 missing] waiters = {}
... a few times the above ...
... then an OSD read ...
2023-02-03T14:42:48.557-0500 7fffdbfff700 10 objectcacher readx extent(100dbad3fce.000000b3 (179) in @9 0~25165824 -> [50331648,25165824])
2023-02-03T14:42:48.557-0500 7fffdbfff700 10 objectcacher.object(100dbad3fce.000000b3/head) map_read 100dbad3fce.000000b3 0~25165824
2023-02-03T14:42:48.557-0500 7fffdbfff700 20 objectcacher.object(100dbad3fce.000000b3/head) map_read miss 25165824 left, bh[ 0x7fffb123acd0 0~25165824 0x7fffb0e4b720 (0) v 0 missing] waiters = {}
2023-02-03T14:42:48.557-0500 7fffdbfff700  7 objectcacher bh_read on bh[ 0x7fffb123acd0 0~25165824 0x7fffb0e4b720 (0) v 0 missing] waiters = {} outstanding reads 170
2023-02-03T14:42:48.557-0500 7fffdbfff700 10 client.151672236.objecter _op_submit op 0x7fffb12de170
2023-02-03T14:42:48.557-0500 7fffdbfff700 20 client.151672236.objecter _calc_target epoch 2609365 base 100dbad3fce.000000b3 @9 precalc_pgid 0 pgid 0.0 is_read
2023-02-03T14:42:48.557-0500 7fffdbfff700 20 client.151672236.objecter _calc_target target 100dbad3fce.000000b3 @9 -> pgid 9.502eec40
2023-02-03T14:42:48.557-0500 7fffdbfff700 10 client.151672236.objecter _calc_target  raw pgid 9.502eec40 -> actual 9.6c40s0 acting [975,3712,536,3462,2276,1365,2726,1648,4053] primary 975
2023-02-03T14:42:48.557-0500 7fffdbfff700 20 client.151672236.objecter _get_session s=0x7fff9824d360 osd=975 3
2023-02-03T14:42:48.557-0500 7fffdbfff700 10 client.151672236.objecter _op_submit oid 100dbad3fce.000000b3 '@9' '@9' [read 0~25165824] tid 42777830 osd.975
2023-02-03T14:42:48.557-0500 7fffdbfff700 20 client.151672236.objecter get_session s=0x7fff9824d360 osd=975 3
2023-02-03T14:42:48.557-0500 7fffdbfff700 15 client.151672236.objecter _session_op_assign 975 42777830
2023-02-03T14:42:48.557-0500 7fffdbfff700 15 client.151672236.objecter _send_op 42777830 to 9.6c40s0 on osd.975
2023-02-03T14:42:48.557-0500 7fffdbfff700 20 client.151672236.objecter put_session s=0x7fff9824d360 osd=975 4
2023-02-03T14:42:48.557-0500 7fffdbfff700  5 client.151672236.objecter 170 in flight
...
2023-02-03T14:42:50.045-0500 7fffdbfff700 10 objectcacher readx extent(100dbad3fce.000000b3 (179) in @9 0~25165824 -> [729808896,25165824])
2023-02-03T14:42:50.045-0500 7fffdbfff700 10 objectcacher.object(100dbad3fce.000000b3/head) map_read 100dbad3fce.000000b3 0~25165824
2023-02-03T14:42:50.045-0500 7fffdbfff700 20 objectcacher.object(100dbad3fce.000000b3/head) map_read rx bh[ 0x7fffb123acd0 0~25165824 0x7fffb0e4b720 (0) v 0 rx] waiters = { 0->[0x7fffb0798310, ]}
... read reply ...
2023-02-03T14:42:51.653-0500 7fffeb7fe700 10 client.151672236.objecter ms_dispatch 0x55555e305810 osd_op_reply(42777830 100dbad3fce.000000b3 [read 0~25165824 out=25165824b] v0'0 uv471599 ondisk = 0) v8
2023-02-03T14:42:51.653-0500 7fffeb7fe700 10 client.151672236.objecter in handle_osd_op_reply
2023-02-03T14:42:51.653-0500 7fffeb7fe700  7 client.151672236.objecter handle_osd_op_reply 42777830 ondisk uv 471599 in 9.6c40 attempt 0
2023-02-03T14:42:51.653-0500 7fffeb7fe700 10 client.151672236.objecter  op 0 rval 0 len 25165824
2023-02-03T14:42:51.653-0500 7fffeb7fe700 15 client.151672236.objecter handle_osd_op_reply completed tid 42777830
2023-02-03T14:42:51.653-0500 7fffeb7fe700 15 client.151672236.objecter _finish_op 42777830
2023-02-03T14:42:51.653-0500 7fffeb7fe700 20 client.151672236.objecter put_session s=0x7fff9824d360 osd=975 5
2023-02-03T14:42:51.653-0500 7fffeb7fe700 15 client.151672236.objecter _session_op_remove 975 42777830
2023-02-03T14:42:51.653-0500 7fffeb7fe700  5 client.151672236.objecter 169 in flight
2023-02-03T14:42:51.657-0500 7fffdbfff700 10 finisher(0x55555e43bf90) finisher_thread doing [0x7fffb0f2ad60,0]
2023-02-03T14:42:51.657-0500 7fffdbfff700  7 objectcacher bh_read_finish 100dbad3fce.000000b3/head tid 42770553 0~25165824 (bl is 25165824) returned 0 outstanding reads 170
2023-02-03T14:42:51.657-0500 7fffdbfff700 20 objectcacher checking bh bh[ 0x7fffb123acd0 0~25165824 0x7fffb0e4b720 (0) v 0 rx] waiters = { 0->[0x7fffb0798310, ]}
2023-02-03T14:42:51.657-0500 7fffdbfff700 10 objectcacher bh_read_finish read bh[ 0x7fffb123acd0 0~25165824 0x7fffb0e4b720 (25165824) v 0 clean firstbyte=5] waiters = {}
2023-02-03T14:42:51.657-0500 7fffdbfff700 10 objectcacher.object(100dbad3fce.000000b3/head) try_merge_bh bh[ 0x7fffb123acd0 0~25165824 0x7fffb0e4b720 (25165824) v 0 clean firstbyte=5] waiters = {}
2023-02-03T14:42:51.657-0500 7fffdbfff700 20 objectcacher finishing waiters 0x7fffb0798310
...
2023-02-03T14:42:51.657-0500 7fffdbfff700 10 objectcacher readx extent(100dbad3fce.000000b3 (179) in @9 0~25165824 -> [50331648,25165824])
2023-02-03T14:42:51.657-0500 7fffdbfff700 10 objectcacher.object(100dbad3fce.000000b3/head) map_read 100dbad3fce.000000b3 0~25165824
2023-02-03T14:42:51.657-0500 7fffdbfff700 20 objectcacher.object(100dbad3fce.000000b3/head) map_read hit bh[ 0x7fffb123acd0 0~25165824 0x7fffb0e4b720 (25165824) v 0 clean firstbyte=5] waiters = {}
... a few times the above ...
2023-02-03T14:42:55.085-0500 7fffacff9700 10 objectcacher trim trimming object[100dbad3fce.000000b3/head oset 0x7fffd2895870 wr 0/0]
2023-02-03T14:42:55.085-0500 7fffacff9700 10 objectcacher close_object object[100dbad3fce.000000b3/head oset 0x7fffd2895870 wr 0/0]
...
2023-02-03T14:42:55.293-0500 7fffdbfff700 10 objectcacher readx extent(100dbad3fce.000000b3 (179) in @9 0~25165824 -> [50331648,25165824])
2023-02-03T14:42:55.293-0500 7fffdbfff700 10 objectcacher.object(100dbad3fce.000000b3/head) map_read 100dbad3fce.000000b3 0~25165824
2023-02-03T14:42:55.293-0500 7fffdbfff700 20 objectcacher.object(100dbad3fce.000000b3/head) map_read miss 25165824 left, bh[ 0x7fffb1c36480 0~25165824 0x7fffb0dfefa0 (0) v 0 missing] waiters = {}
... some more times ...
... then eventually another OSD read for the same object - and the cycle repeats ...

The client is ceph-fuse Pacific 16.2.10, the back end ceph is Octopus 15.2.7. The cluster is healthy during this time, no down OSDs, recoveries, etc.

Actions

Also available in: Atom PDF