Project

General

Profile

Actions

Bug #64730

open

fs/misc/multiple_rsync.sh workunit times out

Added by Venky Shankar about 2 months ago. Updated 30 days ago.

Status:
Triaged
Priority:
Normal
Assignee:
Category:
Correctness/Safety
Target version:
% Done:

0%

Source:
Tags:
Backport:
quincy,reef,squid
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Client, ceph-fuse
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

/a/vshankar-2024-03-04_08:26:39-fs-wip-vshankar-testing-20240304.042522-testing-default-smithi/7580882

Description: fs/workload/{0-centos_9.stream begin/{0-install 1-cephadm 2-logrotate} clusters/1a11s-mds-1c-client-3node conf/{client mds mon osd} mount/fuse objectstore-ec/bluestore-comp omap_limit/10 overrides/{cephsqlite-timeout frag ignorelist_health ignorelist_wrongly_marked_down osd-asserts session_timeout} ranks/multi/{balancer/automatic export-check n/5 replication/default} standby-replay tasks/{0-subvolume/{with-quota} 1-check-counter 2-scrub/yes 3-snaps/yes 4-flush/no 5-workunit/fs/misc}}

The ceph-fuse client log file is huge (26G uncompressed) - there is lots of the following log entry in and around when the rysnc transfer hang is perceived

2024-03-04T13:19:04.817+0000 7f9f70c00640 10 client.15360 add_update_cap calling signal_caps_inode


Related issues 2 (2 open0 closed)

Related to CephFS - Bug #51197: qa: [WRN] Scrub error on inode 0x10000001520 (/client.0/tmp/t/linux-5.4/Documentation/driver-api) see mds.f log and `damage ls` output for detailsTriagedVenky Shankar

Actions
Related to CephFS - Bug #57676: qa: error during scrub thrashing: rank damage found: {'backtrace'} TriagedKotresh Hiremath Ravishankar

Actions
Actions #1

Updated by Xiubo Li about 2 months ago

It seems metadata damaged:

2024-03-04T13:10:00.386 INFO:journalctl@ceph.mon.a.smithi047.stdout:Mar 04 13:10:00 smithi047 ceph-0860966a-da0d-11ee-95c6-87774f69a715-mon-a[31807]: 2024-03-04T13:09:59.999+0000 7f4255600700 -1 log_channel(cluster) log [ERR] : Health detail: HEALTH_ERR 4 MDSs report damaged metadata
2024-03-04T13:10:00.387 INFO:journalctl@ceph.mon.a.smithi047.stdout:Mar 04 13:10:00 smithi047 ceph-0860966a-da0d-11ee-95c6-87774f69a715-mon-a[31807]: 2024-03-04T13:09:59.999+0000 7f4255600700 -1 log_channel(cluster) log [ERR] : [ERR] MDS_DAMAGE: 4 MDSs report damaged metadata
2024-03-04T13:10:00.387 INFO:journalctl@ceph.mon.a.smithi047.stdout:Mar 04 13:10:00 smithi047 ceph-0860966a-da0d-11ee-95c6-87774f69a715-mon-a[31807]: 2024-03-04T13:09:59.999+0000 7f4255600700 -1 log_channel(cluster) log [ERR] :     mds.c(mds.4): Metadata damage detected
2024-03-04T13:10:00.387 INFO:journalctl@ceph.mon.a.smithi047.stdout:Mar 04 13:10:00 smithi047 ceph-0860966a-da0d-11ee-95c6-87774f69a715-mon-a[31807]: 2024-03-04T13:09:59.999+0000 7f4255600700 -1 log_channel(cluster) log [ERR] :     mds.i(mds.3): Metadata damage detected
2024-03-04T13:10:00.387 INFO:journalctl@ceph.mon.a.smithi047.stdout:Mar 04 13:10:00 smithi047 ceph-0860966a-da0d-11ee-95c6-87774f69a715-mon-a[31807]: 2024-03-04T13:09:59.999+0000 7f4255600700 -1 log_channel(cluster) log [ERR] :     mds.k(mds.2): Metadata damage detected
2024-03-04T13:10:00.387 INFO:journalctl@ceph.mon.a.smithi047.stdout:Mar 04 13:10:00 smithi047 ceph-0860966a-da0d-11ee-95c6-87774f69a715-mon-a[31807]: 2024-03-04T13:09:59.999+0000 7f4255600700 -1 log_channel(cluster) log [ERR] :     mds.h(mds.1): Metadata damage detected
2024-03-04T13:10:00.886 INFO:journalctl@ceph.mon.a.smithi047.stdout:Mar 04 13:10:00 smithi047 ceph-mon[31811]: Health detail: HEALTH_ERR 4 MDSs report damaged metadata
2024-03-04T13:10:00.886 INFO:journalctl@ceph.mon.a.smithi047.stdout:Mar 04 13:10:00 smithi047 ceph-mon[31811]: [ERR] MDS_DAMAGE: 4 MDSs report damaged metadata
2024-03-04T13:10:00.886 INFO:journalctl@ceph.mon.a.smithi047.stdout:Mar 04 13:10:00 smithi047 ceph-mon[31811]:     mds.c(mds.4): Metadata damage detected
2024-03-04T13:10:00.886 INFO:journalctl@ceph.mon.a.smithi047.stdout:Mar 04 13:10:00 smithi047 ceph-mon[31811]:     mds.i(mds.3): Metadata damage detected
2024-03-04T13:10:00.886 INFO:journalctl@ceph.mon.a.smithi047.stdout:Mar 04 13:10:00 smithi047 ceph-mon[31811]:     mds.k(mds.2): Metadata damage detected
2024-03-04T13:10:00.886 INFO:journalctl@ceph.mon.a.smithi047.stdout:Mar 04 13:10:00 smithi047 ceph-mon[31811]:     mds.h(mds.1): Metadata damage detected
2024-03-04T13:10:00.940 INFO:journalctl@ceph.mon.b.smithi088.stdout:Mar 04 13:10:00 smithi088 ceph-mon[35303]: Health detail: HEALTH_ERR 4 MDSs report damaged metadata
2024-03-04T13:10:00.940 INFO:journalctl@ceph.mon.b.smithi088.stdout:Mar 04 13:10:00 smithi088 ceph-mon[35303]: [ERR] MDS_DAMAGE: 4 MDSs report damaged metadata
2024-03-04T13:10:00.940 INFO:journalctl@ceph.mon.b.smithi088.stdout:Mar 04 13:10:00 smithi088 ceph-mon[35303]:     mds.c(mds.4): Metadata damage detected
2024-03-04T13:10:00.941 INFO:journalctl@ceph.mon.b.smithi088.stdout:Mar 04 13:10:00 smithi088 ceph-mon[35303]:     mds.i(mds.3): Metadata damage detected
2024-03-04T13:10:00.941 INFO:journalctl@ceph.mon.b.smithi088.stdout:Mar 04 13:10:00 smithi088 ceph-mon[35303]:     mds.k(mds.2): Metadata damage detected
2024-03-04T13:10:00.941 INFO:journalctl@ceph.mon.b.smithi088.stdout:Mar 04 13:10:00 smithi088 ceph-mon[35303]:     mds.h(mds.1): Metadata damage detected
2024-03-04T13:10:00.957 INFO:journalctl@ceph.mon.c.smithi138.stdout:Mar 04 13:10:00 smithi138 ceph-mon[36284]: Health detail: HEALTH_ERR 4 MDSs report damaged metadata
2024-03-04T13:10:00.957 INFO:journalctl@ceph.mon.c.smithi138.stdout:Mar 04 13:10:00 smithi138 ceph-mon[36284]: [ERR] MDS_DAMAGE: 4 MDSs report damaged metadata
2024-03-04T13:10:00.957 INFO:journalctl@ceph.mon.c.smithi138.stdout:Mar 04 13:10:00 smithi138 ceph-mon[36284]:     mds.c(mds.4): Metadata damage detected
2024-03-04T13:10:00.957 INFO:journalctl@ceph.mon.c.smithi138.stdout:Mar 04 13:10:00 smithi138 ceph-mon[36284]:     mds.i(mds.3): Metadata damage detected
2024-03-04T13:10:00.958 INFO:journalctl@ceph.mon.c.smithi138.stdout:Mar 04 13:10:00 smithi138 ceph-mon[36284]:     mds.k(mds.2): Metadata damage detected
2024-03-04T13:10:00.958 INFO:journalctl@ceph.mon.c.smithi138.stdout:Mar 04 13:10:00 smithi138 ceph-mon[36284]:     mds.h(mds.1): Metadata damage detected
Actions #2

Updated by Venky Shankar about 2 months ago

Xiubo Li wrote:

It seems metadata damaged:

Right. I saw that in the mds log but left that out while creating this tracker since the "signal_caps_inode" traces were a lot.

Actions #3

Updated by Xiubo Li about 2 months ago

Venky Shankar wrote:

Xiubo Li wrote:

It seems metadata damaged:

Right. I saw that in the mds log but left that out while creating this tracker since the "signal_caps_inode" traces were a lot.

Venky,

BTW, why I couldn't see the client log files ? Where is it ?

Actions #4

Updated by Xiubo Li about 2 months ago

Xiubo Li wrote:

Venky Shankar wrote:

Xiubo Li wrote:

It seems metadata damaged:

Right. I saw that in the mds log but left that out while creating this tracker since the "signal_caps_inode" traces were a lot.

Venky,

BTW, why I couldn't see the client log files ? Where is it ?

I found it.

The client log file is full of export/import logs.

Actions #5

Updated by Xiubo Li about 2 months ago

Venky,

Have you ever seen the "On-disk backtrace is divergent or newer" error ?

2024-03-04T13:05:36.662+0000 7fdd42000700 10 MDSContext::complete: 16C_InodeValidated
2024-03-04T13:05:36.662+0000 7fdd42000700  0 log_channel(cluster) log [WRN] : Scrub error on inode 0x5000000210a (/volumes/qa/sv_0/8e048092-a055-46f2-bd35-bc12ad2511ae/client.0/tmp/payload.1/multiple_rsync_payload.196784/modules/5.14.0-239.el9.x86_64/kernel/drivers/net/fjes/fjes.ko.xz) see mds.c log and `damage ls` output for details
2024-03-04T13:05:36.662+0000 7fdd42000700 -1 mds.4.scrubstack _validate_inode_done scrub error on inode [inode 0x5000000210a [42,head] /volumes/qa/sv_0/8e048092-a055-46f2-bd35-bc12ad2511ae/client.0/tmp/payload.1/multiple_rsync_payload.196784/modules/5.14.0-239.el9.x86_64/kernel/drivers/net/fjes/fjes.ko.xz auth v8 ap=1 s=37796 n(v0 rc2024-03-04T10:43:47.766563+0000 b37796 1=1+0) (ifile excl) (iversion lock) cr={15360=0-4194304@41} caps={15360=pAsLsXsFsxcrwb/-@1},l=15360(-1) | importingcaps=0 caps=1 replicated=0 authpin=1 scrubqueue=0 0x55ac8a21d180]: {"performed_validation":true,"passed_validation":false,"backtrace":{"checked":true,"passed":false,"read_ret_val":0,"ondisk_value":"(3)0x5000000210a:[<0x10000004859/fjes.ko.xz v8>,<0x10000004792/fjes v1416>,<0x10000004757/net v5530>,<0x1000000474f/drivers v1576>,<0x10000004749/kernel v624>,<0x10000003675/5.14.0-239.el9.x86_64 v1033>,<0x1000000365f/modules v7307>,<0x1000000365e/multiple_rsync_payload.196784 v3136>,<0x1000000365d/payload.1 v4404>,<0x10000000005/tmp v5746>,<0x10000000003/client.0 v5534>,<0x10000000002/8e048092-a055-46f2-bd35-bc12ad2511ae v5454>,<0x10000000001/sv_0 v5393>,<0x10000000000/qa v5396>,<0x1/volumes v5291>]//[]","memoryvalue":"(3)0x5000000210a:[<0x10000004859/fjes.ko.xz v8>,<0x10000004792/fjes v1416>,<0x10000004757/net v5530>,<0x1000000474f/drivers v1574>,<0x10000004749/kernel v624>,<0x10000003675/5.14.0-239.el9.x86_64 v648>,<0x1000000365f/modules v5655>,<0x1000000365e/multiple_rsync_payload.196784 v3103>,<0x1000000365d/payload.1 v4404>,<0x10000000005/tmp v5742>,<0x10000000003/client.0 v5526>,<0x10000000002/8e048092-a055-46f2-bd35-bc12ad2511ae v5452>,<0x10000000001/sv_0 v5386>,<0x10000000000/qa v5396>,<0x1/volumes v5287>]//[]","error_str":"On-disk backtrace is divergent or newer"},"raw_stats":{"checked":false,"passed":false,"read_ret_val":0,"ondisk_value.dirstat":"f()","ondisk_value.rstat":"n()","memory_value.dirstat":"f()","memory_value.rstat":"n()","error_str":""},"return_code":0}
Actions #6

Updated by Xiubo Li about 2 months ago

Xiubo Li wrote:

Venky,

Have you ever seen the "On-disk backtrace is divergent or newer" error ?

[...]

Should be a known issue and same with https://tracker.ceph.com/issues/51197.

Actions #7

Updated by Venky Shankar about 2 months ago

Xiubo Li wrote:

Venky,

Have you ever seen the "On-disk backtrace is divergent or newer" error ?

[...]

Oh, we never RCA'd that :/

Actions #8

Updated by Venky Shankar about 2 months ago

  • Related to Bug #51197: qa: [WRN] Scrub error on inode 0x10000001520 (/client.0/tmp/t/linux-5.4/Documentation/driver-api) see mds.f log and `damage ls` output for details added
Actions #9

Updated by Venky Shankar about 2 months ago

  • Status changed from New to Triaged
Actions #10

Updated by Milind Changire about 2 months ago

  • Assignee set to Xiubo Li
Actions #11

Updated by Venky Shankar about 1 month ago

  • Assignee changed from Xiubo Li to Venky Shankar

I'm already working on this.

Actions #12

Updated by Venky Shankar 30 days ago

  • Related to Bug #57676: qa: error during scrub thrashing: rank damage found: {'backtrace'} added
Actions #13

Updated by Venky Shankar 30 days ago

"balancer/automatic" with multimds and scrub is likely what's causes the divergent backtraces.

That's also the case in https://tracker.ceph.com/issues/57676 (see linked run in https://tracker.ceph.com/issues/57676#note-21)

Actions

Also available in: Atom PDF