Bug #64730
openfs/misc/multiple_rsync.sh workunit times out
0%
Description
/a/vshankar-2024-03-04_08:26:39-fs-wip-vshankar-testing-20240304.042522-testing-default-smithi/7580882
Description: fs/workload/{0-centos_9.stream begin/{0-install 1-cephadm 2-logrotate} clusters/1a11s-mds-1c-client-3node conf/{client mds mon osd} mount/fuse objectstore-ec/bluestore-comp omap_limit/10 overrides/{cephsqlite-timeout frag ignorelist_health ignorelist_wrongly_marked_down osd-asserts session_timeout} ranks/multi/{balancer/automatic export-check n/5 replication/default} standby-replay tasks/{0-subvolume/{with-quota} 1-check-counter 2-scrub/yes 3-snaps/yes 4-flush/no 5-workunit/fs/misc}}
The ceph-fuse client log file is huge (26G uncompressed) - there is lots of the following log entry in and around when the rysnc transfer hang is perceived
2024-03-04T13:19:04.817+0000 7f9f70c00640 10 client.15360 add_update_cap calling signal_caps_inode
Updated by Xiubo Li about 2 months ago
It seems metadata damaged:
2024-03-04T13:10:00.386 INFO:journalctl@ceph.mon.a.smithi047.stdout:Mar 04 13:10:00 smithi047 ceph-0860966a-da0d-11ee-95c6-87774f69a715-mon-a[31807]: 2024-03-04T13:09:59.999+0000 7f4255600700 -1 log_channel(cluster) log [ERR] : Health detail: HEALTH_ERR 4 MDSs report damaged metadata 2024-03-04T13:10:00.387 INFO:journalctl@ceph.mon.a.smithi047.stdout:Mar 04 13:10:00 smithi047 ceph-0860966a-da0d-11ee-95c6-87774f69a715-mon-a[31807]: 2024-03-04T13:09:59.999+0000 7f4255600700 -1 log_channel(cluster) log [ERR] : [ERR] MDS_DAMAGE: 4 MDSs report damaged metadata 2024-03-04T13:10:00.387 INFO:journalctl@ceph.mon.a.smithi047.stdout:Mar 04 13:10:00 smithi047 ceph-0860966a-da0d-11ee-95c6-87774f69a715-mon-a[31807]: 2024-03-04T13:09:59.999+0000 7f4255600700 -1 log_channel(cluster) log [ERR] : mds.c(mds.4): Metadata damage detected 2024-03-04T13:10:00.387 INFO:journalctl@ceph.mon.a.smithi047.stdout:Mar 04 13:10:00 smithi047 ceph-0860966a-da0d-11ee-95c6-87774f69a715-mon-a[31807]: 2024-03-04T13:09:59.999+0000 7f4255600700 -1 log_channel(cluster) log [ERR] : mds.i(mds.3): Metadata damage detected 2024-03-04T13:10:00.387 INFO:journalctl@ceph.mon.a.smithi047.stdout:Mar 04 13:10:00 smithi047 ceph-0860966a-da0d-11ee-95c6-87774f69a715-mon-a[31807]: 2024-03-04T13:09:59.999+0000 7f4255600700 -1 log_channel(cluster) log [ERR] : mds.k(mds.2): Metadata damage detected 2024-03-04T13:10:00.387 INFO:journalctl@ceph.mon.a.smithi047.stdout:Mar 04 13:10:00 smithi047 ceph-0860966a-da0d-11ee-95c6-87774f69a715-mon-a[31807]: 2024-03-04T13:09:59.999+0000 7f4255600700 -1 log_channel(cluster) log [ERR] : mds.h(mds.1): Metadata damage detected 2024-03-04T13:10:00.886 INFO:journalctl@ceph.mon.a.smithi047.stdout:Mar 04 13:10:00 smithi047 ceph-mon[31811]: Health detail: HEALTH_ERR 4 MDSs report damaged metadata 2024-03-04T13:10:00.886 INFO:journalctl@ceph.mon.a.smithi047.stdout:Mar 04 13:10:00 smithi047 ceph-mon[31811]: [ERR] MDS_DAMAGE: 4 MDSs report damaged metadata 2024-03-04T13:10:00.886 INFO:journalctl@ceph.mon.a.smithi047.stdout:Mar 04 13:10:00 smithi047 ceph-mon[31811]: mds.c(mds.4): Metadata damage detected 2024-03-04T13:10:00.886 INFO:journalctl@ceph.mon.a.smithi047.stdout:Mar 04 13:10:00 smithi047 ceph-mon[31811]: mds.i(mds.3): Metadata damage detected 2024-03-04T13:10:00.886 INFO:journalctl@ceph.mon.a.smithi047.stdout:Mar 04 13:10:00 smithi047 ceph-mon[31811]: mds.k(mds.2): Metadata damage detected 2024-03-04T13:10:00.886 INFO:journalctl@ceph.mon.a.smithi047.stdout:Mar 04 13:10:00 smithi047 ceph-mon[31811]: mds.h(mds.1): Metadata damage detected 2024-03-04T13:10:00.940 INFO:journalctl@ceph.mon.b.smithi088.stdout:Mar 04 13:10:00 smithi088 ceph-mon[35303]: Health detail: HEALTH_ERR 4 MDSs report damaged metadata 2024-03-04T13:10:00.940 INFO:journalctl@ceph.mon.b.smithi088.stdout:Mar 04 13:10:00 smithi088 ceph-mon[35303]: [ERR] MDS_DAMAGE: 4 MDSs report damaged metadata 2024-03-04T13:10:00.940 INFO:journalctl@ceph.mon.b.smithi088.stdout:Mar 04 13:10:00 smithi088 ceph-mon[35303]: mds.c(mds.4): Metadata damage detected 2024-03-04T13:10:00.941 INFO:journalctl@ceph.mon.b.smithi088.stdout:Mar 04 13:10:00 smithi088 ceph-mon[35303]: mds.i(mds.3): Metadata damage detected 2024-03-04T13:10:00.941 INFO:journalctl@ceph.mon.b.smithi088.stdout:Mar 04 13:10:00 smithi088 ceph-mon[35303]: mds.k(mds.2): Metadata damage detected 2024-03-04T13:10:00.941 INFO:journalctl@ceph.mon.b.smithi088.stdout:Mar 04 13:10:00 smithi088 ceph-mon[35303]: mds.h(mds.1): Metadata damage detected 2024-03-04T13:10:00.957 INFO:journalctl@ceph.mon.c.smithi138.stdout:Mar 04 13:10:00 smithi138 ceph-mon[36284]: Health detail: HEALTH_ERR 4 MDSs report damaged metadata 2024-03-04T13:10:00.957 INFO:journalctl@ceph.mon.c.smithi138.stdout:Mar 04 13:10:00 smithi138 ceph-mon[36284]: [ERR] MDS_DAMAGE: 4 MDSs report damaged metadata 2024-03-04T13:10:00.957 INFO:journalctl@ceph.mon.c.smithi138.stdout:Mar 04 13:10:00 smithi138 ceph-mon[36284]: mds.c(mds.4): Metadata damage detected 2024-03-04T13:10:00.957 INFO:journalctl@ceph.mon.c.smithi138.stdout:Mar 04 13:10:00 smithi138 ceph-mon[36284]: mds.i(mds.3): Metadata damage detected 2024-03-04T13:10:00.958 INFO:journalctl@ceph.mon.c.smithi138.stdout:Mar 04 13:10:00 smithi138 ceph-mon[36284]: mds.k(mds.2): Metadata damage detected 2024-03-04T13:10:00.958 INFO:journalctl@ceph.mon.c.smithi138.stdout:Mar 04 13:10:00 smithi138 ceph-mon[36284]: mds.h(mds.1): Metadata damage detected
Updated by Venky Shankar about 2 months ago
Xiubo Li wrote:
It seems metadata damaged:
Right. I saw that in the mds log but left that out while creating this tracker since the "signal_caps_inode" traces were a lot.
Updated by Xiubo Li about 2 months ago
Venky Shankar wrote:
Xiubo Li wrote:
It seems metadata damaged:
Right. I saw that in the mds log but left that out while creating this tracker since the "signal_caps_inode" traces were a lot.
Venky,
BTW, why I couldn't see the client log files ? Where is it ?
Updated by Xiubo Li about 2 months ago
Xiubo Li wrote:
Venky Shankar wrote:
Xiubo Li wrote:
It seems metadata damaged:
Right. I saw that in the mds log but left that out while creating this tracker since the "signal_caps_inode" traces were a lot.
Venky,
BTW, why I couldn't see the client log files ? Where is it ?
I found it.
The client log file is full of export/import logs.
Updated by Xiubo Li about 2 months ago
Venky,
Have you ever seen the "On-disk backtrace is divergent or newer" error ?
2024-03-04T13:05:36.662+0000 7fdd42000700 10 MDSContext::complete: 16C_InodeValidated 2024-03-04T13:05:36.662+0000 7fdd42000700 0 log_channel(cluster) log [WRN] : Scrub error on inode 0x5000000210a (/volumes/qa/sv_0/8e048092-a055-46f2-bd35-bc12ad2511ae/client.0/tmp/payload.1/multiple_rsync_payload.196784/modules/5.14.0-239.el9.x86_64/kernel/drivers/net/fjes/fjes.ko.xz) see mds.c log and `damage ls` output for details 2024-03-04T13:05:36.662+0000 7fdd42000700 -1 mds.4.scrubstack _validate_inode_done scrub error on inode [inode 0x5000000210a [42,head] /volumes/qa/sv_0/8e048092-a055-46f2-bd35-bc12ad2511ae/client.0/tmp/payload.1/multiple_rsync_payload.196784/modules/5.14.0-239.el9.x86_64/kernel/drivers/net/fjes/fjes.ko.xz auth v8 ap=1 s=37796 n(v0 rc2024-03-04T10:43:47.766563+0000 b37796 1=1+0) (ifile excl) (iversion lock) cr={15360=0-4194304@41} caps={15360=pAsLsXsFsxcrwb/-@1},l=15360(-1) | importingcaps=0 caps=1 replicated=0 authpin=1 scrubqueue=0 0x55ac8a21d180]: {"performed_validation":true,"passed_validation":false,"backtrace":{"checked":true,"passed":false,"read_ret_val":0,"ondisk_value":"(3)0x5000000210a:[<0x10000004859/fjes.ko.xz v8>,<0x10000004792/fjes v1416>,<0x10000004757/net v5530>,<0x1000000474f/drivers v1576>,<0x10000004749/kernel v624>,<0x10000003675/5.14.0-239.el9.x86_64 v1033>,<0x1000000365f/modules v7307>,<0x1000000365e/multiple_rsync_payload.196784 v3136>,<0x1000000365d/payload.1 v4404>,<0x10000000005/tmp v5746>,<0x10000000003/client.0 v5534>,<0x10000000002/8e048092-a055-46f2-bd35-bc12ad2511ae v5454>,<0x10000000001/sv_0 v5393>,<0x10000000000/qa v5396>,<0x1/volumes v5291>]//[]","memoryvalue":"(3)0x5000000210a:[<0x10000004859/fjes.ko.xz v8>,<0x10000004792/fjes v1416>,<0x10000004757/net v5530>,<0x1000000474f/drivers v1574>,<0x10000004749/kernel v624>,<0x10000003675/5.14.0-239.el9.x86_64 v648>,<0x1000000365f/modules v5655>,<0x1000000365e/multiple_rsync_payload.196784 v3103>,<0x1000000365d/payload.1 v4404>,<0x10000000005/tmp v5742>,<0x10000000003/client.0 v5526>,<0x10000000002/8e048092-a055-46f2-bd35-bc12ad2511ae v5452>,<0x10000000001/sv_0 v5386>,<0x10000000000/qa v5396>,<0x1/volumes v5287>]//[]","error_str":"On-disk backtrace is divergent or newer"},"raw_stats":{"checked":false,"passed":false,"read_ret_val":0,"ondisk_value.dirstat":"f()","ondisk_value.rstat":"n()","memory_value.dirstat":"f()","memory_value.rstat":"n()","error_str":""},"return_code":0}
Updated by Xiubo Li about 2 months ago
Xiubo Li wrote:
Venky,
Have you ever seen the "On-disk backtrace is divergent or newer" error ?
[...]
Should be a known issue and same with https://tracker.ceph.com/issues/51197.
Updated by Venky Shankar about 2 months ago
Xiubo Li wrote:
Venky,
Have you ever seen the "On-disk backtrace is divergent or newer" error ?
[...]
Oh, we never RCA'd that :/
Updated by Venky Shankar about 2 months ago
- Related to Bug #51197: qa: [WRN] Scrub error on inode 0x10000001520 (/client.0/tmp/t/linux-5.4/Documentation/driver-api) see mds.f log and `damage ls` output for details added
Updated by Venky Shankar about 1 month ago
- Assignee changed from Xiubo Li to Venky Shankar
I'm already working on this.
Updated by Venky Shankar 30 days ago
- Related to Bug #57676: qa: error during scrub thrashing: rank damage found: {'backtrace'} added
Updated by Venky Shankar 30 days ago
"balancer/automatic" with multimds and scrub is likely what's causes the divergent backtraces.
That's also the case in https://tracker.ceph.com/issues/57676 (see linked run in https://tracker.ceph.com/issues/57676#note-21)