Bug #64607
open
ceph: fstest generic/580 test failure with infinitely loop
Added by Xiubo Li 3 months ago.
Updated 2 months ago.
Description
This is reported by Luis, please see https://patchwork.kernel.org/project/ceph-devel/patch/20240125023920.1287555-4-xiubli@redhat.com/.
I'm seeing an issue with fstest generic/580, which seems to enter an
infinite loop effectively rendering the testing VM unusable. It's pretty
easy to reproduce, just run the test ensuring to be using msgv2 (I'm
mounting the filesystem with 'ms_mode=crc'), and you should see the
following on the logs:
[...]
libceph: prepare_sparse_read_cont: ret 0x1000 total_resid 0x0 resid 0x0
libceph: osd1 (2)192.168.155.1:6810 read processing error
libceph: mon0 (2)192.168.155.1:40608 session established
libceph: bad late_status 0x1
libceph: osd1 (2)192.168.155.1:6810 protocol error, bad epilogue
libceph: mon0 (2)192.168.155.1:40608 session established
libceph: prepare_sparse_read_cont: ret 0x1000 total_resid 0x0 resid 0x0
libceph: osd1 (2)192.168.155.1:6810 read processing error
libceph: mon0 (2)192.168.155.1:40608 session established
libceph: bad late_status 0x1
[...]
Reverting this patch (commit 8e46a2d068c9 ("libceph: just wait for more
data to be available on the socket")) seems to fix. I haven't
investigated it further, but since it'll take me some time to refresh my
memory, I thought I should report it immediately. Maybe someone has any
idea.
Cheers,
--
Luís
- Copied to Bug #64654: fscrypt: add mount-syntax/v2 test for fscrypt added
[Edit] https://pulpito.ceph.com/vshankar-2024-02-27_04:05:06-fs-wip-vshankar-testing-20240226.124304-testing-default-smithi/7574246/
[22559.788266] libceph: osd5 (2)172.21.15.123:6816 protocol error, bad epilogue^M
[22559.796022] libceph: mon0 (2)172.21.15.3:3300 session established^M
[22559.797179] libceph: prepare_sparse_read_cont: ret 0x1000 total_resid 0x0 resid 0x0^M
[22559.809937] libceph: osd5 (2)172.21.15.123:6816 read processing error^M
[22559.817265] libceph: mon0 (2)172.21.15.3:3300 session established^M
[22559.824822] libceph: bad late_status 0x1^M
[22559.828856] libceph: osd5 (2)172.21.15.123:6816 protocol error, bad epilogue^M
[22559.836819] libceph: mon0 (2)172.21.15.3:3300 session established^M
[22559.844404] libceph: prepare_sparse_read_cont: ret 0x1000 total_resid 0x0 resid 0x0^M
[22559.852212] libceph: osd5 (2)172.21.15.123:6816 read processing error^M
[22559.859695] libceph: mon0 (2)172.21.15.3:3300 session established^M
[22559.860766] libceph: bad late_status 0x1^M
[22559.869820] libceph: osd5 (2)172.21.15.123:6816 protocol error, bad epilogue^M
[22559.877715] libceph: mon0 (2)172.21.15.3:3300 session established^M
[22559.878964] libceph: prepare_sparse_read_cont: ret 0x1000 total_resid 0x0 resid 0x0^M
[22559.891578] libceph: osd5 (2)172.21.15.123:6816 read processing error^M
[22559.898896] libceph: mon0 (2)172.21.15.3:3300 session established^M
[22559.900353] libceph: bad late_status 0x1^M
[22559.909087] libceph: osd5 (2)172.21.15.123:6816 protocol error, bad epilogue^M
[22559.917071] libceph: mon0 (2)172.21.15.3:3300 session established^M
[22559.924877] libceph: prepare_sparse_read_cont: ret 0x1000 total_resid 0x0 resid 0x0^M
[22559.932645] libceph: osd5 (2)172.21.15.123:6816 read processing error^M
[22559.939957] libceph: mon0 (2)172.21.15.3:3300 session established^M
[22559.941506] libceph: bad late_status 0x1^M
[22559.950120] libceph: osd5 (2)172.21.15.123:6816 protocol error, bad epilogue^M
[22559.958116] libceph: mon0 (2)172.21.15.3:3300 session established^M
[22559.959112] libceph: prepare_sparse_read_cont: ret 0x1000 total_resid 0x0 resid 0x0^M
[22559.971974] libceph: osd5 (2)172.21.15.123:6816 read processing error^M
[22559.979145] libceph: mon0 (2)172.21.15.3:3300 session established^M
[22559.980394] libceph: bad late_status 0x1^M
- Status changed from In Progress to Fix Under Review
Patrick Donnelly wrote:
Here is one job where we would have likely caught this bug:
https://pulpito.ceph.com/vshankar-2024-02-27_04:05:06-fs-wip-vshankar-testing-20240226.124304-testing-default-smithi/7574294/
which uses the kclient (stock kernel) and ms_mode/crc
This one wouldn't have caught it because it just wanted to run fsx -- 5-workunit/suites/fsx boils down to:
OPTIONS="-z" # don't use zero range calls; not supported by cephfs
./fsx $OPTIONS 1MB -N 50000 -p 10000 -l 1048576
./fsx $OPTIONS 10MB -N 50000 -p 10000 -l 10485760
./fsx $OPTIONS 100MB -N 50000 -p 10000 -l 104857600
But if programs in the xfstests repo are failing to build, xfstests themselves wouldn't run either, so I think we have a root cause!
For completeness, it would be nice to link a job that wanted to run xfstests themselves here, and double check that the workunit in question wouldn't have excluded generic/580 for whatever reason.
Also available in: Atom
PDF