Bug #64607
openceph: fstest generic/580 test failure with infinitely loop
0%
Description
This is reported by Luis, please see https://patchwork.kernel.org/project/ceph-devel/patch/20240125023920.1287555-4-xiubli@redhat.com/.
I'm seeing an issue with fstest generic/580, which seems to enter an infinite loop effectively rendering the testing VM unusable. It's pretty easy to reproduce, just run the test ensuring to be using msgv2 (I'm mounting the filesystem with 'ms_mode=crc'), and you should see the following on the logs: [...] libceph: prepare_sparse_read_cont: ret 0x1000 total_resid 0x0 resid 0x0 libceph: osd1 (2)192.168.155.1:6810 read processing error libceph: mon0 (2)192.168.155.1:40608 session established libceph: bad late_status 0x1 libceph: osd1 (2)192.168.155.1:6810 protocol error, bad epilogue libceph: mon0 (2)192.168.155.1:40608 session established libceph: prepare_sparse_read_cont: ret 0x1000 total_resid 0x0 resid 0x0 libceph: osd1 (2)192.168.155.1:6810 read processing error libceph: mon0 (2)192.168.155.1:40608 session established libceph: bad late_status 0x1 [...] Reverting this patch (commit 8e46a2d068c9 ("libceph: just wait for more data to be available on the socket")) seems to fix. I haven't investigated it further, but since it'll take me some time to refresh my memory, I thought I should report it immediately. Maybe someone has any idea. Cheers, -- Luís
Updated by Xiubo Li 2 months ago
- Copied to Bug #64654: fscrypt: add mount-syntax/v2 test for fscrypt added
Updated by Xiubo Li 2 months ago
[22559.788266] libceph: osd5 (2)172.21.15.123:6816 protocol error, bad epilogue^M [22559.796022] libceph: mon0 (2)172.21.15.3:3300 session established^M [22559.797179] libceph: prepare_sparse_read_cont: ret 0x1000 total_resid 0x0 resid 0x0^M [22559.809937] libceph: osd5 (2)172.21.15.123:6816 read processing error^M [22559.817265] libceph: mon0 (2)172.21.15.3:3300 session established^M [22559.824822] libceph: bad late_status 0x1^M [22559.828856] libceph: osd5 (2)172.21.15.123:6816 protocol error, bad epilogue^M [22559.836819] libceph: mon0 (2)172.21.15.3:3300 session established^M [22559.844404] libceph: prepare_sparse_read_cont: ret 0x1000 total_resid 0x0 resid 0x0^M [22559.852212] libceph: osd5 (2)172.21.15.123:6816 read processing error^M [22559.859695] libceph: mon0 (2)172.21.15.3:3300 session established^M [22559.860766] libceph: bad late_status 0x1^M [22559.869820] libceph: osd5 (2)172.21.15.123:6816 protocol error, bad epilogue^M [22559.877715] libceph: mon0 (2)172.21.15.3:3300 session established^M [22559.878964] libceph: prepare_sparse_read_cont: ret 0x1000 total_resid 0x0 resid 0x0^M [22559.891578] libceph: osd5 (2)172.21.15.123:6816 read processing error^M [22559.898896] libceph: mon0 (2)172.21.15.3:3300 session established^M [22559.900353] libceph: bad late_status 0x1^M [22559.909087] libceph: osd5 (2)172.21.15.123:6816 protocol error, bad epilogue^M [22559.917071] libceph: mon0 (2)172.21.15.3:3300 session established^M [22559.924877] libceph: prepare_sparse_read_cont: ret 0x1000 total_resid 0x0 resid 0x0^M [22559.932645] libceph: osd5 (2)172.21.15.123:6816 read processing error^M [22559.939957] libceph: mon0 (2)172.21.15.3:3300 session established^M [22559.941506] libceph: bad late_status 0x1^M [22559.950120] libceph: osd5 (2)172.21.15.123:6816 protocol error, bad epilogue^M [22559.958116] libceph: mon0 (2)172.21.15.3:3300 session established^M [22559.959112] libceph: prepare_sparse_read_cont: ret 0x1000 total_resid 0x0 resid 0x0^M [22559.971974] libceph: osd5 (2)172.21.15.123:6816 read processing error^M [22559.979145] libceph: mon0 (2)172.21.15.3:3300 session established^M [22559.980394] libceph: bad late_status 0x1^M
Updated by Xiubo Li 2 months ago
For the messager V2 test for fscrypt we need to backport https://tracker.ceph.com/issues/59195.
Updated by Patrick Donnelly 2 months ago
I want to answer a question from Ilya about whether we should be catching this in upstream QA. The answer is, we should except there are build failures right now with xfstests caused by the switch to centos9. So we are unlucky. Here is one job where we would have likely caught this bug:
which uses the kclient (stock kernel) and ms_mode/crc
Updated by Ilya Dryomov 2 months ago
Patrick Donnelly wrote:
Here is one job where we would have likely caught this bug:
which uses the kclient (stock kernel) and ms_mode/crc
This one wouldn't have caught it because it just wanted to run fsx -- 5-workunit/suites/fsx boils down to:
OPTIONS="-z" # don't use zero range calls; not supported by cephfs ./fsx $OPTIONS 1MB -N 50000 -p 10000 -l 1048576 ./fsx $OPTIONS 10MB -N 50000 -p 10000 -l 10485760 ./fsx $OPTIONS 100MB -N 50000 -p 10000 -l 104857600But if programs in the xfstests repo are failing to build, xfstests themselves wouldn't run either, so I think we have a root cause!
For completeness, it would be nice to link a job that wanted to run xfstests themselves here, and double check that the workunit in question wouldn't have excluded generic/580 for whatever reason.