Bug #64607: ceph: fstest generic/580 test failure with infinitely loop - Linux kernel client - Ceph

Actions

Copy link

Bug #64607

open

ceph: fstest generic/580 test failure with infinitely loop

Added by Xiubo Li 3 months ago. Updated 2 months ago.

Status:

Fix Under Review

Priority:

Normal

Assignee:

Xiubo Li

Category:

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Crash signature (v1):

Crash signature (v2):

Description

This is reported by Luis, please see https://patchwork.kernel.org/project/ceph-devel/patch/20240125023920.1287555-4-xiubli@redhat.com/.

I'm seeing an issue with fstest generic/580, which seems to enter an
infinite loop effectively rendering the testing VM unusable.  It's pretty
easy to reproduce, just run the test ensuring to be using msgv2 (I'm
mounting the filesystem with 'ms_mode=crc'), and you should see the
following on the logs:

[...]
  libceph: prepare_sparse_read_cont: ret 0x1000 total_resid 0x0 resid 0x0                                                                                       
  libceph: osd1 (2)192.168.155.1:6810 read processing error                                                                                                     
  libceph: mon0 (2)192.168.155.1:40608 session established                                                                                                      
  libceph: bad late_status 0x1                                                                                                                                  
  libceph: osd1 (2)192.168.155.1:6810 protocol error, bad epilogue                                                                                              
  libceph: mon0 (2)192.168.155.1:40608 session established                                                                                                      
  libceph: prepare_sparse_read_cont: ret 0x1000 total_resid 0x0 resid 0x0                                                                                       
  libceph: osd1 (2)192.168.155.1:6810 read processing error              
  libceph: mon0 (2)192.168.155.1:40608 session established                                                                                                      
  libceph: bad late_status 0x1                                                                                                                                  
[...]

Reverting this patch (commit 8e46a2d068c9 ("libceph: just wait for more
data to be available on the socket")) seems to fix.  I haven't
investigated it further, but since it'll take me some time to refresh my
memory, I thought I should report it immediately.  Maybe someone has any
idea.

Cheers,
-- 
Luís

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Xiubo Li 2 months ago

Copied to Bug #64654: fscrypt: add mount-syntax/v2 test for fscrypt added

Actions

Copy link

Updated by Xiubo Li 2 months ago

[Edit] https://pulpito.ceph.com/vshankar-2024-02-27_04:05:06-fs-wip-vshankar-testing-20240226.124304-testing-default-smithi/7574246/

[22559.788266] libceph: osd5 (2)172.21.15.123:6816 protocol error, bad epilogue^M
[22559.796022] libceph: mon0 (2)172.21.15.3:3300 session established^M
[22559.797179] libceph: prepare_sparse_read_cont: ret 0x1000 total_resid 0x0 resid 0x0^M
[22559.809937] libceph: osd5 (2)172.21.15.123:6816 read processing error^M
[22559.817265] libceph: mon0 (2)172.21.15.3:3300 session established^M
[22559.824822] libceph: bad late_status 0x1^M
[22559.828856] libceph: osd5 (2)172.21.15.123:6816 protocol error, bad epilogue^M
[22559.836819] libceph: mon0 (2)172.21.15.3:3300 session established^M
[22559.844404] libceph: prepare_sparse_read_cont: ret 0x1000 total_resid 0x0 resid 0x0^M
[22559.852212] libceph: osd5 (2)172.21.15.123:6816 read processing error^M
[22559.859695] libceph: mon0 (2)172.21.15.3:3300 session established^M
[22559.860766] libceph: bad late_status 0x1^M
[22559.869820] libceph: osd5 (2)172.21.15.123:6816 protocol error, bad epilogue^M
[22559.877715] libceph: mon0 (2)172.21.15.3:3300 session established^M
[22559.878964] libceph: prepare_sparse_read_cont: ret 0x1000 total_resid 0x0 resid 0x0^M
[22559.891578] libceph: osd5 (2)172.21.15.123:6816 read processing error^M
[22559.898896] libceph: mon0 (2)172.21.15.3:3300 session established^M
[22559.900353] libceph: bad late_status 0x1^M
[22559.909087] libceph: osd5 (2)172.21.15.123:6816 protocol error, bad epilogue^M
[22559.917071] libceph: mon0 (2)172.21.15.3:3300 session established^M
[22559.924877] libceph: prepare_sparse_read_cont: ret 0x1000 total_resid 0x0 resid 0x0^M
[22559.932645] libceph: osd5 (2)172.21.15.123:6816 read processing error^M
[22559.939957] libceph: mon0 (2)172.21.15.3:3300 session established^M
[22559.941506] libceph: bad late_status 0x1^M
[22559.950120] libceph: osd5 (2)172.21.15.123:6816 protocol error, bad epilogue^M
[22559.958116] libceph: mon0 (2)172.21.15.3:3300 session established^M
[22559.959112] libceph: prepare_sparse_read_cont: ret 0x1000 total_resid 0x0 resid 0x0^M
[22559.971974] libceph: osd5 (2)172.21.15.123:6816 read processing error^M
[22559.979145] libceph: mon0 (2)172.21.15.3:3300 session established^M
[22559.980394] libceph: bad late_status 0x1^M

Actions

Copy link

Updated by Xiubo Li 2 months ago

For the messager V2 test for fscrypt we need to backport https://tracker.ceph.com/issues/59195.

Actions

Copy link

Updated by Xiubo Li 2 months ago

Status changed from In Progress to Fix Under Review

v1 patchwork link: https://patchwork.kernel.org/project/ceph-devel/list/?series=830937&archive=both

Actions

Copy link

Updated by Patrick Donnelly 2 months ago

I want to answer a question from Ilya about whether we should be catching this in upstream QA. The answer is, we should except there are build failures right now with xfstests caused by the switch to centos9. So we are unlucky. Here is one job where we would have likely caught this bug:

https://pulpito.ceph.com/vshankar-2024-02-27_04:05:06-fs-wip-vshankar-testing-20240226.124304-testing-default-smithi/7574294/

which uses the kclient (stock kernel) and ms_mode/crc

Actions

Copy link

Updated by Ilya Dryomov 2 months ago

Patrick Donnelly wrote:

Here is one job where we would have likely caught this bug:

https://pulpito.ceph.com/vshankar-2024-02-27_04:05:06-fs-wip-vshankar-testing-20240226.124304-testing-default-smithi/7574294/

which uses the kclient (stock kernel) and ms_mode/crc

This one wouldn't have caught it because it just wanted to run fsx -- 5-workunit/suites/fsx boils down to:

OPTIONS="-z"  # don't use zero range calls; not supported by cephfs

./fsx $OPTIONS  1MB -N 50000 -p 10000 -l 1048576
./fsx $OPTIONS  10MB -N 50000 -p 10000 -l 10485760
./fsx $OPTIONS 100MB -N 50000 -p 10000 -l 104857600

But if programs in the xfstests repo are failing to build, xfstests themselves wouldn't run either, so I think we have a root cause!

For completeness, it would be nice to link a job that wanted to run xfstests themselves here, and double check that the workunit in question wouldn't have excluded generic/580 for whatever reason.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » Linux kernel client

Custom queries

Bug #64607

ceph: fstest generic/580 test failure with infinitely loop

Updated by Xiubo Li 2 months ago

Updated by Xiubo Li 2 months ago

Updated by Xiubo Li 2 months ago

Updated by Xiubo Li 2 months ago

Updated by Patrick Donnelly 2 months ago

Updated by Ilya Dryomov 2 months ago