Bug #55840: windows clients unable to perform IO to clusters with over 200+ OSDs - Ceph - Ceph

Actions

Copy link

Bug #55840

open

windows clients unable to perform IO to clusters with over 200+ OSDs

Added by Rafael Lopez almost 2 years ago. Updated 11 months ago.

Status:

Pending Backport

Priority:

Normal

Assignee:

Rafael Lopez

Category:

msgr

Target version:

v17.0.0

% Done:

Source:

Tags:

windows,messenger,msg,async backport_processed

Backport:

pacific quincy

Regression:

Severity:

3 - minor

Reviewed:

06/03/2022

Affected Versions:

ceph-qa-suite:

Pull request ID:

46525

Crash signature (v1):

Crash signature (v2):

Description

When cluster has a large number of OSDs (around 200 or more), windows clients can only do IO for a short period then stop. For example a rados bench test looks like this:

PS C:\Users\Administrator\Downloads\ceph\ceph> rados.exe -p rbdec bench 60 write -b 1
hints = 1
Maintaining 16 concurrent writes of 1 bytes to objects of size 1 for up to 60 seconds or 0 objects
Object prefix: benchmark_data_WIN-TEST_7452
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
    0       1         1         0         0         0           -           0
    1      16       991       975 0.000921874 0.000929832   0.0170945    0.016306
    2      16      1853      1837 0.000865209 0.000822067   0.0094437   0.0163601
    3      16      2266      2250 0.000705895 0.000393867   0.0115127   0.0159883
    4      16      2518      2502 0.000588372 0.000240326   0.0126331   0.0158619
    5      16      2733      2717 0.000510963 0.00020504    0.012853   0.0157324
    6      16      2826      2810 0.000440277 8.86917e-05   0.0143737   0.0157153
    7      16      2884      2868 0.000385106 5.53131e-05   0.0137714   0.0157497
    8      16      2941      2925 0.000343621 5.43594e-05   0.0139032   0.0157918
    9      16      2941      2925 0.000305395         0           -   0.0157918
   10      16      2941      2925 0.000274849         0           -   0.0157918
   11      16      2941      2925 0.000249846         0           -   0.0157918
   12      16      2941      2925 0.00022901         0           -   0.0157918

The issue is due to FD_SETSIZE used by select() in windows is limited to 64. This means the client can only manage ms_async_op_threads * 64 FDs (default 3*64) for IO to the cluster.
Workaround is to increase ms_async_op_threads in client ceph.conf to support the number of OSDs in the cluster.

Related issues 2 (1 open — 1 closed)