Project

General

Profile

Actions

Bug #59534

open

qa/workunits/suites/dbench.sh failed with "write failed on handle 9938 (Input/output error)"

Added by Xiubo Li about 1 year ago. Updated 12 days ago.

Status:
Triaged
Priority:
Normal
Assignee:
Category:
Correctness/Safety
Target version:
% Done:

100%

Source:
Tags:
Backport:
reef,quincy,pacific
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Client
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

https://pulpito.ceph.com/vshankar-2023-04-20_04:50:51-fs-wip-vshankar-testing-20230412.053558-testing-default-smithi/7245942/

2023-04-23T00:13:25.693 INFO:tasks.workunit.client.0.smithi179.stdout:   0        12     0.00 MB/sec  warmup 894 sec  latency 894023.068 ms
2023-04-23T00:13:26.693 INFO:tasks.workunit.client.0.smithi179.stdout:   0        12     0.00 MB/sec  warmup 895 sec  latency 895023.186 ms
2023-04-23T00:13:27.693 INFO:tasks.workunit.client.0.smithi179.stdout:   0        12     0.00 MB/sec  warmup 896 sec  latency 896023.282 ms
2023-04-23T00:13:28.693 INFO:tasks.workunit.client.0.smithi179.stdout:   0        12     0.00 MB/sec  warmup 897 sec  latency 897023.398 ms
2023-04-23T00:13:29.693 INFO:tasks.workunit.client.0.smithi179.stdout:   0        12     0.00 MB/sec  warmup 898 sec  latency 898023.508 ms
2023-04-23T00:13:30.693 INFO:tasks.workunit.client.0.smithi179.stdout:   0        12     0.00 MB/sec  warmup 899 sec  latency 899023.569 ms
2023-04-23T00:13:31.671 INFO:tasks.workunit.client.0.smithi179.stdout:[12] write failed on handle 9938 (Input/output error)
2023-04-23T00:13:31.671 INFO:tasks.workunit.client.0.smithi179.stdout:Child failed with status 1
2023-04-23T00:13:31.672 DEBUG:teuthology.orchestra.run:got remote process result: 1
2023-04-23T00:13:31.673 INFO:tasks.workunit:Stopping ['suites/dbench.sh'] on client.0...
2023-04-23T00:13:31.673 DEBUG:teuthology.orchestra.run.smithi179:> sudo rm -rf -- /home/ubuntu/cephtest/workunits.list.client.0 /home/ubuntu/cephtest/clone.client.0
2023-04-23T00:13:32.570 ERROR:teuthology.run_tasks:Saw exception from tasks.
Traceback (most recent call last):

And from the fuse client logs:

remote/smithi179/log/ceph-client.0.134417.log.gz

2023-04-22T23:58:31.670+0000 7fb2bbfff700  1 -- 172.21.15.179:0/1855002755 --> v1:172.21.15.90:6812/134863 -- osd_op(unknown.0.0:3 3.9 3:9258b67d:::10000000005.00000000:head [stat] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio e17) v9 -- 0x7fb2b40320a0 con 0x7fb2b402e390
2023-04-22T23:58:31.670+0000 7fb2bbfff700  1 -- 172.21.15.179:0/1855002755 --> v1:172.21.15.90:6812/134863 -- osd_op(unknown.0.0:4 3.9 3:9258b67d:::10000000005.00000000:head [create] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio e17) v9 -- 0x7fb2b4032ae0 con 0x7fb2b402e390
2023-04-22T23:58:31.671+0000 7fb2d37fe700 10 client.14477 ms_handle_connect on v1:172.21.15.90:6812/134863
2023-04-22T23:58:31.671+0000 7fb2d37fe700  1 -- 172.21.15.179:0/1855002755 <== osd.3 v1:172.21.15.90:6812/134863 1 ==== osd_map(17..17 src has 1..17) v4 ==== 3194+0+0 (unknown 3046238503 0 0) 0x7fb2d400bd20 con 0x7fb2b402e390
2023-04-22T23:58:32.326+0000 7fb2d1ffb700 20 client.14477 tick
...
2023-04-23T00:13:31.670+0000 7fb2bbfff700 10 client.14477 check_pool_perm on pool 3 ns  rd_err = -110 wr_err = -110
2023-04-23T00:13:31.670+0000 7fb2bbfff700 10 client.14477 check_pool_perm on pool 3 ns  rd_err = -110 wr_err = -110
2023-04-23T00:13:31.670+0000 7fb2bbfff700  3 client.14477 ll_write 0x7fb2ac002cc0 65534~1 = -5
2023-04-23T00:13:31.670+0000 7fb2d0ff9700  3 client.14477 ll_flush 0x7fb2ac002cc0 0x10000000005

The write failed when checking the pool's perms with -110 errno, which is timedout.

From the osd.3 logs:

remote/smithi090/log/ceph-osd.3.log.gz

90:6812/134863 target_addr v1:172.21.15.179:38462/0
2023-04-22 23:58:31.670 7efdec7fe700 10 osd.3 17 ms_handle_authentication new session 0x562db8c1bb80 con 0x562db8ba0400 entity client.0 addr v1:172.21.15.179:0/1855002755
2023-04-22 23:58:31.670 7efdec7fe700 10 osd.3 17 ms_handle_authentication session 0x562db8c1bb80 client.0 has caps osdcap[grant(rwx)] 'allow rwx'
2023-04-22 23:58:31.671 7efdec7fe700  1 -- v1:172.21.15.90:6812/134863 <== client.14477 v1:172.21.15.179:0/1855002755 1 ==== osd_op(client.14477.0:3 0.0 0.0 (undecoded) - e0) v9 ==== 236+0+0 (unknown 1119608350 0 0) 0x562dbb778600 con 0x562db8ba0400
2023-04-22 23:58:31.671 7efdec7fe700 15 osd.3 17 enqueue_op 0x562db8bfef20 prio 63 cost 0 latency 0.000021 epoch 0 osd_op(client.14477.0:3 0.0 0.0 (undecoded) - e0) v9
2023-04-22 23:58:31.671 7efdec7fe700 20 osd.3 op_wq(0) _enqueue OpQueueItem(0.0 PGOpItem(op=osd_op(client.14477.0:3 0.0 0.0 (undecoded) - e0) v9) prio 63 cost 0 e0)
2023-04-22 23:58:31.671 7efdec7fe700  1 -- v1:172.21.15.90:6812/134863 <== client.14477 v1:172.21.15.179:0/1855002755 2 ==== osd_op(client.14477.0:4 0.0 0.0 (undecoded) - e0) v9 ==== 236+0+0 (unknown 3820315456 0 0) 0x562dbb778900 con 0x562db8ba0400
2023-04-22 23:58:31.671 7efdec7fe700 15 osd.3 17 enqueue_op 0x562db8f3e580 prio 63 cost 0 latency 0.000007 epoch 0 osd_op(client.14477.0:4 0.0 0.0 (undecoded) - e0) v9
2023-04-22 23:58:31.671 7efdcebe9700 20 osd.3 op_wq(0) _process 0.0 (new) to_process <> waiting <> waiting_peering {}
2023-04-22 23:58:31.671 7efdcebe9700 20 osd.3 op_wq(0) _process OpQueueItem(0.0 PGOpItem(op=osd_op(client.14477.0:3 0.0 0.0 (undecoded) - e0) v9) prio 63 cost 0 e0) queued
2023-04-22 23:58:31.671 7efdcebe9700 20 osd.3 op_wq(0) _process 0.0 to_process <OpQueueItem(0.0 PGOpItem(op=osd_op(client.14477.0:3 0.0 0.0 (undecoded) - e0) v9) prio 63 cost 0 e0)> waiting <> waiting_peering {}
2023-04-22 23:58:31.671 7efdcebe9700 20 osd.3 op_wq(0) _process OpQueueItem(0.0 PGOpItem(op=osd_op(client.14477.0:3 0.0 0.0 (undecoded) - e0) v9) prio 63 cost 0 e0) pg 0
2023-04-22 23:58:31.671 7efdcebe9700 20 osd.3 op_wq(0) _process 0.0 no pg, shouldn't exist e17, dropping OpQueueItem(0.0 PGOpItem(op=osd_op(client.14477.0:3 0.0 0.0 (undecoded) - e0) v9) prio 63 cost 0 e0)
2023-04-22 23:58:31.671 7efdcebe9700 20 osd.3 17 share_map client.14477 v1:172.21.15.179:0/1855002755 0
2023-04-22 23:58:31.671 7efdcebe9700 20 osd.3 17 should_share_map client.14477 v1:172.21.15.179:0/1855002755 0
2023-04-22 23:58:31.671 7efdcebe9700 20 osd.3 17 client session last_sent_epoch: 0 versus osdmap epoch 17
2023-04-22 23:58:31.671 7efdcebe9700 10 osd.3 17 client.14477 has old map 0 < 17
2023-04-22 23:58:31.671 7efdcebe9700 10 osd.3 17 send_incremental_map 0 -> 17 to 0x562db8ba0400 v1:172.21.15.179:0/1855002755
2023-04-22 23:58:31.671 7efdcebe9700 10 osd.3 17 add_map_bl 17 3150 bytes

The osd requests are just dropped by osd.3 because of the osdmap versions were mismatched.

This is a upgrade client test case from nautilus, and the client sent two osd request with e17, while the osd just get a incorrect epoch number, witch is e0.

This should be a incompatible issue between latest clients and nautilus osd code.


Subtasks 1 (0 open1 closed)

Bug #64265: Command failed (workunit test suites/dbench.sh)Duplicate

Actions
Actions

Also available in: Atom PDF