Actions
Bug #59534
openqa/workunits/suites/dbench.sh failed with "write failed on handle 9938 (Input/output error)"
Status:
Triaged
Priority:
Normal
Assignee:
Category:
Correctness/Safety
Target version:
% Done:
100%
Source:
Tags:
Backport:
reef,quincy,pacific
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Client
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
2023-04-23T00:13:25.693 INFO:tasks.workunit.client.0.smithi179.stdout: 0 12 0.00 MB/sec warmup 894 sec latency 894023.068 ms 2023-04-23T00:13:26.693 INFO:tasks.workunit.client.0.smithi179.stdout: 0 12 0.00 MB/sec warmup 895 sec latency 895023.186 ms 2023-04-23T00:13:27.693 INFO:tasks.workunit.client.0.smithi179.stdout: 0 12 0.00 MB/sec warmup 896 sec latency 896023.282 ms 2023-04-23T00:13:28.693 INFO:tasks.workunit.client.0.smithi179.stdout: 0 12 0.00 MB/sec warmup 897 sec latency 897023.398 ms 2023-04-23T00:13:29.693 INFO:tasks.workunit.client.0.smithi179.stdout: 0 12 0.00 MB/sec warmup 898 sec latency 898023.508 ms 2023-04-23T00:13:30.693 INFO:tasks.workunit.client.0.smithi179.stdout: 0 12 0.00 MB/sec warmup 899 sec latency 899023.569 ms 2023-04-23T00:13:31.671 INFO:tasks.workunit.client.0.smithi179.stdout:[12] write failed on handle 9938 (Input/output error) 2023-04-23T00:13:31.671 INFO:tasks.workunit.client.0.smithi179.stdout:Child failed with status 1 2023-04-23T00:13:31.672 DEBUG:teuthology.orchestra.run:got remote process result: 1 2023-04-23T00:13:31.673 INFO:tasks.workunit:Stopping ['suites/dbench.sh'] on client.0... 2023-04-23T00:13:31.673 DEBUG:teuthology.orchestra.run.smithi179:> sudo rm -rf -- /home/ubuntu/cephtest/workunits.list.client.0 /home/ubuntu/cephtest/clone.client.0 2023-04-23T00:13:32.570 ERROR:teuthology.run_tasks:Saw exception from tasks. Traceback (most recent call last):
And from the fuse client logs:
remote/smithi179/log/ceph-client.0.134417.log.gz
2023-04-22T23:58:31.670+0000 7fb2bbfff700 1 -- 172.21.15.179:0/1855002755 --> v1:172.21.15.90:6812/134863 -- osd_op(unknown.0.0:3 3.9 3:9258b67d:::10000000005.00000000:head [stat] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio e17) v9 -- 0x7fb2b40320a0 con 0x7fb2b402e390 2023-04-22T23:58:31.670+0000 7fb2bbfff700 1 -- 172.21.15.179:0/1855002755 --> v1:172.21.15.90:6812/134863 -- osd_op(unknown.0.0:4 3.9 3:9258b67d:::10000000005.00000000:head [create] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio e17) v9 -- 0x7fb2b4032ae0 con 0x7fb2b402e390 2023-04-22T23:58:31.671+0000 7fb2d37fe700 10 client.14477 ms_handle_connect on v1:172.21.15.90:6812/134863 2023-04-22T23:58:31.671+0000 7fb2d37fe700 1 -- 172.21.15.179:0/1855002755 <== osd.3 v1:172.21.15.90:6812/134863 1 ==== osd_map(17..17 src has 1..17) v4 ==== 3194+0+0 (unknown 3046238503 0 0) 0x7fb2d400bd20 con 0x7fb2b402e390 2023-04-22T23:58:32.326+0000 7fb2d1ffb700 20 client.14477 tick ... 2023-04-23T00:13:31.670+0000 7fb2bbfff700 10 client.14477 check_pool_perm on pool 3 ns rd_err = -110 wr_err = -110 2023-04-23T00:13:31.670+0000 7fb2bbfff700 10 client.14477 check_pool_perm on pool 3 ns rd_err = -110 wr_err = -110 2023-04-23T00:13:31.670+0000 7fb2bbfff700 3 client.14477 ll_write 0x7fb2ac002cc0 65534~1 = -5 2023-04-23T00:13:31.670+0000 7fb2d0ff9700 3 client.14477 ll_flush 0x7fb2ac002cc0 0x10000000005
The write failed when checking the pool's perms with -110 errno, which is timedout.
From the osd.3 logs:
remote/smithi090/log/ceph-osd.3.log.gz
90:6812/134863 target_addr v1:172.21.15.179:38462/0 2023-04-22 23:58:31.670 7efdec7fe700 10 osd.3 17 ms_handle_authentication new session 0x562db8c1bb80 con 0x562db8ba0400 entity client.0 addr v1:172.21.15.179:0/1855002755 2023-04-22 23:58:31.670 7efdec7fe700 10 osd.3 17 ms_handle_authentication session 0x562db8c1bb80 client.0 has caps osdcap[grant(rwx)] 'allow rwx' 2023-04-22 23:58:31.671 7efdec7fe700 1 -- v1:172.21.15.90:6812/134863 <== client.14477 v1:172.21.15.179:0/1855002755 1 ==== osd_op(client.14477.0:3 0.0 0.0 (undecoded) - e0) v9 ==== 236+0+0 (unknown 1119608350 0 0) 0x562dbb778600 con 0x562db8ba0400 2023-04-22 23:58:31.671 7efdec7fe700 15 osd.3 17 enqueue_op 0x562db8bfef20 prio 63 cost 0 latency 0.000021 epoch 0 osd_op(client.14477.0:3 0.0 0.0 (undecoded) - e0) v9 2023-04-22 23:58:31.671 7efdec7fe700 20 osd.3 op_wq(0) _enqueue OpQueueItem(0.0 PGOpItem(op=osd_op(client.14477.0:3 0.0 0.0 (undecoded) - e0) v9) prio 63 cost 0 e0) 2023-04-22 23:58:31.671 7efdec7fe700 1 -- v1:172.21.15.90:6812/134863 <== client.14477 v1:172.21.15.179:0/1855002755 2 ==== osd_op(client.14477.0:4 0.0 0.0 (undecoded) - e0) v9 ==== 236+0+0 (unknown 3820315456 0 0) 0x562dbb778900 con 0x562db8ba0400 2023-04-22 23:58:31.671 7efdec7fe700 15 osd.3 17 enqueue_op 0x562db8f3e580 prio 63 cost 0 latency 0.000007 epoch 0 osd_op(client.14477.0:4 0.0 0.0 (undecoded) - e0) v9 2023-04-22 23:58:31.671 7efdcebe9700 20 osd.3 op_wq(0) _process 0.0 (new) to_process <> waiting <> waiting_peering {} 2023-04-22 23:58:31.671 7efdcebe9700 20 osd.3 op_wq(0) _process OpQueueItem(0.0 PGOpItem(op=osd_op(client.14477.0:3 0.0 0.0 (undecoded) - e0) v9) prio 63 cost 0 e0) queued 2023-04-22 23:58:31.671 7efdcebe9700 20 osd.3 op_wq(0) _process 0.0 to_process <OpQueueItem(0.0 PGOpItem(op=osd_op(client.14477.0:3 0.0 0.0 (undecoded) - e0) v9) prio 63 cost 0 e0)> waiting <> waiting_peering {} 2023-04-22 23:58:31.671 7efdcebe9700 20 osd.3 op_wq(0) _process OpQueueItem(0.0 PGOpItem(op=osd_op(client.14477.0:3 0.0 0.0 (undecoded) - e0) v9) prio 63 cost 0 e0) pg 0 2023-04-22 23:58:31.671 7efdcebe9700 20 osd.3 op_wq(0) _process 0.0 no pg, shouldn't exist e17, dropping OpQueueItem(0.0 PGOpItem(op=osd_op(client.14477.0:3 0.0 0.0 (undecoded) - e0) v9) prio 63 cost 0 e0) 2023-04-22 23:58:31.671 7efdcebe9700 20 osd.3 17 share_map client.14477 v1:172.21.15.179:0/1855002755 0 2023-04-22 23:58:31.671 7efdcebe9700 20 osd.3 17 should_share_map client.14477 v1:172.21.15.179:0/1855002755 0 2023-04-22 23:58:31.671 7efdcebe9700 20 osd.3 17 client session last_sent_epoch: 0 versus osdmap epoch 17 2023-04-22 23:58:31.671 7efdcebe9700 10 osd.3 17 client.14477 has old map 0 < 17 2023-04-22 23:58:31.671 7efdcebe9700 10 osd.3 17 send_incremental_map 0 -> 17 to 0x562db8ba0400 v1:172.21.15.179:0/1855002755 2023-04-22 23:58:31.671 7efdcebe9700 10 osd.3 17 add_map_bl 17 3150 bytes
The osd requests are just dropped by osd.3 because of the osdmap versions were mismatched.
This is a upgrade client test case from nautilus, and the client sent two osd request with e17, while the osd just get a incorrect epoch number, witch is e0.
This should be a incompatible issue between latest clients and nautilus osd code.
Actions