Bug #58130
LibRadosAio.SimpleWrite hang and pkill
0%
Description
A rados api test experienced a failure after the last global tests had successfully run.
/a/yuriw-2022-11-29_22:29:58-rados-wip-yuri10-testing-2022-11-29-1005-pacific-distro-default-smithi/7097464/teuthology.log
2022-11-30T02:58:42.308 INFO:tasks.workunit.client.0.smithi018.stdout: api_tier_pp: [ RUN ] LibRadosTwoPoolsECPP.TrySetDedupTier
2022-11-30T02:58:42.308 INFO:tasks.workunit.client.0.smithi018.stdout: api_tier_pp: [ OK ] LibRadosTwoPoolsECPP.TrySetDedupTier (6148 ms)
2022-11-30T02:58:42.308 INFO:tasks.workunit.client.0.smithi018.stdout: api_tier_pp: [----------] 22 tests from LibRadosTwoPoolsECPP (299040 ms total)
2022-11-30T02:58:42.308 INFO:tasks.workunit.client.0.smithi018.stdout: api_tier_pp:
2022-11-30T02:58:42.309 INFO:tasks.workunit.client.0.smithi018.stdout: api_tier_pp: [----------] Global test environment tear-down
2022-11-30T02:58:42.309 INFO:tasks.workunit.client.0.smithi018.stdout: api_tier_pp: [==========] 72 tests from 4 test suites ran. (1828588 ms total)
2022-11-30T02:58:42.309 INFO:tasks.workunit.client.0.smithi018.stdout: api_tier_pp: [ PASSED ] 72 tests.
2022-11-30T02:58:42.310 INFO:tasks.workunit.client.0.smithi018.stderr:+ exit 1
2022-11-30T02:58:42.310 INFO:tasks.workunit.client.0.smithi018.stderr:+ cleanup
2022-11-30T02:58:42.310 INFO:tasks.workunit.client.0.smithi018.stderr:+ pkill -P 120721
2022-11-30T02:58:42.317 DEBUG:teuthology.orchestra.run:got remote process result: 1
2022-11-30T02:58:42.317 INFO:tasks.workunit.client.0.smithi018.stderr:+ true
2022-11-30T02:58:42.318 INFO:tasks.workunit:Stopping ['rados/test.sh'] on client.0...
...
2022-11-30T02:58:42.841 INFO:teuthology.orchestra.run.smithi018.stderr:2022-11-30T02:58:42.438+0000 7fb4ba7a0700 10 monclient: _send_mon_message to mon.c at v2:172.21.15.18:3301/0
2022-11-30T02:58:42.842 INFO:teuthology.orchestra.run.smithi018.stderr:2022-11-30T02:58:42.438+0000 7fb4ba7a0700 1 -- 172.21.15.18:0/1567830916 --> [v2:172.21.15.18:3301/0,v1:172.21.15.18:6790/0] -- mon_command({"prefix": "get_command_descriptions"} v 0) v1 -- 0x7fb4b40af4c0 con 0x7fb4b4068c10
2022-11-30T02:58:42.842 INFO:teuthology.orchestra.run.smithi018.stderr:2022-11-30T02:58:42.498+0000 7fb497fff700 1 -- 172.21.15.18:0/1567830916 <== mon.2 v2:172.21.15.18:3301/0 4 ==== mgrmap(e 4) v1 ==== 75039+0+0 (secure 0 0 0) 0x7fb4a002c070 con 0x7fb4b4068c10
2022-11-30T02:58:42.842 INFO:teuthology.orchestra.run.smithi018.stderr:2022-11-30T02:58:42.499+0000 7fb497fff700 1 --2- 172.21.15.18:0/1567830916 >> v2:172.21.15.37:6800/108913 conn(0x7fb49c060380 0x7fb49c062840 unknown :-1 s=NONE pgs=0 cs=0 l=1 rev1=0 rx=0 tx=0).connect
2022-11-30T02:58:42.842 INFO:teuthology.orchestra.run.smithi018.stderr:2022-11-30T02:58:42.499+0000 7fb497fff700 1 -- 172.21.15.18:0/1567830916 <== mon.2 v2:172.21.15.18:3301/0 5 ==== osd_map(1390..1390 src has 691..1390) v4 ==== 6026+0+0 (secure 0 0 0) 0x7fb4a007b610 con 0x7fb4b4068c10
2022-11-30T02:58:42.842 INFO:teuthology.orchestra.run.smithi018.stderr:2022-11-30T02:58:42.499+0000 7fb497fff700 10 client.23765.objecter ms_dispatch 0x7fb4b4001550 osd_map(1390..1390 src has 691..1390) v4
2022-11-30T02:58:42.843 INFO:teuthology.orchestra.run.smithi018.stderr:2022-11-30T02:58:42.499+0000 7fb497fff700 3 client.23765.objecter handle_osd_map got epochs [1390,1390] > 0
2022-11-30T02:58:42.843 INFO:teuthology.orchestra.run.smithi018.stderr:2022-11-30T02:58:42.499+0000 7fb497fff700 3 client.23765.objecter handle_osd_map decoding full epoch 1390
2022-11-30T02:58:42.843 INFO:teuthology.orchestra.run.smithi018.stderr:2022-11-30T02:58:42.499+0000 7fb497fff700 20 client.23765.objecter dump_active .. 0 homeless
2022-11-30T02:58:42.843 INFO:teuthology.orchestra.run.smithi018.stderr:2022-11-30T02:58:42.499+0000 7fb4b37fe700 1 --2- 172.21.15.18:0/1567830916 >> v2:172.21.15.37:6800/108913 conn(0x7fb49c060380 0x7fb49c062840 unknown :-1 s=BANNER_CONNECTING pgs=0 cs=0 l=1 rev1=0 rx=0 tx=0)._handle_peer_banner_payload supported=1 required=0
2022-11-30T02:58:42.844 INFO:teuthology.orchestra.run.smithi018.stderr:2022-11-30T02:58:42.499+0000 7fb4b37fe700 10 monclient: get_auth_request con 0x7fb49c060380 auth_method 0
2022-11-30T02:58:42.844 INFO:teuthology.orchestra.run.smithi018.stderr:2022-11-30T02:58:42.499+0000 7fb497fff700 10 client.23765.objecter ms_handle_connect 0x7fb49c060380
2022-11-30T02:58:42.844 INFO:teuthology.orchestra.run.smithi018.stderr:2022-11-30T02:58:42.499+0000 7fb4b37fe700 1 --2- 172.21.15.18:0/1567830916 >> v2:172.21.15.37:6800/108913 conn(0x7fb49c060380 0x7fb49c062840 secure :-1 s=READY pgs=1646 cs=0 l=1 rev1=1 rx=0x7fb4b4069ac0 tx=0x7fb4a4009060).ready entity=mgr.4103 client_cookie=0 server_cookie=0 in_seq=0 out_seq=0
2022-11-30T02:58:42.845 DEBUG:teuthology.orchestra.run.smithi018:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph -- tell 'osd.*' injectargs --osd_enable_op_tracker=false
2022-11-30T02:58:42.846 INFO:teuthology.orchestra.run.smithi018.stderr:2022-11-30T02:58:42.773+0000 7fb497fff700 1 -- 172.21.15.18:0/1567830916 <== mon.2 v2:172.21.15.18:3301/0 6 ==== mon_command_ack([{"prefix": "get_command_descriptions"}]=0 v0) v1 ==== 72+0+162146 (secure 0 0 0) 0x7fb4a004cdb0 con 0x7fb4b4068c10
2022-11-30T02:58:42.846 INFO:teuthology.orchestra.run.smithi018.stderr:2022-11-30T02:58:42.773+0000 7fb497fff700 10 monclient: handle_mon_command_ack 1 [{"prefix": "get_command_descriptions"}]
2022-11-30T02:58:42.847 INFO:teuthology.orchestra.run.smithi018.stderr:2022-11-30T02:58:42.773+0000 7fb497fff700 10 monclient: _finish_command 1 = system:0
2022-11-30T02:58:42.848 ERROR:teuthology.run_tasks:Saw exception from tasks.
Traceback (most recent call last):
File "/home/teuthworker/src/git.ceph.com_teuthology_4da97cf64e542f347ec47b7bdbe5eca99759f9b7/teuthology/run_tasks.py", line 103, in run_tasks
manager = run_one_task(taskname, ctx=ctx, config=config)
File "/home/teuthworker/src/git.ceph.com_teuthology_4da97cf64e542f347ec47b7bdbe5eca99759f9b7/teuthology/run_tasks.py", line 82, in run_one_task
return task(**kwargs)
File "/home/teuthworker/src/github.com_ceph_ceph-c_bebb9581d1d6dc759a2591366d38060cec2b4275/qa/tasks/workunit.py", line 134, in task
coverage_and_limits=not config.get('no_coverage_and_limits', None))
File "/home/teuthworker/src/git.ceph.com_teuthology_4da97cf64e542f347ec47b7bdbe5eca99759f9b7/teuthology/parallel.py", line 84, in __exit__
for result in self:
File "/home/teuthworker/src/git.ceph.com_teuthology_4da97cf64e542f347ec47b7bdbe5eca99759f9b7/teuthology/parallel.py", line 98, in __next__
resurrect_traceback(result)
File "/home/teuthworker/src/git.ceph.com_teuthology_4da97cf64e542f347ec47b7bdbe5eca99759f9b7/teuthology/parallel.py", line 30, in resurrect_traceback
raise exc.exc_info[1]
File "/home/teuthworker/src/git.ceph.com_teuthology_4da97cf64e542f347ec47b7bdbe5eca99759f9b7/teuthology/parallel.py", line 23, in capture_traceback
return func(*args, **kwargs)
File "/home/teuthworker/src/github.com_ceph_ceph-c_bebb9581d1d6dc759a2591366d38060cec2b4275/qa/tasks/workunit.py", line 426, in _run_tests
label="workunit test {workunit}".format(workunit=workunit)
File "/home/teuthworker/src/git.ceph.com_teuthology_4da97cf64e542f347ec47b7bdbe5eca99759f9b7/teuthology/orchestra/remote.py", line 525, in run
r = self._runner(client=self.ssh, name=self.shortname, **kwargs)
File "/home/teuthworker/src/git.ceph.com_teuthology_4da97cf64e542f347ec47b7bdbe5eca99759f9b7/teuthology/orchestra/run.py", line 455, in run
r.wait()
File "/home/teuthworker/src/git.ceph.com_teuthology_4da97cf64e542f347ec47b7bdbe5eca99759f9b7/teuthology/orchestra/run.py", line 161, in wait
self._raise_for_status()
File "/home/teuthworker/src/git.ceph.com_teuthology_4da97cf64e542f347ec47b7bdbe5eca99759f9b7/teuthology/orchestra/run.py", line 183, in _raise_for_status
node=self.hostname, label=self.label
teuthology.exceptions.CommandFailedError: Command failed (workunit test rados/test.sh) on smithi018 with status 1: 'mkdir -p -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && cd -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && CEPH_CLI_TEST_DUP_COMMAND=1 CEPH_REF=bebb9581d1d6dc759a2591366d38060cec2b4275 TESTDIR="/home/ubuntu/cephtest" CEPH_ARGS="--cluster ceph" CEPH_ID="0" PATH=$PATH:/usr/sbin CEPH_BASE=/home/ubuntu/cephtest/clone.client.0 CEPH_ROOT=/home/ubuntu/cephtest/clone.client.0 CEPH_MNT=/home/ubuntu/cephtest/mnt.0 ALLOW_TIMEOUTS=1 adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 6h /home/ubuntu/cephtest/clone.client.0/qa/workunits/rados/test.sh'
There were no memory leaks detected by valgrind, but I did find this in the mon log around the time of the Traceback:
/a/yuriw-2022-11-29_22:29:58-rados-wip-yuri10-testing-2022-11-29-1005-pacific-distro-default-smithi/7097464/remote/smithi018/log/ceph-mon.a.log.gz
2022-11-30T02:58:40.999+0000 1d45e700 20 mon.a@0(leader).mgrstat health checks:
{
"PG_DEGRADED": {
"severity": "HEALTH_WARN",
"summary": {
"message": "Degraded data redundancy: 1/5 objects degraded (20.000%), 1 pg degraded",
"count": 1
},
"detail": [
{
"message": "pg 71.3 is active+undersized+degraded, acting [5,2147483647,3]"
}
]
},
Related issues
History
#1 Updated by Laura Flores about 1 year ago
Ran into another instance of this here:
/a/yuriw-2022-11-30_23:13:27-rados-wip-yuri2-testing-2022-11-30-0724-pacific-distro-default-smithi/7099593/
Same symptom, where the pid is killed without any other obvious clues. I don't think my observation in the mon log above means anything; what we really need to understand is what the pid is doing. However, grepping for the pid does not reveal any sort of trail. There is also no coredump or crash recorded.
#2 Updated by Brad Hubbard about 1 year ago
The issue appears to be in the api_aio test as it gets started but doesn't complete.
$ grep "Global test environment set-up" teuthology.log |gawk '{print $3}'|sort api_aio: api_aio_pp: api_asio: api_cmd: api_cmd_pp: api_c_read_operations: api_c_write_operations: api_io: api_io_pp: api_list: api_lock: api_lock_pp: api_misc: api_misc_pp: api_pool: api_service: api_service_pp: api_snapshots: api_snapshots_pp: api_stat: api_stat_pp: api_tier_pp: api_watch_notify: api_watch_notify_pp:
$ grep "Global test environment tear-down" teuthology.log|gawk '{print $3}'|sort api_aio_pp: api_asio: api_cmd: api_cmd_pp: api_c_read_operations: api_c_write_operations: api_io: api_io_pp: api_list: api_lock: api_lock_pp: api_misc: api_misc_pp: api_pool: api_service: api_service_pp: api_snapshots: api_snapshots_pp: api_stat: api_stat_pp: api_tier_pp: api_watch_notify: api_watch_notify_pp:
2022-12-01T05:48:21.847 INFO:tasks.workunit.client.0.smithi090.stdout: api_aio: [==========] Running 40 tests from 2 test suites. 2022-12-01T05:48:21.847 INFO:tasks.workunit.client.0.smithi090.stdout: api_aio: [----------] Global test environment set-up. 2022-12-01T05:48:21.847 INFO:tasks.workunit.client.0.smithi090.stdout: api_aio: [----------] 24 tests from LibRadosAio 2022-12-01T05:48:21.847 INFO:tasks.workunit.client.0.smithi090.stdout: api_aio: [ RUN ] LibRadosAio.TooBig 2022-12-01T05:48:21.847 INFO:tasks.workunit.client.0.smithi090.stdout: api_aio: [ OK ] LibRadosAio.TooBig (7105 ms) 2022-12-01T05:48:21.848 INFO:tasks.workunit.client.0.smithi090.stdout: api_aio: [ RUN ] LibRadosAio.SimpleWrite 2022-12-01T05:48:21.848 INFO:tasks.workunit.client.0.smithi090.stderr:bash: line 1: 119798 Alarm clock ceph_test_rados_api_aio 2>&1 2022-12-01T05:48:21.848 INFO:tasks.workunit.client.0.smithi090.stderr: 119799 Done | tee ceph_test_rados_api_aio.log 2022-12-01T05:48:21.849 INFO:tasks.workunit.client.0.smithi090.stderr: 119800 Done | sed "s/^/ api_aio: /"
Looks like the problem may be in the LibRadosAio.SimpleWrite test.
We could try a special test to run api_aio only?
If we can capture the ceph_test_rados_api_aio.log file it may tell us more?
in the case of 7097464 above the issue is with api_watch_notify so probably a different (and possibly known) issue?
2022-11-30T02:54:05.553 INFO:tasks.workunit.client.0.smithi018.stdout: api_watch_notify: [----------] 10 tests from LibRadosWatchNotify (136446 ms total) 2022-11-30T02:54:05.553 INFO:tasks.workunit.client.0.smithi018.stdout: api_watch_notify: 2022-11-30T02:54:05.553 INFO:tasks.workunit.client.0.smithi018.stdout: api_watch_notify: [----------] 1 test from LibRadosWatchNotifyEC 2022-11-30T02:54:05.553 INFO:tasks.workunit.client.0.smithi018.stdout: api_watch_notify: [ RUN ] LibRadosWatchNotifyEC.WatchNotify 2022-11-30T02:54:05.554 INFO:tasks.workunit.client.0.smithi018.stderr:bash: line 1: 121010 Alarm clock ceph_test_rados_api_watch_notify 2>&1 2022-11-30T02:54:05.554 INFO:tasks.workunit.client.0.smithi018.stderr: 121011 Done | tee ceph_test_rados_api_watch_notify.log 2022-11-30T02:54:05.554 INFO:tasks.workunit.client.0.smithi018.stderr: 121012 Done | sed "s/^/ api_watch_notify: /" 2022-11-30T02:54:05.555 INFO:tasks.workunit.client.0.smithi018.stdout:error in api_watch_notify (121007) 2022-11-30T02:54:05.559 INFO:tasks.workunit.client.0.smithi018.stderr:+ echo 'error in api_watch_notify (121007)'
#3 Updated by Laura Flores about 1 year ago
Thanks for your observations, Brad! I'm going to dedicate this Tracker to `LibRadosAio.SimpleWrite` and mark it as related to https://tracker.ceph.com/issues/57751.
#4 Updated by Laura Flores about 1 year ago
- Subject changed from rados_api_tests: HEALTH_WARN: Degraded data redundancy: 1/5 objects degraded (20.000%), 1 pg degraded to LibRadosAio.SimpleWrite hang and pkill
#5 Updated by Laura Flores about 1 year ago
- Related to Bug #57751: LibRadosAio.SimpleWritePP hang and pkill added
#6 Updated by Nitzan Mordechai 12 months ago
Laura, i think it is different than that bug (57751), in that case all the osds are still up.
We can see that we never called watch_notify_test_cb for LibRadosWatchNotifyEC.WatchNotify and after a while we raised the Alarm.
We know the PID of api_watch_notify (121010) so we can find the pool name of that test: test-rados-api-smithi018-121010-13
we are trying to write object foo and watching him, watch_notify_test_cb supposed to post the semaphore &sem and release the Alarm, but we never got it on time.
PR https://github.com/ceph/ceph/pull/49109 will fix that, it will set the alarm to 2400 sec instead of 1200 sec. that will give us enough time to complete the write op and get the callback on time.
#7 Updated by Radoslaw Zarzynski 12 months ago
Basing comment #14 we have a "fix candidate" that might also with issue.
If that's correct, we may wait for merging it and then observe whether the symptoms described here persist.
#8 Updated by Kamoltat (Junior) Sirivadhna 11 months ago
/a/ksirivad-2022-12-22_17:58:01-rados-wip-ksirivad-testing-pacific-distro-default-smithi/7125137/
#9 Updated by Nitzan Mordechai 11 months ago
Kamoltat (Junior) Sirivadhna wrote:
/a/ksirivad-2022-12-22_17:58:01-rados-wip-ksirivad-testing-pacific-distro-default-smithi/7125137/
It looks like it is more related to: https://tracker.ceph.com/issues/21592
2022-12-22T20:55:04.563 INFO:tasks.workunit.client.0.smithi178.stdout: api_c_write_operations: [ OK ] LibRadosCWriteOps.WriteSame (3018 ms) 2022-12-22T20:55:04.563 INFO:tasks.workunit.client.0.smithi178.stdout: api_c_write_operations: [ RUN ] LibRadosCWriteOps.CmpExt 2022-12-22T20:55:04.563 INFO:tasks.workunit.client.0.smithi178.stdout: api_c_write_operations: /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.10-1548-g9523d2fc/rpm/el8/BUILD/ceph-16.2.10-1548-g9523d2fc/src/test/librados/c_write_operations.cc:269: Failure 2022-12-22T20:55:04.563 INFO:tasks.workunit.client.0.smithi178.stdout: api_c_write_operations: Expected equality of these values: 2022-12-22T20:55:04.564 INFO:tasks.workunit.client.0.smithi178.stdout: api_c_write_operations: -4095 - 1 2022-12-22T20:55:04.564 INFO:tasks.workunit.client.0.smithi178.stdout: api_c_write_operations: Which is: -4096 2022-12-22T20:55:04.564 INFO:tasks.workunit.client.0.smithi178.stdout: api_c_write_operations: val 2022-12-22T20:55:04.564 INFO:tasks.workunit.client.0.smithi178.stdout: api_c_write_operations: Which is: 0
#10 Updated by Laura Flores 8 months ago
- Backport changed from pacific to pacific,quincy,reef
/a/yuriw-2023-03-14_20:10:47-rados-wip-yuri-testing-2023-03-14-0714-reef-distro-default-smithi/7207033
2023-03-16T16:08:22.635 INFO:tasks.workunit.client.0.smithi039.stdout: api_aio: [ RUN ] LibRadosAio.SimpleWrite
2023-03-16T16:08:22.635 INFO:tasks.workunit.client.0.smithi039.stderr:bash: line 1: 119745 Alarm clock ceph_test_rados_api_aio 2>&1
2023-03-16T16:08:22.635 INFO:tasks.workunit.client.0.smithi039.stderr: 119746 Done | tee ceph_test_rados_api_aio.log
2023-03-16T16:08:22.636 INFO:tasks.workunit.client.0.smithi039.stderr: 119747 Done | sed "s/^/ api_aio: /"
2023-03-16T16:08:22.636 INFO:tasks.ceph.osd.4.smithi146.stderr:2023-03-16T16:04:04.727+0000 7f5166efd700 -1 osd.4 623 get_health_metrics reporting 2 slow ops, oldest is osd_op(client.4824.0:14 41.0 41:2fd8f2fa:test-rados-api-smithi039-120106-8::foo:head [write 0~128 in=128b] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio e171)
2023-03-16T16:08:22.636 INFO:tasks.ceph.osd.4.smithi146.stderr:2023-03-16T16:04:05.755+0000 7f5166efd700 -1 osd.4 623 get_health_metrics reporting 2 slow ops, oldest is osd_op(client.4824.0:14 41.0 41:2fd8f2fa:test-rados-api-smithi039-120106-8::foo:head [write 0~128 in=128b] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio e171)
2023-03-16T16:08:22.636 INFO:tasks.ceph.osd.4.smithi146.stderr:2023-03-16T16:04:06.721+0000 7f5166efd700 -1 osd.4 623 get_health_metrics reporting 2 slow ops, oldest is osd_op(client.4824.0:14 41.0 41:2fd8f2fa:test-rados-api-smithi039-120106-8::foo:head [write 0~128 in=128b] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio e171)
2023-03-16T16:08:22.636 INFO:tasks.ceph.osd.4.smithi146.stderr:2023-03-16T16:04:07.676+0000 7f5166efd700 -1 osd.4 623 get_health_metrics reporting 2 slow ops, oldest is osd_op(client.4824.0:14 41.0 41:2fd8f2fa:test-rados-api-smithi039-120106-8::foo:head [write 0~128 in=128b] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio e171)
2023-03-16T16:08:22.637 INFO:tasks.ceph.osd.4.smithi146.stderr:2023-03-16T16:04:08.637+0000 7f5166efd700 -1 osd.4 624 get_health_metrics reporting 2 slow ops, oldest is osd_op(client.4824.0:14 41.0 41:2fd8f2fa:test-rados-api-smithi039-120106-8::foo:head [write 0~128 in=128b] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio e171)
2023-03-16T16:08:22.637 INFO:tasks.ceph.osd.4.smithi146.stderr:2023-03-16T16:04:09.594+0000 7f5166efd700 -1 osd.4 624 get_health_metrics reporting 2 slow ops, oldest is osd_op(client.4824.0:14 41.0 41:2fd8f2fa:test-rados-api-smithi039-120106-8::foo:head [write 0~128 in=128b] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio e171)
2023-03-16T16:08:22.637 INFO:tasks.ceph.osd.4.smithi146.stderr:2023-03-16T16:04:10.645+0000 7f5166efd700 -1 osd.4 624 get_health_metrics reporting 2 slow ops, oldest is osd_op(client.4824.0:14 41.0 41:2fd8f2fa:test-rados-api-smithi039-120106-8::foo:head [write 0~128 in=128b] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio e171)
2023-03-16T16:08:22.637 INFO:tasks.ceph.osd.4.smithi146.stderr:2023-03-16T16:04:11.684+0000 7f5166efd700 -1 osd.4 624 get_health_metrics reporting 2 slow ops, oldest is osd_op(client.4824.0:14 41.0 41:2fd8f2fa:test-rados-api-smithi039-120
...
2023-03-16T16:25:18.467 INFO:tasks.workunit.client.0.smithi039.stdout: api_aio_pp: [ RUN ] LibRadosAio.IsSafePP
2023-03-16T16:25:18.467 INFO:tasks.workunit.client.0.smithi039.stderr:bash: line 1: 119753 Alarm clock ceph_test_rados_api_aio_pp 2>&1
2023-03-16T16:25:18.467 INFO:tasks.workunit.client.0.smithi039.stderr: 119754 Done | tee ceph_test_rados_api_aio_pp.log
2023-03-16T16:25:18.467 INFO:tasks.workunit.client.0.smithi039.stderr: 119755 Done | sed "s/^/ api_aio_pp: /"
2023-03-16T16:25:18.469 INFO:tasks.ceph.osd.5.smithi146.stderr:2023-03-16T16:24:23.803+0000 7f3d84078700 -1 osd.5 798 handle_osd_map injecting map crc failure
2023-03-16T16:25:18.972 INFO:tasks.ceph.osd.5.smithi146.stderr:2023-03-16T16:24:33.777+0000 7f3d84078700 -1 osd.5 801 handle_osd_map injecting map crc failure
2023-03-16T16:25:18.973 INFO:tasks.ceph.osd.6.smithi146.stderr:2023-03-16T16:24:34.068+0000 7f40625cb700 -1 osd.6 801 handle_osd_map injecting map crc failure
2023-03-16T16:25:20.428 INFO:tasks.ceph.osd.6.smithi146.stderr:2023-03-16T16:24:50.037+0000 7f4066b24700 -1 osd.6 802 handle_osd_map injecting map crc failure
2023-03-16T16:25:20.429 INFO:tasks.ceph.osd.5.smithi146.stderr:2023-03-16T16:25:06.863+0000 7f3d84078700 -1 osd.5 804 handle_osd_map injecting map crc failure
#11 Updated by Nitzan Mordechai 6 months ago
Laura Flores wrote:
/a/yuriw-2023-03-14_20:10:47-rados-wip-yuri-testing-2023-03-14-0714-reef-distro-default-smithi/7207033
[...]
I think we need to separate this issue from others, it looks like a real issue with AIO and not a test issue from first look.
we are doing osd_backoff and then continue with ping magic and never return to the AIO
ceph-osd.7.log.gz:2023-03-16T15:24:10.122+0000 7f1d0eb5f700 1 --2- [v2:172.21.15.146:6826/112882,v1:172.21.15.146:6827/112882] >> 172.21.15.39:0/2511184891 conn(0x5584756c5000 0x558476a24580 crc :-1 s=READY pgs=4 cs=0 l=1 rev1=1 crypto rx=0 tx=0 comp rx=0 tx=0).ready entity=client.4559 client_cookie=0 server_ ceph-osd.7.log.gz:2023-03-16T15:24:10.123+0000 7f1d0eb5f700 1 -- [v2:172.21.15.146:6826/112882,v1:172.21.15.146:6827/112882] <== client.4559 172.21.15.39:0/2511184891 1 ==== osd_op(client.4559.0:2 33.6 33.7fc1f406 (undecoded) ondisk+write+known_if_redirected+supports_pool_eio e162) v8 ==== 212+0+128 (crc 0 0 ceph-osd.7.log.gz:2023-03-16T15:24:10.125+0000 7f1cebf04700 10 osd.7 162 maybe_share_map con 0x5584756c5000 172.21.15.39:0/2511184891 map epoch 0 -> 162 (as per caller) ceph-osd.7.log.gz:2023-03-16T15:24:10.130+0000 7f1ceff0c700 1 -- [v2:172.21.15.146:6826/112882,v1:172.21.15.146:6827/112882] --> 172.21.15.39:0/2511184891 -- osd_backoff(33.6 block id 1 [33:60000000::::0,33:68000000::::head) e162) v1 -- 0x558476f66000 con 0x5584756c5000 ceph-osd.7.log.gz:2023-03-16T15:24:10.132+0000 7f1d0eb5f700 1 -- [v2:172.21.15.146:6826/112882,v1:172.21.15.146:6827/112882] <== client.4559 172.21.15.39:0/2511184891 2 ==== osd_backoff(33.6 ack-block id 1 [33:60000000::::0,33:68000000::::head) e162) v1 ==== 115+0+0 (crc 0 0 0) 0x558476f66000 con 0x5584756c50 ceph-osd.7.log.gz:2023-03-16T15:24:23.211+0000 7f1d0eb5f700 1 -- [v2:172.21.15.146:6826/112882,v1:172.21.15.146:6827/112882] <== client.4559 172.21.15.39:0/2511184891 3 ==== ping magic: 0 v1 ==== 0+0+0 (crc 0 0 0) 0x55847750eb40 con 0x5584756c5000 ceph-osd.7.log.gz:2023-03-16T15:24:28.211+0000 7f1d0eb5f700 1 -- [v2:172.21.15.146:6826/112882,v1:172.21.15.146:6827/112882] <== client.4559 172.21.15.39:0/2511184891 4 ==== ping magic: 0 v1 ==== 0+0+0 (crc 0 0 0) 0x55847750eb40 con 0x5584756c5000 ceph-osd.7.log.gz:2023-03-16T15:24:33.213+0000 7f1d0eb5f700 1 -- [v2:172.21.15.146:6826/112882,v1:172.21.15.146:6827/112882] <== client.4559 172.21.15.39:0/2511184891 5 ==== ping magic: 0 v1 ==== 0+0+0 (crc 0 0 0) 0x55847750eb40 con 0x5584756c5000 ceph-osd.7.log.gz:2023-03-16T15:24:38.221+0000 7f1d0eb5f700 1 -- [v2:172.21.15.146:6826/112882,v1:172.21.15.146:6827/112882] <== client.4559 172.21.15.39:0/2511184891 6 ==== ping magic: 0 v1 ==== 0+0+0 (crc 0 0 0) 0x55847750eb40 con 0x5584756c5000
#12 Updated by Radoslaw Zarzynski 6 months ago
- Assignee set to Nitzan Mordechai
Would you mind continuing?
#13 Updated by Nitzan Mordechai 6 months ago
still digging the logs, but after we are sending aio_write, osd.7 showing added backoff on PG 33.6
2023-03-16T15:24:10.130+0000 7f1ceff0c700 10 osd.7 pg_epoch: 162 pg[33.6( empty local-lis/les=0/0 n=0 ec=161/161 lis/c=0/0 les/c/f=0/0/0 sis=161) [7] r=0 lpr=161 crt=0'0 mlcod 0'0 undersized+peered mbc={}] add_backoff session 0x5584769ec1e0 added Backoff(0x558476fcd680 33.6 1 new [33:60000000::::0,33:68000000::::head) session 0x5584769ec1e0 pg 0x558476aec000)
removing it on 15:26:38.243
re-add it on 15:26:38.243
remove it on 15:38:03.365
re-add 15:38:03.374
remove 16:04:10.098
release 17:22:29.487
since the aio_write started on 15:24:10.093 and Alarm set for 2400 seconds its not enough probably with backoff, still checking why it too so much time to release
#14 Updated by Nitzan Mordechai 6 months ago
it looks like we have something wrong with backoff with peering in some cases.
from osd.7 - we added backoff after starting peering (2023-03-16T15:38:03.375)
removed that backoff at 2023-03-16T16:04:10.098
release that backoff only at 2023-03-16T17:22:29.487 (when completing recovery)
it sounds like we are missing the release for peering. Continue checking
#15 Updated by Nitzan Mordechai 6 months ago
After checking the logs, it looks like pg 33.6 on osd.7 was in peering state and we didn't had chance (Until Alarm poped after 2400 seconds) to complete the aio_write.
rados/test.sh with osd threshold may generate a lot of activity for peering, and it looks like we couldn't complete the write during that time.
#16 Updated by Nitzan Mordechai 6 months ago
- Status changed from New to In Progress
#17 Updated by Radoslaw Zarzynski 5 months ago
Looks we should for PGs going out from the peering. Bump up.
#18 Updated by Laura Flores about 1 month ago
/a/yuriw-2023-11-02_14:20:05-rados-wip-yuri6-testing-2023-11-01-0745-reef-distro-default-smithi/7444582
#19 Updated by Nitzan Mordechai 29 days ago
Laura Flores wrote:
/a/yuriw-2023-11-02_14:20:05-rados-wip-yuri6-testing-2023-11-01-0745-reef-distro-default-smithi/7444582
looks more related to https://tracker.ceph.com/issues/52657
#20 Updated by Laura Flores 29 days ago
Nitzan Mordechai wrote:
Laura Flores wrote:
/a/yuriw-2023-11-02_14:20:05-rados-wip-yuri6-testing-2023-11-01-0745-reef-distro-default-smithi/7444582
looks more related to https://tracker.ceph.com/issues/52657
You're right, I missed this bit:
2023-11-02T15:13:15.170 INFO:tasks.ceph.osd.1.smithi151.stderr:*** Caught signal (Aborted) **
2023-11-02T15:13:15.171 INFO:tasks.ceph.osd.1.smithi151.stderr: in thread 29299640 thread_name:tp_osd_tp
2023-11-02T15:13:15.276 INFO:tasks.ceph.osd.2.smithi151.stderr:ceph-osd: /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos9/DIST/centos9/MACHINE_SIZE/gigantic/release/18.2.0-1159-gb5a9486b/rpm/el9/BUILD/ceph-18.2.0-1159-gb5a9486b/src/messages/MOSDPGLog.h:104: virtual void MOSDPGLog::encode_payload(uint64_t): Assertion `HAVE_FEATURE(features, SERVER_NAUTILUS)' failed.
2023-11-02T15:13:15.385 INFO:tasks.ceph.osd.2.smithi151.stderr:*** Caught signal (Aborted) **
2023-11-02T15:13:15.386 INFO:tasks.ceph.osd.2.smithi151.stderr: in thread 25a92640 thread_name:tp_osd_tp
2023-11-02T15:13:15.485 INFO:tasks.ceph.osd.2.smithi151.stderr: ceph version 18.2.0-1159-gb5a9486b (b5a9486b4a6f144f83c51b7ec33111a2c86dc081) reef (stable)
2023-11-02T15:13:15.486 INFO:tasks.ceph.osd.2.smithi151.stderr: 1: /lib64/libc.so.6(+0x54db0) [0x549fdb0]
2023-11-02T15:13:15.486 INFO:tasks.ceph.osd.2.smithi151.stderr: 2: /lib64/libc.so.6(+0xa154c) [0x54ec54c]
2023-11-02T15:13:15.486 INFO:tasks.ceph.osd.2.smithi151.stderr: 3: raise()
2023-11-02T15:13:15.486 INFO:tasks.ceph.osd.2.smithi151.stderr: 4: abort()
2023-11-02T15:13:15.486 INFO:tasks.ceph.osd.2.smithi151.stderr: 5: /lib64/libc.so.6(+0x2871b) [0x547371b]
2023-11-02T15:13:15.487 INFO:tasks.ceph.osd.2.smithi151.stderr: 6: /lib64/libc.so.6(+0x4dca6) [0x5498ca6]
2023-11-02T15:13:15.487 INFO:tasks.ceph.osd.2.smithi151.stderr: 7: ceph-osd(+0x579611) [0x681611]
2023-11-02T15:13:15.487 INFO:tasks.ceph.osd.2.smithi151.stderr: 8: (Message::encode(unsigned long, int, bool)+0x2e) [0xcc19ee]
2023-11-02T15:13:15.487 INFO:tasks.ceph.osd.2.smithi151.stderr: 9: (ProtocolV2::send_message(Message*)+0x241) [0xe593e1]
2023-11-02T15:13:15.487 INFO:tasks.ceph.osd.2.smithi151.stderr: 10: (AsyncConnection::send_message(Message*)+0x266) [0xe3ceb6]
2023-11-02T15:13:15.488 INFO:tasks.ceph.osd.2.smithi151.stderr: 11: (PG::send_cluster_message(int, boost::intrusive_ptr<Message>, unsigned int, bool)+0xa8) [0x7357b8]
2023-11-02T15:13:15.488 INFO:tasks.ceph.osd.2.smithi151.stderr: 12: (PeeringState::activate(ceph::os::Transaction&, unsigned int, PeeringCtxWrapper&)+0x1543) [0x90e423]
2023-11-02T15:13:15.488 INFO:tasks.ceph.osd.2.smithi151.stderr: 13: (PeeringState::Active::Active(boost::statechart::state<PeeringState::Active, PeeringState::Primary, PeeringState::Activating, (boost::statechart::history_mode)0>::my_context)+0x248) [0x924998]
2023-11-02T15:13:15.488 INFO:tasks.ceph.osd.2.smithi151.stderr: 14: ceph-osd(+0x859c47) [0x961c47]
#21 Updated by Radoslaw Zarzynski 29 days ago
Shouldn't we update 52657 then?