Project

General

Profile

Actions

Bug #58130

open

LibRadosAio.SimpleWrite hang and pkill

Added by Laura Flores over 1 year ago. Updated 2 months ago.

Status:
In Progress
Priority:
Normal
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
pacific,quincy,reef
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

A rados api test experienced a failure after the last global tests had successfully run.

/a/yuriw-2022-11-29_22:29:58-rados-wip-yuri10-testing-2022-11-29-1005-pacific-distro-default-smithi/7097464/teuthology.log

2022-11-30T02:58:42.308 INFO:tasks.workunit.client.0.smithi018.stdout:              api_tier_pp: [ RUN      ] LibRadosTwoPoolsECPP.TrySetDedupTier
2022-11-30T02:58:42.308 INFO:tasks.workunit.client.0.smithi018.stdout:              api_tier_pp: [       OK ] LibRadosTwoPoolsECPP.TrySetDedupTier (6148 ms)
2022-11-30T02:58:42.308 INFO:tasks.workunit.client.0.smithi018.stdout:              api_tier_pp: [----------] 22 tests from LibRadosTwoPoolsECPP (299040 ms total)
2022-11-30T02:58:42.308 INFO:tasks.workunit.client.0.smithi018.stdout:              api_tier_pp:
2022-11-30T02:58:42.309 INFO:tasks.workunit.client.0.smithi018.stdout:              api_tier_pp: [----------] Global test environment tear-down
2022-11-30T02:58:42.309 INFO:tasks.workunit.client.0.smithi018.stdout:              api_tier_pp: [==========] 72 tests from 4 test suites ran. (1828588 ms total)
2022-11-30T02:58:42.309 INFO:tasks.workunit.client.0.smithi018.stdout:              api_tier_pp: [  PASSED  ] 72 tests.
2022-11-30T02:58:42.310 INFO:tasks.workunit.client.0.smithi018.stderr:+ exit 1
2022-11-30T02:58:42.310 INFO:tasks.workunit.client.0.smithi018.stderr:+ cleanup
2022-11-30T02:58:42.310 INFO:tasks.workunit.client.0.smithi018.stderr:+ pkill -P 120721
2022-11-30T02:58:42.317 DEBUG:teuthology.orchestra.run:got remote process result: 1
2022-11-30T02:58:42.317 INFO:tasks.workunit.client.0.smithi018.stderr:+ true
2022-11-30T02:58:42.318 INFO:tasks.workunit:Stopping ['rados/test.sh'] on client.0...

...
2022-11-30T02:58:42.841 INFO:teuthology.orchestra.run.smithi018.stderr:2022-11-30T02:58:42.438+0000 7fb4ba7a0700 10 monclient: _send_mon_message to mon.c at v2:172.21.15.18:3301/0
2022-11-30T02:58:42.842 INFO:teuthology.orchestra.run.smithi018.stderr:2022-11-30T02:58:42.438+0000 7fb4ba7a0700  1 -- 172.21.15.18:0/1567830916 --> [v2:172.21.15.18:3301/0,v1:172.21.15.18:6790/0] -- mon_command({"prefix": "get_command_descriptions"} v 0) v1 -- 0x7fb4b40af4c0 con 0x7fb4b4068c10
2022-11-30T02:58:42.842 INFO:teuthology.orchestra.run.smithi018.stderr:2022-11-30T02:58:42.498+0000 7fb497fff700  1 -- 172.21.15.18:0/1567830916 <== mon.2 v2:172.21.15.18:3301/0 4 ==== mgrmap(e 4) v1 ==== 75039+0+0 (secure 0 0 0) 0x7fb4a002c070 con 0x7fb4b4068c10
2022-11-30T02:58:42.842 INFO:teuthology.orchestra.run.smithi018.stderr:2022-11-30T02:58:42.499+0000 7fb497fff700  1 --2- 172.21.15.18:0/1567830916 >> v2:172.21.15.37:6800/108913 conn(0x7fb49c060380 0x7fb49c062840 unknown :-1 s=NONE pgs=0 cs=0 l=1 rev1=0 rx=0 tx=0).connect
2022-11-30T02:58:42.842 INFO:teuthology.orchestra.run.smithi018.stderr:2022-11-30T02:58:42.499+0000 7fb497fff700  1 -- 172.21.15.18:0/1567830916 <== mon.2 v2:172.21.15.18:3301/0 5 ==== osd_map(1390..1390 src has 691..1390) v4 ==== 6026+0+0 (secure 0 0 0) 0x7fb4a007b610 con 0x7fb4b4068c10
2022-11-30T02:58:42.842 INFO:teuthology.orchestra.run.smithi018.stderr:2022-11-30T02:58:42.499+0000 7fb497fff700 10 client.23765.objecter ms_dispatch 0x7fb4b4001550 osd_map(1390..1390 src has 691..1390) v4
2022-11-30T02:58:42.843 INFO:teuthology.orchestra.run.smithi018.stderr:2022-11-30T02:58:42.499+0000 7fb497fff700  3 client.23765.objecter handle_osd_map got epochs [1390,1390] > 0
2022-11-30T02:58:42.843 INFO:teuthology.orchestra.run.smithi018.stderr:2022-11-30T02:58:42.499+0000 7fb497fff700  3 client.23765.objecter handle_osd_map decoding full epoch 1390
2022-11-30T02:58:42.843 INFO:teuthology.orchestra.run.smithi018.stderr:2022-11-30T02:58:42.499+0000 7fb497fff700 20 client.23765.objecter dump_active .. 0 homeless
2022-11-30T02:58:42.843 INFO:teuthology.orchestra.run.smithi018.stderr:2022-11-30T02:58:42.499+0000 7fb4b37fe700  1 --2- 172.21.15.18:0/1567830916 >> v2:172.21.15.37:6800/108913 conn(0x7fb49c060380 0x7fb49c062840 unknown :-1 s=BANNER_CONNECTING pgs=0 cs=0 l=1 rev1=0 rx=0 tx=0)._handle_peer_banner_payload supported=1 required=0
2022-11-30T02:58:42.844 INFO:teuthology.orchestra.run.smithi018.stderr:2022-11-30T02:58:42.499+0000 7fb4b37fe700 10 monclient: get_auth_request con 0x7fb49c060380 auth_method 0
2022-11-30T02:58:42.844 INFO:teuthology.orchestra.run.smithi018.stderr:2022-11-30T02:58:42.499+0000 7fb497fff700 10 client.23765.objecter ms_handle_connect 0x7fb49c060380
2022-11-30T02:58:42.844 INFO:teuthology.orchestra.run.smithi018.stderr:2022-11-30T02:58:42.499+0000 7fb4b37fe700  1 --2- 172.21.15.18:0/1567830916 >> v2:172.21.15.37:6800/108913 conn(0x7fb49c060380 0x7fb49c062840 secure :-1 s=READY pgs=1646 cs=0 l=1 rev1=1 rx=0x7fb4b4069ac0 tx=0x7fb4a4009060).ready entity=mgr.4103 client_cookie=0 server_cookie=0 in_seq=0 out_seq=0
2022-11-30T02:58:42.845 DEBUG:teuthology.orchestra.run.smithi018:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph -- tell 'osd.*' injectargs --osd_enable_op_tracker=false
2022-11-30T02:58:42.846 INFO:teuthology.orchestra.run.smithi018.stderr:2022-11-30T02:58:42.773+0000 7fb497fff700  1 -- 172.21.15.18:0/1567830916 <== mon.2 v2:172.21.15.18:3301/0 6 ==== mon_command_ack([{"prefix": "get_command_descriptions"}]=0  v0) v1 ==== 72+0+162146 (secure 0 0 0) 0x7fb4a004cdb0 con 0x7fb4b4068c10
2022-11-30T02:58:42.846 INFO:teuthology.orchestra.run.smithi018.stderr:2022-11-30T02:58:42.773+0000 7fb497fff700 10 monclient: handle_mon_command_ack 1 [{"prefix": "get_command_descriptions"}]
2022-11-30T02:58:42.847 INFO:teuthology.orchestra.run.smithi018.stderr:2022-11-30T02:58:42.773+0000 7fb497fff700 10 monclient: _finish_command 1 = system:0
2022-11-30T02:58:42.848 ERROR:teuthology.run_tasks:Saw exception from tasks.
Traceback (most recent call last):

  File "/home/teuthworker/src/git.ceph.com_teuthology_4da97cf64e542f347ec47b7bdbe5eca99759f9b7/teuthology/run_tasks.py", line 103, in run_tasks
    manager = run_one_task(taskname, ctx=ctx, config=config)
  File "/home/teuthworker/src/git.ceph.com_teuthology_4da97cf64e542f347ec47b7bdbe5eca99759f9b7/teuthology/run_tasks.py", line 82, in run_one_task
    return task(**kwargs)
  File "/home/teuthworker/src/github.com_ceph_ceph-c_bebb9581d1d6dc759a2591366d38060cec2b4275/qa/tasks/workunit.py", line 134, in task
    coverage_and_limits=not config.get('no_coverage_and_limits', None))
  File "/home/teuthworker/src/git.ceph.com_teuthology_4da97cf64e542f347ec47b7bdbe5eca99759f9b7/teuthology/parallel.py", line 84, in __exit__
    for result in self:
  File "/home/teuthworker/src/git.ceph.com_teuthology_4da97cf64e542f347ec47b7bdbe5eca99759f9b7/teuthology/parallel.py", line 98, in __next__
    resurrect_traceback(result)
  File "/home/teuthworker/src/git.ceph.com_teuthology_4da97cf64e542f347ec47b7bdbe5eca99759f9b7/teuthology/parallel.py", line 30, in resurrect_traceback
    raise exc.exc_info[1]
  File "/home/teuthworker/src/git.ceph.com_teuthology_4da97cf64e542f347ec47b7bdbe5eca99759f9b7/teuthology/parallel.py", line 23, in capture_traceback
    return func(*args, **kwargs)
  File "/home/teuthworker/src/github.com_ceph_ceph-c_bebb9581d1d6dc759a2591366d38060cec2b4275/qa/tasks/workunit.py", line 426, in _run_tests
    label="workunit test {workunit}".format(workunit=workunit)
  File "/home/teuthworker/src/git.ceph.com_teuthology_4da97cf64e542f347ec47b7bdbe5eca99759f9b7/teuthology/orchestra/remote.py", line 525, in run
    r = self._runner(client=self.ssh, name=self.shortname, **kwargs)
  File "/home/teuthworker/src/git.ceph.com_teuthology_4da97cf64e542f347ec47b7bdbe5eca99759f9b7/teuthology/orchestra/run.py", line 455, in run
    r.wait()
  File "/home/teuthworker/src/git.ceph.com_teuthology_4da97cf64e542f347ec47b7bdbe5eca99759f9b7/teuthology/orchestra/run.py", line 161, in wait
    self._raise_for_status()
  File "/home/teuthworker/src/git.ceph.com_teuthology_4da97cf64e542f347ec47b7bdbe5eca99759f9b7/teuthology/orchestra/run.py", line 183, in _raise_for_status
    node=self.hostname, label=self.label
teuthology.exceptions.CommandFailedError: Command failed (workunit test rados/test.sh) on smithi018 with status 1: 'mkdir -p -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && cd -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && CEPH_CLI_TEST_DUP_COMMAND=1 CEPH_REF=bebb9581d1d6dc759a2591366d38060cec2b4275 TESTDIR="/home/ubuntu/cephtest" CEPH_ARGS="--cluster ceph" CEPH_ID="0" PATH=$PATH:/usr/sbin CEPH_BASE=/home/ubuntu/cephtest/clone.client.0 CEPH_ROOT=/home/ubuntu/cephtest/clone.client.0 CEPH_MNT=/home/ubuntu/cephtest/mnt.0 ALLOW_TIMEOUTS=1 adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 6h /home/ubuntu/cephtest/clone.client.0/qa/workunits/rados/test.sh'

There were no memory leaks detected by valgrind, but I did find this in the mon log around the time of the Traceback:

/a/yuriw-2022-11-29_22:29:58-rados-wip-yuri10-testing-2022-11-29-1005-pacific-distro-default-smithi/7097464/remote/smithi018/log/ceph-mon.a.log.gz

2022-11-30T02:58:40.999+0000 1d45e700 20 mon.a@0(leader).mgrstat health checks:
{
    "PG_DEGRADED": {
        "severity": "HEALTH_WARN",
        "summary": {
            "message": "Degraded data redundancy: 1/5 objects degraded (20.000%), 1 pg degraded",
            "count": 1
        },
        "detail": [
            {
                "message": "pg 71.3 is active+undersized+degraded, acting [5,2147483647,3]" 
            }
        ]
    },


Related issues 3 (2 open1 closed)

Related to RADOS - Bug #57751: LibRadosAio.SimpleWritePP hang and pkillResolvedNitzan Mordechai

Actions
Related to RADOS - Bug #64504: aio ops queued but never executedNew

Actions
Related to RADOS - Bug #64670: LibRadosAioEC.RoundTrip2 hang and pkillNew

Actions
Actions #1

Updated by Laura Flores over 1 year ago

Ran into another instance of this here:

/a/yuriw-2022-11-30_23:13:27-rados-wip-yuri2-testing-2022-11-30-0724-pacific-distro-default-smithi/7099593/

Same symptom, where the pid is killed without any other obvious clues. I don't think my observation in the mon log above means anything; what we really need to understand is what the pid is doing. However, grepping for the pid does not reveal any sort of trail. There is also no coredump or crash recorded.

Actions #2

Updated by Brad Hubbard over 1 year ago

The issue appears to be in the api_aio test as it gets started but doesn't complete.


$ grep "Global test environment set-up" teuthology.log |gawk '{print $3}'|sort
api_aio:
api_aio_pp:
api_asio:
api_cmd:
api_cmd_pp:
api_c_read_operations:
api_c_write_operations:
api_io:
api_io_pp:
api_list:
api_lock:
api_lock_pp:
api_misc:
api_misc_pp:
api_pool:
api_service:
api_service_pp:
api_snapshots:
api_snapshots_pp:
api_stat:
api_stat_pp:
api_tier_pp:
api_watch_notify:
api_watch_notify_pp:
$ grep "Global test environment tear-down" teuthology.log|gawk '{print $3}'|sort
api_aio_pp:
api_asio:
api_cmd:
api_cmd_pp:
api_c_read_operations:
api_c_write_operations:
api_io:
api_io_pp:
api_list:
api_lock:
api_lock_pp:
api_misc:
api_misc_pp:
api_pool:
api_service:
api_service_pp:
api_snapshots:
api_snapshots_pp:
api_stat:
api_stat_pp:
api_tier_pp:
api_watch_notify:
api_watch_notify_pp:


2022-12-01T05:48:21.847 INFO:tasks.workunit.client.0.smithi090.stdout:                  api_aio: [==========] Running 40 tests from 2 test suites.
2022-12-01T05:48:21.847 INFO:tasks.workunit.client.0.smithi090.stdout:                  api_aio: [----------] Global test environment set-up.
2022-12-01T05:48:21.847 INFO:tasks.workunit.client.0.smithi090.stdout:                  api_aio: [----------] 24 tests from LibRadosAio
2022-12-01T05:48:21.847 INFO:tasks.workunit.client.0.smithi090.stdout:                  api_aio: [ RUN      ] LibRadosAio.TooBig
2022-12-01T05:48:21.847 INFO:tasks.workunit.client.0.smithi090.stdout:                  api_aio: [       OK ] LibRadosAio.TooBig (7105 ms)
2022-12-01T05:48:21.848 INFO:tasks.workunit.client.0.smithi090.stdout:                  api_aio: [ RUN      ] LibRadosAio.SimpleWrite
2022-12-01T05:48:21.848 INFO:tasks.workunit.client.0.smithi090.stderr:bash: line 1: 119798 Alarm clock             ceph_test_rados_api_aio 2>&1
2022-12-01T05:48:21.848 INFO:tasks.workunit.client.0.smithi090.stderr:     119799 Done                    | tee ceph_test_rados_api_aio.log
2022-12-01T05:48:21.849 INFO:tasks.workunit.client.0.smithi090.stderr:     119800 Done                    | sed "s/^/                  api_aio: /" 

Looks like the problem may be in the LibRadosAio.SimpleWrite test.
We could try a special test to run api_aio only?
If we can capture the ceph_test_rados_api_aio.log file it may tell us more?

in the case of 7097464 above the issue is with api_watch_notify so probably a different (and possibly known) issue?

2022-11-30T02:54:05.553 INFO:tasks.workunit.client.0.smithi018.stdout:         api_watch_notify: [----------] 10 tests from LibRadosWatchNotify (136446 ms total)
2022-11-30T02:54:05.553 INFO:tasks.workunit.client.0.smithi018.stdout:         api_watch_notify:
2022-11-30T02:54:05.553 INFO:tasks.workunit.client.0.smithi018.stdout:         api_watch_notify: [----------] 1 test from LibRadosWatchNotifyEC
2022-11-30T02:54:05.553 INFO:tasks.workunit.client.0.smithi018.stdout:         api_watch_notify: [ RUN      ] LibRadosWatchNotifyEC.WatchNotify
2022-11-30T02:54:05.554 INFO:tasks.workunit.client.0.smithi018.stderr:bash: line 1: 121010 Alarm clock             ceph_test_rados_api_watch_notify 2>&1
2022-11-30T02:54:05.554 INFO:tasks.workunit.client.0.smithi018.stderr:     121011 Done                    | tee ceph_test_rados_api_watch_notify.log
2022-11-30T02:54:05.554 INFO:tasks.workunit.client.0.smithi018.stderr:     121012 Done                    | sed "s/^/         api_watch_notify: /" 
2022-11-30T02:54:05.555 INFO:tasks.workunit.client.0.smithi018.stdout:error in api_watch_notify (121007)
2022-11-30T02:54:05.559 INFO:tasks.workunit.client.0.smithi018.stderr:+ echo 'error in api_watch_notify (121007)'
Actions #3

Updated by Laura Flores over 1 year ago

Thanks for your observations, Brad! I'm going to dedicate this Tracker to `LibRadosAio.SimpleWrite` and mark it as related to https://tracker.ceph.com/issues/57751.

Actions #4

Updated by Laura Flores over 1 year ago

  • Subject changed from rados_api_tests: HEALTH_WARN: Degraded data redundancy: 1/5 objects degraded (20.000%), 1 pg degraded to LibRadosAio.SimpleWrite hang and pkill
Actions #5

Updated by Laura Flores over 1 year ago

  • Related to Bug #57751: LibRadosAio.SimpleWritePP hang and pkill added
Actions #6

Updated by Nitzan Mordechai over 1 year ago

Laura, i think it is different than that bug (57751), in that case all the osds are still up.
We can see that we never called watch_notify_test_cb for LibRadosWatchNotifyEC.WatchNotify and after a while we raised the Alarm.

We know the PID of api_watch_notify (121010) so we can find the pool name of that test: test-rados-api-smithi018-121010-13
we are trying to write object foo and watching him, watch_notify_test_cb supposed to post the semaphore &sem and release the Alarm, but we never got it on time.
PR https://github.com/ceph/ceph/pull/49109 will fix that, it will set the alarm to 2400 sec instead of 1200 sec. that will give us enough time to complete the write op and get the callback on time.

Actions #7

Updated by Radoslaw Zarzynski over 1 year ago

Basing comment #14 we have a "fix candidate" that might also with issue.
If that's correct, we may wait for merging it and then observe whether the symptoms described here persist.

Actions #8

Updated by Kamoltat (Junior) Sirivadhna over 1 year ago

/a/ksirivad-2022-12-22_17:58:01-rados-wip-ksirivad-testing-pacific-distro-default-smithi/7125137/

Actions #9

Updated by Nitzan Mordechai over 1 year ago

Kamoltat (Junior) Sirivadhna wrote:

/a/ksirivad-2022-12-22_17:58:01-rados-wip-ksirivad-testing-pacific-distro-default-smithi/7125137/

It looks like it is more related to: https://tracker.ceph.com/issues/21592

2022-12-22T20:55:04.563 INFO:tasks.workunit.client.0.smithi178.stdout:   api_c_write_operations: [       OK ] LibRadosCWriteOps.WriteSame (3018 ms)
2022-12-22T20:55:04.563 INFO:tasks.workunit.client.0.smithi178.stdout:   api_c_write_operations: [ RUN      ] LibRadosCWriteOps.CmpExt
2022-12-22T20:55:04.563 INFO:tasks.workunit.client.0.smithi178.stdout:   api_c_write_operations: /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.10-1548-g9523d2fc/rpm/el8/BUILD/ceph-16.2.10-1548-g9523d2fc/src/test/librados/c_write_operations.cc:269: Failure
2022-12-22T20:55:04.563 INFO:tasks.workunit.client.0.smithi178.stdout:   api_c_write_operations: Expected equality of these values:
2022-12-22T20:55:04.564 INFO:tasks.workunit.client.0.smithi178.stdout:   api_c_write_operations:   -4095 - 1
2022-12-22T20:55:04.564 INFO:tasks.workunit.client.0.smithi178.stdout:   api_c_write_operations:     Which is: -4096
2022-12-22T20:55:04.564 INFO:tasks.workunit.client.0.smithi178.stdout:   api_c_write_operations:   val
2022-12-22T20:55:04.564 INFO:tasks.workunit.client.0.smithi178.stdout:   api_c_write_operations:     Which is: 0

Actions #10

Updated by Laura Flores about 1 year ago

  • Backport changed from pacific to pacific,quincy,reef

/a/yuriw-2023-03-14_20:10:47-rados-wip-yuri-testing-2023-03-14-0714-reef-distro-default-smithi/7207033

2023-03-16T16:08:22.635 INFO:tasks.workunit.client.0.smithi039.stdout:                  api_aio: [ RUN      ] LibRadosAio.SimpleWrite
2023-03-16T16:08:22.635 INFO:tasks.workunit.client.0.smithi039.stderr:bash: line 1: 119745 Alarm clock             ceph_test_rados_api_aio 2>&1
2023-03-16T16:08:22.635 INFO:tasks.workunit.client.0.smithi039.stderr:     119746 Done                    | tee ceph_test_rados_api_aio.log
2023-03-16T16:08:22.636 INFO:tasks.workunit.client.0.smithi039.stderr:     119747 Done                    | sed "s/^/                  api_aio: /" 
2023-03-16T16:08:22.636 INFO:tasks.ceph.osd.4.smithi146.stderr:2023-03-16T16:04:04.727+0000 7f5166efd700 -1 osd.4 623 get_health_metrics reporting 2 slow ops, oldest is osd_op(client.4824.0:14 41.0 41:2fd8f2fa:test-rados-api-smithi039-120106-8::foo:head [write 0~128 in=128b] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio e171)
2023-03-16T16:08:22.636 INFO:tasks.ceph.osd.4.smithi146.stderr:2023-03-16T16:04:05.755+0000 7f5166efd700 -1 osd.4 623 get_health_metrics reporting 2 slow ops, oldest is osd_op(client.4824.0:14 41.0 41:2fd8f2fa:test-rados-api-smithi039-120106-8::foo:head [write 0~128 in=128b] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio e171)
2023-03-16T16:08:22.636 INFO:tasks.ceph.osd.4.smithi146.stderr:2023-03-16T16:04:06.721+0000 7f5166efd700 -1 osd.4 623 get_health_metrics reporting 2 slow ops, oldest is osd_op(client.4824.0:14 41.0 41:2fd8f2fa:test-rados-api-smithi039-120106-8::foo:head [write 0~128 in=128b] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio e171)
2023-03-16T16:08:22.636 INFO:tasks.ceph.osd.4.smithi146.stderr:2023-03-16T16:04:07.676+0000 7f5166efd700 -1 osd.4 623 get_health_metrics reporting 2 slow ops, oldest is osd_op(client.4824.0:14 41.0 41:2fd8f2fa:test-rados-api-smithi039-120106-8::foo:head [write 0~128 in=128b] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio e171)
2023-03-16T16:08:22.637 INFO:tasks.ceph.osd.4.smithi146.stderr:2023-03-16T16:04:08.637+0000 7f5166efd700 -1 osd.4 624 get_health_metrics reporting 2 slow ops, oldest is osd_op(client.4824.0:14 41.0 41:2fd8f2fa:test-rados-api-smithi039-120106-8::foo:head [write 0~128 in=128b] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio e171)
2023-03-16T16:08:22.637 INFO:tasks.ceph.osd.4.smithi146.stderr:2023-03-16T16:04:09.594+0000 7f5166efd700 -1 osd.4 624 get_health_metrics reporting 2 slow ops, oldest is osd_op(client.4824.0:14 41.0 41:2fd8f2fa:test-rados-api-smithi039-120106-8::foo:head [write 0~128 in=128b] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio e171)
2023-03-16T16:08:22.637 INFO:tasks.ceph.osd.4.smithi146.stderr:2023-03-16T16:04:10.645+0000 7f5166efd700 -1 osd.4 624 get_health_metrics reporting 2 slow ops, oldest is osd_op(client.4824.0:14 41.0 41:2fd8f2fa:test-rados-api-smithi039-120106-8::foo:head [write 0~128 in=128b] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio e171)
2023-03-16T16:08:22.637 INFO:tasks.ceph.osd.4.smithi146.stderr:2023-03-16T16:04:11.684+0000 7f5166efd700 -1 osd.4 624 get_health_metrics reporting 2 slow ops, oldest is osd_op(client.4824.0:14 41.0 41:2fd8f2fa:test-rados-api-smithi039-120

...

2023-03-16T16:25:18.467 INFO:tasks.workunit.client.0.smithi039.stdout:               api_aio_pp: [ RUN      ] LibRadosAio.IsSafePP
2023-03-16T16:25:18.467 INFO:tasks.workunit.client.0.smithi039.stderr:bash: line 1: 119753 Alarm clock             ceph_test_rados_api_aio_pp 2>&1
2023-03-16T16:25:18.467 INFO:tasks.workunit.client.0.smithi039.stderr:     119754 Done                    | tee ceph_test_rados_api_aio_pp.log
2023-03-16T16:25:18.467 INFO:tasks.workunit.client.0.smithi039.stderr:     119755 Done                    | sed "s/^/               api_aio_pp: /" 
2023-03-16T16:25:18.469 INFO:tasks.ceph.osd.5.smithi146.stderr:2023-03-16T16:24:23.803+0000 7f3d84078700 -1 osd.5 798 handle_osd_map injecting map crc failure
2023-03-16T16:25:18.972 INFO:tasks.ceph.osd.5.smithi146.stderr:2023-03-16T16:24:33.777+0000 7f3d84078700 -1 osd.5 801 handle_osd_map injecting map crc failure
2023-03-16T16:25:18.973 INFO:tasks.ceph.osd.6.smithi146.stderr:2023-03-16T16:24:34.068+0000 7f40625cb700 -1 osd.6 801 handle_osd_map injecting map crc failure
2023-03-16T16:25:20.428 INFO:tasks.ceph.osd.6.smithi146.stderr:2023-03-16T16:24:50.037+0000 7f4066b24700 -1 osd.6 802 handle_osd_map injecting map crc failure
2023-03-16T16:25:20.429 INFO:tasks.ceph.osd.5.smithi146.stderr:2023-03-16T16:25:06.863+0000 7f3d84078700 -1 osd.5 804 handle_osd_map injecting map crc failure

Actions #11

Updated by Nitzan Mordechai 11 months ago

Laura Flores wrote:

/a/yuriw-2023-03-14_20:10:47-rados-wip-yuri-testing-2023-03-14-0714-reef-distro-default-smithi/7207033
[...]

I think we need to separate this issue from others, it looks like a real issue with AIO and not a test issue from first look.
we are doing osd_backoff and then continue with ping magic and never return to the AIO

ceph-osd.7.log.gz:2023-03-16T15:24:10.122+0000 7f1d0eb5f700  1 --2- [v2:172.21.15.146:6826/112882,v1:172.21.15.146:6827/112882] >> 172.21.15.39:0/2511184891 conn(0x5584756c5000 0x558476a24580 crc :-1 s=READY pgs=4 cs=0 l=1 rev1=1 crypto rx=0 tx=0 comp rx=0 tx=0).ready entity=client.4559 client_cookie=0 server_
ceph-osd.7.log.gz:2023-03-16T15:24:10.123+0000 7f1d0eb5f700  1 -- [v2:172.21.15.146:6826/112882,v1:172.21.15.146:6827/112882] <== client.4559 172.21.15.39:0/2511184891 1 ==== osd_op(client.4559.0:2 33.6 33.7fc1f406 (undecoded) ondisk+write+known_if_redirected+supports_pool_eio e162) v8 ==== 212+0+128 (crc 0 0 
ceph-osd.7.log.gz:2023-03-16T15:24:10.125+0000 7f1cebf04700 10 osd.7 162 maybe_share_map con 0x5584756c5000 172.21.15.39:0/2511184891 map epoch 0 -> 162 (as per caller)
ceph-osd.7.log.gz:2023-03-16T15:24:10.130+0000 7f1ceff0c700  1 -- [v2:172.21.15.146:6826/112882,v1:172.21.15.146:6827/112882] --> 172.21.15.39:0/2511184891 -- osd_backoff(33.6 block id 1 [33:60000000::::0,33:68000000::::head) e162) v1 -- 0x558476f66000 con 0x5584756c5000
ceph-osd.7.log.gz:2023-03-16T15:24:10.132+0000 7f1d0eb5f700  1 -- [v2:172.21.15.146:6826/112882,v1:172.21.15.146:6827/112882] <== client.4559 172.21.15.39:0/2511184891 2 ==== osd_backoff(33.6 ack-block id 1 [33:60000000::::0,33:68000000::::head) e162) v1 ==== 115+0+0 (crc 0 0 0) 0x558476f66000 con 0x5584756c50
ceph-osd.7.log.gz:2023-03-16T15:24:23.211+0000 7f1d0eb5f700  1 -- [v2:172.21.15.146:6826/112882,v1:172.21.15.146:6827/112882] <== client.4559 172.21.15.39:0/2511184891 3 ==== ping magic: 0 v1 ==== 0+0+0 (crc 0 0 0) 0x55847750eb40 con 0x5584756c5000
ceph-osd.7.log.gz:2023-03-16T15:24:28.211+0000 7f1d0eb5f700  1 -- [v2:172.21.15.146:6826/112882,v1:172.21.15.146:6827/112882] <== client.4559 172.21.15.39:0/2511184891 4 ==== ping magic: 0 v1 ==== 0+0+0 (crc 0 0 0) 0x55847750eb40 con 0x5584756c5000
ceph-osd.7.log.gz:2023-03-16T15:24:33.213+0000 7f1d0eb5f700  1 -- [v2:172.21.15.146:6826/112882,v1:172.21.15.146:6827/112882] <== client.4559 172.21.15.39:0/2511184891 5 ==== ping magic: 0 v1 ==== 0+0+0 (crc 0 0 0) 0x55847750eb40 con 0x5584756c5000
ceph-osd.7.log.gz:2023-03-16T15:24:38.221+0000 7f1d0eb5f700  1 -- [v2:172.21.15.146:6826/112882,v1:172.21.15.146:6827/112882] <== client.4559 172.21.15.39:0/2511184891 6 ==== ping magic: 0 v1 ==== 0+0+0 (crc 0 0 0) 0x55847750eb40 con 0x5584756c5000

Actions #12

Updated by Radoslaw Zarzynski 11 months ago

  • Assignee set to Nitzan Mordechai

Would you mind continuing?

Actions #13

Updated by Nitzan Mordechai 11 months ago

still digging the logs, but after we are sending aio_write, osd.7 showing added backoff on PG 33.6

2023-03-16T15:24:10.130+0000 7f1ceff0c700 10 osd.7 pg_epoch: 162 pg[33.6( empty local-lis/les=0/0 n=0 ec=161/161 lis/c=0/0 les/c/f=0/0/0 sis=161) [7] r=0 lpr=161 crt=0'0 mlcod 0'0 undersized+peered mbc={}] add_backoff session 0x5584769ec1e0 added Backoff(0x558476fcd680 33.6 1 new [33:60000000::::0,33:68000000::::head)  session 0x5584769ec1e0 pg 0x558476aec000)

removing it on 15:26:38.243
re-add it on 15:26:38.243
remove it on 15:38:03.365
re-add 15:38:03.374
remove 16:04:10.098
release 17:22:29.487

since the aio_write started on 15:24:10.093 and Alarm set for 2400 seconds its not enough probably with backoff, still checking why it too so much time to release

Actions #14

Updated by Nitzan Mordechai 11 months ago

it looks like we have something wrong with backoff with peering in some cases.
from osd.7 - we added backoff after starting peering (2023-03-16T15:38:03.375)
removed that backoff at 2023-03-16T16:04:10.098
release that backoff only at 2023-03-16T17:22:29.487 (when completing recovery)
it sounds like we are missing the release for peering. Continue checking

Actions #15

Updated by Nitzan Mordechai 11 months ago

After checking the logs, it looks like pg 33.6 on osd.7 was in peering state and we didn't had chance (Until Alarm poped after 2400 seconds) to complete the aio_write.
rados/test.sh with osd threshold may generate a lot of activity for peering, and it looks like we couldn't complete the write during that time.

Actions #16

Updated by Nitzan Mordechai 11 months ago

  • Status changed from New to In Progress
Actions #17

Updated by Radoslaw Zarzynski 10 months ago

Looks we should for PGs going out from the peering. Bump up.

Actions #18

Updated by Laura Flores 6 months ago

/a/yuriw-2023-11-02_14:20:05-rados-wip-yuri6-testing-2023-11-01-0745-reef-distro-default-smithi/7444582

Actions #19

Updated by Nitzan Mordechai 6 months ago

Laura Flores wrote:

/a/yuriw-2023-11-02_14:20:05-rados-wip-yuri6-testing-2023-11-01-0745-reef-distro-default-smithi/7444582

looks more related to https://tracker.ceph.com/issues/52657

Actions #20

Updated by Laura Flores 6 months ago

Nitzan Mordechai wrote:

Laura Flores wrote:

/a/yuriw-2023-11-02_14:20:05-rados-wip-yuri6-testing-2023-11-01-0745-reef-distro-default-smithi/7444582

looks more related to https://tracker.ceph.com/issues/52657

You're right, I missed this bit:

2023-11-02T15:13:15.170 INFO:tasks.ceph.osd.1.smithi151.stderr:*** Caught signal (Aborted) **
2023-11-02T15:13:15.171 INFO:tasks.ceph.osd.1.smithi151.stderr: in thread 29299640 thread_name:tp_osd_tp
2023-11-02T15:13:15.276 INFO:tasks.ceph.osd.2.smithi151.stderr:ceph-osd: /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos9/DIST/centos9/MACHINE_SIZE/gigantic/release/18.2.0-1159-gb5a9486b/rpm/el9/BUILD/ceph-18.2.0-1159-gb5a9486b/src/messages/MOSDPGLog.h:104: virtual void MOSDPGLog::encode_payload(uint64_t): Assertion `HAVE_FEATURE(features, SERVER_NAUTILUS)' failed.
2023-11-02T15:13:15.385 INFO:tasks.ceph.osd.2.smithi151.stderr:*** Caught signal (Aborted) **
2023-11-02T15:13:15.386 INFO:tasks.ceph.osd.2.smithi151.stderr: in thread 25a92640 thread_name:tp_osd_tp
2023-11-02T15:13:15.485 INFO:tasks.ceph.osd.2.smithi151.stderr: ceph version 18.2.0-1159-gb5a9486b (b5a9486b4a6f144f83c51b7ec33111a2c86dc081) reef (stable)
2023-11-02T15:13:15.486 INFO:tasks.ceph.osd.2.smithi151.stderr: 1: /lib64/libc.so.6(+0x54db0) [0x549fdb0]
2023-11-02T15:13:15.486 INFO:tasks.ceph.osd.2.smithi151.stderr: 2: /lib64/libc.so.6(+0xa154c) [0x54ec54c]
2023-11-02T15:13:15.486 INFO:tasks.ceph.osd.2.smithi151.stderr: 3: raise()
2023-11-02T15:13:15.486 INFO:tasks.ceph.osd.2.smithi151.stderr: 4: abort()
2023-11-02T15:13:15.486 INFO:tasks.ceph.osd.2.smithi151.stderr: 5: /lib64/libc.so.6(+0x2871b) [0x547371b]
2023-11-02T15:13:15.487 INFO:tasks.ceph.osd.2.smithi151.stderr: 6: /lib64/libc.so.6(+0x4dca6) [0x5498ca6]
2023-11-02T15:13:15.487 INFO:tasks.ceph.osd.2.smithi151.stderr: 7: ceph-osd(+0x579611) [0x681611]
2023-11-02T15:13:15.487 INFO:tasks.ceph.osd.2.smithi151.stderr: 8: (Message::encode(unsigned long, int, bool)+0x2e) [0xcc19ee]
2023-11-02T15:13:15.487 INFO:tasks.ceph.osd.2.smithi151.stderr: 9: (ProtocolV2::send_message(Message*)+0x241) [0xe593e1]
2023-11-02T15:13:15.487 INFO:tasks.ceph.osd.2.smithi151.stderr: 10: (AsyncConnection::send_message(Message*)+0x266) [0xe3ceb6]
2023-11-02T15:13:15.488 INFO:tasks.ceph.osd.2.smithi151.stderr: 11: (PG::send_cluster_message(int, boost::intrusive_ptr<Message>, unsigned int, bool)+0xa8) [0x7357b8]
2023-11-02T15:13:15.488 INFO:tasks.ceph.osd.2.smithi151.stderr: 12: (PeeringState::activate(ceph::os::Transaction&, unsigned int, PeeringCtxWrapper&)+0x1543) [0x90e423]
2023-11-02T15:13:15.488 INFO:tasks.ceph.osd.2.smithi151.stderr: 13: (PeeringState::Active::Active(boost::statechart::state<PeeringState::Active, PeeringState::Primary, PeeringState::Activating, (boost::statechart::history_mode)0>::my_context)+0x248) [0x924998]
2023-11-02T15:13:15.488 INFO:tasks.ceph.osd.2.smithi151.stderr: 14: ceph-osd(+0x859c47) [0x961c47]

Actions #21

Updated by Radoslaw Zarzynski 6 months ago

Shouldn't we update 52657 then?

Actions #22

Updated by Laura Flores 3 months ago

The file was too long to search through with vim or less, but I found this using grep:

/a/lflores-2024-01-22_04:12:37-rados-wip-lflores-testing-2024-01-18-2129-distro-default-smithi/7525368

lflores@teuthology:/a/lflores-2024-01-22_04:12:37-rados-wip-lflores-testing-2024-01-18-2129-distro-default-smithi/7525368$ grep "Alarm clock" teuthology.log
2024-01-23T11:06:46.058 INFO:tasks.workunit.client.0.smithi008.stderr:bash: line 1: 49614 Alarm clock             ceph_test_rados_api_aio_pp --gtest_output=xml:/home/ubuntu/cephtest/archive/gtest_xml_report-api_aio_pp.xml 2>&1
lflores@teuthology:/a/lflores-2024-01-22_04:12:37-rados-wip-lflores-testing-2024-01-18-2129-distro-default-smithi/7525368$ grep "FAILED" teuthology.log
2024-01-23T11:35:00.099 INFO:tasks.workunit.client.0.smithi008.stdout:      api_watch_notify_pp: [  FAILED  ] LibRadosWatchNotifyPPTests/LibRadosWatchNotifyPP.AioWatchNotify2/1, where GetParam() = "cache" (192456 ms)

Actions #23

Updated by Nitzan Mordechai 3 months ago

Laura Flores wrote:

The file was too long to search through with vim or less, but I found this using grep:

/a/lflores-2024-01-22_04:12:37-rados-wip-lflores-testing-2024-01-18-2129-distro-default-smithi/7525368
[...]

wow, something happened? 8.7Gb for teuthology.log its pretty high. no?
anyway. it looks like something slowed the test, the alarm was set to 2400:
TestAlarm() {
alarm(2400);
}

we can increase it, but, maybe i'll first watch for any changes that slowed it down

Actions #24

Updated by Laura Flores 2 months ago

/a/lflores-2024-02-13_16:18:32-rados-wip-yuri5-testing-2024-02-12-1152-distro-default-smithi/7558384

2024-02-13T18:05:17.612 INFO:tasks.workunit.client.0.smithi084.stdout:                  api_aio: Running main() from gmock_main.cc
2024-02-13T18:05:17.612 INFO:tasks.workunit.client.0.smithi084.stdout:                  api_aio: [==========] Running 42 tests from 2 test suites.
2024-02-13T18:05:17.613 INFO:tasks.workunit.client.0.smithi084.stdout:                  api_aio: [----------] Global test environment set-up.
2024-02-13T18:05:17.613 INFO:tasks.workunit.client.0.smithi084.stdout:                  api_aio: [----------] 26 tests from LibRadosAio
2024-02-13T18:05:17.613 INFO:tasks.workunit.client.0.smithi084.stdout:                  api_aio: [ RUN      ] LibRadosAio.TooBig
2024-02-13T18:05:17.613 INFO:tasks.workunit.client.0.smithi084.stdout:                  api_aio: [       OK ] LibRadosAio.TooBig (2459 ms)
2024-02-13T18:05:17.613 INFO:tasks.workunit.client.0.smithi084.stdout:                  api_aio: [ RUN      ] LibRadosAio.SimpleWrite
2024-02-13T18:05:17.614 INFO:teuthology.orchestra.run.smithi084.stderr:2024-02-13T18:05:17.600+0000 7f71197fa640  1 -- 172.21.15.84:0/278559395 <== mon.0 v2:172.21.15.84:3300/0 4 ==== mgrmap(e 5) v1 ==== 89788+0+0 (secure 0 0 0) 0x7f711004d3f0 con 0x7f711c06ba50
2024-02-13T18:05:17.614 INFO:tasks.workunit.client.0.smithi084.stderr:bash: line 1: 51563 Alarm clock             ceph_test_rados_api_aio --gtest_output=xml:/home/ubuntu/cephtest/archive/gtest_xml_report-api_aio.xml 2>&1
2024-02-13T18:05:17.615 INFO:tasks.workunit.client.0.smithi084.stderr:     51566 Done                    | tee ceph_test_rados_api_aio.log
2024-02-13T18:05:17.615 INFO:tasks.workunit.client.0.smithi084.stderr:     51568 Done                    | sed "s/^/                  api_aio: /" 
2024-02-13T18:05:17.619 INFO:teuthology.orchestra.run.smithi161.stdout:{

Actions #25

Updated by Laura Flores 2 months ago

This one came with a few other symptoms:

/a/yuriw-2024-02-13_23:41:23-rados-wip-yuri6-testing-2024-02-13-0904-distro-default-smithi/7558996

2024-02-14T19:56:30.998 INFO:tasks.workunit.client.0.smithi088.stdout:                  api_aio: [ RUN      ] LibRadosAio.SimpleWrite
2024-02-14T19:56:30.999 INFO:tasks.workunit.client.0.smithi088.stderr:bash: line 1: 33891 Alarm clock             ceph_test_rados_api_aio --gtest_output=xml:/home/ubuntu/cephtest/archive/gtest_xml_report-api_aio.xml 2>&1
2024-02-14T19:56:30.999 INFO:tasks.workunit.client.0.smithi088.stderr:     33892 Done                    | tee ceph_test_rados_api_aio.log
2024-02-14T19:56:30.999 INFO:tasks.workunit.client.0.smithi088.stderr:     33893 Done                    | sed "s/^/                  api_aio: /" 
2024-02-14T19:56:31.003 INFO:tasks.ceph.osd.4.smithi117.stderr:2024-02-14T19:56:31.000+0000 7f1fcdab6640 -1 received  signal: Hangup from /usr/bin/python3 /usr/bin/daemon-helper kill ceph-osd -f --cluster ceph -i 4  (PID: 25210) UID: 0

2024-02-14T21:07:04.827 INFO:tasks.workunit.client.0.smithi088.stdout:               api_aio_pp: [ RUN      ] LibRadosAio.ExecuteClassPP
2024-02-14T21:07:04.828 INFO:tasks.workunit.client.0.smithi088.stderr:bash: line 1: 33899 Alarm clock             ceph_test_rados_api_aio_pp --gtest_output=xml:/home/ubuntu/cephtest/archive/gtest_xml_report-api_aio_pp.xml 2>&1
2024-02-14T21:07:04.828 INFO:tasks.workunit.client.0.smithi088.stderr:     33900 Done                    | tee ceph_test_rados_api_aio_pp.log
2024-02-14T21:07:04.828 INFO:tasks.workunit.client.0.smithi088.stderr:     33901 Done                    | sed "s/^/               api_aio_pp: /" 
2024-02-14T21:07:04.829 INFO:teuthology.orchestra.run.smithi088.stdout:ERROR: (22) Invalid argument
2024-02-14T19:50:32.283 INFO:tasks.workunit.client.0.smithi088.stdout:         api_watch_notify: [ RUN      ] LibRadosWatchNotify.WatchNotify2Timeout
2024-02-14T19:50:32.283 INFO:tasks.workunit.client.0.smithi088.stdout:         api_watch_notify: watch_notify2_test_cb from 4914 notify_id 4690104287233 cookie 94085572009248
2024-02-14T19:50:32.283 INFO:tasks.workunit.client.0.smithi088.stdout:         api_watch_notify: watch_notify2_test_errcb cookie 94085572009248 err -107
2024-02-14T19:50:32.283 INFO:tasks.workunit.client.0.smithi088.stdout:         api_watch_notify: ./src/test/librados/watch_notify.cc:504: Failure
2024-02-14T19:50:32.283 INFO:tasks.workunit.client.0.smithi088.stdout:         api_watch_notify: Expected equality of these values:
2024-02-14T19:50:32.284 INFO:tasks.workunit.client.0.smithi088.stdout:         api_watch_notify:   1u
2024-02-14T19:50:32.284 INFO:tasks.workunit.client.0.smithi088.stdout:         api_watch_notify:     Which is: 1
2024-02-14T19:50:32.284 INFO:tasks.workunit.client.0.smithi088.stdout:         api_watch_notify:   notify_cookies.size()
2024-02-14T19:50:32.284 INFO:tasks.workunit.client.0.smithi088.stdout:         api_watch_notify:     Which is: 0
2024-02-14T19:50:32.284 INFO:tasks.workunit.client.0.smithi088.stdout:         api_watch_notify: [  FAILED  ] LibRadosWatchNotify.WatchNotify2Timeout (615927 ms)
Actions #26

Updated by Nitzan Mordechai 2 months ago

looks like something new.. there is a big gup between the thrasher start-stop of osds between around 19:30:00 to 21:10:00
then we can see:

2024-02-14T21:09:29.182 INFO:tasks.thrashosds.thrasher:Traceback (most recent call last):
  File "/home/teuthworker/src/github.com_ceph_ceph-c_8420a053c0164096266bfcf03c3e3e81ef20d093/qa/tasks/ceph_manager.py", line 190, in wrapper
    return func(self)
  File "/home/teuthworker/src/github.com_ceph_ceph-c_8420a053c0164096266bfcf03c3e3e81ef20d093/qa/tasks/ceph_manager.py", line 1425, in _do_thrash
    self.ceph_manager.wait_for_recovery(
  File "/home/teuthworker/src/github.com_ceph_ceph-c_8420a053c0164096266bfcf03c3e3e81ef20d093/qa/tasks/ceph_manager.py", line 2873, in wait_for_recovery
    assert now - start < timeout, \
AssertionError: wait_for_recovery: failed before timeout expired

2024-02-14T21:09:29.182 ERROR:tasks.thrashosds.thrasher:exception:
Traceback (most recent call last):
  File "/home/teuthworker/src/github.com_ceph_ceph-c_8420a053c0164096266bfcf03c3e3e81ef20d093/qa/tasks/ceph_manager.py", line 1301, in do_thrash
    self._do_thrash()
  File "/home/teuthworker/src/github.com_ceph_ceph-c_8420a053c0164096266bfcf03c3e3e81ef20d093/qa/tasks/ceph_manager.py", line 190, in wrapper
    return func(self)
  File "/home/teuthworker/src/github.com_ceph_ceph-c_8420a053c0164096266bfcf03c3e3e81ef20d093/qa/tasks/ceph_manager.py", line 1425, in _do_thrash
    self.ceph_manager.wait_for_recovery(
  File "/home/teuthworker/src/github.com_ceph_ceph-c_8420a053c0164096266bfcf03c3e3e81ef20d093/qa/tasks/ceph_manager.py", line 2873, in wait_for_recovery
    assert now - start < timeout, \
AssertionError: wait_for_recovery: failed before timeout expired

wait_for_recovery started at:

2024-02-14T19:33:58.563 INFO:tasks.ceph.ceph_manager.ceph:waiting for recovery to complete

looks like those 2 pgs making the delay:
"detail": [ {
"message": "pg 24.13 is active+recovery_wait+degraded, acting [7,6,1]"
}, {
"message": "pg 28.a is active+recovery_wait+degraded, acting [5,6,1]"
}
]

checking that

Actions #27

Updated by Nitzan Mordechai 2 months ago

Thanks for Aishwarya who also looked on the queued ops that didn't executed, i opened new bug for it: https://tracker.ceph.com/issues/64504

Actions #28

Updated by Nitzan Mordechai 2 months ago

  • Related to Bug #64504: aio ops queued but never executed added
Actions #29

Updated by Laura Flores about 2 months ago

  • Related to Bug #64670: LibRadosAioEC.RoundTrip2 hang and pkill added
Actions #30

Updated by Laura Flores about 2 months ago

  • Related to Bug #64726: LibRadosAioEC.MultiWritePP hang and pkill added
Actions #31

Updated by Nitzan Mordechai about 2 months ago

  • Related to deleted (Bug #64726: LibRadosAioEC.MultiWritePP hang and pkill)
Actions

Also available in: Atom PDF