Bug #57740
crimson: op hang while running ./bin/ceph_test_rados_api_aio_pp and ./bin/ceph_test_rados_api_aio concurrently
0%
Description
MDS=0 MGR=1 OSD=3 MON=1 ../src/vstart.sh --without-dashboard -X --crimson --redirect-output --debug -n --no-restart --crimson-smp 3
./bin/ceph osd pool create rbd 32 32 replicated replicated_rule 2 2 2 # likely irrelevant, but part of my script
while ( true ); do CRIMSON_COMPAT=1 ./bin/ceph_test_rados_api_aio_pp ; done &
while ( true ); do CRIMSON_COMPAT=1 ./bin/ceph_test_rados_api_aio ; done &
Eventually, both will hang (not at a consistent point).
Testing done on a testing branch with a bunch of extra debugging and introspection: https://github.com/athanatos/ceph/tree/sjust/testing/2022-09-30-170258
Related issues
History
#1 Updated by Samuel Just over 1 year ago
osd.2 has a bunch of ops blocked on:
"blocker": { "op_type": "ConnectionPipeline::get_pg", "detail": { "waiting": 50, "held_by_operation_id": 833804 } }
op id 833804 is
"type": "peering_event", "id": 833804, "detail": { "PeeringEvent": { "from": "?", "pgid": "258.1b", "sent": 554, "requested": 554, "evt": "epoch_sent: 554 epoch_requested: 554 NullEvt +create_info", "events": [ { "name": "crimson::osd::TrackableOperationT<crimson::osd::RemotePeeringEvent>::StartEvent", "initiated_at": "0.000000" }, { "name": "crimson::osd::ConnectionPipeline::AwaitActive", "initiated_at": "2022-09-30T23:31:29.335391+0000" }, { "name": "crimson::osd::ConnectionPipeline::AwaitMap", "initiated_at": "2022-09-30T23:31:29.335563+0000" }, { "name": "crimson::osd::OSDMapGate<(crimson::osd::OSDMapGateType)0>::OSDMapBlocker", "initiated_at": "0.000000" }, { "name": "crimson::osd::ConnectionPipeline::GetPG", "initiated_at": "2022-09-30T23:31:31.405028+0000" }, { "name": "crimson::osd::PGMap::PGCreationState", "initiated_at": "2022-09-30T23:31:31.737318+0000", "blocker": { "op_type": "PGCreation", "detail": { "pgid": "258.1b", "creating": true } } }, { "name": "crimson::osd::PGPeeringPipeline::AwaitMap", "initiated_at": "0.000000" }, { "name": "crimson::osd::OSDMapGate<(crimson::osd::OSDMapGateType)1>::OSDMapBlocker", "initiated_at": "0.000000" }, { "name": "crimson::osd::PGPeeringPipeline::Process", "initiated_at": "0.000000" }, { "name": "crimson::osd::BackfillRecovery::BackfillRecoveryPipeline::Process", "initiated_at": "0.000000" }, { "name": "crimson::osd::RemotePeeringEvent::OSDPipeline::AwaitActive", "initiated_at": "0.000000" }, { "name": "crimson::osd::TrackableOperationT<crimson::osd::RemotePeeringEvent>::CompletionEvent", "initiated_at": "0.000000" } ] } } },
So osd.2 at least is blocked by
"blocker": { "op_type": "PGCreation", "detail": { "pgid": "258.1b", "creating": true } }
osd.1 has ops that are blocked like:
"blocker": { "op_type": "ConnectionPipeline::get_pg", "detail": { "waiting": 288, "held_by_operation_id": 816509 } }
op 816509 is
"id": 816509, "detail": { "PeeringEvent": { "from": "?", "pgid": "258.c", "sent": 554, "requested": 554, "evt": "epoch_sent: 554 epoch_requested: 554 NullEvt +create_info", "events": [ { "name": "crimson::osd::TrackableOperationT<crimson::osd::RemotePeeringEvent>::StartEvent", "initiated_at": "0.000000" }, { "name": "crimson::osd::ConnectionPipeline::AwaitActive", "initiated_at": "2022-09-30T23:31:29.340153+0000" }, { "name": "crimson::osd::ConnectionPipeline::AwaitMap", "initiated_at": "2022-09-30T23:31:29.341310+0000" }, { "name": "crimson::osd::OSDMapGate<(crimson::osd::OSDMapGateType)0>::OSDMapBlocker", "initiated_at": "0.000000" }, { "name": "crimson::osd::ConnectionPipeline::GetPG", "initiated_at": "2022-09-30T23:31:30.783338+0000" }, { "name": "crimson::osd::PGMap::PGCreationState", "initiated_at": "2022-09-30T23:31:32.641184+0000", "blocker": { "op_type": "PGCreation", "detail": { "pgid": "258.c", "creating": true } } }, { "name": "crimson::osd::PGPeeringPipeline::AwaitMap", "initiated_at": "0.000000" }, { "name": "crimson::osd::OSDMapGate<(crimson::osd::OSDMapGateType)1>::OSDMapBlocker", "initiated_at": "0.000000" }, { "name": "crimson::osd::PGPeeringPipeline::Process", "initiated_at": "0.000000" }, { "name": "crimson::osd::BackfillRecovery::BackfillRecoveryPipeline::Process", "initiated_at": "0.000000" }, { "name": "crimson::osd::RemotePeeringEvent::OSDPipeline::AwaitActive", "initiated_at": "0.000000" }, { "name": "crimson::osd::TrackableOperationT<crimson::osd::RemotePeeringEvent>::CompletionEvent", "initiated_at": "0.000000" } ] } } }
so osd.1 is basically stuck on
"blocker": { "op_type": "PGCreation", "detail": { "pgid": "258.c", "creating": true } }
#2 Updated by Samuel Just over 1 year ago
- Related to Bug #57547: Hang with seastore at wait_for_active stage added
#3 Updated by Samuel Just over 1 year ago
osd.1:
DEBUG 2022-09-30 23:31:32,641 [shard 2] osd - Creating 258.c
...
DEBUG 2022-09-30 23:31:32,641 [shard 2] osd - operator() ignoring pgid 258.c, pool dne
#4 Updated by Samuel Just over 1 year ago
Bug is that we aren't erroring out the blocking future in get_or_create_pg if the pool does not exist.
#5 Updated by Samuel Just over 1 year ago
- Status changed from New to Resolved