Project

General

Profile

Bug #57740

crimson: op hang while running ./bin/ceph_test_rados_api_aio_pp and ./bin/ceph_test_rados_api_aio concurrently

Added by Samuel Just over 1 year ago. Updated over 1 year ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

MDS=0 MGR=1 OSD=3 MON=1 ../src/vstart.sh --without-dashboard -X --crimson --redirect-output --debug -n --no-restart --crimson-smp 3
./bin/ceph osd pool create rbd 32 32 replicated replicated_rule 2 2 2 # likely irrelevant, but part of my script
while ( true ); do CRIMSON_COMPAT=1 ./bin/ceph_test_rados_api_aio_pp ; done &
while ( true ); do CRIMSON_COMPAT=1 ./bin/ceph_test_rados_api_aio ; done &

Eventually, both will hang (not at a consistent point).

Testing done on a testing branch with a bunch of extra debugging and introspection: https://github.com/athanatos/ceph/tree/sjust/testing/2022-09-30-170258

osd.0.ops.out (28 Bytes) Samuel Just, 10/01/2022 12:04 AM

osd.2.ops.out (172 KB) Samuel Just, 10/01/2022 12:04 AM

osd.1.ops.out (936 KB) Samuel Just, 10/01/2022 12:04 AM


Related issues

Related to crimson - Bug #57547: Hang with seastore at wait_for_active stage New

History

#1 Updated by Samuel Just over 1 year ago

osd.2 has a bunch of ops blocked on:

                            "blocker": {
                                "op_type": "ConnectionPipeline::get_pg",
                                "detail": {
                                    "waiting": 50,
                                    "held_by_operation_id": 833804
                                }
                            }

op id 833804 is

            "type": "peering_event",
            "id": 833804,
            "detail": {
                "PeeringEvent": {
                    "from": "?",
                    "pgid": "258.1b",
                    "sent": 554,
                    "requested": 554,
                    "evt": "epoch_sent: 554 epoch_requested: 554 NullEvt +create_info",
                    "events": [
                        {
                            "name": "crimson::osd::TrackableOperationT<crimson::osd::RemotePeeringEvent>::StartEvent",
                            "initiated_at": "0.000000" 
                        },
                        {
                            "name": "crimson::osd::ConnectionPipeline::AwaitActive",
                            "initiated_at": "2022-09-30T23:31:29.335391+0000" 
                        },
                        {
                            "name": "crimson::osd::ConnectionPipeline::AwaitMap",
                            "initiated_at": "2022-09-30T23:31:29.335563+0000" 
                        },
                        {
                            "name": "crimson::osd::OSDMapGate<(crimson::osd::OSDMapGateType)0>::OSDMapBlocker",
                            "initiated_at": "0.000000" 
                        },
                        {
                            "name": "crimson::osd::ConnectionPipeline::GetPG",
                            "initiated_at": "2022-09-30T23:31:31.405028+0000" 
                        },
                        {
                            "name": "crimson::osd::PGMap::PGCreationState",
                            "initiated_at": "2022-09-30T23:31:31.737318+0000",
                            "blocker": {
                                "op_type": "PGCreation",
                                "detail": {
                                    "pgid": "258.1b",
                                    "creating": true
                                }
                            }
                        },
                        {
                            "name": "crimson::osd::PGPeeringPipeline::AwaitMap",
                            "initiated_at": "0.000000" 
                        },
                        {
                            "name": "crimson::osd::OSDMapGate<(crimson::osd::OSDMapGateType)1>::OSDMapBlocker",
                            "initiated_at": "0.000000" 
                        },
                        {
                            "name": "crimson::osd::PGPeeringPipeline::Process",
                            "initiated_at": "0.000000" 
                        },
                        {
                            "name": "crimson::osd::BackfillRecovery::BackfillRecoveryPipeline::Process",
                            "initiated_at": "0.000000" 
                        },
                        {
                            "name": "crimson::osd::RemotePeeringEvent::OSDPipeline::AwaitActive",
                            "initiated_at": "0.000000" 
                        },
                        {
                            "name": "crimson::osd::TrackableOperationT<crimson::osd::RemotePeeringEvent>::CompletionEvent",
                            "initiated_at": "0.000000" 
                        }
                    ]
                }
            }
        },

So osd.2 at least is blocked by

                            "blocker": {
                                "op_type": "PGCreation",
                                "detail": {
                                    "pgid": "258.1b",
                                    "creating": true
                                }
                            }

osd.1 has ops that are blocked like:

                            "blocker": {
                                "op_type": "ConnectionPipeline::get_pg",
                                "detail": {
                                    "waiting": 288,
                                    "held_by_operation_id": 816509
                                }
                            }

op 816509 is

            "id": 816509,
            "detail": {
                "PeeringEvent": {
                    "from": "?",
                    "pgid": "258.c",
                    "sent": 554,
                    "requested": 554,
                    "evt": "epoch_sent: 554 epoch_requested: 554 NullEvt +create_info",
                    "events": [
                        {
                            "name": "crimson::osd::TrackableOperationT<crimson::osd::RemotePeeringEvent>::StartEvent",
                            "initiated_at": "0.000000" 
                        },
                        {
                            "name": "crimson::osd::ConnectionPipeline::AwaitActive",
                            "initiated_at": "2022-09-30T23:31:29.340153+0000" 
                        },
                        {
                            "name": "crimson::osd::ConnectionPipeline::AwaitMap",
                            "initiated_at": "2022-09-30T23:31:29.341310+0000" 
                        },
                        {
                            "name": "crimson::osd::OSDMapGate<(crimson::osd::OSDMapGateType)0>::OSDMapBlocker",
                            "initiated_at": "0.000000" 
                        },
                        {
                            "name": "crimson::osd::ConnectionPipeline::GetPG",
                            "initiated_at": "2022-09-30T23:31:30.783338+0000" 
                        },
                        {
                            "name": "crimson::osd::PGMap::PGCreationState",
                            "initiated_at": "2022-09-30T23:31:32.641184+0000",
                            "blocker": {
                                "op_type": "PGCreation",
                                "detail": {
                                    "pgid": "258.c",
                                    "creating": true
                                }
                            }
                        },
                        {
                            "name": "crimson::osd::PGPeeringPipeline::AwaitMap",
                            "initiated_at": "0.000000" 
                        },
                        {
                            "name": "crimson::osd::OSDMapGate<(crimson::osd::OSDMapGateType)1>::OSDMapBlocker",
                            "initiated_at": "0.000000" 
                        },
                        {
                            "name": "crimson::osd::PGPeeringPipeline::Process",
                            "initiated_at": "0.000000" 
                        },
                        {
                            "name": "crimson::osd::BackfillRecovery::BackfillRecoveryPipeline::Process",
                            "initiated_at": "0.000000" 
                        },
                        {
                            "name": "crimson::osd::RemotePeeringEvent::OSDPipeline::AwaitActive",
                            "initiated_at": "0.000000" 
                        },
                        {
                            "name": "crimson::osd::TrackableOperationT<crimson::osd::RemotePeeringEvent>::CompletionEvent",
                            "initiated_at": "0.000000" 
                        }
                    ]
                }
            }
        }

so osd.1 is basically stuck on

                            "blocker": {
                                "op_type": "PGCreation",
                                "detail": {
                                    "pgid": "258.c",
                                    "creating": true
                                }
                            }

#2 Updated by Samuel Just over 1 year ago

  • Related to Bug #57547: Hang with seastore at wait_for_active stage added

#3 Updated by Samuel Just over 1 year ago

osd.1:

DEBUG 2022-09-30 23:31:32,641 [shard 2] osd - Creating 258.c
...
DEBUG 2022-09-30 23:31:32,641 [shard 2] osd - operator() ignoring pgid 258.c, pool dne

#4 Updated by Samuel Just over 1 year ago

Bug is that we aren't erroring out the blocking future in get_or_create_pg if the pool does not exist.

#5 Updated by Samuel Just over 1 year ago

  • Status changed from New to Resolved

Also available in: Atom PDF