Project

General

Profile

Actions

Bug #57547

closed

Hang with seastore at wait_for_active stage

Added by chunmei liu over 1 year ago. Updated 11 days ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

when do ceph_test_rados_api_aio_pp test, when pg is not active, the client request will put to PGActivationBlocker::wait, but PGActivationBlocker::unblock will not happen, so the client_request will be stuck there to wait to unlock


Related issues 1 (0 open1 closed)

Related to crimson - Bug #57740: crimson: op hang while running ./bin/ceph_test_rados_api_aio_pp and ./bin/ceph_test_rados_api_aio concurrentlyResolved

Actions
Actions #2

Updated by chunmei liu over 1 year ago

  • Description updated (diff)
Actions #3

Updated by chunmei liu over 1 year ago

The hang op is "append" see (./bin/ceph --admin-daemon asok/client.admin.<pid>.asok objecter_requests | less) out put as follows:

{
"ops": [ {
"tid": 2,
"pg": "37.7fc1f406",
"osd": 2,
"object_id": "foo",
"object_locator": "@37",
"target_object_id": "foo",
"target_object_locator": "@37",
"paused": 0,
"used_replica": 0,
"precalc_pgid": 0,
"last_sent": "14002408.872428s",
"age": 180.60399249700001,
"attempts": 1,
"snapid": "head",
"snap_context": "0=[]",
"mtime": "2022-09-14T15:41:44.024560-0700",
"osd_ops": [
"append 0~128 in=128b"
]
}
],
"linger_ops": [],
"pool_ops": [],
"pool_stat_ops": [],
"statfs_ops": [],
"command_ops": []
}
Actions #4

Updated by Samuel Just over 1 year ago

Can you add to the description:
- exact sha1 tested (with github link if a branch not yet merged)
- vstart.sh command used to start the cluster
- command used to start the client (ceph_test_rados_api_aio_pp in this case, along with any env variables that might matter)
- observations about whether any osds crashed

If you can, please also add output from:

ceph -s
ceph pg dump

And if possible:
- backtraces from any crashes

The above are sort of a baseline for any real attempt to reproduce or fix a bug, it's best to add them to any bug to start with.

Actions #5

Updated by Samuel Just over 1 year ago

The target raw pg for the op is 37.7fc1f406, but that log snippet does not seem to contain any pgs from that pool (37). What pgs exist for pool 37? What state were they in? Can you filter out log lines for that pg from the log? Can you locate the client_request lines for that request in the rest of the log?

----update a new log at: https://gist.github.com/liu-chunmei/1370d7925833ab5c9b11217e9d2c9e51
----pgactiveblocker stuck at 111.6

----client_request(id=5838) is stuck.

Actions #6

Updated by chunmei liu over 1 year ago

Samuel Just wrote:

Can you add to the description:
- exact sha1 tested (with github link if a branch not yet merged)

---https://github.com/liu-chunmei/ceph/tree/rados-api-test-3
---commit fc15d6ce6fb48e476a4544b0b2666dde2cf000c7

- vstart.sh command used to start the cluster

----MDS=0 MGR=1 OSD=3 MON=1 ../src/vstart.sh -n --without-dashboard --seastore -X --crimson-foreground -d

- command used to start the client (ceph_test_rados_api_aio_pp in this case, along with any env variables that might matter)

---need set environment CRIMSON_COMPAT: add it in /etc/env or export CRIMSON_COMPAT=on
----run test: ./bin/ceph_test_rados_api_aio_pp (if one time run not happen block, run it more times)

- observations about whether any osds crashed

----no any crash happen.

If you can, please also add output from:

----ceph -s
cluster:
id: 589abb96-6c0b-41b2-a2d7-355ce4a10a4e
health: HEALTH_WARN
Reduced data availability: 5 pgs inactive

services:
mon: 1 daemons, quorum a (age 59m)
mgr: x(active, since 59m)
osd: 3 osds: 3 up (since 59m), 3 in (since 59m)
data:
pools: 2 pools, 33 pgs
objects: 2 objects, 577 KiB
usage: 106 MiB used, 150 GiB / 150 GiB avail
pgs: 15.152% pgs not active
28 active+clean
5 creating+activating

----ceph pg dump

version 2203
stamp 2022-09-15T21:10:35.097864-0700
last_osdmap_epoch 0
last_pg_scan 0
PG_STAT OBJECTS MISSING_ON_PRIMARY DEGRADED MISPLACED UNFOUND BYTES OMAP_BYTES* OMAP_KEYS* LOG LOG_DUPS DISK_LOG STATE STATE_STAMP VERSION REPORTED UP UP_PRIMARY ACTING ACTING_PRIMARY LAST_SCRUB SCRUB_STAMP LAST_DEEP_SCRUB DEEP_SCRUB_STAMP SNAPTRIMQ_LEN LAST_SCRUB_DURATION SCRUB_SCHEDULING OBJECTS_SCRUBBED OBJECTS_TRIMMED
111.1f 0 0 0 0 0 0 0 0 0 0 0 active+clean 2022-09-15T20:24:26.958693-0700 0'0 367:11 [0,2,1] 0 [0,2,1] 0 0'0 2022-09-15T20:24:14.398444-0700 0'0 2022-09-15T20:24:14.398444-0700 0 0 -- 0 0
111.1e 0 0 0 0 0 0 0 0 0 0 0 active+clean 2022-09-15T20:24:27.129485-0700 0'0 367:11 [1,0,2] 1 [1,0,2] 1 0'0 2022-09-15T20:24:14.398444-0700 0'0 2022-09-15T20:24:14.398444-0700 0 0 -- 0 0
111.1d 0 0 0 0 0 0 0 0 0 0 0 active+clean 2022-09-15T20:24:26.961669-0700 0'0 367:11 [0,1,2] 0 [0,1,2] 0 0'0 2022-09-15T20:24:14.398444-0700 0'0 2022-09-15T20:24:14.398444-0700 0 0 -- 0 0
111.1c 0 0 0 0 0 0 0 0 0 0 0 creating+activating 2022-09-15T20:24:26.886633-0700 0'0 366:4 [2,0,1] 2 [2,0,1] 2 0'0 2022-09-15T20:24:14.398444-0700 0'0 2022-09-15T20:24:14.398444-0700 0 0 -- 0 0
111.1b 0 0 0 0 0 0 0 0 0 0 0 creating+activating 2022-09-15T20:24:26.886061-0700 0'0 366:4 [2,0,1] 2 [2,0,1] 2 0'0 2022-09-15T20:24:14.398444-0700 0'0 2022-09-15T20:24:14.398444-0700 0 0 -- 0 0
111.1a 0 0 0 0 0 0 0 0 0 0 0 active+clean 2022-09-15T20:24:26.990147-0700 0'0 367:11 [1,0,2] 1 [1,0,2] 1 0'0 2022-09-15T20:24:14.398444-0700 0'0 2022-09-15T20:24:14.398444-0700 0 0 -- 0 0
111.19 0 0 0 0 0 0 0 0 0 0 0 creating+activating 2022-09-15T20:24:26.885477-0700 0'0 366:4 [2,1,0] 2 [2,1,0] 2 0'0 2022-09-15T20:24:14.398444-0700 0'0 2022-09-15T20:24:14.398444-0700 0 0 -- 0 0
111.18 0 0 0 0 0 0 0 0 0 0 0 active+clean 2022-09-15T20:24:27.122303-0700 0'0 367:11 [1,0,2] 1 [1,0,2] 1 0'0 2022-09-15T20:24:14.398444-0700 0'0 2022-09-15T20:24:14.398444-0700 0 0 -- 0 0
111.17 0 0 0 0 0 0 0 0 0 0 0 active+clean 2022-09-15T20:24:26.964716-0700 0'0 367:11 [0,2,1] 0 [0,2,1] 0 0'0 2022-09-15T20:24:14.398444-0700 0'0 2022-09-15T20:24:14.398444-0700 0 0 -- 0 0
111.16 0 0 0 0 0 0 0 0 0 0 0 active+clean 2022-09-15T20:24:26.966761-0700 0'0 367:11 [0,2,1] 0 [0,2,1] 0 0'0 2022-09-15T20:24:14.398444-0700 0'0 2022-09-15T20:24:14.398444-0700 0 0 -- 0 0
111.15 0 0 0 0 0 0 0 0 0 0 0 active+clean 2022-09-15T20:24:26.970677-0700 0'0 367:11 [0,1,2] 0 [0,1,2] 0 0'0 2022-09-15T20:24:14.398444-0700 0'0 2022-09-15T20:24:14.398444-0700 0 0 -- 0 0
111.14 0 0 0 0 0 0 0 0 0 0 0 active+clean 2022-09-15T20:24:26.974625-0700 0'0 367:11 [0,2,1] 0 [0,2,1] 0 0'0 2022-09-15T20:24:14.398444-0700 0'0 2022-09-15T20:24:14.398444-0700 0 0 -- 0 0
111.13 0 0 0 0 0 0 0 0 0 0 0 active+clean 2022-09-15T20:24:27.023223-0700 0'0 367:11 [0,2,1] 0 [0,2,1] 0 0'0 2022-09-15T20:24:14.398444-0700 0'0 2022-09-15T20:24:14.398444-0700 0 0 -- 0 0
111.12 0 0 0 0 0 0 0 0 0 0 0 active+clean 2022-09-15T20:24:27.024823-0700 0'0 367:11 [0,1,2] 0 [0,1,2] 0 0'0 2022-09-15T20:24:14.398444-0700 0'0 2022-09-15T20:24:14.398444-0700 0 0 -- 0 0
111.11 0 0 0 0 0 0 0 0 0 0 0 active+clean 2022-09-15T20:24:27.025977-0700 0'0 367:11 [0,2,1] 0 [0,2,1] 0 0'0 2022-09-15T20:24:14.398444-0700 0'0 2022-09-15T20:24:14.398444-0700 0 0 -- 0 0
111.10 0 0 0 0 0 0 0 0 0 0 0 active+clean 2022-09-15T20:24:27.026739-0700 0'0 367:11 [0,2,1] 0 [0,2,1] 0 0'0 2022-09-15T20:24:14.398444-0700 0'0 2022-09-15T20:24:14.398444-0700 0 0 -- 0 0
111.f 0 0 0 0 0 0 0 0 0 0 0 active+clean 2022-09-15T20:24:27.113367-0700 0'0 367:11 [1,0,2] 1 [1,0,2] 1 0'0 2022-09-15T20:24:14.398444-0700 0'0 2022-09-15T20:24:14.398444-0700 0 0 -- 0 0
111.e 0 0 0 0 0 0 0 0 0 0 0 active+clean 2022-09-15T20:24:27.027520-0700 0'0 367:11 [0,2,1] 0 [0,2,1] 0 0'0 2022-09-15T20:24:14.398444-0700 0'0 2022-09-15T20:24:14.398444-0700 0 0 -- 0 0
111.d 0 0 0 0 0 0 0 0 0 0 0 creating+activating 2022-09-15T20:24:26.882823-0700 0'0 366:4 [2,0,1] 2 [2,0,1] 2 0'0 2022-09-15T20:24:14.398444-0700 0'0 2022-09-15T20:24:14.398444-0700 0 0 -- 0 0
111.c 0 0 0 0 0 0 0 0 0 0 0 active+clean 2022-09-15T20:24:27.105160-0700 0'0 367:11 [1,2,0] 1 [1,2,0] 1 0'0 2022-09-15T20:24:14.398444-0700 0'0 2022-09-15T20:24:14.398444-0700 0 0 -- 0 0
111.b 0 0 0 0 0 0 0 0 0 0 0 active+clean 2022-09-15T20:24:27.097632-0700 0'0 367:11 [1,2,0] 1 [1,2,0] 1 0'0 2022-09-15T20:24:14.398444-0700 0'0 2022-09-15T20:24:14.398444-0700 0 0 -- 0 0
111.a 0 0 0 0 0 0 0 0 0 0 0 active+clean 2022-09-15T20:24:27.090081-0700 0'0 367:11 [1,0,2] 1 [1,0,2] 1 0'0 2022-09-15T20:24:14.398444-0700 0'0 2022-09-15T20:24:14.398444-0700 0 0 -- 0 0
111.9 0 0 0 0 0 0 0 0 0 0 0 active+clean 2022-09-15T20:24:27.027697-0700 0'0 367:11 [0,2,1] 0 [0,2,1] 0 0'0 2022-09-15T20:24:14.398444-0700 0'0 2022-09-15T20:24:14.398444-0700 0 0 -- 0 0
111.8 0 0 0 0 0 0 0 0 0 0 0 active+clean 2022-09-15T20:24:27.028444-0700 0'0 367:11 [0,2,1] 0 [0,2,1] 0 0'0 2022-09-15T20:24:14.398444-0700 0'0 2022-09-15T20:24:14.398444-0700 0 0 -- 0 0
111.7 0 0 0 0 0 0 0 0 0 0 0 active+clean 2022-09-15T20:24:27.081418-0700 0'0 367:11 [1,0,2] 1 [1,0,2] 1 0'0 2022-09-15T20:24:14.398444-0700 0'0 2022-09-15T20:24:14.398444-0700 0 0 -- 0 0
111.6 0 0 0 0 0 0 0 0 0 0 0 creating+activating 2022-09-15T20:24:26.882260-0700 0'0 366:4 [2,1,0] 2 [2,1,0] 2 0'0 2022-09-15T20:24:14.398444-0700 0'0 2022-09-15T20:24:14.398444-0700 0 0 -- 0 0
111.5 0 0 0 0 0 0 0 0 0 0 0 active+clean 2022-09-15T20:24:27.029356-0700 0'0 367:11 [0,1,2] 0 [0,1,2] 0 0'0 2022-09-15T20:24:14.398444-0700 0'0 2022-09-15T20:24:14.398444-0700 0 0 -- 0 0
111.4 0 0 0 0 0 0 0 0 0 0 0 active+clean 2022-09-15T20:24:27.030744-0700 0'0 367:11 [0,2,1] 0 [0,2,1] 0 0'0 2022-09-15T20:24:14.398444-0700 0'0 2022-09-15T20:24:14.398444-0700 0 0 -- 0 0
111.3 0 0 0 0 0 0 0 0 0 0 0 active+clean 2022-09-15T20:24:26.998421-0700 0'0 367:11 [1,2,0] 1 [1,2,0] 1 0'0 2022-09-15T20:24:14.398444-0700 0'0 2022-09-15T20:24:14.398444-0700 0 0 -- 0 0
111.2 0 0 0 0 0 0 0 0 0 0 0 active+clean 2022-09-15T20:24:27.083131-0700 0'0 367:11 [0,2,1] 0 [0,2,1] 0 0'0 2022-09-15T20:24:14.398444-0700 0'0 2022-09-15T20:24:14.398444-0700 0 0 -- 0 0
111.1 0 0 0 0 0 0 0 0 0 0 0 active+clean 2022-09-15T20:24:27.082462-0700 0'0 367:11 [0,1,2] 0 [0,1,2] 0 0'0 2022-09-15T20:24:14.398444-0700 0'0 2022-09-15T20:24:14.398444-0700 0 0 -- 0 0
111.0 0 0 0 0 0 0 0 0 0 0 0 active+clean 2022-09-15T20:24:27.031590-0700 0'0 367:11 [0,2,1] 0 [0,2,1] 0 0'0 2022-09-15T20:24:14.398444-0700 0'0 2022-09-15T20:24:14.398444-0700 0 0 -- 0 0
1.0 2 0 0 0 0 590368 0 0 97 0 97 active+clean 2022-09-15T20:09:09.092992-0700 11'97 367:395 [1,0,2] 1 [1,0,2] 1 0'0 2022-09-15T20:09:08.060074-0700 0'0 2022-09-15T20:09:08.060074-0700 0 0 -- 0 0

111 0 0 0 0 0 0 0 0 0 0
1 2 0 0 0 0 590368 0 0 97 97

sum 2 0 0 0 0 590368 0 0 97 97
OSD_STAT USED AVAIL USED_RAW TOTAL HB_PEERS PG_SUM PRIMARY_PG_SUM
2 35 MiB 50 GiB 35 MiB 50 GiB [0,1] 33 5
1 35 MiB 50 GiB 35 MiB 50 GiB [0,2] 33 10
0 35 MiB 50 GiB 35 MiB 50 GiB [1,2] 33 18
sum 106 MiB 150 GiB 106 MiB 150 GiB

[...]

And if possible:
- backtraces from any crashes

----no crash backtrace.

The above are sort of a baseline for any real attempt to reproduce or fix a bug, it's best to add them to any bug to start with.

Actions #7

Updated by Samuel Just over 1 year ago

From your ceph pg dump, there appear to be no pgs in pool 37. Can you find in the osd logs any evidence of pgs from that pool? Can you find the client_request lines from where that hung op was received/handled?

Alternately, can you find a way to upload the full logs from all three osds?

-----it is a different test, the old log is rewritten by my new test, so this test the pool is not 37, I think it is 111, you can check that updated new log on gist: https://gist.github.com/liu-chunmei/1370d7925833ab5c9b11217e9d2c9e51 witch matches that pg dump.

-----you can search the last "after wait_for_active stage", that show where is the last client_request is handled. It is on line 127 on the gits log.

----I will find a way to put the large osd.*.log to some places.

Actions #8

Updated by chunmei liu over 1 year ago

I use git lfs to put the big log files into github.

https://github.com/liu-chunmei/ceph/tree/rados-api-test-3, check out the branch and to see the logs.

there are two times test logs under /ceph, old one is osd.*.log, the new one is 1-osd.*.log, the following dump match the 1-osd.*.log.

the pg stuck at pg num 34.6, last client_request at line : 2141344 in 1-osd.2.log.

./bin/ceph --admin-daemon asok/client.admin.925240.asok objecter_requests | less

{
"ops": [ {
"tid": 2,
"pg": "34.7fc1f406",
"osd": 2,
"object_id": "foo",
"object_locator": "@34",
"target_object_id": "foo",
"target_object_locator": "@34",
"paused": 0,
"used_replica": 0,
"precalc_pgid": 0,
"last_sent": "14180996.813010s",
"age": 323.39598656499999,
"attempts": 1,
"snapid": "head",
"snap_context": "0=[]",
"mtime": "2022-09-16T17:18:11.970514-0700",
"osd_ops": [
"append 0~128 in=128b"
]
}
],
"linger_ops": [],
"pool_ops": [],
"pool_stat_ops": [],
"statfs_ops": [],
"command_ops": []
}

ceph -s:

2022-09-16T17:24:28.848-0700 7ffb50e02700 -1 WARNING: all dangerous and experimental features are enabled.
2022-09-16T17:24:28.848-0700 7ffb4b59e700 -1 WARNING: all dangerous and experimental features are enabled.
cluster:
id: 8807a344-24f2-4e9a-b8eb-e8306aa5be36
health: HEALTH_WARN
Reduced data availability: 32 pgs inactive, 25 pgs peering

services:
mon: 1 daemons, quorum a (age 9m)
mgr: x(active, since 9m)
osd: 3 osds: 3 up (since 9m), 3 in (since 9m)
data:
pools: 2 pools, 33 pgs
objects: 2 objects, 577 KiB
usage: 34 MiB used, 150 GiB / 150 GiB avail
pgs: 21.212% pgs unknown
75.758% pgs not active
25 creating+peering
7 unknown
1 active+clean

ceph pg dump:

2022-09-16T17:25:03.272-0700 7f3b419c4700 1 WARNING: all dangerous and experimental features are enabled.
2022-09-16T17:25:03.276-0700 7f3b419c4700 -1 WARNING: all dangerous and experimental features are enabled.
version 397
stamp 2022-09-16T17:25:01.945332-0700
last_osdmap_epoch 0
last_pg_scan 0
PG_STAT OBJECTS MISSING_ON_PRIMARY DEGRADED MISPLACED UNFOUND BYTES OMAP_BYTES* OMAP_KEYS* LOG LOG_DUPS DISK_LOG STATE STATE_STAMP VERSION REPORTED UP UP_PRIMARY ACTING ACTING_PRIMARY LAST_SCRUB SCRUB_STAMP LAST_DEEP_SCRUB DEEP_SCRUB_STAMP SNAPTRIMQ_LEN LAST_SCRUB_DURATION SCRUB_SCHEDULING OBJECTS_SCRUBBED OBJECTS_TRIMMED
34.1f 0 0 0 0 0 0 0 0 0 0 0 creating+peering 2022-09-16T17:18:11.082627-0700 0'0 113:2 [0,2,1] 0 [0,2,1] 0 0'0 2022-09-16T17:18:10.531019-0700 0'0 2022-09-16T17:18:10.531019-0700 0 0 -
0 0
34.1e 0 0 0 0 0 0 0 0 0 0 0 creating+peering 2022-09-16T17:18:10.660993-0700 0'0 113:2 [2,1,0] 2 [2,1,0] 2 0'0 2022-09-16T17:18:10.531019-0700 0'0 2022-09-16T17:18:10.531019-0700 0 0 -- 0 0
34.1d 0 0 0 0 0 0 0 0 0 0 0 creating+peering 2022-09-16T17:18:10.660629-0700 0'0 113:2 [2,0,1] 2 [2,0,1] 2 0'0 2022-09-16T17:18:10.531019-0700 0'0 2022-09-16T17:18:10.531019-0700 0 0 -- 0 0
34.1c 0 0 0 0 0 0 0 0 0 0 0 creating+peering 2022-09-16T17:18:10.660252-0700 0'0 113:2 [2,0,1] 2 [2,0,1] 2 0'0 2022-09-16T17:18:10.531019-0700 0'0 2022-09-16T17:18:10.531019-0700 0 0 -- 0 0
34.1b 0 0 0 0 0 0 0 0 0 0 0 creating+peering 2022-09-16T17:18:10.659904-0700 0'0 113:2 [2,1,0] 2 [2,1,0] 2 0'0 2022-09-16T17:18:10.531019-0700 0'0 2022-09-16T17:18:10.531019-0700 0 0 -- 0 0
34.1a 0 0 0 0 0 0 0 0 0 0 0 creating+peering 2022-09-16T17:18:11.082302-0700 0'0 113:2 [0,2,1] 0 [0,2,1] 0 0'0 2022-09-16T17:18:10.531019-0700 0'0 2022-09-16T17:18:10.531019-0700 0 0 -- 0 0
34.19 0 0 0 0 0 0 0 0 0 0 0 creating+peering 2022-09-16T17:18:11.081973-0700 0'0 113:2 [0,1,2] 0 [0,1,2] 0 0'0 2022-09-16T17:18:10.531019-0700 0'0 2022-09-16T17:18:10.531019-0700 0 0 -- 0 0
34.18 0 0 0 0 0 0 0 0 0 0 0 creating+peering 2022-09-16T17:18:11.081636-0700 0'0 113:2 [0,1,2] 0 [0,1,2] 0 0'0 2022-09-16T17:18:10.531019-0700 0'0 2022-09-16T17:18:10.531019-0700 0 0 -- 0 0
34.17 0 0 0 0 0 0 0 0 0 0 0 creating+peering 2022-09-16T17:18:11.081269-0700 0'0 113:2 [0,1,2] 0 [0,1,2] 0 0'0 2022-09-16T17:18:10.531019-0700 0'0 2022-09-16T17:18:10.531019-0700 0 0 -- 0 0
34.16 0 0 0 0 0 0 0 0 0 0 0 unknown 2022-09-16T17:18:10.531019-0700 0'0 0:0 [] 1 [] -1 0'0 2022-09-16T17:18:10.531019-0700 0'0 2022-09-16T17:18:10.531019-0700 0 0 - 0 0
34.15 0 0 0 0 0 0 0 0 0 0 0 creating+peering 2022-09-16T17:18:10.659540-0700 0'0 113:2 [2,0,1] 2 [2,0,1] 2 0'0 2022-09-16T17:18:10.531019-0700 0'0 2022-09-16T17:18:10.531019-0700 0 0 -- 0 0
34.14 0 0 0 0 0 0 0 0 0 0 0 unknown 2022-09-16T17:18:10.531019-0700 0'0 0:0 [] 1 [] -1 0'0 2022-09-16T17:18:10.531019-0700 0'0 2022-09-16T17:18:10.531019-0700 0 0 - 0 0
34.13 0 0 0 0 0 0 0 0 0 0 0 unknown 2022-09-16T17:18:10.531019-0700 0'0 0:0 [] 1 [] -1 0'0 2022-09-16T17:18:10.531019-0700 0'0 2022-09-16T17:18:10.531019-0700 0 0 - 0 0
34.12 0 0 0 0 0 0 0 0 0 0 0 unknown 2022-09-16T17:18:10.531019-0700 0'0 0:0 [] 1 [] -1 0'0 2022-09-16T17:18:10.531019-0700 0'0 2022-09-16T17:18:10.531019-0700 0 0 - 0 0
34.11 0 0 0 0 0 0 0 0 0 0 0 unknown 2022-09-16T17:18:10.531019-0700 0'0 0:0 [] 1 [] -1 0'0 2022-09-16T17:18:10.531019-0700 0'0 2022-09-16T17:18:10.531019-0700 0 0 - 0 0
34.10 0 0 0 0 0 0 0 0 0 0 0 unknown 2022-09-16T17:18:10.531019-0700 0'0 0:0 [] 1 [] -1 0'0 2022-09-16T17:18:10.531019-0700 0'0 2022-09-16T17:18:10.531019-0700 0 0 - 0 0
34.f 0 0 0 0 0 0 0 0 0 0 0 creating+peering 2022-09-16T17:18:10.659154-0700 0'0 113:2 [2,0,1] 2 [2,0,1] 2 0'0 2022-09-16T17:18:10.531019-0700 0'0 2022-09-16T17:18:10.531019-0700 0 0 -- 0 0
34.e 0 0 0 0 0 0 0 0 0 0 0 creating+peering 2022-09-16T17:18:10.658787-0700 0'0 113:2 [2,1,0] 2 [2,1,0] 2 0'0 2022-09-16T17:18:10.531019-0700 0'0 2022-09-16T17:18:10.531019-0700 0 0 -- 0 0
34.d 0 0 0 0 0 0 0 0 0 0 0 unknown 2022-09-16T17:18:10.531019-0700 0'0 0:0 [] 1 [] -1 0'0 2022-09-16T17:18:10.531019-0700 0'0 2022-09-16T17:18:10.531019-0700 0 0 - 0 0
34.c 0 0 0 0 0 0 0 0 0 0 0 creating+peering 2022-09-16T17:18:10.658415-0700 0'0 113:2 [2,0,1] 2 [2,0,1] 2 0'0 2022-09-16T17:18:10.531019-0700 0'0 2022-09-16T17:18:10.531019-0700 0 0 -- 0 0
34.b 0 0 0 0 0 0 0 0 0 0 0 creating+peering 2022-09-16T17:18:11.080919-0700 0'0 113:2 [0,2,1] 0 [0,2,1] 0 0'0 2022-09-16T17:18:10.531019-0700 0'0 2022-09-16T17:18:10.531019-0700 0 0 -- 0 0
34.a 0 0 0 0 0 0 0 0 0 0 0 creating+peering 2022-09-16T17:18:11.080534-0700 0'0 113:2 [0,1,2] 0 [0,1,2] 0 0'0 2022-09-16T17:18:10.531019-0700 0'0 2022-09-16T17:18:10.531019-0700 0 0 -- 0 0
34.9 0 0 0 0 0 0 0 0 0 0 0 creating+peering 2022-09-16T17:18:11.080170-0700 0'0 113:2 [0,1,2] 0 [0,1,2] 0 0'0 2022-09-16T17:18:10.531019-0700 0'0 2022-09-16T17:18:10.531019-0700 0 0 -- 0 0
34.8 0 0 0 0 0 0 0 0 0 0 0 creating+peering 2022-09-16T17:18:11.079706-0700 0'0 113:2 [0,1,2] 0 [0,1,2] 0 0'0 2022-09-16T17:18:10.531019-0700 0'0 2022-09-16T17:18:10.531019-0700 0 0 -- 0 0
34.7 0 0 0 0 0 0 0 0 0 0 0 creating+peering 2022-09-16T17:18:11.079253-0700 0'0 113:2 [0,1,2] 0 [0,1,2] 0 0'0 2022-09-16T17:18:10.531019-0700 0'0 2022-09-16T17:18:10.531019-0700 0 0 -- 0 0
34.6 0 0 0 0 0 0 0 0 0 0 0 creating+peering 2022-09-16T17:18:10.658015-0700 0'0 113:2 [2,1,0] 2 [2,1,0] 2 0'0 2022-09-16T17:18:10.531019-0700 0'0 2022-09-16T17:18:10.531019-0700 0 0 -- 0 0
34.5 0 0 0 0 0 0 0 0 0 0 0 creating+peering 2022-09-16T17:18:11.078760-0700 0'0 113:2 [0,2,1] 0 [0,2,1] 0 0'0 2022-09-16T17:18:10.531019-0700 0'0 2022-09-16T17:18:10.531019-0700 0 0 -- 0 0
34.4 0 0 0 0 0 0 0 0 0 0 0 creating+peering 2022-09-16T17:18:10.657564-0700 0'0 113:2 [2,0,1] 2 [2,0,1] 2 0'0 2022-09-16T17:18:10.531019-0700 0'0 2022-09-16T17:18:10.531019-0700 0 0 -- 0 0
34.3 0 0 0 0 0 0 0 0 0 0 0 creating+peering 2022-09-16T17:18:10.656393-0700 0'0 113:2 [2,1,0] 2 [2,1,0] 2 0'0 2022-09-16T17:18:10.531019-0700 0'0 2022-09-16T17:18:10.531019-0700 0 0 -- 0 0
34.2 0 0 0 0 0 0 0 0 0 0 0 creating+peering 2022-09-16T17:18:11.078282-0700 0'0 113:2 [0,2,1] 0 [0,2,1] 0 0'0 2022-09-16T17:18:10.531019-0700 0'0 2022-09-16T17:18:10.531019-0700 0 0 -- 0 0
34.1 0 0 0 0 0 0 0 0 0 0 0 creating+peering 2022-09-16T17:18:11.077772-0700 0'0 113:2 [0,2,1] 0 [0,2,1] 0 0'0 2022-09-16T17:18:10.531019-0700 0'0 2022-09-16T17:18:10.531019-0700 0 0 -- 0 0
34.0 0 0 0 0 0 0 0 0 0 0 0 creating+peering 2022-09-16T17:18:11.076639-0700 0'0 113:2 [0,1,2] 0 [0,1,2] 0 0'0 2022-09-16T17:18:10.531019-0700 0'0 2022-09-16T17:18:10.531019-0700 0 0 -- 0 0
1.0 2 0 0 0 0 590368 0 0 97 0 97 active+clean 2022-09-16T17:15:27.272920-0700 11'97 110:186 [1,0,2] 1 [1,0,2] 1 0'0 2022-09-16T17:15:26.239575-0700 0'0 2022-09-16T17:15:26.239575-0700 0 0 -- 0 0

34 0 0 0 0 0 0 0 0 0 0
1 2 0 0 0 0 590368 0 0 97 97

sum 2 0 0 0 0 590368 0 0 97 97
OSD_STAT USED AVAIL USED_RAW TOTAL HB_PEERS PG_SUM PRIMARY_PG_SUM
2 11 MiB 50 GiB 11 MiB 50 GiB [0,1] 26 11
1 11 MiB 50 GiB 11 MiB 50 GiB [0,2] 26 1
0 11 MiB 50 GiB 11 MiB 50 GiB [1,2] 26 14
sum 34 MiB 150 GiB 34 MiB 150 GiB

Actions #9

Updated by Samuel Just over 1 year ago

Unfortunately, git-lfs doesn't seem to work:

fetch: Fetching reference refs/heads/rados-api-test-3
batch response: This repository is over its data quota. Account responsible for LFS bandwidth should purchase more data packs to restore access.
error: failed to fetch some objects from 'https://github.com/athanatos/ceph.git/info/lfs'

If the pgs in question are in creating+peering, that's the only thing you need to worry about. The op status is irrelevant. If pg 34.6 is stuck creating, can you do
grep '34\.6' <log file for primary> | gzip > pg34.6.log.gz and post that in the bug (should be small enough to be added directly)?

Actions #10

Updated by Samuel Just over 1 year ago

The first log from this bug seems just not to be long enough, it only extends for about .5s after 111.6 is created, so not really long enough to see query/notify message events.

Actions #11

Updated by Samuel Just over 1 year ago

DEBUG 2022-09-16 21:11:31,921 [shard 0] osd - do_peering_event
ignoring epoch_sent: 110 epoch_requested: 110 MLease epoch 110 from
osd.1 pg_lease(ru 924.194763184s ub 996.334594727s int 16.000000000s)
-- pg has reset

What epoch is pg 33.1b on when it processes that message?

Actions #12

Updated by chunmei liu over 1 year ago

uploaded grep '34\.6' <log file for primary> | gzip > pg34.6.log.gz to rados-api-test-3 branch, see the last commit.

Actions #13

Updated by chunmei liu over 1 year ago

Sam, you can just reproduce it by main+Pr48057+Pr48059(I think main should be OK since the pr doesn't fix it),
1.set environment variable
2.MDS=0 MGR=1 OSD=3 MON=1 ../src/vstart.sh -n --without-dashboard --seastore -X --crimson-foreground -d
3.and run ./bin/ceph_rados_test_api_aio_pp 1 to 3 times, then you should meet the bug and you can get all osd.*.log.

thanks!

Actions #14

Updated by Samuel Just over 1 year ago

Actions #15

Updated by Samuel Just over 1 year ago

  • Related to Bug #57740: crimson: op hang while running ./bin/ceph_test_rados_api_aio_pp and ./bin/ceph_test_rados_api_aio concurrently added
Actions #16

Updated by Matan Breizman 11 days ago

  • Status changed from New to Closed

Please re-open if still relevant.

Actions

Also available in: Atom PDF