Project

General

Profile

Actions

Bug #57494

closed

crimson: IO hang with vstart immediately after cluster start

Added by Samuel Just over 1 year ago. Updated over 1 year ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Reproduces about 1/2 of the time with (edit, more like 1/8 times now)

MDS=0 MGR=1 OSD=3 MON=1 ../src/vstart.sh --without-dashboard -X --crimson --redirect-output --debug -n --no-restart
./bin/ceph osd pool create rbd 32 32 replicated replicated_rule 2 2 2
./bin/rados bench 1000 write --admin-socket asok/bench.asok -p rbd -b 4096 --debug-ms=1 2>out/bench.stderr

observed on main branch commit c49b81c7d619cea23e9707d1f5bcc7de3049c4fd

Actions #1

Updated by Samuel Just over 1 year ago

./bin/ceph --admin-daemon asok/bench.asok objecter_requests
... {
"ops": [ {
"tid": 6,
"pg": "1.82f7afa3",
"osd": 2,
"object_id": "benchmark_data_1b46f464e111_390547_object5",
"object_locator": "@1",
"target_object_id": "benchmark_data_1b46f464e111_390547_object5",
"target_object_locator": "@1",
"paused": 0,
"used_replica": 0,
"precalc_pgid": 0,
"last_sent": "55112.228165s",
"age": 205.94835578999999,
"attempts": 3,
"snapid": "head",
"snap_context": "0=[]",
"mtime": "2022-09-10T21:15:53.153335+0000",
"osd_ops": [
"set-alloc-hint object_size 4096 write_size 4096",
"write 0~4096 in=4096b"
]
}, {
"tid": 10,
"pg": "1.cc75a623",
"osd": 2,
"object_id": "benchmark_data_1b46f464e111_390547_object9",
"object_locator": "@1",
"target_object_id": "benchmark_data_1b46f464e111_390547_object9",
"target_object_locator": "@1",
"paused": 0,
"used_replica": 0,
"precalc_pgid": 0,
"last_sent": "55112.228165s",
"age": 205.94835578999999,
"attempts": 3,
"snapid": "head",
"snap_context": "0=[]",
"mtime": "2022-09-10T21:15:53.153374+0000",
"osd_ops": [
"set-alloc-hint object_size 4096 write_size 4096",
"write 0~4096 in=4096b"
]
}, {
"tid": 12,
"pg": "1.d4ac5e63",
"osd": 2,
"object_id": "benchmark_data_1b46f464e111_390547_object11",
"object_locator": "@1",
"target_object_id": "benchmark_data_1b46f464e111_390547_object11",
"target_object_locator": "@1",
"paused": 0,
"used_replica": 0,
"precalc_pgid": 0,
"last_sent": "55112.228165s",
"age": 205.94835578999999,
"attempts": 3,
"snapid": "head",
"snap_context": "0=[]",
"mtime": "2022-09-10T21:15:53.153391+0000",
"osd_ops": [
"set-alloc-hint object_size 4096 write_size 4096",
"write 0~4096 in=4096b"
]
}, {
"tid": 70,
"pg": "1.efa21e83",
"osd": 2,
"object_id": "benchmark_data_1b46f464e111_390547_object69",
"object_locator": "@1",
"target_object_id": "benchmark_data_1b46f464e111_390547_object69",
"target_object_locator": "@1",
"paused": 0,
"used_replica": 0,
"precalc_pgid": 0,
"last_sent": "55114.879221s",
"age": 203.29729981400001,
"attempts": 1,
"snapid": "head",
"snap_context": "0=[]",
"mtime": "2022-09-10T21:16:27.650254+0000",
"osd_ops": [
"set-alloc-hint object_size 4096 write_size 4096",
"write 0~4096 in=4096b"
]
}, {
"tid": 131,
"pg": "1.4b315923",
"osd": 2,
"object_id": "benchmark_data_1b46f464e111_390547_object130",
"object_locator": "@1",
"target_object_id": "benchmark_data_1b46f464e111_390547_object130",
"target_object_locator": "@1",
"paused": 0,
"used_replica": 0,
"precalc_pgid": 0,
"last_sent": "55115.943244s",
"age": 202.233277348,
"attempts": 1,
"snapid": "head",
"snap_context": "0=[]",
"mtime": "2022-09-10T21:16:28.715161+0000",
"osd_ops": [
"set-alloc-hint object_size 4096 write_size 4096",
"write 0~4096 in=4096b"
]
}, {
"tid": 137,
"pg": "1.278a5b63",
"osd": 2,
"object_id": "benchmark_data_1b46f464e111_390547_object136",
"object_locator": "@1",
"target_object_id": "benchmark_data_1b46f464e111_390547_object136",
"target_object_locator": "@1",
"paused": 0,
"used_replica": 0,
"precalc_pgid": 0,
"last_sent": "55116.097247s",
"age": 202.07927409600001,
"attempts": 1,
"snapid": "head",
"snap_context": "0=[]",
"mtime": "2022-09-10T21:16:28.868362+0000",
"osd_ops": [
"set-alloc-hint object_size 4096 write_size 4096",
"write 0~4096 in=4096b"
]
}, {
"tid": 245,
"pg": "1.481d2583",
"osd": 2,
"object_id": "benchmark_data_1b46f464e111_390547_object244",
"object_locator": "@1",
"target_object_id": "benchmark_data_1b46f464e111_390547_object244",
"target_object_locator": "@1",
"paused": 0,
"used_replica": 0,
"precalc_pgid": 0,
"last_sent": "55117.866284s",
"age": 200.310236745,
"attempts": 1,
"snapid": "head",
"snap_context": "0=[]",
"mtime": "2022-09-10T21:16:30.638030+0000",
"osd_ops": [
"set-alloc-hint object_size 4096 write_size 4096",
"write 0~4096 in=4096b"
]
}, {
"tid": 251,
"pg": "1.68a2f9a3",
"osd": 2,
"object_id": "benchmark_data_1b46f464e111_390547_object250",
"object_locator": "@1",
"target_object_id": "benchmark_data_1b46f464e111_390547_object250",
"target_object_locator": "@1",
"paused": 0,
"used_replica": 0,
"precalc_pgid": 0,
"last_sent": "55117.979287s",
"age": 200.19723435899999,
"attempts": 1,
"snapid": "head",
"snap_context": "0=[]",
"mtime": "2022-09-10T21:16:30.750582+0000",
"osd_ops": [
"set-alloc-hint object_size 4096 write_size 4096",
"write 0~4096 in=4096b"
]
}, {
"tid": 261,
"pg": "1.8ea6d663",
"osd": 2,
"object_id": "benchmark_data_1b46f464e111_390547_object260",
"object_locator": "@1",
"target_object_id": "benchmark_data_1b46f464e111_390547_object260",
"target_object_locator": "@1",
"paused": 0,
"used_replica": 0,
"precalc_pgid": 0,
"last_sent": "55118.143290s",
"age": 200.03323089599999,
"attempts": 1,
"snapid": "head",
"snap_context": "0=[]",
"mtime": "2022-09-10T21:16:30.915145+0000",
"osd_ops": [
"set-alloc-hint object_size 4096 write_size 4096",
"write 0~4096 in=4096b"
]
}, {
"tid": 280,
"pg": "1.1d6f3883",
"osd": 2,
"object_id": "benchmark_data_1b46f464e111_390547_object279",
"object_locator": "@1",
"target_object_id": "benchmark_data_1b46f464e111_390547_object279",
"target_object_locator": "@1",
"paused": 0,
"used_replica": 0,
"precalc_pgid": 0,
"last_sent": "55118.522298s",
"age": 199.654222893,
"attempts": 1,
"snapid": "head",
"snap_context": "0=[]",
"mtime": "2022-09-10T21:16:31.293505+0000",
"osd_ops": [
"set-alloc-hint object_size 4096 write_size 4096",
"write 0~4096 in=4096b"
]
}, {
"tid": 296,
"pg": "1.b1c9a8e3",
"osd": 2,
"object_id": "benchmark_data_1b46f464e111_390547_object295",
"object_locator": "@1",
"target_object_id": "benchmark_data_1b46f464e111_390547_object295",
"target_object_locator": "@1",
"paused": 0,
"used_replica": 0,
"precalc_pgid": 0,
"last_sent": "55118.744303s",
"age": 199.43221820599999,
"attempts": 1,
"snapid": "head",
"snap_context": "0=[]",
"mtime": "2022-09-10T21:16:31.515991+0000",
"osd_ops": [
"set-alloc-hint object_size 4096 write_size 4096",
"write 0~4096 in=4096b"
]
}, {
"tid": 360,
"pg": "1.e5a043e3",
"osd": 2,
"object_id": "benchmark_data_1b46f464e111_390547_object359",
"object_locator": "@1",
"target_object_id": "benchmark_data_1b46f464e111_390547_object359",
"target_object_locator": "@1",
"paused": 0,
"used_replica": 0,
"precalc_pgid": 0,
"last_sent": "55120.064331s",
"age": 198.11219033399999,
"attempts": 1,
"snapid": "head",
"snap_context": "0=[]",
"mtime": "2022-09-10T21:16:32.836059+0000",
"osd_ops": [
"set-alloc-hint object_size 4096 write_size 4096",
"write 0~4096 in=4096b"
]
}, {
"tid": 367,
"pg": "1.ef960683",
"osd": 2,
"object_id": "benchmark_data_1b46f464e111_390547_object366",
"object_locator": "@1",
"target_object_id": "benchmark_data_1b46f464e111_390547_object366",
"target_object_locator": "@1",
"paused": 0,
"used_replica": 0,
"precalc_pgid": 0,
"last_sent": "55120.259335s",
"age": 197.91718621699999,
"attempts": 1,
"snapid": "head",
"snap_context": "0=[]",
"mtime": "2022-09-10T21:16:33.030340+0000",
"osd_ops": [
"set-alloc-hint object_size 4096 write_size 4096",
"write 0~4096 in=4096b"
]
}, {
"tid": 433,
"pg": "1.df0136e3",
"osd": 2,
"object_id": "benchmark_data_1b46f464e111_390547_object432",
"object_locator": "@1",
"target_object_id": "benchmark_data_1b46f464e111_390547_object432",
"target_object_locator": "@1",
"paused": 0,
"used_replica": 0,
"precalc_pgid": 0,
"last_sent": "55121.555362s",
"age": 196.62115885099999,
"attempts": 1,
"snapid": "head",
"snap_context": "0=[]",
"mtime": "2022-09-10T21:16:34.326628+0000",
"osd_ops": [
"set-alloc-hint object_size 4096 write_size 4096",
"write 0~4096 in=4096b"
]
}, {
"tid": 444,
"pg": "1.878a7bc3",
"osd": 2,
"object_id": "benchmark_data_1b46f464e111_390547_object443",
"object_locator": "@1",
"target_object_id": "benchmark_data_1b46f464e111_390547_object443",
"target_object_locator": "@1",
"paused": 0,
"used_replica": 0,
"precalc_pgid": 0,
"last_sent": "55121.758367s",
"age": 196.41815456500001,
"attempts": 1,
"snapid": "head",
"snap_context": "0=[]",
"mtime": "2022-09-10T21:16:34.530213+0000",
"osd_ops": [
"set-alloc-hint object_size 4096 write_size 4096",
"write 0~4096 in=4096b"
]
}, {
"tid": 566,
"pg": "1.a78f78a3",
"osd": 2,
"object_id": "benchmark_data_1b46f464e111_390547_object565",
"object_locator": "@1",
"target_object_id": "benchmark_data_1b46f464e111_390547_object565",
"target_object_locator": "@1",
"paused": 0,
"used_replica": 0,
"precalc_pgid": 0,
"last_sent": "55125.396443s",
"age": 192.78007774899999,
"attempts": 1,
"snapid": "head",
"snap_context": "0=[]",
"mtime": "2022-09-10T21:16:38.168052+0000",
"osd_ops": [
"set-alloc-hint object_size 4096 write_size 4096",
"write 0~4096 in=4096b"
]
}
],
...

Hung on apparently the same pg?

Actions #2

Updated by Samuel Just over 1 year ago

Most recent stuck on await_map:

DEBUG 2022-09-10 21:16:38,168 [shard 0] osd - client_request(id=350, detail=m=[osd_op(client.4133.0:566 1.3 1:c51ef1e5:::benchmark_data_1b46f464e111_390547_object565:head {set-alloc-hint object_size 4096 write_size 4096, write 0~4096 in=4096b} snapc 0={} ondisk+write+known_if_redirected+supports_pool_eio e17) v8]).0: after wait_for_map

Actions #3

Updated by Samuel Just over 1 year ago

Least recent went through an actingset change

DEBUG 2022-09-10 21:16:26,115 [shard 0] osd - client_request(id=10, detail=m=[osd_op(client.4133.0:6 1.3 1:c5f5ef41:::benchmark_data_1b46f464e111_390547_object5:head {set-alloc-hint object_size 4096 write_size 4096, write 0~4096 in=4096b}
snapc 0={} RETRY=2 ondisk+retry+write+known_if_redirected+supports_pool_eio e14) v8]).0: interrupted crimson::common::actingset_changed (acting set changed)

Least recent last message:

DEBUG 2022-09-10 21:16:26,120 [shard 0] osd - client_request(id=10, detail=m=[osd_op(client.4133.0:6 1.3 1:c5f5ef41:::benchmark_data_1b46f464e111_390547_object5:head {set-alloc-hint object_size 4096 write_size 4096, write 0~4096 in=4096b} snapc 0={} RETRY=2 ondisk+retry+write+known_if_redirected+supports_pool_eio e14) v8]).1: after wait_for_map

Actions #4

Updated by Samuel Just over 1 year ago

We don't actually release the pipeline handle when we are interrupted. This can result in a circular dependency if prior operations in the pipeline are from the more recent interval?

Actions #5

Updated by Samuel Just over 1 year ago

Hmm, in fact ClientRequest::Orderer::requeue does cancel the outstanding handles.

Actions #6

Updated by Samuel Just over 1 year ago

  • Description updated (diff)
Actions #7

Updated by Samuel Just over 1 year ago

Ok, the problem is reusing the same PipeHandle on requeue. Patch that fixes that seems to have solved the problem, will post PR today or tomorrow.

Actions #9

Updated by Samuel Just over 1 year ago

  • Status changed from New to Resolved
Actions

Also available in: Atom PDF