Bug #47299
openAssertion in pg_missing_set: p->second.need <= v || p->second.is_delete()
0%
97fd8dff92c029abb7aa00e77a6af85ca1b3d963876c082a4f5a579418498427
ea5d5202cbfd479dad466304faaf11b74609c65a810668ae789b64ec19b8be0d
f582692869a94580abf07e6695f97d0a75b4174a2d1727abee7acfb06a234e32
ffcf6fed9c89d0cc0b574d879c8d8d999500d074a062d6e1de03b68fd9eedfbe
Description
Some of our ODSs will sometimes crash with the following message:
{ "crash_id": "2020-09-04T04:16:06.718363Z_78670637-4fad-494f-abea-2afd6e64a970", "timestamp": "2020-09-04T04:16:06.718363Z", "process_name": "ceph-osd", "entity_name": "osd.166", "ceph_version": "15.2.4", "utsname_hostname": "prod-nvme1-c-rma1", "utsname_sysname": "Linux", "utsname_release": "4.15.0-72-generic", "utsname_version": "#81-Ubuntu SMP Tue Nov 26 12:20:02 UTC 2019", "utsname_machine": "x86_64", "os_name": "Ubuntu", "os_id": "ubuntu", "os_version_id": "18.04", "os_version": "18.04.3 LTS (Bionic Beaver)", "assert_condition": "p->second.need <= v || p->second.is_delete()", "assert_func": "void pg_missing_set<TrackChanges>::got(const hobject_t&, eversion_t) [with bool TrackChanges = false]", "assert_file": "/build/ceph-15.2.4/src/osd/osd_types.h", "assert_line": 4774, "assert_thread_name": "tp_osd_tp", "assert_msg": "/build/ceph-15.2.4/src/osd/osd_types.h: In function 'void pg_missing_set<TrackChanges>::got(const hobject_t&, eversion_t) [with bool TrackChanges = false]' thread 7fd8cfcd8700 time 2020-09-04T06:16:06.699081+0200\n/build/ceph-15.2.4/src/osd/osd_types.h: 4774: FAILED ceph_assert(p->second.need <= v || p->second.is_delete())\n", "backtrace": [ "(()+0x12890) [0x7fd8f7255890]", "(gsignal()+0xc7) [0x7fd8f5f07e97]", "(abort()+0x141) [0x7fd8f5f09801]", "(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a5) [0x560dc2d9a7b5]", "(ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char const*, ...)+0) [0x560dc2d9a93f]", "(()+0xc3d9c8) [0x560dc307b9c8]", "(ReplicatedBackend::handle_push_reply(pg_shard_t, PushReplyOp const&, PushOp*)+0x591) [0x560dc3159f31]", "(ReplicatedBackend::do_push_reply(boost::intrusive_ptr<OpRequest>)+0xfa) [0x560dc315a67a]", "(ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x177) [0x560dc315c737]", "(PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x97) [0x560dc2fe2867]", "(PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x6fd) [0x560dc2f853fd]", "(OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x17b) [0x560dc2e09aab]", "(ceph::osd::scheduler::PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x67) [0x560dc3066d07]", "(OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x90c) [0x560dc2e2751c]", "(ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x4ac) [0x560dc34713ec]", "(ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x560dc3474640]", "(()+0x76db) [0x7fd8f724a6db]", "(clone()+0x3f) [0x7fd8f5fea88f]" ] }
Before and after the crash, we'll see a of errors like this being spammed into the logs:
1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7fd8d6ce6700' had timed out after 15
At this point we recreate the OSD, which solves the problem.
Files
Updated by Neha Ojha over 3 years ago
- Status changed from New to Need More Info
Is it possible for you to capture osd logs with debug_osd=30? We'll also try to reproduce this at our end.
Updated by Denis Krienbühl over 3 years ago
Neha Ojha wrote:
Is it possible for you to capture osd logs with debug_osd=30? We'll also try to reproduce this at our end.
Unfortunately we already reset the OSD, as we've been suffering from a number of problems with our cluster and didn't have a chance to keep this in a broken state for investigation. We expect that we may run into further issues like this though, and if permits we'll happily provide more output.
Updated by Tobias Urdin almost 3 years ago
- File crash2.txt crash2.txt added
I had multiple OSDs die during an upgrade from Nautilus to Octopus with this trace. See the attached crash2.txt
Updated by Tobias Urdin almost 3 years ago
Neha, can you draw any conclusions from the above debug_osd=30 log with this issue?
Updated by Neha Ojha over 2 years ago
- Has duplicate Bug #52180: crash: void pg_missing_set<TrackChanges>::got(const hobject_t&, eversion_t) [with bool TrackChanges = false]: assert(p->second.need <= v || p->second.is_delete()) added
Updated by Tobias Urdin about 2 years ago
Just got this again during a recovery after doing maintenance on another node this OSD crashed.
1> 2022-03-06T21:08:01.635+0100 7fc59a875700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/15.2.15/rpm/el7/BUILD/ceph-15.2.15/src/osd/osd_types.h: In function 'void pg_missing_set<TrackChanges>::got(const hobject_t&, eversion_t) [with bool TrackChanges = false]' thread 7fc59a875700 time 2022-03-06T21:08:01.523875+0100
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/15.2.15/rpm/el7/BUILD/ceph-15.2.15/src/osd/osd_types.h: 4774: FAILED ceph_assert(p>second.need <= v || p->second.is_delete())
ceph version 15.2.15 (2dfb18841cfecc2f7eb7eb2afd65986ca4d95985) octopus (stable)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14c) [0x558cce356419]
2: (()+0x4df5e1) [0x558cce3565e1]
3: (PeeringState::on_peer_recover(pg_shard_t, hobject_t const&, eversion_t const&)+0x1a1) [0x558cce68a261]
4: (ReplicatedBackend::handle_push_reply(pg_shard_t, PushReplyOp const&, PushOp*)+0x56d) [0x558cce754acd]
5: (ReplicatedBackend::do_push_reply(boost::intrusive_ptr<OpRequest>)+0xf2) [0x558cce759142]
6: (ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x227) [0x558cce7594d7]
7: (PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x4a) [0x558cce5f532a]
8: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x5cb) [0x558cce59b47b]
9: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x2f9) [0x558cce43ab69]
10: (ceph::osd::scheduler::PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x69) [0x558cce676609]
11: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x143a) [0x558cce45644a]
12: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5b6) [0x558ccea42206]
13: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x558ccea44d50]
14: (()+0x7ea5) [0x7fc5bda74ea5]
15: (clone()+0x6d) [0x7fc5bc9379fd]
0> 2022-03-06T21:08:01.673+0100 7fc59a875700 -1 ** Caught signal (Aborted) *
in thread 7fc59a875700 thread_name:tp_osd_tp
ceph version 15.2.15 (2dfb18841cfecc2f7eb7eb2afd65986ca4d95985) octopus (stable)
1: (()+0xf630) [0x7fc5bda7c630]
2: (gsignal()+0x37) [0x7fc5bc86f3d7]
3: (abort()+0x148) [0x7fc5bc870ac8]
4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x19b) [0x558cce356468]
5: (()+0x4df5e1) [0x558cce3565e1]
6: (PeeringState::on_peer_recover(pg_shard_t, hobject_t const&, eversion_t const&)+0x1a1) [0x558cce68a261]
7: (ReplicatedBackend::handle_push_reply(pg_shard_t, PushReplyOp const&, PushOp*)+0x56d) [0x558cce754acd]
8: (ReplicatedBackend::do_push_reply(boost::intrusive_ptr<OpRequest>)+0xf2) [0x558cce759142]
9: (ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x227) [0x558cce7594d7]
10: (PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x4a) [0x558cce5f532a]
11: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x5cb) [0x558cce59b47b]
12: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x2f9) [0x558cce43ab69]
13: (ceph::osd::scheduler::PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x69) [0x558cce676609]
14: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x143a) [0x558cce45644a]
15: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5b6) [0x558ccea42206]
16: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x558ccea44d50]
17: (()+0x7ea5) [0x7fc5bda74ea5]
18: (clone()+0x6d) [0x7fc5bc9379fd]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
--- logging levels ---
0/ 0 none
0/ 0 lockdep
0/ 0 context
0/ 0 crush
0/ 0 mds
1/ 5 mds_balancer
1/ 5 mds_locker
1/ 5 mds_log
1/ 5 mds_log_expire
1/ 5 mds_migrator
0/ 0 buffer
0/ 0 timer
0/ 0 filer
0/ 0 striper
0/ 0 objecter
0/ 0 rados
0/ 0 rbd
0/ 5 rbd_mirror
0/ 5 rbd_replay
0/ 5 rbd_rwl
0/ 0 journaler
0/ 0 objectcacher
0/ 5 immutable_obj_cache
0/ 0 client
0/ 0 osd
0/ 0 optracker
0/ 0 objclass
0/ 0 filestore
0/ 0 journal
0/ 0 ms
0/ 0 mon
0/ 0 monc
0/ 0 paxos
0/ 0 tp
0/ 0 auth
0/ 0 crypto
0/ 0 finisher
0/ 0 reserver
0/ 0 heartbeatmap
0/ 0 perfcounter
0/ 0 rgw
1/ 5 rgw_sync
0/ 0 civetweb
0/ 0 javaclient
0/ 0 asok
0/ 0 throttle
0/ 0 refs
0/ 0 compressor
0/ 0 bluestore
0/ 0 bluefs
0/ 0 bdev
0/ 0 kstore
0/ 0 rocksdb
0/ 0 leveldb
0/ 0 memdb
0/ 0 fuse
0/ 0 mgr
0/ 0 mgrc
0/ 0 dpdk
0/ 0 eventtrace
1/ 5 prioritycache
0/ 5 test
2/-2 (syslog threshold) pthread ID / name mapping for recent threads ---
-1/-1 (stderr threshold)
--
7fc59a875700 / tp_osd_tp
7fc5ad09a700 / bstore_mempool
max_recent 10000
max_new 1000
log_file /var/log/ceph/ceph-osd.67.log
--- end dump of recent events ---
Updated by Radoslaw Zarzynski about 2 years ago
If this is easily reproducible could you please provide us with logs of replicas for the failing PG? It can be figured out ceph osd pg dump
.
Alternatively, a log from primary but with increased debugs (debug_ms = 1
, debug_osd = 30
) would be also helpful.
Updated by Telemetry Bot about 2 years ago
- Crash signature (v1) updated (diff)
- Crash signature (v2) updated (diff)
- Affected Versions v15.2.13, v15.2.6, v15.2.8 added
Assert condition: p->second.need <= v || p->second.is_delete()
Assert function: void pg_missing_set<TrackChanges>::got(const hobject_t&, eversion_t) [with bool TrackChanges = false]
Sanitized backtrace:
ReplicatedBackend::handle_push_reply(pg_shard_t, PushReplyOp const&, PushOp*) ReplicatedBackend::do_push_reply(boost::intrusive_ptr<OpRequest>) ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>) PGBackend::handle_message(boost::intrusive_ptr<OpRequest>) PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&) OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&) ceph::osd::scheduler::PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&) OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*) ShardedThreadPool::shardedthreadpool_worker(unsigned int) ShardedThreadPool::WorkThreadSharded::entry() clone()
Crash dump sample:
{ "archived": "2021-08-23 17:18:32.303571", "assert_condition": "p->second.need <= v || p->second.is_delete()", "assert_file": "osd/osd_types.h", "assert_func": "void pg_missing_set<TrackChanges>::got(const hobject_t&, eversion_t) [with bool TrackChanges = false]", "assert_line": 4774, "assert_msg": "osd/osd_types.h: In function 'void pg_missing_set<TrackChanges>::got(const hobject_t&, eversion_t) [with bool TrackChanges = false]' thread 7f72e348b700 time 2021-08-23T09:50:07.539096-0700\nosd/osd_types.h: 4774: FAILED ceph_assert(p->second.need <= v || p->second.is_delete())", "assert_thread_name": "tp_osd_tp", "backtrace": [ "(()+0x12730) [0x7f72ff25a730]", "(gsignal()+0x10b) [0x7f72fed1f7bb]", "(abort()+0x121) [0x7f72fed0a535]", "(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a5) [0x559eb6f49b09]", "(()+0x9c5c90) [0x559eb6f49c90]", "(()+0xcc0c95) [0x559eb7244c95]", "(ReplicatedBackend::handle_push_reply(pg_shard_t, PushReplyOp const&, PushOp*)+0x575) [0x559eb7319a95]", "(ReplicatedBackend::do_push_reply(boost::intrusive_ptr<OpRequest>)+0xfa) [0x559eb731dc9a]", "(ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x237) [0x559eb731e037]", "(PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x57) [0x559eb71b0e97]", "(PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x62f) [0x559eb7154a5f]", "(OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x325) [0x559eb6fed5c5]", "(ceph::osd::scheduler::PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x64) [0x559eb7231414]", "(OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x12fa) [0x559eb700a00a]", "(ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5b4) [0x559eb76108b4]", "(ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x559eb7613330]", "(()+0x7fa3) [0x7f72ff24ffa3]", "(clone()+0x3f) [0x7f72fede14cf]" ], "ceph_version": "15.2.13", "crash_id": "2021-08-23T16:50:07.609190Z_40d40626-04c7-4f75-b10d-404b107288e8", "entity_name": "osd.f44e26114dd19dd3c396a7933df86e932f20de67", "os_id": "10", "os_name": "Debian GNU/Linux 10 (buster)", "os_version": "10 (buster)", "os_version_id": "10", "process_name": "ceph-osd", "stack_sig": "ea5d5202cbfd479dad466304faaf11b74609c65a810668ae789b64ec19b8be0d", "timestamp": "2021-08-23T16:50:07.609190Z", "utsname_machine": "x86_64", "utsname_release": "5.11.21-1-pve", "utsname_sysname": "Linux", "utsname_version": "#1 SMP PVE 5.11.21-1~bpo10 (Wed, 02 Jun 2021 11:34:45 +0200)" }
Updated by Telemetry Bot about 2 years ago
Updated by Radoslaw Zarzynski about 2 years ago
- Status changed from New to Need More Info
- Crash signature (v1) updated (diff)
Still new more info. See the comment #8.
Updated by Tobias Urdin almost 2 years ago
"assert_condition": "p->second.need <= v || p->second.is_delete()", "assert_file": "/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.7/rpm/el8/BUILD/ceph-16.2.7/src/osd/osd_types.h", "assert_func": "void pg_missing_set<TrackChanges>::got(const hobject_t&, eversion_t) [with bool TrackChanges = false]", "assert_line": 4910, "assert_msg": "/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.7/rpm/el8/BUILD/ceph-16.2.7/src/osd/osd_types.h: In function 'void pg_missing_set<TrackChanges>::got(const hobject_t&, eversion_t) [with bool TrackChanges = false]' thread 7f41e33ba700 time 2022-05-12T11:21:23.175952+0200\n/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.7/rpm/el8/BUILD/ceph-16.2.7/src/osd/osd_types.h: 4910: FAILED ceph_assert(p->second.need <= v || p->second.is_delete())\n", "assert_thread_name": "tp_osd_tp", "backtrace": [ "/lib64/libpthread.so.0(+0x12ce0) [0x7f4207b5dce0]", "gsignal()", "abort()", "(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a9) [0x5637d1a4dbcf]", "/usr/bin/ceph-osd(+0x56ad98) [0x5637d1a4dd98]", "(PeeringState::on_peer_recover(pg_shard_t, hobject_t const&, eversion_t const&)+0x1b4) [0x5637d1dcc164]", "(ReplicatedBackend::handle_push_reply(pg_shard_t, PushReplyOp const&, PushOp*)+0x585) [0x5637d1eef615]", "(ReplicatedBackend::do_push_reply(boost::intrusive_ptr<OpRequest>)+0x101) [0x5637d1ef22a1]", "(ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x237) [0x5637d1ef6e17]", "(PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x52) [0x5637d1d297d2]", "(PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x5de) [0x5637d1ccc95e]", "(OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x309) [0x5637d1b558e9]", "(ceph::osd::scheduler::PGRecoveryMsg::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x68) [0x5637d1db3248]", "(OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xc28) [0x5637d1b75d48]", "(ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4) [0x5637d21e75b4]", "(ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x5637d21ea254]", "/lib64/libpthread.so.0(+0x81cf) [0x7f4207b531cf]", "clone()" ],
One OSD crashed with this when upgrading from Octopus to Pacific
Updated by Radoslaw Zarzynski almost 2 years ago
Hello! A note from a bug scrub:
1. This issue looks like being caused by a particular data stored in OSD which
2. makes it very hard to replicate locally so
3. logs (or a coredump) would be really helpful.
For improving the debugging in the future we can also to split the assert's conditions into dedicated assert. However, this won't help with this particular case.