Project

General

Profile

Actions

Bug #59491

open

OSD segfault when OSD primary-temp active and the CRUSH map changes

Added by Stefan Kooman about 1 year ago. Updated about 1 month ago.

Status:
Fix Under Review
Priority:
Normal
Assignee:
Category:
OSD
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
quincy reef squid
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

We used the new osdmaptool from Reef / main branch with the recently merged "read-balancer" code to create an optimal balance of primaries over the OSDs. We dumped an osdmap on a 16.2.9 / 16.2.11 cluster and have the new osdmaptool create the suggested changes. Instead of using the "pg-upmap-primary" command, we used the "primary-temp" command to move the primary PG onto another OSD. This worked as expected. However, when the CRUSH map changes (removing an OSD, adding an OSD, moving a host to a bucket) it would result in OSDs crashing (Segmentation Fault) like this:

-4> 2023-04-19T12:48:06.468+0200 7f84032df700 20 osd.12 41366 advance_pg new pool opts  old pool opts 
-3> 2023-04-19T12:48:06.468+0200 7f84032df700 20 osd.12 41366 get_map 41317 - loading and decoding 0x555f4fa7f400
-2> 2023-04-19T12:48:06.468+0200 7f8410576700 20 osd.12 41366 got_full_map 41370, nothing requested
-1> 2023-04-19T12:48:06.468+0200 7f84032df700 10 osd.12 41366 add_map_bl 41317 18118 bytes
0> 2023-04-19T12:48:06.476+0200 7f84032df700 -1 ** Caught signal (Segmentation fault) *
in thread 7f84032df700 thread_name:tp_osd_tp
ceph version 16.2.9 (4c3647a322c0ff5a1dd2344e039859dcbd28c830) pacific (stable)
1: /lib/x86_64-linux-gnu/libpthread.so.0(+0x14420) [0x7f841e4f9420]
2: /lib/x86_64-linux-gnu/libc.so.6(+0xbc04e) [0x7f841dff404e]
3: tc_calloc()
4: (CrushWrapper::decode_crush_bucket(crush_bucket**, ceph::buffer::v15_2_0::list::iterator_impl<true>&)+0x104) [0x555f41bba6c4]
5: (CrushWrapper::decode(ceph::buffer::v15_2_0::list::iterator_impl<true>&)+0x2ff) [0x555f41bca5df]
6: (OSDMap::decode(ceph::buffer::v15_2_0::list::iterator_impl<true>&)+0x719) [0x555f41b3ef59]
7: (OSDMap::decode(ceph::buffer::v15_2_0::list&)+0x36) [0x555f41b42266]
8: (OSDService::try_get_map(unsigned int)+0x6a4) [0x555f41110214]
9: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle&, PeeringCtx&)+0x1fc) [0x555f4117b5ac]
10: (OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0xd3) [0x555f4117d9d3]
11: (ceph::osd::scheduler::PGPeeringItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x56) [0x555f413eef86]
12: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x8aa) [0x555f411684ba]
13: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x403) [0x555f41898ec3]
14: (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x555f4189bf04]
15: /lib/x86_64-linux-gnu/libpthread.so.0(+0x8609) [0x7f841e4ed609]
16: clone()
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
0/ 5 none
0/ 1 lockdep
0/ 1 context
1/ 1 crush
1/ 5 mds
1/ 5 mds_balancer
1/ 5 mds_locker
1/ 5 mds_log
1/ 5 mds_log_expire
1/ 5 mds_migrator
0/ 1 buffer
0/ 1 timer
0/ 1 filer
0/ 1 striper
0/ 1 objecter
0/ 5 rados
0/ 5 rbd
0/ 5 rbd_mirror
0/ 5 rbd_replay
0/ 5 rbd_pwl
0/ 5 journaler
0/ 5 objectcacher
0/ 5 immutable_obj_cache
0/ 5 client
20/20 osd
0/ 5 optracker
0/ 5 objclass
1/ 3 filestore
1/ 3 journal
0/ 0 ms
1/ 5 mon
0/10 monc
1/ 5 paxos
0/ 5 tp
1/ 5 auth
1/ 5 crypto
1/ 1 finisher
1/ 1 reserver
1/ 5 heartbeatmap
1/ 5 perfcounter
1/ 5 rgw
1/ 5 rgw_sync
1/10 civetweb
1/ 5 javaclient
1/ 5 asok
1/ 1 throttle
0/ 0 refs
1/ 5 compressor
1/ 5 bluestore
1/ 5 bluefs
1/ 3 bdev
1/ 5 kstore
4/ 5 rocksdb
4/ 5 leveldb
4/ 5 memdb
1/ 5 fuse
2/ 5 mgr
1/ 5 mgrc
1/ 5 dpdk
1/ 5 eventtrace
1/ 5 prioritycache
0/ 5 test
0/ 5 cephfs_mirror
0/ 5 cephsqlite
2/-2 (syslog threshold)
-1/-1 (stderr threshold)
--
pthread ID / name mapping for recent threads ---
140204923791104 / osd_srv_heartbt
140204932183808 / tp_osd_tp
140204940576512 / tp_osd_tp
140204948969216 / tp_osd_tp
140204957361920 / tp_osd_tp
140204965754624 / tp_osd_tp
140204974147328 / osd_srv_agent
140205186574080 / ms_dispatch
140205337741056 / io_context_pool
140205363058432 / io_context_pool
140205379843840 / msgr-worker-2
140205388236544 / msgr-worker-1
140205396629248 / msgr-worker-0
140205413961856 / ceph-osd
max_recent 10000
max_new 10000
log_file /var/log/ceph/ceph-osd.12.log
--- end dump of recent events ---

PGs that got a primary temp got into trouble. It seems that the OSD (wrongly) tries to preserve the (temporary primary) OSD 12 when it is not available anymore:

-1> 2023-04-19T11:13:55.987+0200 7f49dd7a1700  1 osd.12 pg_epoch: 45415 pg[1.5( v 3745'134093 (3710'131093,3745'134093] local-lis/les=45151/45152 n=81 ec=54/54 lis/c=45151/45151 les/c/f=45152/45152/39588 sis=45415) [4,16,11] r=-1 lpr=45415 pi=[45151,45415)/1 crt=3745'134093 lcod 0'0 mlcod 0'0 unknown mbc={}] start_peering_interval up [10,16,12] -> [4,16,11], acting [10,16,12] -> [4,16,11], acting_primary 12 -> 12, up_primary 10 -> 4, role 2 -> -1, features acting 4540138297136906239 upacting 4540138297136906239

A restart of the OSD will not help, it will keep on crashing on this (and possibly other PGs) on this OSD. Exporting (and removing) the problematic PG from this OSD will work and the OSD can start again. Tests howed that the PG can be succesfully imported on another OSD.

Due to the nature of these balancer optimalisations the primary-temps are set on all failure domains. A large CRUSH map change involving a lot of PGs can lead to many DOWN OSDs on all Failure Domains at the same time resulting in lots of inactive / unknown PGs. We experienced this on a production cluster.


Files

osdmap_no_pg_temp (15.7 KB) osdmap_no_pg_temp Samuel Just, 06/07/2023 08:06 PM
osdmap_after_osd14_remove (16.7 KB) osdmap_after_osd14_remove Samuel Just, 06/07/2023 08:06 PM
osdmap_pgtemp (16.3 KB) osdmap_pgtemp Samuel Just, 06/07/2023 08:06 PM
Actions

Also available in: Atom PDF