Project

General

Profile

Actions

Bug #59491

open

OSD segfault when OSD primary-temp active and the CRUSH map changes

Added by Stefan Kooman about 1 year ago. Updated 22 days ago.

Status:
Fix Under Review
Priority:
Normal
Assignee:
Category:
OSD
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
quincy reef squid
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

We used the new osdmaptool from Reef / main branch with the recently merged "read-balancer" code to create an optimal balance of primaries over the OSDs. We dumped an osdmap on a 16.2.9 / 16.2.11 cluster and have the new osdmaptool create the suggested changes. Instead of using the "pg-upmap-primary" command, we used the "primary-temp" command to move the primary PG onto another OSD. This worked as expected. However, when the CRUSH map changes (removing an OSD, adding an OSD, moving a host to a bucket) it would result in OSDs crashing (Segmentation Fault) like this:

-4> 2023-04-19T12:48:06.468+0200 7f84032df700 20 osd.12 41366 advance_pg new pool opts  old pool opts 
-3> 2023-04-19T12:48:06.468+0200 7f84032df700 20 osd.12 41366 get_map 41317 - loading and decoding 0x555f4fa7f400
-2> 2023-04-19T12:48:06.468+0200 7f8410576700 20 osd.12 41366 got_full_map 41370, nothing requested
-1> 2023-04-19T12:48:06.468+0200 7f84032df700 10 osd.12 41366 add_map_bl 41317 18118 bytes
0> 2023-04-19T12:48:06.476+0200 7f84032df700 -1 ** Caught signal (Segmentation fault) *
in thread 7f84032df700 thread_name:tp_osd_tp
ceph version 16.2.9 (4c3647a322c0ff5a1dd2344e039859dcbd28c830) pacific (stable)
1: /lib/x86_64-linux-gnu/libpthread.so.0(+0x14420) [0x7f841e4f9420]
2: /lib/x86_64-linux-gnu/libc.so.6(+0xbc04e) [0x7f841dff404e]
3: tc_calloc()
4: (CrushWrapper::decode_crush_bucket(crush_bucket**, ceph::buffer::v15_2_0::list::iterator_impl<true>&)+0x104) [0x555f41bba6c4]
5: (CrushWrapper::decode(ceph::buffer::v15_2_0::list::iterator_impl<true>&)+0x2ff) [0x555f41bca5df]
6: (OSDMap::decode(ceph::buffer::v15_2_0::list::iterator_impl<true>&)+0x719) [0x555f41b3ef59]
7: (OSDMap::decode(ceph::buffer::v15_2_0::list&)+0x36) [0x555f41b42266]
8: (OSDService::try_get_map(unsigned int)+0x6a4) [0x555f41110214]
9: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle&, PeeringCtx&)+0x1fc) [0x555f4117b5ac]
10: (OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0xd3) [0x555f4117d9d3]
11: (ceph::osd::scheduler::PGPeeringItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x56) [0x555f413eef86]
12: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x8aa) [0x555f411684ba]
13: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x403) [0x555f41898ec3]
14: (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x555f4189bf04]
15: /lib/x86_64-linux-gnu/libpthread.so.0(+0x8609) [0x7f841e4ed609]
16: clone()
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
0/ 5 none
0/ 1 lockdep
0/ 1 context
1/ 1 crush
1/ 5 mds
1/ 5 mds_balancer
1/ 5 mds_locker
1/ 5 mds_log
1/ 5 mds_log_expire
1/ 5 mds_migrator
0/ 1 buffer
0/ 1 timer
0/ 1 filer
0/ 1 striper
0/ 1 objecter
0/ 5 rados
0/ 5 rbd
0/ 5 rbd_mirror
0/ 5 rbd_replay
0/ 5 rbd_pwl
0/ 5 journaler
0/ 5 objectcacher
0/ 5 immutable_obj_cache
0/ 5 client
20/20 osd
0/ 5 optracker
0/ 5 objclass
1/ 3 filestore
1/ 3 journal
0/ 0 ms
1/ 5 mon
0/10 monc
1/ 5 paxos
0/ 5 tp
1/ 5 auth
1/ 5 crypto
1/ 1 finisher
1/ 1 reserver
1/ 5 heartbeatmap
1/ 5 perfcounter
1/ 5 rgw
1/ 5 rgw_sync
1/10 civetweb
1/ 5 javaclient
1/ 5 asok
1/ 1 throttle
0/ 0 refs
1/ 5 compressor
1/ 5 bluestore
1/ 5 bluefs
1/ 3 bdev
1/ 5 kstore
4/ 5 rocksdb
4/ 5 leveldb
4/ 5 memdb
1/ 5 fuse
2/ 5 mgr
1/ 5 mgrc
1/ 5 dpdk
1/ 5 eventtrace
1/ 5 prioritycache
0/ 5 test
0/ 5 cephfs_mirror
0/ 5 cephsqlite
2/-2 (syslog threshold)
-1/-1 (stderr threshold)
--
pthread ID / name mapping for recent threads ---
140204923791104 / osd_srv_heartbt
140204932183808 / tp_osd_tp
140204940576512 / tp_osd_tp
140204948969216 / tp_osd_tp
140204957361920 / tp_osd_tp
140204965754624 / tp_osd_tp
140204974147328 / osd_srv_agent
140205186574080 / ms_dispatch
140205337741056 / io_context_pool
140205363058432 / io_context_pool
140205379843840 / msgr-worker-2
140205388236544 / msgr-worker-1
140205396629248 / msgr-worker-0
140205413961856 / ceph-osd
max_recent 10000
max_new 10000
log_file /var/log/ceph/ceph-osd.12.log
--- end dump of recent events ---

PGs that got a primary temp got into trouble. It seems that the OSD (wrongly) tries to preserve the (temporary primary) OSD 12 when it is not available anymore:

-1> 2023-04-19T11:13:55.987+0200 7f49dd7a1700  1 osd.12 pg_epoch: 45415 pg[1.5( v 3745'134093 (3710'131093,3745'134093] local-lis/les=45151/45152 n=81 ec=54/54 lis/c=45151/45151 les/c/f=45152/45152/39588 sis=45415) [4,16,11] r=-1 lpr=45415 pi=[45151,45415)/1 crt=3745'134093 lcod 0'0 mlcod 0'0 unknown mbc={}] start_peering_interval up [10,16,12] -> [4,16,11], acting [10,16,12] -> [4,16,11], acting_primary 12 -> 12, up_primary 10 -> 4, role 2 -> -1, features acting 4540138297136906239 upacting 4540138297136906239

A restart of the OSD will not help, it will keep on crashing on this (and possibly other PGs) on this OSD. Exporting (and removing) the problematic PG from this OSD will work and the OSD can start again. Tests howed that the PG can be succesfully imported on another OSD.

Due to the nature of these balancer optimalisations the primary-temps are set on all failure domains. A large CRUSH map change involving a lot of PGs can lead to many DOWN OSDs on all Failure Domains at the same time resulting in lots of inactive / unknown PGs. We experienced this on a production cluster.


Files

osdmap_no_pg_temp (15.7 KB) osdmap_no_pg_temp Samuel Just, 06/07/2023 08:06 PM
osdmap_after_osd14_remove (16.7 KB) osdmap_after_osd14_remove Samuel Just, 06/07/2023 08:06 PM
osdmap_pgtemp (16.3 KB) osdmap_pgtemp Samuel Just, 06/07/2023 08:06 PM
Actions #1

Updated by Stefan Kooman about 1 year ago

ceph crash log id: 2023-04-19T10:47:39.263101Z_86a9e956-5e51-4491-8dd8-2f3cf671b18c

Actions #2

Updated by Stefan Kooman about 1 year ago

This bug has been reproduced on Ceph 16.2.9 and 16.2.11, but might be present in older and newer releases.

Actions #3

Updated by Igor Fedotov about 1 year ago

  • Status changed from New to Fix Under Review
  • Pull request ID set to 51160
Actions #4

Updated by Igor Fedotov about 1 year ago

  • Backport set to reef, quincy, pacific
Actions #5

Updated by Samuel Just 11 months ago

From your description, it seems like the steps to reproduce would be something like:

create cluster with osds 0,...,3
set a pg with a replica on 0 to use 0 as the primary via primary-temp
stop osd 0 and let it be marked down

The result would be that the pg above has an acting set not containing 0, but get_primary().osd still evaluates to 0?

That's a larger problem than this segfault -- it means that the primary is no longer a member of the acting set. Can you attach an osdmap with this property to the bug? I'd like to take a look.

Actions #7

Updated by Samuel Just 9 months ago

crimson-scrub/build [sjust/wip-crimson-scrub●] » for i in ~/Downloads/osdmap*; do ./bin/osdmaptool $i --dump | grep epoch; done                                                                            
./bin/osdmaptool: osdmap file '/home/sam/Downloads/osdmap_after_osd14_remove'                                                                                                                                      
epoch 42385                                                                                                                                                                                                        
./bin/osdmaptool: osdmap file '/home/sam/Downloads/osdmap_no_pg_temp'                                                                                                                                              
epoch 42325                                                                                                                                                                                                        
./bin/osdmaptool: osdmap file '/home/sam/Downloads/osdmap_pgtemp'                                                                                                                                                  
epoch 42357 
crimson-scrub/build [sjust/wip-crimson-scrub●] » ./bin/osdmaptool ~/Downloads/osdmap_after_osd14_remove --dump | grep primary_temp                                                                                 
./bin/osdmaptool: osdmap file '/home/sam/Downloads/osdmap_after_osd14_remove'                                                                                                                                      
primary_temp 1.4 1                                                                                                                                                                                                 
primary_temp 1.6 11                                                                                                                                                                                                
primary_temp 1.7 5                                                                                                                                                                                                 
primary_temp 1.d 6                                                                                                                                                                                                 
primary_temp 1.18 3                                                                                                                                                                                                
primary_temp 1.1f 10                                                                                                                                                                                               
primary_temp 1.23 4                                                                                                                                                                                                
primary_temp 1.28 2                                                                                                                                                                                                
primary_temp 1.30 5                                                                                                                                                                                                
primary_temp 3.0 4                                                                                                                                                                                                 
primary_temp 3.4 16                                                                                                                                                                                                
primary_temp 3.7 9                                                                                                                                                                                                 
primary_temp 3.e 11                                                                                                                                                                                                
primary_temp 3.f 0                                                                                                                                                                                                 
primary_temp 3.10 6                                                                                                                                                                                                
primary_temp 3.13 11                                                                                                                                                                                               
primary_temp 3.17 16                                                                                                                                                                                               
primary_temp 3.1c 5                                                                                                                                                                                                
primary_temp 3.1d 6                                                                                                                                                                                                
primary_temp 3.3d 4

I don't see any primary_temp entries on osdmap_after_osd14_remove for a down or missing osd -- all of the ones listed appear to be exist and be up. The above crash appears to be for epoch 41317, probably that's the epoch I need?

Actions #8

Updated by Stefan Kooman 9 months ago

It's a couple of months ago I did this test (system is still in the same state though). The read-balancer code produced the following mappings (replaced them with primary_temp):

set 3.b primary_temp mapping to 14
set 1.26 primary_temp mapping to 14

And those are indeed there in the "osdmap_pgtemp" file.

But after the removal, the temporary mappings are gone (same would happen if you reboot the OSD). They are not persistent as the new pg_upmap_primaries are. But the bug triggers the affected OSD that were in the UPset before that segfault.

Can you tell me what you are looking for?

Actions #9

Updated by Samuel Just 9 months ago

Ok, I was able to reproduce it. My steps above are slightly wrong -- OSDMap::clean_temps will notice if the primary_temp is marked down. The problem is that it won't notice if the primary_temp is up, but no longer in the acting set. OSDMap::clean_temps needs to be adjusted to check whether the primary_temp is still a member of the acting set.

create cluster with osds 0,...,3
set a pg with a replica on 0 to use 0 as the primary via primary-temp
set osd 0 out
Actions #10

Updated by Samuel Just 7 months ago

  • Assignee set to Laura Flores
Actions #11

Updated by Laura Flores 7 months ago

Hi Stefan,

Thanks for raising this issue. I will take a look into it. In the meantime, I would advise against using any read balancer functionality on clusters earlier than Reef, unless in a strictly experimental sense. There was a mailing list thread awhile back [1] where people were interested in using primary-temp (in lieu of "pg-upmap-primary", only available in Reef) on older clusters, but they experienced negative side effects.

It looks like a fix was raised for primary-temp though, so I will take a look at this and review it.

@Samuel Hassine there were some discussions amongst developers about retiring primary-temp as well, since it is no longer maintained.

1. Mailing list thread: https://www.mail-archive.com/ceph-users@ceph.io/msg19713.html

Actions #12

Updated by Samuel Just 7 months ago

Ah, retiring it would be fine by me as well.

Actions #13

Updated by Konstantin Shalygin 22 days ago

  • Backport changed from reef, quincy, pacific to quincy reef squid
Actions

Also available in: Atom PDF