Actions
Bug #43766
openOSD crash after change of osd_memory_target
Status:
Need More Info
Priority:
Normal
Assignee:
-
Category:
OSD
Target version:
-
% Done:
0%
Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Description
I'm having troubles changing osd_memory_target on my test cluster. I've upgraded whole cluster from luminous to nautiuls, all OSDs are running bluestore. Because this testlab is short in RAM, I wanted to lower osd_memory_target to save some memory.
# ceph version ceph version 14.2.6 (f0aa067ac7a02ee46ea48aa26c6e298b5ea272e9) nautilus (stable) # ceph config set osd osd_memory_target 2147483648 # ceph config dump WHO MASK LEVEL OPTION VALUE RO mon advanced auth_client_required cephx * mon advanced auth_cluster_required cephx * mon advanced auth_service_required cephx * mon advanced mon_allow_pool_delete true mon advanced mon_max_pg_per_osd 500 mgr advanced mgr/balancer/active true mgr advanced mgr/balancer/mode crush-compat osd advanced osd_crush_update_on_start true osd advanced osd_max_backfills 4 osd basic osd_memory_target 2147483648
Now any OSD is unable to start/restart:
# /usr/bin/ceph-osd -f --cluster ceph --id 0 --setuser ceph --setgroup ceph
LOG /var/log/ceph/ceph-osd.0.log:
min_mon_release 14 (nautilus) 0: [v2:10.0.92.69:3300/0,v1:10.0.92.69:6789/0] mon.testlab-ceph-03 1: [v2:10.0.92.72:3300/0,v1:10.0.92.72:6789/0] mon.testlab-ceph-04 2: [v2:10.0.92.67:3300/0,v1:10.0.92.67:6789/0] mon.testlab-ceph-01 3: [v2:10.0.92.68:3300/0,v1:10.0.92.68:6789/0] mon.testlab-ceph-02 -54> 2020-01-21 11:45:19.289 7f6aa5d78700 1 monclient: mon.2 has (v2) addrs [v2:10.0.92.67:3300/0,v1:10.0.92.67:6789/0] but i'm connected to v1:10.0.92.67:6789/0, reconnecting -53> 2020-01-21 11:45:19.289 7f6aa5d78700 10 monclient: _reopen_session rank -1 -52> 2020-01-21 11:45:19.289 7f6aa5d78700 10 monclient(hunting): picked mon.testlab-ceph-01 con 0x563319682880 addr [v2:10.0.92.67:3300/0,v1:10.0.92.67:6789/0] -51> 2020-01-21 11:45:19.289 7f6aa5d78700 10 monclient(hunting): picked mon.testlab-ceph-04 con 0x563319682d00 addr [v2:10.0.92.72:3300/0,v1:10.0.92.72:6789/0] -50> 2020-01-21 11:45:19.289 7f6aa5d78700 10 monclient(hunting): picked mon.testlab-ceph-02 con 0x563319683180 addr [v2:10.0.92.68:3300/0,v1:10.0.92.68:6789/0] -49> 2020-01-21 11:45:19.289 7f6aa5d78700 10 monclient(hunting): start opening mon connection -48> 2020-01-21 11:45:19.289 7f6aa5d78700 10 monclient(hunting): start opening mon connection -47> 2020-01-21 11:45:19.289 7f6aa5d78700 10 monclient(hunting): start opening mon connection -46> 2020-01-21 11:45:19.289 7f6aa5d78700 10 monclient(hunting): _renew_subs -45> 2020-01-21 11:45:19.289 7f6aa6d7a700 10 monclient(hunting): get_auth_request con 0x563319682880 auth_method 0 -44> 2020-01-21 11:45:19.289 7f6aa6d7a700 10 monclient(hunting): get_auth_request method 2 preferred_modes [1,2] -43> 2020-01-21 11:45:19.289 7f6aa6d7a700 10 monclient(hunting): _init_auth method 2 -42> 2020-01-21 11:45:19.289 7f6aa6d7a700 10 monclient(hunting): handle_auth_reply_more payload 9 -41> 2020-01-21 11:45:19.289 7f6aa6d7a700 10 monclient(hunting): handle_auth_reply_more payload_len 9 -40> 2020-01-21 11:45:19.289 7f6aa6d7a700 10 monclient(hunting): handle_auth_reply_more responding with 36 bytes -39> 2020-01-21 11:45:19.289 7f6aa6579700 10 monclient(hunting): get_auth_request con 0x563319682d00 auth_method 0 -38> 2020-01-21 11:45:19.289 7f6aa6579700 10 monclient(hunting): get_auth_request method 2 preferred_modes [1,2] -37> 2020-01-21 11:45:19.289 7f6aa6579700 10 monclient(hunting): _init_auth method 2 -36> 2020-01-21 11:45:19.289 7f6aa757b700 10 monclient(hunting): get_auth_request con 0x563319683180 auth_method 0 -35> 2020-01-21 11:45:19.289 7f6aa757b700 10 monclient(hunting): get_auth_request method 2 preferred_modes [1,2] -34> 2020-01-21 11:45:19.289 7f6aa757b700 10 monclient(hunting): _init_auth method 2 -33> 2020-01-21 11:45:19.289 7f6aa6d7a700 10 monclient(hunting): handle_auth_done global_id 5638238 payload 386 -32> 2020-01-21 11:45:19.289 7f6aa6d7a700 10 monclient: _finish_hunting 0 -31> 2020-01-21 11:45:19.289 7f6aa6d7a700 1 monclient: found mon.testlab-ceph-01 -30> 2020-01-21 11:45:19.289 7f6aa6d7a700 10 monclient: _send_mon_message to mon.testlab-ceph-01 at v2:10.0.92.67:3300/0 -29> 2020-01-21 11:45:19.289 7f6aa6d7a700 10 monclient: _finish_auth 0 -28> 2020-01-21 11:45:19.289 7f6aa6d7a700 10 monclient: _check_auth_rotating renewing rotating keys (they expired before 2020-01-21 11:44:49.293059) -27> 2020-01-21 11:45:19.289 7f6aa6d7a700 10 monclient: _send_mon_message to mon.testlab-ceph-01 at v2:10.0.92.67:3300/0 -26> 2020-01-21 11:45:19.289 7f6aa5d78700 10 monclient: handle_monmap mon_map magic: 0 v1 -25> 2020-01-21 11:45:19.289 7f6aa5d78700 10 monclient: got monmap 17 from mon.testlab-ceph-01 (according to old e17) -24> 2020-01-21 11:45:19.289 7f6aa5d78700 10 monclient: dump: epoch 17 fsid f42082cc-c35a-44fe-b7ef-c2eb2ff1fe43 last_changed 2020-01-20 10:35:23.579081 created 2018-04-25 17:07:31.881451 min_mon_release 14 (nautilus) 0: [v2:10.0.92.69:3300/0,v1:10.0.92.69:6789/0] mon.testlab-ceph-03 1: [v2:10.0.92.72:3300/0,v1:10.0.92.72:6789/0] mon.testlab-ceph-04 2: [v2:10.0.92.67:3300/0,v1:10.0.92.67:6789/0] mon.testlab-ceph-01 3: [v2:10.0.92.68:3300/0,v1:10.0.92.68:6789/0] mon.testlab-ceph-02 -23> 2020-01-21 11:45:19.289 7f6aa5d78700 10 monclient: handle_config config(3 keys) v1 -22> 2020-01-21 11:45:19.289 7f6aa7781c80 10 monclient: get_monmap_and_config success -21> 2020-01-21 11:45:19.289 7f6aa7781c80 10 monclient: shutdown -20> 2020-01-21 11:45:19.289 7f6aa4575700 4 set_mon_vals no callback set -19> 2020-01-21 11:45:19.289 7f6aa5d78700 10 monclient: discarding stray monitor message mon_map magic: 0 v1 -18> 2020-01-21 11:45:19.289 7f6aa4575700 10 set_mon_vals osd_crush_update_on_start = true -17> 2020-01-21 11:45:19.289 7f6aa4575700 10 set_mon_vals osd_max_backfills = 4 -16> 2020-01-21 11:45:19.289 7f6aa4575700 10 set_mon_vals osd_memory_target = 2147483648 -15> 2020-01-21 11:45:19.297 7f6aa7781c80 0 set uid:gid to 64045:64045 (ceph:ceph) -14> 2020-01-21 11:45:19.297 7f6aa7781c80 0 ceph version 14.2.6 (f0aa067ac7a02ee46ea48aa26c6e298b5ea272e9) nautilus (stable), process ceph-osd, pid 728019 -13> 2020-01-21 11:45:20.681 7f6aa7781c80 0 pidfile_write: ignore empty --pid-file -12> 2020-01-21 11:45:20.685 7f6aa7781c80 5 asok(0x563319688000) init /var/run/ceph/ceph-osd.0.asok -11> 2020-01-21 11:45:20.685 7f6aa7781c80 5 asok(0x563319688000) bind_and_listen /var/run/ceph/ceph-osd.0.asok -10> 2020-01-21 11:45:20.685 7f6aa7781c80 5 asok(0x563319688000) register_command 0 hook 0x5633196003f0 -9> 2020-01-21 11:45:20.685 7f6aa7781c80 5 asok(0x563319688000) register_command version hook 0x5633196003f0 -8> 2020-01-21 11:45:20.685 7f6aa7781c80 5 asok(0x563319688000) register_command git_version hook 0x5633196003f0 -7> 2020-01-21 11:45:20.685 7f6aa7781c80 5 asok(0x563319688000) register_command help hook 0x563319602220 -6> 2020-01-21 11:45:20.685 7f6aa7781c80 5 asok(0x563319688000) register_command get_command_descriptions hook 0x563319602260 -5> 2020-01-21 11:45:20.685 7f6aa4d76700 5 asok(0x563319688000) entry start -4> 2020-01-21 11:45:20.685 7f6aa7781c80 5 object store type is bluestore -3> 2020-01-21 11:45:20.689 7f6aa7781c80 1 bdev create path /var/lib/ceph/osd/ceph-0/block type kernel -2> 2020-01-21 11:45:20.689 7f6aa7781c80 1 bdev(0x56331a2d8000 /var/lib/ceph/osd/ceph-0/block) open path /var/lib/ceph/osd/ceph-0/block -1> 2020-01-21 11:45:20.689 7f6aa7781c80 1 bdev(0x56331a2d8000 /var/lib/ceph/osd/ceph-0/block) open size 2000381018112 (0x1d1c0000000, 1.8 TiB) block_size 4096 (4 KiB) rotational discard not supported 0> 2020-01-21 11:45:20.693 7f6aa7781c80 -1 *** Caught signal (Aborted) ** in thread 7f6aa7781c80 thread_name:ceph-osd ceph version 14.2.6 (f0aa067ac7a02ee46ea48aa26c6e298b5ea272e9) nautilus (stable) 1: (()+0x12730) [0x7f6aa8229730] 2: (gsignal()+0x10b) [0x7f6aa7d0d7bb] 3: (abort()+0x121) [0x7f6aa7cf8535] 4: (()+0x8c983) [0x7f6aa80c0983] 5: (()+0x928c6) [0x7f6aa80c68c6] 6: (()+0x92901) [0x7f6aa80c6901] 7: (()+0x92b34) [0x7f6aa80c6b34] 8: (()+0x5a3f53) [0x56330f0a0f53] 9: (Option::size_t const md_config_t::get_val<Option::size_t>(ConfigValues const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) const+0x81) [0x56330f0a6c91] 10: (BlueStore::_set_cache_sizes()+0x15a) [0x56330f521d8a] 11: (BlueStore::_open_bdev(bool)+0x173) [0x56330f524b23] 12: (BlueStore::get_devices(std::set<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >*)+0xef) [0x56330f58b7ef] 13: (BlueStore::get_numa_node(int*, std::set<int, std::less<int>, std::allocator<int> >*, std::set<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >*)+0x7b) [0x56330f53371b] 14: (main()+0x2870) [0x56330f06e440] 15: (__libc_start_main()+0xeb) [0x7f6aa7cfa09b] 16: (_start()+0x2a) [0x56330f0a0c6a] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. --- logging levels --- 0/ 5 none 0/ 1 lockdep 0/ 1 context 1/ 1 crush 1/ 5 mds 1/ 5 mds_balancer 1/ 5 mds_locker 1/ 5 mds_log 1/ 5 mds_log_expire 1/ 5 mds_migrator 0/ 1 buffer 0/ 1 timer 0/ 1 filer 0/ 1 striper 0/ 1 objecter 0/ 5 rados 0/ 5 rbd 0/ 5 rbd_mirror 0/ 5 rbd_replay 0/ 5 journaler 0/ 5 objectcacher 0/ 5 client 1/ 5 osd 0/ 5 optracker 0/ 5 objclass 1/ 3 filestore 1/ 3 journal 0/ 0 ms 1/ 5 mon 0/10 monc 1/ 5 paxos 0/ 5 tp 1/ 5 auth 1/ 5 crypto 1/ 1 finisher 1/ 1 reserver 1/ 5 heartbeatmap 1/ 5 perfcounter 1/ 5 rgw 1/ 5 rgw_sync 1/10 civetweb 1/ 5 javaclient 1/ 5 asok 1/ 1 throttle 0/ 0 refs 1/ 5 xio 1/ 5 compressor 1/ 5 bluestore 1/ 5 bluefs 1/ 3 bdev 1/ 5 kstore 4/ 5 rocksdb 4/ 5 leveldb 4/ 5 memdb 1/ 5 kinetic 1/ 5 fuse 1/ 5 mgr 1/ 5 mgrc 1/ 5 dpdk 1/ 5 eventtrace 1/ 5 prioritycache -2/-2 (syslog threshold) -1/-1 (stderr threshold) max_recent 10000 max_new 1000 log_file /var/log/ceph/ceph-osd.0.log --- end dump of recent events ---
When I remove this option:
# ceph config rm osd osd_memory_target
OSD starts without any trouble. I've seen same behaviour when I wrote this parameter into /etc/ceph/ceph.conf.
I've been able to compile ceph-osd with debug symbols and perform gdb stepping:
-24> 2020-01-22 13:12:53.614 7f83ed064700 4 set_mon_vals no callback set -23> 2020-01-22 13:12:53.614 7f83ee867700 10 monclient: discarding stray monitor message auth_reply(proto 2 0 (0) Success) v1 -22> 2020-01-22 13:12:53.614 7f83ed064700 10 set_mon_vals osd_crush_update_on_start = true -21> 2020-01-22 13:12:53.614 7f83ed064700 10 set_mon_vals osd_max_backfills = 64 -20> 2020-01-22 13:12:53.614 7f83ed064700 10 set_mon_vals osd_memory_target = 2147483648 -19> 2020-01-22 13:12:53.614 7f83ed064700 10 set_mon_vals osd_recovery_max_active = 40 -18> 2020-01-22 13:12:53.614 7f83ed064700 10 set_mon_vals osd_recovery_max_single_start = 1000 -17> 2020-01-22 13:12:53.614 7f83ed064700 10 set_mon_vals osd_recovery_sleep_hdd = 0.000000 -16> 2020-01-22 13:12:53.614 7f83ed064700 10 set_mon_vals osd_recovery_sleep_hybrid = 0.000000 -15> 2020-01-22 13:12:53.627 7f83f0276c40 0 set uid:gid to 64045:64045 (ceph:ceph) -14> 2020-01-22 13:12:53.627 7f83f0276c40 0 ceph version 14.2.6 (f0aa067ac7a02ee46ea48aa26c6e298b5ea272e9) nautilus (stable), process ceph-osd, pid 1111622 -13> 2020-01-22 13:12:53.649 7f83f0276c40 0 pidfile_write: ignore empty --pid-file -12> 2020-01-22 13:12:53.657 7f83f0276c40 5 asok(0x5580518fa000) init /var/run/ceph/ceph-osd.6.asok -11> 2020-01-22 13:12:53.657 7f83f0276c40 5 asok(0x5580518fa000) bind_and_listen /var/run/ceph/ceph-osd.6.asok -10> 2020-01-22 13:12:53.657 7f83f0276c40 5 asok(0x5580518fa000) register_command 0 hook 0x558051872fc0 -9> 2020-01-22 13:12:53.657 7f83f0276c40 5 asok(0x5580518fa000) register_command version hook 0x558051872fc0 -8> 2020-01-22 13:12:53.657 7f83f0276c40 5 asok(0x5580518fa000) register_command git_version hook 0x558051872fc0 -7> 2020-01-22 13:12:53.657 7f83f0276c40 5 asok(0x5580518fa000) register_command help hook 0x558051874220 -6> 2020-01-22 13:12:53.657 7f83f0276c40 5 asok(0x5580518fa000) register_command get_command_descriptions hook 0x558051874260 -5> 2020-01-22 13:12:53.657 7f83ed865700 5 asok(0x5580518fa000) entry start -4> 2020-01-22 13:12:53.670 7f83f0276c40 5 object store type is bluestore -3> 2020-01-22 13:12:53.675 7f83f0276c40 1 bdev create path /var/lib/ceph/osd/ceph-6/block type kernel -2> 2020-01-22 13:12:53.675 7f83f0276c40 1 bdev(0x5580518f3f80 /var/lib/ceph/osd/ceph-6/block) open path /var/lib/ceph/osd/ceph-6/block -1> 2020-01-22 13:12:53.675 7f83f0276c40 1 bdev(0x5580518f3f80 /var/lib/ceph/osd/ceph-6/block) open size 3000588304384 (0x2baa1000000, 2.7 TiB) block_size 4096 (4 KiB) rotational discard not supported 0> 2020-01-22 13:12:53.714 7f83f0276c40 -1 *** Caught signal (Aborted) ** in thread 7f83f0276c40 thread_name:ceph-osd ceph version 14.2.6 (f0aa067ac7a02ee46ea48aa26c6e298b5ea272e9) nautilus (stable) 1: (()+0x2c19654) [0x558045ec6654] 2: (()+0x12730) [0x7f83f0d1f730] 3: (gsignal()+0x10b) [0x7f83f08027bb] 4: (abort()+0x121) [0x7f83f07ed535] 5: (()+0x8c983) [0x7f83f0bb5983] 6: (()+0x928c6) [0x7f83f0bbb8c6] 7: (()+0x92901) [0x7f83f0bbb901] 8: (()+0x92b34) [0x7f83f0bbbb34] 9: (void boost::throw_exception<boost::bad_get>(boost::bad_get const&)+0x7b) [0x5580454d5430] 10: (Option::size_t&& boost::relaxed_get<Option::size_t, boost::blank, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, unsigned long, long, double, bool, entity_addr_t, entity_addrvec_t, std::chrono::duration<long, std::ratio<1l, 1l> >, Option::size_t, uuid_d>(boost::variant<boost::blank, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, unsigned long, long, double, bool, entity_addr_t, entity_addrvec_t, std::chrono::duration<long, std::ratio<1l, 1l> >, Option::size_t, uuid_d>&&)+0x5b) [0x5580454d6223] 11: (Option::size_t&& boost::strict_get<Option::size_t, boost::blank, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, unsigned long, long, double, bool, entity_addr_t, entity_addrvec_t, std::chrono::duration<long, std::ratio<1l, 1l> >, Option::size_t, uuid_d>(boost::variant<boost::blank, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, unsigned long, long, double, bool, entity_addr_t, entity_addrvec_t, std::chrono::duration<long, std::ratio<1l, 1l> >, Option::size_t, uuid_d>&&)+0x20) [0x5580454d4a39] 12: (Option::size_t&& boost::get<Option::size_t, boost::blank, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, unsigned long, long, double, bool, entity_addr_t, entity_addrvec_t, std::chrono::duration<long, std::ratio<1l, 1l> >, Option::size_t, uuid_d>(boost::variant<boost::blank, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, unsigned long, long, double, bool, entity_addr_t, entity_addrvec_t, std::chrono::duration<long, std::ratio<1l, 1l> >, Option::size_t, uuid_d>&&)+0x20) [0x5580454d1ed7] 13: (Option::size_t const md_config_t::get_val<Option::size_t>(ConfigValues const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) const+0x48) [0x5580454ce882] 14: (Option::size_t const ConfigProxy::get_val<Option::size_t>(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) const+0x58) [0x5580454cb9b8] 15: (BlueStore::_set_cache_sizes()+0x159) [0x558045ce2213] 16: (BlueStore::_open_bdev(bool)+0x301) [0x558045ce6be3] 17: (BlueStore::get_devices(std::set<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >*)+0xf9) [0x558045d0f16d] 18: (BlueStore::get_numa_node(int*, std::set<int, std::less<int>, std::allocator<int> >*, std::set<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >*)+0x79) [0x558045d0eb55] 19: (main()+0x3aae) [0x5580454c2460] 20: (__libc_start_main()+0xeb) [0x7f83f07ef09b] 21: (_start()+0x2a) [0x5580454bda2a] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
In int BlueStore::_set_cache_sizes():
(gdb) n 4116 cache_autotune_interval = (gdb) n 4117 cct->_conf.get_val<double>("bluestore_cache_autotune_interval"); (gdb) p cache_autotune_interval $3 = 5 (gdb) n 4118 osd_memory_target = cct->_conf.get_val<Option::size_t>("osd_memory_target"); (gdb) s std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string<std::allocator<char> > (this=0x7fffffffc140, __s=0x555558d26c2f "osd_memory_target", __a=...) at /usr/include/c++/8/bits/basic_string.h:515 515 : _M_dataplus(_M_local_data(), __a) (gdb) n 516 { _M_construct(__s, __s ? __s + traits_type::length(__s) : __s+npos); } (gdb) terminate called after throwing an instance of 'boost::exception_detail::clone_impl<boost::exception_detail::error_info_injector<boost::bad_get> >' what(): boost::bad_get: failed value get using boost::get
Files
Actions