Project

General

Profile

Bug #13990

Hammer (0.94.3) OSD does not delete old OSD Maps in a timely fashion (maybe at all?)

Added by Tom Christensen about 3 years ago. Updated over 2 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
OSD
Target version:
-
Start date:
12/05/2015
Due date:
% Done:

0%

Source:
Community (user)
Tags:
Backport:
hammer,jewel
Regression:
Yes
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:

Description

We recently upgraded from firefly 0.80.9 to Hammer 0.94.3. Since our upgrade 3 weeks ago we have accumulated over 250GB of old OSD maps in the meta directory of each OSD. The OSD does not appear to be deleting old OSD maps. We generate substantially more OSD maps than most clusters because we delete RBD snapshots regularly. We have looked at the code around OSD map deletion in OSD.cc and it doesn't look like anything in the OSD code ever moves the lower_bound of the map_cache forward to expire maps.

We have attempted lowering the map_cache_size on the OSDs but this did not result in getting maps to expire and get deleted from disk. We have attempted restarting OSDs, restarting entire OSD nodes, and even marked osds out. Nothing seems to get the OSD to reset its map cache lower bound. Pretty soon we'll have to start just taking OSDs completely out of the cluster, zapping the disk, and then readd back to the cluster. We have over 100,000 OSD maps stored on each OSD and we're using about 10% of our raw disk space to store these maps so its quickly becoming a serious issue for us space-wise.

I am attaching an OSD debug 20 log that shows an iteration of handle_osd_map where the OSD clearly has way more than 500 epochs (oldest_map = 1106541, newest map = 1248379) according to the superblock write command. But it doesn't enter the loop to remove old osdmap epochs. We've set our map_cache_size back to default (500) since lowering it to 250 didn't seem to kick start any sort of cleanup.

If you need any other logs or have ideas of how we could get the OSDs to start trimming osd maps it would be very appreciated.

grep.py View (1.89 KB) Kefu Chai, 01/05/2016 11:11 AM

ceph.log.bz2 - Log after 'ceph-deploy mon create-initial' (10.6 KB) Steve Taylor, 01/08/2016 05:46 PM

ceph-osd.28.log.bz2 - Log from osd.28 showing new crash (124 KB) Steve Taylor, 01/13/2016 10:56 PM

ceph-osd.28.log.bz2 (196 KB) Steve Taylor, 01/14/2016 07:18 PM

ceph-mon.mon-eng-05-03.log.bz2 - Mon log showing osd map trimming (435 KB) Steve Taylor, 01/19/2016 05:42 PM

ceph-osd.0.log.bz2 - Osd log showing osd map trimming (818 KB) Steve Taylor, 01/19/2016 05:42 PM

ceph-osd.0.log.1.gz - osd.0 log (91.9 KB) Steve Taylor, 02/10/2016 03:11 PM

newest-minus-oldest.png View (14 KB) Kefu Chai, 03/21/2016 08:21 AM

oldest.png View (31.7 KB) Kefu Chai, 03/21/2016 08:21 AM


Related issues

Related to Ceph - Backport #14592: osd crashes when handling a stale pg-create message (hammer) Closed
Related to Ceph - Bug #14213: osd/OSD.cc: 2940: FAILED assert(0 == "Missing map in load_pgs") Resolved 01/02/2016
Related to Ceph - Bug #13748: ceph-mons crashing constantly after 0.94.3->0.94.5 upgrade Resolved 11/10/2015
Related to Ceph - Bug #15517: msg/simple: PipeConnection leak Can't reproduce 04/15/2016
Copied to Ceph - Backport #15193: hammer: Hammer (0.94.3) OSD does not delete old OSD Maps in a timely fashion (maybe at all?) Resolved
Copied to Ceph - Backport #15856: jewel: Hammer (0.94.3) OSD does not delete old OSD Maps in a timely fashion (maybe at all?) Resolved
Copied to Ceph - Bug #15879: Hammer (0.94.3) OSD does not delete old OSD Maps in a timely fashion (maybe at all?) Duplicate 12/05/2015

Associated revisions

Revision 9789c29f (diff)
Added by Kefu Chai almost 3 years ago

osd: populate the trim_thru epoch using MOSDMap.oldest_map

instead of filling MOSDMap with the local oldest_map, we share
the maximum MOSDMap.oldest_map received so far with peers. That
way one OSD's failure to trim won't prevent it from sharing with
others that they are allowed to trim.

Fixes: #13990
Signed-off-by: Kefu Chai <>

Revision ac0340ad (diff)
Added by Kefu Chai over 2 years ago

osd: populate the trim_thru epoch using MOSDMap.oldest_map

instead of filling MOSDMap with the local oldest_map, we share
the maximum MOSDMap.oldest_map received so far with peers. That
way one OSD's failure to trim won't prevent it from sharing with
others that they are allowed to trim.

Fixes: #13990
Signed-off-by: Kefu Chai <>
(cherry picked from commit 9789c29fe0f82b236703f7ae4d356251fde3a46f)

Conflict: use atomic_t instead of std::atomic<>

History

#1 Updated by Tom Christensen about 3 years ago

Request entity too large... here's a snippet that I believe shows what I said in the bug:
Dec 5 09:23:32 str-slc-06-10 ceph-osd: 2015-12-05 09:23:32.865411 7f6472338700 20 osd.1199 1248378 share_map_peer 0x139902c0 already has epoch 1248378
Dec 5 09:23:32 str-slc-06-10 ceph-osd: 2015-12-05 09:23:32.865520 7f6472338700 20 osd.1199 1248378 share_map_peer 0x139902c0 already has epoch 1248378
Dec 5 09:23:32 str-slc-06-10 ceph-osd: 2015-12-05 09:23:32.866343 7f6472338700 10 osd.1199 1248378 note_peer_epoch osd.916 has 1248379
Dec 5 09:23:32 str-slc-06-10 ceph-osd: 2015-12-05 09:23:32.866348 7f6472338700 20 osd.1199 1248378 share_map_peer 0x446cf020 already has epoch 1248379
Dec 5 09:23:32 str-slc-06-10 ceph-osd: 2015-12-05 09:23:32.867206 7f6472338700 20 osd.1199 1248378 share_map_peer 0x1fd7bde0 already has epoch 1248378
Dec 5 09:23:32 str-slc-06-10 ceph-osd: 2015-12-05 09:23:32.873899 7f6473b3b700 3 osd.1199 1248378 handle_osd_map epochs [1248379,1248379], i have 1248378, src has [1040337,1248379]
Dec 5 09:23:32 str-slc-06-10 ceph-osd: 2015-12-05 09:23:32.873921 7f6473b3b700 10 osd.1199 1248378 handle_osd_map got inc map for epoch 1248379
Dec 5 09:23:32 str-slc-06-10 ceph-osd: 2015-12-05 09:23:32.876915 7f6472338700 20 osd.1199 1248378 share_map_peer 0x446cf020 already has epoch 1248379
Dec 5 09:23:32 str-slc-06-10 ceph-osd: 2015-12-05 09:23:32.878785 7f6472338700 20 osd.1199 1248378 share_map_peer 0x450ab080 already has epoch 1248378
Dec 5 09:23:32 str-slc-06-10 ceph-osd: 2015-12-05 09:23:32.886623 7f6472338700 20 osd.1199 1248378 share_map_peer 0x450ab080 already has epoch 1248378
Dec 5 09:23:32 str-slc-06-10 ceph-osd: 2015-12-05 09:23:32.910007 7f6473b3b700 10 osd.1199 1248378 advance to epoch 1248379 (<= newest 1248379)
Dec 5 09:23:32 str-slc-06-10 ceph-osd: 2015-12-05 09:23:32.910177 7f6473b3b700 7 osd.1199 1248379 advance_map epoch 1248379
Dec 5 09:23:32 str-slc-06-10 ceph-osd: 2015-12-05 09:23:32.910211 7f6473b3b700 10 osd.1199 1248379 write_superblock sb(f3b7f409-e061-4e39-b4d0-ae380e29ae7e osd.1199 0946dfb1-444d-4210-b99d-17512206b09d e1248379 [1106541,1248379] lci=[1243375,1248379])
Dec 5 09:23:32 str-slc-06-10 ceph-osd: 2015-12-05 09:23:32.910745 7f6473b3b700 7 osd.1199 1248379 consume_map version 1248379

The OSD Map removal is between the "We got an incremental osd map" and the superblock write... so it seems clear we didn't even enter the loop to remove old osd maps.

#2 Updated by Tom Christensen about 3 years ago

Ok.. Sorry for the spam, the last log snippet didn't quite show what I indicated, as it appears 1199 did finally start trimming its OSD maps overnight. Here we can see 385 which has not trimmed an OSD map since Nov 17th, and has 277GB of osd maps sitting in the meta directory. We would like our OSDs to trim their OSD maps very regularly because we create so many due to snapshot deletion.

Dec 5 10:19:35 str-slc-02-07 ceph-osd: 2015-12-05 10:19:35.654775 7f0a1ae7c700 10 osd.385 1248399 do_waiters -- finish
Dec 5 10:19:35 str-slc-02-07 ceph-osd: 2015-12-05 10:19:35.743692 7f0a0e0bf700 10 osd.385 1248399 do_waiters -- start
Dec 5 10:19:35 str-slc-02-07 ceph-osd: 2015-12-05 10:19:35.743697 7f0a0e0bf700 10 osd.385 1248399 do_waiters -- finish
Dec 5 10:19:35 str-slc-02-07 ceph-osd: 2015-12-05 10:19:35.743699 7f0a0e0bf700 20 osd.385 1248399 _dispatch 0x1ea3b440 osd_map(1248400..1248400 src has 1145073..1248400) v3
Dec 5 10:19:35 str-slc-02-07 ceph-osd: 2015-12-05 10:19:35.755159 7f0a0e0bf700 3 osd.385 1248399 handle_osd_map epochs [1248400,1248400], i have 1248399, src has [1145073,1248400]
Dec 5 10:19:35 str-slc-02-07 ceph-osd: 2015-12-05 10:19:35.755183 7f0a0e0bf700 10 osd.385 1248399 handle_osd_map got inc map for epoch 1248400
Dec 5 10:19:35 str-slc-02-07 ceph-osd: 2015-12-05 10:19:35.782900 7f0a0e0bf700 10 osd.385 1248399 advance to epoch 1248400 (<= newest 1248400)
Dec 5 10:19:35 str-slc-02-07 ceph-osd: 2015-12-05 10:19:35.783113 7f0a0e0bf700 7 osd.385 1248400 advance_map epoch 1248400
Dec 5 10:19:35 str-slc-02-07 ceph-osd: 2015-12-05 10:19:35.783248 7f0a0e0bf700 10 osd.385 1248400 write_superblock sb(f3b7f409-e061-4e39-b4d0-ae380e29ae7e osd.385 21b29058-c135-4be7-8c95-5ed6c4c245ff e1248400 [1045648,1248400] lci=[1042331,1248400])
Dec 5 10:19:35 str-slc-02-07 ceph-osd: 2015-12-05 10:19:35.783697 7f0a0e0bf700 7 osd.385 1248400 consume_map version 1248400

#3 Updated by Tom Christensen about 3 years ago

One last comment for now. I have found this morning that restarting an OSD does trigger the cleanup. It just takes a long time to clean up as it only removes a max of 30 epochs per new epoch (osd_target_transaction_size limits the number that are culled in each iteration of handle_osd_map).. and because our OSDs are behind by 200k+ epochs on most of our OSDs, it takes 2-3 days to clean up, so I failed to recognize the pattern.

However, having to restart 1500 OSDs every week to keep the OSD maps on disk small is obviously problematic. This feels like a pretty major regression.

#4 Updated by Greg Farnum about 3 years ago

What's the history of this cluster during the upgrade? It sounds like you had dirty state for a very long time (preventing it from trimming maps) and are now just very slowly cleaning up. You could try increasing the osd_target_transaction_size, although if it gets too large that can cause timeouts IIRC.

#5 Updated by Tom Christensen about 3 years ago

It was HEALTH_OK prior to upgrade for months. We've grown the cluster various times, and it has backfilled successfully. We only started seeing this growth of OSD maps after the upgrade to hammer though. We did grow the cluster again, adding 35 OSDs 3 days after the upgrade, but most of the OSDs have OSD maps going back to when they were restarted for the upgrade (which was on Nov 17th)

I can now confirm that a restart of the OSD process does get it to start trimming maps, but once it catches up to when it was restarted, it stops trimming again. All of our OSDs keep all of their maps from the time they were last restarted.

#6 Updated by Samuel Just about 3 years ago

  • Priority changed from Normal to Urgent

The next step would be to post an osd log from an osd as it fails to trim maps. I suggest 1 hour of logs at

debug osd = 20
debug ms = 1
debug filestore = 20.

#7 Updated by Samuel Just about 3 years ago

If the log is too large, you can use ceph-post-file.

#8 Updated by Tom Christensen about 3 years ago

Well since restarting got the OSDs to start trimming, we've now restarted all OSDs in our cluster as we were above 80% used on a good portion of our disks and needed to remedy that situation. It'll take 2-3 days before they stop trimming again as they have so many maps to trim. I will update this ticket as soon as the OSDs finish trimming and start collecting maps again.

#9 Updated by Tom Christensen about 3 years ago

Having never used ceph-post-file, are there any special args to get it to work? I get a "connect to host drop.ceph.com port 22: connection timed out". I have verified on my firewall that this traffic should be allowed and I don't see it getting blocked on my firewall. Ideas?

#10 Updated by Tom Christensen about 3 years ago

I have a log file from an OSD in this state, its 270MB, how would you recommend I get this file into the hands of someone who can look at the problem since ceph-post-file is down? I hope we can get a fix for this into 0.94.6 because we definitely are in a situation where we have to restart all of our OSDs once a week to keep them from filling up with osdmaps. Since our mass restart on Tuesday we've again accumulated 30-40GB of osdmaps on all of our osds. I'm willing and able to attempt a code fix myself, but it would be my first contribution, and I'm not 100% clear on the code in question. It really looks to me like nothing ever moves the cache window forward (by setting the map_cache lower bound), but I don't know if just sticking a "move forward here" call in the handle osd map method would do it? Or if that would be the best solution?

#11 Updated by Kefu Chai about 3 years ago

  • Assignee set to Kefu Chai

#12 Updated by Kefu Chai about 3 years ago

seems we are leaking an osdmap somewhere:

dump_weak_refs 0x4ee8bc8 weak_refs: 1737 = 0x5668ac0 with 1 refs
dump_weak_refs 0x4ee8bc8 weak_refs: 1738 = 0x5668d00 with 1 refs
dump_weak_refs 0x4ee8bc8 weak_refs: 1739 = 0x5668f40 with 1 refs
dump_weak_refs 0x4ee8bc8 weak_refs: 1740 = 0x5669180 with 1 refs
dump_weak_refs 0x4ee8bc8 weak_refs: 1741 = 0x56693c0 with 1 refs
dump_weak_refs 0x4ee8bc8 weak_refs: 1742 = 0x5669600 with 1 refs
dump_weak_refs 0x4ee8bc8 weak_refs: 1743 = 0x5669840 with 1 refs
dump_weak_refs 0x4ee8bc8 weak_refs: 1744 = 0x524ad80 with 1 refs
dump_weak_refs 0x4ee8bc8 weak_refs: 1745 = 0x524b440 with 9 refs <<<<<
dump_weak_refs 0x4ee8bc8 weak_refs: 1746 = 0x524b680 with 1 refs
dump_weak_refs 0x4ee8bc8 weak_refs: 1747 = 0x4fe7200 with 1 refs
dump_weak_refs 0x4ee8bc8 weak_refs: 1748 = 0x4fe7b00 with 1 refs
dump_weak_refs 0x4ee8bc8 weak_refs: 1749 = 0x5667440 with 1 refs
dump_weak_refs 0x4ee8bc8 weak_refs: 1750 = 0x524b200 with 1 refs
dump_weak_refs 0x4ee8bc8 weak_refs: 1751 = 0x5667680 with 1 refs
dump_weak_refs 0x4ee8bc8 weak_refs: 1752 = 0x5667d40 with 1 refs
dump_weak_refs 0x4ee8bc8 weak_refs: 1753 = 0x5667b00 with 1 refs
dump_weak_refs 0x4ee8bc8 weak_refs: 1754 = 0x4fe81c0 with 1 refs
dump_weak_refs 0x4ee8bc8 weak_refs: 1755 = 0x5668400 with 1 refs
dump_weak_refs 0x4ee8bc8 weak_refs: 1756 = 0x5669cc0 with 12 refs

i dumped the osdmap cache in OSD::handle_osd_map(), in theory, the old osdmap is not likely to be referenced, and the oldest should be evicted from the LRU cache when new maps are added. but somehow, osdmap#1745 stayed with us.

#13 Updated by Kefu Chai about 3 years ago

  • Backport set to jewel, hammer

tested on master, this also applies to master.

#14 Updated by Nathan Cutler about 3 years ago

Kefu, shouldn't the backport be "infernalis, hammer" ? I don't think we're backporting to jewel yet.

#15 Updated by Kefu Chai about 3 years ago

  • Backport changed from jewel, hammer to infernalis, hammer

#16 Updated by Kefu Chai about 3 years ago

1
====
 1: (PG::get_osdmap() const 0x0) [0x00000000]
 2: (PG::prepare_write_info(std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, ceph::buffer::list, std::less<std::__cx\
x11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, s\
td::allocator<char> > const, ceph::buffer::list> > >*)+0x174) [0x7f8b7794a624]
 3: (PG::write_if_dirty(ObjectStore::Transaction&)+0x123) [0x7f8b7794b333]
 4: (OSD::process_peering_events(std::__cxx11::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x317) [0x7f8b778c1d57]
 5: (OSD::PeeringWQ::_process(std::__cxx11::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x26) [0x7f8b77906436]
 6: (ThreadPool::worker(ThreadPool::WorkThread*)+0xda1) [0x7f8b77dac131]
 7: (ThreadPool::WorkThread::entry()+0x10) [0x7f8b77dad240]
 8: (()+0x7284) [0x7f8b7677d284]
 9: (clone()+0x6d) [0x7f8b74a2074d]

14
====
 1: (PG::PG(OSDService*, OSDMapRef, PGPool const&, spg_t) 0x0) [0x00000000]
 2: (ReplicatedPG::ReplicatedPG(OSDService*, OSDMapRef, PGPool const&, spg_t)+0x69) [0x7f8b779e1b39]
 3: (OSD::_make_pg(OSDMapRef, spg_t)+0x264) [0x7f8b778a1804]
 4: (OSD::_open_lock_pg(OSDMapRef, spg_t, bool)+0x5b) [0x7f8b778a1adb]
 5: (OSD::_create_lock_pg(OSDMapRef, spg_t, bool, bool, int, std::vector<int, std::allocator<int> >&, int, std::vector<int, std::allocator<int> >&, int, pg_h\
istory_t, std::map<unsigned int, pg_interval_t, std::less<unsigned int>, std::allocator<std::pair<unsigned int const, pg_interval_t> > >&, ObjectStore::Trans\
action&)+0x1a2) [0x7f8b778a4ff2]
 6: (OSD::handle_pg_peering_evt(spg_t, pg_info_t const&, std::map<unsigned int, pg_interval_t, std::less<unsigned int>, std::allocator<std::pair<unsigned int\
 const, pg_interval_t> > >&, unsigned int, pg_shard_t, bool, std::shared_ptr<PG::CephPeeringEvt>)+0x11e6) [0x7f8b778bd436]
 7: (OSD::handle_pg_log(std::shared_ptr<OpRequest>)+0x548) [0x7f8b778be748]
 8: (OSD::dispatch_op(std::shared_ptr<OpRequest>)+0xd3) [0x7f8b778c03b3]
 9: (OSD::_dispatch(Message*)+0x27e) [0x7f8b778c110e]
 10: (OSD::ms_dispatch(Message*)+0x2e7) [0x7f8b778c18a7]
 11: (DispatchQueue::entry()+0xe0a) [0x7f8b77ecc14a]
 12: (DispatchQueue::DispatchThread::entry()+0xd) [0x7f8b77d9af0d]
 13: (()+0x7284) [0x7f8b7677d284]
 14: (clone()+0x6d) [0x7f8b74a2074d]

-7
====
  1: (PG::handle_advance_map(OSDMapRef, OSDMapRef, std::vector<int, std::allocator<int> >&, int, std::vector<int, std::allocator<int> >&, int, PG::RecoveryCt\
x*) 0x0) [0x00000000]
 2: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle&, PG::RecoveryCtx*, std::set<boost::intrusive_ptr<PG>, std::less<boost::intrusive_ptr<PG> >, std\
::allocator<boost::intrusive_ptr<PG> > >*)+0x2c7) [0x7f8b778a8657]
 3: (OSD::process_peering_events(std::__cxx11::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x1e3) [0x7f8b778c1c23]
 4: (OSD::PeeringWQ::_process(std::__cxx11::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x26) [0x7f8b77906436]
 5: (ThreadPool::worker(ThreadPool::WorkThread*)+0xda1) [0x7f8b77dac131]
 6: (ThreadPool::WorkThread::entry()+0x10) [0x7f8b77dad240]
 7: (()+0x7284) [0x7f8b7677d284]
 8: (clone()+0x6d) [0x7f8b74a2074d]

i think the PG::last_persisted_osdmap_ref is holding 8 references of osdmap, and hence prevents the OSD::map_cache from freeing the old osdmap, and OSD will not remove them from the objectstore if somebody is still holding them.

#17 Updated by Kefu Chai about 3 years ago

by repeating adding a pool and deleting it, we can see 8 PGs in the pool:

2015-12-21 20:43:22.170173 7f17b9e70700 10 osd.0 198 _open_lock_pg===0.7
2015-12-21 20:43:22.170176 7f17b9e70700 10 osd.0 198 _open_lock_pg===0.6
2015-12-21 20:43:22.170177 7f17b9e70700 10 osd.0 198 _open_lock_pg===0.5
2015-12-21 20:43:22.170178 7f17b9e70700 10 osd.0 198 _open_lock_pg===92.0
2015-12-21 20:43:22.170179 7f17b9e70700 10 osd.0 198 _open_lock_pg===0.4
2015-12-21 20:43:22.170180 7f17b9e70700 10 osd.0 198 _open_lock_pg===0.0
2015-12-21 20:43:22.170181 7f17b9e70700 10 osd.0 198 _open_lock_pg===0.1
2015-12-21 20:43:22.170182 7f17b9e70700 10 osd.0 198 _open_lock_pg===0.2
2015-12-21 20:43:22.170184 7f17b9e70700 10 osd.0 198 _open_lock_pg===0.3

8 of them stay with us after being added, they are not replaced/removed any more. the last one (pg 92.0 at that moment) gets replaced when the pool is created/deleted. the 0.[0-7] pg belong to the rbd pool. and since the rbd pool is not touched, these pg is not changed. i guess that's why the last_persisted_osdmap_ref is not updated.

printing out the PG in OSD::pg_map, their last_persisted_osdmap_ref are:

0x7f17d1288800 std::shared_ptr (count 7, weak 1) 0x7f17d1443d40, <No data fields>}
0x7f17d1286000 std::shared_ptr (count 7, weak 1) 0x7f17d1443d40, <No data fields>}

0x7f17d10c3000 std::shared_ptr (count 1, weak 1) 0x7f17d145b200, <No data fields>}

0x7f17d10c0800 std::shared_ptr (count 7, weak 1) 0x7f17d1443d40, <No data fields>}
0x7f17d0ecc000 std::shared_ptr (count 7, weak 1) 0x7f17d1443d40, <No data fields>}
0x7f17d0ece800 std::shared_ptr (count 7, weak 1) 0x7f17d1443d40, <No data fields>}
0x7f17d0ed1000 std::shared_ptr (count 7, weak 1) 0x7f17d1443d40, <No data fields>}
0x7f17d10be000 std::shared_ptr (count 7, weak 1) 0x7f17d1443d40, <No data fields>}

(gdb) p ((OSDMap*)0x7f17d1443d40)->epoch
$35 = 198

and osdmap e198 is the first one in the osdmap cache

2015-12-21 20:46:29.117824 7f17b866d700 -1 leaked refs:
dump_weak_refs 0x7f17d0de90a0 weak_refs: 198 = 0x7f17d1443d40 with 7 refs
dump_weak_refs 0x7f17d0de90a0 weak_refs: 200 = 0x7f17d145b200 with 2 refs
dump_weak_refs 0x7f17d0de90a0 weak_refs: 208 = 0x7f17d0ca78c0 with 4 refs
dump_weak_refs 0x7f17d0de90a0 weak_refs: 209 = 0x7f17d24d7d40 with 9 refs
dump_weak_refs 0x7f17d0de90a0 weak_refs: 227 = 0x7f17d24d8400 with 1 refs
dump_weak_refs 0x7f17d0de90a0 weak_refs: 246 = 0x7f17d0e54640 with 4 refs
dump_weak_refs 0x7f17d0de90a0 weak_refs: 248 = 0x7f17d1014240 with 4 refs
dump_weak_refs 0x7f17d0de90a0 weak_refs: 249 = 0x7f17d15ecf40 with 1 refs
dump_weak_refs 0x7f17d0de90a0 weak_refs: 250 = 0x7f17d0e53440 with 1 refs
dump_weak_refs 0x7f17d0de90a0 weak_refs: 251 = 0x7f17d24d78c0 with 1 refs
dump_weak_refs 0x7f17d0de90a0 weak_refs: 252 = 0x7f17d1443440 with 1 refs

#18 Updated by Kefu Chai about 3 years ago

  • Status changed from New to Need More Info

hi Tom,

It really looks to me like nothing ever moves the cache window forward (by setting the map_cache lower bound), but I don't know if just sticking a "move forward here" call in the handle osd map method would do it? Or if that would be the best solution?

the cache is self maintained, if the last reference of the cached osdmap is released, the osdmap is freed. and once the smallest epoch of cached entry in the cache increases, we will remove [oldest_map..min(msg.oldest_map, lower_bound)] in the cache.

#19 Updated by Kefu Chai about 3 years ago

we send NullEvt when a new osdmap is consumed, which also updates all PGs in the pg_map, if a PG's persisted osd map is too stale (osd_pg_epoch_persisted_max_stale), the PG::write_if_dirty() updates the last_persisted_osdmap_ref with the latest osdmap set by PG::handle_advance_map. so changing the last_persisted_osdmap_ref from OSDMapRef to epoch_t does not fix this issue. it just helps with it.

#20 Updated by Kefu Chai about 3 years ago

https://github.com/ceph/ceph/pull/7007 is posted to address the potential issue caused by the reference (last_persisted_osdmap_ref) held by PG. but the root cause of this issue is still in the mist.

#21 Updated by Kefu Chai about 3 years ago

  • Backport changed from infernalis, hammer to hammer

#22 Updated by Kefu Chai about 3 years ago

  • Status changed from Need More Info to New

#23 Updated by Tom Christensen about 3 years ago

Thanks for the updates Kefu. Is there something you would like me to attempt to gather additional info?

We create lots of "non-change" osd maps by deleting lots of RBD snapshots. So my guess is none of our osd maps contain changes to individual PGs, they are mostly updates to the removed_snaps list in the OSD map.

We have test environments where we can easily duplicate this issue, and we can run custom builds, debug builds, whatever you need there to try to track this down. If I understand correctly, you're saying this cleanup fixes a potential osd map reference leak, but there is at least one more we're hitting in addition because this one (the PG last_persisted_osdmap_ref) should only leak up to osd_pg_epoch_persisted_max_stale epochs?

#24 Updated by Kefu Chai about 3 years ago

hey Tom,

but there is at least one more we're hitting

the issue i am addressing is that the last_persisted_osdmap_ref hold hold the osd map for too long, but even without my patch the last_persisted_osdmap_ref will be refreshed once the osd_pg_epoch_persisted_max_stale is reached. so strictly speaking, this fix and the one we are after might not the same kind of fix.

Is there something you would like me to attempt to gather additional info?
We have test environments where we can easily duplicate this issue, and we can run custom builds, debug builds, whatever you need there to try to track this down.

much appreciated, Tom! i will prepare a custom build for you this evening.

#25 Updated by Tom Christensen about 3 years ago

I would say this probably will not fix the issue we're seeing as osd_pg_epoch_persisted_max_stale defaults to 200 (we haven't changed this setting) and we are seeing our OSDs keep around hundreds of thousands of osd maps, so either its a different reference leak, or its a completely different issue, though I do think a reference leak is still a likely culprit based on the behavior of the LRU.

Once we have the custom build we will run it for a couple days, we'll generate 2-3000 osd maps in a day or 2 and we can get logs/debug/etc for you.

#26 Updated by Kefu Chai about 3 years ago

from the log from ceph-osd@str-slc-04-08, mon.0 sent osdmap[1303506,1304111] to the osd. so seems the monitor did not trim the osdmap in half an hour: the oldest_map was always 1303506

$ grep '<== mon' ceph-osd.log | grep osd_map
Dec 16 11:09:21 str-slc-04-08 ceph-osd: 2015-12-16 11:09:21.959067 7f4c6148d700  1 -- 10.200.20.67:6826/5913 <== mon.0 10.200.20.30:6789/0 10515 ==== osd_map(1304111..1304111 src has 1303506..1304111) v3 ==== 285396+0+0 (1443583887 0 0) 0x6d53e880 con 0x1d0b0c60
Dec 16 11:21:54 str-slc-04-08 ceph-osd: 2015-12-16 11:21:54.000470 7f4c6148d700  1 -- 10.200.20.67:6826/5913 <== mon.0 10.200.20.30:6789/0 10550 ==== osd_map(1304114..1304115 src has 1303506..1304115) v3 ==== 570760+0+0 (817761143 0 0) 0x6d53f840 con 0x1d0b0c60
Dec 16 11:32:02 str-slc-04-08 ceph-osd: 2015-12-16 11:32:02.415693 7f4c6148d700  1 -- 10.200.20.67:6826/5913 <== mon.0 10.200.20.30:6789/0 10567 ==== osd_map(1304120..1304120 src has 1303506..1304120) v3 ==== 285396+0+0 (322435256 0 0) 0x499d5b00 con 0x1d0b0c60
Dec 16 11:37:18 str-slc-04-08 ceph-osd: 2015-12-16 11:37:18.722655 7f4c6148d700  1 -- 10.200.20.67:6826/5913 <== mon.0 10.200.20.30:6789/0 10584 ==== osd_map(1304125..1304125 src has 1303506..1304125) v3 ==== 285396+0+0 (3651012295 0 0) 0x499d7cc0 con 0x1d0b0c60

but the peer OSDs kept even older osdmap: range from 1262819 to 1263051.

$ grep '<== osd' ceph-osd.log | grep osd_map
...
Dec 16 11:37:17 str-slc-04-08 ceph-osd: 2015-12-16 11:37:17.258647 7f4c5fc8a700  1 -- 10.200.21.67:6826/5913 <== osd.785 10.200.21.64:6838/7223 24474 ==== osd_map(1304124..1304124 src has 1262666..1304124) v3 ==== 285396+0+0 (3345522726 0 0) 0x1ea1ad00 con 0x13ed0160
Dec 16 11:37:17 str-slc-04-08 ceph-osd: 2015-12-16 11:37:17.265035 7f4c5fc8a700  1 -- 10.200.21.67:6826/5913 <== osd.1473 10.200.21.95:6808/6168 24433 ==== osd_map(1304124..1304124 src has 1262920..1304124) v3 ==== 285396+0+0 (1024685728 0 0) 0x1ea19f80 con 0x11dda260
Dec 16 11:37:18 str-slc-04-08 ceph-osd: 2015-12-16 11:37:18.739196 7f4c5fc8a700  1 -- 10.200.21.67:6826/5913 <== osd.443 10.200.21.76:6802/6552 133533 ==== osd_map(1304125..1304125 src has 1263043..1304125) v3 ==== 285396+0+0 (2835833692 0 0) 0x7ece6ac0 con 0xdd8c580
Dec 16 11:37:18 str-slc-04-08 ceph-osd: 2015-12-16 11:37:18.766409 7f4c5fc8a700  1 -- 10.200.21.67:6826/5913 <== osd.1438 10.200.21.98:6834/6656 28262 ==== osd_map(1304125..1304125 src has 1279359..1304125) v3 ==== 285396+0+0 (4170126892 0 0) 0x367ce1c0 con 0x3002b180
Dec 16 11:37:19 str-slc-04-08 ceph-osd: 2015-12-16 11:37:19.714696 7f4c5fc8a700  1 -- 10.200.21.67:6826/5913 <== osd.1164 10.200.21.88:6816/4864 24893 ==== osd_map(1304124..1304125 src has 1262512..1304125) v3 ==== 570760+0+0 (1400774595 0 0) 0x46499680 con 0x35ce9340
...

so i'd like to be double check the monitor first, with following steps:

  1. the default setting for the cluster tends to keep the osdmap longer. if possible, could you put following settings into your ceph.conf?
    osd_pg_epoch_persisted_max_stale=10
    osd_map_cache_size=20
    osd_map_max_advance=10
    paxos_service_trim_min=10
    mon_min_osdmap_epochs=20
    

    if not, no worries. but the osdmap trim interval is expected to be longer.
  2. to verify that the monitor trims its own osdmap, increase log level of the leader monitor (use "ceph -s" to figure out the leader monitor)
     $ ceph tell mon.a injectargs '--debug-paxos 10'
     $ tailf /var/log/ceph/mon.a.log | grep trim
    

    we will see output like:
    2016-01-05 16:57:29.845970 7f5f3cbca700 10 mon.a@0(leader).paxosservice(osdmap 46..75) maybe_trim trim_to 55 would only trim 9 < paxos_service_trim_min 10
    2016-01-05 16:57:34.846232 7f5f3cbca700 10 mon.a@0(leader).paxosservice(osdmap 46..76) maybe_trim trimming to 56, 10 states
    2016-01-05 16:57:34.846243 7f5f3cbca700 10 mon.a@0(leader).paxosservice(osdmap 46..76) trim from 46 to 56
    2016-01-05 16:57:34.846439 7f5f3cbca700 10 mon.a@0(leader).osd e76 encode_trim_extra including full map for e 56
    2016-01-05 16:57:54.859806 7f5f3cbca700 10 mon.a@0(leader).paxosservice(osdmap 56..77) maybe_trim trim_to 57 would only trim 1 < paxos_service_trim_min 10
    

    where the monitor trims osdmap [46,56]. once the monitor trims its osdmap, the "oldest_map" epoch in the osdmap message sent to OSD will be increased, upon receiving such an osdmap message from monitor, the OSD will remove the cached osdmaps older than the "oldest_map" if none of the maps to be removed is still referenced. so if we see something like "trim from 46 to 56" in the monitor log. we should move our magnifier to the OSD. if the monitor does not trim the osdmap, could you upload the log?
  3. the OSD removes osdmaps from its cache when an osdmap message is received from its peer whose "oldest_map" is greater than the one in the receiver's superblock. if any of of the osdmaps to be moved is still referenced by the LRU cache, all the maps newer than the referenced map are not removed. the "peer" here could be another OSD or a monitor. so, whenever an OSD removes its osdmap, we can always track down to an osdmap message from a monitor who just trimmed its osdmaps.
    1. replace an OSD with the test build (http://gitbuilder.ceph.com/ceph-deb-trusty-x86_64-basic/ref/wip-hammer-13990/pool/main/c/ceph/)
    2. make sure it receives osdmaps from monitor instead of its peer OSDs
      $ ceph tell osd.1 injectargs '--debug-ms 1'
      $ tailf /var/log/osd.1.log | grep osd_map
      
    3. watch the output of monitor until it trims the osdmaps. wait for 10 minutes, and collect the log of the OSD with test build.
    4. upload the log file,
    5. if the log file is too large, we can preprocess it with the attached python script
        python grep.py /var/log/ceph/osd.1.log 14 > /tmp/bt.log
      

      in which, the 13505 is the epoch of the oldest osdmap in objectstore. to figure out its epoch number:
      $ find current/meta/ . -name 'osdmap*' -exec basename {} \; | sort | uniq | head -n1
      osdmap.14__0_6417091C__none
      

      here "14" is the epoch number of the oldest osdmap in cache.

#27 Updated by Kefu Chai about 3 years ago

Tom, are you reachable over the IRC? ping me in the #ceph-devel or #ceph if anything. thanks =)

#29 Updated by Kefu Chai about 3 years ago

Tom,

how's the test going? and i found that the osdmap related settings mentioned in https://cds.cern.ch/record/2015206/files/CephScaleTestMarch2015.pdf str the same with those mentioned in comment #26, probably they perform better in large cluster.

#30 Updated by Tom Christensen about 3 years ago

Kefu,
Unfortunately we have been unable to get the build to run in any cluster configuration.
We had a cluster running 0.94.5, we installed the build on one of the OSD nodes and restarted a couple OSDs and within about 20-30 minutes all 3 mons segfaulted simultaneously and after restarting them, they segfault again every few minutes. We believe this is related to issue 13748.

We then attempted to deploy an entire cluster using the test build, and this failed to generate admin key file when deploying mons. We are attempting to use ceph-deploy to create the mons.

We are now attempting to get a new clean deployment of stock 0.94.5 to attempt upgrading the entire cluster to the test build.
Any pointers you might have as to how to successfully deploy the test build would be greatly appreciated.

#31 Updated by Kefu Chai about 3 years ago

Tom,

Vikhyat and I suspect that

this failed to generate admin key file when deploying mons

could be easier to solve.

could you

- attach the ceph.log (it's in ceph-config or ceph-deploy directory or current directory)
- and follow the instructions at https://access.redhat.com/documentation/en/red-hat-ceph-storage/version-1.3/installation-guide-for-ubuntu-x86-64/#storage-cluster-quick-start , to deploy the cluster?

#32 Updated by Steve Taylor about 3 years ago

Kefu,

I work with Tom. I followed the installation guide you referenced with the exception of manually installing the 0.94.5-180-g8642def packages. I still performed the 'ceph-deploy install ...' steps after installing the packages manually to make sure nothing was missed.

Ceph-deploy generated the attached ceph.log file from a test environment with 3 mons. At the point where this log ends all three mons have ceph.conf, rbdmap, tmpXXXXXX and nothing else in /etc/ceph. The ceph.conf and rbdmap files look good, and the tmp files are 0 bytes in all three cases. If I install 0.94.5 from the ceph.com repos via ceph-deploy, 'ceph-deploy mon create-initial' is able to obtain the admin keys from all three mons and form a quorum as expected.

#33 Updated by Kefu Chai about 3 years ago

hey Steve,

just discussed with Vikhyat on this issue,

we suggest you to

  1. check if you have ceph.client.admin.keyring and ceph.conf in the ceph-deploy admin node
    • if not, there must have been something wrong when you were deploying the cluster
  2. if yes, run
    ceph-deploy purgedata mon-eng-05-01 mon-eng-05-02 mon-eng-05-03
    ceph-deploy forgetkeys
    ceph-deploy admin mon-eng-05-01 mon-eng-05-02 mon-eng-05-03
    ceph-deploy mon create-initial
    

    in which, ceph-deploy admin will copy $PWD/ceph.client.admin.keyring to the monitors' hosts.

#34 Updated by Steve Taylor about 3 years ago

I do have a ceph.conf on the ceph-deploy admin node, but there isn't any ceph.client.admin.keyring file. Running "ceph-deploy admin <any server>" gives me this:

[ceph_deploy][ERROR ] RuntimeError: ceph.client.admin.keyring not found

I've tried the mon deployment using "ceph-deploy install --release hammer mon-eng-05-01 mon-eng-05-02 mon-eng-05-03" and the create-initial succeeds in that case for 0.94.5-1 with a resulting ceph.client.admin.keyring getting saved to the ceph-deploy admin node. I also tried installing 0.94.5-1 to the ceph-deploy admin node and 0.94.5-180-g8642def to the mons. I get the same failure with the missing client admin key.

I tried one more test where I deployed the admin node and all of the mons with 0.94.5-1 and then upgraded all of the mons simultaneously to 0.94.5-180-g8642def. The mons seemed to be running okay after the upgrade, but "ceph status" gave me the following:

Traceback (most recent call last):
File "/usr/bin/ceph", line 914, in <module>
retval = main()
File "/usr/bin/ceph", line 663, in main
conffile=conffile)
File "/usr/lib/python2.7/dist-packages/rados.py", line 215, in init
self.librados = CDLL
File "/usr/lib/python2.7/ctypes/__init__.py", line 365, in init
self._handle = _dlopen(self._name, mode)
OSError: /usr/lib/librados.so.2: undefined symbol: g_ceph_context

Other ceph commands on the admin host either returned the same error or hung until I killed them. This is so far the closest I've gotten to a working cluster with 0.94.5-180-g8642def. Is g_ceph_context() missing from librados in this build or something? I'm not sure if this is related to the missing client admin key, but it certainly seems possible.

#35 Updated by Kefu Chai about 3 years ago

Steve,

i apologise for this problem. it is actually expected. as i am using the ceph's logging facility in osdmap to print the backtrace, and hence introduced the dependency to the g_ceph_context symbol to librados.so which is the workhorse of the "ceph" CLI command. so that's why the ceph command failed to work for you when you re-deployed the whole cluster using the test build.

so instead of deploying the whole cluster using the test build, i'd suggest you to

  1. install the latest hammer from http://gitbuilder.ceph.com/ceph-deb-trusty-x86_64-basic/ref/hammer/pool/main/c/ceph/, which is quite the same as the test build. the only difference is https://github.com/ceph/ceph/commit/9708e59a1fc2ae52cfae848ce585751bc9fbe572. and i believe it should be fine as it is a trivial rgw fix.
  2. and replace one of the OSDs using the testbuild.

also i'm interested in the test result on the monitor side, as put by step 1 and step 2 in http://tracker.ceph.com/issues/13990#note-26 .

#36 Updated by Kefu Chai about 3 years ago

might related to #7995.

#37 Updated by Steve Taylor about 3 years ago

I got the build you referenced deployed today, but any osds that I deploy with that build run for a short time (maybe 1-2 minutes) and crash. From the logs it appears that it's crashing in the osd map code. A log from one of the osds is attached. I've restarted and watched it crashed maybe three times so far.

#38 Updated by Kefu Chai about 3 years ago

Steve, could you enable the "debug ms = 1" and "debug osd = 20", and post the log again? many thanks!

#39 Updated by Steve Taylor about 3 years ago

Here's a new log from the same osd. Same behavior.

#40 Updated by Kefu Chai about 3 years ago

the osd crashed when it was handling a pg_create message, it walked through the prior epochs looking for the acting set from the "create" epoch (1) to the latest one (353). but the new OSD only have osdmap[327..336]...

#41 Updated by Kefu Chai about 3 years ago

Steve,

seems the monitor is sending erroneous pg-create messages to osd after the PGs in question have already been created.

could you

  1. set "debug-mon=20" on the monitor side
  2. post the output of "ceph pg dump --format-json-pretty"
  3. and start osd.28 again.
  4. and post the log on all monitor and osd.28?

thanks,

#42 Updated by Steve Taylor almost 3 years ago

Kefu,

This took a little more time because some of the hardware I was using for this test needed to be repurposed a few days ago. I redeployed 0.94.5-183-ge1b9208 on 1 mon and 24 osds (8 on each of 3 hosts) and didn't see any crashes this time. I'm not sure what's different, but it's running fine.

I followed the steps above and obtained the debug logs requested. I saw the mon trim 65..95 at 10:20 this morning and immediately saw osd.0 log the same trim. I waited 10 minutes and grabbed both the mon and osd logs, which are attached. There are other trims in there as well, but 65..95 is the one I noticed.

Let me know if you need additional info.

#43 Updated by Kefu Chai almost 3 years ago

  • Related to Backport #14592: osd crashes when handling a stale pg-create message (hammer) added

#44 Updated by Kefu Chai almost 3 years ago

@Steve,

i am sorry for the latency, was on another issue last week. the trimming behaviour is exactly what we expect. the reason why only one osdmap trimming was observed is that, the max osdmap epoch was 95 by the time we stopped.

so could you wait a little bit longer before collecting the log? and create pool snapshots as usual while you are waiting so that more osdmap updates are expected? if the monitor always trim the osdmap, and osd follows, and we are not able to reproduce this issue with this setting. then we have two options:

  1. switch to the options of
    osd_pg_epoch_persisted_max_stale=10
    osd_map_cache_size=20
    osd_map_max_advance=10
    paxos_service_trim_min=10
    mon_min_osdmap_epochs=20
    
  2. or revert the options in the test cluster to the ones in your production cluster, and perform the test again to reproduce the issue.

#45 Updated by Kefu Chai almost 3 years ago

the crash you spotted is root caused at #14592

#46 Updated by David Zafman almost 3 years ago

  • Related to Bug #14213: osd/OSD.cc: 2940: FAILED assert(0 == "Missing map in load_pgs") added

#47 Updated by David Zafman almost 3 years ago

A fix related to this tracker caused 14213.

#48 Updated by Steve Taylor almost 3 years ago

Attached is a log from osd.0 in a new deployment where I've performed some tests with rbd snapshots. Based on the mon log there should be some of the desired map update paths here, but I'm not 100% positive. The problem I'm seeing is one that we've seen for a while, and that's that we see the mon segfault nearly every time we create a snapshot. Tom mentioned above about a month ago that we were seeing this issue, and it still persists. One problem we're looking at is that I don't see how we can upgrade if we do get a fix for this osd map cache issue without the mon segfault issue resolved.

The stack trace on the mon segfault looks the same as the one referenced at http://tracker.ceph.com/issues/13748#note-13, so the assumption is that we're seeing that same issue. Every 0.94.5 deployment we've tested has exhibited that segfault issue, mainly when we create rbd snapshots.

Hopefully the attached log gives you info you need for this osd map cache problem.

#49 Updated by Kefu Chai almost 3 years ago

  • Related to Bug #13748: ceph-mons crashing constantly after 0.94.3->0.94.5 upgrade added

#50 Updated by Kefu Chai almost 3 years ago

the assumption is that we're seeing that same issue. Every 0.94.5 deployment we've tested has exhibited that segfault issue, mainly when we create rbd snapshots.

i just fixed #13748, but it only happens when we are using syslog for cluster log. in other words, when we have following setting for monitor:

[mon]
        mon cluster log to syslog = true

guess it prints a cluster log when creating a pool snapshot. and we failed to lookup the facility for the log entry when we are about to write to the syslog.

Steve, before the fix lands on hammer, we can move on with our testing by specifying a log file for the "clog" and disabling the syslog for clog:

[mon]
  mon cluster log file = /var/log/ceph/cluster.mon.$id.log
  mon cluster log to syslog = false

#51 Updated by Steve Taylor almost 3 years ago

The log setting change worked. No more mon segfaults. I have a new osd log for you too, but I'm unable to upload it. I keep getting an error when I try. It just says, "error" without any additional information.

If you need a new log file, I can send it to you another way or maybe I'll be able to upload it here later.

#52 Updated by Kefu Chai almost 3 years ago

Steve, you can either send it over to me (), or post it here. but before that, i want to confirm with you: where we are now?

I saw the mon trim 65..95 at 10:20 this morning and immediately saw osd.0 log the same trim.

last time the log you sent to me shows that the mon and osd trimmed the osdmaps as expected

and i suggested to either use more aggressive options to trim the osdmap or revert to the options used in your production cluster, and wait longer to reproduce the issue.

#53 Updated by Steve Taylor almost 3 years ago

I just sent you the new osd log via email.

What I did for this test was write a bunch of RBD data and take lots of RBD snapshots over the course of several hours. During this time I also added new osds to the cluster, so there should have been osdmap updates happening frequently throughout the process. I tailed the mon log during this process and saw it issuing trims. I continued to let it run all day in this mode and collected the log I just sent the next morning.

The osdmap trimming options in this cluster match what we use in production. My understanding of the purpose of this test is either to verify the fix you made as working in our environment or not, so the fact that the previous logs show things working as expected is a good sign, correct?

#54 Updated by Kefu Chai almost 3 years ago

My understanding of the purpose of this test is either to verify the fix you made as working in our environment or not,

the purpose of the test is to find out why the osd failed to trim the osdmaps by walking through the steps detailed by http://tracker.ceph.com/issues/13990#note-26. my change at http://gitbuilder.ceph.com/ceph-deb-trusty-x86_64-basic/ref/wip-hammer-13990/pool/main/c/ceph/ prints log messages whenever the reference count of an osdmap is increased or decreased. and the http://tracker.ceph.com/attachments/download/2144/grep.py script is designed to preprocess the log messages to find out the differences in the number of add-refs and dec-refs.

so the fact that the previous logs show things working as expected is a good sign, correct?

to me, it's a mixed feeling. the problem goes away without a root cause, and it could return without an omen.

but by inspecting the log you sent to me, the OSDs are trimming osdmaps:

$ grep  osd_map ceph-osd.0.log.1-20160217 | grep -- '-->' | awk '{print $13}'| cut -d')' -f1 | uniq
304..334
314..335
314..336
314..337
314..338
314..339
314..340
314..341
314..342
314..343
...
364..503
364..504
364..505
364..506
364..507
364..508
364..509
364..510
364..511
364..512

the osd is trimming its osdmaps, that's why the oldest epoch number is increasing from 304 to 364. which means 60 osdmaps were removed.
$ grep  osd_map ceph-osd.0.log.1-20160217 | grep -- '-->' | awk '{print $13}'| cut -d')' -f1 | uniq | awk -F'[^0-9]+'  '{print $2-$1 }'
30
21
22
23
24
25
26
27
28
29
...
139
140
141
142
143
144
145
146
147
148

but the number of osdmaps held by that OSD was increasing. and the last osdmap message from monitor was

$ grep 'mon\.' ceph-osd.0.log.1-20160217|grep osd_map | tail -n1
2016-02-16 15:29:54.456292 7f4a6513d700  1 -- 10.3.20.163:6814/15131 <== mon.0 10.3.20.169:6789/0 1382 ==== osd_map(482..483 src has 364..483) v3 ==== 2468+0+0 (3379488231 0 0) 0xad5e640 con 0x9dd0000

in which, the monitor was still holding the osdmap back to epoch #364. which is ancient from the point of view of the settings mentioned by #13990-44 , where mon_min_osdmap_epochs is merely 20. so per the analysis in #13990-26, we need to move back to step 2 and focus on the monitor side. put in other words, we need to revisit

to verify that the monitor trims its own osdmap, increase log level of the leader monitor (use "ceph -s" to figure out the leader monitor)

> $ ceph tell mon.a injectargs '--debug-paxos 10'
> $ tailf /var/log/ceph/mon.a.log | grep trim
> 

and collect the monitor log instead.

#55 Updated by Steve Taylor almost 3 years ago

I already have the corresponding mon log with --debug-paxos 10. I didn't send it because I couldn't get the osd log to go, apparently due to its size, and the mon log is about twice as large. I'll attempt to send it privately.

#56 Updated by Vikhyat Umrao almost 3 years ago

Steve Taylor wrote:

I already have the corresponding mon log with --debug-paxos 10. I didn't send it because I couldn't get the osd log to go, apparently due to its size, and the mon log is about twice as large. I'll attempt to send it privately.

Hello Steve,

If you want you can send them via Red Hat case Portal. Then I will provide them to Kefu. I will update the Red Hat Case how you can send large files to Red Hat.

#57 Updated by Kefu Chai almost 3 years ago

i received the mon log from Steve.

the monitor was holding 20 osdmaps (856-876), right before the log is collected.

$ grep osdmap ceph-mon.mon-eng-05-03.log.5 | tail -n1
2016-02-18 06:39:17.839603 7f9b77817700 10 mon.mon-eng-05-03@0(leader).paxosservice(osdmap 856..876) post_refresh

this matches with expectation when the setting is

osd_map_cache_size=20

that's to say, the monitor is trimming as expected.

Steve, may i have the log from 10.3.20.169 instead? as we can see from #13990-54, it is holding over 100 osdmaps.

#58 Updated by Steve Taylor almost 3 years ago

mon-eng-05-03 is 10.3.20.169. This is a test cluster with only 1 mon.

#59 Updated by Kefu Chai almost 3 years ago

oh, i see. Steve, could you help send me the mon log and osd log collected in the same time period?

  • the mon log you sent to me is collected from 2016-02-17 06:35:47 ~ 2016-02-18 06:39:17, and per this log, the mon was holding more and more osdmap
  • the osd log you sent to me is collected from 2016-02-16 13:14:35 ~ 2016-02-16 22:42:42, while per this log, the mon was trimming as expected.

#60 Updated by Steve Taylor almost 3 years ago

I noticed the mon log was a day off when I made my last update. Sorry about that. Unfortunately this test was old enough that the logs had rolled by the time I noticed, so I'll have to run the test again and create new logs.

#61 Updated by Kefu Chai almost 3 years ago

thanks in advance!

#62 Updated by Steve Taylor almost 3 years ago

Mon and osd logs uploaded to support case 01549888. They're too big to upload here.

#63 Updated by Vikhyat Umrao almost 3 years ago

Steve Taylor wrote:

Mon and osd logs uploaded to support case 01549888. They're too big to upload here.

Thanks Steve , will provide these logs to Kefu.

#64 Updated by Kefu Chai almost 3 years ago

Steve,

so the osdmaps are being trimmed fine on both OSD and monitor (v0.94.5) with the given settings. here could be two paths,

  1. we'd suggest that you use the latest hammer LTS (v0.94.6) with the tested options (see #13990-44),
  2. if you'd prefer using the old options, could you test 0.94.6 with the default settings?

#65 Updated by Steve Taylor almost 3 years ago

I ran another test with 0.94.5-183-ge1b9208. I created a simple script that created and deleted a bunch of rbd snapshots without any additional I/O on the cluster and ran 16 instances of it simultaneously for a little over an hour. I had osd_map_cache_size at 20, but by 70 minutes or so into this test osd.0 had over 2,000 osd maps in its cache.

Over the course of this time I watched the osd map cache grow steadily. There were intermittent occurrences where the number would drop by 30 at a time, so it did appear to be trimming occasionally.

When it got over 2,000 I started killing the instances of the script. By the time I got them all killed there were 2,120 cached osd maps on osd.0. I found 5 snapshots that had been created by the scripts and not deleted because I killed them, so I deleted those manually. That caused the osd map cache on osd.0 to grow to 2,125, one for each snapshot I deleted manually, which is what I expected.

After this I let the cluster sit idle for over an hour. Osd.0's map cache stayed at 2,125 the whole time, so I collected the logs from mon-eng-05-03 and osd.0 again. I'll upload those to support case 01549888 again. It's now been another 20 minutes or something since I collected the logs, and osd.0's map cache is still holding 2,125 maps.

If these logs also aren't helpful, I'll try 0.94.6 and see what I get.

#66 Updated by Steve Taylor almost 3 years ago

Update--it's now been approximately 20 hours since my latest test finished and my osd map cache still has 2,125 maps. Not a single one has been trimmed.

#67 Updated by Kefu Chai almost 3 years ago

downloaded the log, will check them out tomorrow.

fwiw, OSD does not trim osdmap periodically, instead, osdmap trimming is triggered when new osdmaps are received. and new osdmaps are circulated around only when there are some activity/change happening in the cluster.

#68 Updated by Steve Taylor almost 3 years ago

As I watched this test it was apparent that osd maps were getting trimmed occasionally, but certainly not every time a new map epoch was generated. That's an obvious point since the map cache grew to over 2,000 maps in a little over an hour. It's still at 2,125 now after sitting all weekend, so that matches your description.

It appeared when I was watching it that every time the osd map cache trimmed it was trimmed by 30 maps. It frequently grew by 100+ in between trims though, so it wasn't even close to keeping up. I don't know if the root cause is reference leaks, reference destruction timing, flawed trimming logic, or something else, but I'm confident that this test reproduced the behavior we're seeing in production.

We've been restarting all 1,480 osds one at time in our primary production cluster every two weeks for the last three months to work around this problem. If we don't do that, osds eventually get to a point where they trim hundreds of thousands of maps at once, go unresponsive while that trim is in process, and get marked down by their peers. This gets ugly in a hurry if we happen to have multiple osds do this at the same time in multiple failure domains.

#69 Updated by Steve Taylor almost 3 years ago

I just performed a simple test where I manually created and deleted a single rbd snapshot and waited 15 minutes or so to see if any trimming happened in osd.0's osd map cache. The number of maps increased to 2,126 and then 2,127 when I created and deleted the snapshot, and after 15 minutes nothing had trimmed.

I don't know if this is interesting, but it appears to reproduce the behavior and the logs are much smaller.

https://api.access.redhat.com/rs/cases/01549888/attachments/47d2d757-ece6-4d23-a46d-781fc524a22b
https://api.access.redhat.com/rs/cases/01549888/attachments/d71b38d8-ba2d-4c31-a46a-697aa12fc765

#70 Updated by Kefu Chai almost 3 years ago

Steve, the logs are helpful.

but I'm confident that this test reproduced the behavior we're seeing in production.

agreed.

by inspecting the ceph-mon.mon-eng-05-03.log, the monitor kept 44 osdmaps at last. so the monitor is trimming as expected.

2016-03-03 11:53:54.474554 7f9b76205700 10 mon.mon-eng-05-03@0(leader).paxosservice(osdmap 6108..6322) trim from 6108 to 6278

and there was a guy (i can not tell if it was an OSD or a monitor, because debug-ms=1 was not enabled) kept sending osdmap message whose oldest_map is 2906

2016-03-03 11:38:27.874424 7f2d80b7f700  3 osd.0 6321 handle_osd_map epochs [6322,6322], i have 6321, src has [2906,6322]

if it's osdmap from the guy whose oldestmap is 2906, the osd in question won't trim its osdmap cache. because 2906 is way too small in comparison to the oldest_map cached in osd.0.

but occasionally, another guys sends an osdmap message with larger oldest_map. and at the very beginning, osd.0 is able to keep up with the increasing number of osdmap by trimming up to the oldest_map.

2016-03-03 09:58:05.251103 7f2d80b7f700  3 osd.0 2952 handle_osd_map epochs [2953,2953], i have 2952, src has [2906,2953]
2016-03-03 09:58:05.251109 7f2d80b7f700 10 osd.0 2952 handle_osd_map  got inc map for epoch 2953
2016-03-03 09:58:05.252228 7f2d80b7f700 20 osd.0 2952  removing old osdmap epoch 2904
2016-03-03 09:58:05.252246 7f2d80b7f700 20 osd.0 2952  removing old osdmap epoch 2905

and

2016-03-03 10:29:38.296907 7f2d82382700  3 osd.0 3095 handle_osd_map epochs [3095,3096], i have 3095, src has [2968,3096]
2016-03-03 10:29:38.296918 7f2d82382700 10 osd.0 3095 handle_osd_map  got inc map for epoch 3096
2016-03-03 10:29:38.297767 7f2d82382700 20 osd.0 3095  removing old osdmap epoch 2955
2016-03-03 10:29:38.297779 7f2d82382700 20 osd.0 3095  removing old osdmap epoch 2956
2016-03-03 10:29:38.297783 7f2d82382700 20 osd.0 3095  removing old osdmap epoch 2957
2016-03-03 10:29:38.297786 7f2d82382700 20 osd.0 3095  removing old osdmap epoch 2958
2016-03-03 10:29:38.297790 7f2d82382700 20 osd.0 3095  removing old osdmap epoch 2959
2016-03-03 10:29:38.297802 7f2d82382700 20 osd.0 3095  removing old osdmap epoch 2960
2016-03-03 10:29:38.297806 7f2d82382700 20 osd.0 3095  removing old osdmap epoch 2961
2016-03-03 10:29:38.297810 7f2d82382700 20 osd.0 3095  removing old osdmap epoch 2962
2016-03-03 10:29:38.297813 7f2d82382700 20 osd.0 3095  removing old osdmap epoch 2963
2016-03-03 10:29:38.297816 7f2d82382700 20 osd.0 3095  removing old osdmap epoch 2964
2016-03-03 10:29:38.297820 7f2d82382700 20 osd.0 3095  removing old osdmap epoch 2965
2016-03-03 10:29:38.297823 7f2d82382700 20 osd.0 3095  removing old osdmap epoch 2966
2016-03-03 10:29:38.297827 7f2d82382700 20 osd.0 3095  removing old osdmap epoch 2967

but it fails to catch up with the pace at last

2016-03-03 10:35:43.100400 7f2d82382700  3 osd.0 3396 handle_osd_map epochs [3397,3397], i have 3396, src has [3219,3397]
2016-03-03 10:35:43.100411 7f2d82382700 10 osd.0 3396 handle_osd_map  got inc map for epoch 3397
2016-03-03 10:35:43.101183 7f2d82382700 20 osd.0 3396  removing old osdmap epoch 2968
2016-03-03 10:35:43.101198 7f2d82382700 20 osd.0 3396  removing old osdmap epoch 2969
2016-03-03 10:35:43.101206 7f2d82382700 20 osd.0 3396  removing old osdmap epoch 2970
2016-03-03 10:35:43.101213 7f2d82382700 20 osd.0 3396  removing old osdmap epoch 2971
2016-03-03 10:35:43.101220 7f2d82382700 20 osd.0 3396  removing old osdmap epoch 2972
2016-03-03 10:35:43.101226 7f2d82382700 20 osd.0 3396  removing old osdmap epoch 2973
2016-03-03 10:35:43.101233 7f2d82382700 20 osd.0 3396  removing old osdmap epoch 2974
2016-03-03 10:35:43.101239 7f2d82382700 20 osd.0 3396  removing old osdmap epoch 2975
2016-03-03 10:35:43.101244 7f2d82382700 20 osd.0 3396  removing old osdmap epoch 2976
2016-03-03 10:35:43.101250 7f2d82382700 20 osd.0 3396  removing old osdmap epoch 2977
2016-03-03 10:35:43.101257 7f2d82382700 20 osd.0 3396  removing old osdmap epoch 2978
2016-03-03 10:35:43.101264 7f2d82382700 20 osd.0 3396  removing old osdmap epoch 2979
2016-03-03 10:35:43.101271 7f2d82382700 20 osd.0 3396  removing old osdmap epoch 2980
2016-03-03 10:35:43.101278 7f2d82382700 20 osd.0 3396  removing old osdmap epoch 2981
2016-03-03 10:35:43.101284 7f2d82382700 20 osd.0 3396  removing old osdmap epoch 2982
2016-03-03 10:35:43.101291 7f2d82382700 20 osd.0 3396  removing old osdmap epoch 2983
2016-03-03 10:35:43.101299 7f2d82382700 20 osd.0 3396  removing old osdmap epoch 2984
2016-03-03 10:35:43.101306 7f2d82382700 20 osd.0 3396  removing old osdmap epoch 2985
2016-03-03 10:35:43.101313 7f2d82382700 20 osd.0 3396  removing old osdmap epoch 2986
2016-03-03 10:35:43.101320 7f2d82382700 20 osd.0 3396  removing old osdmap epoch 2987
2016-03-03 10:35:43.101326 7f2d82382700 20 osd.0 3396  removing old osdmap epoch 2988
2016-03-03 10:35:43.101333 7f2d82382700 20 osd.0 3396  removing old osdmap epoch 2989
2016-03-03 10:35:43.101339 7f2d82382700 20 osd.0 3396  removing old osdmap epoch 2990
2016-03-03 10:35:43.101346 7f2d82382700 20 osd.0 3396  removing old osdmap epoch 2991
2016-03-03 10:35:43.101352 7f2d82382700 20 osd.0 3396  removing old osdmap epoch 2992
2016-03-03 10:35:43.101359 7f2d82382700 20 osd.0 3396  removing old osdmap epoch 2993
2016-03-03 10:35:43.101365 7f2d82382700 20 osd.0 3396  removing old osdmap epoch 2994
2016-03-03 10:35:43.101372 7f2d82382700 20 osd.0 3396  removing old osdmap epoch 2995
2016-03-03 10:35:43.101379 7f2d82382700 20 osd.0 3396  removing old osdmap epoch 2996
2016-03-03 10:35:43.101386 7f2d82382700 20 osd.0 3396  removing old osdmap epoch 2997

as it turns out the maximum number of each trimming is 30. as Steve put

It frequently grew by 100+ in between trims though

so questions

  1. who was sending the oldest_map=2906, "debug-ms=1" would help with this? Steve, could you help on this?
  2. why we can only trim 30 osdmaps each time? Steve, could you post the output of
    ceph daemon /path/to/osd.0.asok config show
    

meanwhile, i will continue the investigation.

if i fail to find out the root cause, our last resort would be #13990-26 with http://gitbuilder.ceph.com/ceph-deb-trusty-x86_64-basic/ref/wip-hammer-13990/pool/main/c/ceph/.

#71 Updated by Steve Taylor almost 3 years ago

New logs with debug-ms=1:

https://api.access.redhat.com/rs/cases/01549888/attachments/04e45b4a-6ae7-4b6f-9347-741fa977e8f2
https://api.access.redhat.com/rs/cases/01549888/attachments/d9fcd2ae-1ad6-4c61-b558-1c3745d9fbd5

I created and deleted two more rbd snapshots. The number of cached osd maps on osd.0 is now 2,131.

stc@str-eng-05-04:~$ sudo ceph daemon /var/run/ceph/ceph-osd.0.asok config show {
"name": "osd.0",
"cluster": "ceph",
"debug_none": "0\/5",
"debug_lockdep": "0\/1",
"debug_context": "0\/1",
"debug_crush": "1\/1",
"debug_mds": "1\/5",
"debug_mds_balancer": "1\/5",
"debug_mds_locker": "1\/5",
"debug_mds_log": "1\/5",
"debug_mds_log_expire": "1\/5",
"debug_mds_migrator": "1\/5",
"debug_buffer": "0\/1",
"debug_timer": "0\/1",
"debug_filer": "0\/1",
"debug_striper": "0\/1",
"debug_objecter": "0\/1",
"debug_rados": "0\/5",
"debug_rbd": "0\/5",
"debug_rbd_replay": "0\/5",
"debug_journaler": "0\/5",
"debug_objectcacher": "0\/5",
"debug_client": "0\/5",
"debug_osd": "20\/20",
"debug_optracker": "0\/5",
"debug_objclass": "0\/5",
"debug_filestore": "1\/3",
"debug_keyvaluestore": "1\/3",
"debug_journal": "1\/3",
"debug_ms": "1\/1",
"debug_mon": "1\/5",
"debug_monc": "0\/10",
"debug_paxos": "1\/5",
"debug_tp": "0\/5",
"debug_auth": "1\/5",
"debug_crypto": "1\/5",
"debug_finisher": "1\/1",
"debug_heartbeatmap": "1\/5",
"debug_perfcounter": "1\/5",
"debug_rgw": "1\/5",
"debug_civetweb": "1\/10",
"debug_javaclient": "1\/5",
"debug_asok": "1\/5",
"debug_throttle": "1\/1",
"debug_refs": "0\/0",
"debug_xio": "1\/5",
"host": "localhost",
"fsid": "ee53f297-e75d-4d5c-b705-0fa6b473e944",
"public_addr": "10.3.20.163:0\/0",
"cluster_addr": "10.3.21.163:0\/0",
"public_network": "10.3.20.0\/24",
"cluster_network": "10.3.21.0\/24",
"num_client": "1",
"monmap": "",
"mon_host": "10.3.20.169",
"lockdep": "false",
"lockdep_force_backtrace": "false",
"run_dir": "\/var\/run\/ceph",
"admin_socket": "\/var\/run\/ceph\/ceph-osd.0.asok",
"daemonize": "false",
"pid_file": "",
"chdir": "\/",
"max_open_files": "0",
"restapi_log_level": "",
"restapi_base_url": "",
"fatal_signal_handlers": "true",
"log_file": "\/var\/log\/ceph\/ceph-osd.0.log",
"log_max_new": "1000",
"log_max_recent": "10000",
"log_to_stderr": "false",
"err_to_stderr": "true",
"log_to_syslog": "false",
"err_to_syslog": "false",
"log_flush_on_exit": "true",
"log_stop_at_utilization": "0.97",
"clog_to_monitors": "default=true",
"clog_to_syslog": "false",
"clog_to_syslog_level": "info",
"clog_to_syslog_facility": "default=daemon audit=local0",
"mon_cluster_log_to_syslog": "default=false",
"mon_cluster_log_to_syslog_level": "info",
"mon_cluster_log_to_syslog_facility": "daemon",
"mon_cluster_log_file": "default=\/var\/log\/ceph\/ceph.$channel.log cluster=\/var\/log\/ceph\/ceph.log",
"mon_cluster_log_file_level": "info",
"enable_experimental_unrecoverable_data_corrupting_features": "",
"xio_trace_mempool": "false",
"xio_trace_msgcnt": "false",
"xio_trace_xcon": "false",
"xio_queue_depth": "512",
"xio_mp_min": "128",
"xio_mp_max_64": "65536",
"xio_mp_max_256": "8192",
"xio_mp_max_1k": "8192",
"xio_mp_max_page": "4096",
"xio_mp_max_hint": "4096",
"xio_portal_threads": "2",
"key": "",
"keyfile": "",
"keyring": "\/var\/lib\/ceph\/osd\/ceph-0\/keyring",
"heartbeat_interval": "5",
"heartbeat_file": "",
"heartbeat_inject_failure": "0",
"perf": "true",
"ms_type": "simple",
"ms_tcp_nodelay": "true",
"ms_tcp_rcvbuf": "0",
"ms_tcp_prefetch_max_size": "4096",
"ms_initial_backoff": "0.2",
"ms_max_backoff": "15",
"ms_crc_data": "true",
"ms_crc_header": "true",
"ms_die_on_bad_msg": "false",
"ms_die_on_unhandled_msg": "false",
"ms_die_on_old_message": "false",
"ms_die_on_skipped_message": "false",
"ms_dispatch_throttle_bytes": "104857600",
"ms_bind_ipv6": "false",
"ms_bind_port_min": "6800",
"ms_bind_port_max": "7300",
"ms_bind_retry_count": "3",
"ms_bind_retry_delay": "5",
"ms_rwthread_stack_bytes": "1048576",
"ms_tcp_read_timeout": "900",
"ms_pq_max_tokens_per_priority": "16777216",
"ms_pq_min_cost": "65536",
"ms_inject_socket_failures": "0",
"ms_inject_delay_type": "",
"ms_inject_delay_msg_type": "",
"ms_inject_delay_max": "1",
"ms_inject_delay_probability": "0",
"ms_inject_internal_delays": "0",
"ms_dump_on_send": "false",
"ms_dump_corrupt_message_level": "1",
"ms_async_op_threads": "2",
"ms_async_set_affinity": "true",
"ms_async_affinity_cores": "",
"inject_early_sigterm": "false",
"mon_data": "\/var\/lib\/ceph\/mon\/ceph-0",
"mon_initial_members": "mon-eng-05-03",
"mon_sync_fs_threshold": "5",
"mon_compact_on_start": "false",
"mon_compact_on_bootstrap": "false",
"mon_compact_on_trim": "true",
"mon_osd_cache_size": "10",
"mon_tick_interval": "5",
"mon_subscribe_interval": "300",
"mon_delta_reset_interval": "10",
"mon_osd_laggy_halflife": "3600",
"mon_osd_laggy_weight": "0.3",
"mon_osd_adjust_heartbeat_grace": "true",
"mon_osd_adjust_down_out_interval": "true",
"mon_osd_auto_mark_in": "false",
"mon_osd_auto_mark_auto_out_in": "true",
"mon_osd_auto_mark_new_in": "true",
"mon_osd_down_out_interval": "300",
"mon_osd_down_out_subtree_limit": "rack",
"mon_osd_min_up_ratio": "0.3",
"mon_osd_min_in_ratio": "0.3",
"mon_osd_max_op_age": "32",
"mon_osd_max_split_count": "32",
"mon_osd_allow_primary_temp": "false",
"mon_osd_allow_primary_affinity": "false",
"mon_stat_smooth_intervals": "2",
"mon_lease": "5",
"mon_lease_renew_interval": "3",
"mon_lease_ack_timeout": "10",
"mon_clock_drift_allowed": "0.05",
"mon_clock_drift_warn_backoff": "5",
"mon_timecheck_interval": "300",
"mon_accept_timeout": "10",
"mon_pg_create_interval": "30",
"mon_pg_stuck_threshold": "300",
"mon_pg_warn_min_per_osd": "30",
"mon_pg_warn_max_per_osd": "300",
"mon_pg_warn_max_object_skew": "10",
"mon_pg_warn_min_objects": "10000",
"mon_pg_warn_min_pool_objects": "1000",
"mon_cache_target_full_warn_ratio": "0.66",
"mon_osd_full_ratio": "0.95",
"mon_osd_nearfull_ratio": "0.85",
"mon_allow_pool_delete": "true",
"mon_globalid_prealloc": "10000",
"mon_osd_report_timeout": "900",
"mon_force_standby_active": "true",
"mon_warn_on_old_mons": "true",
"mon_warn_on_legacy_crush_tunables": "true",
"mon_warn_on_osd_down_out_interval_zero": "true",
"mon_warn_on_cache_pools_without_hit_sets": "true",
"mon_min_osdmap_epochs": "500",
"mon_max_pgmap_epochs": "500",
"mon_max_log_epochs": "500",
"mon_max_mdsmap_epochs": "500",
"mon_max_osd": "10000",
"mon_probe_timeout": "2",
"mon_slurp_timeout": "10",
"mon_slurp_bytes": "262144",
"mon_client_bytes": "104857600",
"mon_daemon_bytes": "419430400",
"mon_max_log_entries_per_event": "4096",
"mon_reweight_min_pgs_per_osd": "10",
"mon_reweight_min_bytes_per_osd": "104857600",
"mon_health_data_update_interval": "60",
"mon_health_to_clog": "true",
"mon_health_to_clog_interval": "3600",
"mon_health_to_clog_tick_interval": "60",
"mon_data_avail_crit": "5",
"mon_data_avail_warn": "30",
"mon_data_size_warn": "16106127360",
"mon_config_key_max_entry_size": "4096",
"mon_sync_timeout": "60",
"mon_sync_max_payload_size": "1048576",
"mon_sync_debug": "false",
"mon_sync_debug_leader": "-1",
"mon_sync_debug_provider": "-1",
"mon_sync_debug_provider_fallback": "-1",
"mon_inject_sync_get_chunk_delay": "0",
"mon_osd_min_down_reporters": "1",
"mon_osd_min_down_reports": "3",
"mon_osd_force_trim_to": "0",
"mon_mds_force_trim_to": "0",
"crushtool": "crushtool",
"mon_debug_dump_transactions": "false",
"mon_debug_dump_location": "\/var\/log\/ceph\/ceph-osd.0.tdump",
"mon_inject_transaction_delay_max": "10",
"mon_inject_transaction_delay_probability": "0",
"mon_sync_provider_kill_at": "0",
"mon_sync_requester_kill_at": "0",
"mon_force_quorum_join": "false",
"mon_keyvaluedb": "leveldb",
"mon_debug_unsafe_allow_tier_with_nonempty_snaps": "false",
"paxos_stash_full_interval": "25",
"paxos_max_join_drift": "10",
"paxos_propose_interval": "1",
"paxos_min_wait": "0.05",
"paxos_min": "500",
"paxos_trim_min": "250",
"paxos_trim_max": "500",
"paxos_service_trim_min": "250",
"paxos_service_trim_max": "500",
"paxos_kill_at": "0",
"clock_offset": "0",
"auth_cluster_required": "cephx",
"auth_service_required": "cephx",
"auth_client_required": "cephx",
"auth_supported": "",
"cephx_require_signatures": "false",
"cephx_cluster_require_signatures": "false",
"cephx_service_require_signatures": "false",
"cephx_sign_messages": "true",
"auth_mon_ticket_ttl": "43200",
"auth_service_ticket_ttl": "3600",
"auth_debug": "false",
"mon_client_hunt_interval": "3",
"mon_client_ping_interval": "10",
"mon_client_ping_timeout": "30",
"mon_client_hunt_interval_backoff": "2",
"mon_client_hunt_interval_max_multiple": "10",
"mon_client_max_log_entries_per_message": "1000",
"mon_max_pool_pg_num": "65536",
"mon_pool_quota_warn_threshold": "0",
"mon_pool_quota_crit_threshold": "0",
"client_cache_size": "16384",
"client_cache_mid": "0.75",
"client_use_random_mds": "false",
"client_mount_timeout": "300",
"client_tick_interval": "1",
"client_trace": "",
"client_readahead_min": "131072",
"client_readahead_max_bytes": "0",
"client_readahead_max_periods": "4",
"client_snapdir": ".snap",
"client_mountpoint": "\/",
"client_notify_timeout": "10",
"osd_client_watch_timeout": "30",
"client_caps_release_delay": "5",
"client_quota": "false",
"client_oc": "true",
"client_oc_size": "209715200",
"client_oc_max_dirty": "104857600",
"client_oc_target_dirty": "8388608",
"client_oc_max_dirty_age": "5",
"client_oc_max_objects": "1000",
"client_debug_force_sync_read": "false",
"client_debug_inject_tick_delay": "0",
"client_max_inline_size": "4096",
"client_inject_release_failure": "false",
"fuse_use_invalidate_cb": "false",
"fuse_allow_other": "true",
"fuse_default_permissions": "true",
"fuse_big_writes": "true",
"fuse_atomic_o_trunc": "true",
"fuse_debug": "false",
"fuse_multithreaded": "true",
"client_try_dentry_invalidate": "true",
"client_die_on_failed_remount": "true",
"client_check_pool_perm": "true",
"crush_location": "",
"objecter_tick_interval": "5",
"objecter_timeout": "10",
"objecter_inflight_op_bytes": "104857600",
"objecter_inflight_ops": "1024",
"objecter_completion_locks_per_session": "32",
"objecter_inject_no_watch_ping": "false",
"journaler_allow_split_entries": "true",
"journaler_write_head_interval": "15",
"journaler_prefetch_periods": "10",
"journaler_prezero_periods": "5",
"journaler_batch_interval": "0.001",
"journaler_batch_max": "0",
"mds_data": "\/var\/lib\/ceph\/mds\/ceph-0",
"mds_max_file_size": "4398046511104",
"mds_cache_size": "100000",
"mds_cache_mid": "0.7",
"mds_max_file_recover": "32",
"mds_mem_max": "1048576",
"mds_dir_max_commit_size": "10",
"mds_decay_halflife": "5",
"mds_beacon_interval": "4",
"mds_beacon_grace": "15",
"mds_enforce_unique_name": "true",
"mds_blacklist_interval": "1440",
"mds_session_timeout": "60",
"mds_revoke_cap_timeout": "60",
"mds_recall_state_timeout": "60",
"mds_freeze_tree_timeout": "30",
"mds_session_autoclose": "300",
"mds_health_summarize_threshold": "10",
"mds_reconnect_timeout": "45",
"mds_tick_interval": "5",
"mds_dirstat_min_interval": "1",
"mds_scatter_nudge_interval": "5",
"mds_client_prealloc_inos": "1000",
"mds_early_reply": "true",
"mds_default_dir_hash": "2",
"mds_log": "true",
"mds_log_skip_corrupt_events": "false",
"mds_log_max_events": "-1",
"mds_log_events_per_segment": "1024",
"mds_log_segment_size": "0",
"mds_log_max_segments": "30",
"mds_log_max_expiring": "20",
"mds_bal_sample_interval": "3",
"mds_bal_replicate_threshold": "8000",
"mds_bal_unreplicate_threshold": "0",
"mds_bal_frag": "false",
"mds_bal_split_size": "10000",
"mds_bal_split_rd": "25000",
"mds_bal_split_wr": "10000",
"mds_bal_split_bits": "3",
"mds_bal_merge_size": "50",
"mds_bal_merge_rd": "1000",
"mds_bal_merge_wr": "1000",
"mds_bal_interval": "10",
"mds_bal_fragment_interval": "5",
"mds_bal_idle_threshold": "0",
"mds_bal_max": "-1",
"mds_bal_max_until": "-1",
"mds_bal_mode": "0",
"mds_bal_min_rebalance": "0.1",
"mds_bal_min_start": "0.2",
"mds_bal_need_min": "0.8",
"mds_bal_need_max": "1.2",
"mds_bal_midchunk": "0.3",
"mds_bal_minchunk": "0.001",
"mds_bal_target_removal_min": "5",
"mds_bal_target_removal_max": "10",
"mds_replay_interval": "1",
"mds_shutdown_check": "0",
"mds_thrash_exports": "0",
"mds_thrash_fragments": "0",
"mds_dump_cache_on_map": "false",
"mds_dump_cache_after_rejoin": "false",
"mds_verify_scatter": "false",
"mds_debug_scatterstat": "false",
"mds_debug_frag": "false",
"mds_debug_auth_pins": "false",
"mds_debug_subtrees": "false",
"mds_kill_mdstable_at": "0",
"mds_kill_export_at": "0",
"mds_kill_import_at": "0",
"mds_kill_link_at": "0",
"mds_kill_rename_at": "0",
"mds_kill_openc_at": "0",
"mds_kill_journal_at": "0",
"mds_kill_journal_expire_at": "0",
"mds_kill_journal_replay_at": "0",
"mds_journal_format": "1",
"mds_kill_create_at": "0",
"mds_inject_traceless_reply_probability": "0",
"mds_wipe_sessions": "false",
"mds_wipe_ino_prealloc": "false",
"mds_skip_ino": "0",
"max_mds": "1",
"mds_standby_for_name": "",
"mds_standby_for_rank": "-1",
"mds_standby_replay": "false",
"mds_enable_op_tracker": "true",
"mds_op_history_size": "20",
"mds_op_history_duration": "600",
"mds_op_complaint_time": "30",
"mds_op_log_threshold": "5",
"mds_snap_min_uid": "0",
"mds_snap_max_uid": "65536",
"mds_verify_backtrace": "1",
"mds_action_on_write_error": "1",
"osd_compact_leveldb_on_mount": "false",
"osd_max_backfills": "1",
"osd_min_recovery_priority": "0",
"osd_backfill_full_ratio": "0.85",
"osd_backfill_retry_interval": "10",
"osd_agent_max_ops": "4",
"osd_agent_min_evict_effort": "0.1",
"osd_agent_quantize_effort": "0.1",
"osd_agent_delay_time": "5",
"osd_find_best_info_ignore_history_les": "false",
"osd_agent_hist_halflife": "1000",
"osd_agent_slop": "0.02",
"osd_uuid": "00000000-0000-0000-0000-000000000000",
"osd_data": "\/var\/lib\/ceph\/osd\/ceph-0",
"osd_journal": "\/var\/lib\/ceph\/osd\/ceph-0\/journal",
"osd_journal_size": "10000",
"osd_max_write_size": "90",
"osd_max_pgls": "1024",
"osd_client_message_size_cap": "524288000",
"osd_client_message_cap": "100",
"osd_pg_bits": "6",
"osd_pgp_bits": "6",
"osd_crush_chooseleaf_type": "1",
"osd_pool_default_crush_rule": "-1",
"osd_pool_default_crush_replicated_ruleset": "0",
"osd_pool_erasure_code_stripe_width": "4096",
"osd_pool_default_size": "3",
"osd_pool_default_min_size": "2",
"osd_pool_default_pg_num": "2048",
"osd_pool_default_pgp_num": "2048",
"osd_pool_default_erasure_code_directory": "\/usr\/lib\/ceph\/erasure-code",
"osd_pool_default_erasure_code_profile": "plugin=jerasure technique=reed_sol_van k=2 m=1 ",
"osd_erasure_code_plugins": "jerasure lrc isa",
"osd_allow_recovery_below_min_size": "true",
"osd_pool_default_flags": "0",
"osd_pool_default_flag_hashpspool": "true",
"osd_pool_default_flag_nodelete": "false",
"osd_pool_default_flag_nopgchange": "false",
"osd_pool_default_flag_nosizechange": "false",
"osd_pool_default_hit_set_bloom_fpp": "0.05",
"osd_pool_default_cache_target_dirty_ratio": "0.4",
"osd_pool_default_cache_target_full_ratio": "0.8",
"osd_pool_default_cache_min_flush_age": "0",
"osd_pool_default_cache_min_evict_age": "0",
"osd_hit_set_min_size": "1000",
"osd_hit_set_max_size": "100000",
"osd_hit_set_namespace": ".ceph-internal",
"osd_tier_default_cache_mode": "writeback",
"osd_tier_default_cache_hit_set_count": "4",
"osd_tier_default_cache_hit_set_period": "1200",
"osd_tier_default_cache_hit_set_type": "bloom",
"osd_tier_default_cache_min_read_recency_for_promote": "1",
"osd_map_dedup": "true",
"osd_map_max_advance": "10",
"osd_map_cache_size": "20",
"osd_map_message_max": "100",
"osd_map_share_max_epochs": "100",
"osd_inject_bad_map_crc_probability": "0",
"osd_inject_failure_on_pg_removal": "false",
"osd_op_threads": "4",
"osd_peering_wq_batch_size": "20",
"osd_op_pq_max_tokens_per_priority": "4194304",
"osd_op_pq_min_cost": "65536",
"osd_disk_threads": "1",
"osd_disk_thread_ioprio_class": "",
"osd_disk_thread_ioprio_priority": "-1",
"osd_recovery_threads": "1",
"osd_recover_clone_overlap": "true",
"osd_op_num_threads_per_shard": "2",
"osd_op_num_shards": "5",
"osd_read_eio_on_bad_digest": "true",
"osd_recover_clone_overlap_limit": "10",
"osd_backfill_scan_min": "64",
"osd_backfill_scan_max": "512",
"osd_op_thread_timeout": "15",
"osd_op_thread_suicide_timeout": "150",
"osd_recovery_thread_timeout": "30",
"osd_recovery_thread_suicide_timeout": "300",
"osd_snap_trim_thread_timeout": "3600",
"osd_snap_trim_thread_suicide_timeout": "36000",
"osd_snap_trim_sleep": "0.5",
"osd_scrub_thread_timeout": "60",
"osd_scrub_thread_suicide_timeout": "60",
"osd_scrub_finalize_thread_timeout": "600",
"osd_scrub_invalid_stats": "true",
"osd_remove_thread_timeout": "3600",
"osd_remove_thread_suicide_timeout": "36000",
"osd_command_thread_timeout": "600",
"osd_age": "0.8",
"osd_age_time": "0",
"osd_command_thread_suicide_timeout": "900",
"osd_heartbeat_addr": ":\/0",
"osd_heartbeat_interval": "6",
"osd_heartbeat_grace": "20",
"osd_heartbeat_min_peers": "10",
"osd_heartbeat_use_min_delay_socket": "false",
"osd_pg_max_concurrent_snap_trims": "2",
"osd_heartbeat_min_healthy_ratio": "0.33",
"osd_mon_heartbeat_interval": "30",
"osd_mon_report_interval_max": "120",
"osd_mon_report_interval_min": "5",
"osd_pg_stat_report_interval_max": "500",
"osd_mon_ack_timeout": "30",
"osd_default_data_pool_replay_window": "45",
"osd_preserve_trimmed_log": "false",
"osd_auto_mark_unfound_lost": "false",
"osd_recovery_delay_start": "0",
"osd_recovery_max_active": "1",
"osd_recovery_max_single_start": "5",
"osd_recovery_max_chunk": "8388608",
"osd_copyfrom_max_chunk": "8388608",
"osd_push_per_object_cost": "1000",
"osd_max_push_cost": "8388608",
"osd_max_push_objects": "10",
"osd_recovery_forget_lost_objects": "false",
"osd_max_scrubs": "1",
"osd_scrub_begin_hour": "0",
"osd_scrub_end_hour": "24",
"osd_scrub_load_threshold": "40",
"osd_scrub_min_interval": "1.2096e+06",
"osd_scrub_max_interval": "2.4192e+06",
"osd_scrub_interval_randomize_ratio": "0.5",
"osd_scrub_chunk_min": "5",
"osd_scrub_chunk_max": "25",
"osd_scrub_sleep": "1",
"osd_deep_scrub_interval": "1.5552e+07",
"osd_deep_scrub_stride": "1048576",
"osd_deep_scrub_update_digest_min_age": "7200",
"osd_scan_list_ping_tp_interval": "100",
"osd_auto_weight": "false",
"osd_class_dir": "\/usr\/lib\/rados-classes",
"osd_open_classes_on_start": "true",
"osd_check_for_log_corruption": "false",
"osd_use_stale_snap": "false",
"osd_rollback_to_cluster_snap": "",
"osd_default_notify_timeout": "30",
"osd_kill_backfill_at": "0",
"osd_pg_epoch_persisted_max_stale": "10",
"osd_min_pg_log_entries": "3000",
"osd_max_pg_log_entries": "10000",
"osd_pg_log_trim_min": "100",
"osd_op_complaint_time": "30",
"osd_command_max_records": "256",
"osd_max_pg_blocked_by": "16",
"osd_op_log_threshold": "5",
"osd_verify_sparse_read_holes": "false",
"osd_debug_drop_ping_probability": "0",
"osd_debug_drop_ping_duration": "0",
"osd_debug_drop_pg_create_probability": "0",
"osd_debug_drop_pg_create_duration": "1",
"osd_debug_drop_op_probability": "0",
"osd_debug_op_order": "false",
"osd_debug_scrub_chance_rewrite_digest": "0",
"osd_debug_verify_snaps_on_info": "false",
"osd_debug_verify_stray_on_activate": "false",
"osd_debug_skip_full_check_in_backfill_reservation": "false",
"osd_debug_reject_backfill_probability": "0",
"osd_debug_inject_copyfrom_error": "false",
"osd_enable_op_tracker": "true",
"osd_num_op_tracker_shard": "32",
"osd_op_history_size": "20",
"osd_op_history_duration": "600",
"osd_target_transaction_size": "30",
"osd_failsafe_full_ratio": "0.97",
"osd_failsafe_nearfull_ratio": "0.9",
"osd_pg_object_context_cache_count": "64",
"osd_debug_pg_log_writeout": "false",
"threadpool_default_timeout": "60",
"threadpool_empty_queue_max_wait": "2",
"leveldb_write_buffer_size": "8388608",
"leveldb_cache_size": "134217728",
"leveldb_block_size": "0",
"leveldb_bloom_size": "0",
"leveldb_max_open_files": "0",
"leveldb_compression": "true",
"leveldb_paranoid": "false",
"leveldb_log": "",
"leveldb_compact_on_mount": "false",
"kinetic_host": "",
"kinetic_port": "8123",
"kinetic_user_id": "1",
"kinetic_hmac_key": "asdfasdf",
"kinetic_use_ssl": "false",
"rocksdb_compact_on_mount": "false",
"rocksdb_write_buffer_size": "0",
"rocksdb_target_file_size_base": "0",
"rocksdb_cache_size": "0",
"rocksdb_block_size": "0",
"rocksdb_bloom_size": "0",
"rocksdb_write_buffer_num": "0",
"rocksdb_background_compactions": "0",
"rocksdb_background_flushes": "0",
"rocksdb_max_open_files": "0",
"rocksdb_compression": "",
"rocksdb_paranoid": "false",
"rocksdb_log": "\/dev\/null",
"rocksdb_level0_file_num_compaction_trigger": "0",
"rocksdb_level0_slowdown_writes_trigger": "0",
"rocksdb_level0_stop_writes_trigger": "0",
"rocksdb_disableDataSync": "true",
"rocksdb_disableWAL": "false",
"rocksdb_num_levels": "0",
"rocksdb_wal_dir": "",
"rocksdb_info_log_level": "info",
"osd_client_op_priority": "63",
"osd_recovery_op_priority": "1",
"osd_recovery_op_warn_multiple": "16",
"osd_mon_shutdown_timeout": "5",
"osd_max_object_size": "107374182400",
"osd_max_object_name_len": "2048",
"osd_max_attr_name_len": "100",
"osd_max_attr_size": "0",
"osd_objectstore": "filestore",
"osd_debug_override_acting_compat": "false",
"osd_bench_small_size_max_iops": "100",
"osd_bench_large_size_max_throughput": "104857600",
"osd_bench_max_block_size": "67108864",
"osd_bench_duration": "30",
"memstore_device_bytes": "1073741824",
"filestore_omap_backend": "leveldb",
"filestore_debug_disable_sharded_check": "false",
"filestore_wbthrottle_enable": "true",
"filestore_wbthrottle_btrfs_bytes_start_flusher": "41943040",
"filestore_wbthrottle_btrfs_bytes_hard_limit": "419430400",
"filestore_wbthrottle_btrfs_ios_start_flusher": "500",
"filestore_wbthrottle_btrfs_ios_hard_limit": "5000",
"filestore_wbthrottle_btrfs_inodes_start_flusher": "500",
"filestore_wbthrottle_xfs_bytes_start_flusher": "41943040",
"filestore_wbthrottle_xfs_bytes_hard_limit": "419430400",
"filestore_wbthrottle_xfs_ios_start_flusher": "500",
"filestore_wbthrottle_xfs_ios_hard_limit": "5000",
"filestore_wbthrottle_xfs_inodes_start_flusher": "500",
"filestore_wbthrottle_btrfs_inodes_hard_limit": "5000",
"filestore_wbthrottle_xfs_inodes_hard_limit": "5000",
"filestore_index_retry_probability": "0",
"filestore_debug_inject_read_err": "false",
"filestore_debug_omap_check": "false",
"filestore_omap_header_cache_size": "1024",
"filestore_max_inline_xattr_size": "0",
"filestore_max_inline_xattr_size_xfs": "65536",
"filestore_max_inline_xattr_size_btrfs": "2048",
"filestore_max_inline_xattr_size_other": "512",
"filestore_max_inline_xattrs": "0",
"filestore_max_inline_xattrs_xfs": "10",
"filestore_max_inline_xattrs_btrfs": "10",
"filestore_max_inline_xattrs_other": "2",
"filestore_sloppy_crc": "false",
"filestore_sloppy_crc_block_size": "65536",
"filestore_max_alloc_hint_size": "1048576",
"filestore_max_sync_interval": "5",
"filestore_min_sync_interval": "0.01",
"filestore_btrfs_snap": "true",
"filestore_btrfs_clone_range": "true",
"filestore_zfs_snap": "false",
"filestore_fsync_flushes_journal_data": "false",
"filestore_fiemap": "false",
"filestore_fadvise": "true",
"filestore_xfs_extsize": "true",
"filestore_journal_parallel": "false",
"filestore_journal_writeahead": "false",
"filestore_journal_trailing": "false",
"filestore_queue_max_ops": "50",
"filestore_queue_max_bytes": "104857600",
"filestore_queue_committing_max_ops": "500",
"filestore_queue_committing_max_bytes": "104857600",
"filestore_op_threads": "2",
"filestore_op_thread_timeout": "60",
"filestore_op_thread_suicide_timeout": "180",
"filestore_commit_timeout": "600",
"filestore_fiemap_threshold": "4096",
"filestore_merge_threshold": "10",
"filestore_split_multiple": "2",
"filestore_update_to": "1000",
"filestore_blackhole": "false",
"filestore_fd_cache_size": "128",
"filestore_fd_cache_shards": "16",
"filestore_dump_file": "",
"filestore_kill_at": "0",
"filestore_inject_stall": "0",
"filestore_fail_eio": "true",
"filestore_debug_verify_split": "false",
"journal_dio": "true",
"journal_aio": "true",
"journal_force_aio": "false",
"keyvaluestore_queue_max_ops": "50",
"keyvaluestore_queue_max_bytes": "104857600",
"keyvaluestore_debug_check_backend": "false",
"keyvaluestore_op_threads": "2",
"keyvaluestore_op_thread_timeout": "60",
"keyvaluestore_op_thread_suicide_timeout": "180",
"keyvaluestore_default_strip_size": "4096",
"keyvaluestore_max_expected_write_size": "16777216",
"keyvaluestore_header_cache_size": "4096",
"keyvaluestore_backend": "leveldb",
"journal_max_corrupt_search": "10485760",
"journal_block_align": "true",
"journal_write_header_frequency": "0",
"journal_max_write_bytes": "10485760",
"journal_max_write_entries": "100",
"journal_queue_max_ops": "300",
"journal_queue_max_bytes": "33554432",
"journal_align_min_size": "65536",
"journal_replay_from": "0",
"journal_zero_on_create": "false",
"journal_ignore_corruption": "false",
"journal_discard": "false",
"rados_mon_op_timeout": "0",
"rados_osd_op_timeout": "0",
"rbd_op_threads": "1",
"rbd_op_thread_timeout": "60",
"rbd_non_blocking_aio": "true",
"rbd_cache": "true",
"rbd_cache_writethrough_until_flush": "true",
"rbd_cache_size": "33554432",
"rbd_cache_max_dirty": "25165824",
"rbd_cache_target_dirty": "16777216",
"rbd_cache_max_dirty_age": "1",
"rbd_cache_max_dirty_object": "0",
"rbd_cache_block_writes_upfront": "false",
"rbd_concurrent_management_ops": "10",
"rbd_balance_snap_reads": "false",
"rbd_localize_snap_reads": "false",
"rbd_balance_parent_reads": "false",
"rbd_localize_parent_reads": "true",
"rbd_readahead_trigger_requests": "10",
"rbd_readahead_max_bytes": "524288",
"rbd_readahead_disable_after_bytes": "52428800",
"rbd_clone_copy_on_read": "false",
"rbd_blacklist_on_break_lock": "true",
"rbd_blacklist_expire_seconds": "0",
"rbd_request_timed_out_seconds": "30",
"rbd_default_format": "1",
"rbd_default_order": "22",
"rbd_default_stripe_count": "0",
"rbd_default_stripe_unit": "0",
"rbd_default_features": "3",
"nss_db_path": "",
"rgw_max_chunk_size": "524288",
"rgw_override_bucket_index_max_shards": "0",
"rgw_bucket_index_max_aio": "8",
"rgw_enable_quota_threads": "true",
"rgw_enable_gc_threads": "true",
"rgw_data": "\/var\/lib\/ceph\/radosgw\/ceph-0",
"rgw_enable_apis": "s3, swift, swift_auth, admin",
"rgw_cache_enabled": "true",
"rgw_cache_lru_size": "10000",
"rgw_socket_path": "",
"rgw_host": "",
"rgw_port": "",
"rgw_dns_name": "",
"rgw_content_length_compat": "false",
"rgw_script_uri": "",
"rgw_request_uri": "",
"rgw_swift_url": "",
"rgw_swift_url_prefix": "swift",
"rgw_swift_auth_url": "",
"rgw_swift_auth_entry": "auth",
"rgw_swift_tenant_name": "",
"rgw_swift_enforce_content_length": "false",
"rgw_keystone_url": "",
"rgw_keystone_admin_token": "",
"rgw_keystone_admin_user": "",
"rgw_keystone_admin_password": "",
"rgw_keystone_admin_tenant": "",
"rgw_keystone_accepted_roles": "Member, admin",
"rgw_keystone_token_cache_size": "10000",
"rgw_keystone_revocation_interval": "900",
"rgw_s3_auth_use_rados": "true",
"rgw_s3_auth_use_keystone": "false",
"rgw_admin_entry": "admin",
"rgw_enforce_swift_acls": "true",
"rgw_swift_token_expiration": "86400",
"rgw_print_continue": "true",
"rgw_remote_addr_param": "REMOTE_ADDR",
"rgw_op_thread_timeout": "600",
"rgw_op_thread_suicide_timeout": "0",
"rgw_thread_pool_size": "100",
"rgw_num_control_oids": "8",
"rgw_num_rados_handles": "1",
"rgw_zone": "",
"rgw_zone_root_pool": ".rgw.root",
"rgw_region": "",
"rgw_region_root_pool": ".rgw.root",
"rgw_default_region_info_oid": "default.region",
"rgw_log_nonexistent_bucket": "false",
"rgw_log_object_name": "%Y-%m-%d-%H-%i-%n",
"rgw_log_object_name_utc": "false",
"rgw_usage_max_shards": "32",
"rgw_usage_max_user_shards": "1",
"rgw_enable_ops_log": "false",
"rgw_enable_usage_log": "false",
"rgw_ops_log_rados": "true",
"rgw_ops_log_socket_path": "",
"rgw_ops_log_data_backlog": "5242880",
"rgw_usage_log_flush_threshold": "1024",
"rgw_usage_log_tick_interval": "30",
"rgw_intent_log_object_name": "%Y-%m-%d-%i-%n",
"rgw_intent_log_object_name_utc": "false",
"rgw_init_timeout": "300",
"rgw_mime_types_file": "\/etc\/mime.types",
"rgw_gc_max_objs": "32",
"rgw_gc_obj_min_wait": "7200",
"rgw_gc_processor_max_time": "3600",
"rgw_gc_processor_period": "3600",
"rgw_s3_success_create_obj_status": "0",
"rgw_resolve_cname": "false",
"rgw_obj_stripe_size": "4194304",
"rgw_extended_http_attrs": "",
"rgw_exit_timeout_secs": "120",
"rgw_get_obj_window_size": "16777216",
"rgw_get_obj_max_req_size": "4194304",
"rgw_relaxed_s3_bucket_names": "false",
"rgw_defer_to_bucket_acls": "",
"rgw_list_buckets_max_chunk": "1000",
"rgw_md_log_max_shards": "64",
"rgw_num_zone_opstate_shards": "128",
"rgw_opstate_ratelimit_sec": "30",
"rgw_curl_wait_timeout_ms": "1000",
"rgw_copy_obj_progress": "true",
"rgw_copy_obj_progress_every_bytes": "1048576",
"rgw_data_log_window": "30",
"rgw_data_log_changes_size": "1000",
"rgw_data_log_num_shards": "128",
"rgw_data_log_obj_prefix": "data_log",
"rgw_replica_log_obj_prefix": "replica_log",
"rgw_bucket_quota_ttl": "600",
"rgw_bucket_quota_soft_threshold": "0.95",
"rgw_bucket_quota_cache_size": "10000",
"rgw_expose_bucket": "false",
"rgw_frontends": "fastcgi, civetweb port=7480",
"rgw_user_quota_bucket_sync_interval": "180",
"rgw_user_quota_sync_interval": "86400",
"rgw_user_quota_sync_idle_users": "false",
"rgw_user_quota_sync_wait_time": "86400",
"rgw_multipart_min_part_size": "5242880",
"rgw_olh_pending_timeout_sec": "3600",
"rgw_user_max_buckets": "1000",
"mutex_perf_counter": "false",
"throttler_perf_counter": "true",
"internal_safe_to_start_threads": "true"
}

#72 Updated by Kefu Chai almost 3 years ago

30 is the osd_target_transaction_size. i am preparing a changeset to force the OSD trim osdmaps even if the incoming osdmap does not have new maps but its oldest_map is great enough to trigger the trimming.

#73 Updated by Steve Taylor almost 3 years ago

I spent some time looking at the OSD code yesterday. If I'm reading OSD::handle_osd_map() correctly, then the problem here seems to be that our OSDs are getting into a state where superblock.oldest_map is the same as m->oldest_map because all OSDs are stuck with the same oldest map and therefore prevent each other from trimming their caches unnecessarily.

In the log I sent 6 days ago I can see incoming osdmaps that do have new maps but trimming is still not triggered. It's only when m->oldest_map > superblock.oldest_map that anything gets trimmed.

Once all of the OSDs have the same superblock.oldest_map, it looks like they will all fail to trim until one of them gets a full map from a mon and starts sending updates to its peers. At that point they all start trimming because they recognize that they no longer need to hold onto those old epochs.

Would it be helpful to introduce a condition where an OSD could request a full map from a mon in the event that it's keeping around more than osd_map_cache_size+osd_target_transaction_size maps or something? It looks to me like these OSDs are keeping way too many maps cached because all of the other OSDs are keeping them cached and it takes one of them getting a full map from a mon to start trimming.

#74 Updated by Kefu Chai almost 3 years ago

all OSDs are stuck with the same oldest map and therefore prevent each other from trimming their caches unnecessarily.

i tend to agree with you, Steve. but not all OSDs, but quite a few of them. the larger the cluster is, the more OSDs are likely to get stuck by a small "m->oldest_map" from its peers. and the larger the cluster is, the smaller "m->oldest_map" an OSD is likely to get. in short, the larger cluster, the more OSD gets impacted, and the more stale osdmaps they are holding.

that's why i failed to reproduce your issue with my mini cluster.

until one of them gets a full map from a mon and starts sending updates to its peers.

i'd say it does not matter if it is a full map or an incremental map. as long as

  • it has new maps that can be consumed by this OSD (i.e. no skip maps, or corrupted maps)
  • m->oldest_map > superblock.oldest_map

but the probability of fulfilling both these conditions are lower if the scale of the cluster is getting larger. OSD nodes are generous when sharing osdmaps with each other, so they can keep up with the latest osdmap from monitor. but the shared osdmap messages do not always come with an updated "oldest_map" from monitor.

  1. the monitor as the source of the updated osdmaps, sends the first osdmap message, in which, oldest_map = first_committed.
  2. while the 1st OSD forwarding the osdmap does not send the oldest_map as it is, instead, because the PGs served by this OSD still cache some stale osdmaps, the "oldest_map" in the forwarded osdmap message is greater than the one in the originally received message from monitor
  3. when the updated osdmap message is circulated around in the cluster, the oldest_map in the osdmap message keeps increasing. that's why we have following log line, in which, the src OSD was holding osdmap[2906..6322].
    2016-03-03 11:38:27.874424 7f2d80b7f700  3 osd.0 6321 handle_osd_map epochs [6322,6322], i have 6321, src has [2906,6322]
    
  4. and the condition is getting worse in a large cluster. and a small "osd_target_transaction_size" slows the trimming down even more. it hinders the osd from trimming all the stale maps even if it is lucky enough to get an osdmap message with a great enough "oldest_map" from an OSD which is closer to monitor.

Would it be helpful to introduce a condition where an OSD could request a full map from a mon in the event that it's keeping around more than osd_map_cache_size+osd_target_transaction_size maps or something

it would be helpful, but it could put much more workload on the shoulder of the monitor. the ceph cluster is decentralized, so OSDs can gossip with each other to circulate the interesting news around by themselves, even if there are only a few of them have the news from monitor at the very beginning. this design helps to improve the availability of cluster as a whole.

but yes, it's a signal that the osd trimming fails to keep up with the speed of its accumulation. the idea i put in #13990-72 is aggressive enough to help with this case, as the chance of follow scenario is quite low:

  1. great enough m->oldest_map
  2. but the message gets dropped because it does not comes with new maps

so instead of committing a single trimming transaction each time a new osdmap message is handled, maybe we should send multiple transactions to remove the stale maps?

#75 Updated by Kefu Chai almost 3 years ago

  • Status changed from New to Need Review

#76 Updated by Steve Taylor almost 3 years ago

I made the change locally and deployed a custom build to the OSD hosts in my test environment. I restarted all of the OSDs in the cluster a few times until the osdmap cache on osd.0 got down to 20 maps, which is the configured osdmap cache size. Then I re-ran my test from last week that caused that osdmap cache to get above 2,000 maps. The cache climbed to 370 maps before it started trimming, but it almost immediately trimmed to 28 while the test was still running and creating new map epochs. At that point it had been running for less than 10 minutes. It then began to climb again and got just below 500 before it trimmed down to 320. It's been running for 45 minutes now and has been bouncing around between 300 and 500 ever since that point.

It definitely looks better than my previous test last week, but it's still not really getting close to the cache size of 20. Is this expected? It does appear to be keeping up where it is, but it's still way above the cache size.

#77 Updated by Kefu Chai almost 3 years ago

thank you Steve, i appreciate your help and testing. it's expected. that change is simpler than the one i was thinking about yesterday, but it still sucks in two perspectives:

  1. it can starve the normal user osd ops when trimming a huge amount of osdmaps.
  2. it does not trim all stale maps, instead it trims half of them at most if the number of stale maps exceeds the osd_target_transaction_size
also
  1. monitor will not trim the osdmaps referenced by dirty PGs. and the status of PGs are reported by OSDs at the interval defined by "conf.osd_mon_report_interval_min", "conf.osd_mon_report_interval_max" and "osd_pg_stat_report_interval_max", in your case, it's 5, 120 and 500 seconds respectively. if the cluster is experiencing frequent changes. that delay could be relatively significant also. not to mention the delay of monitor paxos. for the minimal of last epoch pg clean, see the output of
    $ ceph pg dump --format=json-pretty | grep min_last_epoch_clean
    $ ceph pg dump --format=json-pretty | grep reported_epoch
    
  2. due to the nature of osdmap circulation (the gossip protocol), an OSD has less chance to get an MOSDMap message triggering trimming if it is topologically far from the monitor. and the larger the cluster, the more chance an OSD is "far" from a monitor.

i am preparing another change, which still suffers from two factors above. but it will trim all stale osdmaps without interfering with the normal osd ops too much.

#78 Updated by Kefu Chai almost 3 years ago

steve, the new changeset in https://github.com/ceph/ceph/pull/8017 is targeting master.

and its backport to hammer is
https://github.com/tchaikov/ceph/commit/70a1c6df77e2977d2b2fded738a7b80a5cbbaa1b , in case you want to try it out.

#79 Updated by Sage Weil almost 3 years ago

Is an orthogonal fix to add a trim_thru epoch (instead of inferring it from oldest_map) to the message? That way one OSD's failure to trim won't prevent it from sharing with others that they are allowed to trim...

Or, we can make the oldest_map populated with the trim_thru epoch, regardless of what has been locally trimmed. THe field name becomes a bit confusing, but we can rename it.

#80 Updated by Steve Taylor almost 3 years ago

Kefu,

I built your new change earlier today and deployed it to my test cluster. It's been running for about 6 hours now with results similar to what I saw with the previous iteration. The noticeable difference is that osd.0 typically seems to trim by larger numbers, which matches your description of the fix. It still almost always gets up to ~500 prior to trimming, but it typically trims down to 150 or so now instead of 300. I've seen it trim all the way down to 20 though.

I decided to try injecting an osd_map_cache_size of 1,000 to all of my OSDs to see what effect it would have. Pretty much to validate that the trimming behavior I'm seeing isn't due to the cache size of 20 being ignored. I didn't see any effect immediately (osd.0's cache was still trimming at right around 500 every time), but eventually it got up to 1,000 and trimmed from there. I injected 20 again and it immediately went back to the old behavior.

We're still obviously not seeing the cache get down to its target size very often, but this is a huge improvement over what we've been experiencing. I'm not sure what to expect in our production cluster with 1,480 OSDs, but so far this is looking very acceptable on my 88-OSD test cluster with 16 threads on a single client creating and deleting RBD snapshots as quickly as possible. This should be relatively worse than the production load, but it's also a smaller cluster, so this problem should affect it less.

I'm happy to continue testing if you have additional changes based on Sage's comments or whatever. We aren't building our own Ceph binaries for production, so we'll likely have to wait for 0.94.7 before we get this change into QA and start testing it for real.

#81 Updated by Kefu Chai almost 3 years ago

Or, we can make the oldest_map populated with the trim_thru epoch, regardless of what has been locally trimmed. THe field name becomes a bit confusing, but we can rename it.

thanks, @sage. will give it a try.

@steve, will update this pr once this fix is ready. thanks.

#82 Updated by Kefu Chai almost 3 years ago

i updated https://github.com/ceph/ceph/pull/8017 with a commit to populate the max_oldest_map as sage suggested at #13990-79. will prepare a hammer backport later on.

#83 Updated by Kefu Chai almost 3 years ago

steve, i prepared a hammer backport for the change at https://github.com/tchaikov/ceph/commit/74aa13a54f4a0e82b3677bc3890eb345ea5a3e99, in case you want to test it.

#84 Updated by Steve Taylor almost 3 years ago

I pulled down the latest changes and deployed a new build to my test cluster today. The cache trim doesn't appear to be keeping up as well as it did in the last test. I deployed the new build to all three of my OSD hosts and restarted all of the OSDs, one host at a time. I left the client load in place throughout this entire process.

By the time the new build was running on all 88 OSDs in my cluster, osd.0's osdmap cache already had over 30,000 cached maps (size is still 20), but it trimmed a bit at a time until it was down to about 850 after about 20 minutes or so. It then began to climb again until I restarted osd.0 a second time as its cache approached 2,000 maps. It then trimmed down to about 300 and stayed around the 300-500 range I had seen previously for several minutes, at which point it started climbing again. It's been going for about an hour now since the deployment and is at 1,500 and still climbing. So far it doesn't look to me like this build is keeping up as well with its cache trimming.

Do you want some logs? I haven't generated debug logs this time around, but I can if they're useful.

#85 Updated by Sage Weil almost 3 years ago

  • Status changed from Need Review to Pending Backport

#86 Updated by Kefu Chai almost 3 years ago

steve,

i'd like to have the log on monitor and osd

  • on monitor: debug-paxos=10
  • on osd: debug-ms=1

thank you in advance.

and i am using

osd_pg_epoch_persisted_max_stale=10
osd_map_cache_size=20
osd_map_max_advance=10
osd_mon_report_interval_min=10
osd_pg_stat_report_interval_max=10
paxos_service_trim_min=10
mon_min_osdmap_epochs=20

for speeding up the trimming in my test cluster.

#87 Updated by Nathan Cutler almost 3 years ago

  • Copied to Backport #15193: hammer: Hammer (0.94.3) OSD does not delete old OSD Maps in a timely fashion (maybe at all?) added

#88 Updated by Steve Taylor almost 3 years ago

I've injected those config settings to the mon and all osds in my test cluster, as well as the debug log settings on the mon and osd.0. I'll let it run for a while and send you the logs.

This build has been running on my test cluster for about 18 hours now, and osd.0 has just under 50,000 cached maps. It stayed between 300 and 500 for several days with the previous build. I'll give you another update when I send the logs.

#90 Updated by Kefu Chai almost 3 years ago

the monitor was trimming fine. the number of osdmaps was under 30.

$ grep trim ceph-mon.mon-eng-05-03.log | grep osdmap | cut -d' ' -f 6 | sed s'/)//' | awk -F"."  '{print $3-$1}' | tail
24
28
32
32
24
28
32
32
24
29

and on the osd side, the received osdmap message's oldest_map is very closed to newest_map (30 apart). which is expected:

$ grep "<==" ceph-osd.0.log | grep osd_map | tail
0+0 (522977030 0 0) 0xd1e86c0 con 0xee67180
2016-03-18 13:14:23.347387 7ff527059700  1 -- 10.3.21.163:6889/131872 <== osd.45 10.3.21.164:6855/184211 14145 ==== osd_map(566881..566881 src has 566851..566881) v3 ==== 3181+0+0 (1507650724 0 0) 0x116fe6c0 con 0xed63600
2016-03-18 13:14:23.364797 7ff527059700  1 -- 10.3.21.163:6889/131872 <== osd.44 10.3.21.164:6852/183870 30850 ==== osd_map(566880..566881 src has 566851..566881) v3 ==== 6314+0+0 (603421176 0 0) 0x10e9c400 con 0xed63760
2016-03-18 13:14:23.583016 7ff527059700  1 -- 10.3.21.163:6889/131872 <== osd.61 10.3.21.164:6866/185377 17799 ==== osd_map(566881..566881 src has 566851..566881) v3 ==== 3181+0+0 (1507650724 0 0) 0x11496d80 con 0xed634a0
2016-03-18 13:14:23.635387 7ff527059700  1 -- 10.3.21.163:6889/131872 <== osd.64 10.3.21.165:6809/157635 12896 ==== osd_map(566880..566881 src has 566851..566881) v3 ==== 6314+0+0 (603421176 0 0) 0x122a18c0 con 0xe1692e0

also i plotted the time sequence of the newest_map - oldest_map

and the time sequence of oldest_map

both of them look sane.

steve, could you enable debug-osd=10 and upload the osd log again. many thanks.

#91 Updated by Steve Taylor almost 3 years ago

OSD log with debug-osd=10 uploaded at https://api.access.redhat.com/rs/cases/01549888/attachments/2b0d52d6-af85-422c-9612-fc24b6fedd5e. That OSD is now holding over 266,000 maps in its cache. I grew from 261,000 to 264,000 during the period covered by the log.

#92 Updated by Kefu Chai almost 3 years ago

this one is different from the previous symptoms. the osd never trimmed the osdmaps. the local oldest_map written to super block was always 501056, even if the oldest_map in the received MOSDMap message was much larger than this number. for example, following message didn't trigger the trimming at all.

2016-03-21 07:43:36.536916 7ff527059700  1 -- 10.3.21.163:6889/131872 <== osd.62 10.3.21.164:6862/184799 94296 ==== osd_map(762412..762412 src has 762391.
.762412) v3 ==== 3485+0+0 (729543111 0 0) 0x12fa6d00 con 0xed62f20
2016-03-21 07:43:36.536947 7ff527059700 10 osd.0 762411 do_waiters -- start
2016-03-21 07:43:36.536950 7ff527059700 10 osd.0 762411 do_waiters -- finish
2016-03-21 07:43:36.537119 7ff527059700  3 osd.0 762411 handle_osd_map epochs [762412,762412], i have 762411, src has [762391,762412]
2016-03-21 07:43:36.537125 7ff527059700 10 osd.0 762411 handle_osd_map  got inc map for epoch 762412
2016-03-21 07:43:36.538143 7ff527059700 10 osd.0 762411  advance to epoch 762412 (<= newest 762412)
2016-03-21 07:43:36.538158 7ff527059700  7 osd.0 762412 advance_map epoch 762412
2016-03-21 07:43:36.538165 7ff527059700 10 osd.0 762412 write_superblock sb(ee53f297-e75d-4d5c-b705-0fa6b473e944 osd.0 b0fefab9-00e3-442f-b4c6-eaece569a258 e762412 [501056,762412] lci=[500920,762412])

so, i suspect that we are leaking osdmaps somewhere in osd. could you follow the 3rd,4th,5th and 6th steps of #13990-26? i am copying it as following with some editings:

  1. .
  2. .
  3. replace an OSD with the test build (http://gitbuilder.ceph.com/ceph-deb-trusty-x86_64-basic/ref/wip-hammer-13990/pool/main/c/ceph/), i'd suggest install the deb package using "dpkg" manually, instead of using ceph-deploy, as the ceph cli will fail to work with this test build (see #13990-34 and #13990-35)
  4. watch the output of monitor until it trims the osdmaps. wait for 10 minutes, and collect the log of the OSD with test build.
  5. upload the log file,
  6. if the log file is too large, we can preprocess it with the attached python script
    > python grep.py /var/log/ceph/osd.1.log 14 > /tmp/bt.log
    > 

    in which, the 13505 is the epoch of the oldest osdmap in objectstore. to figure out its epoch number:
    > $ find current/meta/ . -name 'osdmap*' -exec basename {} \; | sort | uniq | head -n1
    > osdmap.14__0_6417091C__none
    > 

    here "14" is the epoch number of the oldest osdmap in cache.

the test build includes the latest you were testing last time, and will also print verbose log when acquiring/releasing the osdmaps in cache.

#93 Updated by Bryan Stillwell almost 3 years ago

We're also seeing this issue and was wondering if there are any updates on when it will get fixed?

I'm seeing 42,000+ osdmaps on some of my OSDs:

# find /var/lib/ceph/osd/ceph-575/current/meta/ -name 'osdmap*' | wc -l
42704

With each osdmap averaging around 600KB, that's 24GB of space:

# du -sh /var/lib/ceph/osd/ceph-575/current/meta
24G    /var/lib/ceph/osd/ceph-575/current/meta

We're running hammer 0.94.6.

#94 Updated by Kefu Chai over 2 years ago

grep -C6 update_waiting_for_pg out/osd.4.log | grep -A5 '#199'
2016-04-26 09:33:28.471563 7fa07ba89700 -1 osdmap: =#198 7 => #199 7
 ceph version 0.94.6-258-g6688c7a (6688c7af3073479c4994846f295cfcd1bf045ec1)
 1: (OSD::update_waiting_for_pg(OSD::Session*, OSDMapRef)+0x6fa) [0x64255a]
 2: (OSD::dispatch_sessions_waiting_on_map()+0x14f) [0x6b6a7f]
 3: (OSD::consume_map()+0x5bf) [0x66995f]
 4: (OSD::handle_osd_map(MOSDMap*)+0x1a0e) [0x66ba9e]
--
2016-04-26 09:33:29.444836 7fa06664d700 -1 osdmap: =#198 3 => #199 12
 ceph version 0.94.6-258-g6688c7a (6688c7af3073479c4994846f295cfcd1bf045ec1)
 1: (OSD::update_waiting_for_pg(OSD::Session*, OSDMapRef)+0x6fa) [0x64255a]
 2: (OSD::ms_fast_dispatch(Message*)+0x187) [0x65ef27]
 3: (DispatchQueue::fast_dispatch(Message*)+0x56) [0xb8a026]
 4: (Pipe::reader()+0x1aea) [0xbb6faa]
--
2016-04-26 09:33:29.977869 7fa05f5dd700 -1 osdmap: =#198 2 => #199 13
 ceph version 0.94.6-258-g6688c7a (6688c7af3073479c4994846f295cfcd1bf045ec1)
 1: (OSD::update_waiting_for_pg(OSD::Session*, OSDMapRef)+0x6fa) [0x64255a]
 2: (OSD::ms_fast_dispatch(Message*)+0x187) [0x65ef27]
 3: (DispatchQueue::fast_dispatch(Message*)+0x56) [0xb8a026]
 4: (Pipe::reader()+0x1aea) [0xbb6faa]
--
2016-04-26 09:33:31.129628 7fa059880700 -1 osdmap: =(NULL) => #199 14
 ceph version 0.94.6-258-g6688c7a (6688c7af3073479c4994846f295cfcd1bf045ec1)
 1: (OSD::update_waiting_for_pg(OSD::Session*, OSDMapRef)+0x8d0) [0x642730]
 2: (OSD::ms_fast_dispatch(Message*)+0x187) [0x65ef27]
 3: (DispatchQueue::fast_dispatch(Message*)+0x56) [0xb8a026]
 4: (Pipe::reader()+0x1aea) [0xbb6faa]
--
2016-04-26 09:33:33.452062 7fa07ba89700 -1 osdmap: =#199 7 => #200 6
 ceph version 0.94.6-258-g6688c7a (6688c7af3073479c4994846f295cfcd1bf045ec1)
 1: (OSD::update_waiting_for_pg(OSD::Session*, OSDMapRef)+0x6fa) [0x64255a]
 2: (OSD::dispatch_sessions_waiting_on_map()+0x14f) [0x6b6a7f]
 3: (OSD::consume_map()+0x5bf) [0x66995f]
 4: (OSD::handle_osd_map(MOSDMap*)+0x1a0e) [0x66ba9e]
--
2016-04-26 09:33:34.465743 7fa05f5dd700 -1 osdmap: =#199 4 => #200 9
 ceph version 0.94.6-258-g6688c7a (6688c7af3073479c4994846f295cfcd1bf045ec1)
 1: (OSD::update_waiting_for_pg(OSD::Session*, OSDMapRef)+0x6fa) [0x64255a]
 2: (OSD::ms_fast_dispatch(Message*)+0x187) [0x65ef27]
 3: (DispatchQueue::fast_dispatch(Message*)+0x56) [0xb8a026]
 4: (Pipe::reader()+0x1aea) [0xbb6faa]
--
2016-04-26 09:33:34.528829 7fa06664d700 -1 osdmap: =#199 3 => #200 11
 ceph version 0.94.6-258-g6688c7a (6688c7af3073479c4994846f295cfcd1bf045ec1)
 1: (OSD::update_waiting_for_pg(OSD::Session*, OSDMapRef)+0x6fa) [0x64255a]
 2: (OSD::ms_fast_dispatch(Message*)+0x187) [0x65ef27]
 3: (DispatchQueue::fast_dispatch(Message*)+0x56) [0xb8a026]
 4: (Pipe::reader()+0x1aea) [0xbb6faa]
bash-4.2$ grep -b3 'OSD::Session::~Session' out/osd.4.log | grep -A3 '#198'
364265996-2016-04-26 09:33:26.972170 7fa07ba89700 -1 osdmap: -#198 12
364266056- ceph version 0.94.6-258-g6688c7a (6688c7af3073479c4994846f295cfcd1bf045ec1)
364266133: 1: (OSD::Session::~Session()+0x5a) [0x68f4fa]
364266180: 2: (OSD::Session::~Session()+0x9) [0x68f719]
bash-4.2$ grep -b3 'OSD::Session::~Session' out/osd.4.log | grep -A3 '#199'
bash-4.2$ grep -b3 'OSD::Session::~Session' out/osd.4.log | grep -A3 '#200'
368465437-2016-04-26 09:33:35.460745 7fa07ba89700 -1 osdmap: -#200 12
368465497- ceph version 0.94.6-258-g6688c7a (6688c7af3073479c4994846f295cfcd1bf045ec1)
368465574: 1: (OSD::Session::~Session()+0x5a) [0x68f4fa]
368465621: 2: (OSD::Session::~Session()+0x9) [0x68f719]

and the time span is long enough to cover the life cycle of the Session.

2016-04-26 14:48:31.615597 7fa078a83700 -1 osdmap: =(NULL) => #288 11
i was running then:
$ for i in `seq 300`; do ./rbd snap create rbd/rbd-image-0@rbd-snap-$i; done
$ for i in `seq 300`; do ./rbd snap rm  rbd/rbd-image-0@rbd-snap-$i; done

the osdmap#199 was assigned to OSD::Session, and never got released after that.

this problem reminds me of #14135

#95 Updated by Kefu Chai over 2 years ago

(gdb) p (*(('OSD::Session' *)0x5c49800))
$26 = {<RefCountedObject> = {_vptr.RefCountedObject = 0xff2780 <vtable for OSD::Session+16>, nref = {val = 1}, cct = 0x4b82000}, entity_name = {type = 8, id = "admin",
    type_id = "client.admin"}, caps = {grants = std::vector of length 1, capacity 1 = {{match = {auid = -1, pool_name = "", is_nspace = false, nspace = "", object_prefix = ""},
        spec = {allow = {val = 255 '\377'}, class_name = "", class_allow = ""}}}}, auid = 0, con = {px = 0x0}, wstate = {lock = {name = "WatchConState", id = -1,
      recursive = false, lockdep = true, backtrace = false, _m = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = 2, __spins = 0, __elision = 0, __list = {
            __prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 16 times>, "\002", '\000' <repeats 22 times>, __align = 0}, nlock = 0, locked_by = 0, cct = 0x0, logger = 0x0},
    watches = std::set with 0 elements}, session_dispatch_lock = {name = "Session::session_dispatch_lock", id = -1, recursive = false, lockdep = true, backtrace = false, _m = {
      __data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = 2, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}},
      __size = '\000' <repeats 16 times>, "\002", '\000' <repeats 22 times>, __align = 0}, nlock = 0, locked_by = 0, cct = 0x0, logger = 0x0},
  waiting_on_map = {<std::__cxx11::_List_base<std::tr1::shared_ptr<OpRequest>, std::allocator<std::tr1::shared_ptr<OpRequest> > >> = {
      _M_impl = {<std::allocator<std::_List_node<std::tr1::shared_ptr<OpRequest> > >> = {<__gnu_cxx::new_allocator<std::_List_node<std::tr1::shared_ptr<OpRequest> > >> = {<No data fields>}, <No data fields>}, _M_node = {<std::__detail::_List_node_base> = {_M_next = 0x5c49998, _M_prev = 0x5c49998}, _M_data = 0}}}, <No data fields>},
  osdmap = std::tr1::shared_ptr (count 1, weak 1) 0x4f14d80, waiting_for_pg = std::map with 0 elements, sent_epoch_lock = {name = "Session::sent_epoch_lock", id = -1,
    recursive = false, lockdep = true, backtrace = false, _m = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = 2, __spins = 0, __elision = 0, __list = {
          __prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 16 times>, "\002", '\000' <repeats 22 times>, __align = 0}, nlock = 0, locked_by = 0, cct = 0x0, logger = 0x0},
  last_sent_epoch = 0, received_map_lock = {name = "Session::received_map_lock", id = -1, recursive = false, lockdep = true, backtrace = false, _m = {__data = {__lock = 0,
        __count = 0, __owner = 0, __nusers = 0, __kind = 2, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}},
      __size = '\000' <repeats 16 times>, "\002", '\000' <repeats 22 times>, __align = 0}, nlock = 0, locked_by = 0, cct = 0x0, logger = 0x0}, received_map_epoch = 0}

2016-04-28 14:19:14.186796 7f495c42b700  1 osd.0 206 ms_handle_reset con 0x5072580 session 0x5c49800

(gdb) p (*(PipeConnection*)0x5072580)
$28 = {<Connection> = {<RefCountedObject> = {_vptr.RefCountedObject = 0x1003cc8 <vtable for PipeConnection+16>, nref = {val = 1}, cct = 0x4b82000}, lock = {
      name = "Connection::lock", id = -1, recursive = false, lockdep = true, backtrace = false, _m = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = 2,
          __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 16 times>, "\002", '\000' <repeats 22 times>, __align = 0}, nlock = 0,
      locked_by = 0, cct = 0x0, logger = 0x0}, msgr = 0x4bce000, priv = 0x5c49800, peer_type = 8, peer_addr = {type = 0, nonce = 741692480, {addr = {ss_family = 2,
          __ss_align = 0, __ss_padding = '\000' <repeats 111 times>}, addr4 = {sin_family = 2, sin_port = 0, sin_addr = {s_addr = 16777343},
          sin_zero = "\000\000\000\000\000\000\000"}, addr6 = {sin6_family = 2, sin6_port = 0, sin6_flowinfo = 16777343, sin6_addr = {__in6_u = {
              __u6_addr8 = '\000' <repeats 15 times>, __u6_addr16 = {0, 0, 0, 0, 0, 0, 0, 0}, __u6_addr32 = {0, 0, 0, 0}}}, sin6_scope_id = 0}}}, last_keepalive_ack = {tv = {
        tv_sec = 0, tv_nsec = 0}}, features = 55169095435288575, failed = true, rx_buffers_version = 0, rx_buffers = std::map with 0 elements}, pipe = 0x0}

(gdb) p (*(PipeConnection*)0x5072580).priv
$30 = (RefCountedObject *) 0x5c49800

/// so PipeConnection is still holding the session even after the session is reset (as the peer has reset the TCP connection)

i will continue checking the life cycle of PipeConnection.

#96 Updated by Kefu Chai over 2 years ago

i am tracking a message which is holding a Connection reference. and the connection is 0x4c20880 per the analysis this time.

(gdb) p op_tracker.history.arrived
$26 = std::set with 20 elements = {[0] = {first = {tv = {tv_sec = 1461899885, tv_nsec = 300862058}}, second = std::tr1::shared_ptr (count 2, weak 0) 0x4535900}, [1] = {first = {tv = {tv_sec = 1461899885,
        tv_nsec = 329180686}}, second = std::tr1::shared_ptr (count 2, weak 0) 0x4535a40}, [2] = {first = {tv = {tv_sec = 1461899885, tv_nsec = 351494267}},
    second = std::tr1::shared_ptr (count 2, weak 0) 0x4535b80}, [3] = {first = {tv = {tv_sec = 1461899961, tv_nsec = 186232316}}, second = std::tr1::shared_ptr (count 2, weak 0) 0x4535cc0}, [4] = {first = {
      tv = {tv_sec = 1461899966, tv_nsec = 363835585}}, second = std::tr1::shared_ptr (count 2, weak 0) 0x4dfadc0}, [5] = {first = {tv = {tv_sec = 1461899973, tv_nsec = 338892221}},
    second = std::tr1::shared_ptr (count 2, weak 0) 0x4dfa8c0}, [6] = {first = {tv = {tv_sec = 1461899984, tv_nsec = 607502971}}, second = std::tr1::shared_ptr (count 2, weak 0) 0x45375c0}, [7] = {first = {
      tv = {tv_sec = 1461899989, tv_nsec = 73728621}}, second = std::tr1::shared_ptr (count 2, weak 0) 0x4dfc940}, [8] = {first = {tv = {tv_sec = 1461899990, tv_nsec = 902038482}},
    second = std::tr1::shared_ptr (count 2, weak 0) 0x4534500}, [9] = {first = {tv = {tv_sec = 1461899995, tv_nsec = 142282053}}, second = std::tr1::shared_ptr (count 2, weak 0) 0x4536940}, [10] = {first = {
      tv = {tv_sec = 1461899996, tv_nsec = 424468779}}, second = std::tr1::shared_ptr (count 2, weak 0) 0x4dfbe00}, [11] = {first = {tv = {tv_sec = 1461900013, tv_nsec = 394524550}},
    second = std::tr1::shared_ptr (count 2, weak 0) 0x4537700}, [12] = {first = {tv = {tv_sec = 1461900015, tv_nsec = 897184842}}, second = std::tr1::shared_ptr (count 2, weak 0) 0x4dfcd00}, [13] = {first = {
      tv = {tv_sec = 1461900025, tv_nsec = 625109825}}, second = std::tr1::shared_ptr (count 2, weak 0) 0x4537d40}, [14] = {first = {tv = {tv_sec = 1461900030, tv_nsec = 636461575}},
    second = std::tr1::shared_ptr (count 2, weak 0) 0x5360500}, [15] = {first = {tv = {tv_sec = 1461900047, tv_nsec = 976718805}}, second = std::tr1::shared_ptr (count 2, weak 0) 0x4537e80}, [16] = {first = {
      tv = {tv_sec = 1461900053, tv_nsec = 286816706}}, second = std::tr1::shared_ptr (count 2, weak 0) 0x4536300}, [17] = {first = {tv = {tv_sec = 1461900054, tv_nsec = 835373979}},
    second = std::tr1::shared_ptr (count 2, weak 0) 0x4dfd0c0}, [18] = {first = {tv = {tv_sec = 1461900058, tv_nsec = 741772904}}, second = std::tr1::shared_ptr (count 2, weak 0) 0x4dfb540}, [19] = {first = {
      tv = {tv_sec = 1461900070, tv_nsec = 483376921}}, second = std::tr1::shared_ptr (count 2, weak 0) 0x4537c00}}

(gdb) p (((OpRequest*)0x4535cc0)->request)->connection
$27 = {px = 0x4c20880}

so, it's the OpTracker who is holding the message.

OpTracker->OpHistory->Message->PipeConnection->Session->OSDMap

and OpHistory does not cleanup Request unless its size exceeds osd_op_history_size or the oldest completed op is older than now-osd_op_history_duration. so we need to reset the connection when unregistering an completed op. as we don't care about the connection when dumping the op history.

voilĂ .

#97 Updated by Kefu Chai over 2 years ago

  • Status changed from Pending Backport to Need Test

https://github.com/ceph/ceph/pull/8828

and this change is being tested in lab.

the fix does address the issue, but it alleviates it =)

will continue the investigation.

#98 Updated by Kefu Chai over 2 years ago

  • Related to Bug #15517: msg/simple: PipeConnection leak added

#99 Updated by Kefu Chai over 2 years ago

in OSD::handle_osd_map()

  if (!superblock.oldest_map || skip_maps)
    superblock.oldest_map = first;

if the cluster's osdmap is moving way ahead of an OSD, the OSD could miss some osdmaps at seeing the osdmaps carried by the MOSDMap message (very likely one shared by a monitor at random). in this case, skip_maps is true. hence superblock.oldest_map is reset to first, which is the first osdmap in the MOSDMap message. but we can only remove 30 osdmaps by default. so it's very likely we can not remove all the old maps [superblock.oldest_map, min(msg->oldest_map, cached_maps)]. and we hope the next MOSDMap message could help us zap more old osdmap epochs, but it can not! because the superblock.oldest_map has been set to the msg->first. so the range of osdmap in [superblock.oldest_map+30, msg->first] will stay with us after the leap.

this only happens in a large cluster where the distribution of OSD updated by OSDMap is spatio-temporal sparse, which leads to the skip_maps.

#100 Updated by Kefu Chai over 2 years ago

https://github.com/ceph/ceph/pull/8990 targeting master.

will test its hammer backport in lab also.

#101 Updated by Sage Weil over 2 years ago

  • Status changed from Need Test to Pending Backport
  • Backport changed from hammer to hammer,jewel

#102 Updated by Loic Dachary over 2 years ago

  • Copied to Backport #15856: jewel: Hammer (0.94.3) OSD does not delete old OSD Maps in a timely fashion (maybe at all?) added

#103 Updated by Kefu Chai over 2 years ago

every OSD::Session holds an osdmap, and OSD talks with each other over OSD::Session also. in a large cluster, there is chance that an OSD::Session is used very occasionally if the connected OSDs are not chatty.

  1. OSD.1 sends a request to OSD.2
  2. OSD.2 sets the latest osdmap (osdmap.23066) to the session over which the request is received
  3. OSD.2 dispatches the request and the request is handled
  4. OSD.1 and OSD.2 are back to peace, but the OSD::Session is still holding osdmap.23066
  5. time elapses. and OSD.1 and OSD.2 stare at each other quietly making sure its peer is still alive.
  6. and the latest osdmap is now osd.32768, the osdmap.23066 is still being held by the OSD::Session.

hence the "leak".

#104 Updated by Kefu Chai over 2 years ago

  • Status changed from Pending Backport to Need Review

#105 Updated by Kefu Chai over 2 years ago

note to myself: should revisit https://github.com/ceph/ceph/pull/8828

OpRequest::_dump() =>
  Message::get_orig_source() => 
    Message::get_orig_source_inst().name =>
      Message::get_source_inst(get_source(), get_source_addr()) =>
        Message::get_source_addr() =>

    if (connection)
      return connection->get_peer_addr();
    return entity_addr_t();

but what we print is the "name" field. which comes from "Message::get_source()"

Message::get_source() => entity_name_t(header.src)

so we don't lose anything if we reset Message::connection.

will post a revised version of PR#9108.

#106 Updated by Kefu Chai over 2 years ago

#108 Updated by Kefu Chai over 2 years ago

  • Copied to Bug #15879: Hammer (0.94.3) OSD does not delete old OSD Maps in a timely fashion (maybe at all?) added

#109 Updated by Steve Taylor over 2 years ago

I cherry picked the Hammer backports of these three pull requests into my Hammer clone and built a 0.94.7 with these fixes. I deployed that build to a new test cluster with a single mon and 88 OSDs across three hosts (24, 32, 32). I ran a script to create a bunch of small RBDs with filesystems and content on them, then ran my script that creates a deletes RBD snapshots like mad to reproduce this issue.

I haven't set osd_map_cache_size, so I should be using the default value of 500. The script has been running for about 90 minutes so far, and the cached osdmap count on osd.0 has held steady between 1,000 and 1,500. Obviously this is more than the 500 it should be, but it's MUCH better than the hundreds of thousands I've seen using this script in the past.

I'll continue to let it run and see if anything changes, but so far it's looking good.

#110 Updated by Kefu Chai over 2 years ago

  • Status changed from Need Review to Pending Backport

#111 Updated by Xiaoxi Chen over 2 years ago

Kefu,
We seems also hit this issue, in some OSD we have 75GB+ (327839) old osd maps in meta dir. Now we are on the way to jewel, as hammer->jewel need to "chown", these tons of map really trouble us.

Have some questions for you,
1. Can restart the OSDs(then OSD::session will re-create) cleanup the old map? ---> seems not.
2. Can we delete the meta prior to upgrade?
3. Can upgrade to jewel (10.2.2 with this fix) cleanup the old map? ---> also seems not , I upgraded some of the node, but still seeing a lot in that node.
4. How to zap the old map?
Really thanks for fixing such a bug!

Xiaoxi

#112 Updated by Nathan Cutler over 2 years ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF