Project

General

Profile

Bug #39955

After upgrade to Nautilus 14.2.1 mon DB is growing too fast when state of cluster is not active+clean

Added by Марк Коренберг over 1 year ago. Updated 10 months ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
insights module
Target version:
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
1 - critical
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature:

Description

I did not track before, but after upgrade, any non-"active+clean" state lead to rapid growth of monitor DB. For example in one hour it grew to 20 GB. and in normal state is is kus 40 MB.

FS usage graph is attached

Снимок экрана от 2019-05-16 23-47-50.png View (66.8 KB) Марк Коренберг, 05/16/2019 06:48 PM

Снимок экрана от 2019-05-16 23-59-43.png View (64.5 KB) Марк Коренберг, 05/16/2019 07:00 PM

Снимок экрана от 2019-06-13 21-30-42.png View (44.6 KB) Марк Коренберг, 06/13/2019 04:31 PM

Снимок экрана от 2019-06-13 21-31-46.png View (53 KB) Марк Коренберг, 06/13/2019 04:31 PM

Снимок экрана от 2019-06-13 21-57-25.png View (44.7 KB) Марк Коренберг, 06/13/2019 04:58 PM

Снимок экрана от 2019-06-14 11-56-54.png View (34.2 KB) Марк Коренберг, 06/14/2019 06:57 AM

monmap (409 Bytes) Марк Коренберг, 06/15/2019 11:52 AM

osdmap (14.6 KB) Марк Коренберг, 06/15/2019 11:52 AM

mgrmap (12.3 KB) Марк Коренберг, 06/15/2019 11:52 AM

mdsmap (1.04 KB) Марк Коренберг, 06/15/2019 11:52 AM

dump_paxos (24 Bytes) Марк Коренберг, 06/15/2019 11:52 AM

crushmap (2.21 KB) Марк Коренберг, 06/15/2019 11:52 AM

dump_keys (155 KB) Марк Коренберг, 06/15/2019 11:52 AM


Related issues

Duplicated by mgr - Bug #40385: Ceph mgr `insights` uses mon DB as a storage. Duplicate 06/15/2019

History

#1 Updated by Марк Коренберг over 1 year ago

red triangular spikes -- `ceph mon compact`

#2 Updated by Greg Farnum over 1 year ago

  • Project changed from Ceph to RADOS
  • Component(RADOS) Monitor added

#3 Updated by Neha Ojha over 1 year ago

  • Priority changed from Normal to Immediate

#4 Updated by Neha Ojha over 1 year ago

Are you still seeing this issue? Can you tell us what exactly was happening, when you saw non-"active+clean" state, in this example after 18:30? The output of ceph status will be useful. I do see that after compaction(red spikes), the usage goes down, does it come back to normal?

#5 Updated by Марк Коренберг over 1 year ago

1. Yes, I still seeing this issue. It can be triggered by rebooting a node.
2. I just rebooted my node after upgrading (you can see diconnected blue line since no meterics were collected during reboot)
3. No, compaction just do it's work correctly. Problem not in RocksDB, but in extra content in it.
4. I've written Prometheus collector that collects size of /var/lib/ceph/mon directory. See one more picture.

#7 Updated by Марк Коренберг over 1 year ago

This one rapid growth happened after rebooting an OSD node. Interesting, this node has filestore OSDs and Nautilus run `chown -R` at the start of an OSD., so that's why cluster state was not active+clean too long.

Growth also happens on other OSDs restart, but they are bluestore and restart quickly.

Also, `ceph osd out` DOES NOT lead to DB growth.

#8 Updated by Марк Коренберг over 1 year ago

Is any way to dump Monitor DB contents ?

#9 Updated by Josh Durgin over 1 year ago

With a monitor stopped, you can run e.g.

ceph-monstore-tool /var/lib/ceph/mon/ceph-a/ dump-keys > key_file
- I expect you'll see some large difference in types of keys present before and after the mon db growth.

#11 Updated by Марк Коренберг over 1 year ago

Well, I swtched off one node, and saw that after roughly 2 minutes, DB became to grow.

Next, I shut down one os the mons, copied it's dir and started it back, next, I switched on node again. DB returned to initial size. But now I have a copy of bloated DB.

Next, I run:

ceph-monstore-tool /var/lib/ceph/mon/zzzzzzzzzzz compact

It did compaction, bit size did not reduce significantly.

Unfortunately, I seems to spoil db by command `ceph-monstore-tool /var/lib/ceph/mon/zzzzzzzzzzz rewrite-crush`, but I'm not sure what it does.

I did all possible dumps using the command (saved to separate files) and will attach them, as well as DB itself.

#12 Updated by Марк Коренберг over 1 year ago

#13 Updated by Марк Коренберг over 1 year ago

.tar.xz with mon DB

https://drive.google.com/file/d/1vUnKlJ2DXIN5sG2EKrJe3PFPVD_f8qpg

Also, Ive compiled RocksDB tools ( https://github.com/facebook/rocksdb.git ) and run:

./ldb --db=zzzzzzzzzzz/store.db dump --hex > somedump

gives about 4 GB in size (since every byte is represented as two hex symbols)

Please help me in DB content analysis.

#14 Updated by Марк Коренберг over 1 year ago

$ ./ldb --db=zzzzzzzzzzz/store.db scan | wc -l
7201

$ ./ldb --db=zzzzzzzzzzz/store.db scan | wc -c
203477

$ ./ldb --db=zzzzzzzzzzz/store.db dump --hex | wc -l
7112 somedump.hex

$ ./ldb --db=zzzzzzzzzzz/store.db dump --hex | wc -c
4298956894

#15 Updated by Марк Коренберг over 1 year ago

This is a list of top N key with their value size in MB:

b'logm\x00full_33478572' 14.130637168884277
b'logm\x00full_33478573' 14.130672454833984
b'paxos\x0080459057' 14.13127326965332
b'paxos\x0080459059' 14.132275581359863
b'paxos\x0080459061' 14.131123542785645
b'paxos\x0080459045' 13.510034561157227
b'logm\x00full_33478568' 12.28122329711914
b'logm\x00full_33478569' 12.281198501586914
b'logm\x00full_33478570' 12.281255722045898
b'logm\x00full_33478571' 12.281230926513672
b'paxos\x0080459048' 12.2816743850708
b'paxos\x0080459050' 12.28283405303955
b'paxos\x0080459053' 12.281706809997559
b'logm\x00full_33478564' 11.733987808227539
b'logm\x00full_33478565' 11.733963012695312
b'logm\x00full_33478566' 11.734020233154297
b'logm\x00full_33478567' 11.73399543762207
b'paxos\x0080459022' 11.197884559631348
b'paxos\x0080459034' 11.736038208007812
b'paxos\x0080459037' 11.7344388961792
b'paxos\x0080459041' 11.73559856414795
b'paxos\x0080459043' 11.734471321105957
b'logm\x00full_33478558' 10.03214168548584
b'logm\x00full_33478559' 10.03214168548584
b'logm\x00full_33478560' 10.03217887878418
b'logm\x00full_33478561' 10.032154083251953
b'logm\x00full_33478562' 10.032154083251953
b'logm\x00full_33478563' 10.032129287719727
b'paxos\x0080458989' 10.116427421569824
b'paxos\x0080459005' 10.661810874938965
b'paxos\x0080459023' 10.032552719116211
b'paxos\x0080459025' 10.034005165100098
b'paxos\x0080459027' 10.03262996673584
b'paxos\x0080459029' 10.03256607055664
b'paxos\x0080459032' 10.032605171203613
b'logm\x00full_33478540' 9.096428871154785
b'logm\x00full_33478541' 9.096427917480469
b'logm\x00full_33478542' 9.096403121948242
b'logm\x00full_33478543' 9.096403121948242
b'logm\x00full_33478544' 9.096402168273926

#16 Updated by Марк Коренберг over 1 year ago

I figured out. It was mgr module "insights". It writes tons of data in that case. Disabling module helps. So:

Please update documentation: add tha enabling this module leads to enormously HUGE mon DB size increase in non active+clean state.

The most scary thing that enabling this module in normal state does nothing. And during catastrophe it makes things much worse than they could be. I had to stop my cluster unexpectedly during minor operations due to this bug. It's very dangerous module.

#17 Updated by Josh Durgin over 1 year ago

  • Project changed from RADOS to mgr
  • Category set to insights module
  • Assignee set to Noah Watkins

Noah can you take a look?

#18 Updated by Noah Watkins over 1 year ago

Hi,

It is conceivable that a bug in insights could cause a huge number of keys to be created, but I'm not seeing any evidence of this. The only keys related to insights in your mon keys file dump sare

mon_config_key / mgr/insights/health_history/2019-06-14_06
mon_config_key / mgr/insights/health_history/2019-06-14_07
mon_config_key / mgr/insights/health_history/2019-06-14_08
mon_config_key / mgr/insights/health_history/2019-06-14_09
mon_config_key / mgr/insights/health_history/2019-06-15_11

is it possible to look in your database to see how much space is being occupied by those key/value pairs?

Is it possible to reenable the insights module and collect the monitor log with debug logging turned up?

#19 Updated by Nathan Cutler over 1 year ago

  • Related to Bug #40385: Ceph mgr `insights` uses mon DB as a storage. added

#20 Updated by Nathan Cutler over 1 year ago

  • Related to deleted (Bug #40385: Ceph mgr `insights` uses mon DB as a storage.)

#21 Updated by Nathan Cutler over 1 year ago

  • Duplicated by Bug #40385: Ceph mgr `insights` uses mon DB as a storage. added

#22 Updated by Josh Durgin over 1 year ago

The logm keys being that large suggests there were many messages sent to the cluster log in a short span of time. Is it possible for the insights module to do this? (As an aside, keeping the central log in the monitor db is not ideal, but a bigger project to fix and tougher to backport than limiting log producers).

#23 Updated by Noah Watkins over 1 year ago

Josh, can you speak a little more about the types of messages related to `logm` keys? It's conceivable that some issue with the insights module is causing a lot of communication, but it would help to narrow down what I'm looking for.

#24 Updated by Josh Durgin over 1 year ago

It's storing log messages for the central log, which are also output to the ceph.log file. Is there anything output there from insights when an OSD goes down/up?

#25 Updated by Noah Watkins over 1 year ago

Oh, you mean log messages like debug/info etc... I was thinking you meant that with any communication with the monitor log messages would be generated as a side affect. Is there a different variant of logging that sends logs to the monitor vs the local mgr daemon log?

#26 Updated by Josh Durgin over 1 year ago

As I understand it, within a mgr module self.log goes to the mgr's local log file, and self.cluster_log() would go to the monitor. It appears the latter isn't used very much, so if it's coming from insights it'd be an indirect effect, e.g. from adding/removing a health warning, or running a monitor command (which would be added to the monitor's audit log), or something else I'm not thinking of now.

Марк Коренберг - do you have any ceph.log or audit.log from a time period when you were seeing the mon store grow?

#27 Updated by Noah Watkins over 1 year ago

Thanks Josh. I'll do a quick audit with this info.

#28 Updated by Noah Watkins over 1 year ago

While we don't track much data in the insights module, it looks like the issue here is that the state managed by the insights module is frequently flushed to the monitor's key-value store which the manager implements as a monitor `config-key set` command. That fits with Josh's hint about it being an indirect source.

We can do two things to improve the situation:

(1) decrease the frequency of flushing from 10 seconds to ??? even a couple minutes is probably fine.
(2) the state tracked are various health checks that have been observed. checking for changes since last write should be very effective.

#29 Updated by Kefu Chai about 1 year ago

  • Assignee changed from Noah Watkins to Kefu Chai

#30 Updated by Kefu Chai about 1 year ago

  • Assignee changed from Kefu Chai to Brad Hubbard

#31 Updated by Brad Hubbard about 1 year ago

  • Status changed from New to 4

Noah Watkins wrote:

While we don't track much data in the insights module, it looks like the issue here is that the state managed by the insights module is frequently flushed to the monitor's key-value store which the manager implements as a monitor `config-key set` command. That fits with Josh's hint about it being an indirect source.

We can do two things to improve the situation:

(1) decrease the frequency of flushing from 10 seconds to ??? even a couple minutes is probably fine.
(2) the state tracked are various health checks that have been observed. checking for changes since last write should be very effective.

I've only just strted lookingat this code but I guess you could test (1) by removing /usr/share/ceph/mgr/insights/health.pyc and changing the following line in /usr/share/ceph/mgr/insights/health.py.

PERSIST_PERIOD = datetime.timedelta(seconds = 10)

Change it to the following.

PERSIST_PERIOD = datetime.timedelta(seconds = 120)

Let us know how it goes.

#32 Updated by Patrick Donnelly 12 months ago

  • Status changed from 4 to New

#33 Updated by Brad Hubbard 12 months ago

  • Status changed from New to Need More Info

#34 Updated by Sage Weil 12 months ago

  • Priority changed from Immediate to Urgent

#35 Updated by Sage Weil 10 months ago

  • Target version set to v15.0.0

#36 Updated by Sage Weil 10 months ago

  • Status changed from Need More Info to Resolved

I think this is okay now?

Also available in: Atom PDF