Bug #45706
openMemory usage in buffer_anon showing unbounded growth in osds on EC pool. (14.2.9)
0%
Description
Hi,
Re these threads in the mailing list: https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/DPBVNJQXXIP6LB72ALXYZZESWTWNGYVV/ and https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/YBYJJ3TVLKGYZWNDU7YSUQSTPHYMHGU7/
We see a number of OSDs with apparently unbounded growth memory in buffer_anon in clusters upgraded to Nautilus. We have Nautilus (14.2.9) installed on all nodes; some were
newly installed since the upgrade to Nautilus, and some were upgraded from Mimic. We see these issues across both types of node.
mempool dump from one such OSD:
{
"mempool": {
"by_pool": {
"bloom_filter": {
"items": 0,
"bytes": 0
},
"bluestore_alloc": {
"items": 5629372,
"bytes": 45034976
},
"bluestore_cache_data": {
"items": 127,
"bytes": 65675264
},
"bluestore_cache_onode": {
"items": 8275,
"bytes": 4634000
},
"bluestore_cache_other": {
"items": 2967913,
"bytes": 62469216
},
"bluestore_fsck": {
"items": 0,
"bytes": 0
},
"bluestore_txc": {
"items": 145,
"bytes": 100920
},
"bluestore_writing_deferred": {
"items": 335,
"bytes": 13160884
},
"bluestore_writing": {
"items": 1406,
"bytes": 5379120
},
"bluefs": {
"items": 1105,
"bytes": 24376
},
"buffer_anon": {
"items": 13705143,
"bytes": 40719040439
},
"buffer_meta": {
"items": 6820143,
"bytes": 600172584
},
"osd": {
"items": 96,
"bytes": 1138176
},
"osd_mapbl": {
"items": 59,
"bytes": 7022524
},
"osd_pglog": {
"items": 491049,
"bytes": 156701043
},
"osdmap": {
"items": 107885,
"bytes": 1723616
},
"osdmap_mapping": {
"items": 0,
"bytes": 0
},
"pgmap": {
"items": 0,
"bytes": 0
},
"mds_co": {
"items": 0,
"bytes": 0
},
"unittest_1": {
"items": 0,
"bytes": 0
},
"unittest_2": {
"items": 0,
"bytes": 0
}
},
"total": {
"items": 29733053,
"bytes": 41682277138
}
}
}
PERF DUMP excerpt:
"prioritycache": {
"target_bytes": 4294967296,
"mapped_bytes": 38466584576,
"unmapped_bytes": 425984,
"heap_bytes": 38467010560,
"cache_bytes": 134217728
},
I note that this seems to be correlated with those OSDs also losing contact with the cluster in general (for us, but not perhaps for the other cases) - we see lines like this in our ceph-osd logs for some of them:
2020-05-22 03:42:17.396 7fece7912700 1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2020-05-22 02:42:17.398734) 10.1.50.28:0/453364 >> [v2:10.1.50.33:6912/55198749,v1:10.1.50.33:6925/55198749] conn(0x564a1b406000 0x564a8d02e580 unknown :-1 s=BANNER_CONNECTING pgs=0 cs=0 l=1 rx=0 tx=0)._handle_peer_banner peer [v2:10.1.50.33:6912/55198749,v1:10.1.50.33:6925/55198749] is using msgr V1 protocol
2020-05-22 03:42:18.940 7fecf9e45700 -1 --2
(note that other OSDs on the same host have no problems with their connections, so this isn't a network issue - and the issue goes away with a restart, along with the extra memory usage).
The extra memory usage, in particular, causes high load on our host nodes - including both extensive swapping, and once, an OOMkiller invocation.
Updated by Zac Medico over 3 years ago
There's this buffer::list::rebuild buffer_anon leak fix in the master branch that may solve the issue:
Updated by Greg Farnum almost 3 years ago
- Project changed from Ceph to RADOS
- Category deleted (
OSD)