Project

General

Profile

Actions

Bug #51626

open

OSD uses all host memory (80g) on startup due to pg_split

Added by Tor Martin Ølberg almost 3 years ago. Updated over 2 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
16.2.5,15.2.4,15.2.13
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

After upgrading from 15.2.4 to 15.2.13 some OSDs fails to start.

The OSDs which are failing to start seem to be failing due to pg_split trying to happen on a specific pool.

exporting the troublesome pgs from the osd allows the OSD to start but yields dataloss. reimporting the pg triggers the fault state again.

Looking at perf dumps it looks like buffer_anon is taking up most of the memory and is never being flushed.

During the fault state the OSD output (with log level 50/50) is:

2021-07-11T11:02:59.252+0200 7f3a0754b700 10 osd.19 151095 do_waiters -- finish
2021-07-11T11:02:59.252+0200 7f3a0754b700 20 osd.19 151095 tick last_purged_snaps_scrub 2021-07-11T00:02:32.071851+0200 next 2021-07-12T00:02:32.071851+0200
2021-07-11T11:03:00.280+0200 7f3a0754b700 10 osd.19 151095 tick
2021-07-11T11:03:00.280+0200 7f3a0754b700 10 osd.19 151095 do_waiters -- start
2021-07-11T11:03:00.280+0200 7f3a0754b700 10 osd.19 151095 do_waiters -- finish
2021-07-11T11:03:00.280+0200 7f3a0754b700 20 osd.19 151095 tick last_purged_snaps_scrub 2021-07-11T00:02:32.071851+0200 next 2021-07-12T00:02:32.071851+0200
2021-07-11T11:03:00.720+0200 7f39ee519700 30 osd.19 151095 heartbeat_entry woke up
2021-07-11T11:03:00.720+0200 7f39ee519700 30 osd.19 151095 heartbeat
2021-07-11T11:03:00.720+0200 7f39ee519700 30 osd.19 151095 heartbeat: daily_loadavg 4.08824
2021-07-11T11:03:00.720+0200 7f39ee519700 30 osd.19 151095 heartbeat checking stats
2021-07-11T11:03:00.720+0200 7f39ee519700  5 osd.19 151095 heartbeat osd_stat(store_statfs(0x5382fd0000/0x40000000/0x101e0bfe000, data 0x94f7c3851b/0x951dc20000, compress 0x22bb0bce/0x4bff0000/0xa30e0000, omap 0x20e2f8, meta 0x947efd08), peers [] op hist [])
2021-07-11T11:03:00.724+0200 7f39ee519700 20 osd.19 151095 check_full_status cur ratio 0.675189, physical ratio 0.675189, new state none 
2021-07-11T11:03:00.724+0200 7f39ee519700 30 osd.19 151095 heartbeat lonely?
2021-07-11T11:03:00.724+0200 7f39ee519700 30 osd.19 151095 heartbeat done
2021-07-11T11:03:00.724+0200 7f39ee519700 30 osd.19 151095 heartbeat_entry sleeping for 3.5
2021-07-11T11:03:01.280+0200 7f3a0754b700 10 osd.19 151095 tick
2021-07-11T11:03:01.280+0200 7f3a0754b700 10 osd.19 151095 do_waiters -- start
2021-07-11T11:03:01.280+0200 7f3a0754b700 10 osd.19 151095 do_waiters -- finish
2021-07-11T11:03:01.280+0200 7f3a0754b700 20 osd.19 151095 tick last_purged_snaps_scrub 2021-07-11T00:02:32.071851+0200 next 2021-07-12T00:02:32.071851+0200

So far we've tried compiling from source master/16.2.15,15.2.13 and 15.2.4 all of them which seem to have the same issue.

logfile can be found here (cant attach it because its 20mb)
https://drive.google.com/file/d/1VZpwnx6VDlWZKdOTpNIVF1zxQtpoKSzQ/view?usp=sharing


Related issues 1 (0 open1 closed)

Related to RADOS - Bug #53729: ceph-osd takes all memory before oom on bootResolvedNitzan Mordechai

Actions
Actions

Also available in: Atom PDF