Bug #48732: Marking OSDs out causes mon daemons to crash following tcmalloc: large alloc - RADOS - Ceph

Actions

Copy link

Bug #48732

open

Marking OSDs out causes mon daemons to crash following tcmalloc: large alloc

Added by Wes Dillingham over 3 years ago. Updated almost 3 years ago.

Status:

Need More Info

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(RADOS):

Monitor

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

On a 14.2.11 zero-load cluster I am taking some osd servers out of service.
I began marking OSDs out in preparation and encountered an issue:

Immediately following marking an osd out the quorum became very laggy. The following state was recorded close to the quorum become unresponsive to client requests (which lasted about five minutes)

HEALTH_WARN no active mgr; Reduced data availability: 6 pgs inactive, 35 pgs peering; Degraded data redundancy: 2920/16956 objects degraded (17.221%), 8 pgs degraded; 1/3 mons down

Looking at the mon log I saw the following:
3 instances of " ceph-mon: tcmalloc: large alloc 8589934592 bytes"
followed by the mon going down:
Dec 30 12:20:50 redacted systemd: Stopping Ceph cluster monitor daemon...
Dec 30 12:20:50 redacted ceph-mon: 2020-12-30 12:20:50.017 7f9ec4f21700 -1 received signal: Terminated from /usr/lib/systemd/systemd --system --deserialize 25 (PID: 1) UID: 0

this cascaded to a second monitor going down shortly thereafter. The mon daemon was able to be restarted immediately and after about five minutes the clusters monitors and mgrs all came back but its unclear to me why simply marking an osd out would cause the mon daemons to crash. The mgr daemons never crashed.

looking at the memory graphs of the mon servers there was always plenty of free memory and load never seemed to spike. The highest number of PGs per osd in the cluster is 176. All spinning disks on bluestore.

I can try and recreate this issue if it helps.

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Wes Dillingham over 3 years ago

This seems related to https://bugzilla.redhat.com/show_bug.cgi?id=1826450 our circumstances are highly similar.

Actions

Copy link

Updated by Neha Ojha over 3 years ago

Status changed from New to Need More Info

It will be great if you can share a reproducer for this or reproduce this capture monitor logs with debugging enabled.

Actions

Copy link

Updated by Dan van der Ster almost 3 years ago

Wes, did you ever find out more about the root cause of this? We saw something similar today in #50587

Actions

Copy link

Updated by Dan van der Ster almost 3 years ago

Related to Bug #50587: mon election storm following osd recreation: huge tcmalloc and ceph::msgr::v2::FrameAssembler backtraces added

Actions

Copy link

Updated by Wes Dillingham almost 3 years ago

Hello Dan and Neha. Shortly after filing this bug I went on paternity leave but have returned today. I will try and attempt to recreate this issue in hopes that is helpful still.

Actions

Copy link

Updated by Dan van der Ster almost 3 years ago

Wes Dillingham wrote:

Hello Dan and Neha. Shortly after filing this bug I went on paternity leave but have returned today. I will try and attempt to recreate this issue in hopes that is helpful still.

Welcome back. In our case we found it was due to a negative progress bug; see the comments in #50587. I wonder if this explains what you saw?

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » RADOS

Custom queries

Bug #48732

Marking OSDs out causes mon daemons to crash following tcmalloc: large alloc

Updated by Wes Dillingham over 3 years ago

Updated by Neha Ojha over 3 years ago

Updated by Dan van der Ster almost 3 years ago

Updated by Dan van der Ster almost 3 years ago

Updated by Wes Dillingham almost 3 years ago

Updated by Dan van der Ster almost 3 years ago