Project

General

Profile

Actions

Tasks #11847

open

OSD crashes under cached cluster benchmark

Added by Mark Korondi almost 9 years ago. Updated almost 9 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Tags:
Reviewed:
Affected Versions:
Pull request ID:

Description

I set up a cluster with around 204 OSDs. During continuous benchmarking (set up cache tier, move around hosts in crushmap, wait for HEALTH_OK, tear down cache, loop) several OSDs go down. I checked on the OSD hosts one of the logs:

ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff)
 1: /usr/bin/ceph-osd() [0xac51f2]
 2: (()+0xf130) [0x7fe049204130]
 3: (gsignal()+0x37) [0x7fe047c1e5d7]
 4: (abort()+0x148) [0x7fe047c1fcc8]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7fe0485229b5]
 6: (()+0x5e926) [0x7fe048520926]
 7: (()+0x5e953) [0x7fe048520953]
 8: (()+0x5eb73) [0x7fe048520b73]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x27a) [0xbc53ea]
 10: (Thread::create(unsigned long)+0x8a) [0xba93ba]
 11: (Pipe::accept()+0x3883) [0xca6663]
 12: (Pipe::reader()+0x1a1f) [0xcaa11f]
 13: (Pipe::Reader::entry()+0xd) [0xcacd5d]
 14: (()+0x7df5) [0x7fe0491fcdf5]
 15: (clone()+0x6d) [0x7fe047cdf1ad]

The system:

# lsb_release --all
LSB Version:    :core-4.1-amd64:core-4.1-noarch
Distributor ID: RedHatEnterpriseServer
Description:    Red Hat Enterprise Linux Server release 7.1 (Maipo)
Release:        7.1
Codename:       Maipo

# rpm -qa | grep ^ceph-
ceph-0.94.1-0.el7.x86_64
ceph-common-0.94.1-0.el7.x86_64
ceph-radosgw-0.94.1-0.el7.x86_64

ObjDump: https://drive.google.com/open?id=0B93VwrIsrOpHZzZMdGt4WTFCY2s&authuser=0

Actions #1

Updated by Loïc Dachary almost 9 years ago

  • Project changed from Stable releases to Ceph
Actions #2

Updated by Haomai Wang almost 9 years ago

I guess it should be os thread limit. You need to increase thread limit for osd

Actions #3

Updated by Kefu Chai almost 9 years ago

yeah, i have the same guess as Haomai. Mark, it looks like a dup of #10988 . probably we should find a way to throttle the usage of thread #.

Actions

Also available in: Atom PDF