Project

General

Profile

Bug #23372

osd: segfault

Added by Nokia ceph-users about 6 years ago. Updated over 5 years ago.

Status:
Can't reproduce
Priority:
Normal
Assignee:
-
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

We are having 5 node cluster with 5 mons and 120 OSDs.

One of the OSD (osd.7) crashed with following logs:

    -4> 2018-03-14 22:14:01.748116 7f37b586a700  5 rocksdb: [/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.4/rpm/el7/BUILD/ceph-12.2.4/src/rocksdb/db/db_impl_files.cc:307] [JOB 16] Delete db/006322.sst type=2 #6322 -- OK

    -3> 2018-03-14 22:14:01.748124 7f37b586a700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 1521065641748121, "job": 16, "event": "table_file_deletion", "file_number": 6322}
    -2> 2018-03-14 22:14:01.748130 7f37b586a700  5 rocksdb: [/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.4/rpm/el7/BUILD/ceph-12.2.4/src/rocksdb/db/db_impl_files.cc:307] [JOB 16] Delete db/006276.sst type=2 #6276 -- OK

    -1> 2018-03-14 22:14:01.748134 7f37b586a700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 1521065641748133, "job": 16, "event": "table_file_deletion", "file_number": 6276}
     0> 2018-03-14 22:49:29.198238 7f37bf87e700 -1 *** Caught signal (Segmentation fault) **
in thread 7f37bf87e700 thread_name:safe_timer

ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b) luminous (stable)
1: (()+0xa3c611) [0x5633ee9df611]
2: (()+0xf5e0) [0x7f37c6b3f5e0]
3: [0x563400080000]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 0 lockdep
   0/ 0 context
   0/ 0 crush
   0/ 0 mds
   0/ 0 mds_balancer
   0/ 0 mds_locker
   0/ 0 mds_log
   0/ 0 mds_log_expire
   0/ 0 mds_migrator
   0/ 0 buffer
   0/ 0 timer
   0/ 0 filer
   0/ 1 striper
   0/ 0 objecter
   0/ 0 rados
   0/ 0 rbd
0/ 5 rbd_mirror
   0/ 5 rbd_replay
   0/ 0 journaler
   0/ 0 objectcacher
   0/ 0 client
   0/ 0 osd
   0/ 0 optracker
   0/ 0 objclass
   0/ 0 filestore
   0/ 0 journal
   0/ 0 ms
   1/ 5 mon
   0/ 0 monc
   0/ 0 paxos
   0/ 0 tp
   0/ 0 auth
   1/ 5 crypto
   0/ 0 finisher
   1/ 1 reserver
   0/ 0 heartbeatmap
   0/ 0 perfcounter
   0/ 0 rgw
   1/10 civetweb
   1/ 5 javaclient
   0/ 0 asok
   0/ 0 throttle
   0/ 0 refs
   1/ 5 xio
   1/ 5 compressor
   1/ 0 bluestore
   1/ 0 bluefs
   0/ 0 bdev
   1/ 5 kstore
   4/ 5 rocksdb
   4/ 5 leveldb
   4/ 5 memdb
   1/ 5 kinetic
   1/ 5 fuse
   1/ 5 mgr
   1/ 5 mgrc
   1/ 5 dpdk
   1/ 5 eventtrace
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent     10000
  max_new         1000
  log_file /var/log/ceph/ceph-osd.7.log

what all extra information is required to debug this issue?

ceph-osd.121.txt View (4.53 KB) Nokia ceph-users, 03/22/2018 10:53 AM

History

#1 Updated by Nokia ceph-users about 6 years ago

Nokia ceph-users wrote:

We are having 5 node cluster with 5 mons and 120 OSDs.

One of the OSD (osd.7) crashed with following logs:
[...]

what all extra information is required to debug this issue?

Crash reproduced in a similar cluster(340 OSDs) with luminous 12.2.2

cn2.chn6us1c1.cdn ~# abrt-cli list --since 1521543718
id ca4e01c701cd3a2e50e4ec1e1176aa14f012aff5
reason:         ceph-osd killed by SIGABRT
time:           Tue 20 Mar 2018 09:10:16 PM UTC
cmdline:        /usr/bin/ceph-osd -f --cluster ceph --id 121 --setuser ceph --setgroup ceph
package:        ceph-osd-12.2.2-0.el7
uid:            167 (ceph)
count:          1
Directory:      /var/spool/abrt/ccpp-2018-03-20-21:10:16-45896

cn2.chn6us1c1.cdn /var/log/ceph# zgrep boot ceph.log-20180321.gz
2018-03-20 21:12:14.915185 mon.cn1 mon.0 10.50.35.71:6789/0 309391 : cluster [INF] osd.121 10.50.35.72:6906/707645 boot
2018-03-20 21:12:14.783543 mon.cn1 mon.0 10.50.35.71:6789/0 309390 : cluster [INF] Health check cleared: OSD_DOWN (was: 1 osds down)
2018-03-20 21:12:14.915185 mon.cn1 mon.0 10.50.35.71:6789/0 309391 : cluster [INF] osd.121 10.50.35.72:6906/707645 boot
2018-03-20 21:12:15.108860 mon.cn1 mon.0 10.50.35.71:6789/0 309727 : cluster [WRN] Health check update: Degraded data redundancy: 3857957/1306619130 objects degraded (0.295%), 121 pgs unclean, 121 pgs degraded, 121 pgs undersized (PG_DEGRADED)
Attached the ceph-osd.121 log file. Please raise ticket of needed.

Attaching the osd logs

#2 Updated by Patrick Donnelly almost 6 years ago

  • Project changed from Ceph to RADOS
  • Subject changed from OSD crashed in Luminous 12.2.4 to osd: segfault
  • Source set to Community (user)
  • Release deleted (luminous)
  • Component(RADOS) OSD added

#3 Updated by Josh Durgin almost 6 years ago

  • Project changed from RADOS to bluestore

#4 Updated by Sage Weil over 5 years ago

  • Status changed from New to Can't reproduce

Also available in: Atom PDF