Project

General

Profile

Bug #24678

ceph-mon segmentation fault after setting pool size to 1 on degraded cluster

Added by Sergey Burdakov almost 6 years ago. Updated over 4 years ago.

Status:
Can't reproduce
Priority:
High
Assignee:
-
Category:
-
Target version:
% Done:

0%

Source:
Tags:
ceph-mon Segmentation fault
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Monitor
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

We have an issue with starting any from 3 monitors after changing pool size from 3 to 1. The cluster was in a degraded state (2/3 OSD nodes in down state), also there was a BUG #24423

    -7> 2018-06-27 18:45:58.929 7f7637878700  4 mgrc handle_mgr_map Active mgr is now 10.14.88.207:6801/569195
    -6> 2018-06-27 18:46:00.245 7f763b880700  0 mon.cdp4@0(leader) e1 handle_command mon_command({"var": "size", "prefix": "osd pool set", "pool": "cephfs_data", "val": "1"} v 0) v1
    -5> 2018-06-27 18:46:00.245 7f763b880700  0 log_channel(audit) log [INF] : from='client.114290 -' entity='client.admin' cmd=[{"var": "size", "prefix": "osd pool set", "pool": "cephfs_data", "val": "1"}]: dispatch
    -4> 2018-06-27 18:46:00.321 7f7637878700  4 mgrc handle_mgr_map Got map version 92
    -3> 2018-06-27 18:46:00.321 7f7637878700  4 mgrc handle_mgr_map Active mgr is now 10.14.88.207:6801/569195
    -2> 2018-06-27 18:46:00.325 7f7637878700  0 log_channel(audit) log [INF] : from='client.114290 -' entity='client.admin' cmd='[{"var": "size", "prefix": "osd pool set", "pool": "cephfs_data", "val": "1"}]': finished
    -1> 2018-06-27 18:46:00.325 7f7637878700  0 log_channel(cluster) log [DBG] : osdmap e934: 64 total, 28 up, 47 in
     0> 2018-06-27 18:46:00.333 7f763d884700 -1 *** Caught signal (Segmentation fault) **
 in thread 7f763d884700 thread_name:cpu_tp

 ceph version 13.2.0 (79a10589f1f80dfe21e8f9794365ed98143071c4) mimic (stable)
 1: (()+0x4b22a0) [0x5608f564c2a0]
 2: (()+0x11390) [0x7f7646d02390]
 3: (OSDMapMapping::_build_rmap(OSDMap const&)+0x114) [0x7f764759f204]
 4: (OSDMapMapping::_finish(OSDMap const&)+0x11) [0x7f764759f531]
 5: (ParallelPGMapper::Job::finish_one()+0xf5) [0x7f764759f635]
 6: (ParallelPGMapper::WQ::_process(ParallelPGMapper::Item*, ThreadPool::TPHandle&)+0x5c) [0x7f764759f6bc]
 7: (ThreadPool::worker(ThreadPool::WorkThread*)+0x8f7) [0x7f76473fdc37]
 8: (ThreadPool::WorkThread::entry()+0x10) [0x7f76473feb60]
 9: (()+0x76ba) [0x7f7646cf86ba]
 10: (clone()+0x6d) [0x7f7645a1c41d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

stop / start systemctl monitor service / target is leading to segfault with the following stack trace.

   -20> 2018-06-27 19:58:28.391 7fd0c8edf180  4 rocksdb: [/build/ceph-13.2.0/src/rocksdb/db/version_set.cc:3362] Recovered from manifest file:/var/lib/ceph/mon/ceph-cdp4/store.db/MANIFEST-019582 succeeded,manifest_file_number is 19582, next_file_number is 19585, last_sequ
ence is 7451606, log_number is 0,prev_log_number is 0,max_column_family is 0,deleted_log_number is 19580

   -19> 2018-06-27 19:58:28.391 7fd0c8edf180  4 rocksdb: [/build/ceph-13.2.0/src/rocksdb/db/version_set.cc:3370] Column family [default] (ID 0), log number is 19581

   -18> 2018-06-27 19:58:28.391 7fd0c8edf180  4 rocksdb: EVENT_LOG_v1 {"time_micros": 1530118708394381, "job": 1, "event": "recovery_started", "log_files": [19583]}
   -17> 2018-06-27 19:58:28.391 7fd0c8edf180  4 rocksdb: [/build/ceph-13.2.0/src/rocksdb/db/db_impl_open.cc:551] Recovering log #19583 mode 2
   -16> 2018-06-27 19:58:28.391 7fd0c8edf180  4 rocksdb: [/build/ceph-13.2.0/src/rocksdb/db/version_set.cc:2863] Creating manifest 19585

   -15> 2018-06-27 19:58:28.395 7fd0c8edf180  4 rocksdb: EVENT_LOG_v1 {"time_micros": 1530118708395928, "job": 1, "event": "recovery_finished"}
   -14> 2018-06-27 19:58:28.395 7fd0c8edf180  5 rocksdb: [/build/ceph-13.2.0/src/rocksdb/db/db_impl_files.cc:380] [JOB 2] Delete /var/lib/ceph/mon/ceph-cdp4/store.db//MANIFEST-019582 type=3 #19582 -- OK

   -13> 2018-06-27 19:58:28.395 7fd0c8edf180  5 rocksdb: [/build/ceph-13.2.0/src/rocksdb/db/db_impl_files.cc:380] [JOB 2] Delete /var/lib/ceph/mon/ceph-cdp4/store.db//019583.log type=0 #19583 -- OK

   -12> 2018-06-27 19:58:28.395 7fd0c8edf180  4 rocksdb: [/build/ceph-13.2.0/src/rocksdb/db/db_impl_open.cc:1218] DB pointer 0x55a24fd08000
   -11> 2018-06-27 19:58:28.395 7fd0c8edf180  0 starting mon.cdp4 rank 0 at public addr 10.14.88.204:6789/0 at bind addr 10.14.88.204:6789/0 mon_data /var/lib/ceph/mon/ceph-cdp4 fsid 04176392-32d2-11e8-a537-00259074f012
   -10> 2018-06-27 19:58:28.395 7fd0c8edf180  0 starting mon.cdp4 rank 0 at 10.14.88.204:6789/0 mon_data /var/lib/ceph/mon/ceph-cdp4 fsid 04176392-32d2-11e8-a537-00259074f012
    -9> 2018-06-27 19:58:28.399 7fd0c8edf180  0 mon.cdp4@-1(probing).mds e14 print_map
e14
enable_multiple, ever_enabled_multiple: 0,0
compat: compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no anchor table,9=file layout v2,10=snaprealm v2}
legacy client fscid: 1

Filesystem 'cephfs' (1)
fs_name cephfs
epoch   14
flags   12
created 2018-06-10 01:03:29.512343
modified        2018-06-27 17:53:21.155983
tableserver     0
root    0
session_timeout 60
session_autoclose       300
max_file_size   1099511627776
last_failure    0
last_failure_osd_epoch  905
compat  compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no anchor table,9=file layout v2,10=snaprealm v2}
max_mds 1
in      0
up      {0=90956}
failed  
damaged 
stopped 
data_pools      [5]
metadata_pool   8
inline_data     disabled
balancer        
standby_count_wanted    1
90956:  10.14.88.207:6800/1768662562 'cdp7' mds.0.9 up:active seq 11
104121: 10.14.88.208:6800/3625827821 'cdp8' mds.0.0 up:standby-replay seq 1 (standby for rank 0)

    -8> 2018-06-27 19:58:28.399 7fd0c8edf180  0 mon.cdp4@-1(probing).osd e934 crush map has features 288514051259236352, adjusting msgr requires
    -7> 2018-06-27 19:58:28.399 7fd0c8edf180  0 mon.cdp4@-1(probing).osd e934 crush map has features 288514051259236352, adjusting msgr requires
    -6> 2018-06-27 19:58:28.399 7fd0c8edf180  0 mon.cdp4@-1(probing).osd e934 crush map has features 1009089991638532096, adjusting msgr requires
    -5> 2018-06-27 19:58:28.399 7fd0c8edf180  0 mon.cdp4@-1(probing).osd e934 crush map has features 288514051259236352, adjusting msgr requires
    -4> 2018-06-27 19:58:28.399 7fd0c8edf180  4 mgrc handle_mgr_map Got map version 92
    -3> 2018-06-27 19:58:28.399 7fd0c8edf180  4 mgrc handle_mgr_map Active mgr is now 10.14.88.207:6801/569195
    -2> 2018-06-27 19:58:28.399 7fd0c8edf180  4 mgrc reconnect Starting new session with 10.14.88.207:6801/569195
    -1> 2018-06-27 19:58:28.403 7fd0c8edf180  0 mon.cdp4@-1(probing) e1  my rank is now 0 (was -1)
     0> 2018-06-27 19:58:28.415 7fd0b66a7700 -1 *** Caught signal (Segmentation fault) **
 in thread 7fd0b66a7700 thread_name:cpu_tp

 ceph version 13.2.0 (79a10589f1f80dfe21e8f9794365ed98143071c4) mimic (stable)
 1: (()+0x4b22a0) [0x55a24e5b92a0]
 2: (()+0x11390) [0x7fd0bfb25390]
 3: (OSDMapMapping::_build_rmap(OSDMap const&)+0x1d5) [0x7fd0c03c22c5]
 4: (OSDMapMapping::_finish(OSDMap const&)+0x11) [0x7fd0c03c2531]
 5: (ParallelPGMapper::Job::finish_one()+0xf5) [0x7fd0c03c2635]
 6: (ParallelPGMapper::WQ::_process(ParallelPGMapper::Item*, ThreadPool::TPHandle&)+0x5c) [0x7fd0c03c26bc]
 7: (ThreadPool::worker(ThreadPool::WorkThread*)+0x8f7) [0x7fd0c0220c37]
 8: (ThreadPool::WorkThread::entry()+0x10) [0x7fd0c0221b60]
 9: (()+0x76ba) [0x7fd0bfb1b6ba]
 10: (clone()+0x6d) [0x7fd0be83f41d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

History

#1 Updated by Patrick Donnelly over 5 years ago

  • Project changed from Ceph to RADOS
  • Category deleted (Monitor)
  • Component(RADOS) Monitor added

#2 Updated by Josh Durgin over 5 years ago

  • Priority changed from Normal to High

#3 Updated by Josh Durgin over 4 years ago

  • Status changed from New to Can't reproduce

Also available in: Atom PDF