Project

General

Profile

Bug #21977

null map from OSDService::get_map in advance_pg

Added by Yuri Weinstein over 6 years ago. Updated almost 6 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
luminous
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
upgrade/jewel-x
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Run: http://pulpito.ceph.com/teuthology-2017-10-30_04:23:02-upgrade:jewel-x-luminous-distro-basic-ovh/
Jobs: 1791436
Logs: http://qa-proxy.ceph.com/teuthology/teuthology-2017-10-30_04:23:02-upgrade:jewel-x-luminous-distro-basic-ovh/1791436/teuthology.log

2017-10-30T07:41:21.182 INFO:tasks.ceph.osd.0.ovh062.stderr:2017-10-30 07:41:21.163762 7fe63f602d00 -1 osd.0 28 log_to_monitors {default=true}
2017-10-30T07:41:21.183 INFO:tasks.ceph.osd.0.ovh062.stderr:*** Caught signal (Segmentation fault) **
2017-10-30T07:41:21.183 INFO:tasks.ceph.osd.0.ovh062.stderr: in thread 7fe622d52700 thread_name:tp_peering
2017-10-30T07:41:21.189 INFO:tasks.ceph.osd.0.ovh062.stderr: ceph version 12.2.1-454-g6166148 (61661480780e555fc501aec7c32163596e1e18d3) luminous (stable)
2017-10-30T07:41:21.199 INFO:tasks.ceph.osd.0.ovh062.stderr: 1: (()+0xa11b79) [0x7fe63ec38b79]
2017-10-30T07:41:21.199 INFO:tasks.ceph.osd.0.ovh062.stderr: 2: (()+0x10330) [0x7fe63c72d330]
2017-10-30T07:41:21.199 INFO:tasks.ceph.osd.0.ovh062.stderr: 3: (OSD::process_peering_events(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x153) [0x7fe63e71c8d3]
2017-10-30T07:41:21.200 INFO:tasks.ceph.osd.0.ovh062.stderr: 4: (OSD::PeeringWQ::_process(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x17) [0x7fe63e780047]
2017-10-30T07:41:21.200 INFO:tasks.ceph.osd.0.ovh062.stderr: 5: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa6e) [0x7fe63ec7bd0e]
2017-10-30T07:41:21.200 INFO:tasks.ceph.osd.0.ovh062.stderr: 6: (ThreadPool::WorkThread::entry()+0x10) [0x7fe63ec7cbf0]
2017-10-30T07:41:21.200 INFO:tasks.ceph.osd.0.ovh062.stderr: 7: (()+0x8184) [0x7fe63c725184]
2017-10-30T07:41:21.200 INFO:tasks.ceph.osd.0.ovh062.stderr: 8: (clone()+0x6d) [0x7fe63b814ffd]
2017-10-30T07:41:21.200 INFO:tasks.ceph.osd.0.ovh062.stderr:2017-10-30 07:41:21.167301 7fe622d52700 -1 *** Caught signal (Segmentation fault) **
2017-10-30T07:41:21.200 INFO:tasks.ceph.osd.0.ovh062.stderr: in thread 7fe622d52700 thread_name:tp_peering
2017-10-30T07:41:21.201 INFO:tasks.ceph.osd.0.ovh062.stderr:
2017-10-30T07:41:21.201 INFO:tasks.ceph.osd.0.ovh062.stderr: ceph version 12.2.1-454-g6166148 (61661480780e555fc501aec7c32163596e1e18d3) luminous (stable)
2017-10-30T07:41:21.201 INFO:tasks.ceph.osd.0.ovh062.stderr: 1: (()+0xa11b79) [0x7fe63ec38b79]
2017-10-30T07:41:21.201 INFO:tasks.ceph.osd.0.ovh062.stderr: 2: (()+0x10330) [0x7fe63c72d330]
2017-10-30T07:41:21.201 INFO:tasks.ceph.osd.0.ovh062.stderr: 3: (OSD::process_peering_events(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x153) [0x7fe63e71c8d3]
2017-10-30T07:41:21.201 INFO:tasks.ceph.osd.0.ovh062.stderr: 4: (OSD::PeeringWQ::_process(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x17) [0x7fe63e780047]
2017-10-30T07:41:21.202 INFO:tasks.ceph.osd.0.ovh062.stderr: 5: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa6e) [0x7fe63ec7bd0e]
2017-10-30T07:41:21.202 INFO:tasks.ceph.osd.0.ovh062.stderr: 6: (ThreadPool::WorkThread::entry()+0x10) [0x7fe63ec7cbf0]
2017-10-30T07:41:21.202 INFO:tasks.ceph.osd.0.ovh062.stderr: 7: (()+0x8184) [0x7fe63c725184]
2017-10-30T07:41:21.202 INFO:tasks.ceph.osd.0.ovh062.stderr: 8: (clone()+0x6d) [0x7fe63b814ffd]
2017-10-30T07:41:21.202 INFO:tasks.ceph.osd.0.ovh062.stderr: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Related issues

Copied to RADOS - Backport #23870: luminous: null map from OSDService::get_map in advance_pg Resolved

History

#1 Updated by Yuri Weinstein about 6 years ago

Seems persisting, see in

http://qa-proxy.ceph.com/teuthology/teuthology-2018-02-05_04:23:02-upgrade:jewel-x-luminous-distro-basic-ovh/2154473/teuthology.log

2018-02-05T16:01:19.860 INFO:tasks.ceph.osd.3.ovh035.stderr:2018-02-05 16:01:19.831595 7f5af0032d00 -1 osd.3 pg_epoch: 15 pg[1.1( v 15'5 (0'0,15'5] local-lis/les=14/14 n=3 ec=13/13 lis/c 13/13 les/c/f 14/14/0 13/13/13) [3,0] r=0 lpr=0 crt=15'5 lcod 0'0 mlcod 0'0 unknown] update_store_on_load setting bit width to 2
2018-02-05T16:01:19.862 INFO:tasks.ceph.osd.2.ovh035.stderr:2018-02-05 16:01:19.833550 7f7af2485d00 -1 osd.2 51 log_to_monitors {default=true}
2018-02-05T16:01:19.862 INFO:tasks.ceph.osd.2.ovh035.stderr:*** Caught signal (Segmentation fault) **
2018-02-05T16:01:19.862 INFO:tasks.ceph.osd.2.ovh035.stderr: in thread 7f7ad53be700 thread_name:tp_peering
2018-02-05T16:01:19.867 INFO:tasks.ceph.osd.2.ovh035.stderr: ceph version 12.2.2-742-g239b1ae (239b1ae5e19c16e976c2045fef5ad65f1f727278) luminous (stable)
2018-02-05T16:01:19.867 INFO:tasks.ceph.osd.2.ovh035.stderr: 1: (()+0xa1aaf9) [0x7f7af1aa8af9]
2018-02-05T16:01:19.867 INFO:tasks.ceph.osd.2.ovh035.stderr: 2: (()+0x10330) [0x7f7aef594330]
2018-02-05T16:01:19.868 INFO:tasks.ceph.osd.2.ovh035.stderr: 3: (OSD::process_peering_events(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x153) [0x7f7af1586ea3]
2018-02-05T16:01:19.868 INFO:tasks.ceph.osd.2.ovh035.stderr: 4: (OSD::PeeringWQ::_process(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x17) [0x7f7af15ed957]
2018-02-05T16:01:19.868 INFO:tasks.ceph.osd.2.ovh035.stderr: 5: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa6e) [0x7f7af1aec56e]
2018-02-05T16:01:19.868 INFO:tasks.ceph.osd.2.ovh035.stderr: 6: (ThreadPool::WorkThread::entry()+0x10) [0x7f7af1aed450]
2018-02-05T16:01:19.868 INFO:tasks.ceph.osd.2.ovh035.stderr: 7: (()+0x8184) [0x7f7aef58c184]
2018-02-05T16:01:19.868 INFO:tasks.ceph.osd.2.ovh035.stderr: 8: (clone()+0x6d) [0x7f7aee67c03d]
2018-02-05T16:01:19.868 INFO:tasks.ceph.osd.2.ovh035.stderr:2018-02-05 16:01:19.838810 7f7ad53be700 -1 *** Caught signal (Segmentation fault) **
2018-02-05T16:01:19.868 INFO:tasks.ceph.osd.2.ovh035.stderr: in thread 7f7ad53be700 thread_name:tp_peering
2018-02-05T16:01:19.868 INFO:tasks.ceph.osd.2.ovh035.stderr:
2018-02-05T16:01:19.868 INFO:tasks.ceph.osd.2.ovh035.stderr: ceph version 12.2.2-742-g239b1ae (239b1ae5e19c16e976c2045fef5ad65f1f727278) luminous (stable)
2018-02-05T16:01:19.869 INFO:tasks.ceph.osd.2.ovh035.stderr: 1: (()+0xa1aaf9) [0x7f7af1aa8af9]
2018-02-05T16:01:19.869 INFO:tasks.ceph.osd.2.ovh035.stderr: 2: (()+0x10330) [0x7f7aef594330]
2018-02-05T16:01:19.869 INFO:tasks.ceph.osd.2.ovh035.stderr: 3: (OSD::process_peering_events(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x153) [0x7f7af1586ea3]
2018-02-05T16:01:19.869 INFO:tasks.ceph.osd.2.ovh035.stderr: 4: (OSD::PeeringWQ::_process(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x17) [0x7f7af15ed957]
2018-02-05T16:01:19.869 INFO:tasks.ceph.osd.2.ovh035.stderr: 5: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa6e) [0x7f7af1aec56e]
2018-02-05T16:01:19.869 INFO:tasks.ceph.osd.2.ovh035.stderr: 6: (ThreadPool::WorkThread::entry()+0x10) [0x7f7af1aed450]
2018-02-05T16:01:19.869 INFO:tasks.ceph.osd.2.ovh035.stderr: 7: (()+0x8184) [0x7f7aef58c184]
2018-02-05T16:01:19.869 INFO:tasks.ceph.osd.2.ovh035.stderr: 8: (clone()+0x6d) [0x7f7aee67c03d]
2018-02-05T16:01:19.869 INFO:tasks.ceph.osd.2.ovh035.stderr: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

#3 Updated by Josh Durgin almost 6 years ago

From the latest logs, the peering thread id does not appear at all in the log until the crash.

I'm wondering if we can reproduce on smithi or mira? Perhaps this is related to an anomalously slow disk in ovh.

#4 Updated by Sage Weil almost 6 years ago

  • Subject changed from "Caught signal (Segmentation fault)" in upgrade:jewel-x-luminous to null map from OSDService::get_map in advance_pg
  • Status changed from New to Fix Under Review
  • Assignee set to Sage Weil

advance_pg ran before init() published the initial map to OSDService.

#6 Updated by Sage Weil almost 6 years ago

  • Backport set to luminous

#7 Updated by Sage Weil almost 6 years ago

  • Status changed from Fix Under Review to Pending Backport

#8 Updated by Nathan Cutler almost 6 years ago

  • Copied to Backport #23870: luminous: null map from OSDService::get_map in advance_pg added

#9 Updated by Nathan Cutler almost 6 years ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF