Project

General

Profile

Bug #21977

null map from OSDService::get_map in advance_pg

Added by Yuri Weinstein over 1 year ago. Updated 11 months ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
-
Start date:
10/30/2017
Due date:
% Done:

0%

Source:
Q/A
Tags:
Backport:
luminous
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
upgrade/jewel-x
Component(RADOS):
Pull request ID:

Description

Run: http://pulpito.ceph.com/teuthology-2017-10-30_04:23:02-upgrade:jewel-x-luminous-distro-basic-ovh/
Jobs: 1791436
Logs: http://qa-proxy.ceph.com/teuthology/teuthology-2017-10-30_04:23:02-upgrade:jewel-x-luminous-distro-basic-ovh/1791436/teuthology.log

2017-10-30T07:41:21.182 INFO:tasks.ceph.osd.0.ovh062.stderr:2017-10-30 07:41:21.163762 7fe63f602d00 -1 osd.0 28 log_to_monitors {default=true}
2017-10-30T07:41:21.183 INFO:tasks.ceph.osd.0.ovh062.stderr:*** Caught signal (Segmentation fault) **
2017-10-30T07:41:21.183 INFO:tasks.ceph.osd.0.ovh062.stderr: in thread 7fe622d52700 thread_name:tp_peering
2017-10-30T07:41:21.189 INFO:tasks.ceph.osd.0.ovh062.stderr: ceph version 12.2.1-454-g6166148 (61661480780e555fc501aec7c32163596e1e18d3) luminous (stable)
2017-10-30T07:41:21.199 INFO:tasks.ceph.osd.0.ovh062.stderr: 1: (()+0xa11b79) [0x7fe63ec38b79]
2017-10-30T07:41:21.199 INFO:tasks.ceph.osd.0.ovh062.stderr: 2: (()+0x10330) [0x7fe63c72d330]
2017-10-30T07:41:21.199 INFO:tasks.ceph.osd.0.ovh062.stderr: 3: (OSD::process_peering_events(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x153) [0x7fe63e71c8d3]
2017-10-30T07:41:21.200 INFO:tasks.ceph.osd.0.ovh062.stderr: 4: (OSD::PeeringWQ::_process(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x17) [0x7fe63e780047]
2017-10-30T07:41:21.200 INFO:tasks.ceph.osd.0.ovh062.stderr: 5: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa6e) [0x7fe63ec7bd0e]
2017-10-30T07:41:21.200 INFO:tasks.ceph.osd.0.ovh062.stderr: 6: (ThreadPool::WorkThread::entry()+0x10) [0x7fe63ec7cbf0]
2017-10-30T07:41:21.200 INFO:tasks.ceph.osd.0.ovh062.stderr: 7: (()+0x8184) [0x7fe63c725184]
2017-10-30T07:41:21.200 INFO:tasks.ceph.osd.0.ovh062.stderr: 8: (clone()+0x6d) [0x7fe63b814ffd]
2017-10-30T07:41:21.200 INFO:tasks.ceph.osd.0.ovh062.stderr:2017-10-30 07:41:21.167301 7fe622d52700 -1 *** Caught signal (Segmentation fault) **
2017-10-30T07:41:21.200 INFO:tasks.ceph.osd.0.ovh062.stderr: in thread 7fe622d52700 thread_name:tp_peering
2017-10-30T07:41:21.201 INFO:tasks.ceph.osd.0.ovh062.stderr:
2017-10-30T07:41:21.201 INFO:tasks.ceph.osd.0.ovh062.stderr: ceph version 12.2.1-454-g6166148 (61661480780e555fc501aec7c32163596e1e18d3) luminous (stable)
2017-10-30T07:41:21.201 INFO:tasks.ceph.osd.0.ovh062.stderr: 1: (()+0xa11b79) [0x7fe63ec38b79]
2017-10-30T07:41:21.201 INFO:tasks.ceph.osd.0.ovh062.stderr: 2: (()+0x10330) [0x7fe63c72d330]
2017-10-30T07:41:21.201 INFO:tasks.ceph.osd.0.ovh062.stderr: 3: (OSD::process_peering_events(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x153) [0x7fe63e71c8d3]
2017-10-30T07:41:21.201 INFO:tasks.ceph.osd.0.ovh062.stderr: 4: (OSD::PeeringWQ::_process(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x17) [0x7fe63e780047]
2017-10-30T07:41:21.202 INFO:tasks.ceph.osd.0.ovh062.stderr: 5: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa6e) [0x7fe63ec7bd0e]
2017-10-30T07:41:21.202 INFO:tasks.ceph.osd.0.ovh062.stderr: 6: (ThreadPool::WorkThread::entry()+0x10) [0x7fe63ec7cbf0]
2017-10-30T07:41:21.202 INFO:tasks.ceph.osd.0.ovh062.stderr: 7: (()+0x8184) [0x7fe63c725184]
2017-10-30T07:41:21.202 INFO:tasks.ceph.osd.0.ovh062.stderr: 8: (clone()+0x6d) [0x7fe63b814ffd]
2017-10-30T07:41:21.202 INFO:tasks.ceph.osd.0.ovh062.stderr: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Related issues

Copied to RADOS - Backport #23870: luminous: null map from OSDService::get_map in advance_pg Resolved

History

#1 Updated by Yuri Weinstein about 1 year ago

Seems persisting, see in

http://qa-proxy.ceph.com/teuthology/teuthology-2018-02-05_04:23:02-upgrade:jewel-x-luminous-distro-basic-ovh/2154473/teuthology.log

2018-02-05T16:01:19.860 INFO:tasks.ceph.osd.3.ovh035.stderr:2018-02-05 16:01:19.831595 7f5af0032d00 -1 osd.3 pg_epoch: 15 pg[1.1( v 15'5 (0'0,15'5] local-lis/les=14/14 n=3 ec=13/13 lis/c 13/13 les/c/f 14/14/0 13/13/13) [3,0] r=0 lpr=0 crt=15'5 lcod 0'0 mlcod 0'0 unknown] update_store_on_load setting bit width to 2
2018-02-05T16:01:19.862 INFO:tasks.ceph.osd.2.ovh035.stderr:2018-02-05 16:01:19.833550 7f7af2485d00 -1 osd.2 51 log_to_monitors {default=true}
2018-02-05T16:01:19.862 INFO:tasks.ceph.osd.2.ovh035.stderr:*** Caught signal (Segmentation fault) **
2018-02-05T16:01:19.862 INFO:tasks.ceph.osd.2.ovh035.stderr: in thread 7f7ad53be700 thread_name:tp_peering
2018-02-05T16:01:19.867 INFO:tasks.ceph.osd.2.ovh035.stderr: ceph version 12.2.2-742-g239b1ae (239b1ae5e19c16e976c2045fef5ad65f1f727278) luminous (stable)
2018-02-05T16:01:19.867 INFO:tasks.ceph.osd.2.ovh035.stderr: 1: (()+0xa1aaf9) [0x7f7af1aa8af9]
2018-02-05T16:01:19.867 INFO:tasks.ceph.osd.2.ovh035.stderr: 2: (()+0x10330) [0x7f7aef594330]
2018-02-05T16:01:19.868 INFO:tasks.ceph.osd.2.ovh035.stderr: 3: (OSD::process_peering_events(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x153) [0x7f7af1586ea3]
2018-02-05T16:01:19.868 INFO:tasks.ceph.osd.2.ovh035.stderr: 4: (OSD::PeeringWQ::_process(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x17) [0x7f7af15ed957]
2018-02-05T16:01:19.868 INFO:tasks.ceph.osd.2.ovh035.stderr: 5: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa6e) [0x7f7af1aec56e]
2018-02-05T16:01:19.868 INFO:tasks.ceph.osd.2.ovh035.stderr: 6: (ThreadPool::WorkThread::entry()+0x10) [0x7f7af1aed450]
2018-02-05T16:01:19.868 INFO:tasks.ceph.osd.2.ovh035.stderr: 7: (()+0x8184) [0x7f7aef58c184]
2018-02-05T16:01:19.868 INFO:tasks.ceph.osd.2.ovh035.stderr: 8: (clone()+0x6d) [0x7f7aee67c03d]
2018-02-05T16:01:19.868 INFO:tasks.ceph.osd.2.ovh035.stderr:2018-02-05 16:01:19.838810 7f7ad53be700 -1 *** Caught signal (Segmentation fault) **
2018-02-05T16:01:19.868 INFO:tasks.ceph.osd.2.ovh035.stderr: in thread 7f7ad53be700 thread_name:tp_peering
2018-02-05T16:01:19.868 INFO:tasks.ceph.osd.2.ovh035.stderr:
2018-02-05T16:01:19.868 INFO:tasks.ceph.osd.2.ovh035.stderr: ceph version 12.2.2-742-g239b1ae (239b1ae5e19c16e976c2045fef5ad65f1f727278) luminous (stable)
2018-02-05T16:01:19.869 INFO:tasks.ceph.osd.2.ovh035.stderr: 1: (()+0xa1aaf9) [0x7f7af1aa8af9]
2018-02-05T16:01:19.869 INFO:tasks.ceph.osd.2.ovh035.stderr: 2: (()+0x10330) [0x7f7aef594330]
2018-02-05T16:01:19.869 INFO:tasks.ceph.osd.2.ovh035.stderr: 3: (OSD::process_peering_events(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x153) [0x7f7af1586ea3]
2018-02-05T16:01:19.869 INFO:tasks.ceph.osd.2.ovh035.stderr: 4: (OSD::PeeringWQ::_process(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x17) [0x7f7af15ed957]
2018-02-05T16:01:19.869 INFO:tasks.ceph.osd.2.ovh035.stderr: 5: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa6e) [0x7f7af1aec56e]
2018-02-05T16:01:19.869 INFO:tasks.ceph.osd.2.ovh035.stderr: 6: (ThreadPool::WorkThread::entry()+0x10) [0x7f7af1aed450]
2018-02-05T16:01:19.869 INFO:tasks.ceph.osd.2.ovh035.stderr: 7: (()+0x8184) [0x7f7aef58c184]
2018-02-05T16:01:19.869 INFO:tasks.ceph.osd.2.ovh035.stderr: 8: (clone()+0x6d) [0x7f7aee67c03d]
2018-02-05T16:01:19.869 INFO:tasks.ceph.osd.2.ovh035.stderr: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

#3 Updated by Josh Durgin 11 months ago

From the latest logs, the peering thread id does not appear at all in the log until the crash.

I'm wondering if we can reproduce on smithi or mira? Perhaps this is related to an anomalously slow disk in ovh.

#4 Updated by Sage Weil 11 months ago

  • Subject changed from "Caught signal (Segmentation fault)" in upgrade:jewel-x-luminous to null map from OSDService::get_map in advance_pg
  • Status changed from New to Need Review
  • Assignee set to Sage Weil

advance_pg ran before init() published the initial map to OSDService.

#6 Updated by Sage Weil 11 months ago

  • Backport set to luminous

#7 Updated by Sage Weil 11 months ago

  • Status changed from Need Review to Pending Backport

#8 Updated by Nathan Cutler 11 months ago

  • Copied to Backport #23870: luminous: null map from OSDService::get_map in advance_pg added

#9 Updated by Nathan Cutler 11 months ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF