Project

General

Profile

Actions

Bug #8646

closed

OSD: assert in share_map() when marked down by an OSDMap

Added by Greg Farnum almost 10 years ago. Updated almost 10 years ago.

Status:
Resolved
Priority:
High
Assignee:
-
Category:
OSD
Target version:
-
% Done:

0%

Source:
Community (dev)
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

0> 2014-06-09 18:06:22.922629 7fcfab369700 -1 osd/OSD.cc: In function 'void OSDService::share_map(entity_name_t, Connection*, epoch_t, OSDMapRef&, epoch_t*)' thread 7fcfab369700 time 2014-06-09 18:06:22.921311
osd/OSD.cc: 4781: FAILED assert(osd->is_active() || osd->is_stopping())
ceph version andisk-sprint-2-drop-3-390-g2dbd85c (2dbd85c94cf27a1ff0419c5ea9359af7fe30e9b6)
1: (OSDService::share_map(entity_name_t, Connection*, unsigned int, std::tr1::shared_ptr<OSDMap const>&, unsigned int*)+0x58f) [0x6351df]
2: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x182) [0x635442]
3: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x346) [0x635ce6]
4: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x8ce) [0xa4a1ce]
5: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xa4c420]
6: (()+0x8182) [0x7fcfc4a7d182]
7: (clone()+0x6d) [0x7fcfc2e1e30d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. 

This is from a custom build, but the issue exists in master. We're calling share_map in OSD::dequeue_op(), but we might be dequeuing after changing the OSD state to STATE_WAITING_FOR_HEALTHY. I think the fix is just to condition trying to call share_map on actually being STATE_ACTIVE.

Actions #1

Updated by Sahana Lokeshappa almost 10 years ago

Steps to reproduce:

Ceph cluster with 8 nodes, 3 osds per node.

While Client Io was going on, run command: ceph osd thrash 101

one of the osd crashed with assert in share_map()

osd.22 was marked down by the monitor on epoch 122.
2014-06-09 18:06:17.817630 7f421da58700 2 mon.ip-10-15-16-63@0(leader).osd e122 osd.22 DOWN
2014-06-09 18:06:17.817639 7f421da58700 2 mon.ip-10-15-16-63@0(leader).osd e122 osd.8 IN
2014-06-09 18:06:17.817643 7f421da58700 2 mon.ip-10-15-16-63@0(leader).osd e122 osd.11 OUT
2014-06-09 18:06:17.817710 7f421da58700 0 log [INF] : osdmap e122: 24 osds: 17 up, 16 in

On epoch 125 of the map sharing osd complained about being wrongly marked down.

-452> 2014-06-09 18:06:22.880416 7fcfb4b7c700 1 osd.22 124 ms_handle_reset con 0x6fe8dc0 session 0
-451> 2014-06-09 18:06:22.880433 7fcfb637f700 0 log [WRN] : map e125 wrongly marked me down
-450> 2014-06-09 18:06:22.880440 7fcfb637f700 1 osd.22 125 start_waiting_for_healthy

1> 2014-06-09 18:06:22.922590 7fcfac36b700 1 osd.22 pg_epoch: 125 pg[4.11d( empty local-les=109 n=0 ec=108 les/c 109/109 123/123/120) [3,14] r=-1 lpr=123 pi=108-122/4 crt=0'0 inactive NOTIFY] state<Start>: transitioning to Stray
0> 2014-06-09 18:06:22.922629 7fcfab369700 1 osd/OSD.cc: In function 'void OSDService::share_map(entity_name_t, Connection*, epoch_t, OSDMapRef&, epoch_t*)' thread 7fcfab369700 time 2014-06-09 18:06:22.921311
osd/OSD.cc: 4781: FAILED assert(osd
>is_active() || osd->is_stopping())
ceph version andisk-sprint-2-drop-3-390-g2dbd85c (2dbd85c94cf27a1ff0419c5ea9359af7fe30e9b6)
1: (OSDService::share_map(entity_name_t, Connection*, unsigned int, std::tr1::shared_ptr<OSDMap const>&, unsigned int*)+0x58f) [0x6351df]
2: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x182) [0x635442]
3: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x346) [0x635ce6]
4: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x8ce) [0xa4a1ce]
5: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xa4c420]
6: (()+0x8182) [0x7fcfc4a7d182]
7: (clone()+0x6d) [0x7fcfc2e1e30d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Actions #2

Updated by Sage Weil almost 10 years ago

  • Status changed from New to Resolved
Actions

Also available in: Atom PDF