Project

General

Profile

Bug #8646

OSD: assert in share_map() when marked down by an OSDMap

Added by Greg Farnum about 5 years ago. Updated about 5 years ago.

Status:
Resolved
Priority:
High
Assignee:
-
Category:
OSD
Target version:
-
Start date:
06/23/2014
Due date:
% Done:

0%

Source:
Community (dev)
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:

Description

0> 2014-06-09 18:06:22.922629 7fcfab369700 -1 osd/OSD.cc: In function 'void OSDService::share_map(entity_name_t, Connection*, epoch_t, OSDMapRef&, epoch_t*)' thread 7fcfab369700 time 2014-06-09 18:06:22.921311
osd/OSD.cc: 4781: FAILED assert(osd->is_active() || osd->is_stopping())
ceph version andisk-sprint-2-drop-3-390-g2dbd85c (2dbd85c94cf27a1ff0419c5ea9359af7fe30e9b6)
1: (OSDService::share_map(entity_name_t, Connection*, unsigned int, std::tr1::shared_ptr<OSDMap const>&, unsigned int*)+0x58f) [0x6351df]
2: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x182) [0x635442]
3: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x346) [0x635ce6]
4: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x8ce) [0xa4a1ce]
5: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xa4c420]
6: (()+0x8182) [0x7fcfc4a7d182]
7: (clone()+0x6d) [0x7fcfc2e1e30d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. 

This is from a custom build, but the issue exists in master. We're calling share_map in OSD::dequeue_op(), but we might be dequeuing after changing the OSD state to STATE_WAITING_FOR_HEALTHY. I think the fix is just to condition trying to call share_map on actually being STATE_ACTIVE.

Associated revisions

Revision fde99e69 (diff)
Added by Somnath Roy about 5 years ago

OSD: adjust share_map() to handle the case that the osd is down

The assert was hitting while OSd is waiting for becoming healthy
in handle_osd_map(). This can happen while io is going on and
OSDs are made down forcefully by say osd thrash command.
So, the fix could be instead of asserting just return from here.

Fixes: #8646

Signed-off-by: Somnath Roy <>

History

#1 Updated by Sahana Lokeshappa about 5 years ago

Steps to reproduce:

Ceph cluster with 8 nodes, 3 osds per node.

While Client Io was going on, run command: ceph osd thrash 101

one of the osd crashed with assert in share_map()

osd.22 was marked down by the monitor on epoch 122.
2014-06-09 18:06:17.817630 7f421da58700 2 mon.ip-10-15-16-63@0(leader).osd e122 osd.22 DOWN
2014-06-09 18:06:17.817639 7f421da58700 2 mon.ip-10-15-16-63@0(leader).osd e122 osd.8 IN
2014-06-09 18:06:17.817643 7f421da58700 2 mon.ip-10-15-16-63@0(leader).osd e122 osd.11 OUT
2014-06-09 18:06:17.817710 7f421da58700 0 log [INF] : osdmap e122: 24 osds: 17 up, 16 in

On epoch 125 of the map sharing osd complained about being wrongly marked down.

-452> 2014-06-09 18:06:22.880416 7fcfb4b7c700 1 osd.22 124 ms_handle_reset con 0x6fe8dc0 session 0
-451> 2014-06-09 18:06:22.880433 7fcfb637f700 0 log [WRN] : map e125 wrongly marked me down
-450> 2014-06-09 18:06:22.880440 7fcfb637f700 1 osd.22 125 start_waiting_for_healthy

1> 2014-06-09 18:06:22.922590 7fcfac36b700 1 osd.22 pg_epoch: 125 pg[4.11d( empty local-les=109 n=0 ec=108 les/c 109/109 123/123/120) [3,14] r=-1 lpr=123 pi=108-122/4 crt=0'0 inactive NOTIFY] state<Start>: transitioning to Stray
0> 2014-06-09 18:06:22.922629 7fcfab369700 -1 osd/OSD.cc: In function 'void OSDService::share_map(entity_name_t, Connection*, epoch_t, OSDMapRef&, epoch_t*)' thread 7fcfab369700 time 2014-06-09 18:06:22.921311
osd/OSD.cc: 4781: FAILED assert(osd->is_active() || osd->is_stopping())
ceph version andisk-sprint-2-drop-3-390-g2dbd85c (2dbd85c94cf27a1ff0419c5ea9359af7fe30e9b6)
1: (OSDService::share_map(entity_name_t, Connection*, unsigned int, std::tr1::shared_ptr<OSDMap const>&, unsigned int*)+0x58f) [0x6351df]
2: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x182) [0x635442]
3: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x346) [0x635ce6]
4: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x8ce) [0xa4a1ce]
5: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xa4c420]
6: (()+0x8182) [0x7fcfc4a7d182]
7: (clone()+0x6d) [0x7fcfc2e1e30d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

#2 Updated by Sage Weil about 5 years ago

  • Status changed from New to Resolved

Also available in: Atom PDF