Project

General

Profile

Actions

Bug #16308

closed

mon: osd.1 marked down for no apparent reason

Added by Sage Weil almost 8 years ago. Updated almost 8 years ago.

Status:
Rejected
Priority:
Urgent
Assignee:
-
Category:
Monitor
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

2016-06-14 11:48:33.039760 7fe7e481e700 10 mon.a@0(leader).osd e14 preprocess_query osd_alive(want up_thru 14 have 14) v1 from osd.1 172.21.15.36:6805/1851
2016-06-14 11:48:33.039779 7fe7e481e700 10 mon.a@0(leader).osd e14 preprocess_alive want up_thru 14 from osd.1 172.21.15.36:6805/1851
2016-06-14 11:48:33.039787 7fe7e481e700  7 mon.a@0(leader).osd e14 prepare_update osd_alive(want up_thru 14 have 14) v1 from osd.1 172.21.15.36:6805/1851
2016-06-14 11:48:33.039796 7fe7e481e700  7 mon.a@0(leader).osd e14 prepare_alive want up_thru 14 have 14 from osd.1 172.21.15.36:6805/1851
2016-06-14 11:48:34.091714 7fe7e481e700 10 mon.a@0(leader).osd e15 committed, telling random osd.1 172.21.15.36:6805/1851 all about it
2016-06-14 11:48:34.092194 7fe7e481e700  7 mon.a@0(leader).osd e15 _reply_map 14 from osd.1 172.21.15.36:6805/1851
2016-06-14 11:48:34.092201 7fe7e481e700  5 mon.a@0(leader).osd e15 send_latest to osd.1 172.21.15.36:6805/1851 start 14
2016-06-14 11:48:34.092204 7fe7e481e700  5 mon.a@0(leader).osd e15 send_incremental [14..15] to osd.1 172.21.15.36:6805/1851
2016-06-14 11:48:34.092207 7fe7e481e700 10 mon.a@0(leader).osd e15  osd.1 should have epoch 14
2016-06-14 11:48:34.134437 7fe7e481e700 10 mon.a@0(leader).osd e15 preprocess_query osd_alive(want up_thru 15 have 15) v1 from osd.1 172.21.15.36:6805/1851
2016-06-14 11:48:34.134441 7fe7e481e700 10 mon.a@0(leader).osd e15 preprocess_alive want up_thru 15 from osd.1 172.21.15.36:6805/1851
2016-06-14 11:48:34.134443 7fe7e481e700  7 mon.a@0(leader).osd e15 prepare_update osd_alive(want up_thru 15 have 15) v1 from osd.1 172.21.15.36:6805/1851
2016-06-14 11:48:34.134445 7fe7e481e700  7 mon.a@0(leader).osd e15 prepare_alive want up_thru 15 have 15 from osd.1 172.21.15.36:6805/1851
2016-06-14 11:48:35.159764 7fe7e481e700 10 mon.a@0(leader).osd e16 committed, telling random osd.1 172.21.15.36:6805/1851 all about it
2016-06-14 11:48:35.160218 7fe7e481e700  7 mon.a@0(leader).osd e16 _reply_map 15 from osd.1 172.21.15.36:6805/1851
2016-06-14 11:48:35.160228 7fe7e481e700  5 mon.a@0(leader).osd e16 send_latest to osd.1 172.21.15.36:6805/1851 start 15
2016-06-14 11:48:35.160236 7fe7e481e700  5 mon.a@0(leader).osd e16 send_incremental [15..16] to osd.1 172.21.15.36:6805/1851
2016-06-14 11:48:35.160246 7fe7e481e700 10 mon.a@0(leader).osd e16  osd.1 should have epoch 15
2016-06-14 11:48:48.116530 7fe7e481e700 10 mon.a@0(leader).osd e17 preprocess_query osd_alive(want up_thru 17 have 17) v1 from osd.1 172.21.15.36:6805/1851
2016-06-14 11:48:48.116551 7fe7e481e700 10 mon.a@0(leader).osd e17 preprocess_alive want up_thru 17 from osd.1 172.21.15.36:6805/1851
2016-06-14 11:48:48.116559 7fe7e481e700  7 mon.a@0(leader).osd e17 prepare_update osd_alive(want up_thru 17 have 17) v1 from osd.1 172.21.15.36:6805/1851
2016-06-14 11:48:48.116568 7fe7e481e700  7 mon.a@0(leader).osd e17 prepare_alive want up_thru 17 have 17 from osd.1 172.21.15.36:6805/1851
2016-06-14 11:48:48.737515 7fe7e501f700  2 mon.a@0(leader).osd e17  osd.1 DOWN
2016-06-14 11:48:48.872492 7fe7e481e700 10 mon.a@0(leader).osd e18  adding osd.1 to down_pending_out map

I can't see in the log how/why osd.1 got marked down.

/a/sage-2016-06-14_04:22:11-rados-master---basic-smithi/258317


Related issues 1 (0 open1 closed)

Related to Ceph - Bug #16332: whitelist "wrongly marked me down" in lfn-upgrade-infernalis.yaml and lfn-upgrade-hammer.yamlResolvedYuri Weinstein06/15/2016

Actions
Actions #1

Updated by Sage Weil almost 8 years ago

also /a/sage-2016-06-14_04:22:11-rados-master---basic-smithi/258496

Actions #2

Updated by Sage Weil almost 8 years ago

and /a/sage-2016-06-14_04:22:11-rados-master---basic-smithi/258523

Actions #3

Updated by Samuel Just almost 8 years ago

ceph-qa-suite change 1b7552c9cb331978cb0bfd4d7dc4dcde4186c176 is marking the osds down manually to eliminate the retart->wait_for_clean race. The right answer is to whitelist it.

Actions #4

Updated by Samuel Just almost 8 years ago

  • Related to Bug #16332: whitelist "wrongly marked me down" in lfn-upgrade-infernalis.yaml and lfn-upgrade-hammer.yaml added
Actions #5

Updated by Samuel Just almost 8 years ago

  • Status changed from New to Rejected
Actions

Also available in: Atom PDF