Project

General

Profile

Actions

Bug #8387

closed

osd: skipping missing maps broken

Added by karan singh almost 10 years ago. Updated almost 10 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
OSD
Target version:
-
% Done:

0%

Source:
Development
Tags:
Backport:
Regression:
Severity:
1 - critical
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Problem :: 50+ OSD are getting marked out of cluster and are down. The cluster is degraded. On checking logs of failed OSD we are getting wired entries that are continuously getting generated.

logs from OSD :: (debug logging set on OSDs)

2014-05-19 17:09:49.185921 7fcbae674700 0 -- 192.168.1.110:6819/58592 >> 192.168.1.104:6826/777047378 pipe(0x447aa80 sd=129 :0 s=1 pgs=0 cs=0 l=0 c=0x7c165c0).fault with nothing to send, going to standby
2014-05-19 17:09:49.187704 7fcbb0a98700 0 -- 192.168.1.110:6819/58592 >> 192.168.1.102:6803/1081057589 pipe(0x7ddd280 sd=131 :48417 s=1 pgs=0 cs=0 l=0 c=0x912b180).connect claims to be 192.168.1.102:6803/2072057589 not 192.168.1.102:6803/1081057589 - wrong node!
2014-05-19 17:09:49.187817 7fcbb0a98700 0 -- 192.168.1.110:6819/58592 >> 192.168.1.102:6803/1081057589 pipe(0x7ddd280 sd=131 :48417 s=1 pgs=0 cs=0 l=0 c=0x912b180).fault with nothing to send, going to standby
2014-05-19 17:09:49.193429 7fcbb8fba700 1 osd.159 pg_epoch: 226656 pg[3.5c( v 76227'43 (0'0,76227'43] local-les=221836 n=1 ec=71845 les/c 221836/221836 226656/226656/226656) [78,12] r=-1 lpr=226656 pi=71845-226655/6176 crt=0'0 lcod 0'0 inactive NOTIFY] state<Start>: transitioning to Stray
2014-05-19 17:09:49.200057 7fcbb99bb700 1 osd.159 pg_epoch: 226656 pg[2.2ef( empty local-les=221905 n=0 ec=1 les/c 221905/221905 226656/226656/225867) [77,20] r=-1 lpr=226656 pi=221879-226655/156 crt=0'0 inactive NOTIFY] state<Start>: transitioning to Stray
2014-05-19 17:09:49.211625 7fcbb99bb700 1 osd.159 pg_epoch: 226656 pg[0.7a1( empty local-les=221905 n=0 ec=1 les/c 221905/221905 226622/226622/226216) [66,88,121] r=-1 lpr=226622 pi=629-226621/7373 crt=0'0 inactive NOTIFY] state<Start>: transitioning to Stray
2014-05-19 17:09:49.213734 7fcbb8fba700 1 osd.159 pg_epoch: 226656 pg[2.497( empty local-les=221807 n=0 ec=1 les/c 221807/221807 226656/226656/226436) [2,151] r=-1 lpr=226656 pi=490-226655/9134 crt=0'0 inactive NOTIFY] state<Start>: transitioning to Stray
2014-05-19 17:09:49.237009 7fcbb99bb700 1 osd.159 pg_epoch: 226656 pg[0.525( empty local-les=221906 n=0 ec=1 les/c 221906/221906 226637/226637/226561) [21,117,73] r=-1 lpr=226637 pi=338-226636/14696 crt=0'0 inactive NOTIFY] state<Start>: transitioning to Stray
2014-05-19 17:09:52.355092 7fcbb8fba700 1 osd.159 pg_epoch: 226656 pg[0.2db( empty local-les=221894 n=0 ec=1 les/c 221894/221894 226620/226620/226615) [129,106,158] r=-1 lpr=226620 pi=629-226619/16410 crt=0'0 inactive NOTIFY] state<Start>: transitioning to Stray
2014-05-19 17:09:52.374646 7fcbb109e700 0 -- 192.168.1.110:6819/58592 >> 192.168.1.107:6828/1519061057 pipe(0x7dda580 sd=549 :48520 s=1 pgs=0 cs=0 l=0 c=0x7c14d00).connect claims to be 192.168.1.107:6828/1486062762 not 192.168.1.107:6828/1519061057 - wrong node!
2014-05-19 17:09:52.374728 7fcbb109e700 0 -- 192.168.1.110:6819/58592 >> 192.168.1.107:6828/1519061057 pipe(0x7dda580 sd=549 :48520 s=1 pgs=0 cs=0 l=0 c=0x7c14d00).fault with nothing to send, going to standby
2014-05-19 17:09:54.871888 7fcbb8fba700 1 osd.159 pg_epoch: 226656 pg[2.510( empty local-les=221820 n=0 ec=1 les/c 221820/221820 226597/226597/226597) [68,25,95] r=-1 lpr=226597 pi=625-226596/7959 crt=0'0 inactive NOTIFY] state<Start>: transitioning to Stray
2014-05-19 17:09:54.872696 7fcbb99bb700 1 osd.159 pg_epoch: 226656 pg[0.290( empty local-les=221906 n=0 ec=1 les/c 221906/221906 226656/226656/226637) [117,127] r=-1 lpr=226656 pi=108400-226655/8477 crt=0'0 inactive NOTIFY] state<Start>: transitioning to Stray
2014-05-19 17:09:54.883318 7fcbae876700 0 -- 192.168.1.110:6819/58592 >> 192.168.1.105:6851/3429031235 pipe(0x7ddf800 sd=174 :0 s=1 pgs=0 cs=0 l=0 c=0x912d960).fault with nothing to send, going to standby

More logs check : http://pastebin.com/0DRuiGRS

  1. ceph -v
    ceph version 0.80-469-g991f7f1 (991f7f15a6e107b33a24bbef1169f21eb7fcce2c) #
  1. ceph osd stat
    osdmap e357073: 165 osds: 112 up, 165 in
    flags noout #

I have tried doing :

1. Restarting the problematic OSDs , but no luck
2. One host has 4 affected OSDs , so i restarted entire host but no luck, still osds are down and getting the same mesage

2014-05-19 17:16:00.140440 7fcbb008e700 0 -- 192.168.1.110:6819/58592 >> 192.168.1.111:6815/727016887 pipe(0x447ee00 sd=262 :40743 s=1 pgs=0 cs=0 l=0 c=0x7c127e0).fault with nothing to send, going to standby
2014-05-19 17:16:00.140815 7fcbaf987700 0 -- 192.168.1.110:6819/58592 >> 192.168.1.102:6802/982001016 pipe(0xa4f5000 sd=267 :43043 s=1 pgs=0 cs=0 l=0 c=0x77fdee0).connect claims to be 192.168.1.102:6802/1848051066 not 192.168.1.102:6802/982001016 - wrong node!
2014-05-19 17:16:00.140846 7fcbaf987700 0 -- 192.168.1.110:6819/58592 >> 192.168.1.102:6802/982001016 pipe(0xa4f5000 sd=267 :43043 s=1 pgs=0 cs=0 l=0 c=0x77fdee0).fault with nothing to send, going to standby
2014-05-19 17:16:00.141851 7fcbb8fba700 1 osd.159 pg_epoch: 236857 pg[2.40c( empty local-les=221836 n=0 ec=1 les/c 221836/221836 236795/236795/236795) [61,46,78] r=-1 lpr=236795 pi=221835-236794/489 crt=0'0 inactive NOTIFY] state<Start>: transitioning to Stray

3. Disks do not have errors , no message in dmesg and /var/log/messages

4. there was a bug in the past http://tracker.ceph.com/issues/4006 , dont know it again came bacin in Firefly

5. Recently no activity performed on cluster , except some pool and keys creation for cinder /glance integration

6. Nodes have enough free resources for osds.

Please suggest a solution to this bug.

Actions

Also available in: Atom PDF