Project

General

Profile

Actions

Bug #4006

closed

osd: repeating 'wrong node' message in log

Added by Sage Weil about 11 years ago. Updated about 11 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
OSD
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Two users now (paravoid an xiaoxi in #ceph) have reported seeing repeated "... - wrong node!" messages in the osd logs. This is likely caused by a msgr peer being incorrectly cleaned up as we advance through OSDMaps.

Reported on both bobtail and master branch just prior to the wip-osd-msgr merge, which hopefully fixes it... need to confirm.


Related issues 1 (0 open1 closed)

Has duplicate Ceph - Bug #4246: osd/msgr: rebound to an addr a peer osd already had openResolvedSage Weil02/22/2013

Actions
Actions #1

Updated by Ken Franklin about 11 years ago

I am also seeing this message in the radosgw.log file for .56-623. The error appears when restarting rgw and again during my s3 tests. Radosgw process died after the second occurance. I restarted and have not seen it when I re-ran the tests.

The test system is burnupi59

radosgw.log snippet after restart...

2013-02-08 08:25:10.523532 7fa070d63780 0 ceph version 0.56-623-g27fb0e6 (27fb0e63053a581b67a79718876e89fea0026d7a), process radosgw, pid 22692
2013-02-08 08:25:10.537994 7fa05bfff700 2 garbage collection: start
2013-02-08 08:25:13.540366 7fa0607f8700 0 -- 10.214.136.12:0/1022711 >> 10.214.136.12:6801/14617 pipe(0x130a8e0 sd=11 :53701 s=1 pgs=0 cs=0 l=1).connect claims to be 0.0.0.0:6801/22425 not 10.214.136.12:6801/14617 - wrong node!
2013-02-08 08:25:13.540438 7fa0607f8700 0 -- 10.214.136.12:0/1022711 >> 10.214.136.12:6801/14617 pipe(0x130a8e0 sd=11 :53701 s=1 pgs=0 cs=0 l=1).fault
2013-02-08 08:25:13.540619 7fa0607f8700 0 -- 10.214.136.12:0/1022711 >> 10.214.136.12:6801/14617 pipe(0x130a8e0 sd=11 :53703 s=1 pgs=0 cs=0 l=1).connect claims to be 0.0.0.0:6801/22425 not 10.214.136.12:6801/14617 - wrong node!

During test tun....

2013-02-08 14:36:47.925090 7fd780546700 0 -- 10.214.136.12:0/1026412 >> 10.214.136.12:6801/22425 pipe(0x1e8ba40 sd=10 :54047 s=1 pgs=0 cs=0 l=1).connect claims to be 10.214.136.12:6801/26144 not 10.214.136.12:6801/22425 - wrong node!
2013-02-08 14:36:52.450874 7fd780546700 0 -- 10.214.136.12:0/1026412 >> 10.214.136.12:6801/22425 pipe(0x1e8ba40 sd=10 :54051 s=1 pgs=0 cs=0 l=1).connect claims to be 10.214.136.12:6801/26144 not 10.214.136.12:6801/22425 - wrong node!
2013-02-08 14:36:52.451766 7fd780445700 0 -- 10.214.136.12:0/1026412 >> 10.214.136.12:6804/22665 pipe(0x7fd778005840 sd=11 :0 s=1 pgs=0 cs=0 l=1).fault
2013-02-08 14:37:07.437996 7fd783e4e700 -1 Initialization timeout, failed to initialize

2013-02-08 16:19:10.730797 7f3b089f3780 0 ceph version 0.56-623-g27fb0e6 (27fb0e63053a581b67a79718876e89fea0026d7a), process radosgw, pid 37107
2013-02-08 16:19:10.745500 7f3af37fe700 2 garbage collection: start
2013-02-08 16:19:10.869732 7f3b089f3780 10 allocated request req=0x2664ba0

Actions #2

Updated by Sage Weil about 11 years ago

  • Status changed from 12 to Resolved

This is believed to be fixed with the merge of wip-osd-msgr (and wip-bobtail-osd-msgr backport). If anybody sees this with v0.57+ or v0.56.3+, reopen!

Actions #3

Updated by Bill Reid about 11 years ago

I don't have the radosgw logs but had a widespread occurence of this same issue running ceph version 0.57 (9a7a9d06c0623ccc116a1d3b71c765c20a17e98e)

ceph-osd.74.log:2013-03-04 09:45:41.473058 7f08edb34700 0 -- 10.9.2.134:6823/6232 >> 10.9.2.132:6813/4014 pipe(0x14058f00 sd=69 :36497 s=1 pgs=837 cs=4 l=0).connect claims to be 10.9.2.132:6813/4704 not 10.9.2.132:6813/4014 - wron
g node!
ceph-osd.74.log:2013-03-04 09:45:54.682499 7f09093eb700 0 -- 10.9.2.134:6823/6232 >> 10.9.2.133:6801/29645 pipe(0x272d5b80 sd=54 :59450 s=1 pgs=1154 cs=4 l=0).connect claims to be 10.9.2.133:6801/29403 not 10.9.2.133:6801/29645 -

Multiple osd's accross multiple nodes. Restarting the entire cluster seemed to have fixed the problem.

/var/log/ceph$ zgrep -l "wrong node" *
ceph-osd.74.log
ceph-osd.75.log
ceph-osd.77.log
ceph-osd.78.log
ceph-osd.79.log
ceph-osd.80.log
ceph-osd.81.log
ceph-osd.83.log
ceph-osd.84.log
ceph-osd.86.log
ceph-osd.87.log
ceph-osd.89.log
ceph-osd.90.log
ceph-osd.91.log
ceph-osd.93.log
ceph-osd.94.log
ceph-osd.95.log

I have all the logs available if needed.

Actions #4

Updated by Dan Mick about 11 years ago

Logs from Bill are in cephdrop as 'wer-logs-4006-4Mar13.tgz'

Actions #5

Updated by Ian Colle about 11 years ago

  • Status changed from Resolved to 12

Reopened.

Actions #6

Updated by Dan Mick about 11 years ago

IRC user Kioob reports similar issues tonight. Seems like he's got such errors four hours after the last relevant OSD event. Something must not be cleaning up connection info somewhere.

Actions #7

Updated by Sage Weil about 11 years ago

I am hopeful that 881e9d850c6762290f8be24da9e74b9dc112f1c9 resolves this. It's in next branch, but I was going to wait a few days before putting it in bobtail.

Actions #8

Updated by Sage Weil about 11 years ago

  • Status changed from 12 to Resolved

i think this was #4246 (and/or the previous osd msgr refactoring)

Actions

Also available in: Atom PDF