Project

General

Profile

Actions

Bug #6429

closed

msg/Pipe.cc: 1029: FAILED assert(m)

Added by Jens-Christian Fischer over 10 years ago. Updated about 10 years ago.

Status:
Can't reproduce
Priority:
High
Assignee:
Category:
OSD
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Not sure what happened, I had restarted a server with 9 OSDs when all of a sudden around 30 of our OSD processes died.

We are on a mixed 0.61.7 / 0.61.8 environment

Attached is the log of one of the OSDs


Files

osd-crash.txt (56.5 KB) osd-crash.txt Jens-Christian Fischer, 09/27/2013 07:00 AM
ceph-osd.6.log.1.gz (376 KB) ceph-osd.6.log.1.gz Andrei Mikhailovsky, 10/01/2013 01:04 PM
ceph-all-200.tar (7.94 MB) ceph-all-200.tar Andrei Mikhailovsky, 10/02/2013 01:29 AM
ceph-all-201.tar (12.5 MB) ceph-all-201.tar Andrei Mikhailovsky, 10/02/2013 01:29 AM

Related issues 1 (0 open1 closed)

Has duplicate Ceph - Bug #6476: lost around 30 OSDs (of 75) at onceDuplicate10/04/2013

Actions
Actions #1

Updated by Sage Weil over 10 years ago

  • Subject changed from suddenly lost 30 of 72 OSDs to msg/Pipe.cc: 1029: FAILED assert(m)
  • Status changed from New to Need More Info

I assume they restarted okay? I believe this problem has been fixed in dumpling. It is a rare race having ot do with reconnects inside the osd cluster.

Actions #2

Updated by Andrei Mikhailovsky over 10 years ago

Sage, not sure if my logs relate to this bug, but i've been asked on #ceph to add it to this bug. I am on ubuntu 12.04 with 3.8 backpor kernel using ceph 0.67.3. After restarting both of my osd servers i had ceph-osd crashes which caused hang tasks on vms running on the ceph cluster. Attaching log file from one of the crashed ceph-osd processes. The time reference is 2013-09-30 19:38 / 2013-09-30 19:39.

I had about 10 osds out of 16 showing this behaviour around the same time (about 30 minutes time difference between the first and the last crash)

Actions #3

Updated by Sage Weil over 10 years ago

Hey, do you have the log on teh other OSD too? (it's on 192.168.168.201, and should have the string "192.168.168.201:6806/18002851 >>" somewhere in the log)

Thanks!

Updated by Andrei Mikhailovsky over 10 years ago

Sage, here you go. Two files with tarred logs from that date. One file with 200 in the name is the 192.168.169.200 osd server, the other one with 201 in the name is 192.168.169.201 osd server.

The osd server 192.168.169.200 had all of its ceph-osd processes crash, whereas the 192.168.169.201 only had osd16 crashed.

Please let me know if you need any more information.

Actions #5

Updated by Sage Weil over 10 years ago

  • Assignee set to Sage Weil
Actions #6

Updated by Sage Weil about 10 years ago

  • Status changed from Need More Info to Can't reproduce
Actions

Also available in: Atom PDF