Bug #6429: msg/Pipe.cc: 1029: FAILED assert(m) - Ceph - Ceph

Actions

Copy link

Bug #6429

closed

msg/Pipe.cc: 1029: FAILED assert(m)

Added by Jens-Christian Fischer over 10 years ago. Updated about 10 years ago.

Status:

Can't reproduce

Priority:

High

Assignee:

Sage Weil

Category:

OSD

Target version:

% Done:

Source:

Community (user)

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Not sure what happened, I had restarted a server with 9 OSDs when all of a sudden around 30 of our OSD processes died.

We are on a mixed 0.61.7 / 0.61.8 environment

Attached is the log of one of the OSDs

Files

Download all files

osd-crash.txt (56.5 KB) osd-crash.txt		Jens-Christian Fischer, 09/27/2013 07:00 AM
ceph-osd.6.log.1.gz (376 KB) ceph-osd.6.log.1.gz		Andrei Mikhailovsky, 10/01/2013 01:04 PM
ceph-all-200.tar (7.94 MB) ceph-all-200.tar		Andrei Mikhailovsky, 10/02/2013 01:29 AM
ceph-all-201.tar (12.5 MB) ceph-all-201.tar		Andrei Mikhailovsky, 10/02/2013 01:29 AM

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Sage Weil over 10 years ago

Subject changed from suddenly lost 30 of 72 OSDs to msg/Pipe.cc: 1029: FAILED assert(m)
Status changed from New to Need More Info

I assume they restarted okay? I believe this problem has been fixed in dumpling. It is a rare race having ot do with reconnects inside the osd cluster.

Actions

Copy link

Updated by Andrei Mikhailovsky over 10 years ago

File ceph-osd.6.log.1.gz ceph-osd.6.log.1.gz added

Sage, not sure if my logs relate to this bug, but i've been asked on #ceph to add it to this bug. I am on ubuntu 12.04 with 3.8 backpor kernel using ceph 0.67.3. After restarting both of my osd servers i had ceph-osd crashes which caused hang tasks on vms running on the ceph cluster. Attaching log file from one of the crashed ceph-osd processes. The time reference is 2013-09-30 19:38 / 2013-09-30 19:39.

I had about 10 osds out of 16 showing this behaviour around the same time (about 30 minutes time difference between the first and the last crash)

Actions

Copy link

Updated by Sage Weil over 10 years ago

Hey, do you have the log on teh other OSD too? (it's on 192.168.168.201, and should have the string "192.168.168.201:6806/18002851 >>" somewhere in the log)

Thanks!

Actions

Copy link Download all files

Updated by Andrei Mikhailovsky over 10 years ago

File ceph-all-200.tar ceph-all-200.tar added
File ceph-all-201.tar ceph-all-201.tar added

Sage, here you go. Two files with tarred logs from that date. One file with 200 in the name is the 192.168.169.200 osd server, the other one with 201 in the name is 192.168.169.201 osd server.

The osd server 192.168.169.200 had all of its ceph-osd processes crash, whereas the 192.168.169.201 only had osd16 crashed.

Please let me know if you need any more information.

Actions

Copy link