Bug #6429
closed
msg/Pipe.cc: 1029: FAILED assert(m)
Added by Jens-Christian Fischer over 10 years ago.
Updated about 10 years ago.
Description
Not sure what happened, I had restarted a server with 9 OSDs when all of a sudden around 30 of our OSD processes died.
We are on a mixed 0.61.7 / 0.61.8 environment
Attached is the log of one of the OSDs
Files
- Subject changed from suddenly lost 30 of 72 OSDs to msg/Pipe.cc: 1029: FAILED assert(m)
- Status changed from New to Need More Info
I assume they restarted okay? I believe this problem has been fixed in dumpling. It is a rare race having ot do with reconnects inside the osd cluster.
Sage, not sure if my logs relate to this bug, but i've been asked on #ceph to add it to this bug. I am on ubuntu 12.04 with 3.8 backpor kernel using ceph 0.67.3. After restarting both of my osd servers i had ceph-osd crashes which caused hang tasks on vms running on the ceph cluster. Attaching log file from one of the crashed ceph-osd processes. The time reference is 2013-09-30 19:38 / 2013-09-30 19:39.
I had about 10 osds out of 16 showing this behaviour around the same time (about 30 minutes time difference between the first and the last crash)
Hey, do you have the log on teh other OSD too? (it's on 192.168.168.201, and should have the string "192.168.168.201:6806/18002851 >>" somewhere in the log)
Thanks!
Sage, here you go. Two files with tarred logs from that date. One file with 200 in the name is the 192.168.169.200 osd server, the other one with 201 in the name is 192.168.169.201 osd server.
The osd server 192.168.169.200 had all of its ceph-osd processes crash, whereas the 192.168.169.201 only had osd16 crashed.
Please let me know if you need any more information.
- Assignee set to Sage Weil
- Status changed from Need More Info to Can't reproduce
Also available in: Atom
PDF