Bug #6429
closedmsg/Pipe.cc: 1029: FAILED assert(m)
0%
Description
Not sure what happened, I had restarted a server with 9 OSDs when all of a sudden around 30 of our OSD processes died.
We are on a mixed 0.61.7 / 0.61.8 environment
Attached is the log of one of the OSDs
Files
Updated by Sage Weil over 10 years ago
- Subject changed from suddenly lost 30 of 72 OSDs to msg/Pipe.cc: 1029: FAILED assert(m)
- Status changed from New to Need More Info
I assume they restarted okay? I believe this problem has been fixed in dumpling. It is a rare race having ot do with reconnects inside the osd cluster.
Updated by Andrei Mikhailovsky over 10 years ago
- File ceph-osd.6.log.1.gz ceph-osd.6.log.1.gz added
Sage, not sure if my logs relate to this bug, but i've been asked on #ceph to add it to this bug. I am on ubuntu 12.04 with 3.8 backpor kernel using ceph 0.67.3. After restarting both of my osd servers i had ceph-osd crashes which caused hang tasks on vms running on the ceph cluster. Attaching log file from one of the crashed ceph-osd processes. The time reference is 2013-09-30 19:38 / 2013-09-30 19:39.
I had about 10 osds out of 16 showing this behaviour around the same time (about 30 minutes time difference between the first and the last crash)
Updated by Sage Weil over 10 years ago
Hey, do you have the log on teh other OSD too? (it's on 192.168.168.201, and should have the string "192.168.168.201:6806/18002851 >>" somewhere in the log)
Thanks!
Updated by Andrei Mikhailovsky over 10 years ago
- File ceph-all-200.tar ceph-all-200.tar added
- File ceph-all-201.tar ceph-all-201.tar added
Sage, here you go. Two files with tarred logs from that date. One file with 200 in the name is the 192.168.169.200 osd server, the other one with 201 in the name is 192.168.169.201 osd server.
The osd server 192.168.169.200 had all of its ceph-osd processes crash, whereas the 192.168.169.201 only had osd16 crashed.
Please let me know if you need any more information.
Updated by Sage Weil about 10 years ago
- Status changed from Need More Info to Can't reproduce