Bug #7216
closed
ASSERT AuthMonitor::update_from_paxos on 0.72.2
Added by Grigory Gorelov over 10 years ago.
Updated about 10 years ago.
Description
Greetings.
Today i have restarted my cluster and got all tree monitors not starting. Whole output is attached but the main thing i believe is
mon.srv2@-1(probing).paxosservice(auth 251..288) refresh upgraded, format 0 -> 1
mon/AuthMonitor.cc: In function 'virtual void AuthMonitor::update_from_paxos(bool*)' ...
mon/AuthMonitor.cc: 153: FAILED assert(ret == 0)
I've found that problem happened before in different clusters with different users but was solved by patching the mon. I cannot do this with 0.72.2 version.
All my data in coma now and I hope for a help very much.
Below output, monitor data and ceph.conf is attached.
Thank you.
My respects for a best distributed fs at this time =)
Files
Also i think i need to add that it was clean install of 0.72.2 version. No updates, no movements.
- Assignee set to Joao Eduardo Luis
- Priority changed from Normal to Urgent
- Project changed from devops to Ceph
- Category set to Monitor
- Status changed from New to Need More Info
Is there a full log for this monitor, as well as for the other 2 monitors?
I'm sorry to say that but there isn't. There are nothing related to ceph in /var/log.
Unfortunately I've been unable to reproduce this locally.
Can you provide a list of the steps you took in order to trigger this? And ceph versions you might have upgraded from and to? You mentioned it was a clean 0.72.2 install, so I'm assuming you didn't have a cluster prior to that. Can you please confirm this?
My steps are:
1. Install ceph-0.72.2 to three servers.
2. Created some RBD images.
3. Run qemu-kvm on them.
4. Rebooted one of the servers and it's monitor didn't start.
5. So did the other two monitors.
If you cannot reproduce this, please tell me your configuration.
./configure flags, kernel, versions if it is possible.
I'll try to build same environment and run monitor in it.
Thank you.
I've opened ssh for you:
<redacted>
When you logged in you can ssh to those three servers:
ssh root@10.0.0.1, pass "1" (server 1 mon data is destroyed due to experiments)
ssh root@10.0.0.2, pass "1"
ssh root@10.0.0.3, pass "1"
I've reproduced bug on clean server:
1. Download ceph-0.72.2.tar.gz and unpack
2. Install snappy-1.1.0
3. Install libedit-20130712.3.1
4. ./configure (with no extra flags)
5. make -j4
6. Copied mon.srv3 to /home/ceph_mon
7. Assigned 10.0.0.3/24 on eth0
8. Copied ceph.conf to /etc/ceph/ceph.conf
9. ceph-mon -i srv3 -d
And accert occured.
are you reusing a previous store, from a previously problematic cluster?
No, clean server right now means there is nothing except gentoo stage3 installation.
I'm sorry to say, all my data is considered lost right now. I like Ceph architecture very much but cannot use due to bugs. Will wait for a few years to let it reach stability.
Thank you for your work, i hope Ceph will be de facto in distributed storage area.
- Status changed from Need More Info to New
- Status changed from New to Can't reproduce
Also available in: Atom
PDF