Project

General

Profile

Bug #4851

leveldb: hang on leveldb::DBImpl::MakeRoomForWrite(bool)

Added by Wido den Hollander over 6 years ago. Updated over 6 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
Monitor
Target version:
Start date:
04/29/2013
Due date:
% Done:

0%

Spent time:
Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:

Description

While testing with the next branch (50e58b9f49382d690f5a22af80f6981f1c12d4c3) I stumbled upon the problem that creating for example a auth key took a long time:

ceph auth get-or-create XXXX

That took 5 minutes, but I wasn't able to record it with logs, so later on I tried to remove the same key which took 4 minutes:

ceph auth del client.openstack

In this case I have 3 monitors where this command went to "mon2":

2013-04-29 14:22:15.299854 7fd523711700  1 -- 10.23.24.9:6789/0 <== client.? 10.23.24.4:0/29190 5 ==== mon_command(auth del client.openstack v 0) v1 ==== 73+0+0 (2842674342 0 0) 0xc2af000 con 0xad8a580
2013-04-29 14:22:15.299991 7fd523711700  0 mon.mon2@1(peon) e1 handle_command mon_command(auth del client.openstack v 0) v1
2013-04-29 14:22:15.376928 7fd523711700 10 mon.mon2@1(peon).auth v764 preprocess_query mon_command(auth del client.openstack v 0) v1 from client.? 10.23.24.4:0/29190
2013-04-29 14:22:15.396140 7fd523711700 10 mon.mon2@1(peon) e1 forward_request 828 request mon_command(auth del client.openstack v 0) v1
2013-04-29 14:22:15.396161 7fd523711700  1 -- 10.23.24.9:6789/0 --> mon.0 10.23.24.8:6789/0 -- forward(mon_command(auth del client.openstack v 0) v1) to leader v1 -- ?+0 0xc2ab300
2013-04-29 14:26:09.397449 7fd523f12700 10 mon.mon2@1(leader) e1  requeue for self tid 828 mon_command(auth del client.openstack v 0) v1
2013-04-29 14:26:09.432538 7fd523f12700  0 mon.mon2@1(leader) e1 handle_command mon_command(auth del client.openstack v 0) v1
2013-04-29 14:26:12.245758 7fd523711700 10 mon.mon2@1(leader).auth v764 preprocess_query mon_command(auth del client.openstack v 0) v1 from client.? 10.23.24.4:0/29190
2013-04-29 14:26:12.261239 7fd523711700 10 mon.mon2@1(leader).auth v764 prepare_update mon_command(auth del client.openstack v 0) v1 from client.? 10.23.24.4:0/29190

Here you see that "mon2" forwarded the request to mon1 (10.23.24.8).

Looking in mon1's logs I see:

2013-04-29 14:27:19.907293 7fdf507ac700  1 -- 10.23.24.8:6789/0 <== mon.1 10.23.24.9:6789/0 5337 ==== forward(mon_command(auth del client.openstack v 0) v1) to leader v1 ==== 343+0+0 (3205198275 0 0) 0x116d9000 con 0x2e409a0
2013-04-29 14:27:19.907431 7fdf507ac700  0 mon.mon1@0(leader) e1 handle_command mon_command(auth del client.openstack v 0) v1

Looking at the timestamps this is very weird. How can mon1 report that this happened at 14:27 while mon2 reports it happend at 14:22 ~ 14:26?

I verified, the clocks on all 3 monitors are synced, there is no difference between them.

The logs from both mon1 and mon2 are attached.

mon1.log.gz (911 KB) Wido den Hollander, 04/29/2013 06:01 AM

mon2.log.gz (840 KB) Wido den Hollander, 04/29/2013 06:01 AM

ceph-mon.londensekaai.log.gz - Single monitor setup logs (540 KB) Wido den Hollander, 05/01/2013 01:30 PM

bt.txt View (21.7 KB) Wido den Hollander, 05/01/2013 02:18 PM

bt-mon2.txt View - Backtrace mon2 02-05-2013 (24.3 KB) Wido den Hollander, 05/02/2013 10:56 AM

bt-mon3.txt View - Backtrace mon3 02-05-2013 (15.8 KB) Wido den Hollander, 05/02/2013 10:56 AM

bt-mon1.txt View - Backtrace mon1 03-05-2013 (8.52 KB) Wido den Hollander, 05/03/2013 02:25 AM


Related issues

Related to Ceph - Bug #4896: leveldb: stuck on leveldb::DBImpl::CompactRange() Resolved 05/02/2013

History

#1 Updated by Greg Farnum over 6 years ago

  • Priority changed from Normal to High

This might be Urgent, but somebody needs to evaluate it.

#2 Updated by Ian Colle over 6 years ago

  • Priority changed from High to Urgent

#3 Updated by Sage Weil over 6 years ago

  • Status changed from New to In Progress
  • Assignee set to Sage Weil

#4 Updated by Sage Weil over 6 years ago

The time skew is just because mon1 was way behind.. the message it received is in sequence with the other sent by mon2 at 14:22:15. and it appears to be discarding them correctly, as per the election epoch check in PaxosService::dispatch() (debug paxos = 10 would have shown it).

Not sure why mon1 was so slow/behind, though.. was that machine heavily loaded? According to the mon2 log, the command was eventually processed with mon2 became leader.

#5 Updated by Sage Weil over 6 years ago

  • Priority changed from Urgent to High

#6 Updated by Sage Weil over 6 years ago

  • Status changed from In Progress to Need More Info

#7 Updated by Wido den Hollander over 6 years ago

No, I checked the load already with Joao yesterday, but they weren't loaded at all.

The load was about 0.5 at the time. (on all 3 monitors)

#8 Updated by Wido den Hollander over 6 years ago

I'm now also seeing this on a single-monitor cluster running the next branch (fe68afe9d10bc5d49a05a8bafa644d57783447cf)

I set debug mon to 10 and I'm seeing this:

2013-05-01 22:02:59.729488 7fac88328700  1 -- [2a00:f10:113::1]:6789/0 >> :/0 pipe(0x40b0c80 sd=52 :6789 s=0 pgs=0 cs=0 l=0).accept sd=52 [2a00:f10:113:0:8c83:d0ce:9d38:513f]:57595/0
2013-05-01 22:02:59.730086 7fac88328700 10 mon.londensekaai@0(leader) e1 ms_verify_authorizer [2a00:f10:113:0:3dc5:47e0:3e88:a9f]:6800/10651 osd protocol 0
2013-05-01 22:03:01.330416 7fac8862b700  1 -- [2a00:f10:113::1]:6789/0 >> :/0 pipe(0x40b0000 sd=58 :6789 s=0 pgs=0 cs=0 l=0).accept sd=58 [2a00:f10:113::1]:34885/0
2013-05-01 22:03:01.330936 7fac8862b700 10 mon.londensekaai@0(leader) e1 ms_verify_authorizer [2a00:f10:113::1]:0/31782 client protocol 0
2013-05-01 22:03:02.552704 7fac8973c700  1 -- [2a00:f10:113::1]:6789/0 >> :/0 pipe(0x2d58780 sd=6 :6789 s=0 pgs=0 cs=0 l=0).accept sd=6 [2a00:f10:113:0:1823:b27d:ba75:ba36]:34866/0
2013-05-01 22:03:02.554701 7fac8973c700 10 mon.londensekaai@0(leader) e1 ms_verify_authorizer [2a00:f10:113:0:8c31:9f28:1aee:2661]:6806/22318 osd protocol 0
2013-05-01 22:03:02.731089 7fac88328700  1 -- [2a00:f10:113::1]:6789/0 >> :/0 pipe(0x2d58c80 sd=52 :6789 s=0 pgs=0 cs=0 l=0).accept sd=52 [2a00:f10:113:0:8c83:d0ce:9d38:513f]:57599/0
2013-05-01 22:03:02.731674 7fac88328700 10 mon.londensekaai@0(leader) e1 ms_verify_authorizer [2a00:f10:113:0:3dc5:47e0:3e88:a9f]:6800/10651 osd protocol 0
2013-05-01 22:03:05.732590 7fac88328700  1 -- [2a00:f10:113::1]:6789/0 >> :/0 pipe(0x2d58a00 sd=52 :6789 s=0 pgs=0 cs=0 l=0).accept sd=52 [2a00:f10:113:0:8c83:d0ce:9d38:513f]:57600/0
2013-05-01 22:03:05.733300 7fac88328700 10 mon.londensekaai@0(leader) e1 ms_verify_authorizer [2a00:f10:113:0:3dc5:47e0:3e88:a9f]:6800/10651 osd protocol 0
2013-05-01 22:03:08.734375 7fac88328700  1 -- [2a00:f10:113::1]:6789/0 >> :/0 pipe(0x2d58280 sd=52 :6789 s=0 pgs=0 cs=0 l=0).accept sd=52 [2a00:f10:113:0:8c83:d0ce:9d38:513f]:57602/0
2013-05-01 22:03:08.734967 7fac88328700 10 mon.londensekaai@0(leader) e1 ms_verify_authorizer [2a00:f10:113:0:3dc5:47e0:3e88:a9f]:6800/10651 osd protocol 0

This keeps going on and on.

Simple commands like 'ceph -s' hang for ever without anything in the logs.

I'll try some more debugging, but for now the mon isn't responding to anything.

#9 Updated by Sage Weil over 6 years ago

full mon log starting from a ceph-mon restart would be ideal. thanks!

#10 Updated by Wido den Hollander over 6 years ago

Sage Weil wrote:

full mon log starting from a ceph-mon restart would be ideal. thanks!

I added logs with this settings:
debug mon = 20
debug paxos = 10
debug ms = 1

I tried a couple of commands, but they are all hanging, even communicating with the mon via the admin socket hangs.

In this case it is a single mon setup, but also running the 'next' branch from a couple of hours ago.

#11 Updated by Sage Weil over 6 years ago

can you attach with gdb and do 'thread apply all bt'?

#12 Updated by Wido den Hollander over 6 years ago

Sage Weil wrote:

can you attach with gdb and do 'thread apply all bt'?

Done! I've added the output.

#13 Updated by Sage Weil over 6 years ago

it looks like leveldb is doing a compaction. is the processing using disk? how big is store.db?

#14 Updated by Wido den Hollander over 6 years ago

Sage Weil wrote:

it looks like leveldb is doing a compaction. is the processing using disk? how big is store.db?

The disk isn't doing anything. No I/O at all.

The store.db directory is 349MB in size.

#15 Updated by Sage Weil over 6 years ago

can you try wip-paranoid with --mon-leveldb-paranoid when starting ceph-mon, and see what happens? that enables leveldb's internal fsck-like checks

#16 Updated by Wido den Hollander over 6 years ago

Sage Weil wrote:

can you try wip-paranoid with --mon-leveldb-paranoid when starting ceph-mon, and see what happens? that enables leveldb's internal fsck-like checks

I tried that on the in this case single monitor setup and that one is working fine again.

I have another 3 monitor setup which is a different case, but I have to check them to see what's going on.

The initial report came from that 3 monitor setup btw.

#17 Updated by Sage Weil over 6 years ago

  • Priority changed from High to Urgent

#18 Updated by Wido den Hollander over 6 years ago

So, back to the three monitors which aren't working yet.

What I noticed is this:

2013-05-02 19:48:55.827094 7ff3a02e2700 10 mon.mon1@0(leader) e1 ms_verify_authorizer 10.23.24.55:6800/925 osd protocol 0
2013-05-02 19:48:56.009319 7ff39d6b7700 10 mon.mon1@0(leader) e1 ms_verify_authorizer 10.23.24.55:6809/1840 osd protocol 0
2013-05-02 19:48:56.535218 7ff39e1c1700 10 mon.mon1@0(leader) e1 ms_verify_authorizer 10.23.24.56:6803/20280 osd protocol 0
2013-05-02 19:48:57.042281 7ff3a675f700 10 mon.mon1@0(leader) e1 ms_verify_authorizer 10.23.24.59:6812/5571 osd protocol 0
2013-05-02 19:48:57.072107 7ff3a05e5700 10 mon.mon1@0(leader) e1 ms_verify_authorizer 10.23.24.59:6803/4985 osd protocol 0
2013-05-02 19:48:57.170968 7ff39fbdb700 10 mon.mon1@0(leader) e1 ms_verify_authorizer 10.23.24.53:6806/7829 osd protocol 0

So mon1 thinks he is the leader (debug mon = 10)

Checking the log of mon2 shows me:

2013-05-02 19:49:38.066233 7fe489e2f700  1 mon.mon2@1(electing) e1 discarding message auth(proto 0 27 bytes epoch 1) v1 and sending client elsewhere
2013-05-02 19:49:38.066274 7fe489e2f700  1 mon.mon2@1(electing) e1 discarding message auth(proto 0 27 bytes epoch 1) v1 and sending client elsewhere
2013-05-02 19:49:38.066317 7fe489e2f700  1 mon.mon2@1(electing) e1 discarding message auth(proto 0 26 bytes epoch 1) v1 and sending client elsewhere
2013-05-02 19:49:38.066369 7fe489e2f700  5 mon.mon2@1(electing) e1 waitlisting message auth(proto 0 26 bytes epoch 1) v1
2013-05-02 19:49:38.066389 7fe489e2f700  5 mon.mon2@1(electing) e1 waitlisting message auth(proto 0 27 bytes epoch 1) v1
2013-05-02 19:49:38.066407 7fe489e2f700  5 mon.mon2@1(electing) e1 waitlisting message auth(proto 0 26 bytes epoch 1) v1
2013-05-02 19:49:38.066426 7fe489e2f700  5 mon.mon2@1(electing) e1 waitlisting message auth(proto 0 27 bytes epoch 1) v1
2013-05-02 19:49:38.066444 7fe489e2f700  5 mon.mon2@1(electing) e1 waitlisting message auth(proto 0 26 bytes epoch 1) v1
2013-05-02 19:49:38.066461 7fe489e2f700  5 mon.mon2@1(electing) e1 waitlisting message auth(proto 0 27 bytes epoch 1) v1
2013-05-02 19:49:38.066479 7fe489e2f700  5 mon.mon2@1(electing) e1 waitlisting message auth(proto 0 27 bytes epoch 1) v1
2013-05-02 19:49:38.066496 7fe489e2f700  5 mon.mon2@1(electing) e1 waitlisting message auth(proto 0 26 bytes epoch 1) v1
2013-05-02 19:49:38.066514 7fe489e2f700  5 mon.mon2@1(electing) e1 waitlisting message auth(proto 0 27 bytes epoch 1) v1
2013-05-02 19:49:38.066532 7fe489e2f700  5 mon.mon2@1(electing) e1 waitlisting message auth(proto 0 27 bytes epoch 1) v1
2013-05-02 19:49:38.066550 7fe489e2f700  5 mon.mon2@1(electing) e1 waitlisting message auth(proto 0 27 bytes epoch 1) v1
2013-05-02 19:49:38.073925 7fe489e2f700  5 mon.mon2@1(electing).elector(5871) election timer expired
2013-05-02 19:49:38.073953 7fe489e2f700  5 mon.mon2@1(electing).elector(5871) start -- can i be leader?
2013-05-02 19:49:38.077971 7fe489e2f700  1 mon.mon2@1(electing).elector(5871) init, last seen epoch 5871
2013-05-02 19:49:38.354572 7fe480ab9700 10 mon.mon2@1(electing) e1 ms_verify_authorizer 10.23.24.52:6803/28014 osd protocol 0
2013-05-02 19:49:38.355098 7fe48962e700  5 mon.mon2@1(electing) e1 waitlisting message auth(proto 0 26 bytes epoch 1) v1
2013-05-02 19:49:38.573330 7fe4800af700 10 mon.mon2@1(electing) e1 ms_verify_authorizer 10.23.24.54:6806/25159 osd protocol 0
2013-05-02 19:49:38.573920 7fe48962e700  5 mon.mon2@1(electing) e1 waitlisting message auth(proto 0 27 bytes epoch 1) v1

So mon2 calls a election and then it doesn't continue?

mon3 doesn't respond to anything, it didn't log anything in the last 3 hours and isn't responding on the admin socket at all.

A processlist even shows the mon never started correctly:

root     23703  0.5  3.9 316152 158580 ?       Ssl  10:58   3:05 /usr/bin/ceph-mon -i mon3 --pid-file /var/run/ceph/mon.mon3.pid -c /etc/ceph/ceph.conf
root     23725  0.0  0.1  46552  7508 ?        S    10:58   0:00 /usr/bin/python /usr/sbin/ceph-create-keys -i mon3
root     24228  0.0  0.1  59020  5384 ?        Sl   10:58   0:00 ceph --cluster=ceph --admin-daemon=/var/run/ceph/ceph-mon.mon3.asok mon_status

The backtrace of mon3 shows it is hanging on writing to the leveldb and it's also trying to accept the incoming admin socket request.

Another leveldb issue?

The level db sizes:
  • mon1: 536MB
  • mon2: 960MB
  • mon3: 2.1GB

So the size of mon3 is kind of odd.

The Ceph version on all monitors:

ceph version 0.60-773-gfe68afe (fe68afe9d10bc5d49a05a8bafa644d57783447cf)

I attached a backtrace of the thread states from both mon2 and mon3.

#19 Updated by Sage Weil over 6 years ago

  • Subject changed from mon: Commands are hanging for a long time to leveldb: hang on leveldb::DBImpl::MakeRoomForWrite(bool)

so far, --mon-leveldb-paranoid seems to work around this.

#20 Updated by Sage Weil over 6 years ago

  • Priority changed from High to Urgent

#21 Updated by Sage Weil over 6 years ago

procedure for generating a leveldb transaction log that we can (hopefully) use to reproduce this:

- install wip_mon_dump_ops
- stop mon
- make copy of mon data dir
- start mon with 'mon debug dump transactions = true' and 'mon leveldb paranoid = false'
- wait for the hang
- archive up the original mon data dir, the final mon data dir, and the tdata log file in /var/log/ceph (it may be big)

thanks!

#22 Updated by Sage Weil over 6 years ago

leveldb log with successful quorum: http://pastebin.com/raw.php?i=SQzePEn2

#23 Updated by Sage Weil over 6 years ago

(05:45:49 PM) mikedawson: sagewk: working -> http://pastebin.com/raw.php?i=SQzePEn2 hung -> http://pastebin.com/raw.php?i=cjzDTFS9
(05:46:26 PM) mikedawson: difference was starting with starting with 'mon compact on start = true' made it hang

#24 Updated by Sage Weil over 6 years ago

http://pastebin.com/raw.php?i=mmVeZ4ik

notice that hte thread doing all the work (the background thread) is the logger thread according to the gdb dump. i think we are losing it when the fork() happens.

#25 Updated by Sage Weil over 6 years ago

  • Status changed from Need More Info to Need Review

wip-leveldb-reopen

needs testing and review

#26 Updated by Wido den Hollander over 6 years ago

Sage Weil wrote:

wip-leveldb-reopen

needs testing and review

I build this morning with that branch and it all seems to work now.

ceph version 0.60-773-gfe68afe (fe68afe9d10bc5d49a05a8bafa644d57783447cf)

mon1 became the leader again and mon2 and mon3 are both in peon state.

I upgraded them by still tarring up their data dir

dpkg -i /root/*.deb && service ceph stop && tar czf /root/mon-data.tar.gz /var/lib/ceph/mon && service ceph start

I didn't try 'wip_mon_dump_ops' since it seems that most fixes which went in there also went into 'wip-leveldb-reopen' after a quick review.

Mon settings:

[mon]
    # Issue #4815
    mon leveldb paranoid = false
    mon debug dump transactions = true

These are the default settings in config_opts.h btw, but I configured them anyway.

#27 Updated by Wido den Hollander over 6 years ago

Ok, my conclusion was a bit pre-mature. mon1 just went down (while the other took over) and a bt showed it was hanging again on MakeRoomForWrite(bool).

The backtrace is attached.

I stopped mon1, left it out for about 10 minutes and after starting again it became the leader again and thus joined the quorum.

#28 Updated by Sage Weil over 6 years ago

  • Status changed from Need Review to Resolved

Also available in: Atom PDF