Bug #15233
closedFlattening an rbd image with a running Qemu instances causes librbd worker threads to hang.
0%
Description
I'm sorry if this is noise, but #14483 is not solved for me and it seems responding to the fixed issue doesn't attract any attention.
Let me know if you need any more input.
Files
Updated by Jason Dillaman about 8 years ago
I assume this is on 0.94.6? Can you please provide the full "debug rbd=20" logs pre/post hang or can you attach to the hung QEMU process and provide the full backtrace (thread apply all bt)?
Updated by Jason Dillaman about 8 years ago
- Priority changed from Normal to High
Updated by Jason Dillaman about 8 years ago
- Status changed from New to Need More Info
Updated by Christian Theune about 8 years ago
- File qemu-gdb.log qemu-gdb.log added
Thanks, sorry for the delay, I must have missed the notification.
You can find the RBD client log output and some GDB output attached.
librados isn't built with debug symbols, so this won't be too helpful. However, something I noticed that happened while attached to GDB while flattening the image was this:
[New Thread 0x7fad884f5700 (LWP 629)] [New Thread 0x7fad1c1c2700 (LWP 630)] Program received signal SIGPIPE, Broken pipe. 0x00007fadc60afbcd in write () from /lib64/libpthread.so.0 (gdb) Continuing. Program received signal SIGPIPE, Broken pipe. 0x00007fadc60afbcd in write () from /lib64/libpthread.so.0 (gdb) Continuing. Program received signal SIGPIPE, Broken pipe. 0x00007fadc60afbcd in write () from /lib64/libpthread.so.0 (gdb) Continuing.
In the client log, you can see:
- the new image being created (test24.root)
- the image being openend and used
There is some more traffic interleaved as the host in our development cluster also runs other stuff - I couldn't find a reliable filter to get rid of those entries without risking incomplete information, maybe you can filter those more effectively than me.
Hope this helps - let me know if even more input would help.
Christian
Updated by Christian Theune about 8 years ago
Oh, and yes: this is 0.94.6 (with jemalloc)
Updated by Christian Theune about 8 years ago
It seems the client log went missing. Here it is (again). Ah it's too big. Redmine says max. size 73.4MiB, but your nginx doesn't like 1.5MiB. Alright.
Lets put it here then:
http://shared00.fe.rzob.gocept.net/~ctheune/ceph-15233-client.log.xz
Updated by Jason Dillaman about 8 years ago
Thanks -- so what client applications does this log output include? Does this log combine the rbd CLI and qemu? What specific commands do you run to repeat the hang? I don't see any logs for a "flatten" request in the provided client log but I do see the thread pool hang warning:
2016-04-05 09:48:03.160300 7fadbf0bf700 1 heartbeat_map is_healthy 'librbd::thread_pool thread 0x7fadb8dac700' had timed out after 60
Updated by Jason Dillaman about 8 years ago
- Status changed from Need More Info to 12
- Priority changed from High to Urgent
Updated by Jason Dillaman about 8 years ago
- Status changed from 12 to In Progress
- Assignee set to Jason Dillaman
- Backport set to hammer,infernalis
Updated by Jason Dillaman about 8 years ago
- Copied to Backport #15414: hammer: Flattening an rbd image with a running Qemu instances causes librbd worker threads to hang. added
Updated by Jason Dillaman about 8 years ago
- Copied to Backport #15415: infernalis: Flattening an rbd image with a running Qemu instances causes librbd worker threads to hang. added
Updated by Jason Dillaman about 8 years ago
- Status changed from In Progress to Need More Info
@Christian: I just occurred to me that the fix for the original issue, while flagged as resolved, is still forthcoming in the 0.94.7 release (http://tracker.ceph.com/issues/14611). Another (semi-) related issue (http://tracker.ceph.com/issues/15033) is also forthcoming which should close another possible deadlock. I am going to move this back to "Need More Info" for now pending a re-test in 0.94.7.
Thanks.
Updated by Jason Dillaman almost 8 years ago
@Christian: v0.94.7 is now available. Any chance you can retest this issue and see if you can still repeat it?
Updated by Christian Theune almost 8 years ago
Looks like this works now. Previously I could reliably trigger VMs getting stuck after flattening, this hasn't happened after my test. I'll start using cloning again more wide-spread which will show whether this holds true in a more massive environment. Thanks!
Updated by Jason Dillaman almost 8 years ago
- Status changed from Need More Info to Duplicate
Awesome news! I am going to close this ticket for now -- please re-open it if the issue re-occurs.