Project

General

Profile

Actions

Bug #15233

closed

Flattening an rbd image with a running Qemu instances causes librbd worker threads to hang.

Added by Christian Theune about 8 years ago. Updated almost 8 years ago.

Status:
Duplicate
Priority:
Urgent
Assignee:
Jason Dillaman
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
hammer,infernalis
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I'm sorry if this is noise, but #14483 is not solved for me and it seems responding to the fixed issue doesn't attract any attention.

Let me know if you need any more input.


Files

qemu-gdb.log (50.6 KB) qemu-gdb.log Christian Theune, 04/05/2016 08:08 AM

Related issues 2 (0 open2 closed)

Copied to rbd - Backport #15414: hammer: Flattening an rbd image with a running Qemu instances causes librbd worker threads to hang.DuplicateJason DillamanActions
Copied to rbd - Backport #15415: infernalis: Flattening an rbd image with a running Qemu instances causes librbd worker threads to hang.DuplicateJason DillamanActions
Actions #1

Updated by Jason Dillaman about 8 years ago

I assume this is on 0.94.6? Can you please provide the full "debug rbd=20" logs pre/post hang or can you attach to the hung QEMU process and provide the full backtrace (thread apply all bt)?

Actions #2

Updated by Jason Dillaman about 8 years ago

  • Priority changed from Normal to High
Actions #3

Updated by Jason Dillaman about 8 years ago

  • Status changed from New to Need More Info
Actions #4

Updated by Christian Theune about 8 years ago

Thanks, sorry for the delay, I must have missed the notification.

You can find the RBD client log output and some GDB output attached.

librados isn't built with debug symbols, so this won't be too helpful. However, something I noticed that happened while attached to GDB while flattening the image was this:

[New Thread 0x7fad884f5700 (LWP 629)]
[New Thread 0x7fad1c1c2700 (LWP 630)]

Program received signal SIGPIPE, Broken pipe.
0x00007fadc60afbcd in write () from /lib64/libpthread.so.0
(gdb) 
Continuing.

Program received signal SIGPIPE, Broken pipe.
0x00007fadc60afbcd in write () from /lib64/libpthread.so.0
(gdb) 
Continuing.

Program received signal SIGPIPE, Broken pipe.
0x00007fadc60afbcd in write () from /lib64/libpthread.so.0
(gdb) 
Continuing.

In the client log, you can see:

  • the new image being created (test24.root)
  • the image being openend and used

There is some more traffic interleaved as the host in our development cluster also runs other stuff - I couldn't find a reliable filter to get rid of those entries without risking incomplete information, maybe you can filter those more effectively than me.

Hope this helps - let me know if even more input would help.

Christian

Actions #5

Updated by Christian Theune about 8 years ago

Oh, and yes: this is 0.94.6 (with jemalloc)

Actions #6

Updated by Christian Theune about 8 years ago

It seems the client log went missing. Here it is (again). Ah it's too big. Redmine says max. size 73.4MiB, but your nginx doesn't like 1.5MiB. Alright.
Lets put it here then:

http://shared00.fe.rzob.gocept.net/~ctheune/ceph-15233-client.log.xz

Actions #7

Updated by Jason Dillaman about 8 years ago

Thanks -- so what client applications does this log output include? Does this log combine the rbd CLI and qemu? What specific commands do you run to repeat the hang? I don't see any logs for a "flatten" request in the provided client log but I do see the thread pool hang warning:

2016-04-05 09:48:03.160300 7fadbf0bf700 1 heartbeat_map is_healthy 'librbd::thread_pool thread 0x7fadb8dac700' had timed out after 60

Actions #8

Updated by Jason Dillaman about 8 years ago

  • Status changed from Need More Info to 12
  • Priority changed from High to Urgent
Actions #9

Updated by Jason Dillaman about 8 years ago

  • Status changed from 12 to In Progress
  • Assignee set to Jason Dillaman
  • Backport set to hammer,infernalis
Actions #10

Updated by Jason Dillaman about 8 years ago

  • Copied to Backport #15414: hammer: Flattening an rbd image with a running Qemu instances causes librbd worker threads to hang. added
Actions #11

Updated by Jason Dillaman about 8 years ago

  • Copied to Backport #15415: infernalis: Flattening an rbd image with a running Qemu instances causes librbd worker threads to hang. added
Actions #12

Updated by Jason Dillaman about 8 years ago

  • Status changed from In Progress to Need More Info

@Christian: I just occurred to me that the fix for the original issue, while flagged as resolved, is still forthcoming in the 0.94.7 release (http://tracker.ceph.com/issues/14611). Another (semi-) related issue (http://tracker.ceph.com/issues/15033) is also forthcoming which should close another possible deadlock. I am going to move this back to "Need More Info" for now pending a re-test in 0.94.7.

Thanks.

Actions #13

Updated by Jason Dillaman almost 8 years ago

@Christian: v0.94.7 is now available. Any chance you can retest this issue and see if you can still repeat it?

Actions #14

Updated by Christian Theune almost 8 years ago

Awesome, thanks. I'll check.

Actions #15

Updated by Christian Theune almost 8 years ago

Looks like this works now. Previously I could reliably trigger VMs getting stuck after flattening, this hasn't happened after my test. I'll start using cloning again more wide-spread which will show whether this holds true in a more massive environment. Thanks!

Actions #16

Updated by Jason Dillaman almost 8 years ago

  • Status changed from Need More Info to Duplicate

Awesome news! I am going to close this ticket for now -- please re-open it if the issue re-occurs.

Actions

Also available in: Atom PDF