Bug #15233: Flattening an rbd image with a running Qemu instances causes librbd worker threads to hang. - rbd - Ceph

Actions

Copy link

Bug #15233

closed

Flattening an rbd image with a running Qemu instances causes librbd worker threads to hang.

Added by Christian Theune about 8 years ago. Updated almost 8 years ago.

Status:

Duplicate

Priority:

Urgent

Assignee:

Jason Dillaman

Target version:

% Done:

Source:

other

Tags:

Backport:

hammer,infernalis

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

Ceph - v0.94.6

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

I'm sorry if this is noise, but #14483 is not solved for me and it seems responding to the fixed issue doesn't attract any attention.

Let me know if you need any more input.

Files

qemu-gdb.log (50.6 KB) qemu-gdb.log

Christian Theune, 04/05/2016 08:08 AM

Related issues 2 (0 open — 2 closed)

Actions

Copy link

Updated by Jason Dillaman about 8 years ago

I assume this is on 0.94.6? Can you please provide the full "debug rbd=20" logs pre/post hang or can you attach to the hung QEMU process and provide the full backtrace (thread apply all bt)?

Actions

Copy link

Updated by Jason Dillaman about 8 years ago

Priority changed from Normal to High

Actions

Copy link

Updated by Jason Dillaman about 8 years ago

Status changed from New to Need More Info

Actions

Copy link

Updated by Christian Theune about 8 years ago

File qemu-gdb.log qemu-gdb.log added

Thanks, sorry for the delay, I must have missed the notification.

You can find the RBD client log output and some GDB output attached.

librados isn't built with debug symbols, so this won't be too helpful. However, something I noticed that happened while attached to GDB while flattening the image was this:

[New Thread 0x7fad884f5700 (LWP 629)]
[New Thread 0x7fad1c1c2700 (LWP 630)]

Program received signal SIGPIPE, Broken pipe.
0x00007fadc60afbcd in write () from /lib64/libpthread.so.0
(gdb) 
Continuing.

Program received signal SIGPIPE, Broken pipe.
0x00007fadc60afbcd in write () from /lib64/libpthread.so.0
(gdb) 
Continuing.

Program received signal SIGPIPE, Broken pipe.
0x00007fadc60afbcd in write () from /lib64/libpthread.so.0
(gdb) 
Continuing.

In the client log, you can see:

the new image being created (test24.root)
the image being openend and used

There is some more traffic interleaved as the host in our development cluster also runs other stuff - I couldn't find a reliable filter to get rid of those entries without risking incomplete information, maybe you can filter those more effectively than me.

Hope this helps - let me know if even more input would help.

Christian

Actions

Copy link

Updated by Christian Theune about 8 years ago

Oh, and yes: this is 0.94.6 (with jemalloc)

Actions

Copy link

Updated by Christian Theune about 8 years ago

It seems the client log went missing. Here it is (again). Ah it's too big. Redmine says max. size 73.4MiB, but your nginx doesn't like 1.5MiB. Alright.
Lets put it here then:

http://shared00.fe.rzob.gocept.net/~ctheune/ceph-15233-client.log.xz

Actions

Copy link

Updated by Jason Dillaman about 8 years ago

Thanks -- so what client applications does this log output include? Does this log combine the rbd CLI and qemu? What specific commands do you run to repeat the hang? I don't see any logs for a "flatten" request in the provided client log but I do see the thread pool hang warning:

2016-04-05 09:48:03.160300 7fadbf0bf700 1 heartbeat_map is_healthy 'librbd::thread_pool thread 0x7fadb8dac700' had timed out after 60

Actions

Copy link

Updated by Jason Dillaman about 8 years ago

Status changed from Need More Info to 12
Priority changed from High to Urgent

Actions

Copy link

Updated by Jason Dillaman about 8 years ago

Status changed from 12 to In Progress
Assignee set to Jason Dillaman
Backport set to hammer,infernalis

Actions

Copy link

#10

Updated by Jason Dillaman about 8 years ago

Copied to Backport #15414: hammer: Flattening an rbd image with a running Qemu instances causes librbd worker threads to hang. added

Actions

Copy link

#11

Updated by Jason Dillaman about 8 years ago

Copied to Backport #15415: infernalis: Flattening an rbd image with a running Qemu instances causes librbd worker threads to hang. added

Actions

Copy link

#12

Updated by Jason Dillaman about 8 years ago

Status changed from In Progress to Need More Info

@Christian: I just occurred to me that the fix for the original issue, while flagged as resolved, is still forthcoming in the 0.94.7 release (http://tracker.ceph.com/issues/14611). Another (semi-) related issue (http://tracker.ceph.com/issues/15033) is also forthcoming which should close another possible deadlock. I am going to move this back to "Need More Info" for now pending a re-test in 0.94.7.

Thanks.

Actions

Copy link

#13

Updated by Jason Dillaman almost 8 years ago

@Christian: v0.94.7 is now available. Any chance you can retest this issue and see if you can still repeat it?

Actions

Copy link

#14

Updated by Christian Theune almost 8 years ago

Awesome, thanks. I'll check.

Actions

Copy link

#15

Updated by Christian Theune almost 8 years ago

Looks like this works now. Previously I could reliably trigger VMs getting stuck after flattening, this hasn't happened after my test. I'll start using cloning again more wide-spread which will show whether this holds true in a more massive environment. Thanks!

Actions

Copy link

#16

Updated by Jason Dillaman almost 8 years ago

Status changed from Need More Info to Duplicate

Awesome news! I am going to close this ticket for now -- please re-open it if the issue re-occurs.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » rbd

Custom queries

Bug #15233

Flattening an rbd image with a running Qemu instances causes librbd worker threads to hang.

Updated by Jason Dillaman about 8 years ago

Updated by Jason Dillaman about 8 years ago

Updated by Jason Dillaman about 8 years ago

Updated by Christian Theune about 8 years ago

Updated by Christian Theune about 8 years ago

Updated by Christian Theune about 8 years ago

Updated by Jason Dillaman about 8 years ago

Updated by Jason Dillaman about 8 years ago

Updated by Jason Dillaman about 8 years ago

Updated by Jason Dillaman about 8 years ago

Updated by Jason Dillaman about 8 years ago

Updated by Jason Dillaman about 8 years ago

Updated by Jason Dillaman almost 8 years ago

Updated by Christian Theune almost 8 years ago

Updated by Christian Theune almost 8 years ago

Updated by Jason Dillaman almost 8 years ago