Project

General

Profile

Bug #17913

librbd io deadlock after host lost network connectivity

Added by Dan van der Ster over 3 years ago. Updated over 2 years ago.

Status:
Rejected
Priority:
Normal
Assignee:
-
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
Yes
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
rbd
Pull request ID:
Crash signature:

Description

During a recent network outage we found one VM with a totally deadlocked attached rbd volume. The full ceph-client.log is attached, but the interesting part is below.

The qemu-kvm lost network at 2016-11-08 09:06:21.834225. Then a few minutes later:

2016-11-08 09:14:06.136622 7f6a8f2dc700  0 -- 10.16.105.66:0/1983621456 >> 188.184.36.166:6789/0 pipe(0x7f79846ef000 sd=1034 :0 s=1 pgs=0 cs=0 l=1 c=0x7f797751f440).fault
2016-11-08 09:14:31.375413 7f6a54523700 -1 librbd::ImageWatcher: 0x7f7983aa4ac0 image watch failed: 140159521141248, (107) Transport endpoint is not connected
2016-11-08 09:14:39.567941 7f6a93dfd700  0 -- 10.16.105.66:0/1983621456 submit_message osd_op(client.118493800.0:51481051 rbd_data.6ae581d1314c088.00000000000156ee [set-alloc-hint object_size 4194304 write_size 4194304,write 4153344~40960] 4.47b99b35 ack+ondisk+write+known_if_redirected e549025) v5 remote, 128.142.161.201:6897/3581993, failed lossy con, dropping message 0x7f79847f0300
2016-11-08 09:14:42.024813 7f6a93dfd700  0 -- 10.16.105.66:0/1983621456 submit_message osd_op(client.118493800.0:51481168 rbd_data.6ae581d1314c088.0000000000038ee5 [set-alloc-hint object_size 4194304 write_size 4194304,write 507904~4096] 4.91a10f63 ack+ondisk+write+known_if_redirected e549025) v5 remote, 128.142.161.203:6817/3677267, failed lossy con, dropping message 0x7f79766e5700
2016-11-08 09:14:51.332193 7f6a93dfd700  0 -- 10.16.105.66:0/1983621456 submit_message osd_op(client.118493800.0:51481353 rbd_data.6ae581d1314c088.0000000000027804 [set-alloc-hint object_size 4194304 write_size 4194304,write 3637248~4096] 4.85873218 ack+ondisk+write+known_if_redirected e549025) v5 remote, 128.142.161.222:6866/3426863, failed lossy con, dropping message 0x7f7977250000
2016-11-08 09:14:52.344017 7f6a93dfd700  0 -- 10.16.105.66:0/1983621456 submit_message osd_op(client.118493800.0:51481432 rbd_data.6ae581d1314c088.0000000000020030 [set-alloc-hint object_size 4194304 write_size 4194304,write 1597440~45056] 4.3418d482 ack+ondisk+write+known_if_redirected e549025) v5 remote, 128.142.161.229:6928/3546026, failed lossy con, dropping message 0x7f7977dbb400
2016-11-08 09:15:08.629234 7f6a94dff700 -1 librbd::ImageWatcher: 0x7f796f2b4fc0 image watch failed: 140159514277632, (107) Transport endpoint is not connected
2016-11-08 09:15:08.641940 7f6a93dfd700  0 -- 10.16.105.66:0/1983621456 submit_message osd_op(client.118493800.0:51481923 rbd_data.6ae581d1314c088.0000000000038d3e [set-alloc-hint object_size 4194304 write_size 4194304,write 983040~4096] 4.56cc8de3 ack+ondisk+write+known_if_redirected e549025) v5 remote, 128.142.161.231:6940/4190058, failed lossy con, dropping message 0x7f7974508f00
2016-11-08 09:15:08.643951 7f6a93dfd700  0 -- 10.16.105.66:0/1983621456 submit_message osd_op(client.118493800.0:51481925 rbd_data.6ae581d1314c088.0000000000038e01 [set-alloc-hint object_size 4194304 write_size 4194304,write 827392~8192] 4.fa7a45b3 ack+ondisk+write+known_if_redirected e549025) v5 remote, 128.142.161.203:6910/3679986, failed lossy con, dropping message 0x7f7974508f00
2016-11-08 09:16:23.371172 7f795232f700  1 heartbeat_map is_healthy 'librbd::thread_pool thread 0x7f6a93dfd700' had timed out after 60

The client was running librbd 0.94.9, as is the Ceph cluster.

ceph-client.1669.log.gz (52.5 KB) Dan van der Ster, 11/15/2016 01:44 PM

ceph.log.gz (138 KB) Christian Theune, 03/29/2017 02:52 PM

thread-apply-bt-all.txt View (58.4 KB) Christian Theune, 03/29/2017 03:21 PM

History

#1 Updated by Jason Dillaman over 3 years ago

  • Status changed from New to Need More Info

Am I correct in assuming that it never regained functionality after the network was restored? Any chance you collected a core dump from the deadlocked process?

#2 Updated by Dan van der Ster over 3 years ago

We can't say exactly when the network was restored, but know it was only down for a few minutes.

And unfortunately we don't have a core dump :(

#3 Updated by Dan van der Ster about 3 years ago

This happened again after a network outage yesterday (again 0.94.9 librbd):

2017-01-31 16:07:13.650311 7f8e25a82700 -1 librbd::ImageWatcher: 0x7f9cfed84d00 image watch failed: 140312834614016, (107) Transport endpoint is not connected
2017-01-31 16:09:00.503499 7f9ce4fc2700  1 heartbeat_map is_healthy 'librbd::thread_pool thread 0x7f8e25281700' had timed out after 60
2017-01-31 16:09:05.503584 7f9ce4fc2700  1 heartbeat_map is_healthy 'librbd::thread_pool thread 0x7f8e25281700' had timed out after 60
2017-01-31 16:09:10.503634 7f9ce4fc2700  1 heartbeat_map is_healthy 'librbd::thread_pool thread 0x7f8e25281700' had timed out after 60

No core dump available.

#4 Updated by Dan van der Ster about 3 years ago

It happened again:

2017-02-03 11:38:56.185795 7fa6029a9700  1 heartbeat_map is_healthy 'librbd::thread_pool thread 0x7fa5fe09f700' had timed out after 60

and we got a backtrace:

(gdb) thread 2607
[Switching to thread 2607 (Thread 0x7fa5fe09f700 (LWP 8486))]
#0  0x00007fa60ec91e24 in pthread_rwlock_rdlock () from /lib64/libpthread.so.0
(gdb) bt
#0  0x00007fa60ec91e24 in pthread_rwlock_rdlock () from /lib64/libpthread.so.0
#1  0x00007fa615f725c0 in RWLock::get_read (this=0x7fa61ca0a1f8) at common/RWLock.h:76
#2  0x00007fa615fb0aac in RLocker (lock=..., this=<synthetic pointer>) at common/RWLock.h:130
#3  librbd::aio_write (ictx=0x7fa61ca0a000, off=654269480960, len=8257536,
    buf=0x7fa62ea34000 "000 (2538309.000.000) 01/29 12:41:25 Job submitted from host: <188.184.92.231:4080?addrs=188.184.92.231-4080&noUDP&sock=3450_78c5_70>\n    DAG Node: Job6\n...\n000 (2538310.000.000) 01/29 12:41:25 Job su"..., c=0x7fa62de77380, op_flags=0) at librbd/internal.cc:3388
#4  0x00007fa615f6a6b9 in Context::complete (this=0x7fa6266a5740, r=<optimized out>) at include/Context.h:65
#5  0x00007fa616084766 in ThreadPool::worker (this=0x7fa61be90340, wt=0x7fa61c9fd4d0) at common/WorkQueue.cc:128
#6  0x00007fa6160857e0 in ThreadPool::WorkThread::entry (this=<optimized out>) at common/WorkQueue.h:397
#7  0x00007fa60ec8edc5 in start_thread () from /lib64/libpthread.so.0
#8  0x00007fa60e9bd73d in clone () from /lib64/libc.so.6

We're gcore'ing that kvm process. Where can we post it? (It's several 10's of GB in size).

#5 Updated by Jason Dillaman about 3 years ago

@Dan van der Ster:

Please use the "ceph-post-file" utility to upload the core dump along with a listing of installed packages+versions on that host.

Thanks!

#6 Updated by Dan van der Ster about 3 years ago

Hi Jason -- our security officer is hesitating to let me post the machine memory dump. Could we meet on IRC and I can run what you need?

#7 Updated by Jason Dillaman about 3 years ago

@Dan van der Ster:

If you can install all necessary debug packages and get a complete gdb core backtrace via "thread apply all bt", that would be a good start.

#8 Updated by Christian Theune about 3 years ago

I'm seeing a similar issue but I'm not sure we're actually loosing network at that moment. However, I get the same output in the client log and have a stuck VM available.

I'll try to see to get a core dump right now. No security issues on my side as it's a testing machine, but we see a similar issue in production, too.

#9 Updated by Christian Theune about 3 years ago

BTW: this is also on hammer (0.94.7).

#10 Updated by Christian Theune about 3 years ago

Grr. My setup doesn't have debug symbols. I'm trying to get them but I may not be able to apply them to the currently running process. We'll see.

Something I'm not seeing in my logs is the `heartbeat_map is_healthy 'librbd::thread_pool thread 0x7f6a93dfd700' had timed out after 60` message. Is that something you could relate to whether this is the same issue?

I'll try to find out what other environmental things were happening at the point in time when the VM got stuck.

#11 Updated by Christian Theune about 3 years ago

Here's what we see on our statistics for the VM at the time when it got stuck:

https://stats.flyingcircus.io/grafana/dashboard/snapshot/UZb8rvC3mIZe6m7Ya6LywQUwN3X0MDfF

#12 Updated by Christian Theune about 3 years ago

Alright. I could not prove any network issue - nothing in the kernel logs of the Qemu machine. However, our (development) cluster experienced a period of slow requests at the same time. That might be related. I'm attaching the relevant general Ceph log for that time. If needed I can extract more information from OSDs if you like.

That's it for today. I'll see whether I can get a gdb trace tomorrow.

#13 Updated by Christian Theune about 3 years ago

Recompiling did get me symbols on the running process. I hope those make sense and aren't just accidental garbage (I don't know whether anything checks the consistency of that stuff).

Anyway, I'm attaching the output of "thread apply all bt".

#14 Updated by Christian Theune about 3 years ago

Hey, @jdillaman, on IRC you mentioned no threads were stuck, but we lost touch. Is your indication that this isn't a Ceph issue but something else? (I.e. maybe a guest issue?)

#15 Updated by Jason Dillaman about 3 years ago

@Christian: I just think you need to recreate and get a fresh set of backtraces. Since you had the log message of "heartbeat_map is_healthy 'librbd::thread_pool thread 0x7f6a93dfd700' had timed out after 60", you should have had a stuck thread.

#16 Updated by Christian Theune about 3 years ago

Ah, that was maybe a confusion. My instance of this error only had the first message:

librbd::ImageWatcher: 0x7f796f2b4fc0 image watch failed: 140159514277632, (107) Transport endpoint is not connected

But not the heartbeat one. Maybe we hav a different issue here. I guess the logging level should not be an issue here if the first message made it into the log but not the second ... right?

#17 Updated by Christian Theune almost 3 years ago

I think I found the culprit and I also think this may have been completely independent from Ceph. In our case, I found some VMs were stuck after having had a qemu guest agent "freeze" called semi-unsuccessfully (it returned an error but was frozen) and then the following thaw commands failed, too. This explains why the VM wasn't seeing IO any longer but was functional and all components like Ceph, Qemu itself, etc. appeared mysteriously fine. I applied a fix in our environment but maybe this was different from the original report in this issue after all ...

#18 Updated by Jason Dillaman over 2 years ago

  • Status changed from Need More Info to Rejected
  • Affected Versions v0.51 added

Also available in: Atom PDF