Project

General

Profile

Bug #45456

The read and write operations of libceph are blocked waiting for IO to complete, causing a deadlock

Added by 伟 宋 almost 4 years ago. Updated almost 3 years ago.

Status:
Need More Info
Priority:
Normal
Assignee:
-
Category:
-
Target version:
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):

Description

When libceph: mon2 10.36.4.65:6789 socket closed, the service using rbd will hang, the process status will slowly change to D, and finally the zombie.

ceph verison: 12.2.11

root@mgt11:/var/log# uname -a
Linux mgt11 5.0.0-29-generic #31~18.04.1-Ubuntu SMP Thu Sep 12 18:29:21 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

log info :
<6>2020-04-30T19:19:58.530503+08:00 mgt11 kernel: [ 52.601531] EXT4-fs (rbd0): mounted filesystem with ordered data mode. Opts: (null)
<6>2020-04-30T19:20:02.230426+08:00 mgt11 kernel: [ 56.301070] IPv6: ADDRCONF: calibdcd013e339: link is not ready
<6>2020-04-30T19:20:02.230447+08:00 mgt11 kernel: [ 56.302392] IPv6: ADDRCONF: calibdcd013e339: link becomes ready
<6>2020-04-30T19:20:07.358406+08:00 mgt11 kernel: [ 61.428217] rbd: rbd4: capacity 8589934592 features 0x0
<6>2020-04-30T19:20:07.714439+08:00 mgt11 kernel: [ 61.787747] EXT4-fs (rbd4): mounted filesystem with ordered data mode. Opts: (null)
<6>2020-05-01T17:17:52.876321+08:00 mgt11 kernel: [79111.101347] libceph: osd0 down
<6>2020-05-01T17:17:52.876356+08:00 mgt11 kernel: [79111.101349] libceph: osd3 down
<4>2020-05-01T17:17:53.660257+08:00 mgt11 kernel: [79111.883634] libceph: mon2 10.36.4.65:6789 socket closed (con state OPEN)
<6>2020-05-01T17:17:53.660291+08:00 mgt11 kernel: [79111.883646] libceph: mon2 10.36.4.65:6789 session lost, hunting for new mon
<6>2020-05-01T17:17:57.288267+08:00 mgt11 kernel: [79115.511787] libceph: osd0 up
<6>2020-05-01T17:17:57.288302+08:00 mgt11 kernel: [79115.511788] libceph: osd3 up
<4>2020-05-01T17:17:57.288308+08:00 mgt11 kernel: [79115.512220] libceph: osd4 10.36.4.63:6805 socket closed (con state OPEN)
<4>2020-05-01T17:17:57.288311+08:00 mgt11 kernel: [79115.513001] libceph: osd4 10.36.4.63:6805 socket closed (con state NEGOTIATING)
<6>2020-05-01T17:18:12.392226+08:00 mgt11 kernel: [79130.617727] libceph: mon0 10.36.4.63:6789 session established
<4>2020-05-01T17:18:32.360251+08:00 mgt11 kernel: [79150.583404] libceph: mon0 10.36.4.63:6789 socket closed (con state OPEN)
<6>2020-05-01T17:18:32.360276+08:00 mgt11 kernel: [79150.583415] libceph: mon0 10.36.4.63:6789 session lost, hunting for new mon
<6>2020-05-01T17:18:38.248254+08:00 mgt11 kernel: [79156.473562] libceph: mon0 10.36.4.63:6789 session established
<4>2020-05-01T17:19:50.440272+08:00 mgt11 kernel: [79228.663308] libceph: mon0 10.36.4.63:6789 socket closed (con state OPEN)
<6>2020-05-01T17:19:50.440299+08:00 mgt11 kernel: [79228.663320] libceph: mon0 10.36.4.63:6789 session lost, hunting for new mon
<6>2020-05-01T17:19:50.440302+08:00 mgt11 kernel: [79228.665152] libceph: mon1 10.36.4.64:6789 session established
<3>2020-05-01T17:24:29.992333+08:00 mgt11 kernel: [79508.213884] INFO: task jbd2/rbd3-8:5884 blocked for more than 120 seconds.
<3>2020-05-01T17:24:29.992371+08:00 mgt11 kernel: [79508.213968] Not tainted 4.15.0-45-generic #48-Ubuntu
<3>2020-05-01T17:24:29.992375+08:00 mgt11 kernel: [79508.214018] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
<6>2020-05-01T17:24:29.992377+08:00 mgt11 kernel: [79508.214111] jbd2/rbd3-8 D 0 5884 2 0x80000000
<4>2020-05-01T17:24:29.992379+08:00 mgt11 kernel: [79508.214130] Call Trace:

<3>2020-05-01T17:26:30.824325+08:00 mgt11 kernel: [79629.045535] INFO: task jbd2/rbd0-8:8780 blocked for more than 120 seconds.
<3>2020-05-01T17:26:30.824358+08:00 mgt11 kernel: [79629.045611] Not tainted 4.15.0-45-generic #48-Ubuntu
<3>2020-05-01T17:26:30.824360+08:00 mgt11 kernel: [79629.045661] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
<6>2020-05-01T17:26:30.824361+08:00 mgt11 kernel: [79629.045724] jbd2/rbd0-8 D 0 8780 2 0x80000000
<4>2020-05-01T17:26:30.824362+08:00 mgt11 kernel: [79629.045727] Call Trace:
<4>2020-05-01T17:26:30.824363+08:00 mgt11 kernel: [79629.045735] __schedule+0x291/0x8a0
<4>2020-05-01T17:26:30.824364+08:00 mgt11 kernel: [79629.045739] ? blk_mq_run_hw_queue+0x29/0x90
<4>2020-05-01T17:26:30.824365+08:00 mgt11 kernel: [79629.045741] ? bit_wait+0x60/0x60
<4>2020-05-01T17:26:30.824366+08:00 mgt11 kernel: [79629.045743] schedule+0x2c/0x80
<4>2020-05-01T17:26:30.824368+08:00 mgt11 kernel: [79629.045746] io_schedule+0x16/0x40
<4>2020-05-01T17:26:30.824369+08:00 mgt11 kernel: [79629.045748] bit_wait_io+0x11/0x60
<4>2020-05-01T17:26:30.824370+08:00 mgt11 kernel: [79629.045749] __wait_on_bit+0x4c/0x90
<4>2020-05-01T17:26:30.824371+08:00 mgt11 kernel: [79629.045751] out_of_line_wait_on_bit+0x90/0xb0
<4>2020-05-01T17:26:30.824373+08:00 mgt11 kernel: [79629.045755] ? bit_waitqueue+0x40/0x40
<4>2020-05-01T17:26:30.824374+08:00 mgt11 kernel: [79629.045757] __wait_on_buffer+0x32/0x40
<4>2020-05-01T17:26:30.824375+08:00 mgt11 kernel: [79629.045759] jbd2_journal_commit_transaction+0xd88/0x1720
<4>2020-05-01T17:26:30.824376+08:00 mgt11 kernel: [79629.045762] ? __switch_to_asm+0x34/0x70
<4>2020-05-01T17:26:30.824377+08:00 mgt11 kernel: [79629.045766] kjournald2+0xc8/0x270
<4>2020-05-01T17:26:30.824377+08:00 mgt11 kernel: [79629.045767] ? kjournald2+0xc8/0x270
<4>2020-05-01T17:26:30.824378+08:00 mgt11 kernel: [79629.045769] ? wait_woken+0x80/0x80
<4>2020-05-01T17:26:30.824379+08:00 mgt11 kernel: [79629.045773] kthread+0x121/0x140
<4>2020-05-01T17:26:30.824381+08:00 mgt11 kernel: [79629.045774] ? commit_timeout+0x20/0x20
<4>2020-05-01T17:26:30.824381+08:00 mgt11 kernel: [79629.045776] ? kthread_create_worker_on_cpu+0x70/0x70
<4>2020-05-01T17:26:30.824382+08:00 mgt11 kernel: [79629.045778] ret_from_fork+0x35/0x40
<3>2020-05-01T17:26:30.824383+08:00 mgt11 kernel: [79629.045806] INFO: task mysqld:22816 blocked for more than 120 seconds.
<3>2020-05-01T17:26:30.824384+08:00 mgt11 kernel: [79629.045867] Not tainted 4.15.0-45-generic #48-Ubuntu
<3>2020-05-01T17:26:30.824385+08:00 mgt11 kernel: [79629.045916] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
<6>2020-05-01T17:26:30.824386+08:00 mgt11 kernel: [79629.045977] mysqld D 0 22816 10800 0x00000000
<4>2020-05-01T17:26:30.824387+08:00 mgt11 kernel: [79629.045979] Call Trace:
<4>2020-05-01T17:26:30.824388+08:00 mgt11 kernel: [79629.045982] __schedule+0x291/0x8a0
<4>2020-05-01T17:26:30.824388+08:00 mgt11 kernel: [79629.045984] schedule+0x2c/0x80
<4>2020-05-01T17:26:30.824393+08:00 mgt11 kernel: [79629.045986] io_schedule+0x16/0x40
<4>2020-05-01T17:26:30.824395+08:00 mgt11 kernel: [79629.045989] wait_on_page_bit_common+0xd8/0x160
<4>2020-05-01T17:26:30.824396+08:00 mgt11

History

#1 Updated by Greg Farnum almost 3 years ago

  • Project changed from Ceph to Linux kernel client
  • Category deleted (librbd)

#2 Updated by Ilya Dryomov almost 3 years ago

  • Status changed from New to Need More Info

It's not just the monitor, looks more like a general connectivity issue. Note that the monitor session was successfully re-established.

Also available in: Atom PDF