Project

General

Profile

Actions

Bug #4348

closed

OSD slow request leads to RBD clients stalled/delayed

Added by Ivan Kudryavtsev about 11 years ago. Updated almost 11 years ago.

Status:
Resolved
Priority:
High
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Development
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hi. I have next problem and I believe it can be reproduced. One of my OSDs is full with messages like:

2013-03-05 06:55:07.146109 7fd3a7abf700 0 log [WRN] : slow request 6335.099822 seconds old, received at 2013-03-05 05:09:32.046218: osd_op(client.11689.1:97589574 rb.0.2cd1.238e1f29.000000000c7e [write 1454080~4096] 2.28656d03) currently reached pg
2013-03-05 06:55:07.146111 7fd3a7abf700 0 log [WRN] : slow request 6335.099794 seconds old, received at 2013-03-05 05:09:32.046246: osd_op(client.11689.1:97589575 rb.0.2cd1.238e1f29.000000000c7e [write 1507328~4096] 2.28656d03) currently reached pg
2013-03-05 06:55:07.146113 7fd3a7abf700 0 log [WRN] : slow request 6335.099767 seconds old, received at 2013-03-05 05:09:32.046273: osd_op(client.11689.1:97589576 rb.0.2cd1.238e1f29.000000000c7e [write 1609728~4096] 2.28656d03) currently reached pg
2013-03-05 06:55:07.146114 7fd3a7abf700 0 log [WRN] : slow request 6335.099068 seconds old, received at 2013-03-05 05:09:32.046972: osd_op(client.11689.1:97589577 rb.0.2cd1.238e1f29.000000000c7e [write 1675264~4096] 2.28656d03) currently reached pg

however it's on the same host with several other osds and has the same configuration as they have, but no more osd have such a bug log, so I suppose it's buggy and it's an OSD bug.

I use kernel RBD client

So, client has the same line of messages like:
[643232.462554] libceph: tid 97589149 timed out on osd22, will reset osd

and actually, all VMs become stalled (which I suppose tried to write to that OSD), because the host unable to kick the OSD off and tries to work with it. After OSD restart the problem is over. And moreover, rbd map/unmap are unable to do the job correctly for stalled /dev/rbd* devices.

But, indeed. It's unable to find that OSD is buggy if it's not, but it's buggy indeed when it gets a lot of such messages.

What I see is to create some pipe log analyzer and restart osd when I will detect the bug, until the bug will be closed.

Actions #1

Updated by Ivan Kudryavtsev about 11 years ago

BTW, I'm using zabbix to monitor OSD hosts and on buggy host I've seen slightly larger iowait last days, but it was about 0.3-1.5, while another hosts have 0.1-0.5. But detailed "iostat" didn't showed something extraordinary, so I missed it. Actually all drives are the same and on the same Adaptec 6405 so nothing showed something wrong.

Actions #2

Updated by Ivan Kudryavtsev about 11 years ago

Some more words.

The problem makes system stale when ceph launches either scrubs (so I disabled it few days ago, because hadn't time to find what's wrong in rush hour) or grows (today early morning I added 3 more OSDs and got it during rebalancing), generally when ceph is under high load. Actually it works fine with such problem under low load (I tuned pdflush on client to avoid big and long sync processes and avoided locking until started to grow).

Actions #3

Updated by Ivan Kudryavtsev about 11 years ago

After I started grow (5:00 AM) I got next messages in my client kernel:

Mar 5 05:10:20 tsk-vps-node-04 kernel: [636904.715559] libceph: osd18 down
Mar 5 05:10:20 tsk-vps-node-04 kernel: [636904.715562] libceph: osd22 down
Mar 5 05:10:24 tsk-vps-node-04 kernel: [636908.708570] libceph: osd22 up
Mar 5 05:10:24 tsk-vps-node-04 kernel: [636908.708576] libceph: osd16 down
Mar 5 05:10:45 tsk-vps-node-04 kernel: [636929.712269] libceph: osd20 down

But actually, I didn't even touched 18,22,16,20 and they became down. What is it?

next I have messages:

Mar 5 05:10:50 tsk-vps-node-04 kernel: [636934.703926] libceph: tid 97588271 timed out on osd12, will reset osd
Mar 5 05:10:55 tsk-vps-node-04 kernel: [636939.694933] libceph: osd16 up
Mar 5 05:11:00 tsk-vps-node-04 kernel: [636944.700388] libceph: tid 97588346 timed out on osd21, will reset osd
Mar 5 05:11:32 tsk-vps-node-04 kernel: [636976.132798] libceph: osd18 up
Mar 5 05:11:50 tsk-vps-node-04 kernel: [636994.682574] libceph: tid 97589149 timed out on osd22, will reset osd
Mar 5 05:12:00 tsk-vps-node-04 kernel: [637004.585952] libceph: osd20 up
Mar 5 05:12:50 tsk-vps-node-04 kernel: [637054.661229] libceph: tid 97589149 timed out on osd22, will reset osd
Mar 5 05:13:50 tsk-vps-node-04 kernel: [637114.639889] libceph: tid 97589149 timed out on osd22, will reset osd

And after

Mar 5 05:14:31 tsk-vps-node-04 kernel: [637155.184607] INFO: task kvm:29204 blocked for more than 120 seconds.
Mar 5 05:14:31 tsk-vps-node-04 kernel: [637155.184652] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 5 05:14:31 tsk-vps-node-04 kernel: [637155.184720] kvm D ffff883f0641d960 0 29204 1 0x00000000
Mar 5 05:14:31 tsk-vps-node-04 kernel: [637155.184724] ffff883f0641d960 0000000000000082 00000000000004c0 ffff881ff352d960
Mar 5 05:14:31 tsk-vps-node-04 kernel: [637155.184726] 0000000000013840 ffff8825c6769fd8 0000000000013840 ffff8825c6769fd8
Mar 5 05:14:31 tsk-vps-node-04 kernel: [637155.184729] 0000000000013840 ffff883f0641d960 0000000000013840 ffff8825c6768010
Mar 5 05:14:31 tsk-vps-node-04 kernel: [637155.184731] Call Trace:
Mar 5 05:14:31 tsk-vps-node-04 kernel: [637155.184739] [<ffffffff81014b55>] ? read_tsc+0x5/0x16
Mar 5 05:14:31 tsk-vps-node-04 kernel: [637155.184743] [<ffffffff810c85a6>] ? __lock_page+0x63/0x63
Mar 5 05:14:31 tsk-vps-node-04 kernel: [637155.184748] [<ffffffff81365ba7>] ? io_schedule+0x84/0xc3
Mar 5 05:14:31 tsk-vps-node-04 kernel: [637155.184750] [<ffffffff810c85af>] ? sleep_on_page+0x9/0xd
Mar 5 05:14:31 tsk-vps-node-04 kernel: [637155.184753] [<ffffffff813649e9>] ? __wait_on_bit+0x3e/0x6f
Mar 5 05:14:31 tsk-vps-node-04 kernel: [637155.184756] [<ffffffff8101548d>] ? native_sched_clock+0x28/0x33
Mar 5 05:14:31 tsk-vps-node-04 kernel: [637155.184758] [<ffffffff810c875a>] ? wait_on_page_bit+0x6a/0x70
Mar 5 05:14:31 tsk-vps-node-04 kernel: [637155.184762] [<ffffffff8105ace7>] ? autoremove_wake_function+0x2a/0x2a
Mar 5 05:14:31 tsk-vps-node-04 kernel: [637155.184764] [<ffffffff810c9a95>] ? grab_cache_page_write_begin+0xb4/0xc6
Mar 5 05:14:31 tsk-vps-node-04 kernel: [637155.184769] [<ffffffff81132191>] ? block_write_begin+0x22/0x72
Mar 5 05:14:31 tsk-vps-node-04 kernel: [637155.184771] [<ffffffff811348e4>] ? I_BDEV+0x8/0x8
Mar 5 05:14:31 tsk-vps-node-04 kernel: [637155.184773] [<ffffffff810c7d67>] ? generic_file_buffered_write+0x121/0x28e
Mar 5 05:14:31 tsk-vps-node-04 kernel: [637155.184776] [<ffffffff8105d56a>] ? __remove_hrtimer+0x2a/0x7f
Mar 5 05:14:31 tsk-vps-node-04 kernel: [637155.184778] [<ffffffff8105dac3>] ? hrtimer_cancel+0xc/0x16
Mar 5 05:14:31 tsk-vps-node-04 kernel: [637155.184781] [<ffffffff810c8ead>] ? __generic_file_aio_write+0x1cc/0x200
Mar 5 05:14:31 tsk-vps-node-04 kernel: [637155.184783] [<ffffffff81134d1b>] ? blkdev_aio_write+0x47/0x92
Mar 5 05:14:31 tsk-vps-node-04 kernel: [637155.184788] [<ffffffff8110b6d7>] ? do_sync_write+0x88/0xc1
Mar 5 05:14:31 tsk-vps-node-04 kernel: [637155.184791] [<ffffffff81176e68>] ? security_file_permission+0x18/0x2d
Mar 5 05:14:31 tsk-vps-node-04 kernel: [637155.184793] [<ffffffff8110c0e5>] ? vfs_write+0xa4/0xff
Mar 5 05:14:31 tsk-vps-node-04 kernel: [637155.184795] [<ffffffff8110c1a8>] ? sys_pwrite64+0x68/0x99
Mar 5 05:14:31 tsk-vps-node-04 kernel: [637155.184798] [<ffffffff8136cbe9>] ? system_call_fastpath+0x16/0x1b

ok, this is just block device hung.

BTW, ceph version is
ceph version 0.56.2 (586538e22afba85c59beda49789ec42024e7a061)

Actions #4

Updated by Ivan Kudryavtsev about 11 years ago

My OSD tree is:

# id    weight  type name       up/down reweight
-1      30      pool default
-11     21              datacenter dc-1
-10     21                      room room-1
-8      9                               rack rack-1
-4      9                                       host ceph-osd-2-1
1       1                                               osd.1   up      1
5       1                                               osd.5   up      1
6       1                                               osd.6   up      1
9       1                                               osd.9   up      1
12      1                                               osd.12  up      1
15      1                                               osd.15  up      1
16      1                                               osd.16  up      1
17      1                                               osd.17  up      1
18      1                                               osd.18  up      1
-9      12                              rack rack-5
-5      12                                      host ceph-osd-1-1
2       1                                               osd.2   up      1
7       1                                               osd.7   up      1
8       1                                               osd.8   up      1
11      1                                               osd.11  up      1
13      1                                               osd.13  up      1
19      1                                               osd.19  up      1
20      1                                               osd.20  up      1
21      1                                               osd.21  up      1
22      1                                               osd.22  up      1
27      1                                               osd.27  up      1
28      1                                               osd.28  up      1
29      1                                               osd.29  up      1
-7      9               datacenter b-1
-6      9                       room b-1-room-1
-3      9                               rack b-1-rack-1
-2      9                                       host ceph-osd-3-1
0       1                                               osd.0   up      1
3       1                                               osd.3   up      1
4       1                                               osd.4   up      1
10      1                                               osd.10  up      1
14      1                                               osd.14  up      1
23      1                                               osd.23  up      1
24      1                                               osd.24  up      1
25      1                                               osd.25  up      1
26      1                                               osd.26  up      1

All pool have size 3, so as you see above logs no chance when all 3 OSDS from different branches become anavailable.

Actions #5

Updated by Ivan Kudryavtsev about 11 years ago

I also got perf dumps for all OSDs two days before. I can place ODS table here with latencies, but I was unable to find some data anomalies among OSDs.

Actions #6

Updated by Sage Weil about 11 years ago

is 'journal aio = true' in your cluster? (ceph-osd -i NNN --show-config | grep journal_aio)

Actions #7

Updated by Ivan Kudryavtsev about 11 years ago

root@ceph-osd-1-1:~# ceph-osd -i 22 --show-config | grep journal_aio
journal_aio = false

How my storage stack is organized:

I have journal at tmpfs:
tmpfs on /srv/ceph/osd22.journal type tmpfs (rw,nosuid,nodev,size=1G)

[osd]
    osd op threads = 4
    osd data = /srv/ceph/osd$id
    osd journal = /srv/ceph/osd$id.journal/journal
    osd journal size = 995
    journal dio = false
    osd class dir = /usr/lib/rados-classes
    keyring = /etc/ceph/keyring.$name
    filestore fiemap = false

Next, I have OSD with XFS, at Bcached device with writeback enabled, so I have moderate io to HDD and a lot of io to SSD, see next:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           3,65    0,00    4,16    0,47    0,00   91,70

Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sda               4,60         0,00         0,02          0          1
sdb             314,58         0,00         3,66          0        219
sdf              56,23         0,00         0,30          0         17
sde               0,13         0,00         0,00          0          0
sdg               0,07         0,00         0,00          0          0

sda is system
sdb is SSD cache
next all backend storage devices

All devices connected to Adaptec 6405 raid as simple volumes.

Actions #8

Updated by Ivan Kudryavtsev about 11 years ago

Adaptec has BBU

Actions #9

Updated by Sage Weil about 11 years ago

is it possible to reproduce this with
debug ms = 1
debug osd = 20
?

Actions #10

Updated by Ivan Kudryavtsev about 11 years ago

Unfortunately, no. Because it's in production. The question is "How OSD became buggy with massive amount of "slow requests"?". I believe the later behavior is connected with the question.

Some days already passed after the situation and that OSD's log has no more such messages. In the day of accident there were about several thousands of them in log file.

Actions #11

Updated by Sage Weil about 11 years ago

  • Project changed from rbd to Ceph
Actions #12

Updated by Sage Weil about 11 years ago

Ivan, are you still seeing this problem?

Actions #13

Updated by Sage Weil about 11 years ago

  • Status changed from New to Resolved

oh, just noticed this is 0.56.2. upgrade to .4 and the stalls will go away.

Actions #14

Updated by Ivan Kudryavtsev almost 11 years ago

After upgrade to
ceph version 0.56.4 (63b0f854d1cef490624de5d6cf9039735c7de5ca)
it doesn't behave as before. Works ok, at least no more messages about it in logs.

Actions

Also available in: Atom PDF