Project

General

Profile

Bug #4079

osd: journal aio deadlock

Added by Alex Elder over 7 years ago. Updated over 7 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Development
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature:

Description

I don't really understand this yet. I have seen it occurring
with the new request code. I thought there could be a
connection with http://tracker.ceph.com/issues/3936 but maybe
that was just wishful thinking.

Starting last week I found that my normal rbd xfstests run
is hitting a problem around the time it runs test 074 or 075.
When this problem occurs, the test seems to be hung. However,
if you give it enough time you see that some forward progress
is made, it just takes a HUGE amount of time before tests
complete (i.e., hours rather than seconds or minutes).

I also notice that log messages looking like those show below
show up (the "Mounting Filesystem" and "Ending clean mount"
are normal when running these tests).

This is a fairly serious problem, and it is preventing me
from completing my normal test runs (well, within a reasonable
amount of time).

[ 4761.797182] XFS (rbd2): Mounting Filesystem
[ 4765.918310] XFS (rbd2): Ending clean mount
[ 4767.139085] XFS (rbd2): Mounting Filesystem
[ 4771.009442] XFS (rbd2): Ending clean mount
[ 5665.686530] libceph: osd0 10.214.133.23:6800 socket closed (con state OPEN)
[ 6101.597699] XFS (rbd2): Mounting Filesystem
[ 6101.899587] XFS (rbd2): Ending clean mount
[ 6875.996583] XFS (rbd2): Mounting Filesystem
[ 6880.217723] XFS (rbd2): Ending clean mount
[ 7860.758422] libceph: osd0 10.214.133.23:6800 socket closed (con state OPEN)
[ 8759.535196] libceph: osd0 10.214.133.23:6800 socket closed (con state OPEN)
[10927.275859] libceph: osd0 10.214.133.23:6800 socket closed (con state OPEN)
[12131.669510] libceph: osd0 10.214.133.23:6800 socket closed (con state OPEN)
[14428.139118] libceph: osd0 10.214.133.23:6800 socket closed (con state OPEN)
[15326.917246] libceph: osd0 10.214.133.23:6800 socket closed (con state OPEN)
[16954.044979] libceph: osd0 10.214.133.23:6800 socket closed (con state OPEN)
[16954.070327] libceph: osd0 10.214.133.23:6800 socket error on read
[19011.107288] libceph: osd0 10.214.133.23:6800 socket closed (con state OPEN)
[21159.809958] XFS (rbd2): Mounting Filesystem
[21164.052011] XFS (rbd2): Ending clean mount

Associated revisions

Revision e5940da9 (diff)
Added by Sage Weil over 7 years ago

os/FileJournal: fix aio self-throttling deadlock

This block of code tries to limit the number of aios in flight by waiting
for the amount of data to be written to grow relative to a function of the
number of aios. Strictly speaking, the condition we are waiting for is a
function of both aio_num and the write queue, but we are only woken by
changes in aio_num, and were (in rare cases) waiting when aio_num == 0 and
there was no possibility of being woken.

Fix this by verifying that aio_num > 0, and restructuring the loop to
recheck that condition on each wakeup.

Fixes: #4079
Signed-off-by: Sage Weil <>
Reviewed-by: Samuel Just <>

Revision 3569489b (diff)
Added by Sage Weil over 7 years ago

os/FileJournal: fix aio self-throttling deadlock

This block of code tries to limit the number of aios in flight by waiting
for the amount of data to be written to grow relative to a function of the
number of aios. Strictly speaking, the condition we are waiting for is a
function of both aio_num and the write queue, but we are only woken by
changes in aio_num, and were (in rare cases) waiting when aio_num == 0 and
there was no possibility of being woken.

Fix this by verifying that aio_num > 0, and restructuring the loop to
recheck that condition on each wakeup.

Fixes: #4079
Signed-off-by: Sage Weil <>
Reviewed-by: Samuel Just <>
(cherry picked from commit e5940da9a534821d0d8f872c13f9ac26fb05a0f5)

History

#1 Updated by Alex Elder over 7 years ago

Below is a chart with test number, an example of a test
run with results more in line with what's expected, and
the time for a set of slow tests I've seen. Not all tests
are slowed down, and that may give some information about
what is affected by the problem. Variability in the test
time is normal, but it's normally a few seconds off, not
a factor of 100.

Test    Time    Slow
----    ----    ----
001      4s     123s
002      0s     0s
003      0s     0s
004      5s     510s
005      1s     0s
006      2s     1s
007      1s     1s
008      5s     3s
009      4s     485s
011      1s     6s
012      2s     16s
013      103s   9642s
014      219s   5381s
015      3s     203s
017      33s    5760s
019      3s     499s
020      3s     3s
021      2s     489s
026      15s    550s
027      15s    560s
028      25s    506s
029      2s     459s
031      51s    7736s
032      47s    855s
033      29s    3279s
034      4s     511s
041      77s    5271s
046      14s    512s
047      25s    507s
048      1s     0s
050      28s    3591s
051      1s     1s
052      3s     496s
053      3s     488s
054      15s    2434s
056      14s    552s
061      13s    549s
063      15s    551s
064      60s    1021s
065      40s    983s
066      4s     499s
067      6s     927s
069      7s     484s
070      27s    4876s
074      329s   3076s
075      363s   39375s
076      41s    3352s
078      70s    1330s
079      6s     495s
084      59s    136s
085      4s     498s
086      154s   15205s
087      52s    4976s
088      0s     0s
089      8s     16s
091      862s   15408s
092      4s     431s
100      18s    303s
103      2s     431s
105      3s     457s
108      13s    728s
110      38s    2217s
116      5s     488s
117      33s    5670s
118      7s     457s
119      10s    1394s
120      19s    472s
121      10s    488s
124      7s     530s
126      0s     0s
129      5s     435s
130      56s    807s
131      1s     1s
132      21s    468s
133      1149s  84964s
134      5s     469s
135      4s     474s
137      16s    489s
138      53s    2348s
139      51s    2390s
140      47s    2349s
141      3s     464s
164      0s     1s
165      1s     0s
166      2s     464s
167      333s   8033s
182      18s    535s
184      0s     0s
187      11s    1773s
188      20s    810s
189      6s     601s
190      4s     439s
192      40s    66s
194      3s     453s
196      6s     474s
199      3s     469s
201      58s    3615s
203      4s     464s
214      1s     36s
215      2s     2s
216      55s    3922s
220      6s     938s
221      1s     1s
222      4s     92s
223      31s    949s
224      39s    451s
225      30s    997s
226      96s    8298s
227      614s   76177s
234      14s    780s
236      1s     1s
237      0s     0s
238      4s     485s
241      73s    242s
243      1s     0s
244      13s    1424s
245      0s     0s
246      0s     1s
247      93s    2689s
248      0s     0s
249      2s     19s
253      11s    978s
257      0s     1s
258      1s     34s
259      51s    560s
261      2s     92s
262      5s     509s
269      419s   5961s
277      5s     498s
278      6s     835s
280      5s     505s
285      1s     0s
286      11s    57s

#2 Updated by Alex Elder over 7 years ago

  • Status changed from New to In Progress

I tried running just tests 74 and 75 this morning, and running
just those two (twice each, as well as doing that several times
in a row) did not reproduce this behavior. Test 74 takes about
290+ seconds for me and test 75 takes roughly 230+ seconds.

I just tried running tests 50-80 (which ends up running some
tests we normally don't run from that range) and it looks like
I've hit it.

Tests 55 and 57-60 were not run. Test 62 (normally not run) had
errors. Tests 63-72 ran successfully (including 68, which is
normally not run). 73 produced an output mismatch. Then test
74 took 1524 seconds (rather than 290) and test 75 is still
running.

Looking at the client machine, this is the exact process
running right now:

0 0 19660 18907 20 0 24820 21216 sleep_ D ? 0:03 /var/lib/xfstests/ltp/fsx -d -N 10000 -l 10485760 -S 0 -P /var/lib/xfstests 075.2

That seems to be operating on this file:
/tmp/cephtest/test_dir.96W7qOfsGy/fsx/075.2
And attempting to read that file hangs on the read, or did
initially.

Using dynamic debug was showing no activity in rbd.

I just turned on dynamic debug for libceph and got some
output. I'll go take a look at that.

#3 Updated by Alex Elder over 7 years ago

I haven't done a lot of this, but here's something I find in
the client log after turning on all debug output for the
libceph kernel module.

This message gets sent to osd 1:
[40948.126518] libceph: ----- ffff88020cb9a200 to osd1 42=osd_op len 139+0+32768 -----

Above that we see:
[40948.126510] libceph: calc_object_layout 'rb.0.1008.238e1f29.000000000095' pgid 2.7938354d
[40948.126515] libceph: send_request ffff88020cbcfc00 tid 74425 to osd1 flags 36

So this is a WRITE ONDISK request with 32KB of data to object *00095.
Note transaction id is 74425.

That message gets written successfully in 0.450628 seconds:

[40948.946834] libceph: prepare_write_message ffff88020cb9a200 seq 14941 type 42 len 139+0+32768 0 pgs
[40949.010656] libceph: prepare_write_message front_crc 3377833246 middle_crc 0
[40949.046656] libceph: try_write out_kvec_bytes 202
[40949.080050] libceph: write_partial_kvec ffff880221d7b030 202 left
[40949.115100] libceph: write_partial_kvec ffff880221d7b030 0 left in 0 kvecs ret = 1
[40949.180145] libceph: write_partial_msg_pages ffff880221d7b030 msg ffff88020cb9a200 page 0/0 offset 0
[40949.250134] libceph: write_partial_msg_pages ffff880221d7b030 msg ffff88020cb9a200 done
[40949.321297] libceph: prepare_write_message_footer ffff880221d7b030
[40949.359660] libceph: write_partial_kvec ffff880221d7b030 13 left
[40949.397462] libceph: write_partial_kvec ffff880221d7b030 0 left in 0 kvecs ret = 1

Nothing really happens then for about 3 seconds, until:

[40952.142028] libceph: ceph_sock_data_ready on ffff880221d7b030 state = 5, queueing work

The tag we get is 7, which indicates an incoming message, and then
we learn the message is type 43, CEPH_MSG_OSD_OPREPLY.

[40952.626335] libceph: ===== ffff88020cb9ba00 14941 from osd1 43=osd_opreply len 117+0 (2055398074 0 0) =====
[40952.688569] libceph: handle_reply ffff88020cb9ba00 tid 74425 result 0

It's the same transaction id, 74425 as the write we just sent.

Should I expect a 32KB write request to take 2.744566 seconds
to complete?

#4 Updated by Alex Elder over 7 years ago

And finally test 75 completed, taking 18,723 seconds,
more than 5 hours.

Again, forward progress, absurdly slow.

I could use some help looking at what's going on.

#5 Updated by Alex Elder over 7 years ago

As Sage suggested, I tried with an older kernel. Actually,
I just tried it with the newly updated master branch, which
is based on 3.8-rc5 and includes basically everything in
the testing branch except the new rbd request code.
3ebc21f7 libceph: fix messenger CONFIG_BLOCK dependencies

These tests ran: 50-54 56 61 63-72 74-76 78 79

And the time for tests 74 and 75 (and 76):
074 324s
075 253s
076 31s

I then switched to the testing branch.
07741308 rbd: add barriers near done flag operations

The same tests were run. Results:
074 307s
075 199s
076 30s

I'm going to keep trying. Obviously it doesn't always
happen.

#6 Updated by Alex Elder over 7 years ago

I should also mention, I'm running against the current "next"
branch from the ceph repository. At that moment, that is:
0942e005 OSD: always activate_map in advance_pgs, only send messages if up

And for what it's worth, I'm running with this for teuthology:
0c663ca8 schedule_suite.sh: ensure ceph and kernel branches exist

#7 Updated by Alex Elder over 7 years ago

Is there any chance clock skew could have an impact?
I just got these messages. I don't suppose this is a
problem but I thought I'd mention it.

INFO:teuthology.task.ceph.mon.b.err:2013-02-14 15:34:32.740303 7f73b1fa8700 -1 mon.b@1(peon).paxos(mdsmap active c 1..5) lease_expire from mon.0 10.214.131.10:6789/0 is 0.053851 seconds in the past; mons are laggy or clocks are too skewed
INFO:teuthology.task.ceph.mon.b.err:2013-02-14 15:34:32.740423 7f73b1fa8700 -1 mon.b@1(peon).paxos(monmap active c 1..1) lease_expire from mon.0 10.214.131.10:6789/0 is 0.053820 seconds in the past; mons are laggy or clocks are too skewed
INFO:teuthology.task.ceph.mon.b.err:2013-02-14 15:34:32.740526 7f73b1fa8700 -1 mon.b@1(peon).paxos(auth active c 1..17) lease_expire from mon.0 10.214.131.10:6789/0 is 0.053785 seconds in the past; mons are laggy or clocks are too skewed

#8 Updated by Alex Elder over 7 years ago

I am running xfstests 258 and 259 over rbd, repeating
10 times. The first three iterations took about 1 second
and 53 seconds for the tests, respectively. Then on the
fourth iteration it started taking about 10 seconds and
300+ seconds for them.

So I did some debug tracing on the client.

Here is an example, with extraneous stuff pulled out:

[42111.355376] libceph: send_request ffff880110be8400 tid 48315 to osd4 flags 36
[42111.453278] libceph: ----- ffff88020b8f0c00 to osd4 42=osd_op len 139+0+8192 -----
[42115.456282] libceph: prepare_write_message ffff88020b8f0c00 seq 3186 type 42 len 139+0+8192 0 pgs

This means it took 0.097902 seconds for the osd to report it was
going to send it until the messenger saw it, and then 4.003004 more
seconds between the messenger queueing the message and beginning to
put it on the wire.

Note that after reporting this message (----- fff88...) there is
a test_and_set_bit() to see if we need to reschedule con_work().

Could this be yet another read/write barrier issue? If so
we seem to have some GREAT hardware for finding these.

I'm going to try an experiment to see if I can prove this
happens.

#9 Updated by Alex Elder over 7 years ago

Well, it absolutely looks like it can happen.
I found cases where different CPUs set and cleared
the WRITE_PENDING bit.

I'm going to add some memory barriers to see if I
can make the delays go away. I seem to be able to
reproduce the slowness (300 seconds rather than 50)
on test 259 fairly reliably.

#10 Updated by Alex Elder over 7 years ago

For a while I was pretty hopeful, because it appeared that
with these barriers in place I was getting slowness less
often. I had been getting it pretty reducibly after one or
two passes running tests 258 and 259. The first time I ran
with barriers in place it didn't get slow for 10 consecutive
passes. The next one got fairly slow during the second pass.
The last time I rant them I got 8 passes of 1 second/55 seconds
or so, then pass 9 with 1 second/148 seconds, then the 10th
pass took 146 seconds for test 258 (rather than about 1 second)
and 425 seconds for test 259.

So it's possible the barrier thing is improving things, but
it's not at all conclusive. And even if it does, there
is more to it than just that making things slow.

#11 Updated by Alex Elder over 7 years ago

I don't think the barrier improved anything.
All of those state changes should be happening
under protection of a mutex, so there should
be no need for a barrier.

The next thing I found yesterday was that 0-length
requests were handled differently in the new request
code. Previously they were ignored, whereas the new
code ended up submitting them as 0-length object
object requests. I believe that should have been
harmless, but just to be sure I made a change to
ignore 0-length requests. Once again, things
look OK for a while, then they get slow.

I have been noticing that when things get slow
there is a "mkfs.xfs" process running, and it
appears to be spending a lot of time in one of
two places. Example stack traces of each are
below. I'm using kdb this morning to try to
learn more detail about what's going on.

root@plana27:~# cat /proc/17552/stack
[<ffffffff81314f31>] blkdev_issue_flush+0x141/0x1a0
[<ffffffff811bd45e>] blkdev_fsync+0x3e/0x50
[<ffffffff811b443d>] do_fsync+0x5d/0x90
[<ffffffff811b4840>] sys_fsync+0x10/0x20
[<ffffffff81666f19>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
root@plana27:~#

root@plana27:~# cat /proc/7240/stack
[<ffffffff811c13b7>] do_blockdev_direct_IO+0x1947/0x2060
[<ffffffff811c1b25>] __blockdev_direct_IO+0x55/0x60
[<ffffffff811bd047>] blkdev_direct_IO+0x57/0x60
[<ffffffff81131b80>] generic_file_direct_write+0xd0/0x190
[<ffffffff81131f1d>] __generic_file_aio_write+0x2dd/0x3b0
[<ffffffff811bd912>] blkdev_aio_write+0x52/0xc0
[<ffffffff81183843>] do_sync_write+0xa3/0xe0
[<ffffffff81183ef3>] vfs_write+0xb3/0x180
[<ffffffff811843ba>] sys_pwrite64+0x9a/0xa0
[<ffffffff81666f19>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
root@plana27:~#

#12 Updated by Alex Elder over 7 years ago

After lots of failed attempts to fix this I have what
I hope to be something helpful.

I've added dout() calls to basically trace the process
of a request as it moves through rbd. I just capture
10 seconds worth of this during a period where I find
mkfs.xfs is stuck waiting for completion of direct I/O.

Here's that process:
16959 ? D 0:00 /sbin/mkfs.xfs -f -l su=32k /dev/rbd3

Here is its kernel stack trace before I started capturing
the rbd tracing info:
[<ffffffff811c13b7>] do_blockdev_direct_IO+0x1947/0x2060
[<ffffffff811c1b25>] __blockdev_direct_IO+0x55/0x60
[<ffffffff811bd047>] blkdev_direct_IO+0x57/0x60
[<ffffffff81131b80>] generic_file_direct_write+0xd0/0x190
[<ffffffff81131f1d>] __generic_file_aio_write+0x2dd/0x3b0
[<ffffffff811bd912>] blkdev_aio_write+0x52/0xc0
[<ffffffff81183843>] do_sync_write+0xa3/0xe0
[<ffffffff81183ef3>] vfs_write+0xb3/0x180
[<ffffffff811843ba>] sys_pwrite64+0x9a/0xa0
[<ffffffff81666f19>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff

And here it is after that 10 seconds had passed (it's
the same as before).
[<ffffffff811c13b7>] do_blockdev_direct_IO+0x1947/0x2060
[<ffffffff811c1b25>] __blockdev_direct_IO+0x55/0x60
[<ffffffff811bd047>] blkdev_direct_IO+0x57/0x60
[<ffffffff81131b80>] generic_file_direct_write+0xd0/0x190
[<ffffffff81131f1d>] __generic_file_aio_write+0x2dd/0x3b0
[<ffffffff811bd912>] blkdev_aio_write+0x52/0xc0
[<ffffffff81183843>] do_sync_write+0xa3/0xe0
[<ffffffff81183ef3>] vfs_write+0xb3/0x180
[<ffffffff811843ba>] sys_pwrite64+0x9a/0xa0
[<ffffffff81666f19>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff


Here is the tracing information that got dumped to
the log, with some added annotation. It walks through
an entire image request lifetime.

[ 2192.531728] rbd: rbd_osd_write_callback: ffff88020b259d00 result 0 512/512
[ 2192.531734] rbd: rbd_obj_request_complete: ffff88020b259d00 callback ffffffffa09d9e10
[ 2192.531738] rbd: rbd_img_obj_callback: img ffff88020ca36240 obj ffff88020b259d00
[ 2192.531741] rbd: rbd_img_obj_callback: which 0 next_compl 0 req_count 1
[ 2192.531760] rbd: rbd_img_request_complete: ffff88020ca36240
[ 2192.531763] rbd: rbd_img_request_destroy(ffff88020ca36240)
[ 2192.531765] rbd: rbd_img_obj_request_del(ffff88020ca36240, ffff88020b259d00) 0
[ 2192.531768] rbd: rbd_obj_request_destroy(ffff88020b259d00)

(Partial log of activity above, ignore it for now.)

First, a new image request gets created in response to a request
from the block layer. The data structure representing it is
located at ffff88020ca36840.
[ 2192.531845] rbd: rbd_img_request_create(ffff880219805800, 0x1bc01000, 0x1000, true) -> ffff88020ca36840

That request is split into object requests, each of which
goes to one object containing rbd image data. There is
only one in this case. The object request is at address
ffff88020b258700, and is added to the image request.
[ 2192.531851] rbd: rbd_img_request_fill_bio(ffff88020ca36840, ffff88020ca37d40)
[ 2192.531857] rbd: rbd_obj_request_create("rb.0.1021.238e1f29.00000000006f", 0x1000, 0x1000, 1) -> ffff88020b258700
[ 2192.531866] rbd: rbd_img_obj_request_add(ffff88020ca36840, ffff88020b258700) 0

The image request is submitted. and that causes all of its
object requests to be sent to their target osds.
[ 2192.531868] rbd: rbd_img_request_submit(ffff88020ca36840)
[ 2192.531871] rbd: rbd_obj_request_submit(ffff88020ca3e960, ffff88020b258700)

Now we wait until we get notified of completion. Eventually,
net/ceph/osd_client.c:handle_reply() gets called when the
messenger receives a message destined for the osd client.
Once that has done its processing, it calls the request's
r_callback function, which will always be rbd_osd_req_callback()
for rbd requests.

That function fans out to one of several functions depending
on which type of request it was. In this case it was a
write:
[ 2197.581086] rbd: rbd_osd_write_callback: ffff88020b258700 result 0 4096/4096

The osd write request will now be marked done, so the object
request is completed:
[ 2197.581091] rbd: rbd_obj_request_complete: ffff88020b258700 callback ffffffffa09d9e10

Now that the object request is complete, its callback function
is called. For object requests attached to image requests,
this will be rbd_img_obj_callback():
[ 2197.581094] rbd: rbd_img_obj_callback: img ffff88020ca36840 obj ffff88020b258700

And this is object request 0 for that image, which is the
next one we're waiting to complete for this image request,
and there is only one such object request.
[ 2197.581098] rbd: rbd_img_obj_callback: which 0 next_compl 0 req_count 1

Since that was the last (only) object request for this
image, we've completed the entire image request, so its
completion routine is called:
[ 2197.581109] rbd: rbd_img_request_complete: ffff88020ca36840

This image request has no callback routine, so we're done.
We just drop a reference on this image request, which
turns out to be the last reference, so it gets destroyed.
[ 2197.581112] rbd: rbd_img_request_destroy(ffff88020ca36840)

...including destroying all its component object requests.
[ 2197.581115] rbd: rbd_img_obj_request_del(ffff88020ca36840, ffff88020b258700) 0
[ 2197.581118] rbd: rbd_obj_request_destroy(ffff88020b258700)

And that's it.

Now we start again with the next one.

[ 2197.581185] rbd: rbd_img_request_create(ffff880219805800, 0x1bc02000, 0x1000, true) -> ffff88020ca37a40
[ 2197.581191] rbd: rbd_img_request_fill_bio(ffff88020ca37a40, ffff88020ca37d40)
[ 2197.581197] rbd: rbd_obj_request_create("rb.0.1021.238e1f29.00000000006f", 0x2000, 0x1000, 1) -> ffff88020b259700
[ 2197.581206] rbd: rbd_img_obj_request_add(ffff88020ca37a40, ffff88020b259700) 0
[ 2197.581209] rbd: rbd_img_request_submit(ffff88020ca37a40)
[ 2197.581211] rbd: rbd_obj_request_submit(ffff88020ca3e960, ffff88020b259700)


OK, now let's look at the important two lines, which are
consecutive in the tracing output:

[ 2192.531871] rbd: rbd_obj_request_submit(ffff88020ca3e960, ffff88020b258700)
[ 2197.581086] rbd: rbd_osd_write_callback: ffff88020b258700 result 0 4096/4096

The time difference between these is 5.049215 seconds.

The first function consists of this:
return ceph_osdc_start_request(osdc, obj_request->osd_req, false);

And the second one is as described above, called back
immediately after receiving a response from the request
sent to the osd.

This is why I think I can conclude that something's going
on in the osd (or rados) that is causing these long delays.

I will try to do the same thing, but with full debug logging
so we can see all the messenger and osd client activity between
those two lines I highlighted above.

#13 Updated by Alex Elder over 7 years ago

The process:
20767 ? D 0:00 /sbin/mkfs.xfs -f -l su=32k /dev/rbd3

Its "before" kernel stack trace:
[<ffffffff811c13b7>] do_blockdev_direct_IO+0x1947/0x2060
[<ffffffff811c1b25>] __blockdev_direct_IO+0x55/0x60
[<ffffffff811bd047>] blkdev_direct_IO+0x57/0x60
[<ffffffff81131b80>] generic_file_direct_write+0xd0/0x190
[<ffffffff81131f1d>] __generic_file_aio_write+0x2dd/0x3b0
[<ffffffff811bd912>] blkdev_aio_write+0x52/0xc0
[<ffffffff81183843>] do_sync_write+0xa3/0xe0
[<ffffffff81183ef3>] vfs_write+0xb3/0x180
[<ffffffff81184232>] sys_write+0x52/0xa0
[<ffffffff81666f19>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff

Its "after" kernel stack trace:
[<ffffffff811c13b7>] do_blockdev_direct_IO+0x1947/0x2060
[<ffffffff811c1b25>] __blockdev_direct_IO+0x55/0x60
[<ffffffff811bd047>] blkdev_direct_IO+0x57/0x60
[<ffffffff81131b80>] generic_file_direct_write+0xd0/0x190
[<ffffffff81131f1d>] __generic_file_aio_write+0x2dd/0x3b0
[<ffffffff811bd912>] blkdev_aio_write+0x52/0xc0
[<ffffffff81183843>] do_sync_write+0xa3/0xe0
[<ffffffff81183ef3>] vfs_write+0xb3/0x180
[<ffffffff81184232>] sys_write+0x52/0xa0
[<ffffffff81666f19>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff

And below is my analysis of the very verbose trace info.
We wait about 5 seconds for a response from the osd.

    --------------------
    New rbd image request: ffff88020b582a80
        offset: 0x25808000 = 629178368
        length: 0x40000 = 262144 (256KB)
        write: yes
    --------------------
[ 3953.583136] rbd:  rbd_img_request_create(ffff880219805800, 0x25808000, 0x40000, true) -> ffff88020b582a80

[ 3953.583136] rbd:  rbd_img_request_create(ffff880219805800, 0x25808000, 0x40000, true) -> ffff88020b582a80
[ 3953.583140] rbd:  rbd_img_request_fill_bio(ffff88020b582a80, ffff88020b583980)
    --------------------
    New rbd object request: ffff88020b582a80
        object: "rb.0.1021.238e1f29.000000000096",
        offset: 0x8000 = 32768
        length: 0x40000 = 262144 (256KB)
        type: 1 = OBJ_REQUEST_BIO
    --------------------
[ 3953.583145] rbd:  rbd_obj_request_create("rb.0.1021.238e1f29.000000000096", 0x8000, 0x40000, 1) -> ffff880225f2a700

        --------------------
        OSD request allocated, including request and response messages
        request message is ffff880225f2bd00
        --------------------
[ 3953.583152] libceph:  ceph_msg_new ffff880225f2b900 front 512
[ 3953.583155] libceph:  ceph_msg_new ffff880225f2bd00 front 208

    --------------------
    object request added to image request, and submitted
    --------------------
[ 3953.583159] rbd:  rbd_obj_request_get ffff880225f2a700 (was 1)
[ 3953.583162] rbd:  rbd_img_obj_request_add(ffff88020b582a80, ffff880225f2a700) 0
[ 3953.583164] rbd:  rbd_img_request_submit(ffff88020b582a80)
[ 3953.583167] rbd:  rbd_obj_request_submit(ffff88020ca3e960, ffff880225f2a700)

        --------------------
        ceph_osdc_start_request() registers the and maps the request
        --------------------
[ 3953.583171] libceph:  __register_request ffff88020f185800 tid 65536
[ 3953.583174] libceph:   first request, scheduling timeout
[ 3953.583179] libceph:  map_request ffff88020f185800 tid 65536
[ 3953.583183] libceph:  calc_object_layout 'rb.0.1021.238e1f29.000000000096' pgid 2.526e76bf
[ 3953.583190] libceph:  map_request tid 65536 pgid 2.76bf osd1 (was osd-1)
[ 3953.583193] libceph:  __remove_osd_from_lru ffff88020b595000

        --------------------
        __send_request() hands request message it to messenger
        --------------------
[ 3953.583196] libceph:  send_request ffff88020f185800 tid 65536 to osd1 flags 36
                --------------------
                ceph_con_send() calls con->ops->get(con)
                --------------------
[ 3953.583199] libceph:  ceph_con_send: 2: mutex_lock(ffff88020b595030)
[ 3953.583202] libceph:  get_osd ffff88020b595000 1 -> 2
                --------------------
                ceph_con_send() puts the message on the out queue
                --------------------
[ 3953.583207] libceph:  ----- ffff880225f2bd00 to osd1 42=osd_op len 139+0+262144 -----
[ 3953.583210] libceph:  ceph_con_send: 2: mutex_unlock(ffff88020b595030)

                --------------------
                queue_con_delay() calls con->ops->get(con) (again)
                ...and ensures con_work will run
                --------------------
[ 3953.583212] libceph:  get_osd ffff88020b595000 2 -> 3
[ 3953.583220] libceph:  queue_con_delay ffff88020b595030 0
        --------------------
        rbd drops an object request reference; done for now
        --------------------
[ 3953.583224] rbd:  rbd_obj_request_put ffff880225f2a700 (was 2)

--------------------
Now the messenger needs to send the message
--------------------
[ 3953.583232] libceph:  con_work: 2: mutex_lock(ffff88020b595030)

--------------------
There is nothing incoming
--------------------
[ 3953.583236] libceph:  try_read start on ffff88020b595030 state 5
[ 3953.583238] libceph:  try_read tag 1 in_base_pos 0
[ 3953.583245] libceph:  try_read done on ffff88020b595030 ret 0

--------------------
The only outgoing request is the one we built above, ffff880225f2bd00
--------------------
[ 3953.583247] libceph:  try_write start ffff88020b595030 state 5
[ 3953.583250] libceph:  try_write out_kvec_bytes 0
[ 3953.583253] libceph:  prepare_write_message ffff880225f2bd00 seq 2 type 42 len 139+0+262144 0 pgs
[ 3953.583258] libceph:  prepare_write_message front_crc 541896034 middle_crc 0
[ 3953.583260] libceph:  try_write out_kvec_bytes 193

--------------------
Write out the header and middle
--------------------
[ 3953.583263] libceph:  write_partial_kvec ffff88020b595030 193 left
[ 3953.583270] libceph:  write_partial_kvec ffff88020b595030 0 left in 0 kvecs ret = 1

--------------------
Write out some of the bio data
--------------------
[ 3953.583273] libceph:  write_partial_msg_pages ffff88020b595030 msg ffff880225f2bd00 page 0/0 offset 0
[ 3953.583399] libceph:  try_write done on ffff88020b595030 ret 0
--------------------
con_work() is done, call con->ops->put(con)
--------------------
[ 3953.583405] libceph:  put_osd ffff88020b595000 3 -> 2

--------------------
messenger gets notified the socket has space to write
--------------------
[ 3953.583959] libceph:  ceph_sock_write_space ffff88020b595030 queueing write work

--------------------
con_work() fires again
--------------------
[ 3953.583965] libceph:  get_osd ffff88020b595000 2 -> 3
[ 3953.583973] libceph:  queue_con_delay ffff88020b595030 0
[ 3953.583985] libceph:  con_work: 0: mutex_lock(ffff88020b595030)

--------------------
nothing to read
--------------------
[ 3953.583990] libceph:  try_read start on ffff88020b595030 state 5
[ 3953.583992] libceph:  try_read tag 1 in_base_pos 0
[ 3953.583998] libceph:  try_read done on ffff88020b595030 ret 0

--------------------
write out some more of the bio data from the same message
--------------------
[ 3953.584001] libceph:  try_write start ffff88020b595030 state 5
[ 3953.584003] libceph:  try_write out_kvec_bytes 0

[ 3953.584007] libceph:  write_partial_msg_pages ffff88020b595030 msg ffff880225f2bd00 page 13/0 offset 2086
[ 3953.584104] libceph:  try_write done on ffff88020b595030 ret 0
[ 3953.584108] libceph:  con_work: 0: mutex_unlock(ffff88020b595030)
[ 3953.584111] libceph:  put_osd ffff88020b595000 3 -> 2

--------------------
the socket has space to write again
--------------------
[ 3953.584928] libceph:  ceph_sock_write_space ffff88020b595030 queueing write work
[ 3953.584933] libceph:  get_osd ffff88020b595000 2 -> 3
[ 3953.584940] libceph:  queue_con_delay ffff88020b595030 0
[ 3953.584952] libceph:  con_work: 0: mutex_lock(ffff88020b595030)

--------------------
nothing to read
--------------------
[ 3953.584956] libceph:  try_read start on ffff88020b595030 state 5
[ 3953.584959] libceph:  try_read tag 1 in_base_pos 0
[ 3953.584965] libceph:  try_read done on ffff88020b595030 ret 0
[ 3953.584967] libceph:  try_write start ffff88020b595030 state 5
[ 3953.584969] libceph:  try_write out_kvec_bytes 0

--------------------
write out some more of the bio data from the same message
--------------------
[ 3953.584973] libceph:  write_partial_msg_pages ffff88020b595030 msg ffff880225f2bd00 page 25/0 offset 718
[ 3953.585097] libceph:  try_write done on ffff88020b595030 ret 0
[ 3953.585101] libceph:  con_work: 0: mutex_unlock(ffff88020b595030)
[ 3953.585104] libceph:  put_osd ffff88020b595000 3 -> 2

--------------------
and again--more space, nothing to read, write more from the bio
--------------------
[ 3953.584973] libceph:  write_partial_msg_pages ffff88020b595030 msg ffff880225f2bd00 page 25/0 offset 718
[ 3953.585546] libceph:  ceph_sock_write_space ffff88020b595030 queueing write work
[ 3953.585550] libceph:  get_osd ffff88020b595000 2 -> 3
[ 3953.585556] libceph:  queue_con_delay ffff88020b595030 0
[ 3953.586209] libceph:  con_work: 0: mutex_lock(ffff88020b595030)
[ 3953.586214] libceph:  try_read start on ffff88020b595030 state 5
[ 3953.586216] libceph:  try_read tag 1 in_base_pos 0
[ 3953.586222] libceph:  try_read done on ffff88020b595030 ret 0
[ 3953.586225] libceph:  try_write start ffff88020b595030 state 5
[ 3953.586227] libceph:  try_write out_kvec_bytes 0
[ 3953.586230] libceph:  write_partial_msg_pages ffff88020b595030 msg ffff880225f2bd00 page 40/0 offset 94
[ 3953.586412] libceph:  try_write done on ffff88020b595030 ret 0
[ 3953.586416] libceph:  con_work: 0: mutex_unlock(ffff88020b595030)
[ 3953.586419] libceph:  put_osd ffff88020b595000 3 -> 2

--------------------
and again--more space, nothing to read, write the rest of the bio
--------------------
[ 3953.586620] libceph:  ceph_sock_write_space ffff88020b595030 queueing write work
[ 3953.586625] libceph:  get_osd ffff88020b595000 2 -> 3
[ 3953.586633] libceph:  queue_con_delay ffff88020b595030 0
[ 3953.586645] libceph:  con_work: 0: mutex_lock(ffff88020b595030)
[ 3953.586649] libceph:  try_read start on ffff88020b595030 state 5
[ 3953.586652] libceph:  try_read tag 1 in_base_pos 0
[ 3953.586658] libceph:  try_read done on ffff88020b595030 ret 0
[ 3953.586660] libceph:  try_write start ffff88020b595030 state 5
[ 3953.586663] libceph:  try_write out_kvec_bytes 0
[ 3953.586666] libceph:  write_partial_msg_pages ffff88020b595030 msg ffff880225f2bd00 page 61/0 offset 958
[ 3953.586690] libceph:  write_partial_msg_pages ffff88020b595030 msg ffff880225f2bd00 done

--------------------
now write out the message footer
--------------------
[ 3953.586692] libceph:  prepare_write_message_footer ffff88020b595030
[ 3953.586695] libceph:  write_partial_kvec ffff88020b595030 13 left
[ 3953.586701] libceph:  write_partial_kvec ffff88020b595030 0 left in 0 kvecs ret = 1

--------------------
now write out the message footer, and we're done
--------------------
[ 3953.586704] libceph:  try_write nothing else to write.
[ 3953.586706] libceph:  try_write done on ffff88020b595030 ret 0
[ 3953.586710] libceph:  con_work: 0: mutex_unlock(ffff88020b595030)
[ 3953.586712] libceph:  put_osd ffff88020b595000 3 -> 2

--------------------
messenger gets notified the socket has incoming data
--------------------
[ 3953.802155] libceph:  ceph_sock_data_ready on ffff88020b595030 state = 5, queueing work

--------------------
con_work reads in the tag
--------------------
[ 3953.802161] libceph:  get_osd ffff88020b595000 2 -> 3
[ 3953.802170] libceph:  queue_con_delay ffff88020b595030 0
[ 3953.802187] libceph:  con_work: 0: mutex_lock(ffff88020b595030)
[ 3953.802192] libceph:  try_read start on ffff88020b595030 state 5
[ 3953.802194] libceph:  try_read tag 1 in_base_pos 0
[ 3953.802201] libceph:  try_read got tag 8

--------------------
con_work reads in the tag, it's CEPH_MSGR_TAG_ACK, for our message
--------------------
[ 3953.802204] libceph:  prepare_read_ack ffff88020b595030
[ 3953.802210] libceph:  got ack for seq 2 type 42 at ffff880225f2bd00
[ 3953.802213] libceph:  ceph_msg_remove
[ 3953.802216] libceph:  put_osd ffff88020b595000 3 -> 2
[ 3953.802219] libceph:  prepare_read_tag ffff88020b595030
[ 3953.802221] libceph:  try_read start on ffff88020b595030 state 5
[ 3953.802224] libceph:  try_read tag 1 in_base_pos 0
[ 3953.802228] libceph:  try_read done on ffff88020b595030 ret 0
[ 3953.802231] libceph:  try_write start ffff88020b595030 state 5
[ 3953.802233] libceph:  try_write out_kvec_bytes 0
[ 3953.802236] libceph:  try_write nothing else to write.
[ 3953.802238] libceph:  try_write done on ffff88020b595030 ret 0
[ 3953.802241] libceph:  con_work: 0: mutex_unlock(ffff88020b595030)
[ 3953.802244] libceph:  put_osd ffff88020b595000 2 -> 1

... now nothing for a while ...

--------------------
mon_client isn't hunting, so sends out a keepalive
--------------------
[ 3954.443971] libceph:  monc delayed_work
[ 3954.443978] libceph:  con_keepalive ffff88020ca3e410
[ 3954.443981] libceph:  ceph_con_keepalive: 0: mutex_lock(ffff88020ca3e410)
[ 3954.443986] libceph:  ceph_con_keepalive: 0: mutex_unlock(ffff88020ca3e410)
[ 3954.443991] libceph:  queue_con_delay ffff88020ca3e410 0

--------------------
call __send_subscribe()
--------------------
[ 3954.443995] libceph:  ceph_x_should_authenticate want=39 need=0 have=39
[ 3954.443998] libceph:  ceph_x_is_authenticated want=39 need=0 have=39
[ 3954.444001] libceph:  __send_subscribe sub_sent=0 exp=0 want_osd=2
[ 3954.444004] libceph:  __schedule_delayed after 2000
--------------------
send out the monitor message
--------------------
[ 3954.444010] libceph:  con_work: 0: mutex_lock(ffff88020ca3e410)
[ 3954.444014] libceph:  try_read start on ffff88020ca3e410 state 5
[ 3954.444016] libceph:  try_read tag 1 in_base_pos 0
[ 3954.444023] libceph:  try_read done on ffff88020ca3e410 ret 0
[ 3954.444026] libceph:  try_write start ffff88020ca3e410 state 5
[ 3954.444028] libceph:  try_write out_kvec_bytes 0
[ 3954.444030] libceph:  prepare_write_keepalive ffff88020ca3e410
[ 3954.444033] libceph:  try_write out_kvec_bytes 1
[ 3954.444035] libceph:  write_partial_kvec ffff88020ca3e410 1 left
[ 3954.444044] libceph:  write_partial_kvec ffff88020ca3e410 0 left in 0 kvecs ret = 1
[ 3954.444047] libceph:  try_write nothing else to write.
[ 3954.444049] libceph:  try_write done on ffff88020ca3e410 ret 0
[ 3954.444052] libceph:  con_work: 0: mutex_unlock(ffff88020ca3e410)

... and nothing for 4 more seconds ...

--------------------
now osd_client times out
--------------------
[ 3958.576811] libceph:  timeout
[ 3958.576817] libceph:  request_next_osdmap have 7
[ 3958.576823] libceph:   tid 65536 is slow, will send keepalive on osd1

--------------------
handle_timeout() calls ceph_con_keepalive()
--------------------
[ 3958.576826] libceph:  con_keepalive ffff88020b595030
[ 3958.576829] libceph:  ceph_con_keepalive: 2: mutex_lock(ffff88020b595030)
[ 3958.576833] libceph:  ceph_con_keepalive: 2: mutex_unlock(ffff88020b595030)
[ 3958.576836] libceph:  get_osd ffff88020b595000 1 -> 2
[ 3958.576840] libceph:  queue_con_delay ffff88020b595030 0
[ 3958.576844] libceph:  __send_queued

--------------------
con_work() sends out the keepalive
--------------------
[ 3958.576848] libceph:  con_work: 2: mutex_lock(ffff88020b595030)
[ 3958.576852] libceph:  try_read start on ffff88020b595030 state 5
[ 3958.576855] libceph:  try_read tag 1 in_base_pos 0
[ 3958.576861] libceph:  try_read done on ffff88020b595030 ret 0
[ 3958.576864] libceph:  try_write start ffff88020b595030 state 5
[ 3958.576866] libceph:  try_write out_kvec_bytes 0
[ 3958.576868] libceph:  prepare_write_keepalive ffff88020b595030
[ 3958.576871] libceph:  try_write out_kvec_bytes 1
[ 3958.576873] libceph:  write_partial_kvec ffff88020b595030 1 left
[ 3958.576892] libceph:  write_partial_kvec ffff88020b595030 0 left in 0 kvecs ret = 1
[ 3958.576894] libceph:  try_write nothing else to write.
[ 3958.576897] libceph:  try_write done on ffff88020b595030 ret 0
[ 3958.576900] libceph:  con_work: 2: mutex_unlock(ffff88020b595030)
[ 3958.576903] libceph:  put_osd ffff88020b595000 2 -> 1
--------------------
messenger gets notified the socket has incoming data
--------------------
[ 3958.623971] libceph:  ceph_sock_data_ready on ffff88020b595030 state = 5, queueing work

--------------------
con_work reads in the tag, it's CEPH_MSGR_TAG_MSG
--------------------
[ 3958.623977] libceph:  get_osd ffff88020b595000 1 -> 2
[ 3958.623986] libceph:  queue_con_delay ffff88020b595030 0
[ 3958.624002] libceph:  con_work: 0: mutex_lock(ffff88020b595030)
[ 3958.624006] libceph:  try_read start on ffff88020b595030 state 5
[ 3958.624009] libceph:  try_read tag 1 in_base_pos 0
[ 3958.624016] libceph:  try_read got tag 7

--------------------
read in the message header--it's our response
--------------------
[ 3958.624018] libceph:  prepare_read_message ffff88020b595030
[ 3958.624021] libceph:  read_partial_message con ffff88020b595030 msg           (null)
[ 3958.624027] libceph:  got hdr type 43 front 117 data 0
[ 3958.624030] libceph:  ceph_con_in_msg_alloc: 0: mutex_unlock(ffff88020b595030)
[ 3958.624034] libceph:  get_osd ffff88020b595000 2 -> 3

--- Finally, response to our request arrives

[ 3958.624037] libceph:  get_reply tid 65536 ffff880225f2b900
[ 3958.624040] libceph:  ceph_con_in_msg_alloc: 0: mutex_lock(ffff88020b595030)

--------------------
read in the message data
--------------------
[ 3958.624043] libceph:  get_osd ffff88020b595000 3 -> 4
[ 3958.624053] libceph:  read_partial_message got msg ffff880225f2b900 117 (4140793455) + 0 (0) + 0 (0)
[ 3958.624058] libceph:  put_osd ffff88020b595000 4 -> 3
[ 3958.624063] libceph:  process_message: 0: mutex_unlock(ffff88020b595030)
[ 3958.624070] libceph:  ===== ffff880225f2b900 2 from osd1 43=osd_opreply len 117+0 (4140793455 0 0) =====
[ 3958.624076] libceph:  handle_reply ffff880225f2b900 tid 65536 result 0
[ 3958.624079] libceph:   dropping con_filling_msg ref ffff88020b595030
[ 3958.624082] libceph:  put_osd ffff88020b595000 3 -> 2

--------------------
result was good
--------------------
[ 3958.624084] libceph:  handle_reply result 0 bytes 0
[ 3958.624087] libceph:  handle_reply tid 65536 flags 37
[ 3958.624089] libceph:  __unregister_request ffff88020f185800 tid 65536
[ 3958.624092] libceph:  moving osd to ffff88020b595000 lru
[ 3958.624094] libceph:  __move_osd_to_lru ffff88020b595000
[ 3958.624096] libceph:   no requests, canceling timeout
    --------------------
    and now we're back in rbd, getting called back
    --------------------
[ 3958.624101] rbd:  rbd_osd_req_callback: ffff88020f185800 ffff880225f2b900
[ 3958.624104] rbd:  rbd_osd_write_callback: ffff880225f2a700 result 0 262144/262144
[ 3958.624107] rbd:  rbd_obj_request_complete: ffff880225f2a700 callback ffffffffa09d9e10
[ 3958.624110] rbd:  rbd_img_obj_callback(ffff880225f2a700) 0
[ 3958.624113] rbd:  rbd_img_obj_callback: img ffff88020b582a80 obj ffff880225f2a700
[ 3958.624116] rbd:  rbd_img_obj_callback: which 0 next_compl 0 req_count 1
[ 3958.624132] rbd:  rbd_img_request_complete: ffff88020b582a80
[ 3958.624135] rbd:  rbd_img_request_put ffff88020b582a80 (was 1)
[ 3958.624138] rbd:  rbd_img_request_destroy(ffff88020b582a80)
[ 3958.624141] rbd:  rbd_img_obj_request_del(ffff88020b582a80, ffff880225f2a700) 0

#14 Updated by Alex Elder over 7 years ago

Here is the yaml file I used for reproducing this.

interactive-on-error:
roles:
- [mon.a, mon.c, osd.0, osd.1, osd.2]
- [mon.b, mds.a, osd.3, osd.4, osd.5]
- [client.0]
kernel:
  mon:
    branch: wip-4079
    kdb: true
  client:
    branch: wip-4079
    kdb: true
overrides:
  ceph:
    conf:
      osd:
        osd op complaint time: 120
    fs: btrfs
    log-whitelist:
    - clocks not synchronized
    - old request
tasks:
- ceph:
    branch: next
    conf:
        global:
            debug ms: 10
        osd:
            debug osd: 20
- rbd:
    all:
- rbd.xfstests:
    client.0:
        count: 5
        tests: 258-259

#15 Updated by Ian Colle over 7 years ago

  • Assignee changed from Alex Elder to Josh Durgin

#16 Updated by Alex Elder over 7 years ago

Another possible clue. Things seemed to be going reasonably
well on this run, but sometime during test 78 things went bad,
and we also got lots of "mons are laggy or clocks are too skewed"
messages.

First, this is a set of possibly typical timings for these tests:

. . .
INFO:teuthology.task.rbd.client.0.out:070     22s
INFO:teuthology.task.rbd.client.0.out:074     295s
INFO:teuthology.task.rbd.client.0.out:075     240s
INFO:teuthology.task.rbd.client.0.out:076     37s
INFO:teuthology.task.rbd.client.0.out:078     72s
INFO:teuthology.task.rbd.client.0.out:079     4s
INFO:teuthology.task.rbd.client.0.out:084     59s
INFO:teuthology.task.rbd.client.0.out:085     4s
INFO:teuthology.task.rbd.client.0.out:086     160s
. . .

Now on this one, note that tests 75 and 76 show about
the right elapsed time (240 and 37 seconds), but then
78 (after the messages) took 707 seconds rather than
around 72 seconds, and test 79 took 213 seconds rather
than the expected (about) 4.

. . .
INFO:teuthology.task.rbd.client.0.out:070     23s
INFO:teuthology.task.rbd.client.0.out:074     326s
INFO:teuthology.task.rbd.client.0.out:075     240s
INFO:teuthology.task.rbd.client.0.out:076     37s
INFO:teuthology.task.ceph.mon.a.err:2013-02-22 19:12:49.873079 7fb1ba56e700 -1 mon.a@1(peon).paxos(monmap active c 1..1) lease_expire from mon.0 10.214.133.20:6789/0 is 0.100789 seconds in the past; mons are laggy or clocks are too skewed
INFO:teuthology.task.ceph.mon.c.err:2013-02-22 19:12:49.873082 7f72de10a700 -1 mon.c@2(peon).paxos(monmap active c 1..1) lease_expire from mon.0 10.214.133.20:6789/0 is 0.100793 seconds in the past; mons are laggy or clocks are too skewed
INFO:teuthology.task.ceph.mon.a.err:2013-02-22 19:13:00.061164 7fb1ba56e700 -1 mon.a@1(peon).paxos(logm updating c 132..634) lease_expire from mon.0 10.214.133.20:6789/0 is 0.847412 seconds in the past; mons are laggy or clocks are too skewed
INFO:teuthology.task.ceph.mon.c.err:2013-02-22 19:13:00.061199 7f72de10a700 -1 mon.c@2(peon).paxos(logm updating c 132..634) lease_expire from mon.0 10.214.133.20:6789/0 is 0.847448 seconds in the past; mons are laggy or clocks are too skewed
INFO:teuthology.task.ceph.mon.a.err:2013-02-22 19:13:00.588403 7fb1ba56e700 -1 mon.a@1(peon).paxos(monmap active c 1..1) lease_expire from mon.0 10.214.133.20:6789/0 is 0.544320 seconds in the past; mons are laggy or clocks are too skewed
INFO:teuthology.task.ceph.mon.a.err:2013-02-22 19:13:00.588527 7fb1ba56e700 -1 mon.a@1(peon).paxos(osdmap active c 1..5) lease_expire from mon.0 10.214.133.20:6789/0 is 0.463277 seconds in the past; mons are laggy or clocks are too skewed
INFO:teuthology.task.ceph.mon.a.err:2013-02-22 19:13:00.588650 7fb1ba56e700 -1 mon.a@1(peon).paxos(auth active c 1..10) lease_expire from mon.0 10.214.133.20:6789/0 is 0.463295 seconds in the past; mons are laggy or clocks are too skewed
INFO:teuthology.task.ceph.mon.a.err:2013-02-22 19:13:00.588762 7fb1ba56e700 -1 mon.a@1(peon).paxos(mdsmap active c 1..5) lease_expire from mon.0 10.214.133.20:6789/0 is 0.463101 seconds in the past; mons are laggy or clocks are too skewed
INFO:teuthology.task.ceph.mon.c.err:2013-02-22 19:13:00.588384 7f72de10a700 -1 mon.c@2(peon).paxos(monmap active c 1..1) lease_expire from mon.0 10.214.133.20:6789/0 is 0.544301 seconds in the past; mons are laggy or clocks are too skewed
INFO:teuthology.task.ceph.mon.c.err:2013-02-22 19:13:00.588489 7f72de10a700 -1 mon.c@2(peon).paxos(osdmap active c 1..5) lease_expire from mon.0 10.214.133.20:6789/0 is 0.463238 seconds in the past; mons are laggy or clocks are too skewed
INFO:teuthology.task.ceph.mon.c.err:2013-02-22 19:13:00.588624 7f72de10a700 -1 mon.c@2(peon).paxos(auth active c 1..10) lease_expire from mon.0 10.214.133.20:6789/0 is 0.463268 seconds in the past; mons are laggy or clocks are too skewed
INFO:teuthology.task.ceph.mon.c.err:2013-02-22 19:13:00.588847 7f72de10a700 -1 mon.c@2(peon).paxos(mdsmap active c 1..5) lease_expire from mon.0 10.214.133.20:6789/0 is 0.463186 seconds in the past; mons are laggy or clocks are too skewed
INFO:teuthology.task.ceph.mon.a.err:2013-02-22 19:13:01.181790 7fb1ba56e700 -1 mon.a@1(peon).paxos(pgmap updating c 124..626) lease_expire from mon.0 10.214.133.20:6789/0 is 0.889894 seconds in the past; mons are laggy or clocks are too skewed
INFO:teuthology.task.ceph.mon.c.err:2013-02-22 19:13:01.181809 7f72de10a700 -1 mon.c@2(peon).paxos(pgmap updating c 124..626) lease_expire from mon.0 10.214.133.20:6789/0 is 0.889913 seconds in the past; mons are laggy or clocks are too skewed
INFO:teuthology.task.rbd.client.0.out:078     707s
INFO:teuthology.task.rbd.client.0.out:079     213s
INFO:teuthology.task.rbd.client.0.out:084     100s
INFO:teuthology.task.rbd.client.0.out:085     204s
INFO:teuthology.task.rbd.client.0.out:086     4306s
. . .

#17 Updated by Sage Weil over 7 years ago

  • Assignee deleted (Josh Durgin)
  • Priority changed from High to Urgent

#18 Updated by Ian Colle over 7 years ago

  • Assignee set to Sage Weil

#19 Updated by Sage Weil over 7 years ago

  • Subject changed from krbd: requests become extremely slow following socket error to osd: journal aio deadlock

tracked this down to a hang in the journal aio code.. the throttle code in write_thread_entry()

#20 Updated by Josh Durgin over 7 years ago

Alex saw a slowdown in one of his xfstests runs against ceph master. Logs and a short description of the setup are on plana29 in /home/ubuntu/bug_4079

#21 Updated by Sage Weil over 7 years ago

  • Project changed from rbd to Ceph

#22 Updated by Sage Weil over 7 years ago

not this bug.

also, facepalm, the reason i can't reproduce is becuase we turned of journal aio by default until we fix this. need to put journal aio = true in the config to trigger...

#23 Updated by Sage Weil over 7 years ago

  • Status changed from In Progress to Resolved

Also available in: Atom PDF