Project

General

Profile

Actions

Bug #4598

closed

kernel messenger: fix bogus asserts

Added by Alex Elder about 11 years ago. Updated about 11 years ago.

Status:
Resolved
Priority:
Immediate
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Development
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Sage informed me of four crashes in nightly tested last night.

I believe they are due to an erroneous assertion. I have a
fix and will test when it's done building.

Actions #1

Updated by Alex Elder about 11 years ago

I haven't posted this for review yet, I want to test it first.

[PATCH] libceph: fix broken data length assertions

It's OK for the result of a read to come back with fewer bytes than
were requested. So don't trigger a BUG in that case when
initializing the data cursor.

This resolves:
http://tracker.ceph.com/issues/4598

Signed-off-by: Alex Elder <>
---
net/ceph/messenger.c | 4 +--
1 file changed, 2 insertions(
), 2 deletions(-)

diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
index d4e46d8..24f3aba 100644
--- a/net/ceph/messenger.c
+++ b/net/ceph/messenger.c
@ -833,7 +833,7 @ static void ceph_msg_data_pages_cursor_init(struct ceph_msg_data *data,

BUG_ON(!data->pages);
BUG_ON(!data->length);
- BUG_ON(length != data->length);
+ BUG_ON(length > data->length); /* short reads are OK */
cursor->resid = length;
page_count = calc_pages_for(data->alignment, (u64)data->length);
@ -905,7 +905,7 @ static void ceph_msg_data_pagelist_cursor_init(struct ceph_msg_data *data,
pagelist = data->pagelist;
BUG_ON(!pagelist);
- BUG_ON(length != pagelist->length);
+ BUG_ON(length > pagelist->length); /* short reads are OK */
if (!length)
return; /* pagelist can be assigned but empty */
--
1.7.9.5
Actions #2

Updated by Alex Elder about 11 years ago

My initial tests of my fix did not produce the same crash, so
I posted that for review. Now, after a bit longer though I
see there's another crash to investigate. I believe it's of
the same sort--a failure of an assertion that's not quite
correct.

Actions #3

Updated by Alex Elder about 11 years ago

OK, another assertion failure, but this one caught a problem.

ceph_msg_data_pages_advance() was allowing the page_offset field
of a page array cursor to reach PAGE_SIZE. But a page offset is
by definition less than that.

Generally it worked, but if an I/O ended up not sending/receiving
a complete page, yet the resulting offset was on a page boundary,
the page offset would not get reset to 0 as it should.

I'm about to test a fix, and will post that for review as a
separate patch under this same issue.

Actions #4

Updated by Alex Elder about 11 years ago

One more assertion failure. This one again is a correct
assertion so it's found another real problem in the code.

Here's a specific example of the troublesome case:
If page array data has a non-zero alignment, and a cursor
gets initialized for a length of exactly one page, the
"last_piece" field for the cursor will be initialized to
true, even though a second page will contain the last
piece. As a result, ceph_msg_data_pages_next() will
return a non-zero page offset and the entire residual
byte count, which will exceed the size of a page.

This is more generally true for any non-zero offset
coupled with a page-size or less length such that
their sum exceeds a page.

The fix is to take into account the offset for the
first page, when the cursor gets initialized. Any
page after the first will have a 0 offset, so this
is not required.

Currently pagelist and bio data do not use the
data alignment field (but they probably should;
certainly bio should).

I'll test this some more overnight but I'm about
to post the patch for review because I'm pretty
certain this is the problem I just hit.

Actions #5

Updated by Alex Elder about 11 years ago

The following has been posted for review:
[PATCH] libceph: account for alignment in pages cursor

Also, the previous fix had been posted also:
[PATCH] libceph: page offset must be less than page size
(Sage already reviewed the first two.)

Actions #6

Updated by Alex Elder about 11 years ago

I ran several passes through a bunch of fs
tests over the last day or so and there are
no more assertions triggered (neither valid
nor bogus ones).

My only concern as of last night was that while
running the "suites/ffsb.sh" workunit I saw a
lot of connection closed messages that I didn't
remember noticing before (something like 5-10
messages per second). So I spent some more
time really analyzing the cursor code and couldn't
really find anything else amiss.

So I ran another test using old kernel messenger code
(the master branch, which is quite old but it's something).

There I also found a period during which there
were lots of consecutive messages--again on the order
of 5-10 per second. The osd logs showed no crc errors
like I saw while tracking a recent bug.

So at this point I'm pretty convinced I've fixed the
various remaining issues with the newly-added messenger
code, which supports a single generic data item and
message cursor per message.

As soon as I can get a review on the last patch I'll
commit it. Meanwhile I'll start up another batch of
testing of the code overnight.

Actions #7

Updated by Sage Weil about 11 years ago

  • Status changed from In Progress to Resolved
Actions

Also available in: Atom PDF