Project

General

Profile

Bug #4450

libceph: many socket "socket closed" issues on socket error

Added by Alex Elder almost 8 years ago. Updated almost 8 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Development
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature:

Description

When running xfstests over rbd with error injection:

    conf:
        global:
            ms inject socket failures: 5000

I am now seeing many consecutive messages like this after
an injected socket failure. I suspect there is something
in my recent series of commits that is causing something to
get out of sync following a connection reset, but I haven't
figured out what. For reference, below is stuff I've
committed in the last week or so that puts a bound on where
I think the problem lies. It's probably more recent than this
but this at least defines something.

f6a8ba1 libceph: more cleanup of write_partial_msg_pages()
9856a63 libceph: kill message trail
7e24d23 libceph: kill osd request r_trail
b4ae545 libceph: have osd requests support pagelist data
58516b5 libceph: let osd ops determine request data length
6e8cc21 libceph: implement pages array cursor
fc9026f libceph: implement bio message data item cursor
7e14d6d libceph: use data cursor for message pagelist
1eb7d1c libceph: prepare for other message data item types
cc07389 libceph: start defining message data cursor
6e98d40 libceph: abstract message data
8cc0f56 libceph: be explicit about message data representation
23f4e9a libceph: define ceph_msg_has_*() data macros
3f76347 libceph: define and use ceph_crc32c_page()
fb2177d libceph: define and use ceph_tcp_recvpage()
473a03c libceph: encapsulate reading message data
86dc530 libceph: small write_partial_msg_pages() refactor
7c1775e libceph: consolidate message prep code
5478489 libceph: use local variables for message positions
b0e22c9 libceph: don't clear bio_iter in prepare_write_message()
00a1dc2 libceph: activate message data assignment checks
c61ffd1 libceph: set response data fields earlier
1ec4a63 libceph: record message data byte length
d66d17f ceph: only set message data pointers if non-empty
9c5f95a libceph: isolate other message data fields
e7817c8 libceph: set page info with byte length
634a072 libceph: isolate message page field manipulation
c9582368 libceph: record byte count not page count
f1e83e8 libceph: simplify new message initialization
714b861 libceph: advance pagelist with list_rotate_left()
7ac3ec3 libceph: define and use in_msg_pos_next()
3bfaa45 libceph: kill args in read_partial_message_bio()
c4e8b22 libceph: change type of ceph_tcp_sendpage() "more"
479066e libceph: minor byte order problems in read_partial_message()
1e82218 libceph: define CEPH_MSG_MAX_MIDDLE_LEN

History

#1 Updated by Alex Elder almost 8 years ago

Three more things. If there are no connection failures
injected this problem does not occur.

When socket error injection is active, I am used to seeing
a handful of these messages, two or three at a time, or
maybe dozens of them between file system mounts (but typically
10 seconds or more apart). The behavior I'm talking about
is hundreds of reports over the course of a second or two.

Oddly, I ran some tests with some patches Josh just
reported today in e-mail that he reviewed, and the
problem did not seem to be there. It's possible this
behavior is in some of my newest patches, which I have not
yet posted for review.

(I've been trying to do a lot at once so I could have
gotten confused about that. I'll need to verify this
before going too far with the testing branch.)

#2 Updated by Ian Colle almost 8 years ago

  • Priority changed from Normal to High

#3 Updated by Alex Elder almost 8 years ago

  • Status changed from New to In Progress

I did a little work bisecting this about a week ago.
I'm not 100% trusting the result though, so I'm going
to try to reproduce those results using a subset of
them. The reason I want to be sure is that it is normal
to get some "socket closed* messages, but I'm trying
to find out what triggers the high volume of them.

So I'm going to see if I can narrow the test case on
this, verify its occurrence changed in recent commits
(among the last 20 or so), and zoom in to find the
cause.

For the record, here's what I found before.

{2004} elder@speedy-> git bisect log
git bisect start
  1. good: [d6c0dd6b0c196979fa7b34c1d99432fcb1b7e1df] libceph: fix decoding of pgids
    git bisect good d6c0dd6b0c196979fa7b34c1d99432fcb1b7e1df
  2. bad: [f6a8ba10a049940608fc814ff9650fff39e99885] libceph: more cleanup of write_partial_msg_pages()
    git bisect bad f6a8ba10a049940608fc814ff9650fff39e99885
  3. good: [83ca14fdd35821554058e5fd4fa7b118ee504a33] libceph: add support for HASHPSPOOL pool flag
    git bisect good 83ca14fdd35821554058e5fd4fa7b118ee504a33
  4. good: [1e822187b52cd990557121a7953396cb57fcf975] libceph: define CEPH_MSG_MAX_MIDDLE_LEN
    git bisect good 1e822187b52cd990557121a7953396cb57fcf975
  5. good: [7c1775e2ce1dcd3794642cc40182bdfd5e2026c0] libceph: consolidate message prep code
    git bisect good 7c1775e2ce1dcd3794642cc40182bdfd5e2026c0
  6. good: [cc07389943444278fc3c0c2f8bd02c46f46320aa] libceph: start defining message data cursor
    git bisect good cc07389943444278fc3c0c2f8bd02c46f46320aa
  7. good: [6e8cc2145bfb6f10e82e7f6a2751fb35a7a87cee] libceph: implement pages array cursor
    git bisect good 6e8cc2145bfb6f10e82e7f6a2751fb35a7a87cee
  8. bad: [b4ae545af75c92dcc5bf8ed95a5453c3cf8a57a4] libceph: have osd requests support pagelist data
    git bisect bad b4ae545af75c92dcc5bf8ed95a5453c3cf8a57a4
  9. good: [58516b574b7dd9289248609a81fcdb692cc11e01] libceph: let osd ops determine request data length
    git bisect good 58516b574b7dd9289248609a81fcdb692cc11e01 {2005} elder@speedy->

#4 Updated by Alex Elder almost 8 years ago

Just a little update.

I've been testing the "testing" branch with groups of the
xfstests I use to see if I could limit what's required to
reproduce the problem. So far I haven't hit the problem.

So far I've done these groups:

 1-9 11-15 017 19-21
 26-29 31-34 41 46-48 50-54 56 61 63-67 69 70
 74
 75
 76 78 79 84-89
 91

I'm doing them in blocks that take under about 10 minutes
to run.

I was going to try a complete run on a separate cluster
while I wait for these, but I am unable to allocate 3
nodes.

#5 Updated by Alex Elder almost 8 years ago

  • Status changed from In Progress to 12
  • Priority changed from High to Normal

OK, I've tested this in groups using the current ceph-client
"testing" branch:
06fb6a9 ceph: fix buffer pointer advance in ceph_sync_write
And whatever the current ceph "next" branch is, along with this
teuthology:
4d28a65 stop ignoring osd leaks

Server-side socket injection was set to:
ms inject socket failures: 2500
In no case did I see this problem occur. I also ran two complete
runs of xfstests and again, I saw no evidence of an inordinate
amount of "socket closed" messages showing on the client console.

I'm dropping the priority of this to "normal" and am going to
wait to see if I see it show up again before looking at it again.
As such I'm going to mark it "Verified" for now, but if it doesn't
show up in a week or so I think it can go to "Can't reproduce."

#6 Updated by Alex Elder almost 8 years ago

I seem to have hit this problem again while attempting to test
my most recent changes. It arose while running xfstest 70 but
that may not necessarily be the issue.

So this may be a new problem in code not yet committed. I'll
track it down before committing those new changes.

#7 Updated by Alex Elder almost 8 years ago

I am bisecting this today. I'm going to take notes here to
make sure I don't screw it up.

I started with the "wip-reviewed" branch I was testing before,
which contains the 13 patches that Josh reviewed but which I
have not committed yet:

addc14c libceph: make message data be a pointer
811fde6 libceph: use only ceph_msg_data_advance()
950510c libceph: kill last of ceph_msg_pos
fbdcddf libceph: kill most of ceph_msg_pos
98f8eb9 libceph: use cursor resid for loop condition
b227450 libceph: collapse all data items into one
3856487 libceph: get rid of read helpers
9f7d792 libceph: no outbound zero data
bec9c87 libceph: use cursor for inbound data pages
b88051a libceph: kill ceph message bio_iter, bio_seg
04655f9 libceph: use cursor for bio reads
42573a7 libceph: record residual bytes for all message data types
15b53ed libceph: drop pages parameter

I was able to reproduce the problem while running test 74, having
provided the following yaml file (which corresponds to the last
commit, addc14c):

interactive-on-error:
roles:
- [mon.a, mon.c, osd.0, osd.1, osd.2]
- [mon.b, mds.a, osd.3, osd.4, osd.5]
- [client.0]
kernel:
  mon:
    branch: wip-reviewed
    kdb: true
  client:
    branch: wip-reviewed
    kdb: true
overrides:
  ceph:
    conf:
      osd:
        osd op complaint time: 120
    fs: btrfs
    log-whitelist:
    - clocks not synchronized
    - old request
tasks:
- install:
- ceph:
    branch: next
    conf:
        global:
            ms inject socket failures: 2500
        mon:
            debug mon: 10
            debug paxos: 
        osd:
            debug ms: 1
            debug journal: 20
            osd debug op order: true
- rbd:
    all:
- rbd.xfstests:
    client.0:
        count: 1
        tests: 56 61 63-67 69 70 74-76 78 79

#8 Updated by Alex Elder almost 8 years ago

Verified that I reproduced the problem when reducing the list of
tests to this (again, while running test 74):
tests: 70 74-76
So from here on out that's what I'll be running.

#9 Updated by Alex Elder almost 8 years ago

After a few iterations I have concluded it's this patch
where the problems start to appear:
811fde6 libceph: use only ceph_msg_data_advance()
I am repeating the test (and iterating several times
over it) with its predecessor commit to verify that.

Meanwhile I'll try to see what's going on with that one.

#10 Updated by Alex Elder almost 8 years ago

OK, I screwed up, and that's why I was taking notes.

The bad commit was this one, NOT the one I said before:

950510c libceph: kill last of ceph_msg_pos

This one changed some details of handling message
CRC's so it's a reasonable place for something like
this to go wrong.

Its predecessor commit was this:
fbdcddf libceph: kill most of ceph_msg_pos
I am going to commit up to and including that commit
to the "testing" branch. Then I'll try to fix this
CRC one.

#11 Updated by Alex Elder almost 8 years ago

I have confirmed that the osds are reporting crc errors
at the time these repeated connection closes are occurring.

2013-03-29 08:12:15.779075 7f4e14865700 0 bad crc in data 3831343678 != exp 621
212944
2013-03-29 08:12:15.800400 7f4e14865700 0 bad crc in data 3831343678 != exp 387
0950775
2013-03-29 08:12:15.820265 7f4e14865700 0 bad crc in data 3831343678 != exp 273
6929673
2013-03-29 08:12:15.843569 7f4e14865700 0 bad crc in data 1244014471 != exp 231
8485120
2013-03-29 08:12:25.531742 7f4e14865700 0 bad crc in data 2131376592 != exp 160
9367935
2013-03-29 08:12:25.595346 7f4e14865700 0 bad crc in data 3947293918 != exp 146
7719174
2013-03-29 08:12:25.621507 7f4e14865700 0 bad crc in data 2131376592 != exp 119

#12 Updated by Alex Elder almost 8 years ago

OK, I found the problem. It was subtle.

The patch in question got rid of the last remaining
piece of the "ceph_msg_pos" structure, which kept
track of whether the crc needed to be computed.

Instead, in the new code when the send or receive
routine requests the next piece of a data item to
be sent or received, it is informed whether the
result is moving onto a new piece. (A piece is
a page for a pagelist or a page array, or a
segment of a bio in a bio list.)

That was effectively what was happening previously;
the did_page_crc flag was cleared when you moved
onto a new piece, and set after the crc for the
piece had been calculated.

The subtle difference is that in the sending case,
the crc for the piece is calculated before
actually sending the data. The problem occurs if
the attempt to send returns 0, in which case
sending stops and resumes again next time con_work()
is called.

The old code set that did_page_crc flag when the
crc was calculated, meaning it would not be done
again (even when 0 bytes got sent on a previous
attempt). In the new code, sending 0 bytes does
not change which piece of data you're operating on,
and this means the need_crc flag doesn't get cleared.
So the crc would get computed twice if an attempt to
send a new piece sent 0 bytes.

The solution is easy though--just hold off calculating
the crc for a piece until after we know at least one
byte has been sent.

I'm going to keep this bug fix as a separate patch
to follow the one with the bug, and will post it
for review shortly.

#13 Updated by Alex Elder almost 8 years ago

  • Status changed from 12 to Fix Under Review

I have posted this patch for review:

[PATCH] libceph: don't add to crc unless data sent

In write_partial_message_data() we aggregate the crc for the data
portion of the message as each new piece of the data item is
encountered. Because it was computed before sending the data, if
an attempt to send a new piece resulted in 0 bytes being sent, the
crc crc across that piece would erroneously get computed again and
added to the aggregate result. This would occasionally happen in
the evnet of a connection failure.

The crc value isn't really needed until the complete value is known
after sending all data, so there's no need to compute it before
sending.

So don't calculate the crc for a piece until after we know at
least one byte of it has been sent. That will avoid this problem.

This resolves:
http://tracker.ceph.com/issues/4450

Signed-off-by: Alex Elder <>

#14 Updated by Alex Elder almost 8 years ago

  • Status changed from Fix Under Review to Resolved

The following has been committed to the ceph-client "testing"
branch:
0267c91 libceph: don't add to crc unless data sent

Also available in: Atom PDF