Project

General

Profile

Actions

Subtask #3741

closed

Tasks #3755: krbd: use new request tracking code for sync object operations

krbd: rework request tracking code

Added by Alex Elder over 11 years ago. Updated about 11 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Reviewed:
Affected Versions:
Pull request ID:

Description

This is actually work that's mostly complete, but it never
got a bug assigned to it.

In order to handle layering we need to be able to submit
requests that come back from an OSD with ENOENT to other
OSD associated with a parent image. I tried doing this
with the existing code but after a while I concluded it
was just not structured in a way that was conducive to this.

For one thing, it assumed all requests were initiated by
the Linux bio layer, whereas we need to initiate requests
at the point an osd request completes.

So after some discussion with Josh et al we concluded that
re-implementing how this was done was warranted. I sent
around a design and set about implementing it.

As mentioned before, this is nearly complete. Further
updates will define what's left to do.

Actions #1

Updated by Alex Elder over 11 years ago

The new code for tracking rbd requests works, but
there are some error paths that still need to be
fleshed out.

Here are some specific cases:
- The (new) function rbd_obj_request_submit() calls
  ceph_osdc_start_request().  The cleanup for the
  case where that returns an error needs to be done.
- The (new) function rbd_image_request_submit() calls
  rbd_obj_request_submit(), and if the latter returns
  an error it is not yet properly handled.

There are a few other outstanding questions I've got
in the new code (some of which probably also apply on
the current/"old" code):
- If ceph_osd_wait_request() returns an error, is the
  reassert_version value returned in the osd request
  valid?
- When an osd request response indicates an error, is its
  reassert_version value valid?
- When an osd read request returns with an error, what
  is the meaning of the length field?  Is it valid?
- When an osd read request returns, I would like to
  validate some fields that are not now touched,
  such as the offset, payload_len, truncate_size and
  truncate_seq.
- Similarly, on a write request I'd like to perform
  whatever validation I can on the response.
- What happens if the length in a response to a
  write request is less than requested?  Does that
  indicate a zero-fill to the end of the original
  length should be done?
- There are some special zero-length requests (like
  flush).  We should deal with those.

Actions #2

Updated by Ian Colle over 11 years ago

  • Project changed from rbd to Linux kernel client
Actions #3

Updated by Ian Colle over 11 years ago

  • Parent task set to #3755
Actions #4

Updated by Alex Elder over 11 years ago

I did some testing yesterday and found that I got I/O errors
while running xfstests. This was unexpected; I thought this
code was nearly done (and I did test it before).

I made a few minor changes, and have run tests again. This time
I got through xfstest 075 without error.

HOWEVER I just got a crash. First I got a warning in the
rcu code (weird), but that was followed less than a second later
by a crash in kmem_cache_alloc(). Based on this I suspect there's
a memory allocation/free problem somewhere.

Still looking.

Actions #5

Updated by Alex Elder over 11 years ago

I am leaving shortly for a few hours. In reviewing this
new code I find a few things that make it a little hard
make sure certain things are getting allocated and freed
at all the right times. I have some changes I'm working
through that will improve on that, but I still need to
look them over a bit more before testing with them.

Basically they involve replacing an array of object request
structures with an array of pointers to them. That makes
it easier to get rid of object requests when their last
reference is dropped. It also will make it much more
natural to have object requests (such as rbd image header
reads) stand on their own, without being connected to
an rbd image request (which is used for I/O).

I'll continue tonight and tomorrow morning.

Actions #6

Updated by Alex Elder over 11 years ago

My full test run isn't complete but I seem to have resolved
whatever problem I was hitting yesterday. I have not yet
completed the switchover to using pointers instead of structs
in the image request structure; I'm still finishing that up.

Actions #7

Updated by Alex Elder over 11 years ago

Unfortunately my system crashed after an hour or so. The
crash was in the network driver, and a little analysis
suggested it could be some sort of weird bug, so I restarted
to try again.

This time, however, I'm looking at the slab allocation stats
and it looks like there's a memory leak, at least in the
slab named "kmalloc-192", where I show over 5 million objects
and growing right now. That might be a hint at the source
of the leak. (A ceph_msg_header is 53 bytes; a ceph_msg is
208 bytes, and so on.) I'm turning on some memory debugging
I put together last month.

Actions #8

Updated by Alex Elder over 11 years ago

For some reason my tests started hanging on Friday when
I added memory debug code for catching leaks and reuses.
I went through many iterations trying to figure out what
caused this, and was surprised that even what I thought
was working was also hanging. This morning I learned
that something in the ceph "next" branch had caused this.

I think I'm back to working with what was my best and
latest version of the code before this occurred now, but
I'm a little afraid I may have back-tracked. This episode
has been quite unhelpful.

I finally looked at the memory leak test results though
and find that there is only one spot that seems to be
leaking, and it's only 24 bytes, and is not something that
happens a lot (it's the snapshot context).

This process produces a lot of output on the console so
I'm trying not to run too many tests, in order to get
complete information about allocations and frees. I
may have to do more though.

Actions #9

Updated by Alex Elder over 11 years ago

I spent the day trying to find the memory leak and finally
found it. The structure being leaked was a bio. It was
not caught by my kmalloc debug code because these are
allocated using bio_alloc(), which won't use my clever
macros.

I noticed that the "kmalloc-192" slab cache was growing
larger and larger over time. So I wanted to look at
what was consuming that slab cache.

The way I found it was through miring through raw memory
using kdb.
- I looked through all the kmem caches until I found the
data structure describing "kmalloc-192"
- I figured out that the per-node partial list linked
together a set of page structures on their lru links.
- I learned that for pages in a slab cache, the third
8-byte value in the page structure points to the
first free object in the page. That gave me a pointer
within the page.
- I looked at the 192-byte chunks of that page, and
after some head scratching and heavy duty looking
through data structures, figured out that the most
common thing there looked like a struct bio.

Given that, I looked at the rbd code and, sure enough,
each object request was given a bio structure (or list
of them) but it was not freed when the object request
completed.

I'm waiting for a build to complete so I can test it.

Actions #10

Updated by Alex Elder over 11 years ago

OK, I ran a test and got a crash. The bio built for
an object request gets handed off to an osd request.
I need to work through the ownership life cycle for
that.

But not tonight...

Actions #11

Updated by Alex Elder over 11 years ago

I found the source of my trouble, and in the process understood
a little more about some subtlety in bio reference counting
that exists between the osd client (in the libceph kernel
module) and rbd. The rbd client needs to take a reference
to the first bio in the list for any bio list it passes in
an osd request. Later, the osd client is responsible for
dropping that reference. I created a few new tracker entries
to suggest improving that.

The leak was due to resetting the osd request to NULL before
calling ceph_osdc_put_request(), which prevented that function
from dropping the bio reference.

Actions #12

Updated by Alex Elder over 11 years ago

  • Status changed from New to In Progress

Considering this "is actually work that's mostly complete"
I'm (finally) marking it "In Progress."

This code is functioning correctly and nearly complete.

I have a series of about 13 patches right now, about half
of which are adding the new code in stages and the other
half which are deleting the code that is thereby made obsolete.

I am doing some final touch-ups and giving it a final review
myself before I send it all out for review. I expect to
have it posted in the next two days.

Actions #13

Updated by Alex Elder over 11 years ago

  • Status changed from In Progress to Fix Under Review

I've posted this code for review. I continue to do testing.

Actions #14

Updated by Alex Elder about 11 years ago

  • Status changed from Fix Under Review to 7
Actions #15

Updated by Alex Elder about 11 years ago

My testing on this code is nearly complete. However, I'm going
to hold off on pushing this (along with the changes associated
with 3754, 3755 and 3877) to the testing branch for two reasons:
- I just updated the testing branch to be based on a much newer
version of the Linux kernel, and I'd like to give that change
alone one or more passes through the nightly test suite before
adding more to it.
- There are a few more known problems (3940, 3937, 3427, and
possibly 3950) that I'd like to get reviewed and committed
before I push them to the testing branch.

Actions #16

Updated by Alex Elder about 11 years ago

  • Status changed from 7 to Resolved

commit 9ac90ea3d8dd6ab82f3665a132ca29e6ada56ad8
Author: Alex Elder <>
Date: Thu Nov 22 00:00:08 2012 -0600

rbd: new request tracking code
Actions

Also available in: Atom PDF