Project

General

Profile

Bug #5043

Oops in remove_osd

Added by Sandon Van Ness about 7 years ago. Updated about 7 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature:

Description

Stack output:

Stack traceback for pid 29892
0xffff88022140bf20 29892 2 1 6 R 0xffff88022140c3a8 *kworker/6:1
ffff88020ad79b48 0000000000000018 ffff88022140c640 ffff88020ad79b68
ffffffffa00bfa0c ffff88020a285000 ffff8802244a4950 ffff88020ad79b98
ffffffffa00bfcff ffff88020a285030 ffff8802244a4950 0000000000000005
Call Trace:
[<ffffffffa00bfa0c>] ? __remove_osd+0x3c/0xa0 [libceph]
[<ffffffffa00bfcff>] ? __reset_osd+0x12f/0x170 [libceph]
[<ffffffffa00c0e3e>] ? osd_reset+0x7e/0x2b0 [libceph]
[<ffffffffa00b8e81>] ? con_work+0x571/0x2d40 [libceph]
[<ffffffff816724c0>] ? _raw_spin_unlock_irq+0x30/0x40
[<ffffffff81074aa1>] ? process_one_work+0x161/0x4b0
[<ffffffff810b8c05>] ? trace_hardirqs_on_caller+0x105/0x190
[<ffffffff81074b12>] ? process_one_work+0x1d2/0x4b0
[<ffffffff81074aa1>] ? process_one_work+0x161/0x4b0
[<ffffffff81076778>] ? worker_thread+0x118/0x340
[<ffffffff81076660>] ? manage_workers+0x320/0x320
[<ffffffff8107c46a>] ? kthread+0xea/0xf0
[<ffffffff8107c380>] ? flush_kthread_work+0x1a0/0x1a0
[<ffffffff8167b4ac>] ? ret_from_fork+0x7c/0xb0
[<ffffffff8107c380>] ? flush_kthread_work+0x1a0/0x1a0

This was the job:

kernel: &id001
kdb: true
sha1: b5b09be30cf99f9c699e825629f02e3bce555d44
machine_type: plana
nuke-on-error: true
overrides:
ceph:
conf:
mon:
debug mon: 20
debug ms: 20
debug paxos: 20
osd:
filestore flush min: 0
osd op thread timeout: 60
fs: btrfs
log-whitelist:
- slow request
sha1: 6c1e4791782ce2b3e101ee80640d896bcda684de
s3tests:
branch: next
workunit:
sha1: 6c1e4791782ce2b3e101ee80640d896bcda684de
roles:
- - mon.a
- mon.c
- osd.0
- osd.1
- osd.2
- - mon.b
- mds.a
- osd.3
- osd.4
- osd.5
- - client.0
targets:
: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQC0IExVUuQEnwBzUbN6+jgEb/qLLsSDFLAe0OPv+R3Q3uUZn+4QjB+FQ3sMMEwaGpEhUfbpWn3xZbCmi4NngTsgPDjImYJPMeMecaxvVXqAJt4IM6gdBN0415lrKXtbXaGBJCmeFFDB+xN+JN6Qhk8DWN62DnJB8MS5vKn9u0S3HvtGzY/QrnutT3AX9I4097isbTepLFRC4n2CoAC9srXaxAprFgLgOIYHm386B3W7yK3yfIImWofvZxYpPCsr/7ws8DSgdRX9eDucGQfY2WcYKCEIgZclGPBhExm7q/ahHUqYKrzWY3RD93AuXJVgOJk5Yp8C1Ryx8vyJkAxBW4nb
: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDuXajaQgHe9XnbLOzI8WWFYVz6+TnOiTzbkIJPGOZpzQEjnUtJraQIEt5ABSeovMjiEj+V4XvunfyuSmEd0H9giRSyjmCHTPGlpndfTeCdVtCBpNqf5GkUqHaEY1Hp57XPbya2rGlwtFm0NeIDYx6pfkejKnsTOUqwhgUb6950TRhjHQhMjFgyALSyfAm/4y6vGZfjm57+yyih6XgDkqWiiQ6Y/aJVR2n+iCzvqEzV7JSCU+Brn+k8IQLHho1fadYqc5PjYct5BaVlHcP6c+T8nJE/DvqGwZ4gQaVJcuWJiDfLOPPYo1g/0AFicxauLwVNJ6HFR9FjLLGtGU+2DcVN
: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDp3cwfZhOipCot6NiKX4cRMn4zx43QY0+5HdqzCQU2y7OrOJt3d0qvifnZPyeq8/d+aW2WL2OM8m4taz380JsP0SLmlpY8D0pGY/tN0pQDqIFd8EboMtKY6tR8unQrVzuczMqup/tkKSfdRp0zAeTiJ8qH7l9MaVcOw6WfRACb8f7APJE2gVRBrzPAdbqKzAphTRzZSz0cq722AX7XQDPT2dz7NoTp5Tk7xaQdDu2II+78B1H27IWdyYeonfy17yf9N+IA2Xzna/g5zu8apg7UvzyFmHunLyjr78dhPtR39201A0QJ5x5Qli9/UaB3LwiqnbCiGfx4xWFazdUFzxiD
tasks:
- internal.lock_machines:
- 3
- plana
- internal.save_config: null
- internal.check_lock: null
- internal.connect: null
- internal.check_conflict: null
- internal.check_ceph_data: null
- internal.vm_setup: null
- kernel: *id001
- internal.base: null
- internal.archive: null
- internal.coredump: null
- internal.syslog: null
- internal.timer: null
- chef: null
- clock.check: null
- install: null
- ceph:
log-whitelist:
- wrongly marked me down
- objects unfound and apparently lost
- thrashosds: null
- kclient: null
- workunit:
clients:
all:
- suites/ffsb.sh

Its on plana45 which will not be reclaimed until the end of the week at the earliest if someone wants to check it out.

History

#1 Updated by Alex Elder about 7 years ago

  • Status changed from New to In Progress
  • Assignee set to Alex Elder

I'm taking this for the time being.

This is in rb_erase().

Which means that maybe the osd client's red-black tree got
corrupted.

I'll dig a little deeper to see if I can find out more.

#2 Updated by Alex Elder about 7 years ago

The ceph_osd_client in question is osdc @ 0xffff8802244a4950

The ceph_osd is osd @ 0xffff88020a285000
(struct rb_root *) &osdc->osds.rb_node is ffff8802218a1018 -> {
long unsigned int __rb_parent_color = 0000000000000001
struct rb_node * rb_right = ffff880063513018 ->
struct rb_node * rb_left = ffff880209db4818
}

(struct rb_node *) &osd->node is 0xffff88020a285018 = {
long unsigned int __rb_parent_color = ffff880209db2819
struct rb_node * rb_right = 0000000000000000
struct rb_node * rb_left = 0000000000000000
}

The osd getting removed has empty requests and linger requests
lists, and otherwise looks completely fine (though it does have
two references).

struct ceph_osd @ 0xffff88020a285000 -> {
atomic_t o_ref = { 00000002 }
(pad) = 00000000
struct ceph_osd_client * o_osdc = ffff8802244a4950

int                        o_osd = 00000005
int o_incarnation = 00000001
struct rb_node o_node = {
parent_color = ffff880209db2819
right = 0000000000000000
left = 0000000000000000
}
struct ceph_connection o_con = {
ffff88020a285000 ffffffffa00cf1c0
...
0000000000000000
0000000000000032
}
struct list_head o_requests = ffff88020a285538 = {
next = ffff88020a285538
prev = ffff88020a285538
}
struct list_head o_linger_requests
ffff88020a285548 = {
next = ffff88020a285548
prev = ffff88020a285548
}
struct list_head o_osd_lru ffff88020a285558 = {
next = ffff88020a285558
prev = ffff88020a285558
}
struct ceph_auth_handshake o_auth =
ffff88020a285568 = {
struct ceph_authorizer * authorizer = 0
void * authorizer_buf = 0
size_t authorizer_buf_len = 0
void * authorizer_reply_buf = 0
size_t authorizer_reply_buf_len = 0
}
long unsigned int lru_ttl = 00000000
int o_marked_for_keepalive = 00000000
struct list_head o_keepalive_item @ ffff88020a285590 = {
next = ffff88020a285590
prev = ffff88020a285590
}
};

#3 Updated by Alex Elder about 7 years ago

I think I may have just found something.

It appears as though access to the osd client's osd red-black
tree is supposed to be protected using osdc->request_mutex.

An osd is added to that tree only in __insert_osd(). That
function is only called by __map_request(), which does no
additional locking before calling that function. There
are three places __map_request() is called: kick_requests()
(twice); and ceph_osdc_start_request().

In ceph_osdc_start_request(), osdc->request_mutex is held during
the __map_request() call. In kick_requests(), osdc->request_mutex
is taken within that function before calling __map_request()

So all insertions are protected by the request mutex.

An osd is removed from an osd client's tree only in __remove_osd().
There are three callers of __remove_osd(): remove_all_osds();
remove_old_osds(); and __reset_osd(). The first two of these
take and hold osdc->request_mutex while __remove_osd() is called.

But _reset_osd() does not. That function is called in two places,
_kick_osd_requests() and reset_changed_osds(). The first one
takes osdc->request_mutex before calling __remove_osd().
But reset_changed_osds() does not take the mutex. And the
one place reset_changed_osds() is called drops the mutex before
it makes the call (in kick_requests()).

So...

I think what happened is that kick_requests() got called, ultimately
changing the osd client's red-black tree, concurrent with another
call (in this case a call to __remove_osd() but there are other
possible failure modes), and that ultimately led to the corruption
in the osd client's osd red-black tree.

The problem can be seen in this commit:
e6d50f67 libceph: always reset osds when kicking
In that, the call to reset_changed_osds() was moved inside
kick_requests() rather than being called after it.

However even there, the kick_requests call was only protected
by osdc->map_sem (for writing), and that situation seems to
go back to antiquity (to when net/ceph/osd_client.c first got
added to the tree).

One last point. It's possible that osdc->map_sem was intended
to protect updates to the red-black tree. However that's wrong
too, because there appear to be cases where the semaphore is
taken for reading when in fact the tree could get updated.

I think the simplest fix is to move the call in kick_requests()
to reset_changed_osds() up a few lines, before the request
mutex is dropped. This would follow a call to
ceph_monc_request_next_osdmap(), but I think that's fine because
that appears to be an asynchronous request to the monitor and
shouldn't impact the osd client.

#4 Updated by Alex Elder about 7 years ago

  • Status changed from In Progress to Fix Under Review

The following patch has been posted for review.

[PATCH] libceph: must hold mutex for reset_changed_osds()

#5 Updated by Alex Elder about 7 years ago

  • Status changed from Fix Under Review to Resolved

The following has been committed to the "testing" branch
of the ceph-client git repository:

14d2f38 libceph: must hold mutex for reset_changed_osds()

It has also been queued to send to Linus for the 3.10
Linux release cycle.

Also available in: Atom PDF