Bug #10889: krbd: sent out of order write - Linux kernel client - Ceph

Actions

Copy link

Bug #10889

closed

krbd: sent out of order write

Added by Yuri Weinstein about 9 years ago. Updated about 5 years ago.

Status:

Resolved

Priority:

Low

Assignee:

Ilya Dryomov

Category:

libceph

Target version:

% Done:

Source:

Q/A

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Crash signature (v1):

Crash signature (v2):

Description

Run: http://pulpito.front.sepia.ceph.com/teuthology-2015-02-12_20:00:06-krbd-giant-testing-basic-multi/
Job: ['755352']
Logs: http://qa-proxy.ceph.com/teuthology/teuthology-2015-02-12_20:00:06-krbd-giant-testing-basic-multi/755352/

Assertion: osd/ReplicatedPG.cc: 1955: FAILED assert(0 == "out of order op")
ceph version 0.87-159-gccb0914 (ccb0914f76da23acdd7374233cd1939ab80ef3c8)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0xb82c4b]
 2: (ReplicatedPG::execute_ctx(ReplicatedPG::OpContext*)+0x1815) [0x8803a5]
 3: (ReplicatedPG::do_op(std::tr1::shared_ptr<OpRequest>&)+0x34af) [0x883d2f]
 4: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x611) [0x81f881]
 5: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x3d5) [0x686b45]
 6: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x37b) [0x6870db]
 7: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x821) [0xb72d51]
 8: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xb74e60]
 9: (()+0x8182) [0x7f73b9015182]
 10: (clone()+0x6d) [0x7f73b758138d]

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Sage Weil about 9 years ago

Project changed from Ceph to Linux kernel client
Subject changed from FAILED assert(0 == "out of order op") in krbd-giant-testing-basic-multi run to krbd: sent out of order write

Actions

Copy link

Updated by Sage Weil about 9 years ago

2015-02-13 11:08:23.489822 7f73903df700  1 -- 10.214.135.8:6805/31779 <== client.4111 10.214.131.10:0/1389565972 76 ==== osd_op(client.4111.1:32938 rb.0.100c.238e1f29.000000000089 [set-alloc-hint object_size 4194304 write_size 4194304,write 4096000~98304] 0.50046e9a ondisk+write e62) v4 ==== 208+0+98304 (3323975164 0 0) 0x493b700 con 0x4d72940
2015-02-13 11:08:23.498053 7f739af21700  1 -- 10.214.135.8:6805/31779 --> 10.214.131.10:0/1389565972 -- osd_op_reply(32934 rb.0.100c.238e1f29.000000000089 [set-alloc-hint object_size 4194304 write_size 4194304,write 2523136~524288] v62'1652 uv1652 ondisk = 0) v6 -- ?+0 0x62902c0 con 0x4d72940
2015-02-13 11:08:23.508830 7f739af21700  1 -- 10.214.135.8:6805/31779 --> 10.214.131.10:0/1389565972 -- osd_op_reply(32935 rb.0.100c.238e1f29.000000000089 [set-alloc-hint object_size 4194304 write_size 4194304,write 3047424~524288] v62'1653 uv1653 ondisk = 0) v6 -- ?+0 0x511b180 con 0x4d72940
2015-02-13 11:08:23.516003 7f739af21700  1 -- 10.214.135.8:6805/31779 --> 10.214.131.10:0/1389565972 -- osd_op_reply(32936 rb.0.100c.238e1f29.000000000089 [set-alloc-hint object_size 4194304 write_size 4194304,write 3571712~524288] v62'1654 uv1654 ondisk = 0) v6 -- ?+0 0x66df180 con 0x4d72940
2015-02-13 11:08:23.516795 7f739af21700  1 -- 10.214.135.8:6805/31779 --> 10.214.131.10:0/1389565972 -- osd_op_reply(32938 rb.0.100c.238e1f29.000000000089 [set-alloc-hint object_size 4194304 write_size 4194304,write 4096000~98304] v62'1655 uv1655 ondisk = 0) v6 -- ?+0 0x4c22840 con 0x4d72940
2015-02-13 11:08:29.692828 7f73903df700  1 -- 10.214.135.8:6805/31779 <== client.4111 10.214.131.10:0/1389565972 196 ==== osd_op(client.4111.1:32775 rb.0.100c.238e1f29.000000000089 [set-alloc-hint object_size 4194304 write_size 4194304,write 0~372736] 0.50046e9a RETRY=1 ondisk+retry+write e63) v4 ==== 208+0+372736 (102131416 0 0) 0x5dd4a80 con 0x4d72940

Actions

Copy link

Updated by Josh Durgin about 9 years ago

Backport deleted (~~giant~~)

like sage mentioned via email, this is a kernel client issue that shouldn't block a giant point release.

Actions

Copy link

Updated by Ilya Dryomov about 9 years ago

Assignee set to Ilya Dryomov

We try to be careful to stick requests onto a req_unsent list in such a way that they are always sent in tid order, so this looks like a bug somewhere around kick_requests(). Not much to go on though..

Actions

Copy link

Updated by Josh Durgin about 9 years ago

Priority changed from Urgent to High

Actions

Copy link

Updated by Benoît Canet almost 9 years ago

I spotted two things in the logs:

No OSD logs gives a reference to the first try of this write.
the kern.log of the client describe the connection as closed
around the same time.
It must have been internally dropped and postponed for retry
by the client code because the connection has been closed.

This lead me to a naive question:

After the socket is closed and reopenned in con_work:

if (fault)                                                                                  
            con_fault(con);                                                                        
        mutex_unlock(&con->mutex);

if (fault)                                                                                  
            con_fault_finish(con);

What kind of garanties do we have that
ceph_osdc_start_request( will not take the osdc->request_mutex before
osd_reset( take it and insert requests in the unsent queue and then
submit them to us before osd_reset does it for the already submited
requests ?

A race condition doing out of order request ?

Actions

Copy link

Updated by Ilya Dryomov almost 9 years ago

Regression set to No

I think the idea is that both are critical sections protected by request_mutex. Either ceph_osdc_start_request() queues a request to the old OSD "session", in which case osd_reset() would catch it and requeue or it queues a request to the new "session" with all of the old requests already requeued. If ceph_osdc_start_request() grabs the mutex right before osd_reset(), it will successfully queue the request but it won't be sent over the wire because con_work() for that "session" is going to be blocked on the mutex. Once it's released, osd_reset() will proceed and requeue both old requests and this new request.

That said, it very possible that something along these lines happened here.

Actions

Copy link

Updated by Benoît Canet almost 9 years ago

ceph_con_send( will be called by ceph_osdc_start_request and since the connection is not closed anymore
it will list_add_tail(&msg->list_head, &con->out_queue); then when osd_reset will kick it will have no choice but to queue the previous messages in con->out_queue after the newer one.
Then con_work() will send send in the reverse order.

I am missing something ?

Actions

Copy link

Updated by Benoît Canet almost 9 years ago

Everything in messenger.c uses con->mutex

Actions

Copy link

#10

Updated by Ilya Dryomov almost 9 years ago

osd_reset() will clear out_queue before kicking and then requeue the old ones and the new one (in that order).

Actions

Copy link

#11

Updated by Benoît Canet almost 9 years ago

I don't think this tickets is caused by the race I point to but I want to push
my own reasoning until it break.

The following code is outside of con->mutex.
if (fault)
con_fault_finish(con);

Also

ceph_msgr_wq = alloc_workqueue("ceph-msgr", WQ_MEM_RECLAIM, 0);

O means we can have up to 256 different work items picked up at the same time.

What would prevent the workqueue to process an item just added to con->out_queue
before the reset is done ?

since:

/* if there wasn't anything waiting to send before, queue * new work */
if (con_flag_test_and_set(con, CON_FLAG_WRITE_PENDING) == 0)
queue_con(con);

in ceph_con_send().

Actions

Copy link

#12

Updated by Benoît Canet almost 9 years ago

Ok I find it :)

con->work is the same so executions are serialized to the same worker so no worries.

Actions

Copy link

#13

Updated by Ronny Hegewald over 8 years ago

I have run into the same issue some months ago and track down the cause for it while debugging another issue.

Assume we have OSD 1-5.

Now lets say we have 3 incomming requests that are assigned TID 1-3, that are mapped to 2 OSDs as this:

TID 1 -> osd.1
TID 2 -> osd.2
TID 3 -> osd.2

Now assume the connection to osd.1 breaks down before TID 1 can be sent, for example because osd.1 has crashed. In the meantime TID 2+3 are be sent to osd.2.

As soon the messenger is informed about a new osdmap because of the crashed osd, tid 1 will be remapped to another osd. If that osd is osd.2 it will become a out of order TID and will crash osd.2 with that stacktrace.

Back then i wrote a patch, that would assign new TIDs in kick_requests to keep TIDs in order. But it turned out that this was not necessary for the problem i investigated, so i haven't cleaned it up for submission. And as im not sure if that approach is the right way anyway and you have to activate additional debugging anyway to run into this issue, it seemed not to be so important to get it out.

If there is any interest i can send it to ceph-devel as RFC-patch as it is.

Actions

Copy link

#14

Updated by Ilya Dryomov over 8 years ago

It won't be an out of order tid on osd2 because while tids are per-client, the tid order is per-object. osds generally won't have a problem servicing a request with say tid 1000 after a request with tid 1500, if tid 1000 request just got remapped.

The code in question is (slower than I'd like) getting rewritten, but I still wanted to reproduce this to make sure it's fixed in the new version when it's ready. I did more than a 100 runs with a yaml from #12691 on same speced boxes with no luck. I tried only the current testing kernel though, what remains is to try exactly the kernel it triggered on.

Actions

Copy link

#15

Updated by Ronny Hegewald over 8 years ago

It has been some timce since i looked at the code, but your right, it shouldn't be a problem. But nonetheless i have been running into this assert quite alot. And i had added code to the kernel messenger to see the TIDs of the sent messages and i always could always trace back this out of order asserts to the wrong TID-order on kernel side.

I have to recheck why this happened in my case.

Actions

Copy link

#16

Updated by Ilya Dryomov over 8 years ago

May I ask why haven't you reported this issue, if you've been running into it "quite alot"? There's a good chance if would've been fixed by now...

I'd like to see those logs and the diffs you applied to the kernel to generate them, if possible. If you could also provide matching osd logs, it would be great.

Actions

Copy link

#17

Updated by Ilya Dryomov almost 8 years ago

Category set to libceph
Status changed from New to Need More Info
Priority changed from High to Low

OSD client has been rewritten in 4.7.

Actions

Copy link

#18

Updated by Ilya Dryomov about 5 years ago

Status changed from Need More Info to Resolved

Haven't seen out of order ops on 4.7 and later kernels.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » Linux kernel client

Custom queries

Bug #10889

krbd: sent out of order write

Updated by Sage Weil about 9 years ago

Updated by Sage Weil about 9 years ago

Updated by Josh Durgin about 9 years ago

Updated by Ilya Dryomov about 9 years ago

Updated by Josh Durgin about 9 years ago

Updated by Benoît Canet almost 9 years ago

Updated by Ilya Dryomov almost 9 years ago

Updated by Benoît Canet almost 9 years ago

Updated by Benoît Canet almost 9 years ago

Updated by Ilya Dryomov almost 9 years ago

Updated by Benoît Canet almost 9 years ago

Updated by Benoît Canet almost 9 years ago

Updated by Ronny Hegewald over 8 years ago

Updated by Ilya Dryomov over 8 years ago

Updated by Ronny Hegewald over 8 years ago

Updated by Ilya Dryomov over 8 years ago

Updated by Ilya Dryomov almost 8 years ago

Updated by Ilya Dryomov about 5 years ago